Bulk Object Storage data migration with SkyPilot
Bulk Object Storage data migration with SkyPilot
Moving large datasets between S3 buckets is often slow, unreliable and frustrating — especially across accounts or clouds. In this post, we share a fast, fully open-source workaround using SkyPilot, s5cmd and Nebius AI Cloud. It’s a clever setup that distributes transfer workloads across multiple nodes with RAM disk acceleration, offering high throughput and built-in verification. While not a production-grade tool for every scenario, it’s a powerful option for engineers who need to migrate terabytes of data quickly and efficiently.
Moving terabytes of data between S3 buckets sucks. It’s slow, it breaks halfway through and then you’re stuck wondering which files were actually migrated. I’ve been there — running a single-threaded script for days, only to have it fail at 87 percent completion. Then starting over because who has time to figure out which files transferred successfully? There’s got to be a better way.
Turns out, there is! It involves SkyPilot
The S3 migration problem we don’t talk about
Everyone’s moving data between clouds these days, but we don’t often talk about how painful it is. The standard approaches all have issues:
-
AWS CLI: Kind of slow (try moving 50 TB with
aws s3 sync
and watch your hair turn gray) and, crucially, it doesn’t support cross-account transfers by using different profiles. -
Custom scripts: You’ll spend more time debugging edge cases than transferring data (especially for cross-account scenarios).
-
Managed services (e.g., DataSync): Often expensive and/or vendor-locked.
What if we could distribute the work across multiple machines? That’s where SkyPilot comes in.
SkyPilot + Nebius: New BFFs
SkyPilot recently integrated with Nebius AI Cloud, giving us an easy way to spin up distributed compute resources. While primarily designed for AI workloads, it’s perfect for our data transfer needs.
The best part? While we’re focusing on Nebius as the target, this approach works with literally any S3-compatible storage. No vendor lock-in here!
A small note: there’s an awesome open-source project called Skyplane
The secret sauce: s5cmd and RAM disk
One of the biggest advantages of our approach is the ability to handle cross-account transfers easily.
Here’s the hack that makes this all work — a combination of:
-
SkyPilot for orchestration: Manages cluster provisioning and distribution.
-
s5cmd for transfers: A blazing fast S3 client.
-
RAM disk for temporary storage: Eliminates disk I/O bottlenecks.
-
GNU parallel: Efficiently manages concurrent transfers.
This approach gives us:
-
Distributed workload across multiple nodes
-
Parallel processing within each node
-
Memory-speed temporary storage
-
Error handling and post-transfer verification
How it actually works (the nerdy details)
Here’s the play-by-play of what happens:
-
SkyPilot provisions your migration cluster on Nebius.
-
Each node mounts a RAM disk for temporary storage.
-
The head node lists all objects in the source bucket.
-
That list gets evenly split (by count, not by size) into chunks among all worker nodes.
-
Each node processes its chunk in parallel.
-
Everything gets verified, with detailed reporting at the end.
Let’s look at some of the key code pieces that make this work:
# Mount RAM disk for temporary storage
sudo mount -t tmpfs -o size=8g tmpfs /mnt/ramdisk
mkdir -p /mnt/ramdisk/s3migration/temp
This is crucial — by using RAM instead of disk for temporary storage, we eliminate the I/O bottleneck that often plagues transfer operations.
The head node distributes work by splitting object lists:
# Split files by node using modulo on line number
for i in \
$(seq 0 \
$((NUM_NODES-1))); do
awk -v node= \
$i -v nodes= \
$NUM_NODES 'NR % nodes == node'
$TEMP_DIR/filtered_objects.txt >
$TEMP_DIR/node_ \
${i}_objects.txt
# ...
done
Then, each worker processes its assignments with high concurrency:
cat $TEMP_DIR/my_objects.txt | parallel -j \
$NUM_CONCURRENT process_object
The verification step ensures nothing was lost:
# Compare counts
SOURCE_COUNT= \
$(wc -l < \
$TEMP_DIR/final_source_files.txt)
TARGET_COUNT= \
$(wc -l < \
$TEMP_DIR/final_target_files.txt)
if [ \
$SOURCE_COUNT -eq \
$TARGET_COUNT ]; then
echo "✅ Migration completed successfully! Object counts match."
# Even if counts match, check for differences in file lists
if diff \
$TEMP_DIR/source_keys.txt \
$TEMP_DIR/target_keys.txt > \
$TEMP_DIR/diff_output.txt; then
echo "✅ All objects match between source and target buckets."
else
echo "⚠️ Warning: Although counts match, some objects differ between buckets."
# ...
fi
fi
Getting started: A complete setup guide
Let’s walk through the entire setup process, step by step.
Set up the Nebius CLI and credentials
First, you need to install the Nebius CLI. Then, download the Nebius setup script
wget https://raw.githubusercontent.com/nebius/nebius-solution-library/refs/heads/main/skypilot/nebius-setup.sh
chmod +x nebius-setup.sh
./nebius-setup.sh
When prompted, select your Nebius tenant and project ID from the list. The script will create the required credentials expected by SkyPilot.
Install SkyPilot with Nebius support
Next, install SkyPilot with Nebius integration:
pip install "skypilot-nightly[nebius]"
Verify it’s working:
sky check nebius
> Checking credentials to enable clouds for SkyPilot.
🎉 Enabled clouds 🎉
Nebius [compute, storage]
Using SkyPilot API server: http://127.0.0.1:46580
Download the migration script and SkyPilot YAML
Clone the repository with the migration scripts:
git clone https://github.com/nebius/nebius-solution-library.git
cd nebius-solution-library/skypilot/s3-migration
Configure your credentials
Ensure you have AWS credentials configured for both the source and target buckets:
# For source bucket (e.g., AWS)
aws configure --profile default
# For target bucket (Nebius)
aws configure --profile nebius
Launch your transfer job
Now, you’re ready to launch:
export SOURCE_AWS_PROFILE=default
export SOURCE_ENDPOINT_URL=https://s3.us-east-1.amazonaws.com # change to your region
export SOURCE_BUCKET=s3://source-bucket
export TARGET_AWS_PROFILE=nebius
export TARGET_ENDPOINT_URL=https://storage.eu-north1.nebius.cloud:443 # change to your region
export TARGET_BUCKET=s3://target-bucket
sky launch -c migration s3_migration.yaml \
--env SOURCE_AWS_PROFILE \
--env SOURCE_ENDPOINT_URL \
--env SOURCE_BUCKET \
--env TARGET_AWS_PROFILE \
--env TARGET_ENDPOINT_URL \
--env TARGET_BUCKET
This will:
-
Provision the specified resources on Nebius (by default, eight nodes with 16 CPUs each).
-
Set up the environment with all dependencies.
-
Mount your AWS credentials.
-
Start the distributed transfer process with 16 concurrent transfers per node.
Performance tuning
You can adjust several parameters to optimize performance:
-
Number of nodes: Change
num_nodes: 8
in the YAML file, to scale out more. -
Concurrency: Modify
NUM_CONCURRENT: 16
, to control parallel transfers per node. -
RAM disk size: Adjust the
size=8g
in the mount command for larger files. -
CPUs per node: Change
cpus: 16
in the Resource section.
For extremely large transfers (millions of files), consider changing the post-transfer verification to use sampling instead of checking every file, by modifying the script.
Benchmark: ImageNet-scale transfer performance
To evaluate data transfer performance across clouds and regions, we recreated the 70 GB ImageNet-style dataset that was used in the original Skyplane benchmark
for i in $(seq 1 1152); do
dd if=/dev/urandom of="dummy_file_$i.dat" bs=1M \
count= $((70*1024/1152))
done
We then ran several transfer scenarios using S3-compatible endpoints and compared the timings with Skyplane’s published benchmarks.
Test scenarios and results
Transfer type | Source | Target | Time |
---|---|---|---|
Intra-region (Nebius) | eu-north1 |
eu-north1 |
26 sec |
Inter-cloud | AWS us-east-1 |
Nebius eu-north1 |
146 sec |
Inter-cloud | Nebius eu-north1 |
AWS us-east-1 |
53 sec |
Skyplane (reference) | Various multi-cloud pairs | Various | ~19–28 sec |
AWS DataSync (reference) | AWS to GCP/AWS | Various | ~300–422 sec |
Observations
-
Nebius intra-region performance is comparable to Skyplane’s fastest results, clocking in at just 26 seconds.
-
Cross-cloud transfer from AWS to Nebius (across continents) completed in 2 minutes 26 seconds, significantly outperforming AWS DataSync, which took up to 7 minutes for similar long-haul transfers.
-
Skyplane remains the top performer for optimized, cross-cloud transfers, but unfortunately it only supports AWS, GCP, Azure and Cloudflare. Since the project is stale, I doubt there’ll be support for other S3-compatible storage solutions anytime soon.
Comparison to other options
Solution | What’s good | What’s not |
---|---|---|
Our s5cmd + SkyPilot solution | Distributed, works across clouds, performs post-migration verification | Requires some initial setup |
Skyplane | Purpose-built for S3, extremely fast, uses compression and bandwidth tiering | Abandoned project |
AWS DataSync | Managed service | AWS-specific and costs money |
Manual S3 CLI | Simple, everybody knows it | Slow, no distribution, no cross-account support with different profiles |
The bottom line
By combining SkyPilot, s5cmd and some clever distribution techniques, we have a data transfer solution that’s:
-
Fully open-source
-
Way faster than single-machine approaches
-
Reliable with built-in verification
-
Works with any S3-compatible storage
But remember: this is more of a cool hack than a recommended production solution for all scenarios — there are limitations.
Next time you need to move lots of data between object stores, give this approach a try if your use case fits. You can also use it with minimal changes to move data from external S3 to Nebius object storate.
The full code
Explore Nebius AI Studio
Contents
- The S3 migration problem we don’t talk about
- SkyPilot + Nebius: New BFFs
- The secret sauce: s5cmd and RAM disk
- How it actually works (the nerdy details)
- Getting started: A complete setup guide
- Performance tuning
- Benchmark: ImageNet-scale transfer performance
- Comparison to other options
- The bottom line