Run GPU jobs at scale, on your terms
Schedule and manage GPU workloads at any scale with a complete set of tools on Nebius.
Choose the orchestration that fits your workflow: from fully managed Slurm clusters with our own Soperator to 3rd party
integrated job schedulers.
Simple, quick access to GPU compute
Focus on your models, not on infrastructure setup. Our orchestration tools provision and configure everything automatically, so ML engineers can start scheduling jobs without any DevOps expertise.
Fault-tolerant by design
Run and scale ML jobs without worrying about cluster stability. Scheduled and active health checks detect issues early, and the underlying system automatically remediates node failures in minutes, protecting goodput across long training runs.
Optimized performance from day one
Every cluster on Nebius is pre-validated across drivers, cluster interconnect, and topology-aware scheduling. No manual tuning, no warm-up period. Just high-performance compute, ready to use the moment you deploy.
Soperator
Run fully managed Slurm clusters on Nebius, battle-tested for large-scale training with cloud-native simplicity.
Launch in days, not weeks
A production-scale Slurm environment ready in 20–30 minutes from a simple console configuration. Node provisioning, dependencies, and NVIDIA drivers, all handled automatically.
Zero DevOps experience required
No Slurm installation, no manual configuration, no painful node scaling or cluster extension. The full environment is provisioned and maintained by Nebius.
Extra fault tolerance
An additional layer of scheduled and active health checks detects and remediates node failures automatically, on top of standard cluster resilience. Long training runs stay on track.
Powered by open-source software
Soperator is built and maintained in-house, fully open-sourced, and available on GitHub. Use it as a managed service on Nebius or deploy it yourself on any cloud.
Launching a Slurm cluster in minutes with Soperator
Launching a Slurm cluster in minutes with Soperator
This video demonstrates how quickly you can set up and launch a Slurm cluster for AI training using Soperator.
SkyPilot
SkyPilot
Use SkyPilot to provision, schedule, and execute jobs on Nebius without changing your existing workflow. Nebius hosts a SkyPilot API server so your team’s configs and job history stay in the cloud environment — no local server to maintain, no credentials to distribute across the team.
Connect your existing SkyPilot setup to Nebius in minutes.

Ray and Anyscale
Ray and Anyscale
Deploy Ray clusters directly on Nebius through the Applications marketplace and start running distributed training, hyperparameter tuning, or data processing jobs on your GPU infrastructure.
Bring your Anyscale workflows to Nebius for enterprise-grade managed clusters with SLAs backed by Anyscale.

Built on managed 100% upstream Kubernetes
Built on managed 100% upstream Kubernetes
All orchestration tools on Nebius run over managed Kubernetes, our fully-managed container orchestration layer, optimized for AI workloads and available as a standalone service.
For teams that need direct, DevOps-level control over multi-node environments, managed Kubernetes gives you full visibility into every layer of the stack.

Get started
Have questions about which orchestration setup fits your workload? Reach out to our team and we will help you find the right configuration.
