Run GPU jobs at scale, on your terms

Schedule and manage GPU workloads at any scale with a complete set of tools on Nebius.

Choose the orchestration that fits your workflow: from fully managed Slurm clusters with our own Soperator to 3rd party
integrated job schedulers.

Simple, quick access to GPU compute

Focus on your models, not on infrastructure setup. Our orchestration tools provision and configure everything automatically, so ML engineers can start scheduling jobs without any DevOps expertise.

Fault-tolerant by design

Run and scale ML jobs without worrying about cluster stability. Scheduled and active health checks detect issues early, and the underlying system automatically remediates node failures in minutes, protecting goodput across long training runs.

Optimized performance from day one

Every cluster on Nebius is pre-validated across drivers, cluster interconnect, and topology-aware scheduling. No manual tuning, no warm-up period. Just high-performance compute, ready to use the moment you deploy.

Soperator

Run fully managed Slurm clusters on Nebius, battle-tested for large-scale training with cloud-native simplicity.

Launch in days, not weeks

A production-scale Slurm environment ready in 20–30 minutes from a simple console configuration. Node provisioning, dependencies, and NVIDIA drivers, all handled automatically.

Zero DevOps experience required

No Slurm installation, no manual configuration, no painful node scaling or cluster extension. The full environment is provisioned and maintained by Nebius.

Extra fault tolerance

An additional layer of scheduled and active health checks detects and remediates node failures automatically, on top of standard cluster resilience. Long training runs stay on track.

Powered by open-source software

Soperator is built and maintained in-house, fully open-sourced, and available on GitHub. Use it as a managed service on Nebius or deploy it yourself on any cloud.

Launching a Slurm cluster in minutes with Soperator

This video demonstrates how quickly you can set up and launch a Slurm cluster for AI training using Soperator.

Read more about Soperator

SkyPilot

Use SkyPilot to provision, schedule, and execute jobs on Nebius without changing your existing workflow. Nebius hosts a SkyPilot API server so your team’s configs and job history stay in the cloud environment — no local server to maintain, no credentials to distribute across the team.

Connect your existing SkyPilot setup to Nebius in minutes.

Read the setup guide

Ray and Anyscale

Deploy Ray clusters directly on Nebius through the Applications marketplace and start running distributed training, hyperparameter tuning, or data processing jobs on your GPU infrastructure.

Bring your Anyscale workflows to Nebius for enterprise-grade managed clusters with SLAs backed by Anyscale.

Explore Nebius Applications

Built on managed 100% upstream Kubernetes

All orchestration tools on Nebius run over managed Kubernetes, our fully-managed container orchestration layer, optimized for AI workloads and available as a standalone service.

For teams that need direct, DevOps-level control over multi-node environments, managed Kubernetes gives you full visibility into every layer of the stack.

Learn more about Managed Kubernetes

Get started

Have questions about which orchestration setup fits your workload? Reach out to our team and we will help you find the right configuration.