Soperator
A Slurm-based workload manager for ML and HPC clusters with a modern and simplified user experience.
Robust job scheduling
Slurm can schedule and orchestrate an immense number of jobs on thousands of compute nodes. Together with granular hardware control, it creates a powerful tool for solving the most complicated ML and HPC tasks.
Fault-tolerant training
Thanks to hardware health checks and Kubernetes high-availability mechanisms, Soperator ensures seamless and predictable ML training without disruptions caused by GPU failures.
Simplified environment management
Shared root filesystem provides a single file environment for all nodes of the cluster, allowing ML practitioners to focus on model development, not on complicated package management.
Use cases
Distributed training of any scale
Soperator is a perfect solution for orchestrating highly intensive distributed training with a scale of up to tens of thousands of GPU nodes.
Collaboration within the same cluster
The ability to run jobs in parallel and schedule jobs from various projects in sequences can save time and money when organizing collaborative work of your ML team.
How it works
How it works
Open source solution
At Nebius, we believe that only together can we create better technologies. That’s why we made this product open-source, providing ML enthusiasts and HPC practitioners with the opportunity to use this technology for their endeavors and improve it according to their needs.
Service features
Hardware health checks
The system constantly monitors the availability of every hardware unit within the cluster and reports if any issue occurs.
Shared root filesystem
All system files are shared across all cluster nodes, mitigating the necessity to manually maintain all nodes of the cluster in an identical state.
Easy bootstrap
The solution is ready-to-go and could be deployed within 20-30 minutes. We also provide a Terraform recipe for our cloud that simplifies it even further.
Pre-installed GPU and network drivers
Soperator has pre-packed all NVIDIA GPU, InfiniBand and other drivers necessary for running the ML training cluster.
Easy scaling
You can easily scale your GPU cluster up and down based on the upcoming workloads, new model development tasks or team expansion.
Advanced scheduling mechanism
Slurm allows you to split one large job into many steps, some executed sequentially and others in parallel.
Granular hardware control
Slurm can distinguish hardware like CPU sockets, CPU/GPU cores and hyperthreads, allowing you to provision even the smallest compute units.
Cluster accounting
Slurm accounting provides detailed statistics about cluster usage consumption, job duration, errors and other system data.
High performance filesystem
Delivers up to 100 GB/s throughput and 1M IOps for quick checkpoint restoration during large-scale trainings and effective dataset streaming.
Terraform support
Configuring cluster via Terraform makes the user experience easier even if your team doesn’t have a deep DevOps expertise.
Questions and answers about Soperator
What is Slurm?
What is Slurm?
Slurm is an open source, fault-tolerant and highly scalable cluster management and job scheduling system for large and small Linux clusters.
Why is Slurm good for distributed training?
Why is Slurm good for distributed training?
Does Soperator on Nebius ready to use for training out of the box?
Does Soperator on Nebius ready to use for training out of the box?
Is Soperator a paid service?
Is Soperator a paid service?
Should I pay for Managed Service for Kubernetes when using Soperator?
Should I pay for Managed Service for Kubernetes when using Soperator?
Can I use Soperator in other public cloud or on-premises?
Can I use Soperator in other public cloud or on-premises?
SchedMD
SchedMD
By partnering directly with SchedMD, the developer of the Slurm Workload Manager, Nebius AI provides exceptional support to Slurm users. SchedMD robust Slum workload manager streamlines job scheduling and resource allocation. Its scalability and reliability make it a versatile solution that can meet a variety of business needs.
Start your journey
More to know
* — Slurm is a registered trademerk of SchedMD LLC.