Introducing Managed Soperator: Your quick access to Slurm training
Introducing Managed Soperator: Your quick access to Slurm training
We’re thrilled to announce that Managed Soperator, our fully managed Slurm-on-Kubernetes solution, is now available for everyone in self-service.
This means you can get a ready-to-work Slurm training cluster literally in minutes. The cluster will run on provisioned NVIDIA GPUs and delivered with all necessary pre-installed libraries and drivers — allowing you to start ML training immediately.
By developing it as a managed service on our platform, we provide you with the shortest time to value, allowing you to skip negotiations with customer service and avoid frightening manual Slurm configuration.
Just set up the cluster parameters, click a button and enjoy your ready-to-work training environment.
AI training clusters in one-click
We hear a lot of feedback from our customers and colleagues that the powerful job scheduling capabilities of Slurm fall short of the expectations of many ML developers and researchers, especially among the younger generation.
This isn’t a problem if your endeavors are covered by a dedicated team of DevOps engineers; but for a small group of ML developers or independent researchers provisioning a Slurm-powered cluster, this could be a very frustrating experience.
Nebius’ Slurm/Soperator have enabled us to orchestrate distributing computing workloads with real-time streaming of petabyte-scales of data to train models. It’s ease-of-use, fault-tolerance and efficient orchestration are unique among computing infrastructure solutions.
Dhruv Pal, Tilde
In this managed implementation, we made all infrastructure provisioning and configuring processes happen in the background, hidden from the user’s view. Additionally, Managed Soperator has a full set of features that come with all managed applications on Nebius AI Cloud.
We developed Managed Soperator with a very simple idea in mind — to empower a broad audience of modern AI developers with this job-scheduling tool, to allow them to focus on inspiring AI research and creativity rather than tedious operational routines.
Slurm becomes cloud-native
The core technology of our managed solution is Soperator, our in-house developed Kubernetes operator for Slurm. We released it as an open-source project last autumn. Today, it helps us quickly deploy thousand-GPU clusters for our clients, simplifying the deployment process and cutting down the provisioning time from weeks to a couple of days.
It also has proven reliability for multi-host, fault-tolerant training. Our recent results at MLPerf® Training v5.0, the most trustworthy peer-reviewed industry benchmark suite, demonstrate the value of Soperator as an orchestration tool for 512 and 1,024 GPU training.
One of the most noticeable features of Soperator is its shared root filesystem that allows us to scale the cluster size easily, without complicated packet management at every new node.
Figure 1. Running Slurm on Kubernetes with Soperator
This approach helped us to wrap Slurm capabilities into a cloud-native format, significantly simplifying the operational overhead of this solution.
Slurm-on-Kubernetes solutions by Nebius
Today, we have three options to get a Slurm-based cluster:
- Managed Soperator as a fully managed application on Nebius AI Cloud.
- Professional Soperator as a customized and tailored solution deployed by Nebius solution architects for large-scale training installations.
- Soperator as open-source software.
The table below shows the differences and features of each of these options.
Managed Soperator | Professional Soperator | Soperator | |
---|---|---|---|
Solution | Slurm-based clusters | Slurm-based clusters | Kubernetes operator for Slurm |
Delivery model | Self-service app | Professional service | Open-source software |
Cloud environment | Nebius | Nebius | Cloud agnostic |
Pre-installed AI/ML-drivers and libraries | Yes | Yes | Yes |
All types of containers supported | Yes | Yes | Yes |
Passive health checks | coming soon | Yes | No |
Active health checks | coming soon | Yes | No |
Topology-aware job scheduling | coming soon | Yes | No |
Auto-healing mechanism | coming soon | Yes | Nebius AI Cloud, only |
Free software, consumption-based pricing | Yes | Yes | Yes |
You may notice that many important features are marked as “coming soon” and available only in the Professional version. This reflects our vision of bringing the best of this technology to its managed implementation. Our developers are working on this right now.
Getting started
To start using Managed Soperator, sign up for the console
In its current version, you can create a cluster with up to 32 NVIDIA GPUs. However, if you need more, feel free to contact us via the contact form or Support Center, and we will be happy to provide you with a bigger cluster and turn it into a professional version if necessary.
If you want to contribute to Soperator as open-source software, make sure to check the GitHub