Managed Soperator

A fully managed Slurm-on-Kubernetes solution for simplified AI training on NVIDIA GPU clusters.

One-click cluster setup

Launch your training environment in minutes, not days. Our solution handles everything — from node provisioning to pre-installed dependencies — so you can start scheduling jobs instantly with zero infrastructure configuration.

Fault-tolerant training

Train your models without stress. Automatic health checks and automatic recovery ensure your jobs keep running, even during hardware or node failures. Integrated monitoring dashboards and logging provide an advanced visibility and full control over the cluster.

Maximum GPU utilization

Make the most of your AI hardware. Smart scheduling and topology-aware job placement boost efficiency for large-scale training. Optimized dependencies ensure quick execution of your model training frameworks.

How to launch a Slurm cluster in minutes

This video demonstrates how quickly you can set up and launch a Slurm cluster for AI training by using Managed Soperator.

Try Managed Soperator

Sign up for the console, add billing details and set up your cluster parameters. That’s it! The system will provision the compute and make Slurm deployment automatically.

Running Slurm on Kubernetes

Our Managed Soperator is powered by Soperator, our custom-made Kubernetes operator for Slurm. It allows us to combing advanced job scheduling capabilities of Slurm and cloud-native flexibility of Kubernetes in one training environment.

How it works

Shared root filesystem provides a single file environment for all nodes of the cluster, ensuring simplified package management and cluster scalability.

Full screen image

Open source solution

At Nebius, we believe that only together we can create better technologies. That’s why we made Soperator open source, providing ML enthusiasts and HPC practitioners with the opportunity to use this technology for their endeavors and improve it according to their needs.

Slurm-on-Kubernetes solutions by Nebius

Managed Soperator

Professional Soperator

Soperator

Solution

Slurm-based clusters

Slurm-based clusters

Kubernetes operator for Slurm

Delivery model

Self-service app

Professional service

Open-source software

Cloud environment

Nebius

Nebius

Cloud agnostic

Pre-installed AI/ML-drivers and libraries

Yes

Yes

Yes

All types of containers supported

Yes

Yes

Yes

Passive health checks

Coming soon

Yes

No

Active health checks

Coming soon

Yes

No

Topology-aware job scheduling

Coming soon

Yes

No

Auto-healing mechanism

Coming soon

Yes

on Nebius cloud only

Free software, consumption-based pricing

Yes

Yes

Yes

Getting started

Sign up for the console to get started immediately with Managed Soperator, or contact our team for Professional Soperator where we handle complete installation and configuration.

Questions and answers

Slurm is an open source, fault-tolerant and highly scalable cluster management and job scheduling system for large and small Linux clusters.

SchedMD

By partnering directly with SchedMD, the developer of the Slurm Workload Manager, Nebius provides exceptional support to Slurm users. SchedMD robust Slum workload manager streamlines job scheduling and resource allocation. Its scalability and reliability make it a versatile solution that can meet a variety of business needs.

Slurm is a registered trademerk of SchedMD LLC.