Announcing the integration between Nebius and dstack
The increasing adoption of AI cloud solutions highlights the need for efficient, vendor-independent orchestration tools. dstack provides an open-source, AI-focused orchestration platform that emphasizes flexibility and ease of use for AI development. We are announcing an integration with dstack to enhance the developer experience for professional ML teams. You can now choose Nebius in dstack and start managing dev environments, executing training jobs and deploying models on our AI infrastructure.
April 10, 2025
5 mins to read
Traditional orchestration tools like Kubernetes and Slurm present challenges for ML teams. Kubernetes, with its low-level interface, can be difficult for AI workflows beyond inference, such as development and training, which often require custom setups. Slurm, optimized for training workloads, lacks support for other essential ML tasks, including dev environment management, cluster management and inference deployment.
dstack is an open-source container orchestration platform that bridges the gap between Kubernetes and Slurm, designed for ML teams to manage GPU workloads across GPU clouds and on-premises data centers.
A dev environment lets you provision an instance and access it with your desktop IDE.
Create the following configuration file inside the repo:
type:dev-environmentname:vscode# If `image` is not specified, dstack uses its default imagepython:"3.11"#image: dstackai/base:py3.13-0.7-cuda-12.1ide:vscoderesources:gpu:L40S
Apply the configuration by using dstack apply:
$ dstack apply -f .dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 nebius eu-north1 8xCPU, 32GB, 1xL40S no $1.5484
2 nebius eu-north1 16xCPU, 64GB, 1xL40S no $1.7468
3 nebius eu-north1 16xCPU, 96GB, 1xL40S no $1.8172
Submit the run vscode? [y/n]: y
Launching `vscode`...
████████████████████████████████████████ 100%
To open in VS Code Desktop, use this link:
vscode://vscode-remote/ssh-remote+vscode/workflow
Click the link, to access the dev environment by using your desktop IDE.
Now, imagine you’d like to run training on either a cluster or a single node. Below is an example of a multi-node task:
type:task# The name is optional, if not specified, generated randomlyname:train-distrib# The size of the clusternodes:2python:"3.12"# Commands to run on each nodecommands:-gitclonehttps://github.com/pytorch/examples.git-cdexamples/distributed/ddp-tutorial-series-pipinstall-rrequirements.txt-torchrun--nproc-per-node=$DSTACK_GPUS_PER_NODE--node-rank=$DSTACK_NODE_RANK--nnodes=$DSTACK_NODES_NUM--master-addr=$DSTACK_MASTER_NODE_IP--master-port=12345multinode.py5010resources:gpu:L40Sshm_size:16GB
If you apply it, dstack will automatically set up the cluster and run the script on each node, handling propagating system environment variables such as DSTACK_MASTER_NODE_IP, DSTACK_NODE_RANK, DSTACK_GPUS_PER_NODE.
The examples above are just two of the many features dstack provides. These include running services, managing fleets, volumes and more.
We are now accepting pre-orders for NVIDIA GB200 NVL72 and NVIDIA HGX B200 clusters to be deployed in our data centers in the United States and Finland from early 2025. Based on NVIDIA Blackwell, the architecture to power a new industrial revolution of generative AI, these new clusters deliver a massive leap forward over existing solutions.