Announcing the integration between Nebius and dstack

The increasing adoption of AI cloud solutions highlights the need for efficient, vendor-independent orchestration tools. dstack provides an open-source, AI-focused orchestration platform that emphasizes flexibility and ease of use for AI development. We are announcing an integration with dstack to enhance the developer experience for professional ML teams. You can now choose Nebius in dstack and start managing dev environments, executing training jobs and deploying models on our AI infrastructure.

Traditional orchestration tools like Kubernetes and Slurm present challenges for ML teams. Kubernetes, with its low-level interface, can be difficult for AI workflows beyond inference, such as development and training, which often require custom setups. Slurm, optimized for training workloads, lacks support for other essential ML tasks, including dev environment management, cluster management and inference deployment.

dstack is an open-source container orchestration platform that bridges the gap between Kubernetes and Slurm, designed for ML teams to manage GPU workloads across GPU clouds and on-premises data centers.

Getting started with dstack and Nebius

Setting up the server

To use dstack with Nebius, configure your backend:

  1. Log in to your Nebius AI Cloud account.
  2. Navigate to Access and then select Service Accounts.
  3. Create a new service account, assign it to the editors group and then upload an authorized key.

Then, configure the backend via ~/.dstack/server/config.yml:

projects:
  - name: main
    backends:
    - type: nebius
      creds:
      type: service_account
        service_account_id: serviceaccount-e002dwnbz81sbvg2bs
        public_key_id: publickey-e00fciu5rkoteyzo69
        private_key_file: ~/path/to/key.pem

Proceed with installing and starting the dstack server:

$ pip install "dstack[nebius]"
$ dstack server

Once the server is up, go ahead and initialize a project repo:

$ mkdir quickstart && cd quickstart
$ dstack init

Now, you can also use the dstack CLI to manage dev environments, tasks, services and fleets.

Running a dev environment

A dev environment lets you provision an instance and access it with your desktop IDE.

Create the following configuration file inside the repo:

type: dev-environment
name: vscode

# If `image` is not specified, dstack uses its default image
python: "3.11"
#image: dstackai/base:py3.13-0.7-cuda-12.1

ide: vscode

resources:
  gpu: L40S

Apply the configuration by using dstack apply:

$ dstack apply -f .dstack.yml

 #  BACKEND  REGION     RESOURCES             SPOT  PRICE
 1  nebius   eu-north1  8xCPU, 32GB, 1xL40S   no    $1.5484
 2  nebius   eu-north1  16xCPU, 64GB, 1xL40S  no    $1.7468
 3  nebius   eu-north1  16xCPU, 96GB, 1xL40S  no    $1.8172

Submit the run vscode? [y/n]: y

Launching `vscode`...
████████████████████████████████████████ 100%

To open in VS Code Desktop, use this link:
 vscode://vscode-remote/ssh-remote+vscode/workflow

Click the link, to access the dev environment by using your desktop IDE.

Running a task

Now, imagine you’d like to run training on either a cluster or a single node. Below is an example of a multi-node task:

type: task
# The name is optional, if not specified, generated randomly
name: train-distrib

# The size of the cluster
nodes: 2

python: "3.12"

# Commands to run on each node
commands:
  - git clone https://github.com/pytorch/examples.git
  - cd examples/distributed/ddp-tutorial-series
  - pip install -r requirements.txt
  - torchrun
    --nproc-per-node=$DSTACK_GPUS_PER_NODE
    --node-rank=$DSTACK_NODE_RANK
    --nnodes=$DSTACK_NODES_NUM
    --master-addr=$DSTACK_MASTER_NODE_IP
    --master-port=12345
    multinode.py 50 10

resources:
  gpu: L40S
  shm_size: 16GB

If you apply it, dstack will automatically set up the cluster and run the script on each node, handling propagating system environment variables such as DSTACK_MASTER_NODE_IP, DSTACK_NODE_RANK, DSTACK_GPUS_PER_NODE.

The examples above are just two of the many features dstack provides. These include running services, managing fleets, volumes and more.

Check out dstack’s documentation for details.

Sign in to save this post