Managed SkyPilot API Server on Nebius AI Cloud: Technical overview and setup

Since our initial SkyPilot integration in April 2025, developers have been able to deploy AI workloads seamlessly on Nebius infrastructure through SkyPilot’s unified interface. This integration has enabled teams to provision GPU instances with simple YAML configs by specifying resources like accelerators: H100:8, and letting SkyPilot handle the infrastructure provisioning. Teams could also mount cloud storage directly and automatically mount Nebius object storage buckets to compute instances by using file_mounts, while leveraging high-performance interconnect for distributed training workloads.

However, this initial integration relied on SkyPilot’s local API server model — where each developer ran SkyPilot commands from their individual machines. While this approach worked well for individual users, it created several limitations for teams.

Limitations of the local server

The local server approach created some challenges for teams. Workloads were tied to the developer’s laptop, meaning if their machine shut down or lost connectivity, they couldn’t monitor or manage their jobs. Each team member operated in isolation, with no resource sharing, which led to inefficient resource utilization and duplicate infrastructure provisioning. Teams also lacked centralized visibility, with no unified view of running workloads, which made it difficult to coordinate usage and optimize costs.

These limitations became increasingly problematic as teams scaled their AI workloads and needed more sophisticated infrastructure management.

The solution: Managed SkyPilot API Server

Today, we’re excited to announce the launch of Managed SkyPilot API Server on Nebius AI Cloud — a fully managed solution that transforms SkyPilot from a single-user tool into a scalable, multi-user platform, while eliminating operational overhead entirely.


Click to expand

Benefits of centralized AI Infrastructure

The centralized API server approach changes how both individuals and teams work with AI infrastructure, delivering immediate value across multiple areas.

Productivity

The server enables asynchronous execution, letting you launch multiple jobs without blocking your terminal to submit hyperparameter sweeps or experiment batches in parallel. With multi-device access, you can start a training job from your office and monitor it from home simply by connecting to the same API server. Production workflow integration becomes seamless with orchestrators like Apache Airflow, while SkyPilot manages your diverse infrastructure.

Collaboration and efficiency

Centralized deployment means you set up once, and then team members onboard by using a single endpoint, without distributing credentials or managing individual installations. Teams benefit from resource sharing where they can share GPU clusters, monitor each other’s jobs and coordinate resource usage across the organization. When one team member spins up an expensive multi-GPU cluster, others can submit jobs to it. This enables true collaboration where teams share configurations, coordinate experiments and maintain institutional knowledge.

Operational excellence

The platform provides unified visibility through a single view for all running clusters, jobs and services across your entire infrastructure. Job observability allows you to monitor training progress, view logs and track resource utilization across all team workloads. This centralized visibility helps identify idle resources and optimize cloud spending. The cloud-native deployment ensures fault tolerance, making sure your workloads survive infrastructure changes.

Why choose managed over self-hosted?

Managing a production-grade deployment involves significant operational complexity that can distract teams from their core AI research and development.

Self-hosting challenges

  • Kubernetes expertise required: Setting up persistent storage, configuring ingress controllers, managing SSL certificates.
  • Ongoing maintenance: Monitoring server health, handling upgrades, managing authentication.
  • Database management overhead: Maintaining a separate PostgreSQL database for job state, handling backups, ensuring high availability and managing schema migrations.
  • Infrastructure overhead: Provisioning and maintaining the underlying Kubernetes cluster.

Advantages of the solution managed by Nebius

  • One-click deployment: Provision your SkyPilot API server directly from our console.
  • Zero operational overhead: We handle all infrastructure management, monitoring and maintenance.
  • High availability: Fault-tolerant deployment with automatic failover and recovery, with state backed by Nebius' Managed Service for PostgreSQL.

This managed approach is designed specifically for small to mid-size ML teams who want to focus on AI innovation rather than infrastructure operations. While larger enterprises often have dedicated DevOps teams and require extensive customization, most ML teams simply want a reliable, shared platform that works out of the box.

Deploying your server

Ready to eliminate infrastructure complexity and unlock centralized AI workload management? Managed SkyPilot API Server is available through the Nebius console. You can deploy your API server in minutes through the web UI, share a single endpoint URL for instant team access and start launching jobs and sharing resources right away.

Current capabilities and roadmap

The current release supports two compute infrastructure options:

  • Nebius Platform API: Direct access to high-performance interconnected GPU clusters
  • Nebius Managed Kubernetes Service: Your existing Kubernetes clusters are automatically discovered and available for workloads

Coming soon: Support for additional cloud providers (AWS, GCP, Azure) will be added, preserving SkyPilot’s core value proposition of multi-cloud flexibility and failover capabilities, as well as Nebius' design principle of ecosystem interoperability to avoid lock-in. SSO authentication support will also be available for seamless enterprise integration.

Technical implementation

Connecting to your API server

Once the SkyPilot API server is provisioned, you can connect to the server:

$ export SKY_USERNAME='<replace with your username>'
$ export SKY_PASSWORD='<replace with your password>'
sky api login -e "https://\SKY_USERNAME:\$SKY_PASSWORD@public-...nebius.cloud"

Configuring the SkyPilot API server

After provisioning, you’ll have access to the SkyPilot API server web UI:

The admin user needs to configure the SkyPilot API server by specifying the project ID and, optionally, IB fabric filesystem IDs. Access the configuration through the button in the top right corner of the web UI and update it with your Nebius settings:


Click to expand

nebius:
  region_configs:
    us-central1:
      # replace with your project_id
      project_id: project-u00w5qcypr00qq5k6de40z
      fabric: us-central1-a
      filesystems:
        # replace with your filesystem_id
        - filesystem_id: computefilesystem-u00ka4d0q8959st67e
          mount_path: /mnt/data
          attach_mode: READ_WRITE

This configuration allows you to:

  • Specify which project ID to use in each region.
  • Optionally, configure shared filesystems for your compute instances (highly recommended).

Nebius shared filesystems provide high-performance storage that automatically mounts to your SkyPilot instances. This enables efficient multi-node data access for training datasets and model checkpointing without data duplication across instances.

Working with Kubernetes clusters

Nebius Managed Kubernetes clusters are automatically discovered by the SkyPilot API server. You can verify available clusters:

$ sky check k8s
Checking credentials to enable infra for SkyPilot.
  Kubernetes: enabled [compute]
    Allowed contexts:
    └── nebius-mk8s-my-cluster-1: enabled.
    └── nebius-mk8s-my-cluster-2: enabled.
    ...

GPU workload requirements: To run GPU workloads on these Kubernetes clusters, ensure they are properly configured with: drivers/GPU operator, Device Plugin, Network operators (for interconnect support in distributed training).

Advanced team management

Once the admin is connected to the API server dashboard, they’ll be able to create other users, control their permissions and organize them into teams and workspaces. Watch how an admin creates a new user and how that user can immediately start launching GPU clusters:

# Example: User launches a test job
$ sky launch -c my-cluster --gpus=H200:1 --infra nebius/us-central1 -- nvidia-smi
Command to run: nvidia-smi
Considered resources (1 node):
--------------------------------------------------------------------------------------------------------
INFRA                  INSTANCE                         vCPUs   Mem(GB)   GPUS     COST ($)   CHOSEN
--------------------------------------------------------------------------------------------------------
Nebius (us-central1)   gpu-h200-sxm_1gpu-16vcpu-200gb   16      200       H200:1   3.50          ✔
--------------------------------------------------------------------------------------------------------
Launching a new cluster 'my-cluster'. Proceed? [Y/n]: Y
...
(sky-cmd, pid=4095) Wed Sep  3 14:16:04 2025
(sky-cmd, pid=4095) +-----------------------------------------------------------------------------------------+
(sky-cmd, pid=4095) | NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
(sky-cmd, pid=4095) |-----------------------------------------+------------------------+----------------------+
(sky-cmd, pid=4095) | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
(sky-cmd, pid=4095) | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
(sky-cmd, pid=4095) |                                         |                        |               MIG M. |
(sky-cmd, pid=4095) |=========================================+========================+======================|
(sky-cmd, pid=4095) |   0  NVIDIA H200                    On  |   00000000:8D:00.0 Off |                    0 |
(sky-cmd, pid=4095) | N/A   31C    P0             75W /  700W |       0MiB / 143771MiB |      0%      Default |
(sky-cmd, pid=4095) |                                         |                        |             Disabled |
(sky-cmd, pid=4095) +-----------------------------------------+------------------------+----------------------+
(sky-cmd, pid=4095)
(sky-cmd, pid=4095) +-----------------------------------------------------------------------------------------+
(sky-cmd, pid=4095) | Processes:                                                                              |
(sky-cmd, pid=4095) |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
(sky-cmd, pid=4095) |        ID   ID                                                               Usage      |
(sky-cmd, pid=4095) |=========================================================================================|
(sky-cmd, pid=4095) |  No running processes found                                                             |
(sky-cmd, pid=4095) +-----------------------------------------------------------------------------------------+
✓ Job finished (status: SUCCEEDED).

# Alternatively, you can run the same job on your Managed K8s cluster:
# $ sky launch -c my-cluster --gpus=H200:1 --infra k8s/nebius-mk8s-my-cluster-1 -- nvidia-smi

The workspaces feature provides powerful multi-team management capabilities, including resource isolation to separate infrastructure configurations and track usage by team or project, flexible access control to create private workspaces with specific user permissions and multi-tenant management to manage complex cloud computing environments with minimal administrative overhead.


Click to expand

Transform your AI infrastructure today

Managed SkyPilot API Server represents our commitment to making enterprise-grade AI infrastructure accessible to teams of all sizes. By eliminating the operational barriers that have historically prevented teams from adopting centralized AI infrastructure management, we’re enabling ML teams to get more done in less time.

Explore Nebius AI Cloud

Explore Nebius AI Studio

Sign in to save this post