Fault-tolerant training: How we build reliable clusters for distributed AI workloads
When starting a job, you expect it to run without interruptions. This expectation holds true across many domains, but it resonates especially deeply with machine learning engineers who launch large-scale pre-training jobs. Maintaining a stable training environment is crucial for delivering AI results on schedule and within budget constraints.
Advancing confidential computing and cross-border bioinformatics at the ELIXIR BioHackathon
Open-source, global-scale infrastructure is crucial to accelerate scientific innovation. Developed in the ELIXIR BioHackathon and enabled by Nebius, the BioHackCloud project is designed as a reference implementation for a standards-based, federated multi-cloud platform — having successfully demonstrated an Attested TLS proof of concept for private LLM inference and confidential task execution.
Introducing self-service NVIDIA Blackwell GPUs in Nebius AI Cloud
NVIDIA HGX B200 instances are now publicly available as self-service AI clusters in Nebius AI Cloud. This means anyone can access NVIDIA Blackwell — the latest generation of NVIDIA’s accelerated computing platform — with just a few clicks and a credit card.
Nebius monthly digest: July 2025
Our July’s digest describes how our customers like Stanford and Shopify use flexible compute capacities and the steps we are taking to boost performance. Nebius’ first anniversary also took place in July, marked by Nasdaq in Times Square, and there was news from across the ocean as well.
Slurm Workload Manager: The go-to scheduler for HPC and AI workloads
Slurm Workload Manager is a cornerstone of high-performance computing (HPC) infrastructure, trusted by supercomputing centers worldwide for its scalability and flexibility. As AI workloads grow in size and complexity, Slurm is gaining traction among ML teams as well. In this article, we will look at why it remains relevant, how it supports GPU clusters and what to consider when using it in AI workflows.
Introducing Nebius MCP Server: The LLM-native way to manage your AI Cloud
Skip the CLI commands and web console clicks — just ask Claude about your cloud infrastructure. Today, we’re excited to announce the Nebius MCP Server, our integration that connects Claude by Anthropic, or other AI chatbots, to the Nebius AI Cloud infrastructure.
Q2 2025: Nebius AI Cloud updates
Welcome back to our quarterly roundup featuring all the improvements and updates delivered on Nebius AI Cloud over the past three months. This overview highlights the key developments we’ve made to enhance your AI infrastructure experience.
Agent 101: Launching production-grade agents at scale
To go from prototype to production, AI agents need more than just a good model. In this guide, we break down the four components that matter most: reliable LLMs, orchestration frameworks, evaluation tools, and memory systems. We cover how teams are using Nebius AI Studio with CrewAI, ADK, LangChain, and more to ship scalable, observability-friendly agent workflows, all powered by fast, cost-efficient inference.
Nebius monthly digest: June 2025
Our standout moment in June was NVIDIA GTC Paris, where Nebius was recognized multiple times. To mark the conference, we announced the opening of the UK cloud region and introduced integration with NVIDIA DGX Cloud Lepton, NVIDIA AI Enterprise stack and other systems. Also, our ISEG2 is #13 on the Top500 of supercomputers; we held the AI Discovery Award, and launched Managed Soperator for Slurm-based workloads. Read these and other news in today’s digest.
Nebius AI Studio Q2 2025 updates
We kicked off Q2 with a single mission: turn raw compute horsepower into concrete business outcomes. With numerous launches, including groundbreaking models, streamlined fine-tuning, scalable throughput and seamless integrations, Nebius AI Studio has been updated significantly.
From genome analysis to quantum chemistry: Nebius powers the next generation of biotech research with NVIDIA
As part of NVIDIA GTC Paris at VivaTech, Nebius has announced deeper integration of the NVIDIA AI Enterprise software suite. This includes NVIDIA BioNeMo, a collection of tools, applications, generative AI solutions, and pre-trained microservices (NVIDIA NIM) designed squarely for the biopharma sector.
What is object storage: Key differences from traditional storage explained
Learn the fundamentals of object storage, how it differs from traditional block storage solutions and why it is becoming the go-to choice for modern data management. With data volumes exploding and cloud systems becoming the norm, traditional systems are struggling to keep up with today’s data chaos. That’s where object storage comes in — built for the cloud, made to scale and ready for anything. This guide cuts through the jargon to show you why object storage is the future and how it outperforms block storage where it counts.
Introducing Managed Soperator: Your quick access to Slurm training
Managed Soperator, our fully managed Slurm-on-Kubernetes solution, is now available for everyone in self-service. It provides you with a ready-to-work Slurm training cluster, powered by modern hardware and delivered with all necessary pre-installed libraries and drivers — allowing you to start ML training immediately.