Fault-tolerant training: How we build reliable clusters for distributed AI workloads
When starting a job, you expect it to run without interruptions. This expectation holds true across many domains, but it resonates especially deeply with machine learning engineers who launch large-scale pre-training jobs. Maintaining a stable training environment is crucial for delivering AI results on schedule and within budget constraints.
Incident post-mortem analysis: us-central1 service disruption on September 3, 2025
A detailed analysis of the incident on September 3, 2025 that led to service outages in the us-central1 region.
The incident impacted API operations and Console functionality due to persistent routing loops between network domains, while other regions remained operational.
What is Jupyter Notebook in the context of AI
Jupyter Notebook is a browser-based tool for interactive coding, data exploration and documentation. It lets you run code step by step while combining results, visualizations and explanations in one place. Widely used in machine learning, it speeds up experimentation, ensures reproducibility and makes collaboration easier. This article looks at how Jupyter supports ML workflows, its key features and the tasks it handles best.
Nebius proves bare-metal-class performance for AI inference workloads in MLPerf® Inference v5.1
Today, we’re happy to share our new performance milestone — the latest submission of MLPerf® Inference v5.1 benchmarks, where Nebius achieved leading performance results for three AI systems accelerated by the most in-demand NVIDIA systems on the market: NVIDIA GB200 NVL72, HGX B200 and HGX H200.
Nebius monthly digest: August 2025
In August, we introduced self-service NVIDIA Blackwell GPUs in Nebius AI Cloud and published several in-depth technical articles, including ones on cluster reliability and liquid cooling. We also continued to cover customer success — all this and more in the latest digest.
Advancing confidential computing and cross-border bioinformatics at the ELIXIR BioHackathon
Open-source, global-scale infrastructure is crucial to accelerate scientific innovation. Developed in the ELIXIR BioHackathon and enabled by Nebius, the BioHackCloud project is designed as a reference implementation for a standards-based, federated multi-cloud platform — having successfully demonstrated an Attested TLS proof of concept for private LLM inference and confidential task execution.
How many epochs do you need to train a model? Key considerations explained
Deciding how many epochs to train is a practical engineering choice: too few leads to underfitting, while too many waste compute and increase the risk of overfitting. This guide explains what an epoch means in day-to-day training, how to recognize when you’ve trained enough, and which factors determine a sensible stopping point.
What it takes to build a reasoning model
Modern language models are great at generating text: they complete phrases, answer questions and write code. But in scenarios that demand analytical thinking or step-by-step problem solving, producing a likely next word isn’t enough. What’s needed is a logically consistent answer that unfolds in a structured way. That’s where reasoning models come in.
Exploring cluster orchestration tools for AI
Cluster orchestration tools automatically coordinate the distribution of AI workloads across thousands of clusters. They can scale clusters up or down to match usage and handle failure conditions without interrupting operations. This article explores various functions and types of cluster orchestration tools, including best practices that promote efficiency.
Introducing self-service NVIDIA Blackwell GPUs in Nebius AI Cloud
NVIDIA HGX B200 instances are now publicly available as self-service AI clusters in Nebius AI Cloud. This means anyone can access NVIDIA Blackwell — the latest generation of NVIDIA’s accelerated computing platform — with just a few clicks and a credit card.
Epochs vs iterations in machine learning: what’s the difference
When you’re training a machine learning model, you’ll often hear terms like epochs, iterations and batches. They’re sometimes used interchangeably, but each one refers to something different. Knowing the difference helps you train your model more effectively.
Nebius monthly digest: July 2025
Our July’s digest describes how our customers like Stanford and Shopify use flexible compute capacities and the steps we are taking to boost performance. Nebius’ first anniversary also took place in July, marked by Nasdaq in Times Square, and there was news from across the ocean as well.
AI model components: key elements of a GenAI inference setup explained
In this article, we break down the inference pipeline — from weight loading and batching logic to routing, autoscaling and monitoring — and explore how engineering choices at each level affect end-to-end behavior in production.