Elevating the craft: Introducing the Inference Frontier Program
Today we’re introducing the Inference Frontier Program, a new builder-to-builder initiative dedicated to production inference systems. The program surfaces real architectures, optimizations and engineering tradeoffs from teams running large-scale inference in production.
NVIDIA Nemotron 3 Super now available on Nebius Token Factory
NVIDIA Nemotron 3 Super is now available on Nebius Token Factory, bringing a 120B hybrid MoE model optimized for multi-agent systems and complex reasoning workflows to production deployments. With long-context inference and OpenAI-compatible APIs, teams can run Nemotron 3 Super through dedicated GPU endpoints and autoscaling infrastructure without managing their own serving stack.
OpenClaw security: architecture and hardening guide
Self-hosted AI agents offer control and flexibility, but they also introduce real security risks. Incidents involving malicious ClawHub skills, exposed default ports and prompt-injection attacks show that running OpenClaw is not just an installation task, but an infrastructure decision. This guide explains OpenClaw’s architecture and maps real threats to concrete-hardening controls, so teams can deploy it safely in production.
Meet SWE-rebench-V2: A multilingual, executable dataset for training Software Engineering Agents
We’re introducing SWE-rebench-V2, the next iteration of our large-scale dataset of reinforcement learning (RL) environments for training autonomous software engineering agents (SWEs).
Introducing Dedicated Endpoints and Custom Weights Hub in Nebius Token Factory
We are introducing Dedicated Endpoints and a Custom Weights Hub in Nebius Token Factory. You can now choose GPU type, define GPUs per replica, set scaling limits, select region and deploy your own model weights to isolated endpoints. Deployment becomes a defined, controllable part of your production architecture.
Nebius and Toloka to introduce integration to bring human experts-on-demand to AI agents
Today, Nebius and Toloka are announcing plans to bring Tendem into the Nebius ecosystem. This integration further strengthens the Nebius AI stack, anchoring the raw intelligence of Token Factory and the autonomy of Tavily agentic search with a programmable layer of human reliability. Originally designed as the market’s pioneer hybrid human-AI agent, Tendem is now the first platform to embed vetted human experts directly into agentic workflows — making expert judgment callable via the Model Context Protocol (MCP), the emerging standard for AI tool integration.
Scaling efficient production-grade inference with NVIDIA Run:ai on Nebius
NVIDIA and Nebius ran joint benchmarks using NVIDIA Run:ai, the AI workload orchestration and optimization software platform on the Nebius AI Cloud. The goal was simple: test whether fractional GPU allocation could improve efficiency and scalability for real-world inference workloads — without compromising performance.
Routing in LLM inference is the difference between scaling and stalling
When inference becomes distributed, routing strategy can determine whether a system scales or stalls. In this article, we examine a real agent-style workload where cache-aware routing in vLLM reduced average step time by nearly 50 percent and cut P95 latency from over a minute to under 20 seconds — with the same model, hardware and traffic.
The energy behind AI: Why power efficiency matters
As AI adoption accelerates, energy use increasingly sets the boundaries of how far the systems can scale. Power availability, efficiency and infrastructure design are becoming practical constraints. This shift is prompting to think of concrete ways to manage the energy footprint of AI systems by optimizing energy use and creating measurable efficiency gains. In our latest whitepaper, we explain how Nebius improves efficiency across the stack, from software engineering to hardware design and data center operations.
FinOps efficiency for AI workloads with FOCUS-compliant billing data
We recently introduced support for exporting billing data from Nebius AI Cloud in the FOCUS format. This update is a small but important step toward making cloud integration simpler and financial operations smoother for teams building and scaling AI workloads. At Nebius, we believe billing data should be easy to work with, easy to integrate and easy to trust — especially when AI infrastructure costs are a core part of your business model.
Why large MoE models break latency budgets and what speculative decoding changes in production systems
Large mixture-of-experts language models promise significant gains in model quality, but deploying them in real products often exposes hard latency limits. This article explains why MoE systems that look efficient in benchmarks struggle under production constraints, and how architectural decisions around routing, batching and serving determine whether latency budgets hold under worst-case inputs and real user behavior.
OpenHands trajectories with Qwen3 Coder 480B
While reinforcement learning drives agents to state-of-the-art performance, rejection fine-tuning serves as a powerful baseline. Stemming from our extensive experiments with different models and scaffoldings, we are sharing a dataset of 67k high-quality OpenHands trajectories from Qwen3 Coder 480B for research purposes. We also include two RFT checkpoints — Qwen3 Instruct 30B and 235B, achieving ~50% and ~60% on SWE-bench Verified respectively.