
Nebius proves bare-metal-class performance for AI inference workloads in MLPerf® Inference v5.1
Nebius proves bare-metal-class performance for AI inference workloads in MLPerf® Inference v5.1
Today, we’re happy to share our new performance milestone — the latest submission of MLPerf® Inference v5.1 benchmarks, where Nebius achieved leading performance results for three AI systems accelerated by the most in-demand NVIDIA systems on the market: NVIDIA GB200 NVL72, HGX B200 and HGX H200.
Just this spring, we successfully achieved top-tier performance numbers in the MLPerf Training round by MLCommons®, confirming the smooth and scalable work of Nebius clusters for multi-host, distributed training. The current submission demonstrates Nebius’ ability to deliver outstanding token throughput for inference workloads, proving significant performance gains across all tested configurations.
MLPerf® is a peer-reviewed benchmark suite developed by MLCommons®
This round of MLCommons® benchmarks reflects the continuous improvements by our engineering team to deliver exceptional value for our customers, that make Nebius a leading choice among AI infrastructure providers.
Virtualized instances with bare-metal performance
This benchmarking submission gives us the opportunity to showcase our engineering expertise in building high-performance virtualized environments for modern Gen AI workloads. We run our virtual GPU instances without virtualizing NVIDIA GPUs and NVIDIA ConnectX InfiniBand adapters, which results in zero performance degradation and allows us to stay on par with the best industry benchmarks.
MLPerf® Inference v5.1 results, achieved on single-host installations, prove that Nebius AI Cloud delivers top-tier performance results for inference of large foundational models like Llama 2 70B and Llama 3.1 405B models.
Raising the bar for NVIDIA GB200 NVL72 performance
With increased GPU memory, the NVIDIA GB200 NVL72 can easily handle large foundational models, equipped with NVFP4 precision, fifth-generation NVIDIA NVLink™, NVLink Switch System and high-speed NVIDIA InfiniBand networking for unparalleled GPU-to-GPU interconnects — making these systems a great choice for distributed training and reasoning model inference.
In MLPerf® benchmarks v5.1, Nebius AI Cloud sets a new peak in performance delivered by NVIDIA GB200 NVL72 systems — compared to the best results of the previous round, our one-host installation achieved 6.7% and 14.2% performance increases running Llama 3.1 405B inference in offline and server modes, respectively [1][2][3][4]. These two numbers also hold the first place for Nebius, among other MLPerf® Inference v5.1 submitters for this model on GB200 systems.
Table 1. Nebius sets new peak performance benchmarks for Llama 3.1 405B inference on NVIDIA GB200 NVL72
Compared to the previous generation of NVIDIA GPUs, the GB200 NVL72 demonstrated a significant leap in inference throughput — one host with 4x Blackwell GPUs achieved a 54.8% performance gain compared to an H200 host powered by 8x GPUs, while normalized per-chip comparison showed a 3.1x performance increase (Figure 1) [3][5].
Figure 1. Per-chip and per-host comparison of GB200 NVL72 and HGX H200 inference throughput [3][5]
Demonstrating best-in-class results for HGX platforms
Another NVIDIA Blackwell system Nebius presented in this submission — HGX B200 — shows very promising performance for LLM inference, confirming the advantages of the new model over the previous generation of air-cooled HGX systems.
Running Llama 3.1 405B inference on NVIDIA HGX B200, we recorded 1,660 tokens/s in offline mode and 1,280 tokens/s in server scenario [6][7]. Compared to our NVIDIA HGX H200 results in this MLPerf® submission, we see 3x and 4.3x performance increases in offline and server scenarios respectively [5][8].
Figure 2. Per-host comparison of NVIDIA HGX B200 and HGX H200 inference throughput for Llama 3.1 405B [5][6][7][8]
When it comes to the Llama 2 70B model, the difference between HGX B200 and HGX H200 is also substantial. The B200 system outperforms the H200 3x and 2.9x times, by delivering 101,611 and 101,246 tokens/s in server and offline modes, respectively [9][10][11][12].
Figure 3. Per-host comparison of NVIDIA HGX B200 and HGX H200 inference throughput for Llama 2 70B [9][10][11][12]
For both models, Nebius demonstrated the best performance results in server mode, while in offline and interactive scenarios it shows parity with other submitters.
Proving the efficient way to scale AI workloads
NVIDIA Hopper systems are an excellent choice for building cost-optimized inference environments for AI applications. Featuring increased GPU memory, a battle-tested software stack and affordable pricing, NVIDIA HGX H200 systems remain popular among customers using Nebius’ self-service and reserved capacity options.
In this submission, Nebius recorded that Llama 70B inference on HGX H200 systems gives 11% performance increase compared to the best results achieved on H100-powered systems from the MLPerf® Inference v5.0 submission (Figure 4) [9][10][13][14].
Figure 4. Per-host comparison of NVIDIA HGX H200 and HGX H100 inference throughput for Llama 2 70B [9][10][13][14]
Follow the link
Conclusion
Researchers and engineers from MLCommons® do a great job of supplying the industry with standardized benchmarks to measure inference and training performance on modern AI platforms. These benchmarks help AI companies stay informed about the current state of AI infrastructure, while providing GPU vendors and neoclouds with a measurement system to evaluate and improve the quality of their products.
But synthetic benchmarks only reflect general trends in how AI infrastructure performance evolves. In real-world scenarios, each case is unique, with multiple dependencies and hidden factors that can hinder or accelerate system performance. That’s why we offer our customers the ability to thoroughly test and fine-tune AI clusters for their specific machine learning setups.
These submission results confirm Nebius’ ability to run AI workloads in highly efficient virtualized environments while delivering performance on par with bare metal installations. They reflect the company’s commitment to exceptional AI infrastructure, built on custom hardware, proprietary software, and energy-efficient data centers. This combination gives AI labs and enterprises supercomputer-level performance and reliability, coupled with the flexibility and simplicity of a hyperscaler.
Explore Nebius AI Studio
See also
[1] MLPerf® v5.0 Inference Closed Llama3.1-405B offline. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[2] MLPerf® v5.0 Inference Closed Llama3.1-405B server. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[3] MLPerf® v5.1 Inference Closed Llama3.1-405B offline. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[4] MLPerf® v5.1 Inference Closed Llama3.1-405B server. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[5] MLPerf® v5.1 Inference Closed Llama3.1-405B offline. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[6] MLPerf® v5.1 Inference Closed Llama3.1-405B offline. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[7] MLPerf® v5.1 Inference Closed Llama3.1-405B server. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[8] MLPerf® v5.1 Inference Closed Llama3.1-405B server. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[9] MLPerf® v5.1 Inference Closed Llama2-70B server. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[10] MLPerf® v5.1 Inference Closed Llama2-70B offline. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[11] MLPerf® v5.1 Inference Closed Llama2-70B server. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[12] MLPerf® v5.1 Inference Closed Llama2-70B offline. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[13] MLPerf® v5.0 Inference Closed Llama2-70B server. Retrieved from mlcommons.org/benchmarks/inference-datacenter
[14] MLPerf® v5.0 Inference Closed Llama2-70B offline. Retrieved from mlcommons.org/benchmarks/inference-datacenter