GPU clusters with NVIDIA Quantum-2 InfiniBand

The NVIDIA Quantum-2 InfiniBand is a technology that facilitates the seamless connection of servers equipped with GPUs within a cluster. Use GPU clusters with InfiniBand interconnect to accelerate your data-intensive ML workloads.

Get a multi-node infrastructure with up to 3.2 Tbit/s of per-host networking performance.

More about Nebius GPU clusters

High-speed data transfer

InfiniBand provides the ultra-high data transfer speeds essential for ML workflows that require rapid processing and analysis of large volumes of data.

Scalability

Large-scale clustering and parallel processing make InfiniBand indispensable for complex and resource-intensive algorithms.

Reliability

High reliability and fault tolerance provided by InfiniBand ensure that ML workflows run smoothly with minimal downtime.

Low latency

InfiniBand offers low latency communication required by real-time ML applications with quick decision-making and response times.

Cost-effectiveness

InfiniBand can be a cost-effective solution for engineers who need high-performance networking capabilities without investing in expensive proprietary hardware.

Training-optimized configuration

We provide GPU clusters with NVIDIA® Hopper H100 SXM GPUs.

A host consists of 8 GPUs. Each GPU features up to 400 Gbit/s connection, providing up to 3.2 Tbit/s network bandwidth per host.

You can use images adapted for GPU clusters, e.g. Ubuntu 22.04 LTS for NVIDIA GPU clusters (NVIDIA CUDA 12).

Why InfiniBand is essential for ML workloads

Model parallelism

This innovative approach harnesses the formidable power of InfiniBand, enabling the seamless training of expansive ML models across the entire cluster. This is an exceptional strategy for training transformer models with an extensive array of parameters.

Data parallelism

To expedite the training process, you can part your data into segments and leverage multiple GPU nodes for concurrent, parallelized training. This approach significantly reduces the overall training time by distributing the workload evenly across the available resources.

Hardware efficiency

We understand the importance of using hardware as efficiently as possible. That’s why we design and assemble servers specifically tailored for hosting modern GPUs like the NVIDIA H100.

Our latest server rack generation presents node solutions tuned for ML training and inference.

Nebius AI hardware How we design hardware

How we built ISEG, a top-16 supercomputer in the world

We weren’t aiming to create a supercomputer. Yet our R&D team decided to test a part of the platform which was free of customers’ workloads at that moment. For that, they used a benchmark from the Top 500.

ISEG is now 16th in the world’s ranking — and 4th in Europe.

Discover ISEG