Leveraging high-speed, rack-scale GPU interconnect with NVIDIA GB200 NVL72

At Nebius, we’ve been providing access to the NVIDIA Grace Blackwell platform for quite some time now. Throughout 2025, our team of architects has accumulated substantial knowledge on how to extract maximum performance from these systems and their networking. We wanted to kick off sharing this expertise.

Let’s explore one of the key features that makes the new NVIDIA GB200 NVL72 stand out: the fifth generation NVIDIA NVLink™ scale-up fabric. We’ll discuss how it redefines data center AI infrastructure by moving beyond the traditional 8-GPU NVLink interconnect. You’ll see a practical example of how to take advantage of this capability in Slurm using the block topology plugin.

Finally, we’ll examine a real-world use case: pre-training the Nemotron-4 340B LLM. In this example, using the DGX Cloud Benchmarking recipe for GB200 demonstrates how the rack-scale NVIDIA NVLink fabric delivers the highest inter-GPU bandwidth for the most demanding NCCL collectives. This will provide us with valuable insights on how to adapt AI workloads to get best performance on NVIDIA GB200 NVL72.

In short, we’ll utilize a workload developed by NVIDIA to closely examine the latest features of GB200 NVL72 and how these are designed for such cutting-edge hardware systems. This material will be valuable for people planning to design workloads for NVIDIA GB200 NVL72 or GB300 NVL72 in the future.

Context: The implications of GPU interconnect for parallel compute

Pre-NVIDIA GB200 NVL72: Traditional approach to interconnect

Before the GB200 NVL72 release, the overwhelming majority of GPU servers on the market offered fast GPU-GPU transport inside the server and slower GPU-GPU connectivity between the servers. NVIDIA Hopper™ GPUs each feature 8 GPUs connected through a local NVLink fabric. The Hopper architecture uses fourth-generation NVLink, providing 900 GB/s of bandwidth. For GB200 NVL72, the fifth-generation NVLink provides 130 TB/s of GPU-to-GPU communication between multiple servers. The connectivity is provided via RDMA-enabled networking such as NVIDIA Quantum-2 InfiniBand or Spectrum-X Ethernet with RoCE.

Typical InfiniBand fabric topology is rail-optimized fat-tree, which we implement in Nebius, according to NVIDIA’s recommendations:

  • The GPU servers are connected to a group of network switches (called leaf switches) in multiples of 32 (called a Scalable Unit, SU).

  • Each Network Interface Card (NIC) from a GPU server is connected to a different leaf switch in the same SU. This enables the GPUs inside of a single server to leverage NVIDIA NVSwitch to access any of the server’s 8 NICs and to talk to another GPU in the same SU over just a single leaf switch hop, thus reducing latency.

  • GPUs in different SUs can still talk to each other, however the latency will be slightly higher due to an additional switch hop between SUs.

NVIDIA Quantum-2 InfiniBand fabric topologyFigure 1. Example: NVIDIA Quantum-2 InfiniBand fabric topology. This concept is explained in detail in this NVIDIA blog post. See our documentation for more details on the fabric topology

For smaller workloads, these considerations are not mission-critical since in a proper fat-tree network there is no over-subscription, meaning there is no actual bandwidth penalty for additional switch level. The only difference is a small latency overhead. This does not mean that we do not care about topology at all. For a large-scale cluster, which spans multiple SUs or even occupies the whole InfiniBand fabric, leveraging topology-aware scheduling is beneficial, especially for small size network exchanges, which are more sensitive to latency.

With the GB200 NVL72, NVIDIA brings NVLink to the rack level, connecting 72 Blackwell GPUs into a single, tightly coupled system. Every GPU is connected to an external NVSwitch in the rack. In fact, there is no direct GPU to GPU connectivity at all inside of a single server or compute tray. All 72 cards are equivalent from high-speed NVLink connectivity perspective.

What about workloads requiring >72 GPUs? To enable GPU to GPU connectivity across racks, we leverage the same classic NVIDIA Quantum InfiniBand network. Each GPU in the rack gets a dedicated NIC — same as for the classic 8-GPU servers.

Eliminating communication overhead with rack-scale GPU systems

This extended connectivity is a game changer, as we now can drastically reduce communications overhead of distributed AI workloads both for pre-training and inference. That, in turn, allows for more efficient utilization of GPU compute budget.

However, in order to benefit from this new functionality, we need to make some modifications to our setup, which we typically use for scheduling these distributed workloads. Previously, we discussed how for traditional servers with just 8 GPUs connected over NVLink the cross-server GPU-GPU connection bandwidth is the same for servers located in a single NVIDIA Quantum InfiniBand fabric. The main advantage of leveraging network topology results in latency reduction. We provide access to InfiniBand topology to users in Nebius.

Slurm’s network topology configuration
Figure 2. Slurm’s network topology configuration

We cannot get away with ignoring topology for NVIDIA GB200 NVL72 clusters since the bandwidth difference between NVLink and InfiniBand is massive. Thus, schedulers like Slurm allow us to utilize a block topology plugin, which can be leveraged at job scheduling time with --segment sbatch command option, allowing to balance the worker node distribution in equally sized groups across multiple racks.

Optimizing 340B model pre-training on NVIDIA GB200 NVL72

We will use this example to better understand how unique NVIDIA GB200 NVL72 architecture can be leveraged to accelerate LLM pre-training.

For more details about this module architecture and pre-training, see the following publication.

Inside NVIDIA GB200 NVL72
Figure 3. Inside NVIDIA GB200 NVL72: trays, GPU hosts and NVIDIA NVLink fabric

The relative simplicity of the Nemotron model recipe makes it a good example to study NVIDIA GB200 NVL72 capabilities, as it only uses 3D parallelism, including Tensor parallelism (TP), Pipeline parallelism (PP) and Data parallelism (DP). We will look into the smallest example requiring 128 Blackwell GPUs (TP=8, PP=4 and DP=4), which fit into a cluster of 2 GB200 NVL72 servers.

Tensor Parallelism (TP)

For the TP recipe, NVIDIA Megatron Core utilizes an optimized implementation from a Transformer Engine library where GEMMs are overlapped with communications leveraging CUDA multicast. One significant limitation of this approach is that it requires all devices to be on the same NVLink fabric, which on NVIDIA Hopper systems mean a maximum of 8 GPU servers can be utilized for a single parallelism group. NVIDIA GB200 NVL72 alleviates this limitation, essentially enabling a multi-node paradigm for this style of inter-GPU communication.

Nemotron-4 340B uses TP=8, meaning this group should include 2 compute trays of 4 Blackwell GPUs. Ideally, these groups should be formed this way:

Illustration of tensor parallelism for Nemotron-4 340B

Figure 4. Illustration of tensor parallelism for Nemotron-4 340B

We end up with 16 groups of TP, containing 2 compute trays each. A single TP group of 8 ranks is always contained in a single GB200 NVL72 rack.

For example, the first TP group includes the following ranks: [0, 1, 2, 3, 4, 5, 6, 7]. It is relatively easy to understand that TP groups require the fastest possible interconnect, so when an ML framework such as Megatron Core is building a special mapping, which places GPU ranks to the corresponding parallelism groups, it considers TP first.

Pipeline Parallelism (PP)

Let’s apply another parallelism dimension to this problem, PP. Our model has a PP degree of 4, which means that for our model with 96 layers, we end up with 4 groups of 24 layers each. Each of these groups contains 4 TP groups from earlier, for example, Stage 1 of pipeline for model replica 1 will include the following GPUs:

[0, 1, 2, 3, 4, 5, 6, 7, 32, 33, 34, 35, 36, 37, 38, 39, 64, 65, 66, 67, 68, 69, 70, 71, 96, 97, 98, 99, 100, 101, 102, 103].

Here, we can see how 4 stages of pipeline (= 1 model replica) are placed on 2 racks:

Illustration of pipeline parallelism for Nemotron-4 340B
Figure 5. Illustration of pipeline parallelism for Nemotron-4 340B

Important note: this simplification does not take into account pipeline interleaving. Nemotron4 340B recipe uses VP=12, which means layers inside of a pipeline stage are split in 12 groups of 2 layers in order to reduce the pipeline bubble.

This illustration raises an interesting question: why did Megatron Core place these pipeline stages on both racks?

This seems strange as 4 TP groups (8 compute trays) would easily fit into a single 18-tray NVIDIA GB200 NVL72 rack.

Actually, this is a deliberate choice from the Megatron Core developers. The code which creates rank → parallelism group mapping considers PP at the very end when building rank groups and not by mistake. When we compare to other parallelism types in this recipe, PP is indeed the least demanding out of them in terms of communications bandwidth:

  • TP relies heavily on AllReduce, AllGather and ReduceScatter communications. See the following documentation from PyTorch for more context;

  • DP extensively utilizes AllReduce collective for gradients for each model parameter during training as well as AllGather and ReduceScatter for distributed optimizer, which is used in this recipe, see NeMo documentation;

  • PP mostly uses point-to-point communications (send and recv in NCCL, see NVIDIA’s blog post) which boils down to passing activations and gradients between pipeline stages for every microbatch.

Data Parallelism (DP)

Megatron Core will assign GPU ranks in DP groups which will end up physically in the same NVIDIA GB200 NVL72 rack, since DP has higher priority than PP (see the previous section). The following animation illustrates how the model (placed on a group of 32 GPUs, which we covered in the previous section about PP) is replicated across the allocated compute trays.

Illustration of data parallelism for Nemotron-4 340B

Figure 6. Illustration of data parallelism for Nemotron-4 340B

The most important observation from this illustration is that each pipeline stage replica is kept within the same rack. Since we do not need to perform AllReduce collectives during the backward pass across the 2 racks, but rather only across a single pipeline stage, the fast NVLink is still being leveraged for this demanding collective.

Pipeline stage replicas in AllReduce operation
Figure 7. Pipeline stage replicas in AllReduce operation

Wrapping up

It is essential to leverage this 2-tier topology in the NVIDIA GB200 NVL72 clusters (NVLink in the rack, InfiniBand across racks). Without it, we would be leaving a significant part of performance on the table. An individual Blackwell GPU in GB200 NVL72 is a bit more powerful than standard NVIDIA HGX B200 GPU, but the large NVLink fabric is the real performance enabler.

Always remember: distributed parallelized workload will run at the speed of the slowest constituent and if the parallelism groups with heaviest demand of communication bandwidth are not neatly placed on NVIDIA GB200 NVL72 racks to fully leverage NVLink connectivity, end-to-end workload performance will suffer drastically.

Thus, bringing your AI workload to NVIDIA GB200 NVL72 clusters requires preparation. It is imperative to carefully engineer your workload to fully benefit from the large NVLink fabric.

Understanding how your framework creates parallelism groups (RankGenerator in Megatron Core, DeviceMesh in PyTorch, etc.) will require a bit of effort and learning. Same with finding out what collective operations and exchange sizes are expected to be most prevalent in GPU to GPU communications in your workload. Scaling your workload must be carefully researched as well, since it may affect the aforementioned distribution of communication groups.

Be prepared to invest significant effort in these areas — or contact us for a free PoC with NVIDIA GB200 NVL72, which will be fully assisted either by myself or my colleagues. It’s always interesting to deal with new specific workloads and setups, this allows us to continue building expertise and make the process smoother for every new customer. I hope I was able to share some unique insights today, and that these techniques will help you in configuring your clusters.

Explore Nebius AI Cloud

Explore Nebius AI Studio

Sign in to save this post