InfiniBand vs Ethernet in GPU Clouds

Once a training job spans more than one node, the GPUs spend a great deal of time talking to each other. At that point the network fabric connecting them can matter as much as the GPUs themselves. A cluster of the fastest accelerators connected by a weak network will scale poorly, leaving expensive silicon idle while it waits on data. The two dominant fabrics in GPU clouds are InfiniBand and high-speed Ethernet. This advanced guide explains how they differ, why the difference shows up in distributed training, and what to ask a provider before committing to a multi-node job.

Why interconnect determines scaling

Distributed training relies on collective operations, most importantly all-reduce, where every GPU exchanges and combines gradients with every other GPU each step. As you add nodes, the volume of this cross-node communication grows. If the network cannot keep up, communication time starts to dominate and adding GPUs yields diminishing returns. The goal is near-linear scaling, where doubling GPUs nearly halves training time, and that is only achievable when the fabric sustains the required bandwidth at low latency.

What InfiniBand brings

InfiniBand was designed for high-performance computing and has long been the default fabric for large GPU clusters. Its strengths are directly relevant to training:

Low latency: InfiniBand offers very low and consistent latency, which matters because collective operations are sensitive to the slowest participant.
Native RDMA: remote direct memory access lets GPUs move data between machines while bypassing the CPU, reducing overhead and freeing host resources.
In-network features: some InfiniBand fabrics can perform parts of collective operations within the network, accelerating all-reduce further.
Predictable performance: a dedicated, lossless fabric delivers steady throughput under heavy load.

For the largest, most tightly coupled training runs, these properties have made InfiniBand the traditional choice.

What modern Ethernet brings

Ethernet has historically lagged for tightly coupled training, but modern high-speed Ethernet has narrowed the gap considerably, especially with RDMA over Converged Ethernet, known as RoCE. The case for Ethernet includes:

RoCE for RDMA on Ethernet: RoCE brings the CPU-bypass benefits of RDMA to Ethernet networks, recovering much of the efficiency that made InfiniBand attractive.
Ubiquity and ecosystem: Ethernet is everywhere, with broad tooling, operational familiarity, and a large vendor ecosystem.
Cost and flexibility: Ethernet can be more economical and integrates naturally with the rest of a cloud's networking.
Rapid improvement: purpose-built AI Ethernet fabrics are closing the performance gap for many workloads.

For many training jobs, well-engineered Ethernet with RoCE now delivers scaling that is competitive with InfiniBand.

Dimension	InfiniBand	High-speed Ethernet (RoCE)
Latency	Very low, consistent	Low, improving, more variable historically
RDMA	Native	Via RoCE
Ecosystem	HPC-focused	Broad and ubiquitous
Tuning sensitivity	Lower	Needs careful lossless configuration
Best fit	Largest, tightly coupled training	Most distributed training, cost-sensitive scale

It is not just the fabric type

Choosing InfiniBand or Ethernet is only part of the picture. Several other factors shape real-world scaling:

Bandwidth per GPU: a fabric is only as good as the bandwidth allocated to each GPU. Ask for the per-GPU figure, not just the headline link speed.
Topology and bisection bandwidth: how nodes are wired together determines whether the network bottlenecks under all-to-all traffic.
Placement and locality: GPUs scheduled close together in the network communicate faster than ones spread across the datacenter.
Configuration quality: RoCE in particular depends on careful lossless network tuning. A poorly configured Ethernet fabric underperforms its potential.

How communication patterns stress the fabric

Different parallelism strategies lean on the network in different ways, and matching them to the fabric is part of good cluster design. Data-parallel training relies heavily on all-reduce, where every GPU contributes to and receives a combined gradient each step. This is bandwidth-intensive and latency-sensitive, exactly the pattern InfiniBand and well-tuned RoCE handle best. Pipeline parallelism, by contrast, sends activations point to point between stages, a lighter and more predictable pattern. Tensor parallelism is the most demanding of all, with near-constant exchange that you generally want to keep inside a single node over NVLink rather than across the cluster network.

The practical lesson is to align your parallelism layout with the fabric's strengths. Keep the heaviest, most frequent communication on the fastest links, and push the dimensions that tolerate latency onto the slower cross-node fabric. A layout that ignores this can saturate the network with traffic that should have stayed local, dragging down a job that the hardware was perfectly capable of running fast.

When the interconnect is not your bottleneck

It is worth a note of balance. For single-node jobs, small clusters, or workloads that are compute-bound rather than communication-bound, the choice between InfiniBand and Ethernet barely registers. Spending heavily on the fastest possible fabric for a job that never stresses the network is its own form of overpaying. Profile your training to see how much time is actually spent in communication. If that fraction is small, a standard Ethernet fabric may serve you just as well, and you can direct the savings toward more GPUs or longer runs. The premium fabric earns its keep specifically when communication is a large share of your step time.

What to ask a provider

Before running a multi-node training job, get concrete answers:

What fabric connects the nodes, InfiniBand or RoCE Ethernet, and what is the per-GPU bandwidth?
What is the network topology, and how is bisection bandwidth affected at the scale I plan to run?
Can I request placement so my GPUs are network-close to each other?
Are there reference scaling benchmarks for jobs of my size on this fabric?

Then validate with your own scaling test. Run your real model on one node, then two, then more, and measure how close you stay to linear scaling. The fabric's quality shows up immediately in that curve.

Conclusion

For single-node and small jobs, the interconnect barely matters. For distributed training at scale, it can matter more than the choice of GPU. InfiniBand remains the gold standard for the largest, most tightly coupled runs thanks to its low latency and native RDMA, while modern Ethernet with RoCE has become a strong, cost-effective option for a wide range of distributed workloads. Look past the fabric label to per-GPU bandwidth, topology, placement, and configuration quality, and always validate with your own scaling benchmark. Get the interconnect right, and your expensive GPUs spend their time computing instead of waiting on the network.

InfiniBand vs Ethernet in GPU Clouds: Why Interconnect Matters