Multi-GPU NVLink Clusters in the Cloud: 8x H100 Nodes Compared
An intermediate guide to 8x H100 NVLink nodes in the cloud, covering NVSwitch topology, memory pooling, and what to check when comparing providers.
The 8x H100 node has become the default building block of serious AI training in the cloud. It is not just eight GPUs in a box. The defining feature is how those GPUs talk to each other. Inside a well-designed node, all eight cards connect through NVLink and NVSwitch, giving them high-bandwidth, low-latency communication that ordinary networking cannot match. That fabric is what makes large-scale training practical. This guide explains what an 8x H100 NVLink node really delivers and what to compare when shopping across cloud providers.
Why the interconnect, not the GPU, is the story
A single H100 is fast, but training a large model means splitting work across many GPUs and constantly exchanging gradients, activations, and parameters. The speed at which GPUs can share data often determines whether you get near-linear scaling or hit a wall. NVLink provides direct GPU-to-GPU links far faster than PCIe, and NVSwitch lets every GPU in the node reach every other GPU at full bandwidth. The result is that an 8x H100 NVLink node behaves much more like one large accelerator than like eight separate cards.
Inside the node: NVLink, NVSwitch, and memory pooling
Full mesh through NVSwitch
In a reference 8-GPU design, NVSwitch creates a non-blocking fabric so any GPU can communicate with any other at full NVLink bandwidth simultaneously. This matters for collective operations like all-reduce, which underpin distributed training. A node with full NVSwitch connectivity sustains those collectives far better than one where GPUs are only partially linked.
Effective memory pooling
Because the GPUs share a fast fabric, model and tensor parallel strategies can spread a single large model across all eight cards while treating their memory as a near-unified pool. This is how models too large for one GPU still train efficiently within a single node, avoiding the much slower hop to other nodes over the network.
When you outgrow one node
Eight H100s go a long way, but the largest jobs span many nodes. At that point the cross-node fabric matters as much as NVLink does inside the node. Providers connect nodes with high-bandwidth networking, typically InfiniBand or a comparable RDMA Ethernet fabric. The quality of that fabric, including bandwidth per GPU and topology, decides whether your job scales past one node gracefully. Within a node you lean on NVLink, between nodes you lean on the cluster network, and both must be strong for large training to scale.
What to compare across providers
Not all 8x H100 nodes are equal. When comparing offers, look past the GPU count and check the surrounding design.
- NVLink and NVSwitch topology: confirm full GPU-to-GPU connectivity, not a partial or PCIe-only arrangement.
- Inter-node fabric: ask about InfiniBand or RDMA Ethernet bandwidth per GPU if you plan to scale beyond one node.
- Host resources: CPU cores, system RAM, and local NVMe should be generous enough to feed eight GPUs without starving them.
- Storage throughput: training needs fast access to datasets and checkpoints; a slow shared filesystem bottlenecks the whole node.
- Placement and locality: for multi-node jobs, GPUs should sit close in the network to keep latency low.
| Layer | Technology | What it affects |
|---|---|---|
| Within node | NVLink + NVSwitch | GPU-to-GPU bandwidth, memory pooling |
| Across nodes | InfiniBand or RDMA Ethernet | Multi-node scaling efficiency |
| Host | CPU, RAM, NVMe | Data feeding and preprocessing |
| Storage | Shared high-throughput filesystem | Dataset and checkpoint speed |
How parallelism maps onto the node
To get value from eight tightly linked GPUs, your training has to use a parallelism strategy that matches the hardware. Three approaches commonly combine inside a single node:
- Data parallelism: each GPU holds a full copy of the model and processes a different slice of the batch, then synchronizes gradients through all-reduce over NVLink. Simple and effective when the model fits on one GPU.
- Tensor parallelism: a single layer's math is split across GPUs, which demands very high bandwidth between them. This is exactly where NVSwitch shines, since the GPUs exchange data constantly.
- Pipeline parallelism: different layers live on different GPUs, and activations flow between stages. This eases memory pressure but needs careful balancing to avoid idle bubbles.
Large-model training often blends all three. The reason the 8x H100 node is so effective is that its fabric makes tensor parallelism practical within the node, so you can keep the most bandwidth-hungry communication local and reserve the slower cross-node network for the parallelism dimensions that tolerate it.
Software and driver readiness
A fast node is only as good as the software that drives it. Confirm the provider ships current GPU drivers, a compatible communication library for collective operations, and a runtime that recognizes the NVLink topology. Misconfigured collectives can silently fall back to slower paths, wasting the fabric you are paying for. Before a long run, validate that your framework is actually using NVLink for intra-node communication and the high-speed fabric for cross-node traffic. A short diagnostic at the start can save days of slow training later.
Cost and utilization considerations
An 8x H100 node is a premium rental, so utilization is everything. A few practices keep the cost justified:
- Profile your training to confirm the GPUs are actually saturated, not waiting on data loading or slow storage.
- Use efficient parallel strategies that match the node topology, so NVLink bandwidth is the bottleneck rather than the host.
- Checkpoint frequently so an interruption on spot or preemptible capacity does not waste hours of compute.
- Right-size: if your model fits and trains well on fewer GPUs, do not rent eight out of habit.
Conclusion
The 8x H100 NVLink node earns its place as the workhorse of cloud AI training because of its interconnect, not just its eight powerful GPUs. NVLink and NVSwitch turn the node into something close to a single large accelerator with a pooled memory space, which is exactly what large-model training needs. When you compare providers, weigh the fabric inside the node, the network between nodes, and the host and storage that feed it. Get those right, keep utilization high, and a single 8x H100 node will carry a remarkable amount of training before you ever need to think about a larger cluster.