Tensor Parallelism for LLM Inference

Some models are simply too large to fit in the memory of a single GPU. When the weights plus the working memory for activations and the attention cache exceed what one accelerator can hold, you must spread the model across several GPUs. Tensor parallelism is the most common way to do this for inference. It splits the math inside each layer across multiple GPUs so they work on the same request together. Understanding how it works, and what it costs in communication overhead, helps you choose the right hardware configuration and avoid paying for GPUs that sit underused.

Why One GPU Is Not Always Enough

A model's memory footprint comes from more than its weights. You also need room for activations during computation and for the key-value cache that stores attention state for every token in the context. As models grow into the tens or hundreds of billions of parameters, and as context windows lengthen, the total can far exceed a single GPU's memory. At that point you have two choices: use a smaller or more heavily quantized model, or split the model across multiple GPUs. Tensor parallelism is the standard answer when you need the full model.

How Tensor Parallelism Works

Inside a transformer layer, the heavy operations are large matrix multiplications. Tensor parallelism partitions these matrices across GPUs so each device computes a slice of the result. The attention heads and the feed-forward projections are divided so that, for example, four GPUs each handle a quarter of the work in every layer. After each split operation, the GPUs must combine their partial results, which requires them to exchange data. This exchange happens many times per layer and per token, which is the central cost of the technique.

The Communication Tax

Because the GPUs must synchronize their partial results repeatedly, tensor parallelism is extremely sensitive to the speed of the link between GPUs. On a server where GPUs are connected by a high-bandwidth interconnect within a single node, the communication overhead is manageable and you get a real speedup. Spread the same model across GPUs in different servers connected only by ordinary networking, and the communication can dominate, leaving expensive GPUs waiting on each other. This is why tensor parallelism is almost always kept within a single node with fast interconnects.

Tensor Parallelism Versus Pipeline Parallelism

Tensor parallelism is one of several ways to split a model. The main alternative for inference is pipeline parallelism, and they solve different problems.

Dimension	Tensor parallelism	Pipeline parallelism
What is split	Each layer split across GPUs	Different layers on different GPUs
Communication	Frequent, within every layer	Less frequent, between stages
Interconnect need	Very high bandwidth required	More tolerant of slower links
Latency impact	Lower latency per request	Can add pipeline bubbles
Best for	Fitting one model, low latency	Spanning many GPUs or nodes

In practice large deployments often combine them: tensor parallelism within a node where the interconnect is fast, and pipeline parallelism across nodes where it is slower. The combination lets you serve very large models while keeping the chatty tensor-parallel communication on the fast links.

What This Means for Cost

Splitting a model across GPUs has a direct cost consequence: you are now renting several accelerators to serve what is logically one model. Two factors decide whether that is money well spent.

Utilization: if the GPUs spend time waiting on each other due to communication overhead, you are paying for idle silicon. Fast interconnects keep them busy.
Batching: serving many requests at once across the parallel GPUs spreads the fixed overhead over more useful work, improving cost per token.

The takeaway is that the GPU type and the interconnect between GPUs matter as much as the raw count. A configuration with fewer GPUs on a fast interconnect can outperform more GPUs on a slow one, both in latency and in cost per token, because the fast link keeps the expensive hardware productive.

Practical Guidance for Choosing a Setup

Estimate the model's full memory footprint including weights, activations, and the attention cache at your target context length.
If it fits comfortably on one GPU, do not split it. Parallelism only adds overhead when it is unnecessary.
If it does not fit, prefer tensor parallelism within a single node connected by a high-bandwidth interconnect.
Only span multiple nodes when one node cannot hold the model, and use pipeline parallelism across that slower boundary.
Batch requests aggressively to amortize the communication overhead across more work.
Measure GPU utilization. Idle GPUs waiting on communication are the clearest sign your configuration is wrong.

Quantization as an Alternative

Before reaching for multiple GPUs, consider whether a quantized version of the model would fit on fewer. Reducing weight precision shrinks the memory footprint and can let a model that needed several GPUs run on one or two, eliminating most of the communication overhead. The tradeoff is a potential, usually small, quality reduction that you should validate on your own workload. When quality holds, quantization is often the cheapest path because it sidesteps the parallelism tax entirely.

How Parallelism Interacts With the Attention Cache

One detail that surprises teams is how tensor parallelism interacts with the key-value cache that holds attention state for the context. When the model is split across GPUs, that cache is also distributed, so each GPU holds a portion of it. This is helpful because it spreads the memory burden of long context across the devices, allowing larger context windows than a single GPU could support. But it also means that serving long context at high concurrency still pushes against the combined memory of the group, and the communication overhead applies to that cache traffic too. When you size a tensor-parallel deployment, account for the attention cache at your target context length and concurrency, not just the model weights, or you will run out of memory under real load.

Batching ties all of this together. Because the communication overhead of tensor parallelism is largely fixed per step, processing more requests in the same step spreads that cost over more useful output. A tensor-parallel deployment that serves one request at a time pays the full communication tax for very little work, which is the worst possible efficiency. The same hardware serving a healthy batch amortizes the overhead and delivers a far better cost per token. This is why throughput-oriented serving and tensor parallelism go hand in hand.

Tensor parallelism is a powerful tool for serving models that no single GPU can hold, but it is not free. The communication between GPUs is its defining cost, and that cost is governed by the interconnect. Keep tensor-parallel groups on fast links, batch heavily to stay efficient, account for the attention cache when sizing memory, and always check whether a smaller footprint through quantization would let you avoid splitting the model at all. Match the parallelism strategy to the hardware, and you serve large models without burning money on GPUs that wait instead of work.

Tensor Parallelism for Inference: Splitting Big Models Across GPUs