GPU Sizing for LLM Serving

The first hard constraint in serving a language model is whether it fits in GPU memory at all. Pick a GPU with too little VRAM and the model simply will not load, or it loads but cannot serve more than a handful of users. Pick one with far more memory than you need and you overpay for idle silicon. Right-sizing is a budgeting exercise: you account for the memory the weights consume, the memory the KV cache consumes under load, and a margin for overhead, then match that total to available hardware.

Three claims on VRAM

GPU memory is spent on three things during inference, and you must budget for all of them.

Model weights: a fixed amount set by the parameter count and the precision you load them in.
KV cache: a variable amount that grows with context length and the number of concurrent requests.
Overhead: activations, the runtime, fragmentation, and headroom, typically a modest but real slice.

Many sizing mistakes come from counting only the weights and forgetting that the cache, under real concurrency, can rival or exceed them.

Estimating weight memory

Weight memory scales with parameter count and precision. At sixteen-bit precision, each parameter takes two bytes, so a model's weights occupy roughly two bytes times its parameter count. Quantizing to eight-bit halves that, and four-bit halves it again, at some quality cost you should validate. The table gives rough order-of-magnitude footprints to build intuition.

Model size	16-bit weights (approx.)	8-bit weights (approx.)	4-bit weights (approx.)
7 to 8 billion	About 15 GB	About 8 GB	About 4 GB
13 to 14 billion	About 28 GB	About 14 GB	About 7 GB
70 billion	About 140 GB	About 70 GB	About 35 GB

These are approximations to plan around, not exact figures, since architecture details shift them. The lesson is that precision is a powerful lever: quantization can move a model from needing several GPUs to fitting on one.

Budgeting the KV cache

After weights, reserve memory for the cache. Its size grows with context length and concurrency, so estimate it from your real workload: how long are typical prompts and outputs, and how many requests run at once. Multiply per-request cache by target concurrency to get total cache demand. This is the part that determines how many users a GPU can actually serve, not just whether the model loads. A model that fits in weights but leaves no room for cache will reject requests the moment traffic arrives.

A simple sizing formula

Compute weight memory from parameter count and precision.
Estimate per-request cache from your context length and chosen cache precision.
Multiply per-request cache by target concurrency.
Add an overhead margin, often around ten to twenty percent.
Sum the three and choose a GPU with at least that much VRAM, or split across GPUs.

When one GPU is not enough

If the total exceeds a single GPU's memory, you have options before jumping to bigger hardware.

Quantize the weights to drop the largest term first.
Quantize the KV cache to roughly halve cache memory.
Reduce max context or concurrency per replica and add replicas instead.
Use tensor parallelism to split one large model across multiple GPUs when it genuinely cannot fit, accepting some communication overhead.

Spreading a model across GPUs raises throughput ceilings but adds interconnect cost and complexity, so reach for it when the model truly will not fit or when a single GPU cannot meet your throughput target.

Matching hardware to the budget

Once you have a memory budget, choose hardware with enough VRAM plus headroom for traffic growth, and weigh memory bandwidth too, since generation speed is often bound by how fast the GPU can read weights and cache. A GPU that fits the model but has low bandwidth may serve tokens slowly. Balance VRAM, bandwidth, and price rather than chasing VRAM alone.

A worked sizing example

Imagine serving a mid-size model with weights of roughly fourteen gigabytes at sixteen-bit precision. Suppose your product allows moderate context lengths and you want to serve a handful of concurrent requests per replica. After estimating per-request cache from your context length and multiplying by that concurrency, you might find the cache budget rivals the weights. Add overhead and the total could comfortably exceed a small GPU but fit a mid-tier one with headroom. If instead you quantized the weights to eight-bit, the weight term roughly halves, freeing memory that you could spend on more concurrency or a longer context window. This is the everyday tradeoff: precision buys you either smaller hardware or more headroom on the same hardware.

Leaving room to grow

Static sizing for today's load is a trap, because traffic grows and context features expand. Build in headroom so a modest increase in concurrency or context length does not immediately force a hardware change or start rejecting requests. A common practice is to size for a realistic near-term peak rather than the current average, then rely on autoscaling to add replicas beyond that. It is also wise to re-run your sizing math whenever you change models, adjust quantization, or raise context limits, since any of those shifts the budget. Treat sizing as a living calculation tied to your real workload rather than a one-time decision, and you will avoid both the painful surprise of an out-of-memory failure under load and the slow bleed of paying for capacity you never use.

Memory bandwidth deserves a closer look

VRAM capacity tells you whether the model fits, but memory bandwidth often tells you how fast it serves. Token generation is frequently memory-bound, meaning the GPU spends much of each step reading weights and cache from memory rather than doing arithmetic. Two GPUs with identical VRAM can deliver very different tokens per second if their bandwidth differs. When you compare hardware, look at both numbers together: enough capacity to hold weights plus cache, and enough bandwidth to hit your throughput and latency targets. Paying for a high-bandwidth GPU can be worth it even when a cheaper card has the same capacity, because the faster card serves more tokens per second and therefore lowers your effective cost per token. Conversely, if your workload is latency-tolerant and throughput-light, a lower-bandwidth card with ample capacity may be the more economical fit.

Conclusion

GPU sizing for LLM serving is a three-part memory budget: fixed weights, variable KV cache, and overhead. Estimate the weights from parameter count and precision, the cache from your real context length and concurrency, and add a margin, then pick hardware that holds the sum with room to grow. Quantization is the strongest lever for shrinking both weights and cache, and when a model still will not fit, splitting across GPUs or trimming per-replica concurrency keeps you serving. Budget all three claims on VRAM up front and you avoid both the model that will not load and the GPU you overpaid for.

GPU Sizing for LLM Serving: Matching VRAM to Model Size