KV Cache Explained: How It Drives Inference Memory and Cost
The key-value cache stores attention state during generation and consumes GPU memory that scales with context and concurrency. This guide explains how it works and how to manage its cost.
Ask why a model that fits comfortably in GPU memory still runs out of room under load, and the answer is almost always the key-value cache. The KV cache is the memory a transformer uses to remember its own attention computations while it generates a response. It is invisible in a model size chart, yet it often consumes more memory than the weights once you have many users with long contexts. Understanding how it grows is the difference between guessing at GPU sizing and planning it.
What the KV cache stores
A transformer generates text one token at a time. For each new token, the attention mechanism needs to look back at every previous token in the sequence. Rather than recomputing the key and value vectors for all prior tokens on every step, the model stores them once and reuses them. That stored set of key and value vectors is the KV cache. It is a pure speed optimization: without it, generation would be far slower because each step would redo all the work of the steps before it.
The catch is that the cache grows with every token generated and with every token in the prompt. A long conversation or a large retrieved context means a large cache, held in GPU memory for the entire duration of that request.
Why it dominates memory under load
The size of the cache scales with several factors at once, and they multiply.
- Sequence length: longer prompts and longer outputs mean more tokens to remember.
- Concurrency: every simultaneous request keeps its own cache, so memory scales with the number of active users.
- Model depth and width: more layers and larger hidden dimensions mean more vectors stored per token.
- Precision: storing cache in half precision uses half the memory of full precision.
Multiply a long context by many concurrent users and the cache can easily exceed the size of the weights themselves. This is why a GPU that loads the model fine still rejects new requests under load: it has run out of room for caches, not weights.
| Driver | Effect on KV cache | Lever you control |
|---|---|---|
| Context length | Linear growth | Trim prompts, cap max tokens |
| Concurrent requests | Linear growth | Limit concurrency per GPU |
| Cache precision | Proportional | Quantize the cache |
| Model architecture | Fixed per model | Choose efficient attention designs |
Techniques that reduce KV cache cost
Because the cache is the binding constraint on concurrency, the techniques that shrink it directly increase how many users a single GPU can serve.
Paged attention
Traditional caches reserve a contiguous block of memory sized for the maximum possible sequence, which wastes space on shorter requests. Paged attention allocates cache in small fixed-size pages on demand, much like virtual memory in an operating system. This drastically cuts fragmentation and lets more requests share the same GPU.
Cache quantization
Storing the keys and values in lower precision, such as eight-bit instead of sixteen-bit, can roughly halve cache memory with modest quality impact for many workloads. Always validate quality on your own data before adopting it broadly.
Efficient attention variants
Architectures that share key and value heads across query heads, such as grouped-query attention, store far fewer vectors per token. When you choose a model, this design choice has a direct and lasting effect on serving cost.
Prefix caching
If many requests share the same long prefix, a system prompt or a shared document, the cache for that prefix can be computed once and reused across requests, saving both compute and memory.
Planning capacity around the cache
To size GPUs correctly, budget memory in two parts: the weights, which are fixed, and the cache, which scales with your traffic. Estimate the cache per request from your typical context length, multiply by your target concurrency, and add a safety margin. The remaining memory after weights divided by per-request cache gives a rough ceiling on concurrent requests per GPU.
- Measure or estimate average and maximum context length in production.
- Compute cache memory per request at your chosen precision.
- Subtract weight memory from GPU memory to find cache headroom.
- Divide headroom by per-request cache to get a concurrency ceiling.
- Apply paged attention and quantization to raise that ceiling before adding hardware.
How the cache shapes your pricing
The KV cache is not just a memory concern, it is a cost concern, because the concurrency ceiling it sets determines how many users a GPU can serve and therefore your cost per user. Two products on the same model and hardware can have very different unit economics purely because one allows long contexts and the other keeps them short. Long-context features such as document chat or large retrieval windows are expensive precisely because each request holds a large cache for its entire duration, crowding out other requests. When you price a feature, account for its typical context length, since that single number can change the achievable concurrency by a wide margin.
Practical ways to keep the cache small
- Cap maximum context and output length so a single verbose request cannot monopolize memory.
- Summarize or truncate long histories in conversational apps rather than carrying the full transcript forward indefinitely.
- Reuse shared prefixes so common system prompts or documents are cached once instead of per request.
- Quantize the cache after validating quality, since it roughly halves the dominant per-request term.
- Prefer models with grouped-query attention, which store far fewer key and value vectors per token by design.
Each of these directly raises the number of concurrent requests a GPU can hold, which lowers cost per token without changing the model's answers. Because the cache is usually the binding constraint under load rather than the weights, attacking it is often the single most effective way to serve more users on the same hardware budget.
Conclusion
The KV cache is the hidden tax on every token a model generates, and it, not the weights, usually decides how many users a GPU can serve at once. It grows with context length and concurrency, so any product with long conversations or large retrieved contexts feels it first. The path to lower cost runs through paged attention, cache quantization, efficient attention architectures, and prefix reuse, all of which let you pack more concurrent requests into the same hardware. Treat the cache as a first-class budget item in capacity planning and you will size GPUs with confidence instead of surprise.