How much GPU VRAM do you need?

Running out of GPU memory is the single most common wall engineers hit when they move a model to the cloud, and it is almost always avoidable with a back-of-the-envelope estimate first. VRAM - the high-bandwidth memory soldered to the GPU - has to hold your model weights plus a surprising amount of overhead. This guide gives you the rules of thumb to size VRAM correctly for inference, fine-tuning, and full training, so you rent the right GPU the first time.

The core rule: about 2 GB per billion parameters at FP16

Model weights dominate the memory budget. At 16-bit precision (FP16 or BF16), each parameter takes 2 bytes, so:

FP16/BF16 (2 bytes/param): roughly 2 GB per billion parameters.
FP8 or INT8 (1 byte/param): roughly 1 GB per billion parameters.
INT4 / 4-bit quantized (0.5 bytes/param): roughly 0.5 GB per billion parameters.

So a 7B model needs about 14 GB just for FP16 weights, a 70B model about 140 GB, and a 405B model about 810 GB. That is the floor for inference. Everything else in this guide is overhead on top of it.

Inference: weights plus KV cache

For inference you need the weights plus the KV cache - the stored keys and values for every token in the context window. KV cache grows with sequence length, batch size, and the number of layers, and it can become the dominant cost for long contexts. A few practical points:

Short prompts and small batches: KV cache is modest; budget 10-20% over the weight size.
Long context (tens of thousands of tokens) or large batch: KV cache can rival or exceed the weights, especially without optimizations like grouped-query attention or paged attention.
Quantizing weights to 4-bit shrinks the weight floor dramatically, which is why a 70B model can run on a single 80 GB card at INT4 with room for KV cache.

For bursty inference you do not have to own a full GPU - a serverless GPU option bills only while you generate. Compare the cards that fit your model in the GPU catalog.

Training: the expensive case

Full training (or full fine-tuning) needs far more than weights. With a standard Adam-style optimizer you carry, per parameter:

The weights (2 bytes at FP16, or 4 bytes if kept in FP32).
Gradients (another 2-4 bytes).
Optimizer state - momentum and variance (typically 8 bytes in FP32).
Activations, which scale with batch size and sequence length.

A common rule of thumb: full FP16 training with Adam needs roughly 16-20 GB per billion parameters, before activations. That is why a 7B model that infers happily on 14 GB can need well over 100 GB to train fully, and why large training spans multiple GPUs.

Fine-tuning: LoRA and QLoRA change everything

You rarely need full training. Parameter-efficient fine-tuning freezes the base weights and trains a small number of adapter parameters, collapsing the optimizer and gradient overhead:

LoRA: Keep the base model in FP16 and train low-rank adapters. Memory is dominated by the frozen weights, so it is close to inference cost plus a small adapter overhead.
QLoRA: Quantize the frozen base to 4-bit and train adapters on top. This is the big unlock - a 70B model fine-tunes on a single 80 GB GPU, and a 7B model fits comfortably on a 24 GB card.

For most teams, QLoRA is the difference between needing a multi-GPU node and needing one card.

Model size to GPU mapping

Model size	FP16 inference	QLoRA fine-tune	Full FP16 training
7B	1x 24 GB GPU	1x 24 GB GPU	1x 80 GB GPU
13B	1x 40-48 GB GPU	1x 24-40 GB GPU	2x 80 GB GPUs
70B	1x 80 GB (INT4) or 2x 80 GB FP16	1x 80 GB GPU	8x 80 GB GPUs
180B+	4x 80 GB GPUs (quantized)	2-4x 80 GB GPUs	Multi-node cluster

These are guidelines, not guarantees - context length, batch size, and framework overhead all shift them. Always leave 10-20% headroom. The 80 GB tier maps to the A100 80GB and H100; when you need more memory per card, the H200 at 141 GB lets you run bigger models without sharding.

When one GPU is not enough: multi-GPU sharding

When a model exceeds a single card, you split it across GPUs:

Tensor parallelism: Split individual layers across GPUs within a node. Communication-heavy, so it wants fast NVLink.
Pipeline parallelism: Put different layers on different GPUs. Lower communication, but needs careful batching to avoid idle bubbles.
Fully sharded data parallel (FSDP / ZeRO): Shard weights, gradients, and optimizer state across GPUs, dramatically cutting per-GPU memory for training.

Sharding adds communication overhead, so for multi-node work fast interconnect like InfiniBand matters as much as the GPU itself.

How to estimate before you rent

Start with the weight floor. Multiply billions of parameters by 2 (FP16), 1 (INT8), or 0.5 (INT4).
Add for the task. Inference: +10-20% for KV cache, more for long context. QLoRA: roughly the 4-bit weight size plus a small adapter. Full training: 8-10x the FP16 weight size.
Add headroom. Reserve 10-20% so you do not OOM at peak batch.
Match to a card. Pick the smallest GPU that fits from the catalog; if nothing single-card fits, plan a sharded multi-GPU setup.
Price it. Use the GPU training cost calculator to compare a bigger single GPU against a sharded smaller-GPU setup.

Takeaway

Start every sizing exercise with the 2 GB-per-billion-params FP16 rule, then layer on the overhead your task demands: light for inference, modest for QLoRA, heavy for full training. Quantization and LoRA shrink the budget enough that most fine-tuning fits on a single card, while full training and very large models still need multi-GPU sharding and fast interconnect. Size first, then pick the smallest GPU that fits from the GPU catalog or reach for serverless GPU for bursty inference - and confirm the bill with the cost calculator.