Quantize a Model to INT8 for Cheaper Deployment, Step by Step
A practical, step by step tutorial for quantizing a large language model to INT8 so you can serve it on smaller, cheaper GPUs without unacceptable quality loss.
Quantization is the single most reliable lever for cutting inference cost. By moving model weights and sometimes activations from 16-bit to 8-bit integers, you roughly halve memory footprint, fit larger models on smaller GPUs, and often improve throughput. This tutorial walks through INT8 quantization end to end, from picking a method to validating that quality survives the squeeze. The goal is a model you can deploy on a cheaper instance with confidence, not a benchmark trophy.
Why INT8 Saves Money
A model stored in FP16 uses two bytes per parameter. The same model in INT8 uses one byte, so a 13B parameter model drops from roughly 26 GB to roughly 13 GB of weight memory. That difference decides whether you need a 40 GB card or a 24 GB card, and on most cloud price lists that is a large hourly gap. Smaller memory also leaves more room for the key-value cache, which means longer context windows or more concurrent requests on the same hardware.
There is a second, quieter win. Many GPUs execute INT8 matrix multiplies faster than FP16, so well-tuned kernels can lift tokens per second. The net effect is lower cost per token from two directions at once: cheaper hardware and more work per second.
Choose a Quantization Method
Not all INT8 paths are equal. The right one depends on how much accuracy you can spare and how much engineering time you have.
- Weight-only INT8: weights become 8-bit, activations stay higher precision. Simple, fast to apply, very low accuracy loss. A strong default.
- Weight and activation INT8 (full INT8): both are quantized, which unlocks faster kernels but is more sensitive and usually needs calibration.
- Post-training quantization (PTQ): applied to an already trained model with a small calibration set. No retraining. This is what most teams use.
- Quantization-aware training (QAT): the model learns to tolerate low precision during fine-tuning. Best accuracy, highest effort.
For a first deployment, start with weight-only PTQ. It gives most of the savings with the least risk.
Run Calibration
Calibration shows the quantizer what real activations look like so it can pick sensible scaling factors. Skipping it is the most common cause of quality collapse.
- Assemble a small calibration set, often 128 to 512 representative samples drawn from your actual prompts.
- Run the model in evaluation mode over those samples while the toolkit observes activation ranges.
- Let the toolkit compute per-channel scales for weights, which preserves more accuracy than a single per-tensor scale.
- Export the quantized checkpoint.
Use prompts that match production. If you serve code, calibrate on code. A calibration set that looks nothing like real traffic produces scales tuned for the wrong distribution.
Validate Accuracy
Never ship a quantized model on faith. Compare it against the FP16 baseline on tasks you care about.
| Check | What to measure | Acceptable signal |
|---|---|---|
| Task accuracy | Score on a held-out eval set | Within a small margin of FP16 |
| Perplexity | Language modeling loss on held-out text | Minor increase only |
| Output spot check | Side-by-side generations on real prompts | No obvious regressions |
| Latency and memory | Tokens per second, peak VRAM | Clear improvement |
If accuracy drops too far, try per-channel weight scales, exclude the most sensitive layers from quantization, or move to a method that mixes precision across layers.
Deploy on Cheaper Hardware
Once the quantized checkpoint passes validation, point your serving stack at it and pick a smaller instance. Confirm that the runtime actually loads the INT8 weights rather than silently upcasting, since some loaders fall back to FP16 if a kernel is missing. Watch peak memory during a load test with realistic concurrency, because the key-value cache grows with context length and batch size, not just with the weights.
Re-run a short load test to confirm throughput. The savings are only real once you have downsized the instance and verified the model still serves correctly under load.
Common Pitfalls
- Calibrating on data that does not match production traffic.
- Assuming faster kernels exist on your GPU when they do not, which leaves you with smaller memory but no speedup.
- Quantizing sensitive layers such as the final projection, which can hurt quality disproportionately.
- Forgetting to re-test after a base model or runtime upgrade.
Quantization is not a one-time event. Treat the validated checkpoint as an artifact you regenerate whenever the base model changes.
Understand the Precision Trade Curve
It helps to picture precision as a curve rather than a switch. FP16 sits at one end with full quality and full memory cost. INT8 sits in a sweet spot where memory roughly halves and quality loss is usually small. INT4 sits further along, with even larger savings but a steeper quality risk that varies by model and task. The right point on the curve depends on your tolerance for error and the size of the GPU you are trying to fit into. Most production teams find INT8 is the comfortable default, reaching for INT4 only when the savings are decisive and the eval results hold up.
The curve is not uniform across a model either. Some layers are far more sensitive to precision loss than others. Mixed-precision approaches exploit this by keeping the most sensitive layers at higher precision while quantizing the rest aggressively. That is why a blanket quantization sometimes underperforms a selective one that protects a handful of layers.
Plan for Throughput, Not Just Memory
It is tempting to treat quantization purely as a memory play, but throughput is where the recurring savings often live. A model that fits a smaller GPU is cheaper per hour, yet a model that also runs faster per request lowers cost per token on every single call. To capture both, confirm three things during your load test: that the runtime uses INT8 kernels rather than upcasting, that batching is tuned so the GPU stays busy, and that the key-value cache has enough room to support your target concurrency. Skipping any of these can leave the memory win on paper while the cost per token barely moves.
Operationalize the Workflow
Treat quantization as a repeatable pipeline, not a manual experiment you run once. A clean workflow looks like this: pull the base model, run calibration on a current sample of production prompts, produce the quantized checkpoint, run the full eval suite against the FP16 baseline, and promote the checkpoint only if it clears your accuracy threshold. Wiring this into automation means that when the base model is updated or your traffic distribution shifts, you can regenerate a validated quantized artifact without rediscovering the steps from scratch. It also gives you an audit trail that ties each deployed checkpoint to the calibration data and eval scores behind it, which matters when someone asks why quality changed.
INT8 quantization is the highest-leverage cost optimization available for LLM inference, and weight-only PTQ delivers most of the benefit for a fraction of the effort. Calibrate carefully, validate honestly, then downsize the instance. Done well, you keep the answers your users expect while paying for far less GPU. From there you can explore INT4 or mixed precision if your accuracy budget allows, always measuring before you commit.