Quantization for Cheaper Inference: FP8, INT8, and INT4 Tradeoffs
An advanced guide to using quantization for cheaper LLM inference, comparing FP8, INT8, and INT4 on memory, speed, and quality, with guidance on when to use each.
Quantization is one of the most effective levers for cutting inference cost. By representing a model's weights, and sometimes its activations, with fewer bits, you shrink memory use, fit larger models on cheaper GPUs, and often increase throughput. The catch is that lower precision can degrade quality, and how much it degrades depends on the format and the model. This guide compares the formats teams reach for most, FP8, INT8, and INT4, and explains how to trade cost against quality without guessing.
Why fewer bits save money
Models are usually trained in sixteen bit precision. Each weight at that precision takes two bytes, so memory scales with parameter count. Memory is the binding constraint for serving large models, because the weights plus the attention state must fit on the GPU. Cut the precision in half and you roughly halve the memory the weights consume, which can let a model fit on a smaller, cheaper accelerator, or leave room for more concurrent requests. Lower precision arithmetic can also run faster on hardware that supports it, raising throughput and lowering cost per token.
- Less memory: fit bigger models on cheaper GPUs, or more requests per GPU.
- More speed: faster low precision math on supported hardware raises throughput.
- The cost: reduced numerical precision can lower output quality if pushed too far.
The formats compared
FP8: eight bit floating point
FP8 keeps a floating point representation but in eight bits, preserving a wide dynamic range. Because it retains floating point structure, it tends to hold quality well, and on hardware with native FP8 support it can deliver strong speedups. It is increasingly popular as a default for high quality, lower cost serving when the hardware supports it, because the quality loss is often small.
INT8: eight bit integer
INT8 uses eight bit integers and has long been a workhorse of quantization. It cuts memory significantly and is widely supported across hardware and tooling. With good calibration, quality loss is usually modest, which makes INT8 a safe, mature choice when you want meaningful savings with low risk. It typically requires care around how activations are scaled to avoid degrading sensitive layers.
INT4: four bit integer
INT4 pushes to four bits, roughly quartering the memory of sixteen bit weights. This is the most aggressive common option and enables the largest models to run on modest GPUs. The tradeoff is the highest risk to quality, and the outcome depends heavily on the quantization method and the model. Modern four bit techniques have become quite good, and for many tasks the quality remains acceptable, but you must validate carefully because some models and some tasks suffer.
Side by side
| Format | Memory savings | Quality risk | Typical use |
|---|---|---|---|
| FP8 | About half vs sixteen bit | Low | High quality, lower cost on supported hardware |
| INT8 | About half vs sixteen bit | Low to moderate | Mature, broadly supported savings |
| INT4 | About three quarters vs sixteen bit | Moderate to higher | Fit large models on small GPUs |
Weights only versus weights and activations
An important distinction is whether you quantize only the weights or also the activations that flow through the model at runtime. Weight only quantization mostly buys memory savings and is generally safer for quality. Quantizing activations as well can unlock more speed by using low precision arithmetic throughout, but it is more sensitive and demands careful calibration. Many teams start with weight only quantization to capture the memory win, then explore activation quantization for additional throughput.
Post training versus quantization aware
Most teams use post training quantization, which converts an already trained model and requires no retraining. It is fast and usually good enough, especially at eight bits. When aggressive low bit quantization hurts quality more than you can accept, quantization aware approaches, which account for low precision during training or fine tuning, can recover quality at the cost of more effort. For most inference cost projects, post training quantization with careful calibration is the practical starting point.
How to choose without guessing
- Decide your binding constraint: is it memory to fit the model, or throughput to lower cost per token.
- Start conservative. Try FP8 or INT8 first, since quality risk is low.
- Build an evaluation set from your real tasks and measure quality before and after.
- Only move to INT4 if you need the memory savings, and validate task by task.
- Confirm your hardware and serving engine support the chosen format efficiently.
- Re check quality whenever you change the model or the quantization method.
Common pitfalls
The biggest mistake is assuming a benchmark score from one model transfers to yours. Quantization sensitivity varies by model and by task, and a format that is harmless for summarization may hurt precise reasoning or code generation. Another pitfall is choosing a format your hardware does not accelerate, which gives memory savings without the speed benefit. Always pair the format choice with the hardware and engine that support it.
How quantization interacts with other levers
Quantization rarely stands alone. It compounds with the serving engine, the batch strategy, and the GPU choice, and the interactions matter. By shrinking the weights, quantization frees memory that the serving engine can use to hold more concurrent requests, which raises throughput beyond the raw speed of low precision math. That means the cost benefit of quantizing is often larger in practice than the memory number alone suggests, because it lets a single GPU serve more traffic at once. When you measure, look at end to end throughput on your serving stack, not just a microbenchmark of the matrix operations.
There is also a model sizing decision hiding here. Sometimes the right move is not to quantize your current model aggressively, but to run a larger model at a moderate precision that still fits, trading a format choice for a capability choice. A larger model at eight bit may beat a smaller model at sixteen bit on both quality and cost, because quantization let the bigger model fit on the same hardware. Treat precision and model size as two dials you turn together, guided by the same evaluation set.
A practical rollout sequence
- Establish a quality baseline at the model's native precision on your own tasks.
- Apply eight bit quantization and confirm quality holds, then measure the throughput and memory gain.
- If you need more, test FP8 on supported hardware for a strong quality to speed balance.
- Reserve INT4 for cases where fitting the model on cheaper hardware is the goal, and validate each task.
- Lock in an evaluation gate so future model or method changes cannot regress quality unnoticed.
Quantization is a powerful and well understood way to make inference cheaper, but it is a tradeoff, not free. FP8 and INT8 usually offer large savings with low quality risk, while INT4 unlocks the biggest memory wins at higher risk. The right choice depends on whether memory or throughput is your constraint, on your hardware support, and above all on quality measured against your own tasks. Start conservative, validate rigorously, and push to lower precision only as far as your evaluations allow.