Throughput vs Latency in LLM Inference

When teams say they want their model to be faster, they usually mean one of two different things, and the two pull against each other. Latency is how quickly a single request finishes. Throughput is how many requests the system completes per unit of time. You can usually improve one by sacrificing the other, so the first job in inference optimization is deciding which metric your product is actually graded on. Optimizing the wrong one wastes money and frustrates users.

Defining the metrics precisely

Vague talk about speed causes most of the confusion, so pin down the numbers that matter in token generation.

Time to first token (TTFT): how long from request arrival until the first token streams back. This dominates perceived responsiveness in chat.
Time per output token (TPOT): the steady-state pace of tokens after the first. This sets how fast a long answer renders.
End-to-end latency: total time for a full response, roughly TTFT plus TPOT times the number of output tokens.
Throughput: total tokens or requests served per second across all concurrent users, the metric that drives cost per token.

Why the two fight each other

The lever in the middle is batch size. A GPU is most efficient when it processes many requests together, because it amortizes the fixed cost of loading weights across more work. Larger batches raise throughput, which lowers cost per token. But a request waiting to fill a batch, or sharing the GPU with many others, takes longer to finish, which raises latency. Push batch size up and you serve more users cheaply but each one waits a bit more. Push it down and each request flies but the GPU sits underused and your cost per token climbs.

Tuning choice	Effect on latency	Effect on throughput	Effect on cost per token
Larger batch size	Worse	Better	Lower
Smaller batch size	Better	Worse	Higher
More replicas	Neutral to better	Better	Roughly flat

Pick the metric your workload lives by

Different products care about different numbers, and the right optimization follows from that.

Latency-first workloads

Interactive chat, coding assistants, voice agents, and anything a human waits on in real time should optimize TTFT and TPOT. Here you favor smaller effective batches, fast first-token paths, and streaming so the user sees progress immediately. You accept a higher cost per token because the experience depends on responsiveness.

Throughput-first workloads

Offline batch jobs, bulk classification, embedding generation, and evaluation runs care only about total work per dollar. Maximize batch size, pack the GPU, and let individual requests wait, because no human is watching the clock on any single one. This is where cost per token drops the most.

Tuning in practice

Most real systems sit between the extremes, so tune deliberately rather than by feel.

Write down your target: a TTFT ceiling for interactive work, or a cost-per-token target for batch work.
Load test at increasing concurrency and record TTFT, TPOT, and throughput at each level.
Find the batch size where latency still meets your ceiling but throughput is as high as possible.
If you cannot satisfy both, add replicas to raise throughput without enlarging batches, which protects latency.
Re-test after any model, quantization, or hardware change, since the curve shifts.

Continuous batching, where the server admits new requests into an in-flight batch instead of waiting to assemble a fixed one, softens the tradeoff considerably and is worth enabling for mixed workloads. It lets you keep the GPU busy while still admitting latency-sensitive requests promptly.

Measuring under realistic load

A single-request benchmark tells you almost nothing about a production system, because the tradeoff only appears under concurrency. Always test with a request mix and arrival pattern that resemble production, including bursty arrivals. Watch tail latency, not just the average, since the slowest few percent of requests often define user satisfaction and any latency commitments you have made.

Streaming changes the perception of latency

For interactive products, the metric users actually feel is often time to first token, not total time. Streaming tokens to the client as they are produced means the user sees a response begin almost immediately, even if the full answer takes several seconds to finish. This decouples perceived responsiveness from total generation time and lets you tolerate a slightly slower per-token pace without users noticing. If your product streams, prioritize a fast first token and a steady stream over raw end-to-end speed, because a response that starts instantly and flows smoothly feels faster than one that arrives all at once after a pause.

Hardware and model choices move the curve

Tuning batch size shifts where you sit on the throughput-versus-latency curve, but the shape of the curve itself is set by your model and hardware. A smaller or quantized model generates tokens faster, improving both latency and throughput at some quality cost. A GPU with higher memory bandwidth produces tokens faster because generation is frequently bound by how quickly weights and cache can be read. So before you wring the last few percent out of batch tuning, confirm the model size and hardware are appropriate for your target, since those decisions set the ceiling that tuning operates beneath.

A checklist for picking the right metric

Is a human waiting in real time? Optimize latency, specifically time to first token.
Is the work offline and deadline-free? Optimize throughput and cost per token.
Does the product stream responses? Prioritize first-token speed and a steady token stream.
Are you bound by a latency commitment? Treat tail latency, not average, as the constraint.
Have you changed model or hardware? Re-measure, because the curve shifted.

Conclusion

Throughput and latency are two ends of the same lever, and batch size is the hand on it. Interactive products should optimize first-token and per-token latency and accept a higher cost per token, while offline pipelines should pack batches for maximum throughput and lowest cost. The expensive mistake is optimizing whichever metric is easiest to measure rather than the one your users feel. Define the target, load test honestly under realistic concurrency, and tune batch size and replica count to hit the number that actually matters for your workload.

Throughput vs Latency in LLM Inference: Optimizing the Right Metric