Measure Tokens Per Second on Your GPU: A Benchmarking Tutorial
A hands-on tutorial for measuring LLM inference throughput in tokens per second on a cloud GPU, including warmup, batching, and reporting.
Tokens per second is the metric that turns a vague claim like "this GPU is fast" into something you can put on a spreadsheet and defend. When you compare cloud GPU instances for LLM inference, the headline price per hour means little until you know how many tokens that hardware actually produces under your workload. This tutorial walks through a repeatable way to measure tokens per second so you can compare providers, GPU models, and serving stacks on equal footing.
What tokens per second actually measures
Tokens per second (often written as tok/s) counts how many tokens a model generates or processes in one second of wall clock time. Two flavors matter. Prompt processing throughput (sometimes called prefill) measures how fast the model reads your input. Generation throughput measures how fast it produces new tokens one step at a time. They behave very differently, because prefill runs in parallel across the prompt while generation is sequential and memory bound.
You should also separate single request latency from aggregate throughput. A single user streaming one response cares about per request tok/s. A serving platform handling many users cares about total tok/s across a full batch. Report both, because a GPU that looks slow for one request can be excellent when it batches dozens of concurrent requests.
Set up a clean benchmark environment
Consistency is everything. A noisy environment produces numbers you cannot trust. Before you run anything, lock down the variables that move your results.
- Pin the model, the quantization, and the context length you plan to use in production.
- Record the GPU model, driver version, and the inference runtime (for example vLLM, TGI, TensorRT-LLM, or llama.cpp).
- Disable other workloads on the instance so you are not sharing memory bandwidth.
- Run a warmup pass to load weights into memory and trigger any one time compilation or kernel autotuning.
The warmup step is the one people skip most often. The first request after a cold start pays for weight loading and graph compilation, which can make it several times slower than steady state. Throw the warmup numbers away.
A simple measurement procedure
The core idea is to send a known amount of work, time it with a high resolution clock, and divide. Here is a procedure that works across most serving stacks.
- Fix a prompt of a known token length and a fixed number of output tokens, for example 512 in and 512 out.
- Warm up with three to five throwaway requests.
- Time at least 20 to 50 requests so noise averages out.
- Record the total output tokens and the total elapsed time.
- Compute generation tok/s as total output tokens divided by total elapsed seconds.
For streaming endpoints, capture the timestamp of the first token and the timestamp of the last token. The gap between request start and first token is your time to first token, a latency metric. The gap between first and last token, divided by output tokens, gives a clean generation rate that excludes prefill.
Measuring throughput under concurrency
Real serving rarely processes one request at a time. Continuous batching lets a runtime pack many requests into the same forward pass, which raises total tok/s dramatically until the GPU saturates. To find the useful operating point, sweep concurrency.
| Concurrent requests | Total tok/s | Per request tok/s | Notes |
|---|---|---|---|
| 1 | Baseline | Highest per request | Latency optimal |
| 8 | Several times baseline | Slightly lower | Good balance |
| 32 | Near peak | Lower | Throughput optimal |
| 64+ | Plateaus | Falls off | Queueing begins |
The exact numbers depend on your GPU, model, and context length, so treat the table as a shape rather than a promise. The point is to find where total throughput stops climbing. That plateau is your effective serving capacity, and it is the number you should divide your hourly cost by to get cost per million tokens.
Turning tok/s into cost per token
Tokens per second only becomes a buying signal once you connect it to price. Take your steady state total tok/s at a sensible concurrency, multiply by 3600 to get tokens per hour, and divide the instance hourly rate by that figure. The result is your cost per token, which you can scale to cost per million tokens for easy comparison across providers.
This is where surprising results show up. A pricier GPU with much higher throughput often wins on cost per million tokens even though it loses on price per hour. On DeployCue we encourage readers to compare on cost per unit of output, not on raw hourly rate, because that is what your bill actually tracks.
Common mistakes that produce misleading numbers
Benchmarks go wrong in predictable ways. Watch for these.
- Counting characters or words instead of tokens. Always use the model tokenizer.
- Forgetting to warm up, which inflates the first request and drags down averages.
- Measuring only one request and reporting it as throughput.
- Mixing prefill and generation into a single rate, which hides where time goes.
- Comparing different quantizations or context lengths and pretending they are equal.
Account for input and output ratios
The mix of input and output tokens shifts your results more than most people expect. A workload with long prompts and short answers spends most of its time in prefill, where the GPU runs efficiently in parallel. A workload with short prompts and long answers spends most of its time in sequential generation, which is memory bandwidth bound and slower per token. If you benchmark with a 50 in, 500 out shape but your production traffic is 2000 in, 100 out, your numbers will not predict real behavior.
The fix is to benchmark with the same input and output distribution you actually serve. Sample real prompts and real response lengths, or at least pick a representative average for each. When you publish or compare numbers, always state the input and output token counts alongside the tok/s figure, because a rate without that context is close to meaningless.
Use a script you can rerun
The most valuable benchmark is one you can run again without re deriving anything. Wrap the whole procedure in a small script that takes the model, the concurrency, and the input and output lengths as parameters, runs the warmup, fires the requests, and prints tok/s along with the conditions it ran under. Store that script in version control next to your project.
When a provider changes pricing, a new GPU appears, or you upgrade your inference runtime, you rerun the same script and get comparable numbers in minutes. This turns benchmarking from a one off chore into a repeatable measurement you trust. On DeployCue we treat reproducible scripts like this as the backbone of any honest cloud GPU comparison, because they remove guesswork from the buying decision.
Conclusion
A trustworthy tokens per second benchmark is mostly discipline. Pin your model and settings, warm up, time many requests, separate prefill from generation, and sweep concurrency to find your real serving capacity. Once you have those numbers, convert them to cost per million tokens and the right cloud GPU choice usually becomes obvious. Repeatable measurement, not vendor marketing, is what should drive your inference infrastructure decisions.