Benchmark LLM Inference Providers Fairly

Inference provider benchmarks are everywhere, and most of them are misleading. A vendor showing off its fastest number on an idle endpoint with a trivial prompt tells you almost nothing about what you will experience under real traffic. Fair benchmarking is not hard, but it requires discipline: realistic workloads, honest measurement of the slow tail, load that matches production, and cost normalized so comparisons are apples to apples. This guide lays out a methodology you can use to compare providers in a way that predicts real behavior.

Start With a Realistic Workload

The single biggest source of misleading benchmarks is an unrealistic prompt. A one word input prefills instantly and decodes a short answer, which flatters every provider equally and tells you nothing. Your benchmark prompts should mirror your actual application: the same prompt length distribution, the same output length, and the same structure such as system prompts, retrieved context, or tool definitions. If your production prompts average a few thousand tokens of context, benchmark at that length, because prefill cost and time to first token both scale with it.

Match input length to your real prompt distribution, not a single average.
Match output length, since decode time scales with tokens generated.
Include the structural elements you really use, like long system prompts.
Use a variety of prompts, not one repeated request that may hit a cache.

Measure the Right Latency Numbers

Latency is not one number. For streaming workloads you need at least three.

Metric	What it captures	Why it matters
Time to first token	Delay before output starts	Dominates perceived responsiveness
Inter-token latency	Pace of streamed tokens	Decides if output keeps up with reading
End-to-end latency	Total time to full response	Matters for non-streaming and batch use

Report Percentiles, Not Just Averages

Averages hide the experiences that matter most. A provider can have a great median and a terrible slow tail, and that tail is exactly what frustrates a meaningful share of users. Always report higher percentiles alongside the median so you can see how bad the worst common cases are. A provider with a slightly slower median but a tight tail is often the better choice for a product where consistency matters.

Test Under Realistic Load

An idle endpoint is the easiest condition in the world, and no production system runs idle. Throughput and latency change dramatically as concurrency rises, because requests queue and the GPU batches them together. Run your benchmark at the concurrency level you expect in production, and ideally sweep across several load levels to see where latency starts to degrade. This reveals the point at which a provider's capacity runs out, which a single-request test can never show.

Watch for Rate Limits and Throttling

Part of load testing is discovering the provider's limits. Some endpoints throttle or queue aggressively once you exceed a rate, which can make a fast provider effectively slow at your volume. Record not just latency but error rates and any throttling responses, because a provider that returns errors under load is not a viable option regardless of its best-case speed.

Normalize Cost for a Fair Comparison

Providers price in different units, so raw price tags are not comparable until you normalize them. Convert everything to a common basis, typically cost per million input tokens and cost per million output tokens, since input and output are often priced differently. Then weight by your real input-to-output ratio to get an effective cost per request for your workload. Remember to account for features that change the math, such as cached input tokens billed at a reduced rate or batch tiers that discount asynchronous jobs.

Convert all pricing to cost per million input and output tokens.
Weight by your actual input-to-output token ratio.
Account for prompt caching discounts if you reuse prefixes.
Check for batch or async tiers if your work is not real time.
Express the result as effective cost per representative request.

Do Not Forget Quality

Speed and price are meaningless if the answers are wrong. Two providers serving the same open model may produce slightly different output due to differences in how they run it, such as quantization choices or sampling defaults. Run a quality evaluation on a held-out set so you are comparing providers that deliver acceptable answers, not just fast ones. If a cheaper, faster provider quietly uses a more aggressively quantized variant, the quality check is what reveals the hidden tradeoff.

Keep the Benchmark Reproducible

A benchmark you cannot rerun is a one-time anecdote. Fix the prompts, the concurrency, the region, and the time window, and record them alongside the results. Providers change their infrastructure and pricing frequently, so a benchmark from months ago may no longer hold. A reproducible harness lets you rerun the comparison whenever you suspect something changed, which is far more valuable than a single snapshot.

A Fair Benchmarking Checklist

Use prompts that match your real length and structure.
Measure time to first token, inter-token latency, and end-to-end latency.
Report higher percentiles, not just the median.
Load test at production concurrency and record errors and throttling.
Run from the region your users live in.
Normalize cost to a common per-token basis weighted by your usage.
Validate answer quality on a held-out set.
Keep the harness reproducible so you can rerun it later.

Beware Apples-to-Oranges Model Variants

A subtle trap in provider benchmarking is assuming that the same model name means the same model behavior. Two providers serving an open model may run it differently: one might use a more aggressive quantization to fit more requests on a GPU, another might use different sampling defaults, and a third might run a slightly different version of the weights. These differences can shift both speed and quality. A provider that looks fastest may simply be running a more compressed variant that trades quality for throughput. The only way to catch this is to pair your latency and cost measurements with a quality evaluation on identical inputs, so you are comparing providers that deliver comparable answers rather than just comparable speed.

It also helps to record the exact conditions of each run: the model identifier, the region, the concurrency, the prompt set, and the date. Providers iterate on their infrastructure constantly, and a result that held last month may not hold today. Treat every benchmark as a snapshot with an expiry date, and rerun it before making a decision that depends on numbers gathered a while ago.

Fair benchmarking is less about clever measurement and more about honesty: test what you will actually run, under the load you will actually see, and report the numbers that actually hurt. Do that, and your provider comparison will predict production behavior instead of flattering a vendor's marketing page. The extra effort pays for itself the first time it steers you away from a provider that looked great on paper and folded under real traffic.

How to Benchmark LLM Inference Providers Fairly