Cost Per Million Tokens Compared Across Top Inference APIs
A practical framework for comparing cost per million tokens across LLM inference APIs, with the variables that make a fair comparison harder than it looks.
Cost per million tokens is the headline metric every inference API advertises, and it is the number buyers reach for first when comparing providers. It is also one of the easiest figures to misread. Two APIs can quote similar per-million rates yet produce very different monthly bills once you account for the input and output split, model capability, context handling, and assorted fees. This guide shows how to compare cost per million tokens in a way that reflects what you will actually pay, rather than what the marketing page implies.
Why a Single Number Is Never Enough
The phrase cost per million tokens hides a fork in the road. Nearly every provider charges a different rate for input tokens and output tokens, so a single blended figure can be misleading. An API that looks cheap on input might be expensive on output, and your true cost depends entirely on your own input-to-output ratio. Before comparing anything, separate the two rates and map them onto how your application actually uses tokens.
The Input and Output Ratio Decides the Winner
Consider two workloads. A document-summarization service sends large inputs and returns short outputs, so it is dominated by input pricing. A code-generation assistant sends short prompts and returns long completions, so it is dominated by output pricing. The same two providers can swap places at the top of your shortlist depending on which of these workloads you run. There is no universal cheapest API, only the cheapest API for your token profile.
The Variables That Move the Real Price
A fair comparison weighs several factors beyond the sticker rate:
- Model tier: flagship models cost more per token than compact or distilled models. Compare like for like in capability, not just price.
- Input vs output split: always pull both rates and weight them by your actual ratio.
- Context window pricing: some providers charge more once a request exceeds a long-context threshold.
- Prompt caching: discounts on repeated input prefixes can dramatically cut effective cost for fixed system prompts.
- Batch pricing: asynchronous batch endpoints often carry a meaningful discount for non-urgent work.
- Minimums and platform fees: some hosted offerings add per-request or subscription costs on top of token rates.
A Comparison Framework You Can Reuse
To compare any set of inference APIs fairly, normalize them against a single representative request rather than against the marketing rate. The table below shows the structure of a sound comparison without inventing specific prices, which always change.
| Step | What to measure | Why it matters |
|---|---|---|
| 1 | Average input tokens per request | Anchors the input side of the bill |
| 2 | Average output tokens per request | Usually the more expensive side |
| 3 | Input rate per million | Provider-specific, model-specific |
| 4 | Output rate per million | Often several times the input rate |
| 5 | Effective cost per request | The number that actually compares |
Compute cost per request for each provider, then multiply by your expected volume. This effective cost per request is the only figure that lets you compare providers honestly, because it folds the input and output split into a single workload-specific number.
Capability Versus Price
The cheapest token is not the cheapest answer. A less capable model that needs two attempts, longer prompts, or heavy post-processing can cost more in practice than a pricier model that gets it right the first time. When you compare cost per million tokens, hold quality constant. Run the same evaluation prompts through each candidate and judge whether the cheaper option meets your bar before you let price decide.
Hidden Costs in Retries and Tooling
Real applications rarely make one clean call per task. Retries, tool-calling loops, multi-step agents, and re-prompting all multiply token usage. An agent that loops several times per task can spend far more than its single-call cost suggests. Factor your real call patterns into the comparison, not the idealized single request.
Latency and Throughput Belong in the Comparison
Price per token is one axis, but speed is another, and the two interact. A provider that is marginally cheaper per token yet noticeably slower can cost you in user experience, in higher infrastructure needs to handle queued requests, and in engineering time spent working around latency. For interactive applications, time to first token and tokens per second often matter as much as the raw rate. For batch jobs, throughput and the availability of discounted asynchronous endpoints matter more than latency.
The practical move is to treat the comparison as multidimensional. Build a small scorecard that captures effective cost per request, retrieval or task quality on your evaluation set, and the latency or throughput your workload requires. A provider rarely wins on all three, so the scorecard forces an honest tradeoff rather than letting a single cheap number dominate the decision.
Rate Volatility and Lock-In
Inference pricing changes frequently as competition intensifies and new model generations arrive. A rate that wins today may be undercut next quarter, so avoid hardwiring your architecture to one provider's quirks. Designing your application to call models through a thin abstraction layer makes it cheaper to re-benchmark and switch when the market shifts. The cost of portability is small, and it preserves your ability to chase better pricing without a painful migration. Treat your comparison as a living document you refresh on a schedule rather than a one-time decision.
How to Run a Sound Comparison
- Define a representative workload with realistic input and output sizes.
- Pull the input and output rate for the specific model on each provider.
- Compute effective cost per request for each candidate.
- Layer in caching, batch, and fee adjustments where they apply.
- Validate quality with a shared evaluation set before choosing on price.
- Multiply by realistic volume, including retries and agent loops.
The Takeaway
Cost per million tokens is a useful starting point and a poor finishing line. The provider that wins on a pricing page can lose on your bill once you weight input against output, match capability, and account for caching, batching, and retries. Build a small effective-cost model around your own representative request, keep it updated as rates change, and let it drive the decision. That discipline turns a noisy marketing metric into a reliable comparison you can trust.