LLM API Pricing Explained: Tokens, Context, and Blended Cost
A practical breakdown of how LLM API pricing works: per-million token rates, input vs output vs cached pricing, context-window cost, and blended cost per request.
LLM API pricing looks simple on the marketing page - a dollar figure per million tokens - and then your first real invoice arrives three times higher than you modeled. The gap is almost always the same handful of details: output tokens cost more than input tokens, your context window is silently re-billed on every turn, and the model you picked is priced differently depending on who hosts it. This article unpacks the full pricing model so you can estimate cost before you ship, not after.
The unit: what a token actually is
Providers bill in tokens, not words or characters. A token is a sub-word chunk produced by the model's tokenizer; in English, a rough rule of thumb is that one token is about four characters, and 1,000 tokens is roughly 750 words. Code, JSON, non-English text, and unusual whitespace tokenize less efficiently, so a 500-line source file can cost far more tokens than a 500-word essay. Because tokenizers differ between model families, the same prompt can be a different number of tokens on two providers - one more reason a like-for-like comparison matters. You can sanity-check real per-request cost with the LLM token cost calculator.
Input vs output tokens: the asymmetry that bites
Almost every API splits pricing into two rates: a price per million input (prompt) tokens and a higher price per million output (completion) tokens. Output is usually 2x to 5x the input rate because generating tokens is sequential and compute-bound, while reading the prompt is a single parallel forward pass.
This asymmetry changes how you should design prompts. A long retrieval-augmented prompt that produces a one-line answer is cheap on the output side but expensive on input. A short prompt that asks for a 2,000-word draft flips the ratio. Knowing your input:output ratio is the single most useful number for forecasting spend.
Cached tokens: the discount most teams miss
Many providers now offer prompt caching: if the leading portion of your prompt (a long system prompt, a fixed instruction block, a document you keep re-sending) is identical to a recent request, those tokens are billed at a steep discount - often 10% to 50% of the normal input rate - because the model can reuse the computed key-value state.
Caching only helps when the cached prefix is stable and reused within the cache window, which is typically minutes. Two things to watch:
- Prefix stability. Put static content (instructions, schemas, few-shot examples) at the front and user-specific content at the end, or the cache never hits.
- Cache writes. The first call that populates the cache can cost slightly more than a normal input token, so caching pays off only with repetition.
How a single request is actually priced
Blended cost per request is the sum of three line items. A worked structure looks like this:
| Component | What it covers | Typical relative rate |
|---|---|---|
| Cached input tokens | Repeated prefix (system prompt, docs) | Lowest (0.1x-0.5x of input) |
| Fresh input tokens | New prompt content this turn | Baseline (1x) |
| Output tokens | The generated completion | Highest (2x-5x of input) |
So request cost = (cached_in x cache_rate) + (fresh_in x input_rate) + (out x output_rate), all divided by one million. Model the three counts separately; a single "average tokens per call" number hides the part of the bill you can most easily optimize.
The hidden multiplier: context window in multi-turn chats
This is where forecasts go wrong. In a conversation, most APIs are stateless - you resend the entire history on every turn. Turn 10 of a chat re-bills turns 1 through 9 as input. A 20-message conversation can cost an order of magnitude more than 20 isolated calls, even though the user typed the same amount.
The same applies to agents and RAG: every tool result and retrieved chunk you stuff into context is re-paid on each subsequent step. Large context windows (128K, 200K, 1M tokens) are a capability, not a free feature - filling them is exactly what makes input cost explode. Prompt caching and trimming history are the main defenses, covered in our guide on cutting LLM inference costs.
Why the same model costs different amounts
Open-weight models are hosted by many vendors, and a closed model may be available both first-party and through a cloud marketplace. Prices diverge for real reasons:
- Hardware and batching. A host running the model on cheaper or better-utilized GPUs can charge less. Throughput tiers differ even within one provider.
- Quantization and serving config. An FP8 or INT8 deployment is cheaper to run than full precision, sometimes at a small quality cost.
- Service tier. Batch or off-peak tiers can be 50% cheaper than real-time; priority tiers cost more.
- Margin and bundling. Some vendors subsidize inference to sell adjacent services.
The practical takeaway: never assume a model has one price. Compare the live rates side by side on the LLM inference comparison before committing a workload.
How to estimate your monthly bill
- Measure a representative request: count fresh input, cached-eligible input, and output tokens.
- Multiply each by the provider's three rates and divide by one million to get cost per request.
- For chat, multiply by the average number of turns and account for resent history.
- Multiply by expected requests per day, then by 30. Add a 20-30% buffer for retries, longer outputs, and traffic spikes.
- Re-run the math for two or three providers - the cheapest on input may be the most expensive on output for your ratio.
Plug your numbers into the token cost calculator to skip the arithmetic, and check whether a smaller or different model on the comparison table meets your quality bar at a fraction of the cost.
Takeaway
LLM API pricing is three rates, not one: cached input, fresh input, and output. The biggest surprises come from output being several times pricier than input and from context being re-billed every turn. Forecast those three counts separately, exploit caching for stable prefixes, and compare the same model across hosts before you lock in. If your volume is high and your traffic is steady, that pricing literacy also tells you when an API stops being the cheapest option - the threshold where serverless GPU or dedicated hosting starts to win.