LLM Token Pricing Explained: Input vs Output Token Costs
An introduction to large language model token pricing, explaining what tokens are, why input and output rates differ, and how to forecast costs.
If you have ever looked at an LLM provider's pricing page and felt lost in talk of tokens, prompts, and per-million rates, you are not alone. Token pricing is the standard way modern inference APIs charge for work, and once you understand the model it becomes easy to reason about. This guide explains what a token is, why input and output tokens are priced differently, and how to turn those numbers into a realistic estimate of what your application will cost to run.
What Is a Token?
A token is a chunk of text that the model reads and writes. It is not exactly a word and not exactly a character. Tokenizers break text into common fragments, so a short common word might be a single token while a long or unusual word might split into several. As a rough rule of thumb often cited across the industry, a token corresponds to roughly four characters of English text, which works out to somewhere near three quarters of a word on average. Punctuation, spaces, and formatting all consume tokens too.
Because the model thinks in tokens, providers bill in tokens. Every request you send is measured by how many tokens it contains, and every response is measured by how many tokens the model generates.
Input Tokens Versus Output Tokens
This is the single most important concept in LLM pricing. Your bill is split into two parts:
- Input tokens: everything you send to the model. This includes your prompt, any system instructions, conversation history, and documents you paste in for context.
- Output tokens: everything the model generates in response.
Almost universally, output tokens cost more than input tokens, often by a multiple. The reason is rooted in how the models work. Reading your input can be processed efficiently, while generating output happens one token at a time in sequence, which is more compute-intensive per token. So providers charge a premium for the tokens the model produces.
Why the Split Matters for Your Bill
Two applications can send the same number of total tokens and pay very different amounts. A summarization tool that ingests long documents and returns short summaries is input-heavy, which tends to be cheaper. A creative writing or code-generation tool that takes a short prompt and produces pages of output is output-heavy, which tends to be more expensive. Knowing which side of the ledger dominates your workload tells you where to focus optimization.
How Pricing Is Quoted
Providers typically quote prices per million tokens, listing a separate figure for input and output. The structure looks like this in general form:
| Component | Billed by | Relative cost |
|---|---|---|
| Input tokens | Per million | Lower |
| Output tokens | Per million | Higher |
Prices vary widely between providers and between model tiers, so always read the specific rate for the specific model you intend to call. Larger, more capable models generally cost more per token than smaller, faster ones, which is why model selection is itself a cost decision.
Estimating Your Cost
You can estimate a single request with a simple calculation. Count your input tokens and your expected output tokens, multiply each by its per-token rate, and add them together. To estimate at scale, multiply by your expected number of requests per day or month.
- Estimate average input tokens per request, including history and context.
- Estimate average output tokens per request.
- Apply the input rate to input tokens and the output rate to output tokens.
- Add the two to get cost per request.
- Multiply by request volume for a period total.
The biggest forecasting mistake beginners make is forgetting that conversation history is re-sent on every turn. In a chat application, each new message carries the entire prior conversation as input. A long chat can quietly multiply your input token count even though each user message looks short.
Practical Ways to Reduce Token Costs
Once you understand the split, several levers become obvious:
- Trim the prompt: remove redundant instructions and avoid pasting more context than the task needs.
- Cap output length: set a sensible maximum so the model does not ramble into expensive territory.
- Manage history: summarize or truncate old conversation turns instead of resending everything.
- Match the model to the task: use a smaller, cheaper model for simple jobs and reserve the large model for hard ones.
- Use prompt caching where offered: many providers discount repeated input prefixes, which helps fixed system prompts.
Worked Examples to Build Intuition
Numbers make the model concrete. Imagine a customer support assistant that receives a short user question of roughly thirty tokens, carries a system prompt and a few knowledge snippets totaling around five hundred tokens, and produces an answer of about two hundred tokens. The input side here is large relative to the output, so the input rate matters most even though output is priced higher per token. Trimming that five hundred token context is the most direct way to cut the bill.
Now imagine a marketing copy generator that takes a fifty token brief and produces eight hundred tokens of polished text. This workload is output-heavy, so the premium output rate dominates. Here the lever is capping and shaping the output, since every extra paragraph the model writes costs the expensive rate. Two applications, two opposite optimization strategies, both revealed simply by knowing the input-to-output balance.
These examples also show why intuition fails. People assume the visible user message is the bill, but the invisible system prompt, retrieved context, and accumulated history often dwarf it. Always measure the full request as the model sees it, not just the part a user types.
How Token Pricing Compares to Older Models
Token pricing is usage-based, which means you pay in proportion to the work you do rather than for a fixed allotment of capacity. This is a strength for unpredictable or spiky workloads, because idle time costs nothing. It can be a weakness for very high, steady volume, where a more capable model called millions of times can add up quickly. At that scale, teams evaluate batch endpoints, cheaper model tiers, prompt caching, and in some cases self-hosting an open model on their own GPUs. The right path depends on volume, latency needs, and how much operational work you want to take on.
Putting It Together
Token pricing feels abstract until you map it onto your own workload. Once you know roughly how many input and output tokens a typical request uses, and you remember that output is the pricier side, you can forecast costs with confidence and spot where savings hide. Start by measuring real usage rather than guessing, since actual token counts almost always differ from intuition. With that grounding, the per-million numbers on a pricing page stop being intimidating and become a tool you can use to compare providers, choose models, and keep your inference bill predictable.