Prompt Caching and Pricing: How Cached Tokens Cut Your Bill
An explainer on prompt caching for LLM APIs, covering how cached tokens are discounted, when caching pays off, and how to structure prompts to benefit.
If your application sends the same large block of context to a language model again and again, you may be paying full price to process identical tokens repeatedly. Prompt caching exists to fix exactly that. By storing the model's work on a stable prefix of your prompt, providers can charge a steep discount on the cached portion of subsequent requests. For agents, retrieval-augmented systems, and any workload with a long fixed preamble, the savings can be substantial. This guide explains how cached-token pricing works and how to structure prompts to capture it.
What prompt caching actually does
When a model processes your prompt, it does internal work on every token before generating a response. If two requests share a long common prefix, such as a system prompt, a tool catalog, or a document being questioned, that prefix work is identical each time. Prompt caching stores the intermediate computation for the prefix so that later requests reusing it skip the expensive recomputation.
From a billing perspective, the cached portion of your input is charged at a reduced rate compared with fresh input tokens. The exact discount varies by provider, but cached input is consistently and meaningfully cheaper than uncached input. You still pay full price for the unique part of each prompt and for the generated output.
How cached-token pricing is structured
Providers typically split input pricing into a few components once caching is involved. Understanding the parts lets you reason about net savings rather than assuming caching is free.
- Cache write: the first time a prefix is cached, there may be a small premium to store it, since the provider does the full work plus caching.
- Cache read: subsequent requests that hit the cached prefix pay a deeply reduced rate on those tokens.
- Uncached input: the variable part of each prompt, charged at the standard input rate.
- Output: generated tokens, charged at the standard output rate, unaffected by caching.
The economics hinge on reuse. A cached prefix pays off only when it is read enough times to offset any write premium. One-off prompts gain nothing from caching, while a prefix reused thousands of times delivers large savings.
When prompt caching pays off
Caching shines in specific patterns. Recognizing them tells you where to invest the engineering effort to structure prompts for cache hits.
Long stable system prompts
Agents and assistants often carry a long system prompt with instructions, tools, and examples that never change between turns. Caching that block means you pay full price once and a fraction thereafter, across every conversation.
Document question answering
When users ask multiple questions about the same document, the document text is a perfect cache candidate. The first question warms the cache; the rest read it cheaply.
Few-shot prompting
Prompts that include a fixed set of examples before the variable query benefit directly, since the example block is identical every time.
How to structure prompts for caching
Capturing the discount requires putting the stable content where the cache can find it. The cache matches on a prefix, so order matters.
- Place all stable content at the very start: system instructions, tool definitions, fixed examples, and any shared document.
- Put the variable content, such as the user's specific question, after the stable block.
- Keep the stable prefix byte-for-byte identical across requests. Even small changes can invalidate the cache.
- Make the cached prefix long enough to be worth caching, since very short prefixes save little.
- Reuse the prefix within the cache lifetime, because caches expire after a period of inactivity.
Estimating your savings
To judge whether caching helps, compare the cost of your prompt with and without it. The table below frames the inputs.
| Input | Role in the estimate |
|---|---|
| Prefix token count | The tokens eligible for caching |
| Reuse count | How many requests share the prefix |
| Cache read discount | Reduced rate on cached tokens |
| Variable token count | Charged at full input rate |
| Cache lifetime | Whether reuse happens before expiry |
Multiply the prefix tokens by the reuse count and apply the cache read discount to estimate cached spend, then compare against paying full price for every request. The larger the prefix and the higher the reuse, the bigger the gap in caching's favor.
Pitfalls to avoid
Caching is powerful but easy to misuse. Inserting dynamic content, such as a timestamp or a per-user detail, into the stable prefix breaks the cache and erases the savings. Letting the cache expire between uses forces a fresh write each time, which can cost more than not caching at all for sparse traffic. And caching tiny prefixes adds complexity for negligible benefit. Audit your prompts to confirm the stable block is truly stable and genuinely reused.
Caching and latency, a second benefit
The cost discount is the headline, but caching also improves latency, which can matter just as much for user experience. Because the model skips recomputing the cached prefix, responses that reuse a long preamble often start generating faster. For interactive agents carrying a large system prompt across many turns, this speedup compounds with the savings, giving you a cheaper and snappier application at once. When you weigh whether to invest in cache-friendly prompt structure, count the latency gain alongside the dollars saved.
Combining caching with other cost levers
Prompt caching does not work in isolation. It pairs naturally with other ways to trim an LLM bill, and the effects stack. Trimming unnecessary tokens from the variable portion of your prompt reduces the part that caching never discounts. Choosing a smaller model where quality permits lowers the rate on every token, cached or not. Batching requests can improve throughput on some platforms. Caching is most powerful as one tool in a kit: it handles the repeated-context problem cleanly while you address output length, model choice, and request patterns separately.
Prompt caching turns repeated context from a recurring expense into a near-fixed cost. By placing stable content first, keeping it identical, and reusing it within the cache window, you can cut input costs significantly for agents, document workflows, and few-shot prompts. Measure your prefix size and reuse rate, structure prompts deliberately, and prompt caching becomes one of the simplest, highest-leverage ways to lower an LLM bill.