Hidden Token Cost of Tool Use | DeployCue Skip to content
DeployCue
LLM Inference

Function Calling and Tool Use: The Hidden Token Overhead

Jun 20, 2026

An analysis of the hidden token overhead in function calling and tool use, covering tool schemas, multi-turn loops, and tactics to keep agentic inference affordable.

Function calling makes large language models far more useful by letting them invoke tools, query data, and take actions. It also quietly inflates your token bill in ways that are easy to miss. Every tool definition you provide, every round trip in a multi-step loop, and every tool result fed back to the model consumes tokens that you pay for. For agentic applications that chain many tool calls, this hidden overhead can dwarf the cost of the user's original question. Understanding where the tokens go is the first step to keeping tool use affordable.

Where the Tokens Actually Go

When you enable tool use, several things get added to the input the model processes, often on every single turn.

  • Tool definitions: the schemas describing each available tool, including names, descriptions, and parameter structures, are sent as part of the prompt.
  • System instructions: guidance on when and how to use tools adds to the standing prompt.
  • Tool call output: the model's structured request to call a tool is generated output you pay for.
  • Tool results: whatever the tool returns is fed back into the next prompt as input tokens.
  • Conversation history: in a multi-step loop, prior turns accumulate and are re-sent each time.

The schemas are the most overlooked. A rich tool with many parameters and detailed descriptions can be surprisingly large, and if you attach a dozen tools, those definitions are prepended to every request whether or not the model uses any of them.

The Multiplier Effect of Agent Loops

The real cost explosion comes from multi-step loops. An agent that calls a tool, reads the result, calls another tool, and so on, re-sends the growing conversation on every step. By the time it has taken several tool steps, the prompt may include the original question, all the tool definitions, every prior tool call, and every prior tool result. Each step pays to process all of it again. A task that looks like one question to the user can become many model invocations, each larger than the last.

Source of tokensWhen it is chargedHow it grows
Tool schemasEvery turnWith number and complexity of tools
Tool resultsEvery step after a callWith how verbose the tool output is
History replayEvery stepWith the number of steps taken
Reasoning outputEvery stepWith how much the model deliberates

Tactics to Reduce the Overhead

Trim and Prune Tool Definitions

Only expose the tools relevant to the current task. If your application has many tools, consider selecting a subset based on context rather than attaching all of them to every request. Keep tool descriptions concise but clear: the model needs enough to choose correctly, but verbose prose in every schema is paid for on every turn. Tight, well-written schemas reduce overhead and often improve tool selection accuracy at the same time.

Control Tool Result Size

Tool results are frequently the largest hidden cost. A search tool that returns full documents, or a database query that returns hundreds of rows, dumps all of that into the next prompt. Truncate, summarize, or paginate tool output so the model receives only what it needs. Returning the top few results instead of everything can cut input tokens sharply with no loss of usefulness.

Manage Conversation History

Because history is replayed on every step, long agent loops pay for their own past repeatedly. Summarize older turns once they are no longer needed in full detail, or drop intermediate tool exchanges that the final answer does not depend on. The goal is to keep the working context focused on what the next step actually requires rather than carrying the entire transcript forward.

Exploit Prompt Caching

Tool definitions and system instructions are usually identical across requests, which makes them ideal candidates for prompt caching. If your provider caches a stable prefix, the repeated tool schemas can be billed at a reduced cached rate instead of full price on every turn. Structuring your prompt so the unchanging parts come first maximizes how much of it can be cached.

Cap the Loop

An agent that loops without limit can run up a large bill on a single hard task, or worse, get stuck repeating tool calls. Set a maximum number of steps and a token budget per task. When the cap is reached, return the best answer so far or escalate gracefully. This protects against runaway costs from edge-case prompts and makes your spending predictable.

Choose the Right Model for the Loop

Tool-using loops multiply token counts, so the per-token price of the model you use is amplified across every step. For routine tool orchestration, a smaller and cheaper model may handle the mechanical work of calling tools and stitching results perfectly well, reserving an expensive model only for steps that need real reasoning. Combining model routing with tool use is one of the highest-leverage ways to keep agentic workloads affordable.

A Practical Checklist

  1. Attach only the tools relevant to the current task, not your entire catalog.
  2. Keep tool schemas concise while still clear enough for correct selection.
  3. Truncate or summarize tool results before feeding them back.
  4. Summarize or prune conversation history in long loops.
  5. Order prompts so stable tool definitions can be cached.
  6. Cap the number of steps and tokens per task.
  7. Route mechanical tool steps to cheaper models where quality allows.

Parallel Tool Calls and Their Tradeoffs

Some models can request several tool calls in a single turn rather than one at a time. This is often cheaper than a long serial loop, because it collapses what would have been multiple round trips, each replaying the full history, into fewer turns. If your task naturally needs several independent lookups, encouraging parallel tool calls can cut the number of times you re-send the schemas and the conversation. The tradeoff is that the model commits to all the calls before seeing any results, so it cannot adapt based on what the first tool returned. Use parallel calls for independent fetches and serial calls when each step depends on the last.

Whatever pattern you use, instrument it. Token usage for tool-heavy workloads is rarely intuitive, and the only reliable way to find waste is to log the input and output token counts for every turn of a full agent run. Once you can see which turns are large and why, the fixes become obvious: a bloated tool result here, an unpruned history there, a schema that could be trimmed. Without that visibility, teams routinely underestimate agent cost by a wide margin because they reason about the user's question rather than the dozens of token-laden turns it triggers.

Function calling is one of the most powerful capabilities modern models offer, but its costs are structural and easy to overlook. The tokens hide in the schemas, the tool results, and the replayed history of every step. By trimming each of those, caching what stays constant, and capping how long loops run, you can build capable tool-using applications without watching your inference bill grow with every extra step. Measure the token count of a full agent run, not just the user's question, and you will quickly see where the real spending lives.