Tracking Cost Per Request: Unit Economics for AI Features
A practical guide to building cost-per-request unit economics for AI inference, covering token accounting, allocation methods, and the metrics that keep margins healthy.
When an AI feature ships, the demo looks magical and the bill arrives later. The gap between those two moments is where unit economics live. Cost per request is the single most useful number for understanding whether an AI feature pays for itself, and most teams cannot answer it precisely. This guide walks through how to measure cost per request for inference workloads, how to allocate shared GPU spend fairly, and how to turn that number into pricing and capacity decisions you can defend.
Why Cost Per Request Matters More Than Total Spend
Total monthly GPU spend tells you what you paid, but it hides the structure underneath. Two features can cost the same in aggregate while one earns healthy margin and the other quietly loses money on every call. Cost per request normalizes spend against the unit your business actually sells or serves, which makes it comparable across features, models, and customer tiers.
Once you have a reliable cost per request, you unlock several decisions at once. You can set usage-based pricing with a known floor, forecast spend as traffic grows, identify which prompts or endpoints are expensive outliers, and decide whether a smaller model or a cheaper provider would protect margin without hurting quality.
The Building Blocks of a Per-Request Cost
A request cost is the sum of the resources that request consumed, divided by nothing if you can attribute resources directly, or apportioned if resources are shared. For LLM inference, the major components are usually compute time on the GPU, input and output tokens, and any retrieval or pre-processing that runs alongside the model.
Token-Based Costing
If you call a hosted inference API, token pricing is explicit. You pay a rate per input token and a different rate per output token. Capturing both counts per request is the foundation. Log the prompt token count, the completion token count, and the model name on every call, because output tokens are typically several times more expensive than input tokens and dominate cost for generative features.
GPU-Time Costing
If you self-host on rented GPUs, you pay for the instance by the hour or second whether or not a request is in flight. Here the cost per request depends on throughput. The formula is straightforward: take the hourly instance rate, divide by the number of requests served in that hour, and you have an average cost per request. Higher utilization drives this number down, which is why batching and concurrency tuning matter so much for self-hosted economics.
Allocating Shared and Idle Costs
Real systems rarely map one request to one dedicated resource. A single GPU node serves many requests, some capacity sits idle, and supporting services like vector databases and load balancers add overhead. You need an allocation method that is consistent and explainable.
- Direct attribution: assign costs you can measure per request, such as API token charges, straight to that request.
- Throughput allocation: spread fixed GPU-hour costs across the requests served in the same window, so busy hours carry a lower per-request share than quiet ones.
- Overhead loading: add a flat percentage for shared infrastructure that you cannot cleanly split, reviewed quarterly so it stays honest.
Idle time deserves special attention. If a GPU runs at twenty percent utilization, eighty percent of its cost is being absorbed by the requests that do arrive, which inflates their unit cost dramatically. Tracking utilization alongside cost per request shows whether your problem is the model, the traffic shape, or simply too much reserved capacity.
A Worked Example
Consider a summarization feature running on a self-hosted setup. The table below shows how utilization changes the math on an instance billed at a steady hourly rate.
| Scenario | Requests per hour | Relative cost per request |
|---|---|---|
| Low traffic, idle GPU | 200 | High |
| Moderate traffic | 1,200 | Medium |
| Batched, high concurrency | 4,000 | Low |
The instance rate never changed, yet the per-request cost fell sharply as throughput rose. This is the core lesson of inference unit economics: for self-hosted workloads, the lever is utilization, and for API-based workloads, the lever is token efficiency.
Instrumenting Your Pipeline
Good unit economics require good telemetry. At minimum, emit a structured event per request that records the model, input tokens, output tokens, latency, and a feature or tenant identifier. Send those events to a store where you can aggregate by feature and by day. Many teams attach the cost calculation at write time using a small pricing table, so each event already carries an estimated cost.
- Log model name, token counts, and latency on every inference call.
- Maintain a versioned pricing table that maps models to rates.
- Compute an estimated cost per event and store it alongside the raw counts.
- Roll up daily by feature, tenant, and model to spot trends and outliers.
- Reconcile your estimate against the provider invoice each billing cycle.
That final reconciliation step keeps the system trustworthy. Estimates drift when pricing changes or when hidden charges like egress or storage creep in, and a monthly check against the real bill catches the drift early.
From Cost Per Request to Decisions
With a dependable number in hand, the strategic moves become clear. If a feature's cost per request exceeds the revenue it generates, you can route it to a cheaper model, cap output length, cache common responses, or reprice. If a customer tier is unprofitable, usage limits or higher pricing restore the margin. If utilization is low, consolidating onto fewer, busier nodes lowers everyone's per-request share.
Cost per request also sharpens build-versus-buy conversations. When you know the self-hosted cost at your current utilization, you can compare it directly against hosted API pricing and choose based on numbers rather than instinct.
Conclusion
Cost per request turns a vague monthly bill into a precise operating metric. Capture token counts and GPU time, allocate shared costs with a method you can explain, reconcile against the invoice, and watch utilization closely. Do that and every AI feature carries a price tag you can trust, which makes pricing, forecasting, and optimization far less of a guessing game and far more of an engineering discipline.