Self-Hosting LLMs vs Using an API: The Break-Even Math
The break-even math for self-hosting an LLM vs an API: GPU-hour cost, throughput, utilization, ops burden, and latency, with a worked example in ranges.
"Should we self-host?" usually gets answered with vibes - someone saw a low GPU-hour price and assumed it beats the API, or someone got burned by a tuning project and swore off it. The honest answer is a number you can compute. Self-hosting wins when your effective cost per million tokens, derived from GPU-hour price and throughput, drops below the API's per-token rate - and that hinges almost entirely on utilization. Here is the full math, with a worked example in ranges.
The two cost models
A per-token API charges only for tokens processed: you pay nothing when idle, and cost scales linearly with usage. Self-hosting inverts this. You rent a GPU by the hour and pay whether or not it is doing anything; your per-token cost is whatever you can amortize that fixed hourly bill across. The whole decision is the tension between "pay only for what you use" and "pay for capacity and use it well."
The formula
Self-hosted cost per million tokens is the GPU-hour cost divided by how many tokens that GPU produces in an hour:
- Effective tokens/hour = sustained tokens/second x 3600 x utilization
- Cost per 1M tokens = (GPU $/hour) / (effective tokens/hour / 1,000,000)
Three variables drive everything:
- GPU $/hour - rent price for the accelerator. Check live rates on the GPU comparison; spot or interruptible instances can be much cheaper than on-demand.
- Throughput (tokens/second) - depends on model size, the GPU, quantization, and especially batching. A serving stack like vLLM that batches concurrent requests can deliver many times the throughput of one-request-at-a-time.
- Utilization - the fraction of the hour the GPU is actually producing tokens. This is the variable people forget, and it is usually the one that decides the outcome.
A worked example (in ranges)
Take a mid-size open-weight model on a single high-end GPU. Suppose:
- GPU rent: roughly $2-$4 per hour on-demand (less on spot).
- Sustained batched throughput: on the order of 1,000-3,000 output tokens/second for a well-served mid-size model with concurrency.
At the favorable end (3,000 tok/s, $2/hr, 100% utilization), that GPU produces about 10.8M tokens/hour, so cost is roughly $2 / 10.8 = about $0.18 per million tokens. At the unfavorable end (1,000 tok/s, $4/hr, and only 20% utilization because traffic is bursty), effective output is about 0.72M tokens/hour and cost balloons to roughly $5.50 per million.
| Scenario | Throughput | Utilization | GPU $/hr | Approx $/1M tokens |
|---|---|---|---|---|
| Best case | 3,000 tok/s | 100% | $2 | ~$0.18 |
| Realistic steady | 2,000 tok/s | 60% | $3 | ~$0.69 |
| Bursty / underused | 1,000 tok/s | 20% | $4 | ~$5.50 |
The same hardware spans a 30x range purely on utilization and batching. Compare those figures against the per-token rate of the same or a comparable model on the LLM inference comparison: if the API charges less than your computed self-hosted figure, the API wins; if it charges more, self-hosting is in play. Use the GPU cost calculator to sanity-check the hardware side of the math.
Utilization is the whole game
An idle GPU is the most expensive token machine there is. To make self-hosting pay you need either genuinely steady, high traffic, or a way to fill the gaps:
- Batch aggressively. Continuous batching keeps the GPU busy across many concurrent requests and is the single biggest throughput lever.
- Consolidate workloads. Route multiple apps to the same endpoint so the GPU rarely sits empty.
- Use spot capacity for fault-tolerant or batch jobs to cut the hourly rate.
- Scale to zero when traffic is spiky by using serverless GPU, which only bills while a request runs - trading cold-start latency for no idle cost.
The costs the spreadsheet misses
The per-token figure is necessary but not sufficient. Self-hosting also carries:
- Ops burden. You own the serving stack, autoscaling, monitoring, model updates, and on-call. Budget engineering time, not just GPU dollars.
- Capacity risk. Top accelerators like the H100, H200, and A100 80GB can be supply-constrained; you may queue or pay premiums.
- Memory ceiling. The model and its KV cache must fit in VRAM; large context windows and high concurrency eat memory fast, sometimes forcing a bigger or second GPU.
- Reliability. Matching an API's uptime means redundancy, which raises your effective cost.
Latency considerations
Self-hosting can reduce tail latency because you control batching, hardware locality, and have no shared-tenant noisy neighbors - useful for real-time UX. But naive batching for throughput can hurt per-request latency, and serverless cold starts add seconds on the first call. APIs hide all of this and generally offer predictable, low latency at the cost of less control. Decide whether your product needs latency control or just latency.
How to decide
- Measure your real monthly token volume and how steady it is across the day.
- Pick a candidate model and GPU; estimate batched throughput for that pairing.
- Compute cost per million tokens at your honest utilization, not the best case.
- Compare against the per-token API rate on the comparison table.
- Add a realistic ops cost and a reliability buffer to the self-hosted side.
- If traffic is spiky, price serverless GPU as the middle path before committing to a dedicated instance.
Takeaway
The break-even is not a fixed dollar figure - it is your GPU-hour cost divided by realistic, utilization-adjusted throughput, compared against the API's per-token rate. APIs win at low or bursty volume and for small teams that should not run GPU fleets; self-hosting wins when traffic is high and steady enough to keep a well-batched GPU busy and you have the ops capacity to run it. Run the numbers with live rates from the GPU and LLM comparisons and the GPU cost calculator before you buy capacity you can't fill.