Open-Weight vs Closed LLMs: Cost, Control, and Privacy
A clear comparison of open-weight and closed LLMs across cost, control, data privacy, and self-host vs API tradeoffs - and when each one wins.
The choice between an open-weight model and a closed API is not really about which one is "better" - it is about which tradeoffs you can live with. Open-weight models hand you the weights and the freedom to run them anywhere; closed models hand you frontier capability with none of the operational burden. Getting this decision right shapes your cost structure, your data-privacy posture, and how much infrastructure your team has to own.
Definitions, precisely
The terms get used loosely, so it helps to be exact:
- Open-weight models publish the trained parameters so you can download, run, fine-tune, and self-host them. "Open weight" does not always mean "open source" - the training data and code may stay private, and licenses range from fully permissive to research-only or usage-capped. Examples of open-weight families include Llama, Mistral, Qwen, Gemma, and DeepSeek.
- Closed models are accessible only through a vendor's API. You never see the weights; you send tokens in and get tokens out. Examples include the flagship offerings from OpenAI, Anthropic, and Google.
The practical line is portability: with open weights you can move providers or move on-prem; with a closed model you are tied to one vendor's endpoint and terms.
Cost structure: per-token vs per-GPU-hour
Closed models bill per token with no idle cost - ideal when traffic is low or spiky. Open-weight models can also be consumed per token from a hosting vendor, but their real advantage shows up when you self-host: you rent or own a GPU and pay per hour regardless of tokens served. That flips the economics. At low utilization, per-token APIs win easily. At high, steady utilization, a well-packed GPU can push your effective cost per million tokens below API rates.
This article avoids quoting exact prices because they move weekly and vary by host; check the live numbers on the LLM inference comparison and price the underlying hardware on the GPU comparison. The break-even math is worked through in self-hosting LLMs vs using an API.
Control: what you actually get
| Dimension | Open-weight (self-hosted) | Closed (API) |
|---|---|---|
| Model version pinning | Full - you choose when to upgrade | Vendor may deprecate or silently update |
| Fine-tuning | Full - LoRA or full fine-tune on your data | Limited to vendor's tuning offering, if any |
| Latency tuning | You control hardware, batching, quantization | Fixed by the provider |
| Quantization choice | Yours (FP16, FP8, INT8, INT4) | Hidden |
| Rate limits | Only your hardware | Vendor quotas and throttling |
| Frontier quality | Strong, often a step behind the best closed | Typically the capability frontier |
Open weights mean you decide the precision/quality/cost tradeoff. A heavier quantization halves memory and roughly doubles throughput at a modest quality cost - a knob you simply do not have with a closed endpoint.
Data privacy and compliance
For many regulated teams this is the deciding factor.
- Self-hosted open-weight can run entirely inside your VPC or on-prem, so prompts and completions never leave your perimeter. That makes data residency, air-gapped deployments, and strict compliance regimes tractable.
- Closed APIs send your data to a third party. Reputable vendors offer no-training guarantees, zero-retention modes, and regional endpoints, and for most companies that is sufficient. But you are trusting a contract rather than a network boundary.
If your data cannot leave a controlled environment, open-weight self-hosting on your own GPUs or bare metal is often the only option that clears legal review.
Self-host vs API: the operational reality
Self-hosting is not free even when the GPU math looks good. You take on:
- Serving infrastructure (vLLM, TGI, or similar), autoscaling, and load balancing.
- Capacity planning and GPU procurement, where supply can be tight for top accelerators like the H100 or H200.
- On-call ownership of latency, OOMs, and model updates.
- Utilization risk - an idle GPU still bills.
A middle path is serverless GPU, which runs open-weight models on demand and scales to zero, trading some cold-start latency for far less ops. Closed APIs remove all of this and let a small team ship in an afternoon.
When each one wins
Choose a closed API when:
- You need the absolute best reasoning or multimodal quality available today.
- Volume is low, spiky, or unpredictable - you want zero idle cost.
- Your team is small and should be building product, not running GPU fleets.
- Time-to-market matters more than per-token economics.
Choose open-weight (self-hosted) when:
- Data cannot leave your environment for legal or contractual reasons.
- Volume is high and steady enough to keep GPUs well utilized.
- You need to pin a version, fine-tune deeply, or control latency and cost knobs.
- You want to avoid vendor lock-in and keep the option to move hosts.
Many mature stacks do both: open-weight on self-hosted or serverless GPUs for the high-volume, privacy-sensitive bulk, and a closed API for the hardest reasoning tasks at the quality frontier.
How to decide
- Start with constraints: if data residency forbids third parties, open-weight self-hosting is decided for you.
- Estimate steady-state token volume and expected GPU utilization.
- Compare per-token API cost on the comparison table against amortized GPU-hour cost from the GPU comparison.
- Factor in ops capacity honestly - a fleet you can't keep healthy erases the savings.
- Prototype on a closed API for speed, then migrate the high-volume path to open weights if the numbers and constraints justify it.
Takeaway
Open-weight models trade operational burden for control, portability, and on-prem privacy; closed APIs trade control for frontier quality and zero ops. Cost follows utilization - APIs win when GPUs would sit idle, open-weight self-hosting wins when they stay busy. Decide on your hard constraints first, validate with real numbers on the LLM and GPU comparisons, and don't be afraid to run both.