Blending Reserved and Spot Capacity for Maximum GPU Savings
A strategy guide for combining reserved, on-demand, and spot GPU capacity into a layered model that matches steady baseload and bursty demand for maximum savings.
The cheapest GPU strategy is almost never all of one pricing model. Commit to everything with reserved capacity and you pay for idle GPUs whenever demand dips below your floor. Run everything on spot and you save aggressively but expose critical work to interruption and capacity gaps. The optimal answer is a deliberate blend that layers reserved, on-demand, and spot capacity against the actual shape of your demand. This is portfolio thinking applied to compute, and it consistently beats any single-model approach.
The Three Capacity Types
Each pricing model trades cost against commitment and reliability differently. Understanding the tradeoffs is the prerequisite for blending them well.
| Capacity type | Cost | Reliability | Best for |
|---|---|---|---|
| Reserved or committed | Lowest for steady use | Guaranteed | Predictable baseload |
| On-demand | Highest per hour | Guaranteed, flexible | Unpredictable bursts, deadlines |
| Spot or preemptible | Lowest per hour | Interruptible | Fault-tolerant, restartable work |
Map Your Demand Curve First
Blending well starts with knowing your demand profile. Plot GPU usage over a representative period and you will usually see a stable floor that is almost always in use, a variable middle that flexes with activity, and occasional spikes. Those three bands map cleanly onto the three capacity types.
- The floor is your baseload, the capacity you use nearly all the time. This is what reserved or committed-use discounts are made for.
- The middle is flexible, fault-tolerant work that can ride interruptible spot capacity for deep savings.
- The spikes are unpredictable or deadline-bound bursts where on-demand reliability is worth the premium.
Building the Layered Model
With the demand curve mapped, the blend follows naturally. The art is in sizing each layer conservatively so you capture savings without overcommitting. The instinct to maximize the reserved layer for its low rate is exactly the instinct to resist, because every reserved GPU you do not use is a sunk cost that quietly erodes the discount you committed to capture.
- Size the reserved layer to your reliable floor, not your average and certainly not your peak. Commit only to capacity you are confident you will use almost continuously, because an unused reservation is a sunk cost.
- Push fault-tolerant work onto spot. Training with checkpointing, batch inference, and preprocessing belong here. The savings on this layer are typically the largest.
- Reserve on-demand for the unpredictable and the urgent. Use it as the flexible buffer that absorbs spikes and guarantees deadline-critical runs that cannot tolerate preemption.
- Let spot and on-demand trade off dynamically. When spot is plentiful, lean on it. When spot capacity is scarce or a deadline looms, fall back to on-demand.
Sizing the Commitment Carefully
The biggest risk in this strategy is over-committing the reserved layer. Reservations lock in savings only if you actually use the capacity. A few guardrails keep the commitment safe:
- Commit below your floor, not at it. Leave a margin so a quiet week does not turn your discount into waste.
- Prefer shorter or more flexible commitments when your demand is still evolving, accepting a smaller discount for less lock-in risk.
- Ladder commitments by staggering their start and end so you can adjust the floor as workloads change rather than facing one large renewal cliff.
- Revisit the floor regularly, because baseload grows and shrinks as projects come and go.
Making Spot Reliable Enough to Lean On
The blend only delivers if the spot layer is genuinely usable, which depends on resilience engineering. Checkpoint training jobs so interruptions cost little. Diversify spot requests across instance types and regions so you are not preempted everywhere at once. And design an automatic fallback to on-demand when spot capacity dries up, so a capacity gap degrades gracefully instead of stalling the pipeline. With those pieces in place, you can route a large share of work through the cheapest tier with confidence.
A Worked Example of the Layers
To make the blend concrete, picture a team whose GPU usage rarely drops below a steady floor, flexes through a wide middle band during active project work, and occasionally spikes for big experiments. The layered model maps onto that shape directly. The reserved layer covers the floor at the lowest rate. The spot layer absorbs most of the flexible middle for fault-tolerant training and batch work. The on-demand layer sits ready to catch spikes and guarantee any run that cannot tolerate interruption.
| Demand band | Capacity layer | Why |
|---|---|---|
| Always-on floor | Reserved | Lowest rate for guaranteed steady use |
| Flexible middle | Spot | Deep discount for restartable work |
| Occasional spikes | On-demand | Guaranteed, no commitment, deadline-safe |
The exact proportions differ for every team, which is the whole point: you size each layer to your own demand curve rather than copying a generic ratio.
Routing Workloads to the Right Layer
The blend only delivers if individual jobs land on the right capacity type, which is a scheduling and policy decision as much as a purchasing one. A useful default is to classify each workload by its tolerance for interruption and its urgency, then route accordingly.
- Interruptible and not urgent goes to spot first, falling back to on-demand only when spot is scarce.
- Steady and continuous runs on the reserved floor you already pay for.
- Urgent or non-restartable goes to on-demand, where guaranteed availability is worth the premium.
- Latency-sensitive production inference generally avoids spot entirely, living on reserved or on-demand for reliability.
Review and Rebalance
A capacity blend is not set-and-forget. Demand shifts, spot prices and availability fluctuate, and new hardware changes the price-performance picture. Schedule a recurring review that checks reserved utilization, spot interruption rates, and the share of spend in each layer, then rebalance. Catch an under-used reservation early and you can let it lapse rather than renew. Notice spot serving most of your bursts reliably and you can shrink the on-demand buffer.
Done well, a reserved-plus-spot blend gives you the best of every model: the rock-bottom rate of commitment for the work you always run, the deep discounts of interruptible capacity for the work that can tolerate it, and the guaranteed flexibility of on-demand for the moments that demand it. Match the layers to your real demand curve, size the commitment with discipline, and rebalance on a schedule, and your GPU bill will track your actual needs far more tightly than any single pricing model could.