Spot-to-On-Demand GPU Fallback

Spot GPUs are dramatically cheaper than on-demand, but the provider can reclaim them with little warning. That tradeoff scares many teams away from real savings. The fix is a fallback: run on spot when it is available, and automatically shift to on-demand when spot is interrupted or unavailable. This tutorial shows how to build that pattern so you capture most of the discount while keeping the workload reliable enough for production.

Why Spot Is Cheap and Risky

Spot capacity is the provider's spare inventory, offered at a steep discount because it can be reclaimed when on-demand demand rises. For interruptible work the discount is excellent. For anything that must keep running, an unhandled interruption means a stalled job or a dropped request. The fallback pattern keeps the discount while removing the fragility.

Decide What Can Tolerate Spot

Not every workload suits spot, even with a fallback. Sort your work by interruption tolerance.

Great fit: batch jobs, training with checkpointing, queue-driven processing.
Workable with care: stateless inference behind a load balancer that can shift traffic.
Poor fit: long single operations with no checkpoint and tight deadlines.

The common thread is the ability to pause, move, or retry. If losing an instance means losing hours of irrecoverable work, address that with checkpointing before relying on spot.

Handle the Interruption Signal

Providers usually give a short warning before reclaiming a spot instance. Reacting to that signal is the heart of a graceful fallback.

Watch for the interruption notice the provider emits.
On notice, stop accepting new work on that instance.
Checkpoint progress or drain in-flight requests.
Trigger replacement capacity before the instance disappears.

The warning window is short, so the response must be automated. A human reacting to an alert will not be fast enough. Design the handler to do the right thing without intervention.

Build the Fallback Logic

The fallback decides where replacement capacity comes from. The simplest reliable approach prefers spot but falls back to on-demand when spot cannot be obtained.

Condition	Action
Spot available	Launch or keep running on spot
Spot interrupted	Drain or checkpoint, request replacement
Spot unavailable	Fall back to on-demand to maintain service
Spot returns	Optionally shift back to spot to save again

Many orchestration systems support a capacity preference list, trying spot first and on-demand second. If yours does not, a small controller that watches capacity and requests instances accordingly achieves the same outcome.

Make Workloads Resilient

The fallback only works if the workload itself can survive a swap. A few patterns make that possible.

Checkpointing: training saves state regularly so a new instance resumes instead of restarting.
Statelessness: inference holds no critical local state, so any instance can serve any request.
Queues: work is pulled from a durable queue, so an interrupted item simply returns for retry.
Load balancing: traffic shifts off a draining instance automatically.

Checkpointing is the single most valuable habit for spot training. Without it, an interruption is a disaster; with it, an interruption is a minor delay.

Watch the Real Economics

Spot saves money only if interruptions are not so frequent that the overhead of constant restarts and on-demand fallback erodes the gain. Track how often spot is reclaimed, how much time runs on the cheaper rate versus the fallback, and the effective blended cost. If a particular GPU type or region is reclaimed constantly, a different type or region may give steadier spot capacity and better net savings.

Common Pitfalls

Relying on spot for work that cannot checkpoint or retry.
Handling interruptions manually instead of automatically.
No on-demand fallback, so a spot shortage means an outage.
Ignoring interruption frequency and the overhead it adds.

Diversify Across Capacity Pools

Spot availability is not uniform. A specific GPU type in a specific region may be reclaimed constantly while a slightly different type or a neighboring region stays stable for hours. Spreading your spot requests across several capacity pools reduces the chance that a shortage in one pool stalls the whole workload. The orchestration layer requests from a prioritized list of pools, and only falls back to on-demand when none of them can supply capacity.

Allow multiple GPU types that can run the workload, not just one.
Allow multiple regions or zones where latency permits.
Rank pools by price and observed stability, trying the best first.
Fall back to on-demand only after the spot pools are exhausted.

This diversification often does more for reliability than any single tuning change, because it removes the single point of failure of depending on one scarce pool.

Right-Size the On-Demand Safety Net

The on-demand fallback exists to protect availability, but you do not want it carrying load it does not need to. If a large share of traffic ends up on the fallback, the savings erode and you are paying premium rates while believing you are saving. Monitor the split between spot and on-demand continuously. A healthy system runs most of the time on spot with brief, occasional excursions to on-demand. If the fallback is constantly active, that is a signal to diversify pools further, choose a steadier GPU type, or accept that this particular workload may simply not be a good spot candidate.

Decide When Spot Is Worth It

Spot is not free reliability engineering. Building checkpointing, interruption handling, pool diversification, and fallback logic takes effort, and that effort only pays off when the workload runs enough GPU hours for the discount to matter. For a small, short-lived job, the simplicity of on-demand may be worth more than the savings. For a large, long-running, interruptible workload, the spot discount compounds into real money and easily justifies the engineering. Weigh the size and duration of the workload against the cost of building the resilience before committing to the pattern.

A spot-to-on-demand fallback lets you run cheap GPUs without betting reliability on borrowed capacity. Pick workloads that tolerate interruption, automate the response to reclaim signals, prefer spot but fall back to on-demand, and make the workload resilient with checkpointing, queues, or statelessness. Measure the blended cost so you know the savings are real. With the pattern in place, spot becomes a dependable way to cut GPU spend rather than a gamble.

Build a Spot-to-On-Demand Fallback for Reliable Cheap GPUs