Using Spot Instances for Training: Checkpointing Against Preemption
A practical engineering guide to running GPU training on spot and preemptible instances safely, centered on checkpointing strategy and graceful preemption handling.
Spot and preemptible GPU instances are some of the cheapest compute available anywhere, often priced well below on-demand for the exact same hardware. The catch is in the name: the provider can reclaim the instance with little warning when it needs the capacity back. For interactive work that is a dealbreaker, but for training, which is inherently restartable, spot capacity is close to free money. The whole game is making your training job survive interruption without losing meaningful progress. That comes down to checkpointing strategy.
Why Training Is a Natural Fit for Spot
Training is a long-running, stateless-by-design process whose entire progress lives in two places: model weights and optimizer state. If you can persist those periodically and reload them on a fresh instance, an interruption costs you only the work done since the last save. That property is what makes training fundamentally different from a stateful database or a live web server, and it is why interruptible capacity and training were practically made for each other.
Checkpointing: The Core Discipline
A checkpoint is a saved snapshot of everything needed to resume training exactly where it left off. Done well, checkpointing turns preemption from a catastrophe into a minor hiccup.
What to Save
- Model weights. The obvious one, but not sufficient on its own.
- Optimizer state. Momentum and adaptive learning rate buffers matter. Resuming without them can disturb convergence.
- Training position. The current epoch, step, and data loader position so you do not repeat or skip samples.
- Random number generator state. For reproducibility and to keep augmentation and shuffling consistent across resumes.
How Often to Checkpoint
Checkpoint frequency is a tradeoff. Save too rarely and a preemption near the end of an interval wastes a lot of compute. Save too often and the input/output overhead eats into throughput. A useful way to reason about it:
| Checkpoint interval | Work at risk | Overhead |
|---|---|---|
| Very frequent (minutes) | Minimal | High write cost |
| Moderate (tens of minutes) | Acceptable | Low to moderate |
| Rare (hours) | Large | Negligible |
A common starting point is to size the interval so the worst-case lost work stays within a tolerable fraction of total runtime, then tune based on how often preemptions actually occur in your chosen region and instance type.
Handling the Preemption Signal
Most providers send a short termination notice before reclaiming a spot instance. That window, often a couple of minutes, is precious. Wire up a handler that listens for the signal and triggers an immediate emergency checkpoint, so you capture the most recent state before the instance disappears.
- Detect the termination notice via the provider metadata endpoint or signal.
- Pause new training steps gracefully.
- Write a final checkpoint to durable, external storage.
- Flush logs and metrics so the run record stays complete.
Treat the emergency checkpoint as separate from your regular interval saves. Its only job is to capture the latest state fast, so keep it lean and make sure it writes to durable external storage rather than the doomed local disk. A handler that tries to do too much in that short window risks running out of time before the instance is reclaimed.
Designing for Automatic Resume
Surviving preemption is only half the job. The other half is coming back automatically. Store checkpoints on durable object storage that outlives any single instance, never on the instance local disk alone. On startup, every job should first check for an existing checkpoint and resume from it if present, rather than starting fresh. Pair this with an orchestrator that relaunches the job when capacity returns, and the whole loop becomes hands-off.
- Externalize state. Object storage for checkpoints, decoupled from compute lifecycle.
- Make resume the default. The job should always look for a checkpoint first.
- Diversify capacity. Spreading requests across instance types and regions reduces the chance of being preempted everywhere at once.
- Consider a hybrid floor. Keep a small on-demand baseline for deadline-critical runs and burst onto spot for the rest.
Multi-Node Training on Spot
Single-node spot training is straightforward, but distributed training across many GPUs adds wrinkles worth planning for. In a tightly synchronized job, losing one node can stall the entire cluster, because the surviving workers wait on the missing one. Two approaches soften this. Elastic training frameworks let the job continue with fewer workers when a node disappears and re-expand when capacity returns, rather than failing outright. Alternatively, frequent synchronized checkpoints let the whole cluster restart cleanly from a consistent state. Either way, the principle is the same: assume any node can vanish at any moment and make that assumption cheap to absorb.
- Synchronize checkpoints across ranks so every worker resumes from the same step, avoiding inconsistent state.
- Prefer elastic frameworks that tolerate changing worker counts mid-run for large clusters.
- Spread the cluster across capacity pools so a single pool reclaiming instances does not take down every node simultaneously.
Estimating the Real Savings
Before committing to a spot strategy, it helps to reason about the net benefit rather than the headline discount alone. The raw per-hour saving on spot can be large, but interruptions add some wasted compute and engineering overhead. The honest calculation weighs the discount against the expected fraction of work lost to preemption plus the cost of building resilience.
| Factor | Effect on savings |
|---|---|
| Per-hour discount versus on-demand | Primary gain |
| Frequency of preemption | Reduces gain via lost work |
| Checkpoint interval | Controls worst-case lost work |
| Engineering overhead | One-time cost, amortized |
In practice, for restartable training with solid checkpointing, the net saving stays large because lost work is small and the resilience tooling is built once and reused across every job.
When Spot Is Not Worth It
Spot capacity is not universal. Very short jobs may not justify the engineering overhead. Tightly synchronized multi-node training can be more fragile, since losing one node stalls the whole cluster, though modern elastic training frameworks soften this. And if a hard deadline leaves no room for restarts, the predictability of on-demand or reserved capacity may be worth the premium. Knowing when to opt out is part of using spot well.
Used deliberately, spot training delivers large savings for the price of solid checkpointing hygiene, something you should have anyway for reproducibility and fault tolerance. Build the checkpoint and resume loop once, prove it survives a forced interruption in testing, and you unlock the cheapest serious GPU compute on the market with very little ongoing effort.