Set Up a Fault-Tolerant Spot Training Job From Scratch
A step-by-step tutorial for running model training on interruptible spot GPUs, with checkpointing, resume logic, and interruption handling that keeps runs safe.
Spot capacity can cut the cost of a training run dramatically, but it comes with a catch: the provider can reclaim your GPU with little warning. A training job that crashes and restarts from zero every time that happens is worse than useless on spot. The solution is to build the job so an interruption is a brief pause rather than a lost run. This tutorial walks through making a training job fault-tolerant from scratch, covering checkpointing, automatic resume, interruption handling, and a fallback for the jobs that absolutely must finish on time.
Why Spot Needs Fault Tolerance
Spot, preemptible, and interruptible instances are sold from a provider's spare capacity at a discount. The trade is that the provider can take the capacity back when it needs it, sometimes after only a short notice. For a stateless web request that is harmless, but a training run holds hours of accumulated work in volatile GPU and host memory. Without protection, a reclaim erases all of it.
Fault tolerance turns that risk into a manageable cost. If the job saves its progress regularly and can resume from the latest save on a new node, the worst case is losing the minutes of work since the last checkpoint. That small, bounded loss is what makes spot pricing worth chasing for training.
Step One: Make Training Checkpointable
The foundation is the ability to save and restore complete training state. A checkpoint must capture everything needed to continue exactly where you left off, not just the model weights.
- Model weights: the parameters being learned.
- Optimizer state: momentum and other running statistics, without which resumed training behaves differently.
- Training position: the current epoch and step, so you do not repeat or skip data.
- Scheduler and RNG state: learning-rate schedule position and random seeds, for reproducible continuation.
Save these together as a single checkpoint to durable storage that survives the instance, such as an object store or a persistent volume. Local disk on the instance is not enough, because it disappears with the node.
Step Two: Checkpoint at the Right Cadence
Checkpoint frequency is a trade-off. Save too rarely and an interruption costs a lot of recomputation. Save too often and the overhead of writing checkpoints slows the run and adds storage churn.
| Checkpoint interval | Recompute on interruption | Overhead |
|---|---|---|
| Very frequent | Minimal | Higher |
| Moderate | Small | Balanced |
| Infrequent | Large | Lower |
A practical approach is to checkpoint on a time basis, for example every several minutes of training, so the maximum work at risk is bounded regardless of how fast steps run. Keep a small number of recent checkpoints rather than every one, to control storage cost.
Step Three: Resume Automatically
Saving checkpoints is only half the job. The training script must, on startup, check for an existing checkpoint and resume from it rather than starting fresh. This makes restart the default behavior, which is exactly what you want when a job lands on a new node after a reclaim.
- On launch, look for the latest valid checkpoint in durable storage.
- If one exists, load weights, optimizer state, and training position from it.
- If none exists, start a fresh run.
- Continue training from the restored position to completion.
With this logic, you can kill and relaunch the job at any time and it simply picks up where it left off. That property is what lets it survive an interruption without human intervention.
Step Four: Handle the Interruption Signal
Many providers send a warning shortly before reclaiming a spot instance. Catching that signal lets you write one final checkpoint at the moment of interruption, minimizing lost work beyond the last scheduled save.
Set up a handler that listens for the termination notice, triggers an immediate checkpoint, and exits cleanly. Even a short notice window is usually enough to flush state to durable storage. This single step often turns the typical loss from minutes down to seconds, because the last save happens right before the node goes away rather than at the previous scheduled interval.
Step Five: Add a Fallback for Deadlines
Spot is cheap but not guaranteed. If a run keeps getting interrupted, it may miss a deadline. A fallback protects the jobs that must finish on time.
- Track how many times the job has been interrupted.
- If interruptions exceed a threshold, or a deadline approaches, relaunch on on-demand capacity instead.
- Because the job resumes from its checkpoint regardless of capacity type, the switch costs nothing but the higher rate for the remainder.
This gives you the best of both: spot pricing for the common case, and a guaranteed path to completion when conditions turn against you.
Step Six: Orchestrate the Loop
Tie it together with an orchestrator that relaunches the job whenever it exits before completion. Whether the exit was a spot reclaim, a transient error, or an intentional stop, the orchestrator brings the job back up, it finds its checkpoint, and it continues. The training script and the orchestrator together form a self-healing system that marches toward completion across as many interruptions as the run encounters.
Verifying It Works
Before trusting a long run to this setup, test the failure path deliberately. Start a short training job, kill it partway through, and confirm it resumes from the checkpoint with consistent loss and training position rather than restarting. A resume that produces a discontinuity in the loss curve usually means a piece of state, often the optimizer or the data position, was not saved. Fix that on a small run, not a large one.
Conclusion
A fault-tolerant spot training job is the combination of complete checkpoints, a sensible save cadence, automatic resume on startup, interruption-signal handling, and a deadline fallback, all wrapped in an orchestrator that relaunches on exit. Build those pieces once and you can run training on interruptible capacity with confidence, capturing the discount while bounding the risk to a few minutes of recomputable work. Test the resume path before you rely on it, and your cheap training runs will finish reliably no matter how often the GPU is reclaimed.