Spot Instance Pricing Guide: How Much You Save and What You Risk
A guide to spot and interruptible GPU pricing: how the discount works, the interruption risk, which workloads fit, and how to use spot capacity safely.
Spot instances are the cloud's clearance aisle: spare GPU capacity offered at a steep discount, with the catch that the provider can reclaim it when it needs the hardware back. For the right workload, spot pricing is one of the biggest levers you have to cut GPU costs. For the wrong workload, an interruption at the wrong moment can cost more than you saved. This guide explains how spot pricing works, how much you can realistically save, what you risk, and how to use interruptible capacity safely.
What spot instances are and why they are cheap
Providers build out capacity to meet peak demand, which means there is usually idle hardware at any given moment. Rather than let it sit unused, they offer it at a discount as spot or preemptible capacity. The deal is simple: you get a much lower rate, and in exchange the provider can take the instance back, often with little notice, when on-demand demand rises or capacity is needed elsewhere. The discount is the compensation for that uncertainty.
Because the price reflects spare supply, spot rates can vary, and availability shifts with demand. The savings are real and frequently substantial compared with on-demand, but they come bundled with the possibility of interruption that you must design around.
How much you actually save
The discount versus on-demand can be large, often a major fraction of the standard rate, though the exact figure depends on the provider, the GPU type, the region, and current demand. Rather than fixate on a specific percentage, think in terms of the tradeoff: you are trading reliability for a lower rate. The more your workload can tolerate interruption, the more of that discount you can safely capture.
| Workload trait | Spot suitability |
|---|---|
| Can checkpoint and resume | Excellent |
| Stateless and retryable | Excellent |
| Short, independent tasks | Very good |
| Long run with no checkpoints | Poor |
| Latency-sensitive user serving | Poor |
What you risk
The central risk is interruption. When the provider reclaims your instance, your job stops, sometimes with only a short warning and sometimes with almost none. The consequences depend entirely on whether your workload can absorb that.
- Lost progress: a long training run without checkpoints can lose hours of work if reclaimed near the end.
- Availability gaps: spot capacity for a given GPU can dry up, leaving you unable to get instances when you want them.
- Operational complexity: handling interruptions gracefully requires extra engineering that on-demand does not.
- Price variability: rates can move, so the savings are not perfectly predictable.
The risk is not that interruptions happen, because they will. The risk is being unprepared when they do.
Workloads that fit spot well
Spot shines for anything that is fault-tolerant by nature or can be made so cheaply.
- Checkpointed training: save progress regularly and resume after an interruption, losing only the work since the last checkpoint.
- Batch processing: large independent jobs that can retry failed pieces without harm.
- Hyperparameter sweeps: many short experiments where losing one is cheap to rerun.
- Stateless inference at scale: request handling that can shift to other capacity when an instance disappears, behind a load balancer.
Workloads to keep on demand
Some jobs simply do not belong on spot. Latency-sensitive serving where a user is waiting, single long runs that cannot checkpoint, and anything where an interruption causes outsized damage should stay on on-demand or reserved capacity. The savings are not worth a failed launch or a corrupted long job. A common pattern is to keep a reliable on-demand or reserved baseline and add spot capacity on top for the elastic, interruptible portion of the work.
How to use spot safely
Capturing the discount without the pain comes down to designing for interruption from the start.
- Checkpoint frequently. Save state often enough that any interruption costs minutes, not hours. This single practice unlocks most of spot's value for training.
- Make jobs idempotent and retryable. Design tasks so that rerunning a piece after interruption produces the same result without side effects.
- Handle the reclaim signal. Where the provider gives a warning before reclaiming, use it to checkpoint and shut down cleanly.
- Spread across capacity. Diversify across instance types or regions so a shortage in one pool does not stop everything.
- Automate fallback. If spot is unavailable, fall back to on-demand for critical work rather than waiting indefinitely.
Combining spot with other pricing
The smartest cost strategies rarely use one pricing model alone. Reserved or committed capacity covers your steady, predictable baseline at a discount. On-demand handles the work that needs reliability and flexibility. Spot soaks up the rest, the interruptible and fault-tolerant work, at the lowest rate. Blending the three lets you match each portion of your workload to the cheapest pricing it can safely use, which usually beats putting everything on a single model.
Estimating your real spot savings
The discount on the price tag is not quite your true saving, because interruptions cost a little efficiency. Each reclaim throws away the work done since your last checkpoint, and time spent waiting for replacement capacity is time the job is not progressing. To estimate the real benefit, start from the headline discount and shave off a modest allowance for lost-and-redone work and restart overhead. For a well-checkpointed job that loses only minutes per interruption, the adjustment is small and spot remains a large net win. For a poorly checkpointed job that loses hours each time, the adjustment can erase the saving entirely, which is exactly why checkpointing is the deciding practice.
- Start with the discount. Note the spot rate against on-demand.
- Estimate interruption frequency. Consider how often the pool you use gets reclaimed.
- Add redo cost. Account for work lost between checkpoints and restart time.
- Compare the adjusted figure. If it still beats on-demand comfortably, spot is worth it.
Monitoring and automation make spot practical
Spot becomes genuinely low-effort once the handling is automated rather than manual. Tooling that watches for reclaim signals, checkpoints on warning, requests replacement capacity, and resumes the job turns interruptions into a non-event you rarely think about. Many orchestration systems and managed training services include this behavior, so you are not building it from scratch. The upfront investment in automation is what lets you run large portions of your workload on the cheapest capacity without babysitting it, which is where the real, sustained savings come from.
Spot instances offer some of the deepest discounts in GPU cloud, but the price reflects a real tradeoff between cost and reliability. Use spot for fault-tolerant, checkpointed, retryable work, keep latency-sensitive and uninterruptible jobs on reliable capacity, estimate the saving after redo cost, and automate interruption handling so reclaims cost minutes rather than hours. Treat spot as one layer in a blended strategy, and you can capture large savings without gambling your important workloads.