Serverless GPU vs dedicated: when to switch
A practical breakdown of serverless GPU vs dedicated rental, covering cold starts, utilization break-even math, and which workloads fit each billing model.
The single biggest GPU cost mistake is paying for idle silicon. A dedicated H100 that sits at 8 percent utilization still bills 100 percent of its hourly rate, while a serverless endpoint that scales to zero between requests bills nothing. But serverless is not free of tradeoffs: cold starts, per-second premiums, and concurrency caps can quietly erase the savings. This guide walks through the actual break-even math so you can decide when to switch.
The two billing models, precisely
Dedicated GPU rental means you reserve a specific accelerator - an H100, A100 80GB, or H200 - and pay by the hour (or by the minute) for as long as the instance exists, idle or not. You control the full machine, choose your driver and CUDA stack, and keep model weights resident in VRAM indefinitely.
Serverless GPU flips the unit of billing. You deploy a container or a model artifact, and the platform spins workers up on demand, bills per second (often down to 10-100 ms granularity) only while a request executes, and scales the worker count back to zero when traffic stops. You give up control of placement and the always-warm guarantee in exchange for never paying for idle.
Utilization is the whole game
The decision reduces to one number: what fraction of the wall-clock hours you rent will actually run GPU work. Call it your duty cycle.
- High duty cycle (above ~50-65 percent): dedicated almost always wins. The per-second serverless premium and cold-start overhead stop paying for themselves once the GPU is busy most of the time.
- Low duty cycle (below ~25-35 percent): serverless almost always wins. You only pay for the seconds you compute, so a bursty endpoint that is idle 80 percent of the day costs a fraction of a 24/7 rental.
- The murky middle (~25-50 percent): it depends on cold-start frequency, request shape, and the serverless per-second markup. Model both before committing.
A rough rule of thumb: serverless platforms charge a 1.5x to 3x premium on the effective per-GPU-second rate compared with a dedicated hourly rate amortized to seconds. So the break-even duty cycle is roughly the inverse of that premium. At a 2x premium, you break even near 50 percent utilization; below it serverless is cheaper, above it dedicated is cheaper.
A worked break-even example
Suppose a dedicated GPU rents for a given hourly rate, and the comparable serverless endpoint bills at roughly 2x that rate per active second. You serve inference requests that each take 400 ms of GPU time.
| Daily request volume | Active GPU hours/day | Duty cycle (vs 24h) | Cheaper model |
|---|---|---|---|
| 5,000 | ~0.56 | ~2% | Serverless (by a wide margin) |
| 50,000 | ~5.6 | ~23% | Serverless |
| 120,000 | ~13.3 | ~55% | Roughly even |
| 250,000 | ~27.8 | over 100% (needs >1 GPU) | Dedicated (scale out) |
The pattern is clear: low and spiky volume favors serverless, steady high volume favors dedicated. Use the GPU training-cost calculator to plug in your own active seconds and rates, and check the serverless GPU comparison table for live per-second pricing across providers.
Cold starts: the hidden tax
Serverless GPUs are not instant. When traffic arrives and no warm worker exists, the platform must schedule a node, pull the container image, load model weights into VRAM, and warm the CUDA context. For a multi-gigabyte LLM checkpoint this can take anywhere from a few seconds to over a minute.
Cold starts hurt in three ways:
- Latency: the first request after idle pays the full warm-up, which is unacceptable for interactive UX without mitigation.
- Cost: some platforms bill the loading time; even when they do not, you may keep a min-replica warm to dodge cold starts, which reintroduces idle cost.
- Throughput cliffs: a traffic spike that outpaces autoscaling queues requests behind cold workers.
Mitigations
- Keep one warm replica during business hours, scale to zero overnight - a hybrid that captures most savings.
- Shrink images and lazy-load weights from fast local storage or a cached layer.
- Quantize models so the checkpoint loads faster and fits more requests per GPU.
- Use provider snapshot/restore features that resume a pre-warmed VRAM state in milliseconds where available.
Batch vs spiky: match the shape
Workload shape matters as much as volume.
- Spiky, latency-sensitive inference (chat backends, image generation behind a UI, intermittent embeddings): serverless shines because traffic is unpredictable and idle gaps are large. The scale-to-zero behavior directly maps to the cost you want to pay.
- Steady, high-QPS inference (a popular API serving constant load): dedicated wins. You will keep the GPU near-saturated, so paying for idle is not a concern, and you avoid serverless premiums and concurrency caps.
- Batch and training jobs (fine-tuning, large offline inference, nightly embedding refresh): dedicated or spot dedicated wins. These jobs saturate the GPU for hours; serverless per-second pricing would be pure overhead. For multi-hour GPU rentals, look at spot or interruptible dedicated instances for the lowest rate.
Other factors beyond raw cost
- Concurrency limits: serverless platforms cap simultaneous workers per account; a viral spike can hit a ceiling that a reserved fleet would not.
- Operational overhead: dedicated means you patch drivers, manage autoscaling, and own uptime; serverless offloads most of that.
- Data gravity and egress: moving large inputs to a serverless endpoint can incur transfer cost - watch your egress pricing if payloads are big.
- Reproducibility: dedicated gives you a pinned, fully controlled environment, which matters for compliance and exact-result reproduction.
How to decide in practice
- Measure or estimate your active GPU-seconds per day (requests times per-request GPU time).
- Divide by 86,400 to get your duty cycle against a 24/7 rental.
- Pull the serverless per-second rate and the dedicated hourly rate from the live comparison table and the GPU table.
- If duty cycle is under ~30 percent, default to serverless. If over ~55 percent, default to dedicated. In between, factor in cold-start tolerance and any warm-replica cost.
- Re-evaluate quarterly - traffic that grows past the break-even should migrate to dedicated, and a shrinking or seasonal workload should move the other way.
Takeaway
Serverless GPU and dedicated rental are not rivals so much as tools for different duty cycles. If your GPU would sit idle most hours, serverless scale-to-zero saves real money and operational toil despite the per-second premium and cold-start risk. If your GPU stays busy - steady high-QPS serving, batch jobs, or training - dedicated or spot dedicated is cheaper and gives you full control. Run the duty-cycle math, watch the cold-start tax, and let your live traffic, not a vendor pitch, pick the model. Compare current rates on the serverless GPU page and the dedicated GPU page before you commit.