Deploy a Serverless Inference Endpoint on Modal
A tutorial for deploying a serverless GPU inference endpoint on Modal, with cold start handling, autoscaling, and cost control.
Serverless GPU platforms let you run inference without managing a fleet of always on instances. Modal is one such platform, built around the idea that you describe your environment and function in Python, and the platform provisions GPUs on demand, runs your code, and scales back to zero when traffic stops. For spiky or low volume inference, this can be far cheaper than keeping a dedicated GPU running around the clock. This tutorial walks through deploying a serverless inference endpoint and keeping it fast and affordable.
When serverless GPU inference makes sense
The economics of serverless hinge on utilization. A dedicated GPU instance bills every hour whether or not it serves a request. A serverless endpoint bills only for the seconds your code actually runs on a GPU. That makes serverless a clear win for workloads that are intermittent, bursty, or still finding their traffic.
- Prototypes and demos that get occasional traffic.
- Internal tools used during business hours only.
- Spiky public endpoints where load varies wildly.
- Batch jobs that run a few times a day.
For a steady, high volume stream of requests that keeps a GPU busy nearly all the time, a dedicated or reserved instance usually costs less per token. The decision comes down to how full you can keep the hardware.
Define the environment
On Modal you describe everything in code, which is what makes deployments reproducible. You define an image with your dependencies, attach a GPU type to the function that needs it, and declare how the function is invoked. The image build happens once and is cached, so later deployments are fast.
- Define an image starting from a base and adding your Python and system dependencies.
- Pin library versions so the deployed environment matches what you tested.
- Bake in or mount your model weights so they do not download on every cold start.
- Attach the GPU type your model needs to the inference function.
Choosing the GPU type is a real decision. A smaller model serves fine on a modest GPU, while a large model needs a high memory data center GPU. Pick the smallest GPU that comfortably holds your model and hits your latency target, because the larger the GPU, the more each second of execution costs.
Expose a web endpoint
To serve requests over HTTP, you decorate a function as a web endpoint. The platform gives you a URL that triggers the function on each request. Inside the function you load the model, run inference, and return the response. To avoid reloading weights on every call, load the model once when the container starts and keep it warm in memory for the life of that container.
This warm container pattern is essential for performance. Loading a multi gigabyte model from disk on every request would dominate your latency. By loading it once per container and reusing it across requests, only the first request to a fresh container pays the load cost.
Tame cold starts
The tradeoff of scaling to zero is the cold start: when no container is running, the next request must wait for one to spin up and load the model. Several techniques reduce this pain.
| Technique | Effect | Cost impact |
|---|---|---|
| Keep warm containers | Eliminates cold starts for steady traffic | Pays for idle warm capacity |
| Bake weights into the image | Avoids download time on start | Larger image, no runtime cost |
| Smaller or quantized model | Faster load, faster inference | Lower per request cost |
| Scale to zero | Cheapest when idle | Cold start on first request |
Keeping a minimum number of warm containers is the most direct fix. You trade some idle cost for predictable latency. For workloads that can tolerate an occasional slow first request, scaling fully to zero remains the cheapest option.
Configure autoscaling and limits
Serverless platforms scale by adding containers as concurrent requests rise. Set sensible bounds so a traffic spike does not spin up an unbounded and expensive number of GPUs. Define a maximum number of concurrent containers as a cost ceiling, and tune how many requests each container handles at once based on your model's batching ability. A model that batches well can serve several concurrent requests per container, which raises throughput per GPU and lowers cost per request.
Watch the bill and iterate
Serverless cost is execution time multiplied by the GPU rate, so the levers are obvious: make each request faster and keep the GPU well utilized. Monitor your average execution time, cold start frequency, and concurrency. If cold starts hurt users, add warm capacity. If idle warm capacity costs too much during quiet hours, let it scale to zero overnight. The right balance shifts with your traffic, so revisit it as usage grows.
Handle model weights efficiently
Where your model weights live has an outsized effect on cold start time. Downloading several gigabytes of weights from a remote store on every fresh container is the most common reason serverless inference feels slow. Avoid it by making weights available locally to the container with minimal fetching.
Two patterns work well. You can bake the weights into the container image so they are present the instant a container starts, at the cost of a larger image. Or you can store weights on a fast shared volume that mounts into containers, so they load quickly without bloating the image. For frequently updated models the volume approach is flexible, while for stable models baking them in is simplest. Either way, the goal is the same: never download the full model fresh on the request path.
Add observability from the start
You cannot tune what you cannot see. Instrument your endpoint so you know how it behaves under real traffic. Track request latency, broken out into time to first token where relevant, the rate of cold starts, GPU execution time per request, and error rates. These signals tell you whether to add warm capacity, switch to a smaller GPU, or adjust concurrency.
- Latency percentiles reveal whether cold starts hurt the tail of your traffic.
- Execution time per request feeds directly into your cost per request math.
- Error rates surface out of memory issues or model loading failures early.
With these metrics in hand, the cost and performance tradeoffs become concrete decisions rather than guesses, which is exactly what you want before scaling a serverless endpoint to real users.
Conclusion
A serverless inference endpoint on Modal lets you ship a GPU backed model that scales with demand and bills only for the seconds it runs. Define a pinned image, attach the smallest sufficient GPU, load the model once per warm container, and manage cold starts with warm capacity or baked in weights. For intermittent and bursty inference, this approach often beats a dedicated GPU on cost while keeping deployment refreshingly simple.