Autoscale LLM Inference With KEDA

LLM inference traffic is rarely flat. It spikes during business hours, surges with a product launch, and goes quiet overnight. Running a fixed number of GPU pods means either paying for idle capacity or dropping requests during peaks. KEDA, the Kubernetes Event Driven Autoscaler, lets you scale GPU backed inference pods up and down based on the signals that actually matter, like the depth of a request queue. This tutorial shows how to autoscale LLM inference on Kubernetes with KEDA in a way that respects both performance and cost.

Why default autoscaling falls short for inference

The built in Horizontal Pod Autoscaler often scales on CPU utilization. For LLM inference that is a poor signal, because the real work happens on the GPU, and CPU may stay low while the GPU is saturated and requests pile up. You need to scale on a metric that reflects actual inference pressure.

KEDA extends Kubernetes autoscaling to react to external and custom metrics. Instead of guessing from CPU, it can scale on the number of requests waiting in a queue, the length of a message backlog, or a custom metric your serving stack exposes. That alignment between the scaling signal and the real bottleneck is what makes inference autoscaling work.

Pick the right scaling signal

The best signal depends on how requests reach your model. Common choices include:

Queue depth, when requests flow through a message queue before reaching GPU workers.
Pending request count or concurrency exposed by your inference server.
Request rate, when load is fairly predictable and proportional to traffic.
A custom latency or saturation metric scraped by your monitoring stack.

Queue based scaling is especially clean for inference. Producers drop requests onto a queue, GPU worker pods pull from it, and KEDA scales the worker count to keep the queue from growing. When the queue drains, KEDA scales workers back down, potentially to zero.

Set up KEDA and a scaler

The general flow to get KEDA autoscaling your inference deployment is:

Install KEDA into the cluster so it can manage scaling objects.
Make sure your inference deployment requests GPU resources and lands on GPU nodes.
Expose the metric you want to scale on, such as queue length or a custom metric endpoint.
Create a scaling rule that points KEDA at that metric and sets target thresholds.
Define minimum and maximum replica counts as your cost and performance bounds.

The scaling rule is where you encode your intent. You tell KEDA which metric to read, what target value to hold, and how many replicas it may run. For example, you might target a queue depth of a few requests per worker, so KEDA adds a GPU pod whenever the backlog per worker climbs above that.

Handle GPU specific constraints

GPU autoscaling has wrinkles that CPU autoscaling does not. Address them deliberately.

Constraint	Why it matters	Mitigation
Pod startup time	Loading model weights is slow	Scale early on leading indicators, keep a warm minimum
Node provisioning	New GPU nodes take minutes	Pair KEDA with a cluster autoscaler and headroom
GPU scarcity	Capacity may be unavailable	Set realistic max replicas, spread across zones
Cost of idle GPUs	GPUs are expensive when idle	Allow scale to a low or zero minimum off peak

The slow startup of GPU pods is the biggest challenge. Because loading a model can take a while, reactive scaling alone may lag a traffic spike. Keeping a small warm minimum of pods absorbs sudden bursts, while KEDA adds capacity for sustained load.

Balance scale to zero against cold starts

KEDA can scale a deployment all the way to zero when no work exists, which is the biggest cost saver for spiky workloads. The catch is the cold start: the first request after scaling to zero waits for a GPU pod, and possibly a GPU node, to come up. Decide per workload whether that wait is acceptable.

For internal or batch style workloads, scale to zero off hours and accept the cold start.
For user facing latency sensitive endpoints, keep a small minimum warm.

Test the autoscaling behavior

Before trusting autoscaling in production, load test it. Drive synthetic traffic that ramps up and down, and watch how quickly KEDA reacts, whether the queue stays bounded, and how long pods take to become ready. Tune the target thresholds so the system scales up before latency degrades and scales down without flapping. Flapping, where pods rapidly scale up and down, wastes money and destabilizes serving, so add cooldown settings to smooth it out.

Coordinate pod and node autoscaling

KEDA scales pods, but a pod cannot run if there is no GPU node to host it. On a cluster with a fixed node pool, KEDA can only scale up to the GPUs you already have. To scale beyond that, you pair KEDA with a cluster autoscaler that adds GPU nodes when pending pods cannot be scheduled. The two layers work in sequence: KEDA decides it needs more pods, the scheduler finds no room, and the cluster autoscaler provisions a node.

The catch is timing. Provisioning a fresh GPU node can take minutes, far longer than starting a pod on an existing node. To avoid users waiting through both delays during a spike, keep a little node headroom so KEDA can place new pods immediately, and let the cluster autoscaler refill the buffer in the background. This trades a small amount of idle node cost for much faster response to bursts.

Right size requests and limits

Autoscaling only behaves well if each pod's resource requests are accurate. If a pod requests too little GPU memory, the scheduler may pack incompatible pods onto a node and cause out of memory failures. If it requests too much, you waste capacity and scale out sooner than necessary. Set GPU requests to match what your model actually needs, including room for the batch sizes you serve.

Request a full GPU per pod when the model needs the whole device.
Confirm pods land only on GPU nodes using node selectors or taints.
Leave memory headroom for peak concurrency so batching does not trigger out of memory errors.

Accurate sizing makes the autoscaler's decisions trustworthy, because each added pod delivers the throughput you assumed when you set the scaling thresholds.

Conclusion

KEDA brings demand aware autoscaling to GPU inference on Kubernetes, scaling pods on signals that reflect real load rather than misleading CPU numbers. Choose a metric like queue depth, set sensible replica bounds, account for slow GPU startup with a warm minimum, and decide deliberately where scale to zero is worth the cold start. With careful thresholds and load testing, you get inference that grows with demand and shrinks your bill when traffic fades.

Autoscale LLM Inference on Kubernetes With KEDA