Run a GPU Workload on Kubernetes

Kubernetes is the standard way to orchestrate containerized workloads, and it handles GPUs well once a few pieces are in place. The path from nothing to a running GPU pod has a predictable shape: provision GPU nodes, make the cluster aware of the GPUs through a device plugin, request a GPU in your pod spec, and verify the scheduler placed it correctly. This intermediate tutorial walks through each step and explains why it is needed, so you understand the mechanism rather than just copying commands.

How Kubernetes Sees GPUs

By default, Kubernetes knows about CPU and memory but not GPUs. GPUs are exposed as an extended resource, which means the cluster only schedules them onto nodes that advertise them. The component that does the advertising is a device plugin, which runs on each GPU node, detects the accelerators, and reports them to the scheduler. Until that plugin is in place, your GPU nodes look like ordinary machines and your GPU requests stay unscheduled.

Once the plugin is running, requesting a GPU is as simple as adding a resource request to your pod, much like requesting CPU or memory. The scheduler then places the pod only on a node with an available GPU.

Step One: Create a GPU Node Pool

Start by adding GPU-equipped nodes to your cluster. On managed Kubernetes services this is usually a node pool configured with a GPU instance type; on self-managed clusters it means joining GPU machines as worker nodes. Either way, the goal is a set of nodes that physically have accelerators attached.

It is common to keep GPU nodes in a separate pool from CPU nodes. GPU nodes are expensive, so isolating them lets you scale them independently, often down to zero when no GPU work is pending, while your cheaper CPU nodes handle everything else.

Step Two: Install the Device Plugin

With GPU nodes present, install the device plugin that advertises the GPUs. The plugin runs as a workload on every GPU node, typically deployed so it lands automatically on any node with the right hardware. After it starts, the GPU nodes report their accelerator count to the cluster.

You can confirm this worked by inspecting a GPU node and checking that it now lists an allocatable GPU resource. Seeing that count is the signal that the cluster is ready to schedule GPU pods. If the count is missing, the plugin did not start correctly or the node lacks working drivers, and that is the first thing to fix.

Step Three: Request a GPU in a Pod

Now you can ask for a GPU. In the pod specification, add a GPU resource request to the container. This tells the scheduler the pod needs an accelerator, and it will only place the pod on a node that has one free.

Resource request: specify the GPU resource and the count your container needs.
Container image: use an image built with GPU support so your code can actually use the accelerator.
Node selection: optionally target the GPU node pool explicitly with labels, to keep GPU pods off CPU nodes.

Because GPUs are whole-unit resources by default, a request for one GPU reserves a full device for the pod. If multiple pods request GPUs, the scheduler spreads them across available devices and queues any that cannot be placed until capacity frees up.

Step Four: Verify the Pod Scheduled

After applying the pod, confirm it reached a running state on a GPU node. A pod stuck in a pending state usually means no node has an available GPU, either because the device plugin is not reporting them or because all GPUs are already allocated.

Symptom	Likely cause
Pod pending, no GPU nodes ready	Device plugin not running or drivers missing
Pod pending, GPUs all in use	No free GPU capacity; scale the node pool
Pod running but not using GPU	Image lacks GPU support, or code targets CPU

Once the pod is running, exec into it and confirm the GPU is visible from inside the container. If the accelerator shows up, the full chain works: the node has a GPU, the plugin advertised it, the scheduler placed the pod, and the container can use the hardware.

Step Five: Scale and Clean Up

For real workloads, configure the cluster autoscaler to add GPU nodes when GPU pods are pending and remove them when idle. Because GPU nodes are costly, scaling the GPU pool to zero when there is no work is one of the most effective ways to control spend on a Kubernetes cluster.

Submit GPU work as pods or jobs that request GPU resources.
Let the autoscaler grow the GPU node pool to meet pending demand.
Run the workload to completion.
Allow the autoscaler to remove emptied GPU nodes so you stop paying for them.

For batch-style work, expressing it as a Kubernetes job rather than a bare pod gives you automatic retries and clean completion semantics, which fits training and processing tasks well.

Common Gotchas

Two issues catch people repeatedly. The first is driver mismatch: the GPU nodes need compatible drivers for the device plugin and your container runtime to work together, and a mismatch leaves GPUs unadvertised. The second is image compatibility: a container built without GPU support will schedule onto a GPU node and then quietly run on the CPU, so always use a GPU-enabled image and verify usage from inside the pod.

Sharing GPUs Across Pods

By default Kubernetes treats a GPU as an indivisible unit, so one pod claims a whole device. For small models and light inference that is wasteful, and the cluster can do better with GPU sharing. On supported hardware, partitioning splits a single GPU into isolated slices that each pod can request independently, giving several workloads hardware-enforced fractions of one card. Time-slicing offers a simpler alternative that rotates GPU access between pods, which suits tolerant workloads like development and light batch jobs but allows contention.

Enabling sharing changes what the device plugin advertises, so the cluster reports more schedulable GPU units per physical card. This is one of the highest-leverage ways to raise utilization on an expensive GPU pool, but reserve it for workloads that tolerate co-tenancy, and keep latency-sensitive serving on isolated devices.

Running Training as a Job

For training and other batch work, prefer a Kubernetes job over a bare pod. A job manages the workload to completion, retries on failure, and cleans up when finished, which matches the lifecycle of a training run far better than a long-lived pod. Combined with a GPU request and the autoscaler, a job can pull up GPU capacity on demand, run to completion, and let that capacity scale back down, so you pay for the GPU only while the run is actually executing.

Conclusion

Running a GPU workload on Kubernetes follows a clear chain: create a GPU node pool, install the device plugin so the cluster can see the accelerators, request a GPU in the pod spec, and verify the pod scheduled and can reach the hardware. Add autoscaling to grow and shrink the GPU pool with demand, scaling to zero when idle to control cost. Master this loop and Kubernetes becomes a powerful, repeatable platform for everything from one-off GPU jobs to production inference and training at scale.

Run a GPU Workload on Kubernetes: From Node Pool to Pod