Set Up Multi-GPU Distributed Training With PyTorch DDP
A practical guide to setting up multi-GPU distributed training with PyTorch DistributedDataParallel, including process groups, launch, and debugging.
When a single GPU is no longer enough, PyTorch DistributedDataParallel, usually shortened to DDP, is the standard way to scale training across many GPUs. It replicates the model on each GPU, splits the data, and synchronizes gradients so every replica stays consistent. Done right it scales nearly linearly. Done wrong it hangs, stalls, or silently trains slower than a single card. This tutorial walks through setting up DDP, launching a multi-GPU run, and recognizing the failure modes that waste GPU hours.
How DDP Works
DDP runs one process per GPU. Each process holds a full copy of the model and trains on its own shard of the data. After the backward pass, DDP averages gradients across all processes so the weight updates match. The averaging happens through a collective communication step, and that step is where most distributed problems live.
Because each process sees different data but ends with identical gradients, the result is mathematically similar to a much larger batch trained on one device, but spread across hardware. This is data parallelism, the most common and most approachable form of distributed training.
Prepare the Environment
DDP relies on a few environment pieces being correct before any training code runs.
- A communication backend suited to GPUs for fast collective operations.
- Coordinates for each process: a world size, a rank, and a local rank.
- A rendezvous address so processes can find each other.
- Consistent library and driver versions across all GPUs and nodes.
On a single multi-GPU machine this is simpler because all processes share a host. Across multiple nodes you also need network reachability between hosts and a shared understanding of who the coordinator is.
Wire DDP Into the Training Code
The code changes are modest but specific. Each one matters.
- Initialize the process group at startup so processes can communicate.
- Pin each process to its own GPU using the local rank.
- Wrap the model in DDP so gradients synchronize automatically.
- Use a distributed sampler so each process sees a unique slice of the data.
- Set the sampler epoch each epoch so shuffling stays correct across processes.
- Guard logging and checkpointing so only one process writes, avoiding clobbered files.
The distributed sampler is easy to forget and important. Without it, every process trains on the same data, which wastes the extra GPUs entirely and quietly hurts results.
Launch the Run
DDP jobs are started with a launcher that spawns one process per GPU and supplies the coordination variables. On a single node it launches all local processes. Across nodes you start the launcher on each host with matching rendezvous settings.
| Scope | Setup | Watch for |
|---|---|---|
| Single node | One launcher, many GPUs | Each process on its own GPU |
| Multi node | Launcher per host, shared rendezvous | Network reachability, matching config |
Start small. Confirm a two-GPU run is correct before scaling to many GPUs or many nodes, because bugs are far easier to find at small scale.
Common Failure Modes
- Hang at startup: processes cannot reach the rendezvous address, often a networking or port issue.
- Mismatched world size: the number of processes does not match what was configured, so the group never forms.
- No speedup: a missing distributed sampler means every GPU trains identical data.
- Slow scaling: communication overhead dominates, common with small models or slow interconnects.
- Corrupted checkpoints: multiple processes writing the same file at once.
When scaling efficiency is poor, suspect communication. Larger batches per GPU, faster interconnects, and overlapping computation with gradient sync all help. If the model is tiny, the synchronization cost may simply outweigh the benefit of more GPUs.
Decide What to Scale
DDP shines when the model fits on one GPU and you want to train faster on more data. If the model itself does not fit on a single GPU, data parallelism alone is not enough and you move toward model or tensor parallelism. Knowing which problem you have prevents a lot of wasted effort.
Tune for Scaling Efficiency
Getting DDP to run is one milestone; getting it to scale efficiently is another. As you add GPUs, the share of time spent communicating gradients rises, and at some point adding more hardware stops helping. A few levers push that point further out.
- Larger per-GPU batches: more compute between synchronization points improves the ratio of work to communication.
- Gradient accumulation: simulate a larger batch without exceeding memory, reducing sync frequency.
- Faster interconnect: high-bandwidth links between GPUs and nodes shrink the communication cost directly.
- Overlap: DDP can overlap gradient communication with the backward pass, hiding some of the cost.
Measure scaling efficiency explicitly. Compare throughput on one GPU against throughput on many, and watch how far it falls short of perfect scaling. If efficiency drops sharply, communication is dominating and the levers above are where to look.
Checkpoint and Recover Cleanly
Long distributed runs will be interrupted, whether by a spot reclaim, a node failure, or a planned restart. A run that cannot resume wastes every GPU hour it had accumulated. Save checkpoints on a regular cadence from a single designated process, so replicas do not clobber each other, and store them where any replacement instance can read them. On restart, every process should load the same checkpoint and rejoin the process group cleanly. Getting checkpoint and recovery right turns an interruption from a catastrophe into a short delay, which is what makes long multi-GPU runs economically viable, especially on interruptible capacity.
Know When to Go Beyond Data Parallelism
DDP solves one specific problem: training faster on more data when the model fits on a single GPU. When the model itself is too large for one GPU, data parallelism alone cannot help, because every replica still needs a full copy of the model. That is the boundary where teams move toward sharded, tensor, or pipeline parallel strategies that split the model across devices. Recognizing which problem you have, a data-bound one or a model-size-bound one, saves you from forcing DDP to do something it structurally cannot, and points you at the right tool the moment you outgrow it.
PyTorch DDP is the dependable workhorse of distributed training. Get the process group, GPU pinning, model wrapper, and distributed sampler right, launch with the proper coordination variables, and validate at small scale before going big. Most distributed pain traces back to communication setup or a missing sampler, so check those first when a run hangs or fails to speed up. With the fundamentals solid, DDP turns a fleet of GPUs into near-linear training throughput.