H100 vs A100 vs H200: which training GPU

When you spin up a training or fine-tuning run, the choice usually comes down to three NVIDIA data-center GPUs: the A100 80GB, the H100, and the H200. They span two generations and a wide price band, and picking the wrong one means either paying for capability you cannot use or fighting out-of-memory errors you should not have to. This guide compares them on the dimensions that actually move your training time and bill.

The short version

A100 80GB: Ampere generation, still capable, and the value pick when it is in stock cheap. Best for budget fine-tuning and models that fit comfortably in 80 GB.
H100: Hopper generation, the current default for serious training. Big jump in BF16/FP16 throughput and adds FP8. The safe mainstream choice.
H200: Same Hopper compute as H100 but with far more, faster memory. The pick when memory capacity or bandwidth is your bottleneck.

Spec comparison

Metric	A100 80GB	H100 (SXM)	H200 (SXM)
Architecture	Ampere	Hopper	Hopper
VRAM	80 GB HBM2e	80 GB HBM3	141 GB HBM3e
Memory bandwidth	~2.0 TB/s	~3.35 TB/s	~4.8 TB/s
BF16/FP16 dense	Baseline	~3x A100	~3x A100
FP8 support	No	Yes	Yes
NVLink	Yes	Yes (faster)	Yes (faster)
Relative on-demand price	Low	Mid-high	Highest

The throughput figures are approximate and workload-dependent; treat them as orders of magnitude, not benchmarks. For current hourly rates, see the live pages for the A100 80GB, H100, and H200.

Memory: the dimension that usually decides it

For large-model work, VRAM capacity and bandwidth often matter more than raw compute. The A100 and H100 both ship 80 GB, while the H200 jumps to 141 GB of HBM3e. That extra headroom is decisive in two cases:

Avoiding sharding. A model plus optimizer states plus activations that overflow 80 GB forces you onto multiple GPUs, adding communication overhead. If it fits in 141 GB, one H200 can replace two smaller cards for that run.
Bandwidth-bound work. LLM inference and many training steps are memory-bandwidth bound, not compute bound. The H200's ~4.8 TB/s feeds the cores faster, so it can post real throughput gains over the H100 even though their compute units are the same.

Compute: where H100 and H200 pull ahead

Both Hopper cards deliver roughly 3x the A100's BF16/FP16 dense throughput and add native FP8. If your training framework and model support FP8 or transformer-engine mixed precision, Hopper can roughly double effective throughput again on compatible layers. The A100 has none of that, which is the main reason a newer card with a higher hourly rate often finishes the same job for less total money.

Price-per-hour vs price-per-result

The hourly rate is the wrong number to optimize alone. What matters is cost to a finished, trained model:

Total cost = hourly rate x GPU count x wall-clock hours / utilization.

An H100 may cost meaningfully more per hour than an A100 but finish a training run in a third of the wall-clock time, making it cheaper overall. An H200 may cost more per hour than an H100 but let you drop from two GPUs to one and avoid sharding overhead. Plug your real numbers into the GPU training cost calculator before deciding - intuition about hourly rates is often wrong here.

When each one makes sense

Choose the A100 80GB when

You are fine-tuning or training a model that fits in 80 GB and you are cost-sensitive.
You do not need FP8 and can tolerate longer wall-clock times.
You find it deeply discounted on spot or community capacity.

Choose the H100 when

You want the mainstream, well-supported default for training in 2026.
Your model fits in 80 GB and you want FP8 and 3x Ampere throughput.
You are building multi-node clusters where H100 plus InfiniBand is widely available.

Choose the H200 when

Your model or batch overflows 80 GB and you want to avoid sharding.
Your workload is memory-bandwidth bound, such as long-context LLM training or inference.
Consolidating onto fewer, larger GPUs simplifies your topology.

How to decide in practice

Size your memory. Estimate model + optimizer + activation memory. If it overflows 80 GB, the H200 is in play immediately.
Check your precision. If your stack uses FP8 or transformer-engine, rule out the A100.
Estimate runtime on each. Use rough throughput ratios to project wall-clock hours per card.
Compute total cost. Feed rates, GPU count, and runtime into the calculator.
Compare side by side. Use the comparison tool to line up the finalists and confirm against live rates in the GPU catalog.

Takeaway

The A100 80GB is the budget option for jobs that fit; the H100 is the mainstream training default; the H200 is the answer when memory capacity or bandwidth is your wall. Decide on memory and precision first, then optimize for cost-per-result rather than cost-per-hour. Confirm the current rates on the H100, A100 80GB, and H200 pages, and let the training cost calculator settle close calls.