BlogH100 vs A100: When Older GPUs Still Win on Cost

GPU Infrastructure

H100 is faster than A100 on every benchmark. But faster doesn't mean cheaper per job. Here's the break-even maths and the workloads where A100 at $1.20/hr still wins.

H100 vs A100: When Older GPUs Still Win on Cost

GPUaaS.com Team
Infrastructure Research
May 26, 2026
Blog post cover image

The H100 is faster than the A100 on every benchmark that matters for LLM work. But faster doesn't mean cheaper per job — and for a significant set of workloads in 2026, the A100 at $1.20/GPU/hr still delivers a lower total cost than the H100 at $1.49/GPU/hr. This guide tells you exactly which side of that line your workload sits on.

Key takeaways
  • H100 and A100 both carry 80 GB of VRAM. The H100 uses HBM3 at 3.35 TB/s vs the A100's HBM2e at 2.0 TB/s — a 64% bandwidth advantage that drives most real-world performance differences (NVIDIA)
  • H100 delivers 2–4x faster LLM inference than A100 in real-world serving. On FP8-optimised workloads, the gap can reach 10x+ — but that requires Hopper-specific kernels and framework support
  • A100 on GPUaaS.com starts at $1.20/GPU/hr on-demand vs H100 at $1.49/GPU/hr — a 24% hourly premium for H100. For workloads where H100 is only 1.3–1.5x faster, the A100 wins on cost-per-job
  • A100 has no native FP8 support. BF16 is the practical precision ceiling. Teams on CUDA 11.x, TensorFlow 2.x, or older PyTorch without Hopper kernels see identical performance on both GPUs
  • The A100 wins on cost for: sub-30B inference at INT8, QLoRA fine-tuning up to 70B, batch jobs without latency SLAs, legacy stacks, and MIG-partitioned multi-tenant inference

Most GPU comparison guides reach the same conclusion: H100 wins, upgrade now. That's the right call for production inference serving high-traffic applications. It's the wrong call for research teams fine-tuning 13B models twice a week, batch jobs that run overnight, or anyone on a CUDA stack that doesn't expose Hopper's FP8 kernels. This guide does the maths on both sides.

For VRAM sizing on either GPU, see the VRAM guide. For the H100 vs H200 decision, see the H200 vs H100 rental guide. For cluster options, see the GPUaaS.com cluster catalogue.

◆ SPECS COMPARISON
Specs side by side: what Hopper actually adds

The A100 (Ampere, 2020) and H100 (Hopper, 2022) are two generations apart. The headline VRAM is identical at 80 GB on the SXM variants — but the memory type, bandwidth, compute architecture, and precision support are all meaningfully different. The differences that matter most in practice are bandwidth and FP8.

SpecA100 SXM4H100 SXM5Impact
ArchitectureAmpere (GA100)Hopper (GH100)Different architecture
VRAM80 GB HBM2e80 GB HBM3Identical capacity, faster type
Memory bandwidth2.0 TB/s3.35 TB/s+64% — primary inference speedup
FP8 supportNoYes (Transformer Engine)Doubles throughput on FP8 workloads
BF16 TFLOPS3129893.2x compute advantage
FP8 TFLOPSN/A1,979H100-only; A100 tops out at BF16
NVLink bandwidth600 GB/s900 GB/s+50% multi-GPU communication
TDP400W700WH100 draws 75% more power
MIG partitions77Identical MIG support
Price/GPU/hr (GPUaaS.com)from $1.20from $1.4924% H100 premium hourly

The key insight

The VRAM is the same. The A100's bottleneck is bandwidth (2.0 TB/s vs 3.35 TB/s) and the hard ceiling of BF16 — no FP8, no Transformer Engine. For workloads that don't need FP8 and don't saturate the bandwidth ceiling, A100 and H100 land closer than the spec sheet suggests.

◆ PERFORMANCE
Real-world performance: the 2x–4x gap and when it disappears

For LLM inference on Llama 3 70B, H100 delivers roughly 250–300 tokens/sec with TensorRT-LLM and FP8 optimisation. A100 at BF16 lands at ~130 tokens/sec on the same model — a roughly 2x gap. That gap is real and it compounds at scale. But it's also specific to large-model, production-serving scenarios with optimised runtimes.

For a 13B model at INT8, the gap narrows significantly. Both GPUs fit the model in 80 GB. The A100's lower bandwidth costs you some throughput, but when your target is 200 tok/s for a development API, neither GPU is the bottleneck. The limiting factor becomes what you're paying per hour, not what the hardware can do.

Where H100 wins clearly

  • FP8-optimised inference (vLLM 0.4+, TensorRT-LLM 0.10+)
  • High-traffic real-time serving with sub-100ms latency SLAs
  • Pre-training or continued pre-training on 70B+ models
  • Large batch inference where bandwidth saturates the A100
  • Teams fully on PyTorch 2.1+ with Hopper kernel support

Where A100 matches or beats H100 on cost

  • Sub-30B inference at INT8 where throughput target is modest
  • QLoRA fine-tuning on models up to 70B (adapter fits in 80 GB)
  • Overnight batch jobs without latency SLAs
  • Older CUDA stacks (11.x) without Hopper kernel support
  • MIG-partitioned multi-tenant inference at 7 partitions

Llama 3 70B inference throughput — tokens per second (single GPU, TensorRT-LLM)

H100 FP8
~275 tok/s
H100 BF16
~150 tok/s
A100 BF16
~130 tok/s

Representative throughput figures. Actual results vary with batch size, context length, and framework version. Source: Cudo Compute benchmarks.

⚡ The BF16 comparison is the honest one

When people quote 10x or 30x H100 vs A100 gains, they're comparing H100 FP8 against A100 BF16. That's comparing a GPU running its native precision against a GPU that can't. The like-for-like comparison — H100 BF16 vs A100 BF16 — is closer to 1.5x–2x. Know which comparison is relevant for your actual stack before you upgrade.

For Llama 3 70B inference at BF16, H100 delivers roughly 130–150 tokens per second more than A100 on the same framework stack — a 1.5x–2x real-world gap that narrows further on sub-30B models. Source: Tom's Hardware.

◆ COST-PER-JOB
Cost-per-job maths: when A100 actually wins

Hourly rate isn't the right metric. Cost per completed job is. A100 at $1.20/hr is cheaper per hour, but if H100 finishes the same job in half the time, the H100 costs less in total. The crossover depends on how large the actual speedup is for your specific workload.

The break-even formula

A100 wins on total cost when: A100 hours × $1.20 < H100 hours × $1.49. Equivalently, H100 needs to finish the job in under 80.5% of the A100 time — a 1.24x speedup minimum — to be cheaper in total. Below that threshold, stick with A100.

WorkloadH100 speedupCost-per-job winnerWhy
70B inference, H100 FP8 vs A100 BF16~3–4xH1003–4x speedup more than offsets 24% hourly premium
70B inference, H100 BF16 vs A100 BF16~1.5–2xH1001.5x+ still clears the 1.24x break-even
13B inference, INT8, modest throughput~1.2–1.3xA100Speedup below 1.24x break-even; A100 cheaper per job
QLoRA fine-tuning, 13B–70B~1.3–1.5xA100Marginal speedup; A100 saves ~15–20% on total job cost
Pre-training, 70B+ BF16/FP8~3–5xH100Transformer Engine + FP8 compounds over millions of steps
Overnight batch inference (no SLA)~2xA100If job can run overnight, finishing 2x faster has no value; save the hourly cost
Legacy CUDA 11.x stack (no Hopper kernels)~1.0–1.1xA100Without Hopper kernels, H100 offers almost no advantage

⚠ The overnight batch trap

H100 finishing a batch job twice as fast matters if you're time-constrained. If you're running overnight anyway, finishing at 2am vs 4am has zero business value. Don't pay the H100 premium for speed you don't need.

GPUaaS.com infrastructure data: teams running QLoRA fine-tuning on Llama 3 13B–70B on A100 at $1.20/hr vs H100 at $1.49/hr see roughly 20% lower total job cost on A100, given the 1.3–1.5x H100 speedup doesn't clear the break-even threshold.

◆ WORKLOAD DECISION GUIDE
Workload decision guide: A100 or H100?

Three questions determine your answer. Work through them in order.

Q1

Is your framework stack FP8-capable? (PyTorch 2.1+, vLLM 0.4+, or TensorRT-LLM 0.10+)

No → The biggest H100 gains disappear. Evaluate whether the 1.5–2x BF16 speedup justifies a 24% hourly premium. For most non-real-time workloads it won't. Yes → Continue to Q2.

Q2

Does your workload have a hard latency SLA or time constraint that finishing faster directly solves?

No → Speed gains have no business value. A100 wins on total cost for any workload where you can wait longer. Yes → Continue to Q3.

Q3

Is your model large enough that H100's FP8 and bandwidth advantage materially changes throughput? (Generally 30B+ on optimised frameworks)

No → Rent A100. The speedup on smaller models rarely clears the cost break-even. Yes → Rent H100. At 2x+ genuine speedup, it's cheaper per job.

Quick reference by workload type

WorkloadRecommendedWhy
Mistral 7B / Llama 3 8B inferenceA100Small model; H100 speedup minimal on bandwidth-light workloads
Llama 3 13B–30B inference, INT8A100Fits in 80 GB; H100 gain doesn't clear cost break-even
Llama 3 70B inference, FP8, real-timeH100FP8 Transformer Engine gives 3–4x speedup; H100 cheaper per job
QLoRA fine-tuning, any model up to 70BA100Adapters + 4-bit base fit in 80 GB; H100 speedup modest
Full fine-tuning / pre-training, 70B+H100FP8 + Transformer Engine compounds over millions of training steps
Overnight batch scoring / embeddingsA100No time constraint; 2x faster has no value; save the hourly rate
Research / experimentation (short runs)A100Low total hours; cheapest entry cost matters more than throughput
MIG multi-tenant inference (7 partitions)A100Same 7-partition MIG as H100 at lower hourly cost
◆ THE FP8 GAP
The FP8 gap: why the A100 ceiling matters

The A100's hard ceiling is BF16. It supports INT8 and INT4 for inference quantisation, but it has no FP8 hardware support and no Transformer Engine. That means the largest performance gap between the two GPUs only opens when your runtime actually uses FP8 — which requires PyTorch 2.1+, vLLM 0.4+, or TensorRT-LLM 0.10+ with Hopper-specific kernels enabled.

Teams on CUDA 11.x, older TensorFlow builds, or inference frameworks not yet updated for Hopper won't see those gains. They're comparing H100 BF16 against A100 BF16, where the gap is 1.5–2x — not 4–10x. Before you decide to upgrade, check your actual framework version. The upgrade is free; wasted hourly spend isn't.

FP8 prerequisites (H100 only)

  • PyTorch 2.1+ with transformer_engine.pytorch
  • vLLM 0.4+ (--kv-cache-dtype fp8)
  • TensorRT-LLM 0.10+
  • CUDA 12.x drivers

A100 precision ceiling

  • Training: BF16, FP16, TF32, FP32
  • Inference: INT8, INT4
  • No FP8 hardware support
  • No Transformer Engine

When does FP8 matter?

FP8 matters when you're serving large models at high throughput and can tolerate the slight accuracy trade-off (typically within measurement noise for most production workloads). For KV cache quantisation specifically, FP8 halves cache memory usage — directly enabling longer context windows on H100 that the A100 can't match. See the KV cache inference cost guide for exact numbers.

GPUaaS.com infrastructure data: teams migrating from A100 to H100 for real-time inference on Llama 3 70B with FP8 enabled see 3–4x throughput gains and 40–60% lower cost per million output tokens, despite the higher hourly GPU rate.

◆ MIGRATION PATH
Migration path: A100 to H100 when the time comes

When your workload eventually does warrant the upgrade, A100 to H100 is one of the more straightforward migrations in the NVIDIA lineup. Both GPUs share 80 GB VRAM, the same SXM form factor for NVLink clusters, and the same CUDA software stack at the driver level. The key changes are framework versions and precision flags.

Migration checklist

Moving from A100 to H100: upgrade to CUDA 12.x if not already there; update to PyTorch 2.1+ and enable transformer_engine if training; switch vLLM to 0.4+ and add --kv-cache-dtype fp8 for inference; update TensorRT-LLM if applicable; re-run benchmarks at production batch size — your old BF16 throughput numbers don't reflect FP8 potential. Everything at the model and data layer stays the same.

Signs you've outgrown A100

  • Serving latency regularly exceeds your SLA under load
  • You're running 2+ A100s to hit throughput targets that 1 H100 would cover
  • You need 32K+ context windows and KV cache pressure is chronic
  • You've moved to vLLM 0.4+ or TensorRT-LLM 0.10+ and want FP8

Signs you're fine on A100

  • Models are under 30B and fit in 80 GB with room to spare
  • Throughput targets are below GPU saturation at BF16
  • Workloads are batch jobs, research runs, or low-traffic APIs
  • Framework stack is below the FP8 threshold
◆ FAQ
Frequently asked questions

Yes, for the right workloads. The A100 delivers better cost-per-job than H100 for sub-30B inference, QLoRA fine-tuning, batch jobs without time constraints, and teams on CUDA stacks that don't expose FP8. At $1.20/GPU/hr on GPUaaS.com, it's 24% cheaper per hour than H100, and for many workloads that gap isn't closed by H100's throughput advantage.

It depends on the precision. H100 at FP8 vs A100 at BF16: roughly 3–4x faster on 70B+ models with optimised frameworks. H100 at BF16 vs A100 at BF16: roughly 1.5–2x faster. Claims of 10x–30x come from heavily optimised FP8 scenarios or highly specific benchmarks. For practical planning, use 2–3x as the realistic range for large-model inference on modern frameworks.

A100 SXM starts at $1.20/GPU/hr and H100 SXM5 starts at $1.49/GPU/hr on-demand — a 24% hourly premium for H100. H100 needs to complete a job in under 80.5% of the A100 time (1.24x speedup) to be cheaper in total cost. For many workloads, the real-world speedup doesn't clear that threshold.

Yes. Both have 80 GB VRAM and the same CUDA software stack at the driver level. Any model that fits and runs on A100 also runs on H100 without code changes. The difference is precision: H100 can use FP8 quantisation natively, while A100 tops out at BF16 for training and INT8 for inference. Migrating to H100 means updating your framework to expose FP8, not rewriting anything.

For QLoRA fine-tuning on models up to 70B, A100 is usually the better cost choice. The 4-bit base plus adapters plus optimizer states fit in 80 GB on both GPUs, but A100's lower hourly rate wins when the H100 speedup is only 1.3–1.5x. For full fine-tuning on 70B+ models where you need FP8 and the Transformer Engine to keep training cost reasonable, H100 is the right call. See the VRAM guide for exact memory figures by training method.

The A100 NVL variant reached end-of-life in 2025, and NVIDIA no longer manufactures new A100 units. Cloud availability remains strong through 2026 as data centres continue running existing hardware. Prices have dropped as B200 and H200 supply grew, making A100 increasingly attractive for cost-sensitive workloads. For teams on A100 today, there's no urgency to migrate unless your workload has hit the performance ceiling described above. Check A100 cluster availability on GPUaaS.com for current on-demand slots.

Last reviewed: May 27, 2026. For live A100 and H100 cluster pricing, visit the A100 cluster page or H100 cluster page on GPUaaS.com.

Share this article:LinkedInX / TwitterCopy link
FIND THE BEST GPU DEAL

Get a wholesale GPU quote in a few hours

NVIDIA B200, H200, H100, A100, RTX Pro 6000 — N. America, EU, MEA, APAC. No buyer fees.

Related articles