How much does H200 SXM cost to rent per hour?

H200 SXM on-demand rental starts at $3.50/GPU/hr at wholesale providers like GPUaaS.com. An 8x H200 SXM cluster runs $28/hr on-demand. Reserved 1-year contracts bring the rate to roughly $2.10/GPU/hr. AWS p5.48xlarge runs roughly $12.29/GPU/hr equivalent, about 3.5x the wholesale rate.

What is the difference between H200 SXM and H100 SXM?

H200 and H100 SXM share the same GH100 compute die and identical FP8 TFLOPS. The only difference is memory: H200 carries 141 GB HBM3e at 4.8 TB/s vs H100's 80 GB HBM3 at 3.35 TB/s. That 43% bandwidth increase drives a 45% throughput improvement on memory-bound LLM inference workloads.

Can the H200 SXM run Llama 3 405B on a single GPU?

Not at full precision. At INT4 or FP8 quantisation, Llama 3 405B compresses to roughly 100-120 GB and fits on a single H200 SXM. A single H100 SXM5 cannot fit even the quantised 405B model without aggressive offloading.

Is H200 SXM better than H100 for fine-tuning?

For full fine-tuning of 70B+ models, yes. For QLoRA or LoRA on sub-70B models, H100 SXM5 is usually sufficient and costs 57% less. If your fine-tuning job runs out of VRAM on H100, upgrade to H200. If it fits, stay on H100.

How does H200 SXM compare to B200?

The B200 delivers roughly 2.2x the FP8 inference throughput of H200 and carries 192 GB HBM3e. B200 pricing runs $6-10/GPU/hr in 2026 with constrained availability. For most teams, H200 SXM offers better availability and more predictable pricing right now.

Where can I rent H200 SXM GPUs today?

GPUaaS.com offers 8x H200 SXM HGX clusters from $3.50/GPU/hr on-demand in US (Dallas, Ashburn) and EU (Frankfurt) regions. AWS, Azure, and GCP also offer H200 capacity at higher rates. Visit gpuaas.com/clusters-h200 for current availability.

H200 SXM: Full Spec, Pricing & Where to Rent (2026)

The NVIDIA H200 SXM delivers 141 GB of HBM3e memory at 4.8 TB/s bandwidth, 76% more VRAM and 43% more bandwidth than the H100 SXM5. On memory-bound inference workloads like Llama 3 70B, that single hardware difference translates to a 45% throughput gain. Rental pricing starts at $3.50/GPU/hr on wholesale providers like GPUaaS.com in 2026.

Key takeaways

H200 SXM carries 141 GB HBM3e at 4.8 TB/s, compared to H100 SXM5's 80 GB HBM3 at 3.35 TB/s. Identical Hopper compute engine, FP8 Transformer Engine, and NVLink 4.0 interconnect
MLPerf Inference v4.1: H200 SXM achieves 31,712 tokens/sec on Llama 2 70B vs 21,806 for H100 SXM, a 45% throughput gain driven entirely by memory bandwidth
H200 SXM rental starts at $3.50/GPU/hr on-demand at GPUaaS.com, roughly 2.3x the cost of H100 SXM5 at $1.49/GPU/hr
The H200 SXM is the right GPU for 70B+ parameter inference, 32K+ context windows, multi-modal workloads exceeding 80 GB VRAM, and production serving of quantised 405B models
For sub-70B models at standard context lengths, H100 SXM5 at $1.49/hr delivers near-identical throughput at less than half the cost. The H200 premium isn't justified there

In this article

01Full H200 SXM spec sheet 02H200 SXM vs H100 SXM: what actually changed 03Benchmark performance: MLPerf and inference throughput 04H200 SXM rental pricing in 2026 05Which workloads actually need the H200 SXM 06Where to rent H200 SXM clusters in 2026 07Frequently asked questions

◆ FULL SPEC SHEET

Full H200 SXM spec sheet

Spec	H200 SXM	H100 SXM5
Architecture	Hopper (GH100)	Hopper (GH100)
GPU Memory	141 GB HBM3e	80 GB HBM3
Memory Bandwidth	4.8 TB/s	3.35 TB/s
FP8 Tensor Core TFLOPS	3,958 TFLOPS	3,958 TFLOPS
BF16 Tensor Core TFLOPS	1,979 TFLOPS	1,979 TFLOPS
FP32 (non-Tensor)	67 TFLOPS	67 TFLOPS
Interconnect	NVLink 4.0 (900 GB/s)	NVLink 4.0 (900 GB/s)
Form Factor	SXM5	SXM5
TDP	700W	700W
NVLink Bandwidth (8-GPU)	7.2 TB/s	7.2 TB/s
PCIe Generation	PCIe 5.0	PCIe 5.0
Launch	Q4 2023 (GA 2024)	Q2 2022 (GA 2023)

Source: NVIDIA H200 datasheet. FP8 TFLOPS figures with sparsity.

◆ H200 VS H100

H200 SXM vs H100 SXM: what actually changed

NVIDIA didn't redesign the GPU. They upgraded the memory. The H200 uses the same GH100 die as the H100 SXM5, with identical Tensor Core counts, the same FP8 Transformer Engine, and the same NVLink 4.0 interconnect at 900 GB/s. The only hardware change is swapping HBM3 for HBM3e, which adds 61 GB of capacity and 1.45 TB/s of bandwidth.

That sounds incremental, and for compute-bound workloads it is. Dense matrix multiplications in training are bottlenecked by FLOPS, not memory. But for inference, the GPU spends most of its time loading model weights from HBM into compute units. A 43% bandwidth increase translates almost linearly to throughput, which is why the H200 delivers a 45% tokens-per-second improvement on Llama 2 70B without touching a single compute spec.

+76%

VRAM
141 GB vs 80 GB

+43%

Memory bandwidth
4.8 vs 3.35 TB/s

+0%

Compute TFLOPS
Identical Tensor Cores

⚡ The infrastructure advantage

Because H200 uses the same SXM5 form factor and NVLink topology as H100, data centres running H100 SXM5 systems can upgrade to H200 with minimal infrastructure change. Same HGX baseboard, same power delivery, same cooling. That's why H200 SXM clusters came online quickly after NVIDIA's GA announcement. It wasn't a ground-up redesign.

According to MLPerf Inference v4.1 results, the H200 SXM achieves 31,712 tokens per second on Llama 2 70B, a 45% improvement over the H100 SXM's 21,806 tokens per second on the same benchmark.

◆ BENCHMARKS

Benchmark performance: MLPerf and inference throughput

MLPerf Inference v4.1 is the most directly comparable public benchmark for H200 vs H100 on LLM workloads. Both GPUs were tested on Llama 2 70B in server mode, which is the metric most relevant to production inference. The H200 SXM result came from NVIDIA's reference submission using TensorRT-LLM with FP8 quantisation.

MLPerf v4.1, Llama 2 70B inference throughput (tokens/sec, server mode)

H200 SXM

31,712 tok/sec

H100 SXM5

21,806 tok/sec

Single GPU, server mode, FP8. Source: MLCommons MLPerf Inference v4.1.

The benchmark reflects a specific workload: long context, large model, production-style batching. For smaller models (7B to 13B), the H200's bandwidth advantage narrows because those weights fit comfortably in H100's 80 GB with room to spare for KV cache. The gap widens as model size and context length grow, both of which are trending upward in 2026 production deployments.

On FP8 fine-tuning workloads for 70B+ models, the H200 typically runs 30 to 40% faster than H100 SXM5. The bandwidth difference matters for gradient checkpointing and optimiser state management at large model scale. For training runs on models under 30B, the gap shrinks to single digits and H100 is usually the more cost-efficient choice. See the H200 vs H100 rental decision guide for the full workload framework.

According to GPUaaS.com infrastructure data, H200 SXM clusters running Llama 3 70B at FP8 with continuous batching achieve roughly 24,000 tokens/sec per GPU at 80% utilisation, about 2x the throughput of a single A100 80GB on the same workload.

◆ PRICING

H200 SXM rental pricing in 2026

H200 SXM pricing varies significantly by provider type. Hyperscalers like AWS, Azure, and GCP layer management fees and reserved capacity premiums on top of raw compute cost. Wholesale providers like GPUaaS.com connect buyers directly to data centre capacity without the broker markup, so the same silicon costs materially less.

Wholesale (GPUaaS.com)

$3.50/GPU/hr

On-demand, 8xH200 SXM cluster, US regions

✓No minimum commitment

✓Bare-metal, dedicated GPUs

✓SSH access within 24hrs

✓Reserved 1-yr: ~$2.10/GPU/hr

Hyperscaler (AWS p5.48xlarge)

$98.32/hr

On-demand, 8xH200, ~$12.29/GPU/hr equivalent

✗3.5x wholesale GPU rate

✗Egress fees on top

✓Managed ecosystem

✓Native cloud integrations

Reserved pricing cuts the per-GPU rate significantly. A 1-year reserved H200 SXM contract at GPUaaS.com runs roughly $2.10/GPU/hr, about 40% below on-demand. For teams running persistent production inference, that discount pays back within 3 to 4 months of committed usage. See the reserved vs on-demand GPU guide for the break-even framework.

According to GPUaaS.com pricing data, an 8xH200 SXM cluster at wholesale on-demand rates costs $28/hr, compared to roughly $98/hr on AWS p5.48xlarge for equivalent H200 hardware, a 71% cost reduction for the same silicon.

◆ WORKLOADS

Which workloads actually need the H200 SXM

The H200's value comes entirely from its memory advantage. If your workload doesn't stress memory capacity or bandwidth, you're paying a 2.3x hourly premium for headroom you won't use. Here's the framework.

✓ H200 SXM is the right call

●70B+ parameter inference. Llama 3 70B, Mistral 8x22B MoE, or any model whose weights and KV cache exceed 80 GB at your target batch size

●Long context windows (32K+). KV cache at 32K context in FP16 for a 70B model exceeds 60 GB alone, which pushes you past H100's 80 GB ceiling

●Quantised 405B model serving. Llama 3 405B at INT4/FP8 requires roughly 100 to 120 GB for weights alone. It fits on a single H200 but needs 2x H100

●Multi-modal workloads. Vision-language models with large image encoders like LLaVA-Next and InternVL2-76B frequently exceed 80 GB for inference at batch size greater than 1

●Full fine-tuning of 70B models. AdamW optimiser states for 70B parameters require roughly 560 GB total. Even with FSDP sharding across 8x H200 you need the extra VRAM headroom

✗ H100 SXM5 is the better value

●Sub-70B inference at standard context. Llama 3 8B, Mistral 7B, and Gemma 27B fit comfortably in 80 GB. H100 delivers near-identical throughput

●QLoRA fine-tuning on sub-70B models. 4-bit quantised training keeps VRAM well under 80 GB. The bandwidth premium doesn't meaningfully change wallclock time

●Embedding generation and reranking. These are lightweight relative to generation. H100 handles them fine and costs 57% less per hour

●Development and experimentation. Prototyping, CI/CD pipelines, and short batch jobs don't need the throughput headroom the H200 provides

◆ WHERE TO RENT

Where to rent H200 SXM clusters in 2026

H200 SXM availability is split between wholesale GPU providers and hyperscalers. Wholesale gives you lower cost and bare-metal access. Hyperscalers give you managed infrastructure and native cloud integrations at a 3 to 4x price premium.

GPUaaS.com

Wholesale GPU marketplace, no broker markup

On-demand

$3.50/GPU/hr

Reserved (1-year)

~$2.10/GPU/hr

✓8xH200 SXM HGX clusters, US (Dallas, Ashburn) and EU (Frankfurt)

✓Bare-metal SSH access, no hypervisor overhead

✓Quote turnaround under 24 hours

✓No minimum commitment on on-demand

View H200 cluster availability →

For a full comparison of H200 against the B200, NVIDIA's Blackwell-generation GPU, see the H200 vs B200 cluster comparison. For a breakdown of how wholesale pricing compares to hyperscaler rates across all GPU models, see the wholesale vs hyperscale GPU pricing guide.

◆ FAQ

Frequently asked questions

Last reviewed: May 27, 2026. Pricing sourced from GPUaaS.com live cluster data and published hyperscaler rate cards. Benchmark data from MLCommons MLPerf Inference v4.1. NVIDIA H200 specs from official NVIDIA datasheet.

H200 SXM: Full Spec, Pricing, and Where to Rent in 2026

Get a wholesale GPU quote in a few hours

Related articles

Top 10 GPU Cluster Providers for Enterprise AI Teams in 2026

You Wouldn't Buy a Car From One Dealer Without Checking Prices Elsewhere. Most Teams Buy GPUs That Way.

Everyone Is Waiting 36 Weeks for GPUs. Some Teams Are Getting Them in 24 Hours. Here's the Difference.