BlogH200 SXM: Full Spec, Pricing, and Where to Rent in 2026

GPU Infrastructure

The H200 SXM packs 141 GB HBM3e and 4.8 TB/s memory bandwidth into a form factor that fits your existing H100 SXM infrastructure. Here's the full spec breakdown, current rental pricing, and which workloads actually justify the upgrade.

H200 SXM: Full Spec, Pricing, and Where to Rent in 2026

GPUaaS.com Team
Infrastructure Research
May 26, 2026
Blog post cover image

The NVIDIA H200 SXM delivers 141 GB of HBM3e memory at 4.8 TB/s bandwidth, 76% more VRAM and 43% more bandwidth than the H100 SXM5. On memory-bound inference workloads like Llama 3 70B, that single hardware difference translates to a 45% throughput gain. Rental pricing starts at $3.50/GPU/hr on wholesale providers like GPUaaS.com in 2026.

Key takeaways
  • H200 SXM carries 141 GB HBM3e at 4.8 TB/s, compared to H100 SXM5's 80 GB HBM3 at 3.35 TB/s. Identical Hopper compute engine, FP8 Transformer Engine, and NVLink 4.0 interconnect
  • MLPerf Inference v4.1: H200 SXM achieves 31,712 tokens/sec on Llama 2 70B vs 21,806 for H100 SXM, a 45% throughput gain driven entirely by memory bandwidth
  • H200 SXM rental starts at $3.50/GPU/hr on-demand at GPUaaS.com, roughly 2.3x the cost of H100 SXM5 at $1.49/GPU/hr
  • The H200 SXM is the right GPU for 70B+ parameter inference, 32K+ context windows, multi-modal workloads exceeding 80 GB VRAM, and production serving of quantised 405B models
  • For sub-70B models at standard context lengths, H100 SXM5 at $1.49/hr delivers near-identical throughput at less than half the cost. The H200 premium isn't justified there
◆ FULL SPEC SHEET
Full H200 SXM spec sheet
SpecH200 SXMH100 SXM5
ArchitectureHopper (GH100)Hopper (GH100)
GPU Memory141 GB HBM3e80 GB HBM3
Memory Bandwidth4.8 TB/s3.35 TB/s
FP8 Tensor Core TFLOPS3,958 TFLOPS3,958 TFLOPS
BF16 Tensor Core TFLOPS1,979 TFLOPS1,979 TFLOPS
FP32 (non-Tensor)67 TFLOPS67 TFLOPS
InterconnectNVLink 4.0 (900 GB/s)NVLink 4.0 (900 GB/s)
Form FactorSXM5SXM5
TDP700W700W
NVLink Bandwidth (8-GPU)7.2 TB/s7.2 TB/s
PCIe GenerationPCIe 5.0PCIe 5.0
LaunchQ4 2023 (GA 2024)Q2 2022 (GA 2023)

Source: NVIDIA H200 datasheet. FP8 TFLOPS figures with sparsity.

◆ H200 VS H100
H200 SXM vs H100 SXM: what actually changed

NVIDIA didn't redesign the GPU. They upgraded the memory. The H200 uses the same GH100 die as the H100 SXM5, with identical Tensor Core counts, the same FP8 Transformer Engine, and the same NVLink 4.0 interconnect at 900 GB/s. The only hardware change is swapping HBM3 for HBM3e, which adds 61 GB of capacity and 1.45 TB/s of bandwidth.

That sounds incremental, and for compute-bound workloads it is. Dense matrix multiplications in training are bottlenecked by FLOPS, not memory. But for inference, the GPU spends most of its time loading model weights from HBM into compute units. A 43% bandwidth increase translates almost linearly to throughput, which is why the H200 delivers a 45% tokens-per-second improvement on Llama 2 70B without touching a single compute spec.

+76%
VRAM
141 GB vs 80 GB
+43%
Memory bandwidth
4.8 vs 3.35 TB/s
+0%
Compute TFLOPS
Identical Tensor Cores

⚡ The infrastructure advantage

Because H200 uses the same SXM5 form factor and NVLink topology as H100, data centres running H100 SXM5 systems can upgrade to H200 with minimal infrastructure change. Same HGX baseboard, same power delivery, same cooling. That's why H200 SXM clusters came online quickly after NVIDIA's GA announcement. It wasn't a ground-up redesign.

According to MLPerf Inference v4.1 results, the H200 SXM achieves 31,712 tokens per second on Llama 2 70B, a 45% improvement over the H100 SXM's 21,806 tokens per second on the same benchmark.

◆ BENCHMARKS
Benchmark performance: MLPerf and inference throughput

MLPerf Inference v4.1 is the most directly comparable public benchmark for H200 vs H100 on LLM workloads. Both GPUs were tested on Llama 2 70B in server mode, which is the metric most relevant to production inference. The H200 SXM result came from NVIDIA's reference submission using TensorRT-LLM with FP8 quantisation.

MLPerf v4.1, Llama 2 70B inference throughput (tokens/sec, server mode)

H200 SXM
31,712 tok/sec
H100 SXM5
21,806 tok/sec

Single GPU, server mode, FP8. Source: MLCommons MLPerf Inference v4.1.

The benchmark reflects a specific workload: long context, large model, production-style batching. For smaller models (7B to 13B), the H200's bandwidth advantage narrows because those weights fit comfortably in H100's 80 GB with room to spare for KV cache. The gap widens as model size and context length grow, both of which are trending upward in 2026 production deployments.

On FP8 fine-tuning workloads for 70B+ models, the H200 typically runs 30 to 40% faster than H100 SXM5. The bandwidth difference matters for gradient checkpointing and optimiser state management at large model scale. For training runs on models under 30B, the gap shrinks to single digits and H100 is usually the more cost-efficient choice. See the H200 vs H100 rental decision guide for the full workload framework.

According to GPUaaS.com infrastructure data, H200 SXM clusters running Llama 3 70B at FP8 with continuous batching achieve roughly 24,000 tokens/sec per GPU at 80% utilisation, about 2x the throughput of a single A100 80GB on the same workload.

◆ PRICING
H200 SXM rental pricing in 2026

H200 SXM pricing varies significantly by provider type. Hyperscalers like AWS, Azure, and GCP layer management fees and reserved capacity premiums on top of raw compute cost. Wholesale providers like GPUaaS.com connect buyers directly to data centre capacity without the broker markup, so the same silicon costs materially less.

Wholesale (GPUaaS.com)

$3.50/GPU/hr

On-demand, 8xH200 SXM cluster, US regions

No minimum commitment
Bare-metal, dedicated GPUs
SSH access within 24hrs
Reserved 1-yr: ~$2.10/GPU/hr

Hyperscaler (AWS p5.48xlarge)

$98.32/hr

On-demand, 8xH200, ~$12.29/GPU/hr equivalent

3.5x wholesale GPU rate
Egress fees on top
Managed ecosystem
Native cloud integrations

Reserved pricing cuts the per-GPU rate significantly. A 1-year reserved H200 SXM contract at GPUaaS.com runs roughly $2.10/GPU/hr, about 40% below on-demand. For teams running persistent production inference, that discount pays back within 3 to 4 months of committed usage. See the reserved vs on-demand GPU guide for the break-even framework.

According to GPUaaS.com pricing data, an 8xH200 SXM cluster at wholesale on-demand rates costs $28/hr, compared to roughly $98/hr on AWS p5.48xlarge for equivalent H200 hardware, a 71% cost reduction for the same silicon.

◆ WORKLOADS
Which workloads actually need the H200 SXM

The H200's value comes entirely from its memory advantage. If your workload doesn't stress memory capacity or bandwidth, you're paying a 2.3x hourly premium for headroom you won't use. Here's the framework.

✓ H200 SXM is the right call

70B+ parameter inference. Llama 3 70B, Mistral 8x22B MoE, or any model whose weights and KV cache exceed 80 GB at your target batch size
Long context windows (32K+). KV cache at 32K context in FP16 for a 70B model exceeds 60 GB alone, which pushes you past H100's 80 GB ceiling
Quantised 405B model serving. Llama 3 405B at INT4/FP8 requires roughly 100 to 120 GB for weights alone. It fits on a single H200 but needs 2x H100
Multi-modal workloads. Vision-language models with large image encoders like LLaVA-Next and InternVL2-76B frequently exceed 80 GB for inference at batch size greater than 1
Full fine-tuning of 70B models. AdamW optimiser states for 70B parameters require roughly 560 GB total. Even with FSDP sharding across 8x H200 you need the extra VRAM headroom

✗ H100 SXM5 is the better value

Sub-70B inference at standard context. Llama 3 8B, Mistral 7B, and Gemma 27B fit comfortably in 80 GB. H100 delivers near-identical throughput
QLoRA fine-tuning on sub-70B models. 4-bit quantised training keeps VRAM well under 80 GB. The bandwidth premium doesn't meaningfully change wallclock time
Embedding generation and reranking. These are lightweight relative to generation. H100 handles them fine and costs 57% less per hour
Development and experimentation. Prototyping, CI/CD pipelines, and short batch jobs don't need the throughput headroom the H200 provides
◆ WHERE TO RENT
Where to rent H200 SXM clusters in 2026

H200 SXM availability is split between wholesale GPU providers and hyperscalers. Wholesale gives you lower cost and bare-metal access. Hyperscalers give you managed infrastructure and native cloud integrations at a 3 to 4x price premium.

GPUaaS.com

Wholesale GPU marketplace, no broker markup

On-demand

$3.50/GPU/hr

Reserved (1-year)

~$2.10/GPU/hr

8xH200 SXM HGX clusters, US (Dallas, Ashburn) and EU (Frankfurt)
Bare-metal SSH access, no hypervisor overhead
Quote turnaround under 24 hours
No minimum commitment on on-demand
View H200 cluster availability →

For a full comparison of H200 against the B200, NVIDIA's Blackwell-generation GPU, see the H200 vs B200 cluster comparison. For a breakdown of how wholesale pricing compares to hyperscaler rates across all GPU models, see the wholesale vs hyperscale GPU pricing guide.

◆ FAQ
Frequently asked questions

H200 SXM on-demand rental starts at $3.50/GPU/hr at wholesale providers like GPUaaS.com. An 8xH200 SXM cluster runs $28/hr on-demand. Reserved 1-year contracts bring the rate down to roughly $2.10/GPU/hr. AWS p5.48xlarge runs roughly $12.29/GPU/hr equivalent, about 3.5x the wholesale rate for the same hardware.

The H200 SXM uses the same GH100 compute die as the H100 SXM5, with identical Tensor Cores and FP8 TFLOPS (3,958) and the same NVLink 4.0 interconnect. The only difference is memory: H200 carries 141 GB HBM3e at 4.8 TB/s vs H100's 80 GB HBM3 at 3.35 TB/s. That 43% bandwidth increase drives a 45% throughput improvement on memory-bound LLM inference workloads.

Not at full precision. Llama 3 405B in FP16 requires roughly 810 GB, well past a single H200's 141 GB. At INT4 or FP8 quantisation, the model compresses to roughly 100 to 120 GB and fits on a single H200 SXM. Most production teams use 2 to 4x H200 SXM GPUs for serving at reasonable batch sizes. A single H100 SXM5 at 80 GB can't fit even the quantised 405B model without aggressive offloading.

For full fine-tuning of 70B+ models, yes. The extra VRAM allows larger batch sizes and reduces gradient checkpointing, which meaningfully cuts wallclock time. For QLoRA or LoRA on sub-70B models, H100 SXM5 is usually sufficient and costs 57% less. The practical rule: if your fine-tuning job runs out of VRAM on H100, move to H200. If it fits, stay on H100.

The B200 (Blackwell architecture) delivers roughly 2.2x the FP8 inference throughput of an H200 and carries 192 GB HBM3e. It's a ground-up redesign rather than a memory upgrade. That said, B200 pricing in 2026 runs $6 to $10/GPU/hr with constrained availability outside a handful of data centres. For most teams, H200 SXM offers better availability and more predictable pricing right now. See the H200 vs B200 comparison for the full breakdown.

GPUaaS.com offers 8xH200 SXM HGX clusters from $3.50/GPU/hr on-demand in US (Dallas, Ashburn) and EU (Frankfurt) regions, with SSH access within 24 hours. AWS (p5.48xlarge), Azure (ND H200 v5), and GCP also offer H200 capacity at higher rates with managed infrastructure. For current availability and pricing, visit gpuaas.com/clusters-h200.

Last reviewed: May 27, 2026. Pricing sourced from GPUaaS.com live cluster data and published hyperscaler rate cards. Benchmark data from MLCommons MLPerf Inference v4.1. NVIDIA H200 specs from official NVIDIA datasheet.

Share this article:LinkedInX / TwitterCopy link
FIND THE BEST GPU DEAL

Get a wholesale GPU quote in a few hours

NVIDIA B200, H200, H100, A100, RTX Pro 6000 — N. America, EU, MEA, APAC. No buyer fees.

Related articles