The NVIDIA B200 SXM delivers 2.2x the FP8 inference throughput of an H200, carries 192 GB of HBM3e memory, and runs at 4.0 TB/s of memory bandwidth. It's the fastest GPU NVIDIA has shipped for inference workloads. Enterprise pricing in 2026 runs $6 to $10/GPU/hr at wholesale providers, with availability concentrated at a small number of data centres and wait times measured in weeks rather than hours.
- B200 SXM carries 192 GB HBM3e at 4.0 TB/s and delivers 2.2x the FP8 inference throughput of an H200 SXM on memory-bound workloads like Llama 3 70B serving
- Wholesale pricing in 2026 runs $6 to $10/GPU/hr on-demand. An 8xB200 SXM cluster costs $48 to $80/hr, compared to $28/hr for the equivalent 8xH200
- B200 is Blackwell architecture, not a memory upgrade. NVLink 5.0 delivers 1.8 TB/s chip-to-chip bandwidth, nearly double the H200's NVLink 4.0 at 900 GB/s
- Availability is constrained to a handful of US and EU data centres in 2026. Lead times for new clusters run 4 to 8 weeks. H200 SXM is available in under 24 hours
- B200 makes sense for: 405B+ model inference at scale, MoE architectures exceeding 192 GB, and teams where throughput-per-dollar at high concurrency justifies the premium
| Spec | B200 SXM | H200 SXM | H100 SXM5 |
|---|---|---|---|
| Architecture | Blackwell (GB100) | Hopper (GH100) | Hopper (GH100) |
| GPU Memory | 192 GB HBM3e | 141 GB HBM3e | 80 GB HBM3 |
| Memory Bandwidth | 4.0 TB/s | 4.8 TB/s | 3.35 TB/s |
| FP8 Tensor Core TFLOPS | 9,000 TFLOPS | 3,958 TFLOPS | 3,958 TFLOPS |
| BF16 Tensor Core TFLOPS | 4,500 TFLOPS | 1,979 TFLOPS | 1,979 TFLOPS |
| FP4 Tensor Core TFLOPS | 18,000 TFLOPS | N/A | N/A |
| NVLink Generation | NVLink 5.0 | NVLink 4.0 | NVLink 4.0 |
| NVLink Bandwidth (GPU-to-GPU) | 1.8 TB/s | 900 GB/s | 900 GB/s |
| NVLink Bandwidth (8-GPU node) | 14.4 TB/s | 7.2 TB/s | 7.2 TB/s |
| TDP | 1,000W | 700W | 700W |
| Form Factor | SXM6 | SXM5 | SXM5 |
| Launch | Q1 2025 (GA) | Q4 2023 (GA 2024) | Q2 2022 (GA 2023) |
Source: NVIDIA B200 datasheet. FP8 and FP4 TFLOPS figures with sparsity.
The B200 is a full architecture redesign, not a memory upgrade. NVIDIA moved from Hopper (GH100) to Blackwell (GB100) with the B200, which means new compute units, a new NVLink generation, new precision support (FP4), and a new SXM form factor (SXM6). You can't drop a B200 into an existing H100 or H200 HGX chassis. Data centres running B200 need new infrastructure.
The headline compute improvement is substantial. FP8 TFLOPS jump from 3,958 on the H200 to 9,000 on the B200, a 2.27x improvement. FP4 support (a new precision class introduced with Blackwell) reaches 18,000 TFLOPS. For inference workloads where models can be quantised to FP4 without significant accuracy loss, that's a step change in tokens-per-dollar.
One notable spec inversion: the B200 has slightly less memory bandwidth than the H200. The H200 SXM runs at 4.8 TB/s, the B200 at 4.0 TB/s. NVIDIA compensated for this with higher VRAM capacity (192 GB vs 141 GB) and much faster inter-GPU bandwidth via NVLink 5.0. For single-GPU memory-bound workloads, the H200 is actually faster per byte. The B200's advantage is at the cluster level, not per-GPU bandwidth.
⚡ The infrastructure cost buyers miss
B200 uses the SXM6 form factor and runs at 1,000W TDP per GPU, 43% higher than the H200's 700W. An 8xB200 HGX node draws up to 8,000W under full load. Data centres need upgraded power delivery and cooling infrastructure. If you're co-locating or buying dedicated infrastructure, that's a real cost that doesn't show up in the per-GPU/hr rental rate.
According to NVIDIA's published specifications, the B200 SXM delivers 9,000 FP8 TFLOPS with sparsity, compared to 3,958 TFLOPS for the H200 SXM, a 2.27x improvement in raw compute throughput at the same precision level.
MLPerf Inference v5.0 is the first round to include B200 SXM results at scale. The headline Llama 3 70B server-mode result puts the B200 at roughly 2.2x the throughput of a single H200 SXM on the same workload. That matches NVIDIA's pre-launch claims, which is worth noting since marketing TFLOPS figures often don't translate to real-world inference gains.
The throughput gap widens further on larger models. On Llama 3 405B at FP8, the B200's 192 GB VRAM allows the full model to run on a single GPU where the H200 requires careful sharding across 2 to 4 GPUs. Less inter-GPU communication means higher effective utilisation. For 405B serving at scale, the B200's advantage over H200 in practice is closer to 2.5 to 3x rather than the 2.2x seen on 70B workloads.
FP4 inference support changes the picture further for teams willing to accept the quantisation tradeoff. At FP4 precision, the B200 reaches 18,000 TFLOPS, and for models where FP4 quantisation doesn't materially degrade output quality (primarily smaller, distilled models), the per-token cost drops dramatically. Most frontier-scale production models aren't FP4-deployable yet, but enterprise fine-tunes on smaller base models often are.
Relative inference throughput, Llama 3 70B server mode (FP8)
Relative throughput, single GPU, server mode, FP8. Source: MLCommons MLPerf Inference v5.0 and GPUaaS.com cluster telemetry.
The throughput-per-dollar question
2.2x the throughput at 2.0 to 2.5x the hourly cost is roughly breakeven on throughput-per-dollar at current wholesale pricing. The B200 starts to win on cost-per-token at high sustained concurrency, where its higher batch throughput and larger VRAM mean fewer GPUs needed to hit a given requests-per-second target. For teams running below 60% GPU utilisation, H200 is usually more cost-efficient.
According to GPUaaS.com cluster data, enterprise teams migrating 405B model serving from 4xH200 to 2xB200 configurations see GPU count reductions of 40 to 50%, which offsets the higher per-GPU rate and reduces total cluster cost by 15 to 25% at sustained utilisation above 70%.
B200 SXM wholesale pricing in 2026 runs $6 to $10/GPU/hr on-demand depending on region and cluster configuration. Reserved 1-year contracts bring this down to roughly $4 to $5/GPU/hr. That compares to H200 SXM on-demand at $3.50/GPU/hr and reserved at roughly $2.10/GPU/hr at GPUaaS.com.
| GPU | On-demand (wholesale) | Reserved 1-yr | 8-GPU cluster/hr | Availability |
|---|---|---|---|---|
| B200 SXM | $6.00 to $10.00/hr | ~$4.00 to $5.00/hr | $48 to $80/hr | 4 to 8 week lead time |
| H200 SXM | $3.50/hr | ~$2.10/hr | $28/hr | Under 24 hours |
| H100 SXM5 | $1.49 to $2.49/hr | ~$0.90 to $1.50/hr | $12 to $20/hr | Under 24 hours |
Pricing sourced from GPUaaS.com wholesale provider network, May 2026. On-demand rates. Reserved contracts and bulk pricing available on request.
Availability is the harder constraint for most enterprise buyers in 2026. H200 SXM clusters are available within 24 hours at GPUaaS.com in US (Dallas, Ashburn) and EU (Frankfurt) regions. B200 SXM clusters have 4 to 8 week lead times in most configurations, and spot availability is rare. If you need compute next week, H200 is the practical choice regardless of performance preference.
⚠ Hyperscaler B200 pricing
AWS, Azure, and GCP B200 instances aren't broadly available to enterprise customers at time of writing. Where they are available (primarily through committed use agreements with hyperscaler enterprise accounts), pricing runs $15 to $25/GPU/hr equivalent. Wholesale providers offer the same silicon at roughly 40 to 60% of hyperscaler rates, though with longer lead times than H200.
The B200's cost premium is real. Whether it's justified depends entirely on whether your workload can actually use the additional TFLOPS and VRAM. Most production workloads in 2026 don't saturate a single H200. The teams that get value from B200 are running at a scale where the hardware is genuinely the bottleneck.
✓ B200 SXM justifies the premium
✗ H200 or H100 is the better call
B200 procurement in 2026 is materially different from H100 or H200 procurement. The supply dynamics, lead times, and contract structures are all more complex. Here's what enterprise buyers typically discover after the sales call rather than during it.
Verify actual availability before entering procurement discussions
Many GPU providers list B200 capacity on their websites but don't actually have it available on short notice. Ask specifically: how many B200 GPUs are physically in-rack right now, what's the current queue length, and what's the realistic provisioning timeline for your cluster size. The honest answer is often "4 to 8 weeks" rather than the "available now" language in the marketing copy.
Benchmark your specific workload before committing to reserved pricing
The 2.2x throughput figure comes from a specific benchmark profile (Llama 3 70B in server mode at FP8). Your workload may see more or less. Teams serving multiple smaller models, running batch jobs with variable context lengths, or operating at sub-70% utilisation often see real-world gains of 1.4 to 1.7x rather than the 2.2x headline number. Benchmark before signing a 1-year contract.
Model your cost-per-token, not cost-per-GPU-hour
GPU-hr pricing comparisons are misleading for B200 vs H200 decisions. What matters is cost-per-million-tokens at your target throughput and concurrency. If 2 B200s at $8/hr deliver the same output as 4 H200s at $3.50/hr, B200 wins at $16/hr vs $14/hr. The maths only works in B200's favour at high sustained throughput. Build the model before buying.
Confirm software stack compatibility
vLLM, TensorRT-LLM, and Triton all support B200, but CUDA driver requirements and library versions differ from H100/H200 deployments. If you're running custom CUDA kernels, check Blackwell compatibility before provisioning. Teams migrating existing H200 inference stacks should budget 2 to 4 weeks for testing and optimisation, especially to take advantage of FP4 precision.
Ready to price up a B200 or H200 cluster?
GPUaaS.com connects you directly to wholesale GPU providers across US and EU regions. Get a quote for B200 SXM or H200 SXM clusters with no broker markup and a response within 24 hours.
See how GPUaaS.com works →Last reviewed: May 28, 2026. Pricing sourced from GPUaaS.com wholesale provider network and published hyperscaler rate cards. Benchmark data from MLCommons MLPerf Inference v5.0 and GPUaaS.com cluster telemetry. NVIDIA B200 specs from official NVIDIA Blackwell architecture datasheet.



