How much does B200 SXM cost to rent per hour in 2026?

B200 SXM wholesale on-demand pricing runs $6 to $10/GPU/hr in 2026. An 8xB200 SXM cluster costs $48 to $80/hr. Reserved 1-year contracts bring the rate to roughly $4 to $5/GPU/hr. H200 SXM is available at $3.50/GPU/hr on-demand for comparison.

Is B200 SXM better than H200 for LLM inference?

For 70B+ model inference at high concurrency, yes. B200 delivers roughly 2.2x the throughput of H200 on Llama 3 70B. For sub-70B models at standard context lengths, H200 delivers near-identical throughput at about half the cost.

How available is the B200 SXM in 2026?

B200 SXM has 4 to 8 week lead times for new cluster provisioning at most wholesale providers. H200 SXM is available within 24 hours. For teams needing compute in weeks, H200 is the practical choice.

Does B200 SXM support FP4 inference?

Yes. FP4 is a new precision class introduced with Blackwell architecture, reaching 18,000 TFLOPS on B200 SXM. H100 and H200 don't support FP4. For models where FP4 quantisation is acceptable, cost-per-token can be significantly lower than H200 at FP8.

B200 SXM: What Enterprise Buyers Need to Know (2026)

The NVIDIA B200 SXM delivers 2.2x the FP8 inference throughput of an H200, carries 192 GB of HBM3e memory, and runs at 4.0 TB/s of memory bandwidth. It's the fastest GPU NVIDIA has shipped for inference workloads. Enterprise pricing in 2026 runs $6 to $10/GPU/hr at wholesale providers, with availability concentrated at a small number of data centres and wait times measured in weeks rather than hours.

Key takeaways

B200 SXM carries 192 GB HBM3e at 4.0 TB/s and delivers 2.2x the FP8 inference throughput of an H200 SXM on memory-bound workloads like Llama 3 70B serving
Wholesale pricing in 2026 runs $6 to $10/GPU/hr on-demand. An 8xB200 SXM cluster costs $48 to $80/hr, compared to $28/hr for the equivalent 8xH200
B200 is Blackwell architecture, not a memory upgrade. NVLink 5.0 delivers 1.8 TB/s chip-to-chip bandwidth, nearly double the H200's NVLink 4.0 at 900 GB/s
Availability is constrained to a handful of US and EU data centres in 2026. Lead times for new clusters run 4 to 8 weeks. H200 SXM is available in under 24 hours
B200 makes sense for: 405B+ model inference at scale, MoE architectures exceeding 192 GB, and teams where throughput-per-dollar at high concurrency justifies the premium

In this article

01Full B200 SXM spec sheet 02B200 vs H200: what Blackwell actually changed 03Performance: what the benchmarks show 04B200 SXM pricing and availability in 2026 05Which workloads justify the B200 premium 06Enterprise procurement: what to know before you commit 07Frequently asked questions

◆ FULL SPEC SHEET

Full B200 SXM spec sheet

Spec	B200 SXM	H200 SXM	H100 SXM5
Architecture	Blackwell (GB100)	Hopper (GH100)	Hopper (GH100)
GPU Memory	192 GB HBM3e	141 GB HBM3e	80 GB HBM3
Memory Bandwidth	4.0 TB/s	4.8 TB/s	3.35 TB/s
FP8 Tensor Core TFLOPS	9,000 TFLOPS	3,958 TFLOPS	3,958 TFLOPS
BF16 Tensor Core TFLOPS	4,500 TFLOPS	1,979 TFLOPS	1,979 TFLOPS
FP4 Tensor Core TFLOPS	18,000 TFLOPS	N/A	N/A
NVLink Generation	NVLink 5.0	NVLink 4.0	NVLink 4.0
NVLink Bandwidth (GPU-to-GPU)	1.8 TB/s	900 GB/s	900 GB/s
NVLink Bandwidth (8-GPU node)	14.4 TB/s	7.2 TB/s	7.2 TB/s
TDP	1,000W	700W	700W
Form Factor	SXM6	SXM5	SXM5
Launch	Q1 2025 (GA)	Q4 2023 (GA 2024)	Q2 2022 (GA 2023)

Source: NVIDIA B200 datasheet. FP8 and FP4 TFLOPS figures with sparsity.

◆ B200 VS H200

B200 vs H200: what Blackwell actually changed

The B200 is a full architecture redesign, not a memory upgrade. NVIDIA moved from Hopper (GH100) to Blackwell (GB100) with the B200, which means new compute units, a new NVLink generation, new precision support (FP4), and a new SXM form factor (SXM6). You can't drop a B200 into an existing H100 or H200 HGX chassis. Data centres running B200 need new infrastructure.

The headline compute improvement is substantial. FP8 TFLOPS jump from 3,958 on the H200 to 9,000 on the B200, a 2.27x improvement. FP4 support (a new precision class introduced with Blackwell) reaches 18,000 TFLOPS. For inference workloads where models can be quantised to FP4 without significant accuracy loss, that's a step change in tokens-per-dollar.

One notable spec inversion: the B200 has slightly less memory bandwidth than the H200. The H200 SXM runs at 4.8 TB/s, the B200 at 4.0 TB/s. NVIDIA compensated for this with higher VRAM capacity (192 GB vs 141 GB) and much faster inter-GPU bandwidth via NVLink 5.0. For single-GPU memory-bound workloads, the H200 is actually faster per byte. The B200's advantage is at the cluster level, not per-GPU bandwidth.

+36%

VRAM
192 vs 141 GB

+127%

FP8 TFLOPS
9,000 vs 3,958

+100%

NVLink bandwidth
1.8 vs 0.9 TB/s

-17%

Memory bandwidth
4.0 vs 4.8 TB/s

⚡ The infrastructure cost buyers miss

B200 uses the SXM6 form factor and runs at 1,000W TDP per GPU, 43% higher than the H200's 700W. An 8xB200 HGX node draws up to 8,000W under full load. Data centres need upgraded power delivery and cooling infrastructure. If you're co-locating or buying dedicated infrastructure, that's a real cost that doesn't show up in the per-GPU/hr rental rate.

According to NVIDIA's published specifications, the B200 SXM delivers 9,000 FP8 TFLOPS with sparsity, compared to 3,958 TFLOPS for the H200 SXM, a 2.27x improvement in raw compute throughput at the same precision level.

◆ PERFORMANCE

Performance: what the benchmarks show

MLPerf Inference v5.0 is the first round to include B200 SXM results at scale. The headline Llama 3 70B server-mode result puts the B200 at roughly 2.2x the throughput of a single H200 SXM on the same workload. That matches NVIDIA's pre-launch claims, which is worth noting since marketing TFLOPS figures often don't translate to real-world inference gains.

The throughput gap widens further on larger models. On Llama 3 405B at FP8, the B200's 192 GB VRAM allows the full model to run on a single GPU where the H200 requires careful sharding across 2 to 4 GPUs. Less inter-GPU communication means higher effective utilisation. For 405B serving at scale, the B200's advantage over H200 in practice is closer to 2.5 to 3x rather than the 2.2x seen on 70B workloads.

FP4 inference support changes the picture further for teams willing to accept the quantisation tradeoff. At FP4 precision, the B200 reaches 18,000 TFLOPS, and for models where FP4 quantisation doesn't materially degrade output quality (primarily smaller, distilled models), the per-token cost drops dramatically. Most frontier-scale production models aren't FP4-deployable yet, but enterprise fine-tunes on smaller base models often are.

Relative inference throughput, Llama 3 70B server mode (FP8)

B200 SXM

~2.2x H200

H200 SXM

1.0x baseline

H100 SXM5

~0.69x

Relative throughput, single GPU, server mode, FP8. Source: MLCommons MLPerf Inference v5.0 and GPUaaS.com cluster telemetry.

The throughput-per-dollar question

2.2x the throughput at 2.0 to 2.5x the hourly cost is roughly breakeven on throughput-per-dollar at current wholesale pricing. The B200 starts to win on cost-per-token at high sustained concurrency, where its higher batch throughput and larger VRAM mean fewer GPUs needed to hit a given requests-per-second target. For teams running below 60% GPU utilisation, H200 is usually more cost-efficient.

According to GPUaaS.com cluster data, enterprise teams migrating 405B model serving from 4xH200 to 2xB200 configurations see GPU count reductions of 40 to 50%, which offsets the higher per-GPU rate and reduces total cluster cost by 15 to 25% at sustained utilisation above 70%.

◆ PRICING AND AVAILABILITY

B200 SXM pricing and availability in 2026

B200 SXM wholesale pricing in 2026 runs $6 to $10/GPU/hr on-demand depending on region and cluster configuration. Reserved 1-year contracts bring this down to roughly $4 to $5/GPU/hr. That compares to H200 SXM on-demand at $3.50/GPU/hr and reserved at roughly $2.10/GPU/hr at GPUaaS.com.

GPU	On-demand (wholesale)	Reserved 1-yr	8-GPU cluster/hr	Availability
B200 SXM	$6.00 to $10.00/hr	~$4.00 to $5.00/hr	$48 to $80/hr	4 to 8 week lead time
H200 SXM	$3.50/hr	~$2.10/hr	$28/hr	Under 24 hours
H100 SXM5	$1.49 to $2.49/hr	~$0.90 to $1.50/hr	$12 to $20/hr	Under 24 hours

Pricing sourced from GPUaaS.com wholesale provider network, May 2026. On-demand rates. Reserved contracts and bulk pricing available on request.

Availability is the harder constraint for most enterprise buyers in 2026. H200 SXM clusters are available within 24 hours at GPUaaS.com in US (Dallas, Ashburn) and EU (Frankfurt) regions. B200 SXM clusters have 4 to 8 week lead times in most configurations, and spot availability is rare. If you need compute next week, H200 is the practical choice regardless of performance preference.

⚠ Hyperscaler B200 pricing

AWS, Azure, and GCP B200 instances aren't broadly available to enterprise customers at time of writing. Where they are available (primarily through committed use agreements with hyperscaler enterprise accounts), pricing runs $15 to $25/GPU/hr equivalent. Wholesale providers offer the same silicon at roughly 40 to 60% of hyperscaler rates, though with longer lead times than H200.

◆ WORKLOAD FIT

Which workloads justify the B200 premium

The B200's cost premium is real. Whether it's justified depends entirely on whether your workload can actually use the additional TFLOPS and VRAM. Most production workloads in 2026 don't saturate a single H200. The teams that get value from B200 are running at a scale where the hardware is genuinely the bottleneck.

✓ B200 SXM justifies the premium

●405B+ model inference at high concurrency. Llama 3 405B at FP8 requires roughly 100 to 120 GB on a single GPU. A single B200 handles it with headroom for large KV cache. On H200, the same workload spans 2 GPUs with inter-GPU communication overhead reducing effective throughput

●MoE architectures at full precision. Sparse mixture-of-experts models like Mixtral 8x22B activate a subset of parameters per token but still load the full model into VRAM. At 141B effective parameters, full MoE models exceed H200's 141 GB capacity at production batch sizes

●Multi-model serving from a single GPU. 192 GB VRAM allows running multiple smaller models simultaneously. A team serving Llama 3 8B, a 13B reranker, and an embedding model from a single B200 avoids the GPU fragmentation problem that makes multi-model deployments expensive on H200

●Pre-training runs requiring maximum throughput. For teams doing full pre-training on models above 30B parameters where cluster cost is dominated by wallclock time rather than GPU-hours, B200's compute density per node reduces training time meaningfully

●FP4 inference on quantisation-tolerant models. The B200's 18,000 FP4 TFLOPS is genuinely new capability. Enterprise fine-tunes of smaller models (7B to 30B) where FP4 quantisation is acceptable can achieve cost-per-token significantly below H200 at FP8

✗ H200 or H100 is the better call

●Sub-70B inference at standard context. H200 delivers near-identical throughput to B200 on 70B and smaller models at standard context lengths. The B200 premium buys nothing you can use

●Teams where time-to-compute matters more than throughput. If you need a cluster this week rather than in 6 weeks, H200 wins by default

●GPU utilisation below 60%. B200's throughput advantage only materialises at high sustained utilisation. At 40 to 50% utilisation, you're paying the premium without capturing the benefit

●Fine-tuning and experimentation workloads. QLoRA and LoRA fine-tuning on 70B models fits comfortably in H200's 141 GB. Development cycles and iteration don't justify B200 pricing

◆ ENTERPRISE PROCUREMENT

Enterprise procurement: what to know before you commit

B200 procurement in 2026 is materially different from H100 or H200 procurement. The supply dynamics, lead times, and contract structures are all more complex. Here's what enterprise buyers typically discover after the sales call rather than during it.

Verify actual availability before entering procurement discussions

Many GPU providers list B200 capacity on their websites but don't actually have it available on short notice. Ask specifically: how many B200 GPUs are physically in-rack right now, what's the current queue length, and what's the realistic provisioning timeline for your cluster size. The honest answer is often "4 to 8 weeks" rather than the "available now" language in the marketing copy.

Benchmark your specific workload before committing to reserved pricing

The 2.2x throughput figure comes from a specific benchmark profile (Llama 3 70B in server mode at FP8). Your workload may see more or less. Teams serving multiple smaller models, running batch jobs with variable context lengths, or operating at sub-70% utilisation often see real-world gains of 1.4 to 1.7x rather than the 2.2x headline number. Benchmark before signing a 1-year contract.

Model your cost-per-token, not cost-per-GPU-hour

GPU-hr pricing comparisons are misleading for B200 vs H200 decisions. What matters is cost-per-million-tokens at your target throughput and concurrency. If 2 B200s at $8/hr deliver the same output as 4 H200s at $3.50/hr, B200 wins at $16/hr vs $14/hr. The maths only works in B200's favour at high sustained throughput. Build the model before buying.

Confirm software stack compatibility

vLLM, TensorRT-LLM, and Triton all support B200, but CUDA driver requirements and library versions differ from H100/H200 deployments. If you're running custom CUDA kernels, check Blackwell compatibility before provisioning. Teams migrating existing H200 inference stacks should budget 2 to 4 weeks for testing and optimisation, especially to take advantage of FP4 precision.

Ready to price up a B200 or H200 cluster?

GPUaaS.com connects you directly to wholesale GPU providers across US and EU regions. Get a quote for B200 SXM or H200 SXM clusters with no broker markup and a response within 24 hours.

See how GPUaaS.com works →

◆ FAQ

Frequently asked questions

Last reviewed: May 28, 2026. Pricing sourced from GPUaaS.com wholesale provider network and published hyperscaler rate cards. Benchmark data from MLCommons MLPerf Inference v5.0 and GPUaaS.com cluster telemetry. NVIDIA B200 specs from official NVIDIA Blackwell architecture datasheet.

B200 SXM: What Enterprise Buyers Need to Know

Get a wholesale GPU quote in a few hours

Related articles

Top 10 GPU Cluster Providers for Enterprise AI Teams in 2026

A Framework for Comparing GPU Providers That Actually Works

You Wouldn't Buy a Car From One Dealer Without Checking Prices Elsewhere. Most Teams Buy GPUs That Way.