What's the realistic throughput for B200 on a production LLM workload?

B200 ranges from ~6,972 tok/s/GPU at FP8 (Llama 2 70B offline) to 60,000 tok/s/GPU at FP4 with disaggregated serving (GPT-OSS-120B). The 8.6x spread depends on model, precision, framework, batch size, and serving topology.

Should I deploy B200 at FP4 or FP8 in production?

FP8 is the safer default for production with ~2x speedup over FP16 and minimal accuracy regression. FP4 can deliver up to 4x latency reduction but degrades accuracy on reasoning, math, and chain-of-thought tasks. Always run task-specific accuracy evals before moving to FP4.

What's the cost per million tokens on B200?

NVIDIA reports $0.02 per million tokens on GPT-OSS-120B with TensorRT-LLM and FP4 as of April 2026, down from $0.11 at launch within two months from software updates alone. Actual cost depends on utilisation, model, precision, and framework.

Is B200 worth it over H200?

B200 delivers ~4x H200 throughput on GPT-OSS-120B at FP4 with TensorRT-LLM. On smaller models or batches the gap narrows. B200 from ~$4.50/GPU/hr versus H200 from ~$3.00/GPU/hr means B200 needs to deliver at least 50% more throughput on your workload to break even on per-token cost.

Should I commit to a long-term B200 contract before benchmarking?

No. Software stacks move quarterly and accuracy-throughput tradeoffs are workload-specific. Use short-term contracts to benchmark on real hardware before signing a 24- or 36-month commitment.

How to Benchmark Your Workload Before Committing to B200 (2026)

NVIDIA's B200 hits 60,000 tokens per second per GPU on GPT-OSS-120B with the right software stack, per SemiAnalysis InferenceX data from April 2026. That's a 4x throughput improvement over H200 and pushes cost per million tokens down to about $0.02. Those are the headline numbers. They are also not your numbers. Your workload's tokens per second, time to first token, and tail latency will not match MLPerf's, and that gap is where most B200 procurement decisions go wrong. This is how to benchmark your actual workload before committing to a B200 cluster.

Key takeaways

Public MLPerf and vendor numbers are directional, not predictive. B200 ranges from ~6,972 tok/s/GPU at FP8 (Llama 2 70B offline) to 60,000 tok/s/GPU at FP4 with disaggregated serving (GPT-OSS-120B). The 8.6x spread is the workload, not the GPU (SemiAnalysis InferenceX, April 2026; MLPerf Inference v5.1)
Four metrics determine whether B200 wins for your workload: TTFT (time to first token), TPOT/ITL (per-token latency), throughput (tokens/sec per GPU), and tail latency (P95/P99). Pick the wrong one and the procurement decision optimises the wrong cost
FP4 on Blackwell delivers roughly 2x the TFLOPS of FP8, but accuracy can degrade on reasoning, math, and chain-of-thought tasks. FP8 is the safe default; FP4 requires task-specific accuracy evals before production
B200 cost per million tokens dropped from $0.11 to $0.02 in two months on GPT-OSS-120B from TensorRT-LLM updates alone. Benchmark the software stack you'll actually run, not the one that shipped with the GPU (NVIDIA Developer, April 2026)
Pre-commitment checklist: measure on the model and sequence lengths you'll serve, benchmark at production-like concurrency, include P95/P99, run with the quantisation you'll deploy, and validate on the inference framework you'll use (vLLM, SGLang, TensorRT-LLM)
GPUaaS.com offers short-term B200 and B300 contracts so the benchmarking step doesn't require a multi-year commitment. H100 from ~$2.50/GPU/hr, H200 from ~$3.00/GPU/hr, B200 and B300 from ~$4.50/GPU/hr

B200 procurement conversations break the same way every time. Someone reads that B200 hits 60,000 tokens per second per GPU and starts modelling the cost case off that number. Then the workload lands, and the team is getting 4,000 tokens per second per GPU and a P99 latency that's three times worse than the H100 cluster they were going to replace. The 60,000 number was real. It was also for GPT-OSS-120B at FP4 with disaggregated serving on the latest TensorRT-LLM stack. Three of those four conditions don't apply to most production workloads. This guide walks through what to benchmark, how to benchmark it, and the procurement questions to ask before signing for a B200 cluster. For B200 spec context, see the B200 SXM enterprise buyer's guide. For the broader pricing structure, see the GPU pricing guide.

In this article

01Why MLPerf numbers aren't your numbers 02The four metrics that actually matter 03A benchmarking methodology for B200 procurement 04FP4 vs FP8: the accuracy-throughput tradeoff 05The software stack moves the numbers more than you think 06Pre-commitment procurement checklist 07Frequently asked questions

◆ HEADLINE NUMBERS

Why MLPerf numbers aren't your numbers

Here's the spread of public B200 per-GPU throughput numbers, all real, all from credible sources, all from the last 12 months:

Source	Model	B200 tok/s/GPU	Conditions
MLPerf Inference v5.1	Llama 2 70B (offline)	~12,841	FP4, 8-GPU divided
Derived from H100 ratio	Llama 2 70B (offline)	~6,972	FP8, derived from TFLOPS ratio
SemiAnalysis InferenceX	GPT-OSS-120B	60,000	FP4, TensorRT-LLM, disaggregated
arxiv MR-MXFP4 paper	Llama-3.3-70B (large batch)	~15,000	MR-MXFP4, vLLM, single GPU

Sources: MLPerf Inference v5.1 (September 2025); SemiAnalysis InferenceX via NVIDIA Developer (April 2026); arxiv 2509.23202 (March 2026). All numbers per-GPU, derived from multi-GPU runs where noted.

The throughput ratio between the lowest and highest credible B200 number on this table is roughly 8.6x. The hardware did not change. What changed: the model, the quantisation (FP8 vs FP4 vs MR-MXFP4), the inference framework (TensorRT-LLM vs vLLM), the serving strategy (disaggregated vs co-located), and the workload pattern (offline batch vs online streaming).

Inference now accounts for roughly two-thirds of all AI compute in 2026, per Spheron's April 2026 analysis. Models are trained once; they serve requests millions of times. That shift means the GPU procurement decision for inference deserves the same rigor that training procurement has always gotten. The published numbers tell you what's possible on B200. They tell you nothing about what you'll get on your workload.

According to NVIDIA Developer's MLPerf Inference v6.0 results from April 2026, GB300 NVL72 delivered 2.5 million tokens per second on DeepSeek-R1, a 2.7x improvement over the previous GB300 NVL72 debut submission six months prior. The hardware was unchanged. The 2.7x came from TensorRT-LLM software updates alone.

If you're modelling B200 procurement off a number that came from a different model, different precision, different framework, or different serving strategy than yours, you're not modelling B200 procurement. You're modelling someone else's workload. For a procurement-side TCO breakdown of the surrounding costs, see the real TCO of a GPU cluster in 2026.

◆ FOUR METRICS

The four metrics that actually matter

A single throughput number cannot describe LLM inference performance. The BentoML LLM Inference Handbook, Anyscale's documentation, and Databricks's benchmarking guidance all converge on the same four metrics. These are the ones to model.

Metric	What it measures	Why it matters for procurement
TTFT	Time to first token. Dominated by prefill, scales with input length.	Interactive workloads (chat, code assist) live or die on this. Target: sub-200ms for production.
TPOT / ITL	Time per output token / inter-token latency. Dominated by decode, scales with model size.	100ms TPOT = ~10 tok/s = faster than reading speed. Below 50ms feels instant.
Throughput	Tokens/second per GPU (or per system). Sensitive to batch size.	Batch workloads (summarisation, agents) optimise here. Drives cost per million tokens.
P95 / P99 latency	Tail latency under load. The worst 5% (or 1%) of requests.	Mean throughput hides tail spikes that break SLAs. P99 / mean ratio under load tells you stability.

Infercom's analysis puts the tradeoff plainly: a provider that delivers 400 tok/s at 3 AM but 150 tok/s during business hours is not a 400 tok/s provider for your production workload. That same logic applies to B200 procurement. A B200 cluster that hits 60,000 tok/s/GPU in an offline batch benchmark is not necessarily a 60,000 tok/s/GPU cluster for your streaming production traffic.

Match the metric to the workload

Chat / code assist: TTFT first, then TPOT, then P99. Throughput is a distant fourth.
Agentic workflows: Throughput and TPOT. Latency for the first token matters less than total task completion time.
Batch processing: Throughput dominates. Per-token latency only matters for total job time.
Real-time voice / video: TTFT, TPOT, P95. Throughput is hardware planning, not user experience.

For a deeper look at how these metrics interact with KV cache and decode-side optimisations, see the KV cache and inference cost guide.

◆ METHODOLOGY

A benchmarking methodology for B200 procurement

A pre-commitment benchmark is not a microbenchmark and it is not an MLPerf submission. It's an attempt to predict what a production B200 deployment will actually deliver. Six steps that hold up:

1. Use your real model. If you serve Llama-3.3-70B in production, benchmark Llama-3.3-70B. Different model architectures hit different bottlenecks on B200. A Mixture-of-Experts model with sparse activation behaves nothing like a dense transformer at the same parameter count. The MLPerf Llama 2 70B numbers tell you almost nothing about a DeepSeek-R1 deployment.

2. Use your real sequence lengths. Anyscale's benchmarking guidance is explicit: input length drives prefill cost, output length drives decode cost. A workload with 4,096-token inputs and 256-token outputs is prefill-dominated and bandwidth-bound. A workload with 256-token inputs and 4,096-token outputs is decode-dominated and KV-cache-bound. B200's 8 TB/s HBM3e bandwidth helps both, but the relative gain over H100 is workload-dependent. Test with the actual distribution.

3. Use your real concurrency. A single-stream benchmark tells you peak per-request performance. A 64-concurrent benchmark tells you what happens when the queue fills. Production concurrency is rarely 1 and rarely 1,000. Find your real number from your existing load and test at 0.5x, 1x, and 2x. The P99 / mean ratio at 2x is the number that matters for capacity planning.

4. Use your real inference framework. vLLM, SGLang, and TensorRT-LLM are not interchangeable. They have different scheduling strategies, different KV cache management, different quantisation kernel support. Benchmark the one you'll actually deploy. According to the arxiv MR-MXFP4 study (March 2026), small-batch workloads on vLLM saw negligible MR-MXFP4 speedup over BF16, while large-batch workloads saw a 2.2x increase. Same hardware, same quantisation scheme, totally different conclusion depending on framework and batch.

5. Use your real quantisation. If your production model runs FP8, benchmark FP8. If you're considering FP4, benchmark FP4 alongside an accuracy eval on your task. Mify-Coder's quantisation study found FP8 was a -1.35% accuracy regression versus FP16 baseline (acceptable for most production); FP4 accuracy varies wildly by model and calibration. Don't benchmark FP4 throughput without measuring the corresponding accuracy hit on tasks you care about.

6. Measure all four metrics with statistical rigor. Run for long enough that warmup effects don't dominate (typically 10+ minutes at steady-state concurrency). Record TTFT, TPOT, throughput, and P95/P99 in the same run. Repeat at least 3 times. Variance across runs above ~5% means something is unstable; investigate before procuring.

⚠ Watch out

Averaging per-request ITL first and then averaging across requests is mathematically different from computing ITL across all token timestamps in aggregate. The first method (used in the archived ray-project/llmperf) overweights short requests and can hide tail issues. Verify how your benchmarking tool computes ITL before trusting the number.

◆ PRECISION

FP4 vs FP8: the accuracy-throughput tradeoff

FP4 is the Blackwell-specific lever. B200 delivers approximately 18,000 sparse FP4 TFLOPS versus approximately 9,000 sparse FP8 TFLOPS on the same chip, a clean 2x ratio at the kernel level per Spheron's March 2026 analysis. End-to-end inference speedup is typically 1.3x-2.2x depending on batch size and whether the workload is memory-bound or compute-bound.

The catch is accuracy. The arxiv "Win Fast or Lose Slow" study (May 2025) puts it bluntly: FP8 typically delivers up to 2x latency speedup while maintaining near-lossless output quality and is widely adopted in production. FP4 can deliver up to 4x latency reduction but often causes severe degradation in model performance, limiting its standalone use. NVIDIA's NVFP4 format with micro-block scaling (FP8 E4M3 scale per 16 FP4 values) closes part of this gap, but accuracy still depends on whether the model was calibrated for FP4 post-training.

Where FP4 helps less than the spec suggests:

Reasoning and chain-of-thought. Errors compound across reasoning chains. FP4 quantisation noise on a 50-step reasoning chain can flip the answer. Run task-specific accuracy evals (GSM8K, MATH, MMLU-Pro) before deploying.
Math and scientific tasks. Same compounding-error problem.
Small-batch inference. Memory-bound workloads benefit from weight-only quantisation, but FP4 activation quantisation brings little speedup at batch sizes below ~8.
Uncalibrated models. Dynamic FP4 quantisation applied without calibration produces significantly larger accuracy gaps than PTQ-calibrated weights. If pre-calibrated FP4 weights for your model variant don't exist, FP8 is the safer default.

Where FP4 wins:

Large-batch generation. 8x or higher batch sizes are where FP4 throughput improvements materialise.
Memory-bound workloads. 4x memory reduction lets you fit more weights per memory bandwidth unit, which compounds the kernel speedup.
Energy and inference economics at scale. H100 at FP16 consumes ~10 joules per token; B200 at FP4 drops to ~0.2-0.4 joules per token per ifactoryapp's May 2026 analysis. At 100M tokens/day, that's a 25-50x energy efficiency improvement, which translates directly to power and cooling savings in the cluster TCO.

According to NVIDIA Developer's April 2026 data, B200 cost per million tokens on GPT-OSS-120B dropped from $0.11 at launch to $0.02 within two months — a 5x improvement from software alone. That's TensorRT-LLM updates compounding kernel fusion, quantisation, and scheduling improvements on the same silicon.

The procurement implication is straightforward: benchmark at the precision you'll deploy at, and run the accuracy eval at the same time. A throughput win that costs you 3 points on MMLU-Pro might not be a win. For more on inference economics and the token-cost framing, see tokenmaxxing and exploding enterprise AI bills.

◆ SOFTWARE

The software stack moves the numbers more than you think

The GB300 NVL72 result jumped 2.7x in six months from software updates alone, per NVIDIA Developer's MLPerf Inference v6.0 analysis. The same hardware. The same model. Different TensorRT-LLM version. That's not edge-case behaviour, that's the central fact of B200 procurement in 2026.

Three framework choices to benchmark explicitly:

Framework	Strengths	Tradeoffs
TensorRT-LLM	Highest published throughput on B200. NVFP4 support. Disaggregated serving.	Compilation step, less flexible. NVIDIA-only.
vLLM	Wide model support, paged attention, fast iteration. MR-MXFP4 support landing in 2026.	Throughput typically below TensorRT-LLM peak on B200.
SGLang	Structured generation, RadixAttention prefix caching, strong for agentic and tool-use workloads.	Smaller ecosystem than vLLM.

Disaggregated serving — splitting prefill and decode across separate GPU pools — is one of the biggest B200 wins, but it requires infrastructure work most teams haven't done. Lambda's MLPerf Inference v5.1 submission on 8x HGX B200 saw 15.4% gains over v5.0 from TensorRT 10.11 + CUDA 12.9 + Ubuntu 24.04 alone. The software stack moves quarterly. Whatever you benchmark this month is a lower bound for what production will deliver in three months.

⚡ Software stack reality check

The B200 number to commit to is not the highest published. It's the one your team can actually reproduce on your infrastructure within 60 days of cluster delivery, with your model, framework, and serving topology. Anything above that is upside.

◆ CHECKLIST

Pre-commitment procurement checklist

A short list to run before signing for B200 capacity. Each item should be answered with a number from your own benchmark, not a spec sheet.

What is our TTFT, TPOT, throughput, and P99 on the model we'll deploy, at production sequence lengths and concurrency, on the framework we'll use? If any of these four are unknown, the procurement model is incomplete.
What is the FP4 accuracy delta versus FP8 on the tasks we care about? Run an eval, not a guess. GSM8K, MATH, MMLU-Pro, or domain-specific evals.
What is the P99 / mean ratio at 2x our projected concurrency? A ratio above ~3-4x means the cluster is underprovisioned or the framework is mismanaging the queue.
What's the cost per million tokens at production utilisation? Use real numbers: (GPU-hours × $/GPU-hr) / (tokens served). Not the MLPerf-derived number.
What's the upgrade path if the workload changes? B200 today, but what if next year's model needs B300 or GB300 NVL72? A short-term contract preserves optionality; a 36-month commitment doesn't.
What's the actual utilisation we expect? Per Cast AI's 2026 data, average GPU utilisation across 23,000 measured clusters is 5%. Above 70% utilisation, ownership starts to make sense. Below, contracts win. Most teams sit far below.

For the GPU model decision that sits one step above the benchmark, see H100 vs H200 vs B200: which GPU to rent in 2026.

Your search for enterprise GPU compute ends here.

NVIDIA infrastructure at rates hyperscalers won't offer you. H100, H200, B200, B300 clusters. Short-term and long-term contracts. Benchmark first, commit second. Quotes within 24 hours.

Get a quote on your cluster

◆ FAQ

Frequently asked questions

Last reviewed: June 9, 2026. B200 MLPerf and SemiAnalysis InferenceX figures from NVIDIA Developer (April 2026), Spheron AI inference guide (April 2026), MLPerf Inference v5.1 (September 2025), MLPerf Inference v6.0 (April 2026), Nebius blog (April 2026), and Lambda's v5.1 submission. Benchmarking methodology from BentoML LLM Inference Handbook, Anyscale docs, Databricks Foundation Model APIs guide, and llmperf-rs (April 2026). FP4/FP8 tradeoff analysis from Spheron FP4 quantisation guide (March 2026), arxiv 2509.23202 (March 2026), arxiv 2411.02355, ifactoryapp (May 2026), and Spheron 2025 quantisation tradeoffs analysis. GPUaaS.com rates are indicative, contract-based, and quote-dependent.

How to Benchmark Your Workload Before Committing to B200

Get a wholesale GPU quote in a few hours

Related articles

B200 vs H100 vs H200: What the Price Difference Actually Tells You About Your Workload

The GPU Market Has Two Prices: The One You're Quoted and the One the Market Clears At

FOMO Is Why Enterprises Are Paying for GPUs They Do Not Use