BlogHow to Benchmark Your Workload Before Committing to B200

AI & ML

B200 throughput varies 8.6x across workloads, from 6,972 to 60,000 tok/s/GPU. Here's the four-metric methodology to benchmark before you commit.

How to Benchmark Your Workload Before Committing to B200

GPUaaS.com Team
GPUaaS.com Team
Technical Evaluator
June 8, 2026
Blog post cover image

NVIDIA's B200 hits 60,000 tokens per second per GPU on GPT-OSS-120B with the right software stack, per SemiAnalysis InferenceX data from April 2026. That's a 4x throughput improvement over H200 and pushes cost per million tokens down to about $0.02. Those are the headline numbers. They are also not your numbers. Your workload's tokens per second, time to first token, and tail latency will not match MLPerf's, and that gap is where most B200 procurement decisions go wrong. This is how to benchmark your actual workload before committing to a B200 cluster.

Key takeaways
  • Public MLPerf and vendor numbers are directional, not predictive. B200 ranges from ~6,972 tok/s/GPU at FP8 (Llama 2 70B offline) to 60,000 tok/s/GPU at FP4 with disaggregated serving (GPT-OSS-120B). The 8.6x spread is the workload, not the GPU (SemiAnalysis InferenceX, April 2026; MLPerf Inference v5.1)
  • Four metrics determine whether B200 wins for your workload: TTFT (time to first token), TPOT/ITL (per-token latency), throughput (tokens/sec per GPU), and tail latency (P95/P99). Pick the wrong one and the procurement decision optimises the wrong cost
  • FP4 on Blackwell delivers roughly 2x the TFLOPS of FP8, but accuracy can degrade on reasoning, math, and chain-of-thought tasks. FP8 is the safe default; FP4 requires task-specific accuracy evals before production
  • B200 cost per million tokens dropped from $0.11 to $0.02 in two months on GPT-OSS-120B from TensorRT-LLM updates alone. Benchmark the software stack you'll actually run, not the one that shipped with the GPU (NVIDIA Developer, April 2026)
  • Pre-commitment checklist: measure on the model and sequence lengths you'll serve, benchmark at production-like concurrency, include P95/P99, run with the quantisation you'll deploy, and validate on the inference framework you'll use (vLLM, SGLang, TensorRT-LLM)
  • GPUaaS.com offers short-term B200 and B300 contracts so the benchmarking step doesn't require a multi-year commitment. H100 from ~$2.50/GPU/hr, H200 from ~$3.00/GPU/hr, B200 and B300 from ~$4.50/GPU/hr

B200 procurement conversations break the same way every time. Someone reads that B200 hits 60,000 tokens per second per GPU and starts modelling the cost case off that number. Then the workload lands, and the team is getting 4,000 tokens per second per GPU and a P99 latency that's three times worse than the H100 cluster they were going to replace. The 60,000 number was real. It was also for GPT-OSS-120B at FP4 with disaggregated serving on the latest TensorRT-LLM stack. Three of those four conditions don't apply to most production workloads. This guide walks through what to benchmark, how to benchmark it, and the procurement questions to ask before signing for a B200 cluster. For B200 spec context, see the B200 SXM enterprise buyer's guide. For the broader pricing structure, see the GPU pricing guide.

◆ HEADLINE NUMBERS
Why MLPerf numbers aren't your numbers

Here's the spread of public B200 per-GPU throughput numbers, all real, all from credible sources, all from the last 12 months:

SourceModelB200 tok/s/GPUConditions
MLPerf Inference v5.1Llama 2 70B (offline)~12,841FP4, 8-GPU divided
Derived from H100 ratioLlama 2 70B (offline)~6,972FP8, derived from TFLOPS ratio
SemiAnalysis InferenceXGPT-OSS-120B60,000FP4, TensorRT-LLM, disaggregated
arxiv MR-MXFP4 paperLlama-3.3-70B (large batch)~15,000MR-MXFP4, vLLM, single GPU

Sources: MLPerf Inference v5.1 (September 2025); SemiAnalysis InferenceX via NVIDIA Developer (April 2026); arxiv 2509.23202 (March 2026). All numbers per-GPU, derived from multi-GPU runs where noted.

The throughput ratio between the lowest and highest credible B200 number on this table is roughly 8.6x. The hardware did not change. What changed: the model, the quantisation (FP8 vs FP4 vs MR-MXFP4), the inference framework (TensorRT-LLM vs vLLM), the serving strategy (disaggregated vs co-located), and the workload pattern (offline batch vs online streaming).

Inference now accounts for roughly two-thirds of all AI compute in 2026, per Spheron's April 2026 analysis. Models are trained once; they serve requests millions of times. That shift means the GPU procurement decision for inference deserves the same rigor that training procurement has always gotten. The published numbers tell you what's possible on B200. They tell you nothing about what you'll get on your workload.

According to NVIDIA Developer's MLPerf Inference v6.0 results from April 2026, GB300 NVL72 delivered 2.5 million tokens per second on DeepSeek-R1, a 2.7x improvement over the previous GB300 NVL72 debut submission six months prior. The hardware was unchanged. The 2.7x came from TensorRT-LLM software updates alone.

If you're modelling B200 procurement off a number that came from a different model, different precision, different framework, or different serving strategy than yours, you're not modelling B200 procurement. You're modelling someone else's workload. For a procurement-side TCO breakdown of the surrounding costs, see the real TCO of a GPU cluster in 2026.

◆ FOUR METRICS
The four metrics that actually matter

A single throughput number cannot describe LLM inference performance. The BentoML LLM Inference Handbook, Anyscale's documentation, and Databricks's benchmarking guidance all converge on the same four metrics. These are the ones to model.

MetricWhat it measuresWhy it matters for procurement
TTFTTime to first token. Dominated by prefill, scales with input length.Interactive workloads (chat, code assist) live or die on this. Target: sub-200ms for production.
TPOT / ITLTime per output token / inter-token latency. Dominated by decode, scales with model size.100ms TPOT = ~10 tok/s = faster than reading speed. Below 50ms feels instant.
ThroughputTokens/second per GPU (or per system). Sensitive to batch size.Batch workloads (summarisation, agents) optimise here. Drives cost per million tokens.
P95 / P99 latencyTail latency under load. The worst 5% (or 1%) of requests.Mean throughput hides tail spikes that break SLAs. P99 / mean ratio under load tells you stability.

Infercom's analysis puts the tradeoff plainly: a provider that delivers 400 tok/s at 3 AM but 150 tok/s during business hours is not a 400 tok/s provider for your production workload. That same logic applies to B200 procurement. A B200 cluster that hits 60,000 tok/s/GPU in an offline batch benchmark is not necessarily a 60,000 tok/s/GPU cluster for your streaming production traffic.

Match the metric to the workload

  • Chat / code assist: TTFT first, then TPOT, then P99. Throughput is a distant fourth.
  • Agentic workflows: Throughput and TPOT. Latency for the first token matters less than total task completion time.
  • Batch processing: Throughput dominates. Per-token latency only matters for total job time.
  • Real-time voice / video: TTFT, TPOT, P95. Throughput is hardware planning, not user experience.

For a deeper look at how these metrics interact with KV cache and decode-side optimisations, see the KV cache and inference cost guide.

◆ METHODOLOGY
A benchmarking methodology for B200 procurement

A pre-commitment benchmark is not a microbenchmark and it is not an MLPerf submission. It's an attempt to predict what a production B200 deployment will actually deliver. Six steps that hold up:

1. Use your real model. If you serve Llama-3.3-70B in production, benchmark Llama-3.3-70B. Different model architectures hit different bottlenecks on B200. A Mixture-of-Experts model with sparse activation behaves nothing like a dense transformer at the same parameter count. The MLPerf Llama 2 70B numbers tell you almost nothing about a DeepSeek-R1 deployment.

2. Use your real sequence lengths. Anyscale's benchmarking guidance is explicit: input length drives prefill cost, output length drives decode cost. A workload with 4,096-token inputs and 256-token outputs is prefill-dominated and bandwidth-bound. A workload with 256-token inputs and 4,096-token outputs is decode-dominated and KV-cache-bound. B200's 8 TB/s HBM3e bandwidth helps both, but the relative gain over H100 is workload-dependent. Test with the actual distribution.

3. Use your real concurrency. A single-stream benchmark tells you peak per-request performance. A 64-concurrent benchmark tells you what happens when the queue fills. Production concurrency is rarely 1 and rarely 1,000. Find your real number from your existing load and test at 0.5x, 1x, and 2x. The P99 / mean ratio at 2x is the number that matters for capacity planning.

4. Use your real inference framework. vLLM, SGLang, and TensorRT-LLM are not interchangeable. They have different scheduling strategies, different KV cache management, different quantisation kernel support. Benchmark the one you'll actually deploy. According to the arxiv MR-MXFP4 study (March 2026), small-batch workloads on vLLM saw negligible MR-MXFP4 speedup over BF16, while large-batch workloads saw a 2.2x increase. Same hardware, same quantisation scheme, totally different conclusion depending on framework and batch.

5. Use your real quantisation. If your production model runs FP8, benchmark FP8. If you're considering FP4, benchmark FP4 alongside an accuracy eval on your task. Mify-Coder's quantisation study found FP8 was a -1.35% accuracy regression versus FP16 baseline (acceptable for most production); FP4 accuracy varies wildly by model and calibration. Don't benchmark FP4 throughput without measuring the corresponding accuracy hit on tasks you care about.

6. Measure all four metrics with statistical rigor. Run for long enough that warmup effects don't dominate (typically 10+ minutes at steady-state concurrency). Record TTFT, TPOT, throughput, and P95/P99 in the same run. Repeat at least 3 times. Variance across runs above ~5% means something is unstable; investigate before procuring.

⚠ Watch out

Averaging per-request ITL first and then averaging across requests is mathematically different from computing ITL across all token timestamps in aggregate. The first method (used in the archived ray-project/llmperf) overweights short requests and can hide tail issues. Verify how your benchmarking tool computes ITL before trusting the number.

◆ PRECISION
FP4 vs FP8: the accuracy-throughput tradeoff

FP4 is the Blackwell-specific lever. B200 delivers approximately 18,000 sparse FP4 TFLOPS versus approximately 9,000 sparse FP8 TFLOPS on the same chip, a clean 2x ratio at the kernel level per Spheron's March 2026 analysis. End-to-end inference speedup is typically 1.3x-2.2x depending on batch size and whether the workload is memory-bound or compute-bound.

The catch is accuracy. The arxiv "Win Fast or Lose Slow" study (May 2025) puts it bluntly: FP8 typically delivers up to 2x latency speedup while maintaining near-lossless output quality and is widely adopted in production. FP4 can deliver up to 4x latency reduction but often causes severe degradation in model performance, limiting its standalone use. NVIDIA's NVFP4 format with micro-block scaling (FP8 E4M3 scale per 16 FP4 values) closes part of this gap, but accuracy still depends on whether the model was calibrated for FP4 post-training.

Where FP4 helps less than the spec suggests:

  • Reasoning and chain-of-thought. Errors compound across reasoning chains. FP4 quantisation noise on a 50-step reasoning chain can flip the answer. Run task-specific accuracy evals (GSM8K, MATH, MMLU-Pro) before deploying.
  • Math and scientific tasks. Same compounding-error problem.
  • Small-batch inference. Memory-bound workloads benefit from weight-only quantisation, but FP4 activation quantisation brings little speedup at batch sizes below ~8.
  • Uncalibrated models. Dynamic FP4 quantisation applied without calibration produces significantly larger accuracy gaps than PTQ-calibrated weights. If pre-calibrated FP4 weights for your model variant don't exist, FP8 is the safer default.

Where FP4 wins:

  • Large-batch generation. 8x or higher batch sizes are where FP4 throughput improvements materialise.
  • Memory-bound workloads. 4x memory reduction lets you fit more weights per memory bandwidth unit, which compounds the kernel speedup.
  • Energy and inference economics at scale. H100 at FP16 consumes ~10 joules per token; B200 at FP4 drops to ~0.2-0.4 joules per token per ifactoryapp's May 2026 analysis. At 100M tokens/day, that's a 25-50x energy efficiency improvement, which translates directly to power and cooling savings in the cluster TCO.

According to NVIDIA Developer's April 2026 data, B200 cost per million tokens on GPT-OSS-120B dropped from $0.11 at launch to $0.02 within two months — a 5x improvement from software alone. That's TensorRT-LLM updates compounding kernel fusion, quantisation, and scheduling improvements on the same silicon.

The procurement implication is straightforward: benchmark at the precision you'll deploy at, and run the accuracy eval at the same time. A throughput win that costs you 3 points on MMLU-Pro might not be a win. For more on inference economics and the token-cost framing, see tokenmaxxing and exploding enterprise AI bills.

◆ SOFTWARE
The software stack moves the numbers more than you think

The GB300 NVL72 result jumped 2.7x in six months from software updates alone, per NVIDIA Developer's MLPerf Inference v6.0 analysis. The same hardware. The same model. Different TensorRT-LLM version. That's not edge-case behaviour, that's the central fact of B200 procurement in 2026.

Three framework choices to benchmark explicitly:

FrameworkStrengthsTradeoffs
TensorRT-LLMHighest published throughput on B200. NVFP4 support. Disaggregated serving.Compilation step, less flexible. NVIDIA-only.
vLLMWide model support, paged attention, fast iteration. MR-MXFP4 support landing in 2026.Throughput typically below TensorRT-LLM peak on B200.
SGLangStructured generation, RadixAttention prefix caching, strong for agentic and tool-use workloads.Smaller ecosystem than vLLM.

Disaggregated serving — splitting prefill and decode across separate GPU pools — is one of the biggest B200 wins, but it requires infrastructure work most teams haven't done. Lambda's MLPerf Inference v5.1 submission on 8x HGX B200 saw 15.4% gains over v5.0 from TensorRT 10.11 + CUDA 12.9 + Ubuntu 24.04 alone. The software stack moves quarterly. Whatever you benchmark this month is a lower bound for what production will deliver in three months.

⚡ Software stack reality check

The B200 number to commit to is not the highest published. It's the one your team can actually reproduce on your infrastructure within 60 days of cluster delivery, with your model, framework, and serving topology. Anything above that is upside.

◆ CHECKLIST
Pre-commitment procurement checklist

A short list to run before signing for B200 capacity. Each item should be answered with a number from your own benchmark, not a spec sheet.

  • What is our TTFT, TPOT, throughput, and P99 on the model we'll deploy, at production sequence lengths and concurrency, on the framework we'll use? If any of these four are unknown, the procurement model is incomplete.
  • What is the FP4 accuracy delta versus FP8 on the tasks we care about? Run an eval, not a guess. GSM8K, MATH, MMLU-Pro, or domain-specific evals.
  • What is the P99 / mean ratio at 2x our projected concurrency? A ratio above ~3-4x means the cluster is underprovisioned or the framework is mismanaging the queue.
  • What's the cost per million tokens at production utilisation? Use real numbers: (GPU-hours × $/GPU-hr) / (tokens served). Not the MLPerf-derived number.
  • What's the upgrade path if the workload changes? B200 today, but what if next year's model needs B300 or GB300 NVL72? A short-term contract preserves optionality; a 36-month commitment doesn't.
  • What's the actual utilisation we expect? Per Cast AI's 2026 data, average GPU utilisation across 23,000 measured clusters is 5%. Above 70% utilisation, ownership starts to make sense. Below, contracts win. Most teams sit far below.

For the GPU model decision that sits one step above the benchmark, see H100 vs H200 vs B200: which GPU to rent in 2026.

Your search for enterprise GPU compute ends here.

NVIDIA infrastructure at rates hyperscalers won't offer you. H100, H200, B200, B300 clusters. Short-term and long-term contracts. Benchmark first, commit second. Quotes within 24 hours.

Get a quote on your cluster
◆ FAQ
Frequently asked questions

B200 throughput on production workloads ranges from roughly 6,972 tok/s/GPU at FP8 (Llama 2 70B offline, MLPerf v5.1 derived) to 60,000 tok/s/GPU at FP4 with disaggregated serving (GPT-OSS-120B, SemiAnalysis InferenceX April 2026). The 8.6x spread depends on model, precision, framework, batch size, and serving topology. The realistic number for your workload is the one you measure on your model with your framework, not the public peak. See the B200 buyer's guide for spec context.

FP8 is the safer default for production. It delivers ~2x speedup over FP16 with minimal accuracy degradation (~0.3-0.5 point regression across most benchmarks). FP4 can deliver up to 4x latency reduction but accuracy degrades unpredictably, especially on reasoning, math, and chain-of-thought tasks. Use FP4 when: large-batch inference, memory-bound workloads, and PTQ-calibrated weights are available. Always run a task-specific accuracy eval before moving from FP8 to FP4 in production.

TensorRT-LLM delivers the highest published B200 throughput numbers, with NVFP4 support and disaggregated serving as key advantages. vLLM has wider model support and faster iteration but typically lower peak throughput. SGLang is strong for agentic and tool-use workloads via RadixAttention prefix caching. Benchmark all three on your workload before committing — the right answer depends on your model, batch size, and operational requirements, not the published peak.

NVIDIA Developer's April 2026 data shows B200 reaching $0.02 per million tokens on GPT-OSS-120B with TensorRT-LLM and FP4. This dropped from $0.11 at launch within two months from software updates alone. Your actual cost per million tokens depends on utilisation, model, precision, and framework — compute it from your benchmarks as (GPU-hours × $/GPU-hr) / (tokens served), not from public numbers. GPUaaS.com B200 starts from ~$4.50/GPU/hr on contract.

Minimum 10 minutes at steady-state concurrency, repeated at least 3 times. Anything shorter is dominated by warmup effects, KV cache cold-start behaviour, and statistical noise. Variance across runs above ~5% indicates instability that needs to be investigated before procuring. Measure TTFT, TPOT, throughput, and P95/P99 simultaneously in the same run — separate runs introduce confounds.

It depends on workload, not spec sheet. B200 delivers ~4x H200 throughput on GPT-OSS-120B at FP4 with TensorRT-LLM (SemiAnalysis InferenceX, April 2026). On smaller models, smaller batches, or workloads where H200's 141GB HBM3e already fits the model comfortably, the gap narrows. Benchmark both. H200 from ~$3.00/GPU/hr versus B200 from ~$4.50/GPU/hr means B200 needs to deliver at least 50% more throughput on your workload to break even on per-token cost. See H100 vs H200 vs B200 for the decision framework.

No. Software stacks move quarterly and accuracy-throughput tradeoffs are workload-specific. A short-term contract preserves the option to validate the GPU on your actual workload, with your actual framework, at your actual concurrency, before signing a 24- or 36-month commitment. GPUaaS.com offers both short-term and long-term B200 contracts with no multi-year lock-in, so benchmarking can happen on real hardware without writing off the spend if the workload changes. Get a quote.

Last reviewed: June 9, 2026. B200 MLPerf and SemiAnalysis InferenceX figures from NVIDIA Developer (April 2026), Spheron AI inference guide (April 2026), MLPerf Inference v5.1 (September 2025), MLPerf Inference v6.0 (April 2026), Nebius blog (April 2026), and Lambda's v5.1 submission. Benchmarking methodology from BentoML LLM Inference Handbook, Anyscale docs, Databricks Foundation Model APIs guide, and llmperf-rs (April 2026). FP4/FP8 tradeoff analysis from Spheron FP4 quantisation guide (March 2026), arxiv 2509.23202 (March 2026), arxiv 2411.02355, ifactoryapp (May 2026), and Spheron 2025 quantisation tradeoffs analysis. GPUaaS.com rates are indicative, contract-based, and quote-dependent.

Share this article:LinkedInX / TwitterCopy link
FIND THE BEST GPU DEAL

Get a wholesale GPU quote in a few hours

NVIDIA B200, H200, H100, A100, RTX Pro 6000 — N. America, EU, MEA, APAC. No buyer fees.

Related articles