BlogB200 Cost per Million Tokens, Measured (2026)

AI & ML

NVIDIA quotes B200 at $0.02/M tokens. Real workloads land $0.02-$0.31/M. That is a 15x spread on the same hardware. The measured math, with hyperscaler stacking.

B200 Cost per Million Tokens, Measured (2026)

GPUaaS.com Team
GPUaaS.com Team
Inference Economics
June 10, 2026
Blog post cover image

NVIDIA quotes B200 inference at $0.02 per million tokens on GPT-OSS-120B. That number is real, recent (April 2026 SemiAnalysis InferenceX data via NVIDIA Developer), and almost certainly not what your workload will cost. The same B200 hardware lands anywhere from $0.02 to $0.31 per million tokens depending on model, precision, framework, batch size, and interactivity target. SemiAnalysis InferenceX data from May 2026 puts GLM-5 on B200 at $0.30/M tokens with MTP at 18 TPS/user. Layer in hyperscaler egress, storage, and support fees and the advertised $4 GPU-hour can become $8-12 effective. This is how the cost-per-million-tokens math actually works on B200 in 2026, and what to lock in before signing for capacity.

Key takeaways
  • NVIDIA's $0.02/M tokens headline on B200 is GPT-OSS-120B, FP4, TensorRT-LLM, 55 TPS/user. SemiAnalysis InferenceX puts GLM-5 on B200 at $0.30/M at 18 TPS/user. A 15x spread driven entirely by model, precision, and interactivity target (NVIDIA Developer, April 2026; SemiAnalysis InferenceX, May 2026)
  • The "Great Token Deflation" of 2026 is real: GPT-4-class quality dropped from $60/M tokens in early 2024 to $0.30-$0.75/M by early 2026. A 98%+ collapse, largely from Blackwell hardware plus software stack maturity. Inference is now 80-90% of AI compute consumption
  • The cost-per-token formula is simple: ($/GPU-hr × GPU-hours) ÷ tokens served. The five variables that move the number 10-100x: model architecture, precision (FP4 vs FP8 vs BF16), inference framework (TensorRT-LLM vs vLLM vs SGLang), batch size, and target interactivity (TPS/user)
  • Hyperscaler "all-in" inference cost is 20-40% higher than the advertised GPU rate once egress ($0.08-$0.12/GB), storage, support tiers, and cross-zone networking stack. A 10 TB/month egress workload adds $800-$1,200 alone (GMI Cloud, 2026)
  • Software movement matters more than hardware for cost-per-token in 2026: B200 dropped from $0.11 to $0.02 per million tokens in two months on the same hardware from TensorRT-LLM updates alone (NVIDIA Developer, April 2026)
  • GPUaaS.com offers short-term and long-term B200 and B300 contracts with no multi-year lock-in and no egress markup. H100 from ~$2.50/GPU/hr, H200 from ~$3.00/GPU/hr, B200 and B300 from ~$4.50/GPU/hr

Cost per million tokens is the metric that decides whether an inference workload is a profitable product or a money pit. It's also the metric most procurement decisions get wrong. The published numbers (NVIDIA's $0.02/M, Anthropic's $3/M Claude rate, OpenAI's $2.50/M for GPT-4 Turbo equivalent) all measure something different, on different hardware, with different precision and serving topology. None of them tell you what your workload will cost to serve on B200. This guide walks through the cost-per-million-tokens math from first principles, shows the realistic 2026 range for B200 across workload types, layers in the hyperscaler hidden-cost stacking, and closes with the procurement framing for locking the right target. For the B200 benchmarking methodology, see how to benchmark your workload before committing to B200.

◆ THE MATH
The cost-per-million-tokens math from first principles

Cost per million tokens (CPM) is a derived metric. The formula:

CPM = ($/GPU-hr × GPU-hours used) ÷ (millions of tokens generated)

Expand that and the five variables that actually drive CPM become visible:

  • GPU rate ($/GPU-hr). The advertised number. On contract, GPUaaS.com B200 runs from ~$4.50/GPU/hr. Spheron's spot B200 sits at $2.07/GPU/hr. Hyperscaler on-demand B200 typically lands at $6-$8/GPU/hr, with reserved discounts of 30-50% for multi-year commitments.
  • Throughput (tokens/sec/GPU). This is where the 10-100x variance lives. B200 ranges from ~6,972 tok/s/GPU at FP8 (Llama 2 70B offline) to 60,000 tok/s/GPU at FP4 with TensorRT-LLM disaggregated serving (GPT-OSS-120B) per SemiAnalysis InferenceX, April 2026. An 8.6x spread on the same hardware.
  • Utilisation. A B200 cluster billed 24/7 but only serving traffic 30% of the time has 3.3x the effective CPM of one running near full utilisation. Per Cast AI's 2026 data, average GPU utilisation across 23,000 measured clusters is 5%.
  • Batch size. Per Spheron's April 2026 cost-per-token benchmark, batch size is the biggest single lever on CPM. At batch size 1, GPU utilisation is very low and CPM can be 50-100x higher than at batch size 256.
  • Interactivity target (TPS/user). The tightest constraint in production. Per AMD's GTC 2026 analysis, the right cost-per-token question is always paired with an interactivity target. The same B200 cluster serving at 100 TPS/user costs roughly 2-3x more per token than at 18 TPS/user, because you can pack fewer concurrent users into the same compute.

Per AMD's "Many Aspects of Inference Performance" analysis (March 2026): "Every one of these is a software optimization point. Vendors can find a configuration that shows a large advantage. The right question is not which configuration makes a GPU look best, but rather what the cost per token is for a given workload and interactivity target."

In other words, a published CPM number without the interactivity target is incomplete. A B200 at $0.02/M at 55 TPS/user and a B200 at $0.30/M at 18 TPS/user are both correct numbers. They're just measuring different operating points on the same hardware.

◆ MEASURED RANGE
The measured B200 cost-per-token range in 2026

Here are the public B200 cost-per-million-tokens numbers from credible 2026 sources, sorted by configuration:

SourceModel$/M tokensConditions
NVIDIA Developer (Apr 2026)GPT-OSS-120B$0.02FP4, TensorRT-LLM, 55 TPS/user
NVIDIA DGX B200 page (Q1 2026)GPT-OSS-120B$0.09 (H200 vLLM)vLLM on H200 baseline (~4.5x cheaper on B200)
SemiAnalysis InferenceX (May 2026)GLM-5$0.30FP8, SGLang, MTP, 18 TPS/user
SemiAnalysis InferenceX (May 2026)GLM-5 (no MTP)$0.31FP8, SGLang, 10 TPS/user
Inworld (Apr 2026)Generic LLM$0.02 (B200) vs $0.14 (H100)7x reduction, configuration not specified
Spheron (Apr 2026)Llama 3.3 70B~$0.10-$0.20 est.B200 SXM6 spot $2.07/hr, FP4 via TRT-LLM

Sources: NVIDIA Developer Deep Learning Performance Hub (April 2026); NVIDIA DGX B200 product page; SemiAnalysis InferenceX (May 2026); Inworld B200 GPU guide (April 2026); Spheron GPU cost per token benchmark (April 2026).

The 15x spread between $0.02/M and $0.31/M on the same B200 silicon is the central fact of inference economics in 2026. The hardware did not change. What changed: model (GPT-OSS-120B vs GLM-5), precision (FP4 vs FP8), framework (TensorRT-LLM vs SGLang), interactivity target (55 vs 18 TPS/user), and serving topology.

Realistic operating range for production B200 in 2026

  • Best case: ~$0.02-$0.05/M for highly-optimised, large-batch, low-interactivity inference on FP4 with TensorRT-LLM
  • Typical production: ~$0.10-$0.50/M for FP8 production inference at moderate interactivity (50-100 TPS/user) on vLLM or SGLang
  • Latency-sensitive: ~$0.50-$2.00/M for high interactivity (200+ TPS/user) or low-utilisation deployments
  • Procurement modelling rule: assume 5-10x the headline NVIDIA number for typical first-deployment production workloads. Drive it down from there with software stack iteration.

For the framework around picking the right benchmark configuration before measuring CPM, see how to benchmark your workload before committing to B200.

◆ WORKED EXAMPLES
Worked examples: chat, batch, and agentic workloads

Three worked examples on a single 8-GPU HGX B200 node at GPUaaS.com contract rate of ~$4.50/GPU/hr ($36/hr for the 8-GPU node):

Example 1: Customer support chat (Llama 3.3 70B FP8, vLLM, 80 TPS/user).

  • Realistic throughput: ~3,000 tokens/sec/GPU sustained at this interactivity
  • Node throughput: 24,000 tokens/sec total (3,000 × 8)
  • Tokens per hour at 70% utilisation: 24,000 × 3,600 × 0.7 = 60.5M tokens/hr
  • CPM = $36 / 60.5 = ~$0.60 per million tokens

Example 2: Batch summarisation (Llama 3.3 70B FP4, TensorRT-LLM, batch 256).

  • Realistic throughput: ~20,000 tokens/sec/GPU at large batch size
  • Node throughput: 160,000 tokens/sec total
  • Tokens per hour at 90% utilisation: 160,000 × 3,600 × 0.9 = 518M tokens/hr
  • CPM = $36 / 518 = ~$0.07 per million tokens

Example 3: Agentic workflow (DeepSeek-R1 671B FP8, SGLang, 30 TPS/user).

  • 671B parameters exceeds single-node memory; needs 8-GPU node minimum or NVL72 rack
  • Per SemiAnalysis InferenceX, GLM-5-class models land around ~$0.30/M at this interactivity
  • CPM = ~$0.30 per million tokens on HGX B200; lower on GB200 NVL72 rack with disaggregated serving

Same hardware, same $/GPU-hr rate, three workloads, 9x CPM spread ($0.07 to $0.60). The procurement implication: a finance team modelling B200 inference cost without specifying the workload, precision, and interactivity target is modelling air. For the rack-scale option when the model demands it, see the GB200 NVL72 enterprise buyer's guide.

⚠ Watch out

The "$/GPU-hr × hours" half of the CPM formula gets all the procurement attention. The "tokens served" half, driven by software stack maturity and workload tuning, moves the number 10-100x more than the GPU rate does. A team that signs a 36-month B200 contract at a 30% discount but never optimises their inference stack often ends up with worse CPM than a team paying full rate on the latest TensorRT-LLM.

◆ HYPERSCALER STACKING
The hyperscaler stacking that doubles the bill

The GPU rate is the first line on the invoice, not the last. On hyperscalers, four additional line items routinely add 20-40% to the inference bill per GMI Cloud's 2026 analysis, with worst-case stacking pushing the effective rate 50-100% above the advertised GPU rate per Ace Cloud's December 2025 hidden-cost breakdown.

Line itemTypical hyperscaler costImpact on CPM
Egress (data leaving cloud)$0.08-$0.12/GB10-20% of inference bill for high-traffic APIs; $800-$1,200/month at 10 TB
High-performance storagePer-GB pricing for model weights, KV cache, checkpointsMeaningful for frequent checkpointing; quiet at steady-state inference
Cross-zone / inter-region networking~$0.01-$0.02/GB intra-region; $0.05+/GB inter-regionMulti-AZ deployments accrue per-GB charges on every replication hop
Support tier~10% of monthly spendEffectively a 10% surcharge on the GPU rate for production-grade support
Reserved-but-idle100% of reserved rate × idle hoursA 50% utilised reserved cluster has 2x the effective CPM of a 100% utilised one

Egress and storage figures from GMI Cloud (2026), Spheron (April 2026), Ace Cloud (December 2025). Support tier from typical hyperscaler enterprise pricing.

Per Spheron's April 2026 egress analysis, a model server that restarts once per day with 140 GB of weights and standard S3 GET egress accrues $378/month in egress alone, for a workload that hasn't served a single inference request yet. The "GPU rate" never captures this.

The compounding effect at scale: a hyperscaler B200 advertised at $6/hr that becomes $8-12 effective at production load roughly doubles the CPM number. A $0.30/M token workload running entirely within a hyperscaler ecosystem (with cross-zone traffic for HA, customer-facing egress, and an enterprise support tier) routinely lands at $0.45-$0.60/M effective. Lyceum Technology's April 2026 analysis frames it bluntly: "Egress fees are the 'hidden tax' of AI infrastructure. Moving large datasets or model weights between regions can cost thousands of dollars on US-based clouds."

For the broader hyperscaler-vs-contract pricing breakdown, see the GPUaaS.com GPU pricing guide and hyperscale comparison.

◆ DEFLATION CURVE
The software deflation curve still has a long way to run

Per the 2026 Unit Economics Reckoning analysis, GPT-4-level inference cost dropped from approximately $60 per million tokens in early 2024 to $0.30-$0.75 per million tokens by early 2026. A 98%+ collapse in two years. The hardware accounts for some of it (Hopper to Blackwell, FP16 to FP4). The software accounts for most of it.

Three software-driven CPM moves from 2026 alone:

  • B200 GPT-OSS-120B: $0.11 → $0.02 in two months. NVIDIA Developer's April 2026 data: a 5x cost reduction on identical hardware, from TensorRT-LLM updates (kernel fusion, quantisation, scheduling).
  • GB300 NVL72 DeepSeek-R1: 2.7x throughput in 6 months. Per MLPerf Inference v6.0 (April 2026), GB300 NVL72 hit 2.5M tokens/sec on DeepSeek-R1. 2.7x higher than its debut submission six months prior, entirely from software.
  • Lambda HGX B200 GPT-OSS-120B: 60,220 tok/s offline, 53,463 tok/s server (v6.0). Software stack maturity (CUDA 12.9 → 13.1) delivered 9% gain on Llama 3.1 8B alone over six months on the same hardware.

AMD's March 2026 analysis frames the procurement implication for buyers: "Since February, MI355X GPU cost per token has dropped significantly, while GB300 NVL72 remains higher and unchanged". Meaning the software optimisation work landing on competing platforms can shift the CPM Pareto frontier inside a procurement cycle. SemiAnalysis InferenceX (May 2026) confirmed AMD MI355X with SGLang FP8 landed at $0.22/M on GLM-5 with MTP versus B200's $0.30/M at the same operating point. A 27% gap that didn't exist three months earlier.

⚡ The deflation rule

Whatever CPM number anchors your B200 procurement model today is a ceiling, not a floor. Software updates on the same hardware have moved CPM by 2-5x in 60-90 day windows throughout 2026. A multi-year commitment locks the GPU rate, not the throughput. Sign accordingly.

◆ PROCUREMENT
Procurement framing: lock the target, not the rate

The cost-per-token procurement conversation usually starts with "what's the $/GPU-hr rate?" That's the wrong opening question. The right one: "what cost-per-million-tokens target do we need to hit for this workload to be a profitable product?"

Five questions that move the procurement decision toward defensible numbers:

  • What's our CPM target, and what assumption sits underneath it? A SaaS product charging $0.50/M for AI features against an underlying $0.30/M inference cost has a 40% gross margin. A product charging $0.10/M against $0.30/M cost is losing money on every call. Model both before signing.
  • What interactivity target does our workload actually require? 18 TPS/user vs 100 TPS/user changes the CPM 2-3x on the same B200. Most chat workloads don't need 100 TPS/user; most agentic workloads don't need 18.
  • Have we measured CPM on our model, our framework, our batch size? Not derived from NVIDIA's GPT-OSS-120B headline. B200 CPM varies 15x across configurations; the procurement assumption needs to come from a real benchmark.
  • Have we factored in the hyperscaler stacking? Egress, storage, cross-zone, and support typically add 20-40% to the GPU rate. A contract-based alternative without these stacking line items often beats a discounted hyperscaler reserved rate at production volume.
  • Does the contract length match our software optimisation roadmap? A 12-month contract that lets you reprice as TensorRT-LLM and inference framework maturity drives CPM down often beats a 36-month commitment locking today's throughput assumptions.

GPUaaS.com offers short-term and long-term B200 and B300 contracts with no multi-year lock-in and no egress markup. The CPM math works in the customer's favour at every contract length because the variables that drive it (software stack maturity, batch size, precision) are entirely under the customer's control. For the broader procurement framing, see the real TCO of a GPU cluster in 2026 and tokenmaxxing and exploding enterprise AI bills.

Your search for enterprise GPU compute ends here.

NVIDIA infrastructure at rates hyperscalers won't offer you. H100, H200, B200, B300 clusters. Short-term and long-term contracts. No egress markup. Quotes within 24 hours.

Get a quote on your cluster
◆ FAQ
Frequently asked questions

B200 cost per million tokens ranges from $0.02 (GPT-OSS-120B, FP4, TensorRT-LLM, 55 TPS/user per NVIDIA Developer April 2026) to $0.31 (GLM-5, FP8, SGLang, 10 TPS/user per SemiAnalysis InferenceX May 2026). A 15x spread on the same hardware. Realistic production CPM for FP8 chat workloads at 50-100 TPS/user typically lands in the $0.10-$0.50/M range. Your CPM depends on model, precision, framework, batch size, and interactivity target. Measure on your workload before modelling.

CPM = ($/GPU-hr × GPU-hours used) ÷ (millions of tokens generated). The headline GPU rate ($/GPU-hr) is the first variable; the throughput (tokens/sec/GPU), utilisation rate, batch size, and interactivity target (TPS/user) drive 10-100x more variance than the rate itself. Always compute CPM from real measured throughput on your model and framework, not from public peak numbers.

NVIDIA's $0.02/M figure is GPT-OSS-120B specifically with FP4 precision, TensorRT-LLM stack, and 55 TPS/user interactivity per the April 2026 SemiAnalysis InferenceX data. Other published numbers (like SemiAnalysis InferenceX's $0.30/M on GLM-5 at 18 TPS/user) measure different models, different precision, and different interactivity targets. Both are correct for their stated configurations; neither tells you what your workload will cost without a matching benchmark.

20-40% on top of the advertised GPU rate is typical, per GMI Cloud's 2026 analysis. Worst case stacking (egress at $0.08-$0.12/GB, high-performance storage, cross-zone replication, and enterprise support tier) can push effective costs 50-100% above the GPU rate, per Ace Cloud's December 2025 breakdown. A high-traffic chatbot routinely sees egress alone add 10-20% to the monthly bill. Move 10 TB/month across regions and that's $800-$1,200 in egress charges before a single inference request is billed.

The 98%+ collapse in cost-per-million-tokens for GPT-4-class inference between early 2024 and early 2026. From roughly $60/M tokens to $0.30-$0.75/M. The drivers: Blackwell hardware (3-4x H100 inference per GPU), FP4 native support, and inference framework maturity (TensorRT-LLM, vLLM, SGLang all shipping 2-5x improvements per quarter). Inference now accounts for 80-90% of AI compute consumption, making CPM the central unit economics metric for AI products in 2026.

Lowest CPM, always. The $/GPU-hr rate is a single variable; CPM captures the full economics. Rate, throughput, utilisation, and stacking costs combined. A $4/hr B200 contract with strong software stack support often beats a $3/hr B200 contract with worse throughput tooling. CPM is the only number that maps directly to product gross margin, so it's the only one worth optimising for at the procurement stage.

For large models (70B+) and FP4-friendly workloads, yes. Typically 4-7x lower CPM than H100. NVIDIA cites ~$0.02/M on B200 vs ~$0.14/M on H100 for the same GPT-OSS-120B workload (Inworld April 2026 analysis). For smaller models (under ~30B parameters) or workloads where H200's 141GB HBM3e already fits the model comfortably with mature FP8 tooling, the gap narrows substantially. Run the math on your specific model before assuming B200 wins on CPM. For some workloads, H200 contract rates of ~$3.00/GPU/hr deliver lower CPM than B200 at ~$4.50/GPU/hr. See H100 vs H200 vs B200 for the decision framework. Get a quote.

Last reviewed: June 11, 2026. B200 cost-per-million-tokens figures from NVIDIA Developer Deep Learning Performance Hub (April 2026), NVIDIA DGX B200 product page (Q1 2026), SemiAnalysis InferenceX (May 2026), Inworld B200 GPU guide (April 2026), Spheron GPU cost-per-token benchmark (April 2026), AMD "Many Aspects of Inference Performance" (March 2026), and MLPerf Inference v6.0 results via Lambda and Nebius (April 2026). Hyperscaler hidden-cost figures from GMI Cloud (2026), Spheron GPU cloud egress costs (May 2026), Ace Cloud hidden cloud GPU costs (December 2025), Lyceum Technology hyperscaler GPU pricing alternatives (April 2026), and GPUPerHour data egress comparison (April 2026). Token deflation framing from 2026 Unit Economics Reckoning analysis. GPUaaS.com rates are indicative, contract-based, and quote-dependent.

Share this article:LinkedInX / TwitterCopy link
FIND THE BEST GPU DEAL

Get a wholesale GPU quote in a few hours

NVIDIA B200, H200, H100, A100, RTX Pro 6000 — N. America, EU, MEA, APAC. No buyer fees.

Related articles