What's the cost per million tokens on B200 in 2026?

B200 CPM ranges from $0.02 (GPT-OSS-120B FP4 TensorRT-LLM 55 TPS/user) to $0.31 (GLM-5 FP8 SGLang 10 TPS/user). A 15x spread on the same hardware. Realistic production for FP8 chat at 50-100 TPS/user typically lands in the $0.10-$0.50/M range.

How is cost per million tokens calculated?

CPM = ($/GPU-hr × GPU-hours used) ÷ (millions of tokens generated). The throughput, utilisation, batch size, and interactivity target drive 10-100x more variance than the GPU rate itself.

How much do hyperscaler hidden costs add to inference CPM?

20-40% typically, with worst-case stacking (egress, storage, cross-zone, support tier) pushing effective costs 50-100% above the advertised GPU rate.

What is the Great Token Deflation of 2026?

The 98%+ collapse in CPM for GPT-4-class inference between early 2024 and early 2026. From ~$60/M to $0.30-$0.75/M. Driven by Blackwell hardware, FP4 native support, and inference framework maturity.

Should I optimise for lowest GPU rate or lowest cost per million tokens?

Lowest CPM, always. The $/GPU-hr rate is one variable; CPM captures the full economics including throughput, utilisation, and stacking costs combined.

B200 Cost per Million Tokens, Measured in 2026

NVIDIA quotes B200 inference at $0.02 per million tokens on GPT-OSS-120B. That number is real, recent (April 2026 SemiAnalysis InferenceX data via NVIDIA Developer), and almost certainly not what your workload will cost. The same B200 hardware lands anywhere from $0.02 to $0.31 per million tokens depending on model, precision, framework, batch size, and interactivity target. SemiAnalysis InferenceX data from May 2026 puts GLM-5 on B200 at $0.30/M tokens with MTP at 18 TPS/user. Layer in hyperscaler egress, storage, and support fees and the advertised $4 GPU-hour can become $8-12 effective. This is how the cost-per-million-tokens math actually works on B200 in 2026, and what to lock in before signing for capacity.

Key takeaways

NVIDIA's $0.02/M tokens headline on B200 is GPT-OSS-120B, FP4, TensorRT-LLM, 55 TPS/user. SemiAnalysis InferenceX puts GLM-5 on B200 at $0.30/M at 18 TPS/user. A 15x spread driven entirely by model, precision, and interactivity target (NVIDIA Developer, April 2026; SemiAnalysis InferenceX, May 2026)
The "Great Token Deflation" of 2026 is real: GPT-4-class quality dropped from $60/M tokens in early 2024 to $0.30-$0.75/M by early 2026. A 98%+ collapse, largely from Blackwell hardware plus software stack maturity. Inference is now 80-90% of AI compute consumption
The cost-per-token formula is simple: ($/GPU-hr × GPU-hours) ÷ tokens served. The five variables that move the number 10-100x: model architecture, precision (FP4 vs FP8 vs BF16), inference framework (TensorRT-LLM vs vLLM vs SGLang), batch size, and target interactivity (TPS/user)
Hyperscaler "all-in" inference cost is 20-40% higher than the advertised GPU rate once egress ($0.08-$0.12/GB), storage, support tiers, and cross-zone networking stack. A 10 TB/month egress workload adds $800-$1,200 alone (GMI Cloud, 2026)
Software movement matters more than hardware for cost-per-token in 2026: B200 dropped from $0.11 to $0.02 per million tokens in two months on the same hardware from TensorRT-LLM updates alone (NVIDIA Developer, April 2026)
GPUaaS.com offers short-term and long-term B200 and B300 contracts with no multi-year lock-in and no egress markup. H100 from ~$2.50/GPU/hr, H200 from ~$3.00/GPU/hr, B200 and B300 from ~$4.50/GPU/hr

Cost per million tokens is the metric that decides whether an inference workload is a profitable product or a money pit. It's also the metric most procurement decisions get wrong. The published numbers (NVIDIA's $0.02/M, Anthropic's $3/M Claude rate, OpenAI's $2.50/M for GPT-4 Turbo equivalent) all measure something different, on different hardware, with different precision and serving topology. None of them tell you what your workload will cost to serve on B200. This guide walks through the cost-per-million-tokens math from first principles, shows the realistic 2026 range for B200 across workload types, layers in the hyperscaler hidden-cost stacking, and closes with the procurement framing for locking the right target. For the B200 benchmarking methodology, see how to benchmark your workload before committing to B200.

In this article

01The cost-per-million-tokens math from first principles 02The measured B200 cost-per-token range in 2026 03Worked examples: chat, batch, and agentic workloads 04The hyperscaler stacking that doubles the bill 05The software deflation curve still has a long way to run 06Procurement framing: lock the target, not the rate 07Frequently asked questions

◆ THE MATH

The cost-per-million-tokens math from first principles

Cost per million tokens (CPM) is a derived metric. The formula:

CPM = ($/GPU-hr × GPU-hours used) ÷ (millions of tokens generated)

Expand that and the five variables that actually drive CPM become visible:

GPU rate ($/GPU-hr). The advertised number. On contract, GPUaaS.com B200 runs from ~$4.50/GPU/hr. Spheron's spot B200 sits at $2.07/GPU/hr. Hyperscaler on-demand B200 typically lands at $6-$8/GPU/hr, with reserved discounts of 30-50% for multi-year commitments.
Throughput (tokens/sec/GPU). This is where the 10-100x variance lives. B200 ranges from ~6,972 tok/s/GPU at FP8 (Llama 2 70B offline) to 60,000 tok/s/GPU at FP4 with TensorRT-LLM disaggregated serving (GPT-OSS-120B) per SemiAnalysis InferenceX, April 2026. An 8.6x spread on the same hardware.
Utilisation. A B200 cluster billed 24/7 but only serving traffic 30% of the time has 3.3x the effective CPM of one running near full utilisation. Per Cast AI's 2026 data, average GPU utilisation across 23,000 measured clusters is 5%.
Batch size. Per Spheron's April 2026 cost-per-token benchmark, batch size is the biggest single lever on CPM. At batch size 1, GPU utilisation is very low and CPM can be 50-100x higher than at batch size 256.
Interactivity target (TPS/user). The tightest constraint in production. Per AMD's GTC 2026 analysis, the right cost-per-token question is always paired with an interactivity target. The same B200 cluster serving at 100 TPS/user costs roughly 2-3x more per token than at 18 TPS/user, because you can pack fewer concurrent users into the same compute.

Per AMD's "Many Aspects of Inference Performance" analysis (March 2026): "Every one of these is a software optimization point. Vendors can find a configuration that shows a large advantage. The right question is not which configuration makes a GPU look best, but rather what the cost per token is for a given workload and interactivity target."

In other words, a published CPM number without the interactivity target is incomplete. A B200 at $0.02/M at 55 TPS/user and a B200 at $0.30/M at 18 TPS/user are both correct numbers. They're just measuring different operating points on the same hardware.

◆ MEASURED RANGE

The measured B200 cost-per-token range in 2026

Here are the public B200 cost-per-million-tokens numbers from credible 2026 sources, sorted by configuration:

Source	Model	$/M tokens	Conditions
NVIDIA Developer (Apr 2026)	GPT-OSS-120B	$0.02	FP4, TensorRT-LLM, 55 TPS/user
NVIDIA DGX B200 page (Q1 2026)	GPT-OSS-120B	$0.09 (H200 vLLM)	vLLM on H200 baseline (~4.5x cheaper on B200)
SemiAnalysis InferenceX (May 2026)	GLM-5	$0.30	FP8, SGLang, MTP, 18 TPS/user
SemiAnalysis InferenceX (May 2026)	GLM-5 (no MTP)	$0.31	FP8, SGLang, 10 TPS/user
Inworld (Apr 2026)	Generic LLM	$0.02 (B200) vs $0.14 (H100)	7x reduction, configuration not specified
Spheron (Apr 2026)	Llama 3.3 70B	~$0.10-$0.20 est.	B200 SXM6 spot $2.07/hr, FP4 via TRT-LLM

Sources: NVIDIA Developer Deep Learning Performance Hub (April 2026); NVIDIA DGX B200 product page; SemiAnalysis InferenceX (May 2026); Inworld B200 GPU guide (April 2026); Spheron GPU cost per token benchmark (April 2026).

The 15x spread between $0.02/M and $0.31/M on the same B200 silicon is the central fact of inference economics in 2026. The hardware did not change. What changed: model (GPT-OSS-120B vs GLM-5), precision (FP4 vs FP8), framework (TensorRT-LLM vs SGLang), interactivity target (55 vs 18 TPS/user), and serving topology.

Realistic operating range for production B200 in 2026

Best case: ~$0.02-$0.05/M for highly-optimised, large-batch, low-interactivity inference on FP4 with TensorRT-LLM
Typical production: ~$0.10-$0.50/M for FP8 production inference at moderate interactivity (50-100 TPS/user) on vLLM or SGLang
Latency-sensitive: ~$0.50-$2.00/M for high interactivity (200+ TPS/user) or low-utilisation deployments
Procurement modelling rule: assume 5-10x the headline NVIDIA number for typical first-deployment production workloads. Drive it down from there with software stack iteration.

For the framework around picking the right benchmark configuration before measuring CPM, see how to benchmark your workload before committing to B200.

◆ WORKED EXAMPLES

Worked examples: chat, batch, and agentic workloads

Three worked examples on a single 8-GPU HGX B200 node at GPUaaS.com contract rate of ~$4.50/GPU/hr ($36/hr for the 8-GPU node):

Example 1: Customer support chat (Llama 3.3 70B FP8, vLLM, 80 TPS/user).

Realistic throughput: ~3,000 tokens/sec/GPU sustained at this interactivity
Node throughput: 24,000 tokens/sec total (3,000 × 8)
Tokens per hour at 70% utilisation: 24,000 × 3,600 × 0.7 = 60.5M tokens/hr
CPM = $36 / 60.5 = ~$0.60 per million tokens

Example 2: Batch summarisation (Llama 3.3 70B FP4, TensorRT-LLM, batch 256).

Realistic throughput: ~20,000 tokens/sec/GPU at large batch size
Node throughput: 160,000 tokens/sec total
Tokens per hour at 90% utilisation: 160,000 × 3,600 × 0.9 = 518M tokens/hr
CPM = $36 / 518 = ~$0.07 per million tokens

Example 3: Agentic workflow (DeepSeek-R1 671B FP8, SGLang, 30 TPS/user).

671B parameters exceeds single-node memory; needs 8-GPU node minimum or NVL72 rack
Per SemiAnalysis InferenceX, GLM-5-class models land around ~$0.30/M at this interactivity
CPM = ~$0.30 per million tokens on HGX B200; lower on GB200 NVL72 rack with disaggregated serving

Same hardware, same $/GPU-hr rate, three workloads, 9x CPM spread ($0.07 to $0.60). The procurement implication: a finance team modelling B200 inference cost without specifying the workload, precision, and interactivity target is modelling air. For the rack-scale option when the model demands it, see the GB200 NVL72 enterprise buyer's guide.

⚠ Watch out

The "$/GPU-hr × hours" half of the CPM formula gets all the procurement attention. The "tokens served" half, driven by software stack maturity and workload tuning, moves the number 10-100x more than the GPU rate does. A team that signs a 36-month B200 contract at a 30% discount but never optimises their inference stack often ends up with worse CPM than a team paying full rate on the latest TensorRT-LLM.

◆ HYPERSCALER STACKING

The hyperscaler stacking that doubles the bill

The GPU rate is the first line on the invoice, not the last. On hyperscalers, four additional line items routinely add 20-40% to the inference bill per GMI Cloud's 2026 analysis, with worst-case stacking pushing the effective rate 50-100% above the advertised GPU rate per Ace Cloud's December 2025 hidden-cost breakdown.

Line item	Typical hyperscaler cost	Impact on CPM
Egress (data leaving cloud)	$0.08-$0.12/GB	10-20% of inference bill for high-traffic APIs; $800-$1,200/month at 10 TB
High-performance storage	Per-GB pricing for model weights, KV cache, checkpoints	Meaningful for frequent checkpointing; quiet at steady-state inference
Cross-zone / inter-region networking	~$0.01-$0.02/GB intra-region; $0.05+/GB inter-region	Multi-AZ deployments accrue per-GB charges on every replication hop
Support tier	~10% of monthly spend	Effectively a 10% surcharge on the GPU rate for production-grade support
Reserved-but-idle	100% of reserved rate × idle hours	A 50% utilised reserved cluster has 2x the effective CPM of a 100% utilised one

Egress and storage figures from GMI Cloud (2026), Spheron (April 2026), Ace Cloud (December 2025). Support tier from typical hyperscaler enterprise pricing.

Per Spheron's April 2026 egress analysis, a model server that restarts once per day with 140 GB of weights and standard S3 GET egress accrues $378/month in egress alone, for a workload that hasn't served a single inference request yet. The "GPU rate" never captures this.

The compounding effect at scale: a hyperscaler B200 advertised at $6/hr that becomes $8-12 effective at production load roughly doubles the CPM number. A $0.30/M token workload running entirely within a hyperscaler ecosystem (with cross-zone traffic for HA, customer-facing egress, and an enterprise support tier) routinely lands at $0.45-$0.60/M effective. Lyceum Technology's April 2026 analysis frames it bluntly: "Egress fees are the 'hidden tax' of AI infrastructure. Moving large datasets or model weights between regions can cost thousands of dollars on US-based clouds."

For the broader hyperscaler-vs-contract pricing breakdown, see the GPUaaS.com GPU pricing guide and hyperscale comparison.

◆ DEFLATION CURVE

The software deflation curve still has a long way to run

Per the 2026 Unit Economics Reckoning analysis, GPT-4-level inference cost dropped from approximately $60 per million tokens in early 2024 to $0.30-$0.75 per million tokens by early 2026. A 98%+ collapse in two years. The hardware accounts for some of it (Hopper to Blackwell, FP16 to FP4). The software accounts for most of it.

Three software-driven CPM moves from 2026 alone:

B200 GPT-OSS-120B: $0.11 → $0.02 in two months. NVIDIA Developer's April 2026 data: a 5x cost reduction on identical hardware, from TensorRT-LLM updates (kernel fusion, quantisation, scheduling).
GB300 NVL72 DeepSeek-R1: 2.7x throughput in 6 months. Per MLPerf Inference v6.0 (April 2026), GB300 NVL72 hit 2.5M tokens/sec on DeepSeek-R1. 2.7x higher than its debut submission six months prior, entirely from software.
Lambda HGX B200 GPT-OSS-120B: 60,220 tok/s offline, 53,463 tok/s server (v6.0). Software stack maturity (CUDA 12.9 → 13.1) delivered 9% gain on Llama 3.1 8B alone over six months on the same hardware.

AMD's March 2026 analysis frames the procurement implication for buyers: "Since February, MI355X GPU cost per token has dropped significantly, while GB300 NVL72 remains higher and unchanged". Meaning the software optimisation work landing on competing platforms can shift the CPM Pareto frontier inside a procurement cycle. SemiAnalysis InferenceX (May 2026) confirmed AMD MI355X with SGLang FP8 landed at $0.22/M on GLM-5 with MTP versus B200's $0.30/M at the same operating point. A 27% gap that didn't exist three months earlier.

⚡ The deflation rule

Whatever CPM number anchors your B200 procurement model today is a ceiling, not a floor. Software updates on the same hardware have moved CPM by 2-5x in 60-90 day windows throughout 2026. A multi-year commitment locks the GPU rate, not the throughput. Sign accordingly.

◆ PROCUREMENT

Procurement framing: lock the target, not the rate

The cost-per-token procurement conversation usually starts with "what's the $/GPU-hr rate?" That's the wrong opening question. The right one: "what cost-per-million-tokens target do we need to hit for this workload to be a profitable product?"

Five questions that move the procurement decision toward defensible numbers:

What's our CPM target, and what assumption sits underneath it? A SaaS product charging $0.50/M for AI features against an underlying $0.30/M inference cost has a 40% gross margin. A product charging $0.10/M against $0.30/M cost is losing money on every call. Model both before signing.
What interactivity target does our workload actually require? 18 TPS/user vs 100 TPS/user changes the CPM 2-3x on the same B200. Most chat workloads don't need 100 TPS/user; most agentic workloads don't need 18.
Have we measured CPM on our model, our framework, our batch size? Not derived from NVIDIA's GPT-OSS-120B headline. B200 CPM varies 15x across configurations; the procurement assumption needs to come from a real benchmark.
Have we factored in the hyperscaler stacking? Egress, storage, cross-zone, and support typically add 20-40% to the GPU rate. A contract-based alternative without these stacking line items often beats a discounted hyperscaler reserved rate at production volume.
Does the contract length match our software optimisation roadmap? A 12-month contract that lets you reprice as TensorRT-LLM and inference framework maturity drives CPM down often beats a 36-month commitment locking today's throughput assumptions.

GPUaaS.com offers short-term and long-term B200 and B300 contracts with no multi-year lock-in and no egress markup. The CPM math works in the customer's favour at every contract length because the variables that drive it (software stack maturity, batch size, precision) are entirely under the customer's control. For the broader procurement framing, see the real TCO of a GPU cluster in 2026 and tokenmaxxing and exploding enterprise AI bills.

Your search for enterprise GPU compute ends here.

NVIDIA infrastructure at rates hyperscalers won't offer you. H100, H200, B200, B300 clusters. Short-term and long-term contracts. No egress markup. Quotes within 24 hours.

Get a quote on your cluster

◆ FAQ

Frequently asked questions

Last reviewed: June 11, 2026. B200 cost-per-million-tokens figures from NVIDIA Developer Deep Learning Performance Hub (April 2026), NVIDIA DGX B200 product page (Q1 2026), SemiAnalysis InferenceX (May 2026), Inworld B200 GPU guide (April 2026), Spheron GPU cost-per-token benchmark (April 2026), AMD "Many Aspects of Inference Performance" (March 2026), and MLPerf Inference v6.0 results via Lambda and Nebius (April 2026). Hyperscaler hidden-cost figures from GMI Cloud (2026), Spheron GPU cloud egress costs (May 2026), Ace Cloud hidden cloud GPU costs (December 2025), Lyceum Technology hyperscaler GPU pricing alternatives (April 2026), and GPUPerHour data egress comparison (April 2026). Token deflation framing from 2026 Unit Economics Reckoning analysis. GPUaaS.com rates are indicative, contract-based, and quote-dependent.

B200 Cost per Million Tokens, Measured (2026)

Get a wholesale GPU quote in a few hours

Related articles

B200 vs H100 vs H200: What the Price Difference Actually Tells You About Your Workload

The GPU Market Has Two Prices: The One You're Quoted and the One the Market Clears At

FOMO Is Why Enterprises Are Paying for GPUs They Do Not Use