NVIDIA quotes B200 inference at $0.02 per million tokens on GPT-OSS-120B. That number is real, recent (April 2026 SemiAnalysis InferenceX data via NVIDIA Developer), and almost certainly not what your workload will cost. The same B200 hardware lands anywhere from $0.02 to $0.31 per million tokens depending on model, precision, framework, batch size, and interactivity target. SemiAnalysis InferenceX data from May 2026 puts GLM-5 on B200 at $0.30/M tokens with MTP at 18 TPS/user. Layer in hyperscaler egress, storage, and support fees and the advertised $4 GPU-hour can become $8-12 effective. This is how the cost-per-million-tokens math actually works on B200 in 2026, and what to lock in before signing for capacity.
- NVIDIA's $0.02/M tokens headline on B200 is GPT-OSS-120B, FP4, TensorRT-LLM, 55 TPS/user. SemiAnalysis InferenceX puts GLM-5 on B200 at $0.30/M at 18 TPS/user. A 15x spread driven entirely by model, precision, and interactivity target (NVIDIA Developer, April 2026; SemiAnalysis InferenceX, May 2026)
- The "Great Token Deflation" of 2026 is real: GPT-4-class quality dropped from $60/M tokens in early 2024 to $0.30-$0.75/M by early 2026. A 98%+ collapse, largely from Blackwell hardware plus software stack maturity. Inference is now 80-90% of AI compute consumption
- The cost-per-token formula is simple: ($/GPU-hr × GPU-hours) ÷ tokens served. The five variables that move the number 10-100x: model architecture, precision (FP4 vs FP8 vs BF16), inference framework (TensorRT-LLM vs vLLM vs SGLang), batch size, and target interactivity (TPS/user)
- Hyperscaler "all-in" inference cost is 20-40% higher than the advertised GPU rate once egress ($0.08-$0.12/GB), storage, support tiers, and cross-zone networking stack. A 10 TB/month egress workload adds $800-$1,200 alone (GMI Cloud, 2026)
- Software movement matters more than hardware for cost-per-token in 2026: B200 dropped from $0.11 to $0.02 per million tokens in two months on the same hardware from TensorRT-LLM updates alone (NVIDIA Developer, April 2026)
- GPUaaS.com offers short-term and long-term B200 and B300 contracts with no multi-year lock-in and no egress markup. H100 from ~$2.50/GPU/hr, H200 from ~$3.00/GPU/hr, B200 and B300 from ~$4.50/GPU/hr
Cost per million tokens is the metric that decides whether an inference workload is a profitable product or a money pit. It's also the metric most procurement decisions get wrong. The published numbers (NVIDIA's $0.02/M, Anthropic's $3/M Claude rate, OpenAI's $2.50/M for GPT-4 Turbo equivalent) all measure something different, on different hardware, with different precision and serving topology. None of them tell you what your workload will cost to serve on B200. This guide walks through the cost-per-million-tokens math from first principles, shows the realistic 2026 range for B200 across workload types, layers in the hyperscaler hidden-cost stacking, and closes with the procurement framing for locking the right target. For the B200 benchmarking methodology, see how to benchmark your workload before committing to B200.
Cost per million tokens (CPM) is a derived metric. The formula:
Expand that and the five variables that actually drive CPM become visible:
- GPU rate ($/GPU-hr). The advertised number. On contract, GPUaaS.com B200 runs from ~$4.50/GPU/hr. Spheron's spot B200 sits at $2.07/GPU/hr. Hyperscaler on-demand B200 typically lands at $6-$8/GPU/hr, with reserved discounts of 30-50% for multi-year commitments.
- Throughput (tokens/sec/GPU). This is where the 10-100x variance lives. B200 ranges from ~6,972 tok/s/GPU at FP8 (Llama 2 70B offline) to 60,000 tok/s/GPU at FP4 with TensorRT-LLM disaggregated serving (GPT-OSS-120B) per SemiAnalysis InferenceX, April 2026. An 8.6x spread on the same hardware.
- Utilisation. A B200 cluster billed 24/7 but only serving traffic 30% of the time has 3.3x the effective CPM of one running near full utilisation. Per Cast AI's 2026 data, average GPU utilisation across 23,000 measured clusters is 5%.
- Batch size. Per Spheron's April 2026 cost-per-token benchmark, batch size is the biggest single lever on CPM. At batch size 1, GPU utilisation is very low and CPM can be 50-100x higher than at batch size 256.
- Interactivity target (TPS/user). The tightest constraint in production. Per AMD's GTC 2026 analysis, the right cost-per-token question is always paired with an interactivity target. The same B200 cluster serving at 100 TPS/user costs roughly 2-3x more per token than at 18 TPS/user, because you can pack fewer concurrent users into the same compute.
Per AMD's "Many Aspects of Inference Performance" analysis (March 2026): "Every one of these is a software optimization point. Vendors can find a configuration that shows a large advantage. The right question is not which configuration makes a GPU look best, but rather what the cost per token is for a given workload and interactivity target."
In other words, a published CPM number without the interactivity target is incomplete. A B200 at $0.02/M at 55 TPS/user and a B200 at $0.30/M at 18 TPS/user are both correct numbers. They're just measuring different operating points on the same hardware.
Here are the public B200 cost-per-million-tokens numbers from credible 2026 sources, sorted by configuration:
| Source | Model | $/M tokens | Conditions |
|---|---|---|---|
| NVIDIA Developer (Apr 2026) | GPT-OSS-120B | $0.02 | FP4, TensorRT-LLM, 55 TPS/user |
| NVIDIA DGX B200 page (Q1 2026) | GPT-OSS-120B | $0.09 (H200 vLLM) | vLLM on H200 baseline (~4.5x cheaper on B200) |
| SemiAnalysis InferenceX (May 2026) | GLM-5 | $0.30 | FP8, SGLang, MTP, 18 TPS/user |
| SemiAnalysis InferenceX (May 2026) | GLM-5 (no MTP) | $0.31 | FP8, SGLang, 10 TPS/user |
| Inworld (Apr 2026) | Generic LLM | $0.02 (B200) vs $0.14 (H100) | 7x reduction, configuration not specified |
| Spheron (Apr 2026) | Llama 3.3 70B | ~$0.10-$0.20 est. | B200 SXM6 spot $2.07/hr, FP4 via TRT-LLM |
Sources: NVIDIA Developer Deep Learning Performance Hub (April 2026); NVIDIA DGX B200 product page; SemiAnalysis InferenceX (May 2026); Inworld B200 GPU guide (April 2026); Spheron GPU cost per token benchmark (April 2026).
The 15x spread between $0.02/M and $0.31/M on the same B200 silicon is the central fact of inference economics in 2026. The hardware did not change. What changed: model (GPT-OSS-120B vs GLM-5), precision (FP4 vs FP8), framework (TensorRT-LLM vs SGLang), interactivity target (55 vs 18 TPS/user), and serving topology.
Realistic operating range for production B200 in 2026
- Best case: ~$0.02-$0.05/M for highly-optimised, large-batch, low-interactivity inference on FP4 with TensorRT-LLM
- Typical production: ~$0.10-$0.50/M for FP8 production inference at moderate interactivity (50-100 TPS/user) on vLLM or SGLang
- Latency-sensitive: ~$0.50-$2.00/M for high interactivity (200+ TPS/user) or low-utilisation deployments
- Procurement modelling rule: assume 5-10x the headline NVIDIA number for typical first-deployment production workloads. Drive it down from there with software stack iteration.
For the framework around picking the right benchmark configuration before measuring CPM, see how to benchmark your workload before committing to B200.
Three worked examples on a single 8-GPU HGX B200 node at GPUaaS.com contract rate of ~$4.50/GPU/hr ($36/hr for the 8-GPU node):
Example 1: Customer support chat (Llama 3.3 70B FP8, vLLM, 80 TPS/user).
- Realistic throughput: ~3,000 tokens/sec/GPU sustained at this interactivity
- Node throughput: 24,000 tokens/sec total (3,000 × 8)
- Tokens per hour at 70% utilisation: 24,000 × 3,600 × 0.7 = 60.5M tokens/hr
- CPM = $36 / 60.5 = ~$0.60 per million tokens
Example 2: Batch summarisation (Llama 3.3 70B FP4, TensorRT-LLM, batch 256).
- Realistic throughput: ~20,000 tokens/sec/GPU at large batch size
- Node throughput: 160,000 tokens/sec total
- Tokens per hour at 90% utilisation: 160,000 × 3,600 × 0.9 = 518M tokens/hr
- CPM = $36 / 518 = ~$0.07 per million tokens
Example 3: Agentic workflow (DeepSeek-R1 671B FP8, SGLang, 30 TPS/user).
- 671B parameters exceeds single-node memory; needs 8-GPU node minimum or NVL72 rack
- Per SemiAnalysis InferenceX, GLM-5-class models land around ~$0.30/M at this interactivity
- CPM = ~$0.30 per million tokens on HGX B200; lower on GB200 NVL72 rack with disaggregated serving
Same hardware, same $/GPU-hr rate, three workloads, 9x CPM spread ($0.07 to $0.60). The procurement implication: a finance team modelling B200 inference cost without specifying the workload, precision, and interactivity target is modelling air. For the rack-scale option when the model demands it, see the GB200 NVL72 enterprise buyer's guide.
⚠ Watch out
The "$/GPU-hr × hours" half of the CPM formula gets all the procurement attention. The "tokens served" half, driven by software stack maturity and workload tuning, moves the number 10-100x more than the GPU rate does. A team that signs a 36-month B200 contract at a 30% discount but never optimises their inference stack often ends up with worse CPM than a team paying full rate on the latest TensorRT-LLM.
Per the 2026 Unit Economics Reckoning analysis, GPT-4-level inference cost dropped from approximately $60 per million tokens in early 2024 to $0.30-$0.75 per million tokens by early 2026. A 98%+ collapse in two years. The hardware accounts for some of it (Hopper to Blackwell, FP16 to FP4). The software accounts for most of it.
Three software-driven CPM moves from 2026 alone:
- B200 GPT-OSS-120B: $0.11 → $0.02 in two months. NVIDIA Developer's April 2026 data: a 5x cost reduction on identical hardware, from TensorRT-LLM updates (kernel fusion, quantisation, scheduling).
- GB300 NVL72 DeepSeek-R1: 2.7x throughput in 6 months. Per MLPerf Inference v6.0 (April 2026), GB300 NVL72 hit 2.5M tokens/sec on DeepSeek-R1. 2.7x higher than its debut submission six months prior, entirely from software.
- Lambda HGX B200 GPT-OSS-120B: 60,220 tok/s offline, 53,463 tok/s server (v6.0). Software stack maturity (CUDA 12.9 → 13.1) delivered 9% gain on Llama 3.1 8B alone over six months on the same hardware.
AMD's March 2026 analysis frames the procurement implication for buyers: "Since February, MI355X GPU cost per token has dropped significantly, while GB300 NVL72 remains higher and unchanged". Meaning the software optimisation work landing on competing platforms can shift the CPM Pareto frontier inside a procurement cycle. SemiAnalysis InferenceX (May 2026) confirmed AMD MI355X with SGLang FP8 landed at $0.22/M on GLM-5 with MTP versus B200's $0.30/M at the same operating point. A 27% gap that didn't exist three months earlier.
⚡ The deflation rule
Whatever CPM number anchors your B200 procurement model today is a ceiling, not a floor. Software updates on the same hardware have moved CPM by 2-5x in 60-90 day windows throughout 2026. A multi-year commitment locks the GPU rate, not the throughput. Sign accordingly.
The cost-per-token procurement conversation usually starts with "what's the $/GPU-hr rate?" That's the wrong opening question. The right one: "what cost-per-million-tokens target do we need to hit for this workload to be a profitable product?"
Five questions that move the procurement decision toward defensible numbers:
- What's our CPM target, and what assumption sits underneath it? A SaaS product charging $0.50/M for AI features against an underlying $0.30/M inference cost has a 40% gross margin. A product charging $0.10/M against $0.30/M cost is losing money on every call. Model both before signing.
- What interactivity target does our workload actually require? 18 TPS/user vs 100 TPS/user changes the CPM 2-3x on the same B200. Most chat workloads don't need 100 TPS/user; most agentic workloads don't need 18.
- Have we measured CPM on our model, our framework, our batch size? Not derived from NVIDIA's GPT-OSS-120B headline. B200 CPM varies 15x across configurations; the procurement assumption needs to come from a real benchmark.
- Have we factored in the hyperscaler stacking? Egress, storage, cross-zone, and support typically add 20-40% to the GPU rate. A contract-based alternative without these stacking line items often beats a discounted hyperscaler reserved rate at production volume.
- Does the contract length match our software optimisation roadmap? A 12-month contract that lets you reprice as TensorRT-LLM and inference framework maturity drives CPM down often beats a 36-month commitment locking today's throughput assumptions.
GPUaaS.com offers short-term and long-term B200 and B300 contracts with no multi-year lock-in and no egress markup. The CPM math works in the customer's favour at every contract length because the variables that drive it (software stack maturity, batch size, precision) are entirely under the customer's control. For the broader procurement framing, see the real TCO of a GPU cluster in 2026 and tokenmaxxing and exploding enterprise AI bills.
Your search for enterprise GPU compute ends here.
NVIDIA infrastructure at rates hyperscalers won't offer you. H100, H200, B200, B300 clusters. Short-term and long-term contracts. No egress markup. Quotes within 24 hours.
Get a quote on your clusterLast reviewed: June 11, 2026. B200 cost-per-million-tokens figures from NVIDIA Developer Deep Learning Performance Hub (April 2026), NVIDIA DGX B200 product page (Q1 2026), SemiAnalysis InferenceX (May 2026), Inworld B200 GPU guide (April 2026), Spheron GPU cost-per-token benchmark (April 2026), AMD "Many Aspects of Inference Performance" (March 2026), and MLPerf Inference v6.0 results via Lambda and Nebius (April 2026). Hyperscaler hidden-cost figures from GMI Cloud (2026), Spheron GPU cloud egress costs (May 2026), Ace Cloud hidden cloud GPU costs (December 2025), Lyceum Technology hyperscaler GPU pricing alternatives (April 2026), and GPUPerHour data egress comparison (April 2026). Token deflation framing from 2026 Unit Economics Reckoning analysis. GPUaaS.com rates are indicative, contract-based, and quote-dependent.



