What is tokenmaxxing?

Tokenmaxxing is the enterprise behaviour of maximising AI token consumption because internal incentives reward visible AI usage over outcomes. At Uber, engineers were ranked by token usage, burning the full annual AI budget in four months.

Why did Uber burn through its AI budget so fast?

Widespread deployment (5,000 engineers), internal leaderboards rewarding usage, and a shift to agentic workflows consuming 5-30x more tokens per task than basic chat. Monthly per-engineer costs ran $500-$2,000.

Why did Microsoft cancel its Claude Code licences?

Token-based billing made internal costs unmanageable at enterprise scale. Microsoft cancelled most licences across Windows, Teams, and Surface in May 2026, six months after rollout, directing engineers to GitHub Copilot CLI instead.

What changed with GitHub Copilot billing in June 2026?

All plans moved to usage-based billing via GitHub AI Credits from June 1, 2026. GitHub admitted the flat premium request model was no longer sustainable as Copilot evolved into an agentic platform.

How does running your own GPU cluster change the AI cost equation?

Dedicated GPU compute is priced per GPU-hour, not per token. Agentic workflows don't multiply your bill. For teams above ~500M tokens/month, dedicated clusters typically deliver 3-10x lower cost-per-token than API pricing.

AI FinOps applies cloud financial operations discipline to AI inference spend: token budget allocation by team, cost chargebacks, inference optimisation, and outcome-based ROI measurement. The FinOps Foundation named AI the fastest-growing new spend category in 2026.

The Tokenmaxxing Problem: Why Enterprise AI Bills Are Exploding in 2026

Enterprise AI bills are exploding in 2026 not because GPU prices are rising, but because token-based pricing scales non-linearly with adoption — and most companies never modelled for it. Uber burned through its entire 2026 AI tools budget in four months. Microsoft cancelled its internal Claude Code licences six months after rolling them out. GitHub just announced it can no longer absorb the inference cost of agentic usage at flat rates. The pattern is the same everywhere: the tools got used, the bills arrived, and nobody had a framework to manage them.

Key takeaways

Uber burned through its entire 2026 AI tools budget by April — four months after rollout. Monthly per-engineer API costs hit $500–$2,000. The budget was supposed to last twelve months
Microsoft cancelled internal Claude Code licences across its Experiences & Devices division (Windows, Teams, Surface) on May 14, 2026. Engineers redirected to GitHub Copilot CLI
GitHub moved all Copilot plans to usage-based billing from June 1, 2026 — admitting it “can no longer absorb the escalating inference cost” of agentic usage at flat subscription rates
The root cause is tokenmaxxing: employees maximising token consumption because tools are priced per seat but billed per token at the infrastructure layer. Agentic workflows consume 5–30x more tokens than basic chat
The structural fix is owning your inference stack. Teams running models on dedicated GPU clusters at $1.20–$3.50/GPU/hr have deterministic, predictable costs — the opposite of API-layer token billing

This isn't about AI being too expensive. Per-token prices have fallen 280x over the past two years. The problem is volume — specifically what happens when agentic workflows replace single-query chatbots at scale. Every loop, every retry, every background agent running 24/7 compounds the bill in ways that seat-based procurement models simply don't predict.

If your team runs production inference on dedicated GPU clusters rather than API endpoints, this cost spiral hits you differently — and the maths are worth understanding. See the reserved vs on-demand GPU guide for the cost structure comparison, or the GPUaaS.com cluster catalogue for current pricing.

In this article

01What happened: Uber, Microsoft, GitHub in four months 02Tokenmaxxing: why this keeps happening 03The agentic multiplier: why the shift to agents changes everything 04The ROI gap: spending is up 108%, outcomes are unclear 05Two AI cost models: API tokens vs dedicated GPU compute 06What teams running their own inference stack should do now 07Frequently asked questions

◆ WHAT HAPPENED

What happened: Uber, Microsoft, GitHub in four months

Three of the most prominent AI adopters in enterprise software hit the same wall between December 2025 and May 2026. The sequence matters because each story reveals a different layer of the same structural problem.

🚗

Uber — budget gone by April

December 2025 → April 2026

Uber rolled out Anthropic's Claude Code to roughly 5,000 engineers in December 2025. Adoption was immediate and explosive: agentic usage jumped from 32% of engineers in February to 84% by March. By April 2026, the full-year AI tools budget was gone — in four months.

Monthly API costs per engineer ran $500–$2,000. 95% of engineers used AI tools every month. 70% of code commits were AI-generated. 11% of live backend updates were executed by AI agents with zero human oversight.

The kicker: leadership couldn't draw a line from any of those numbers to better products. COO Andrew Macdonald said publicly in May 2026: “That link is not there yet. It's very hard to draw a line between one of those stats and 'Okay, now we're actually producing 25% more useful consumer features.'” CTO Praveen Neppalli Naga went back to the drawing board. Uber is now directly comparing AI token costs against the cost of hiring engineers.

💻

Microsoft — Claude Code cancelled six months in

December 2025 → June 30, 2026

In December 2025, Microsoft made Claude Code available to thousands of engineers across Windows, Microsoft 365, Outlook, Teams, and Surface — actively encouraging them to reshape their workflows with vibe coding. Six months later, on May 14, 2026, Microsoft cancelled most of those licences. Deadline: June 30, the end of Microsoft's fiscal year.

The reason was straightforward: token-based billing made costs unmanageable once engineers used the tool at full velocity. Engineers are being redirected to GitHub Copilot CLI — Microsoft's own tool, where the cost structure is internal. Microsoft still accesses Claude through Microsoft Foundry and Microsoft 365 Copilot for external products. The tool wasn't cancelled because it was bad. It was cancelled because it was too good to afford internally at scale.

🐈

GitHub — the flat rate is no longer sustainable

April 28, 2026 announcement → June 1, 2026 effective

GitHub announced on April 28 that all Copilot plans will switch to usage-based billing from June 1, 2026. The announcement was unusually direct: “GitHub has absorbed much of the escalating inference cost behind that usage, but the current premium request model is no longer sustainable.”

The core problem GitHub named: a quick chat question and a multi-hour autonomous coding session both cost the user the same flat rate. GitHub was absorbing the difference. Once Copilot became an agentic platform — running long, multi-step sessions across entire codebases — the gap between what users paid and what GitHub actually spent on inference became indefensible. The free lunch ended on June 1.

⚡ This isn't isolated

Deloitte's April 2026 CFO guide on AI token economics cited a healthcare enterprise that consumed 1 trillion tokens over six months, generating $6 million in unplanned costs before finance understood what was driving the bill. A separate report documented a software company that racked up $150,000 in AI token spend in a single billing cycle with no measurable business outcome. FinOps Foundation 2026 State of FinOps: 73% of respondents said AI costs exceeded original budget projections.

Enterprise AI spending jumped 108% year-over-year in 2026, hitting an average of $1.2 million per organisation. 78% of IT leaders reported unexpected AI charges they had never budgeted for. Source: Zylo 2026 SaaS Management Index.

◆ TOKENMAXXING

Tokenmaxxing: why this keeps happening

Tokenmaxxing is the term now circulating in enterprise AI circles for what happened at Uber: employees maximise token consumption — either because they genuinely find the tools useful, or because internal leaderboards and incentive structures reward visible AI usage over outcomes. Uber's engineers were literally ranked on a dashboard by how much AI they used. The budget didn't have a chance.

The mechanics are simple. Enterprise software licences work on a per-seat model: pay for 5,000 seats, get 5,000 seats. Everyone on that seat counts the same whether they're sending one email a day or 10,000. AI tools work differently. The licence is per seat. The actual cost is per token. When every query, every debugging session, every multi-step agent loop, every background process generates billable compute, usage-based billing exposes the true cost that seat pricing was masking.

Vendors subsidised this gap during the adoption phase. Claude Code, GitHub Copilot, and similar tools were priced to win market share, not to reflect actual inference costs. That phase is over. GitHub's admission that its pricing is “no longer sustainable” is the clearest public statement that the subsidy era has ended.

How token costs accumulate

Each autocomplete suggestion: ~200–1,000 tokens
A debugging session with context: ~5,000–20,000 tokens
A multi-step agentic coding loop: ~50,000–500,000 tokens
A background RAG pipeline running hourly: millions of tokens per day
5,000 engineers × 8 hours × constant agentic use: budget collapse

What nobody modelled for

Seat pricing doesn't reflect token volume
Adoption is faster than procurement budgets can track
Internal leaderboards reward usage, not outcomes
Agentic features trigger 5–30x more tokens per task than chat
Finance teams have no framework for token-based cost forecasting

Jensen Huang's framing

At GTC 2026, NVIDIA CEO Jensen Huang said he'd be “deeply alarmed” if a $500,000 engineer didn't consume at least $250,000 worth of AI tokens per year — and confirmed NVIDIA is targeting $2 billion in annual token spend for its engineering team. He's pitching token budgets as a fourth component of compensation. That framing tells you exactly where this is headed: AI compute is not reducing costs. It's creating a new cost centre that scales with usage, not headcount.

◆ THE AGENTIC MULTIPLIER

The agentic multiplier: why the shift to agents changes everything

The root cause of the enterprise AI cost crisis isn't AI adoption. It's the specific shift from single-query chatbots to agentic workflows. Gartner's March 2026 analysis quantified this: agentic AI models require 5–30x more tokens per task than standard chatbots. Teams that sized their AI budgets based on chatbot-era consumption, then deployed multi-step agentic systems, hit cost multiplications they'd never modelled.

Token consumption: chatbot vs agentic workflow

Single chat query

~1x tokens

Code assist + debug

~5x tokens

Multi-step agent loop

~15x tokens

Always-on RAG pipeline

~30x tokens

Relative token consumption. Source: Gartner March 2026 agentic AI analysis.

The cost paradox Gartner identified makes this worse: token prices have fallen 280x over two years, yet total enterprise AI spend has risen 320% in the same period. The per-unit price drop doesn't matter when usage volume grows faster. A 10x increase in usage at a 5x lower price still multiplies your bill by 2x.

⚠ The GPU utilisation irony

Cast AI's 2026 State of Kubernetes Optimisation Report measured actual production telemetry across 23,000 clusters and found average GPU utilisation at 5%. For every dollar spent on GPU infrastructure, 95 cents is producing no useful output. Companies are simultaneously burning through API token budgets and running their own GPU fleets at near-idle. The waste is happening at both layers.

Token prices fell 280x over two years, yet total enterprise AI spend rose 320% in the same period. The volume of agentic workloads is dramatically outpacing per-unit price reductions. Source: Gartner 2026 AI infrastructure analysis.

◆ THE ROI GAP

The ROI gap: spending is up 108%, outcomes are unclear

Enterprise AI spending jumped 108% year-over-year in 2026, averaging $1.2 million per organisation. McKinsey's 2026 Global AI Survey puts the ROI failure rate at 73% — three out of four AI deployments failing to hit projected returns. Uber's COO framing the problem publicly is unusually candid, but the underlying challenge isn't unusual at all.

73%

of enterprise AI deployments fail to achieve projected ROI

McKinsey Global AI Survey 2026

78%

of IT leaders hit unexpected AI charges in 2026

Zylo 2026 SaaS Management Index

average GPU utilisation across enterprise Kubernetes clusters

Cast AI, 23,000 clusters measured

The structural challenge is that AI tools proved too successful to afford. Once engineers have access to agentic coding assistants, usage doesn't stay modest. It expands to fill every available workflow. That's not waste in the traditional sense — engineers genuinely find these tools useful. The problem is that “useful” and “producing measurable business outcomes” aren't the same thing, and the budgets were sized for the latter while the usage patterns reflect the former.

The AI FinOps emergence

A new discipline called AI FinOps is emerging in direct response to this problem: applying cloud financial operations discipline to AI inference spend. Token budget allocation by business unit, model cost chargebacks, inference optimisation teams, outcome-based ROI measurement. If your organisation's AI spend exceeds $500,000 per year and is growing faster than planned, AI FinOps is no longer optional. The FinOps Foundation identified AI as the fastest-growing new spend category in their 2026 State of FinOps Report.

◆ TWO COST MODELS

Two AI cost models: API tokens vs dedicated GPU compute

If your team is running production inference on your own GPU clusters rather than calling third-party API endpoints, you're operating under a fundamentally different cost model. Understanding the difference is what determines whether the Uber problem can happen to you.

API token model (the Uber problem)

✗Cost scales directly with usage — every token generates a bill

✗Agentic workflows multiply token consumption 5–30x

✗Finance teams can't forecast token consumption reliably

✗Budget can disappear in weeks once adoption reaches critical mass

✗Vendor controls pricing, availability, and model changes

✓Zero upfront infrastructure cost; fastest path to start

Dedicated GPU compute model

✓Cost is fixed per GPU-hour regardless of token volume

✓Agentic workflows don't multiply your bill — only GPU time does

✓Finance can budget predictably: cluster size × hourly rate × hours

✓Heavy agentic use doesn't change the monthly invoice

✓You control model choice, inference optimisation, and cost-per-token

✗Requires inference engineering to deploy and operate models

The calculation that makes dedicated GPU compute compelling for production inference: an H100 SXM5 at $1.49/GPU/hr on GPUaaS.com running at 85% utilisation delivers roughly 21,000 tokens per second (Llama 3 70B at FP8). That's approximately $0.020 per 1,000 tokens — a fixed cost that doesn't change whether your engineers run one agentic session or a thousand. On API pricing, the same token volume at $3–10/million tokens costs $0.063–$0.21 per 1,000 tokens — 3–10x more, with no ceiling.

When dedicated compute makes sense

The crossover point is roughly 500 million tokens per month. Below that, API pricing is usually simpler and more cost-effective. Above it — especially for teams running persistent agentic workflows, internal coding assistants, or high-traffic inference APIs — dedicated GPU clusters on wholesale providers like GPUaaS.com deliver materially lower and predictable cost. See the wholesale vs hyperscale cost breakdown for the full comparison.

GPUaaS.com infrastructure data: teams migrating from third-party API endpoints to dedicated H100 or H200 clusters for production inference at volumes above 500M tokens/month see cost-per-token reductions of 60–80%, with fully predictable monthly invoices.

◆ WHAT TO DO NOW

What teams running their own inference stack should do now

Whether you're on API pricing today or already running your own clusters, the Uber and Microsoft stories surface a set of questions worth answering before your next budget cycle hits the same wall.

Audit your current token consumption and cost-per-useful-output

The question Uber's COO couldn't answer is the one to start with: can you draw a line between your AI token spend and a measurable business outcome? If not, you have the same problem Uber has, and you'll hit it at the same scale they did. Token dashboards aren't enough — you need outcome attribution.

Calculate your crossover point for dedicated compute

Take your current monthly token volume. Multiply by your API price per token. Compare against an H100 cluster at $1.49/GPU/hr or an H200 cluster at $3.50/GPU/hr running at 80% utilisation. At volumes above ~500M tokens/month, dedicated compute almost always wins on cost. At lower volumes, API pricing has lower operational overhead.

If you're on API pricing, add hard spend caps immediately

GitHub now lets admins set user-level token budgets that hard-block access when exhausted. Most API providers have similar spend alert and cap mechanisms. Uber's problem wasn't that the tools existed — it was that nobody set a ceiling. The leaderboard incentivised usage without a corresponding cost limit.

If you're on dedicated GPU clusters, fix your utilisation

5% GPU utilisation is the enterprise average, which means most clusters are burning 95% of their hourly rate producing nothing. Continuous batching, FP8 quantisation, and proper vLLM configuration can bring this to 70–85% utilisation without hardware changes. For context on KV cache and memory efficiency, see the KV cache inference cost guide.

Right-size your GPU for your actual workload

The most expensive version of the tokenmaxxing problem on the GPU layer is buying H200 clusters for workloads that run equally well on H100 or A100. At 5% utilisation, a B200 is more expensive per useful token than an A100 — not because B200 is bad, but because the premium chip compounds the waste. See the H100 vs A100 cost guide for the break-even maths.

Running your own inference? Get a wholesale GPU quote.

GPUaaS.com connects you directly to wholesale GPU providers — no broker, no hyperscaler margin, no usage-based surprise invoices. H100 from $1.49/hr, H200 from $3.50/hr.

See how GPUaaS.com works →

◆ FAQ

Frequently asked questions

Last reviewed: May 27, 2026. Sources: Fortune (May 26, 2026), The Verge, TechRadar, Zylo 2026 SaaS Management Index, Cast AI 2026 State of Kubernetes Optimisation Report, Gartner March 2026, McKinsey Global AI Survey 2026, GitHub Blog (April 28, 2026).

The Tokenmaxxing Problem: Why Enterprise AI Bills Are Exploding in 2026

Get a wholesale GPU quote in a few hours

Related articles

You Wouldn't Buy a Car From One Dealer Without Checking Prices Elsewhere. Most Teams Buy GPUs That Way.

Everyone Is Waiting 36 Weeks for GPUs. Some Teams Are Getting Them in 24 Hours. Here's the Difference.

Your Idle H100s Are Losing $15,000 a Month. Here's What Enterprises Are Doing About It.