BlogKV Cache Explained: Why It's Killing Your Inference Costs (And How to Fix It)

AI & ML

Your inference bill isn't scaling with requests — it's scaling with context length. Here's why the KV cache is the culprit, and the four techniques that fix it.

KV Cache Explained: Why It's Killing Your Inference Costs (And How to Fix It)

GPUaaS.com Team
Infrastructure Research
May 26, 2026
Blog post cover image

The KV cache is what makes autoregressive inference fast enough to use in production, and it's also the reason your GPU bill climbs every time you increase context length, batch size, or concurrent users. At 128K context, the KV cache for a single Llama 3 70B request hits ~42.9 GB at FP16, more than the model weights themselves.

Key takeaways
  • KV cache memory scales linearly with context length and batch size. At 128K tokens, it outweighs the model weights for Llama 3 70B at FP16
  • Before PagedAttention, contiguous KV cache allocation wasted 60-80% of VRAM through fragmentation. vLLM's PagedAttention drops that waste to under 4%
  • Prefix caching (SGLang RadixAttention) hits 85-95% cache reuse on few-shot workloads and delivers up to 6.4x throughput gains on prefix-heavy traffic
  • FP8 KV cache quantisation cuts per-token cache memory by 50% with accuracy loss within measurement noise on most benchmarks, per the vLLM team (April 2026)
  • Combining PagedAttention, prefix caching, GQA, and FP8 KV quantisation collapses long-context inference cost by 4-40x vs an unoptimised FP16 baseline

Most teams budget for model weights when they size GPU clusters. They forget the KV cache. That's where the surprise bills come from. A 70B model's weights are fixed at ~140 GB. The KV cache grows with every token you process, every user you add, and every context window you extend. Get the cache wrong and you're either OOMing in production or running at a fraction of the GPU utilisation you're paying for.

This post covers how the KV cache works, why it's expensive, and the four techniques that cut costs without touching your model. For the VRAM sizing behind all of this, see the VRAM guide. For cluster sizing, see the GPUaaS.com cluster catalogue.

◆ WHAT IS KV CACHE
What the KV cache actually is and why it exists

Transformer models generate tokens one at a time. To produce token N, the model runs attention over every token from 1 to N-1. Without any caching, that means re-computing the attention keys and values for all previous tokens on every single step. For a 1,000-token sequence, you'd run the full attention computation 1,000 times, each one getting longer than the last.

The KV cache fixes this by storing the key (K) and value (V) tensors computed for each previous token so the model can reuse them instead of recomputing. On step N, only the new token needs fresh K and V computation. Everything before it is read from cache. That's the tradeoff: compute time for memory.

The formula for how much VRAM the cache uses: 2 x num_layers x num_kv_heads x head_dim x seq_len x batch_size x bytes_per_element. Every variable in that formula except num_layers, num_kv_heads, and head_dim is something you control at serving time.

◆ THE COST PROBLEM
Why it gets expensive: the memory scaling problem

The cost problem has two dimensions. First, cache size scales linearly with context length. Double the context, double the cache. Second, it scales linearly with batch size. Serve 16 concurrent users and you've got 16 independent caches running simultaneously.

~1.3 GB
4K context (70B FP16)
~10.7 GB
32K context (70B FP16)
~42.9 GB
128K context (70B FP16)
~135 GB
1M context (70B FP16)

At 128K context, a single Llama 3 70B request's KV cache (~42.9 GB) already consumes 30% of an H200's 141 GB VRAM. Add model weights (~140 GB) and you're over budget before serving a second user. At 1M context, the cache outweighs the model entirely.

The second cost driver is fragmentation. Pre-PagedAttention, every inference engine reserved a contiguous block of VRAM per request upfront. Requests finish at different times, leaving gaps that can't be filled by new requests of different sizes. Studies found 60-80% of VRAM was wasted this way on typical serving workloads.

GPUaaS.com infrastructure data: teams that move from an unoptimised FP16 KV cache baseline to PagedAttention plus FP8 quantisation see GPU utilisation jump from under 30% to above 85% on the same hardware, at the same model size and context length.

◆ FIX 1: PAGEDATTENTION
Fix 1: PagedAttention -- stop wasting VRAM on fragmentation

PagedAttention, introduced by the vLLM team and now shipped by every major inference engine, borrows the virtual memory paging concept from operating systems. Instead of reserving one contiguous VRAM block per request, it breaks the KV cache into fixed-size pages (typically 16 or 32 tokens) and scatters them anywhere in physical memory. The attention kernel reads through a page-table indirection.

The result: VRAM fragmentation drops from 60-80% to under 4%. That freed memory directly increases the batch sizes you can run. More requests per GPU, lower cost per token.

How to enable it

PagedAttention ships by default in vLLM, SGLang, and TensorRT-LLM. You don't need to turn it on. The only tuning decision is block size: 16 tokens works well for most workloads, 32 tokens can improve throughput on long-context serving at the cost of slightly higher memory granularity.

⚡ Watch out

Some inference benchmarks disable PagedAttention to hit single-request throughput numbers. Don't use those numbers for capacity planning. Real production workloads are multi-request and the comparison isn't valid without it.

◆ FIX 2: PREFIX CACHING
Fix 2: Prefix caching -- stop recomputing shared context

PagedAttention fixes fragmentation within a single request. Prefix caching fixes the waste across requests. If 100 users all send the same 2,000-token system prompt, a naive engine computes fresh K and V tensors for those 2,000 tokens 100 separate times. With prefix caching, it computes them once and reuses the result for every subsequent request that shares that prefix.

SGLang's RadixAttention stores cached prefixes in a radix tree, matching new requests against all cached prefixes and serving the longest match. Cache hit rates on common workloads: 85-95% for few-shot prompting, 75-90% for multi-turn chat. On prefix-heavy traffic, SGLang delivers up to 6.4x higher throughput than vLLM's standard PagedAttention alone.

Prefix cache hit rates by workload type

Few-shot prompting
85-95%
Multi-turn chat
75-90%
RAG pipelines
60-80%
Unique prompts
~0%

⚠ When prefix caching hurts

If your workload is mostly unique prompts with no shared prefix, RadixAttention's LRU cache consumes VRAM without returning any benefit. In that case, stick with standard PagedAttention in vLLM. Prefix caching pays off when you have recurring system prompts, shared context, or multi-turn sessions.

SGLang delivers 29% higher throughput than vLLM on standard H100 benchmarks (16,200 vs 12,500 tokens/sec), and up to 6.4x higher throughput on prefix-heavy traffic via RadixAttention, per Morph LLM inference benchmarks (April 2026).

◆ FIX 3: GQA AND MLA
Fix 3: GQA and MLA -- reduce the cache at the architecture level

The previous two fixes are serving-layer optimisations. GQA (Grouped Query Attention) and MLA (Multi-head Latent Attention) reduce KV cache size at the model architecture level. They ship baked in, so you don't configure them -- you just pick models that use them.

GQA (Grouped Query Attention)

Multiple query heads share a single set of K and V heads, shrinking the KV cache in proportion to the group size. Llama 3 uses GQA across all sizes (8B and 70B). Mistral used it from its first 7B release. Now the baseline expectation for any new open-weight model.

MLA (Multi-head Latent Attention)

DeepSeek's architecture, used in DeepSeek-V2, V3, and R1. Projects K and V into a compressed latent space before caching. Achieves 93.3% KV cache reduction vs standard multi-head attention. That's why DeepSeek providers can quote lower per-token rates at the same context length as Llama-based providers.

Practical implication

If you're choosing between a Llama 3 70B deployment and a DeepSeek-V3 deployment for a long-context use case on GPUaaS.com H200 clusters, MLA is a significant factor in the cost comparison, not just model accuracy. DeepSeek's KV cache is 5-10x smaller at the same context length.

◆ FIX 4: KV CACHE QUANTISATION
Fix 4: KV cache quantisation -- halve the memory, keep the accuracy

KV cache quantisation stores the cached K and V tensors at lower precision than the model's compute precision. FP16 is the typical baseline. Dropping to FP8 or INT8 cuts cache memory in half. Dropping to INT4 cuts it by 75%.

FormatMemory vs FP16Accuracy impactHow to enable (vLLM)
FP850% reductionWithin measurement noise on most benchmarks--kv-cache-dtype fp8
INT850% reductionSlightly larger gap; depends on calibration--kv-cache-dtype int8
INT4 (TurboQuant)75% reduction~7-8% end-to-end overhead (LMDeploy)LMDeploy with TurboQuant

FP8 KV cache is the right default for most production deployments on H100 or H200 hardware. The vLLM team's April 2026 benchmarks show per-token cache memory drops to 54% of the BF16 baseline in the best cases, with accuracy on long-context needle-in-haystack tasks recovering to 89% vs 91% at BF16 when using the two-level accumulation strategy.

⚠ Watch out

FP8 KV cache quantisation and FP8 model weight quantisation are two separate things. You can run a BF16 model with an FP8 KV cache, or an FP8 model with an FP16 KV cache. They're independent flags. Most teams want both enabled in production.

Llama 3.1 70B on H100 SXM5 costs roughly $0.71 per million tokens on-demand at FP8 vs $1.19/M at BF16, a 40% cost reduction on the same hardware, per Spheron FP8 inference benchmarks (May 2026).

◆ FAQ
Frequently asked questions

The KV cache stores the key and value attention tensors computed for each previous token so the model doesn't have to recompute them on every generation step. Without it, generating token N would require running attention over all N-1 prior tokens from scratch, making inference quadratically slower as context grows. The tradeoff is memory: KV cache VRAM grows linearly with context length and batch size. For a full breakdown of VRAM sizing, see the VRAM guide.

For Llama 3 70B at FP16 with batch=1: ~1.3 GB at 4K context, ~10.7 GB at 32K, ~42.9 GB at 128K, and ~135 GB at 1M tokens. Multiply all figures by your batch size for concurrent users. At 128K context with 4 concurrent users, you're looking at ~172 GB of KV cache alone, which exceeds a B200 SXM's 192 GB after accounting for model weights.

PagedAttention eliminates VRAM fragmentation within each request by breaking the KV cache into non-contiguous memory pages. Prefix caching eliminates redundant computation across requests by storing and reusing KV tensors for shared prompt prefixes. They solve different problems. You want both: PagedAttention for memory efficiency, prefix caching for compute reuse on workloads with shared context.

Use SGLang if you have prefix-heavy workloads: multi-turn chat, RAG, few-shot prompting, or coding agents with shared system prompts. RadixAttention's prefix reuse delivers up to 6.4x throughput gains in those scenarios. Use vLLM if you need broad model support, stable production behaviour, or mostly unique prompts where prefix caching adds overhead without benefit. Both run on GPUaaS.com H100 and H200 clusters.

For most workloads, yes. The vLLM team's April 2026 benchmarks show accuracy recovering to 89% on long-context tasks (vs 91% at BF16) when using the two-level accumulation strategy. On standard benchmarks at shorter contexts, the gap is within measurement noise. Enable with --kv-cache-dtype fp8 in vLLM and test on your specific workload before shipping to production.

For Llama 3 70B at 128K context, you need at least an H200 SXM (141 GB) with FP8 KV cache quantisation enabled. Without FP8 quantisation, the cache alone hits ~42.9 GB per user on top of 140 GB weights, which needs multi-GPU setups immediately. For million-token context windows or high-concurrency long-context serving, plan for 8xH200 NVLink clusters or B200 clusters with 192 GB per GPU. Not sure which fits your workload? See how GPUaaS.com works.

Last reviewed: May 27, 2026. For a full GPU comparison by VRAM capacity, see the VRAM guide. To get a cluster quote, see how GPUaaS.com works. For GPU clusters sized for long-context inference, visit the GPUaaS.com cluster catalogue.

Share this article:LinkedInX / TwitterCopy link
FIND THE BEST GPU DEAL

Get a wholesale GPU quote in a few hours

NVIDIA B200, H200, H100, A100, RTX Pro 6000 — N. America, EU, MEA, APAC. No buyer fees.

Related articles