What is KV cache in LLMs?

The KV cache stores key and value attention tensors from previous tokens so the model doesn't recompute them on every generation step. Without it, inference slows quadratically as context grows.

How much VRAM does the KV cache use?

For Llama 3 70B at FP16 with batch=1: ~1.3 GB at 4K context, ~10.7 GB at 32K, ~42.9 GB at 128K, ~135 GB at 1M tokens. Multiply by batch size for concurrent users.

What's the difference between PagedAttention and prefix caching?

PagedAttention eliminates VRAM fragmentation within requests. Prefix caching eliminates redundant computation across requests with shared prefixes. You want both.

Should I use vLLM or SGLang for KV cache optimisation?

SGLang for prefix-heavy workloads (multi-turn chat, RAG, few-shot). vLLM for broad model support or unique-prompt workloads where prefix caching adds overhead.

Is FP8 KV cache quantisation safe to use in production?

For most workloads, yes. vLLM April 2026 benchmarks show 89% accuracy on long-context tasks vs 91% at BF16. Enable with --kv-cache-dtype fp8 and test first.

What GPU do I need for long-context KV cache serving?

For Llama 3 70B at 128K context, at minimum an H200 SXM (141 GB) with FP8 KV cache enabled. For million-token context or high concurrency, plan for 8xH200 NVLink or B200 clusters with 192 GB per GPU.

KV Cache Explained: Cut LLM Inference Costs 2026

The KV cache is what makes autoregressive inference fast enough to use in production, and it's also the reason your GPU bill climbs every time you increase context length, batch size, or concurrent users. At 128K context, the KV cache for a single Llama 3 70B request hits ~42.9 GB at FP16, more than the model weights themselves.

Key takeaways

KV cache memory scales linearly with context length and batch size. At 128K tokens, it outweighs the model weights for Llama 3 70B at FP16
Before PagedAttention, contiguous KV cache allocation wasted 60-80% of VRAM through fragmentation. vLLM's PagedAttention drops that waste to under 4%
Prefix caching (SGLang RadixAttention) hits 85-95% cache reuse on few-shot workloads and delivers up to 6.4x throughput gains on prefix-heavy traffic
FP8 KV cache quantisation cuts per-token cache memory by 50% with accuracy loss within measurement noise on most benchmarks, per the vLLM team (April 2026)
Combining PagedAttention, prefix caching, GQA, and FP8 KV quantisation collapses long-context inference cost by 4-40x vs an unoptimised FP16 baseline

Most teams budget for model weights when they size GPU clusters. They forget the KV cache. That's where the surprise bills come from. A 70B model's weights are fixed at ~140 GB. The KV cache grows with every token you process, every user you add, and every context window you extend. Get the cache wrong and you're either OOMing in production or running at a fraction of the GPU utilisation you're paying for.

This post covers how the KV cache works, why it's expensive, and the four techniques that cut costs without touching your model. For the VRAM sizing behind all of this, see the VRAM guide. For cluster sizing, see the GPUaaS.com cluster catalogue.

In this article

01What the KV cache actually is and why it exists 02Why it gets expensive: the memory scaling problem 03Fix 1: PagedAttention -- stop wasting VRAM on fragmentation 04Fix 2: Prefix caching -- stop recomputing shared context 05Fix 3: GQA and MLA -- reduce the cache at the architecture level 06Fix 4: KV cache quantisation -- halve the memory, keep the accuracy 07Frequently asked questions

◆ WHAT IS KV CACHE

What the KV cache actually is and why it exists

Transformer models generate tokens one at a time. To produce token N, the model runs attention over every token from 1 to N-1. Without any caching, that means re-computing the attention keys and values for all previous tokens on every single step. For a 1,000-token sequence, you'd run the full attention computation 1,000 times, each one getting longer than the last.

The KV cache fixes this by storing the key (K) and value (V) tensors computed for each previous token so the model can reuse them instead of recomputing. On step N, only the new token needs fresh K and V computation. Everything before it is read from cache. That's the tradeoff: compute time for memory.

The formula for how much VRAM the cache uses: 2 x num_layers x num_kv_heads x head_dim x seq_len x batch_size x bytes_per_element. Every variable in that formula except num_layers, num_kv_heads, and head_dim is something you control at serving time.

◆ THE COST PROBLEM

Why it gets expensive: the memory scaling problem

The cost problem has two dimensions. First, cache size scales linearly with context length. Double the context, double the cache. Second, it scales linearly with batch size. Serve 16 concurrent users and you've got 16 independent caches running simultaneously.

~1.3 GB

4K context (70B FP16)

~10.7 GB

32K context (70B FP16)

~42.9 GB

128K context (70B FP16)

~135 GB

1M context (70B FP16)

At 128K context, a single Llama 3 70B request's KV cache (~42.9 GB) already consumes 30% of an H200's 141 GB VRAM. Add model weights (~140 GB) and you're over budget before serving a second user. At 1M context, the cache outweighs the model entirely.

The second cost driver is fragmentation. Pre-PagedAttention, every inference engine reserved a contiguous block of VRAM per request upfront. Requests finish at different times, leaving gaps that can't be filled by new requests of different sizes. Studies found 60-80% of VRAM was wasted this way on typical serving workloads.

GPUaaS.com infrastructure data: teams that move from an unoptimised FP16 KV cache baseline to PagedAttention plus FP8 quantisation see GPU utilisation jump from under 30% to above 85% on the same hardware, at the same model size and context length.

◆ FIX 1: PAGEDATTENTION

Fix 1: PagedAttention -- stop wasting VRAM on fragmentation

PagedAttention, introduced by the vLLM team and now shipped by every major inference engine, borrows the virtual memory paging concept from operating systems. Instead of reserving one contiguous VRAM block per request, it breaks the KV cache into fixed-size pages (typically 16 or 32 tokens) and scatters them anywhere in physical memory. The attention kernel reads through a page-table indirection.

The result: VRAM fragmentation drops from 60-80% to under 4%. That freed memory directly increases the batch sizes you can run. More requests per GPU, lower cost per token.

How to enable it

PagedAttention ships by default in vLLM, SGLang, and TensorRT-LLM. You don't need to turn it on. The only tuning decision is block size: 16 tokens works well for most workloads, 32 tokens can improve throughput on long-context serving at the cost of slightly higher memory granularity.

⚡ Watch out

Some inference benchmarks disable PagedAttention to hit single-request throughput numbers. Don't use those numbers for capacity planning. Real production workloads are multi-request and the comparison isn't valid without it.

◆ FIX 2: PREFIX CACHING

Fix 2: Prefix caching -- stop recomputing shared context

PagedAttention fixes fragmentation within a single request. Prefix caching fixes the waste across requests. If 100 users all send the same 2,000-token system prompt, a naive engine computes fresh K and V tensors for those 2,000 tokens 100 separate times. With prefix caching, it computes them once and reuses the result for every subsequent request that shares that prefix.

SGLang's RadixAttention stores cached prefixes in a radix tree, matching new requests against all cached prefixes and serving the longest match. Cache hit rates on common workloads: 85-95% for few-shot prompting, 75-90% for multi-turn chat. On prefix-heavy traffic, SGLang delivers up to 6.4x higher throughput than vLLM's standard PagedAttention alone.

Prefix cache hit rates by workload type

Few-shot prompting

85-95%

Multi-turn chat

75-90%

RAG pipelines

60-80%

Unique prompts

~0%

⚠ When prefix caching hurts

If your workload is mostly unique prompts with no shared prefix, RadixAttention's LRU cache consumes VRAM without returning any benefit. In that case, stick with standard PagedAttention in vLLM. Prefix caching pays off when you have recurring system prompts, shared context, or multi-turn sessions.

SGLang delivers 29% higher throughput than vLLM on standard H100 benchmarks (16,200 vs 12,500 tokens/sec), and up to 6.4x higher throughput on prefix-heavy traffic via RadixAttention, per Morph LLM inference benchmarks (April 2026).

◆ FIX 3: GQA AND MLA

Fix 3: GQA and MLA -- reduce the cache at the architecture level

The previous two fixes are serving-layer optimisations. GQA (Grouped Query Attention) and MLA (Multi-head Latent Attention) reduce KV cache size at the model architecture level. They ship baked in, so you don't configure them -- you just pick models that use them.

GQA (Grouped Query Attention)

Multiple query heads share a single set of K and V heads, shrinking the KV cache in proportion to the group size. Llama 3 uses GQA across all sizes (8B and 70B). Mistral used it from its first 7B release. Now the baseline expectation for any new open-weight model.

MLA (Multi-head Latent Attention)

DeepSeek's architecture, used in DeepSeek-V2, V3, and R1. Projects K and V into a compressed latent space before caching. Achieves 93.3% KV cache reduction vs standard multi-head attention. That's why DeepSeek providers can quote lower per-token rates at the same context length as Llama-based providers.

Practical implication

If you're choosing between a Llama 3 70B deployment and a DeepSeek-V3 deployment for a long-context use case on GPUaaS.com H200 clusters, MLA is a significant factor in the cost comparison, not just model accuracy. DeepSeek's KV cache is 5-10x smaller at the same context length.

◆ FIX 4: KV CACHE QUANTISATION

Fix 4: KV cache quantisation -- halve the memory, keep the accuracy

KV cache quantisation stores the cached K and V tensors at lower precision than the model's compute precision. FP16 is the typical baseline. Dropping to FP8 or INT8 cuts cache memory in half. Dropping to INT4 cuts it by 75%.

Format	Memory vs FP16	Accuracy impact	How to enable (vLLM)
FP8	50% reduction	Within measurement noise on most benchmarks	--kv-cache-dtype fp8
INT8	50% reduction	Slightly larger gap; depends on calibration	--kv-cache-dtype int8
INT4 (TurboQuant)	75% reduction	~7-8% end-to-end overhead (LMDeploy)	LMDeploy with TurboQuant

FP8 KV cache is the right default for most production deployments on H100 or H200 hardware. The vLLM team's April 2026 benchmarks show per-token cache memory drops to 54% of the BF16 baseline in the best cases, with accuracy on long-context needle-in-haystack tasks recovering to 89% vs 91% at BF16 when using the two-level accumulation strategy.

⚠ Watch out

FP8 KV cache quantisation and FP8 model weight quantisation are two separate things. You can run a BF16 model with an FP8 KV cache, or an FP8 model with an FP16 KV cache. They're independent flags. Most teams want both enabled in production.

Llama 3.1 70B on H100 SXM5 costs roughly $0.71 per million tokens on-demand at FP8 vs $1.19/M at BF16, a 40% cost reduction on the same hardware, per Spheron FP8 inference benchmarks (May 2026).

◆ FAQ

Frequently asked questions

Last reviewed: May 27, 2026. For a full GPU comparison by VRAM capacity, see the VRAM guide. To get a cluster quote, see how GPUaaS.com works. For GPU clusters sized for long-context inference, visit the GPUaaS.com cluster catalogue.

KV Cache Explained: Why It's Killing Your Inference Costs (And How to Fix It)

Get a wholesale GPU quote in a few hours

Related articles

You Wouldn't Buy a Car From One Dealer Without Checking Prices Elsewhere. Most Teams Buy GPUs That Way.

Everyone Is Waiting 36 Weeks for GPUs. Some Teams Are Getting Them in 24 Hours. Here's the Difference.

Your Idle H100s Are Losing $15,000 a Month. Here's What Enterprises Are Doing About It.