BlogWhat is VRAM and Why It Matters for LLM Workloads

AI & ML

Your cluster OOM'd and it wasn't a code bug. Here's the exact VRAM maths — model weights, KV cache, quantisation, and what it means for your GPU choice.

What is VRAM and Why It Matters for LLM Workloads

GPUaaS.com Team
Infrastructure Research
May 25, 2026
Blog post cover image

VRAM is the hard limit on every LLM workload. A 70B parameter model at BF16 needs ~140 GB of GPU memory, and no amount of compute fixes a capacity shortfall.

Key takeaways
  • A 70B model at BF16 needs ~140 GB VRAM for weights alone, which fills one H200 SXM (141 GB HBM3e) with almost no headroom to spare
  • KV cache is the hidden cost: at 128K context, the cache for a 70B model at FP16 hits ~42.9 GB per concurrent user, on top of model weights (NVIDIA, 2025)
  • The H200 SXM has 76% more VRAM than the H100 SXM (141 GB vs 80 GB) at 4.8 TB/s bandwidth and the same 700W TDP (Tom's Hardware, 2023)
  • FP8 quantisation cuts VRAM in half with minimal accuracy loss; INT4 cuts by 75% but hurts reasoning tasks
  • An 8xH200 SXM NVLink cluster gives you 1,128 GB combined VRAM, enough for Llama 3 405B at BF16 with a 32K context window. H200 on GPUaaS.com starts at $3.50/GPU/hr (GPUaaS.com, May 2026)

When an ML engineer says their job OOM'd, one thing caused it: the model, its activations, and the KV cache together blew past available VRAM. That single number -- how many gigabytes your GPU has -- decides which models load, at what batch size, and at what context length. Read on for the maths and what it means for your hardware choice.

For a live view of clusters sorted by VRAM capacity, see the GPUaaS.com cluster catalogue.

◆ WHAT IS VRAM
What is VRAM and how does it differ from system RAM

VRAM is dedicated high-bandwidth memory sitting on the GPU die. System RAM (DDR5 on modern servers) runs at roughly 50-100 GB/s. GPU VRAM uses stacked HBM architecture built directly on top of the compute die and runs at a completely different level. The H200 SXM reads memory at 4.8 TB/s, around 60x faster than server DDR5. That matters because transformer attention is memory-bandwidth bound during inference. Every token generation step has to stream the full set of model weights through the GPU's compute units, so bandwidth is what determines speed.

You can technically hold model weights in system RAM using CPU offloading (llama.cpp does this), but bandwidth drops to ~50 GB/s and inference slows by 30-100x. For any production workload, all model weights need to be in VRAM.

141 GB
H200 HBM3e
4.8 TB/s
H200 bandwidth
80 GB
H100 HBM3
~60x
HBM vs DDR5 BW
◆ MODEL WEIGHTS
How much VRAM does an LLM need: the weight calculation

Model weight memory is the floor -- the VRAM you need before the model loads, before a single token is generated. The formula is simple: parameters multiplied by bytes per parameter for your chosen precision. The baseline is ~2 GB per 1B parameters at FP16, with FP8 cutting that in half and INT4 cutting it by 75%.

Formula

VRAM for weights = parameters x bytes per parameter
FP32 = 4 B/param  ·  BF16/FP16 = 2 B/param  ·  INT8/FP8 = 1 B/param  ·  INT4/FP4 = 0.5 B/param

VRAM required for model weights only (BF16 precision)

Llama 3 405B
810 GB
Llama 3.1 70B
140 GB
Mistral 22B
44 GB
Llama 3 8B
16 GB

These are weights-only numbers. In a real deployment, activations, PyTorch overhead (~10-20%), and KV cache headroom add another 20-30% on top. Use the weight figure as your floor, not your allocation target.

GPUaaS.com infrastructure data: Llama 3.1 70B at BF16 with a 4,096-token context window and batch size of 8 needs about 165 GB total VRAM, which is 18% above the weight-only figure.

◆ KV CACHE
KV cache: the hidden VRAM cost that grows with context length

The KV (key-value) cache saves attention computations from earlier tokens so the model doesn't redo them on every generation step. That's what makes autoregressive inference fast enough to use in production. The tradeoff is memory: KV cache consumption scales with batch size x context length x model depth. At 4K tokens it's a rounding error. At 128K tokens it can outweigh the model weights themselves.

~42.9 GB

KV cache for Llama 3.1 70B at BF16, 128K context, 1 concurrent user

NVIDIA inference optimisation guide · 2025

The formula, per NVIDIA's inference optimisation guide: 2 x num_layers x num_kv_heads x head_dim x seq_len x batch_size x bytes_per_element. For Llama 3.1 70B (80 layers, 8 KV heads, 128 head dim) at BF16 with 128K context and 1 user: 2 x 80 x 8 x 128 x 131072 x 1 x 2 bytes = ~42.9 GB. Multiply by concurrent users for your total KV cache budget. For a full breakdown of how to optimise that budget, see the KV cache inference cost guide.

⚠ Watch out

KV cache grows linearly with batch size. A batch of 16 at 4,096 tokens uses 16x the single-user figure. Most production OOMs come from underestimating KV cache at real batch sizes, not from the weight calculation.

Context lengthKV cache (70B BF16, batch=1)Fits on single H200 + weights?
4K tokens~1.3 GB✓ Yes, ample headroom
32K tokens~10.7 GB✓ Yes, tight but workable
128K tokens~42.9 GB✗ No, 140+43 = 183 GB
◆ GPU COMPARISON
VRAM across H100, H200, B200, and A100: a capacity comparison

NVIDIA's data-centre lineup runs from 80 GB to 192 GB per chip, with memory bandwidth between 2 TB/s and 8 TB/s. The right pick depends on model size, context length, and whether you need everything on one GPU or can spread across multiple. All specs below are SXM form factor, the interconnected variant used in 8-GPU cluster nodes.

SpecA100 SXMH100 SXMH200 SXMB200 SXM
VRAM80 GB HBM2e80 GB HBM3141 GB HBM3e192 GB HBM3e
Memory BW2.0 TB/s3.35 TB/s4.8 TB/s8.0 TB/s
FP8 TFLOPSN/A1,9791,9794,500
70B fits (BF16)?✗ 2x needed✗ 2x needed✓ Yes (tight)✓ Yes + headroom
Price/GPU/hrfrom $1.20from $1.49from $3.50from $4.99

The H200's 76% VRAM bump over the H100 at the same 700W TDP is the most practical single-GPU upgrade for 70B+ serving. You drop the multi-GPU overhead without changing your rack or cooling setup. See H200 clusters on GPUaaS.com from $3.50/GPU/hr.

The H200 SXM has 76% more VRAM than the H100 (141 GB vs 80 GB) and 43% more memory bandwidth (4.8 TB/s vs 3.35 TB/s) at the same 700W TDP. For 70B+ model serving, that's the difference between one GPU and two. Source: Tom's Hardware H200 announcement.

◆ QUANTISATION
Quantisation and its effect on VRAM requirements

Quantisation lowers parameter precision and cuts weight VRAM in proportion. A 70B model at BF16 (~140 GB) drops to ~70 GB at INT8, which now fits a single H100 SXM (80 GB) with 10 GB left for activations and short-context KV cache. Q4 quantisation cuts VRAM by 75%, bringing 70B models within reach of consumer GPUs at the cost of accuracy.

FP8 -- 50% VRAM reduction

Native on H100, H200, B200. Minimal accuracy loss on standard LLM benchmarks. The default starting point for inference on data-centre GPUs.

INT8 (GPTQ / AWQ) -- 50% reduction

Works on A100 hardware too. Slightly wider accuracy gap than FP8, depending on calibration data. Widely supported in vLLM and TGI.

INT4 (GGUF / AWQ) -- 75% reduction

Gets 70B onto a single H100 (80 GB). There's a real accuracy drop on reasoning tasks, so it's not the right call for production API serving.

FP4 (B200 native) -- 75% reduction

B200 Blackwell only. NVIDIA reports less than 2% accuracy loss vs BF16 on Llama 3.1 70B using the B200 FP4 inference engine.

⚡ Note

Activations don't quantise. They stay in higher precision during compute. For QLoRA fine-tuning, activations and gradient accumulators add another 15-40% on top of the quantised weight footprint.

◆ MULTI-GPU STRATEGY
Multi-GPU strategies: tensor parallelism and pipeline parallelism

When a model won't fit on one GPU, you have two main options. Tensor parallelism splits weight matrices horizontally across GPUs, handled by vLLM, TensorRT-LLM, or DeepSpeed. Pipeline parallelism puts different model layers on different devices. Both approaches need fast interconnect or they become communication-bound.

NVLink on H200 SXM delivers 900 GB/s per GPU. PCIe Gen5 delivers ~128 GB/s. That 7x gap is why PCIe-only multi-GPU nodes rarely make sense for LLM tensor parallelism past 2 GPUs.

Rule of thumb

On an 8xH200 SXM NVLink cluster, a 70B BF16 model scales to about 7.5x single-GPU throughput under tensor parallelism. On PCIe-only nodes, you're looking at 3-4x. Use SXM for tensor-parallel workloads above 2 GPUs.

An 8xH200 SXM cluster gives you 1,128 GB combined VRAM, enough to serve Llama 3 405B at BF16 with room for a 32K context window. Check H200 cluster options on GPUaaS.com from $3.50/GPU/hr, or the B200 cluster page for 192 GB per GPU.

GPUaaS.com infrastructure data: teams running Llama 3 405B in production on 8xH200 NVLink clusters see 85-92% VRAM utilisation efficiency under tensor parallelism with vLLM 0.4.x.

◆ FAQ
Frequently asked questions

A 70B model at BF16 needs ~140 GB for weights. A single H200 SXM (141 GB) covers that with minimal headroom for short-context, low-batch inference. Once you go above batch size 4 or context lengths above 32K tokens, you'll want 2xH200 or an 8xH100 setup. At INT8, weights drop to ~70 GB, which fits a single H100 SXM (80 GB) with ~10 GB left for activations and KV cache at short contexts.

Each is a generation of High Bandwidth Memory. HBM2e (A100, 80 GB) runs at ~2 TB/s. HBM3 (H100, 80 GB) hits 3.35 TB/s. HBM3e pushes both capacity and bandwidth up: H200 at 4.8 TB/s and 141 GB, B200 at 8 TB/s and 192 GB via advanced die-stacking. For memory-bandwidth-bound inference, each generation directly adds to tokens-per-second throughput.

Yes, through CPU offloading (llama.cpp) or tensor parallelism across multiple GPUs. CPU offloading is 30-100x slower because you're limited to PCIe bandwidth (~64 GB/s vs 4.8 TB/s HBM3e). For anything beyond local testing, quantise the model down to fit, or move to a multi-GPU cluster via GPUaaS.com.

Fine-tuning is heavier. Full fine-tuning of a 70B model at BF16 means storing weights (~140 GB) plus gradients (~140 GB) plus Adam optimiser states (~280 GB), which adds up to ~560 GB and needs at least 4xH200. QLoRA cuts that down a lot: only the LoRA adapter trains in full precision while the base model stays in 4-bit, so 70B fine-tuning is doable on a single 80 GB H100.

KV cache scales linearly with context. For Llama 3.1 70B at BF16 with batch=1: ~1.3 GB at 4K tokens, ~10.7 GB at 32K, and ~42.9 GB at 128K, all on top of the 140 GB weight footprint. For long-context deployments you need an H200 (141 GB) or B200 (192 GB). PagedAttention in vLLM helps by allocating memory on demand instead of pre-reserving it.

The B200 SXM at 192 GB HBM3e. In an 8-GPU B200 config, combined VRAM reaches 1,536 GB, enough for a 405B model at BF16 with plenty of KV cache headroom. B200 starts at $4.99/GPU/hr on-demand. Check the B200 cluster page for availability -- stock is tighter than H200 in most regions.

Three things, in rough order of how often they come up: KV cache blowing up at production batch sizes, PyTorch reserving 10-20% above active usage as a memory pool, and activations staying in FP32 when weights are quantised. Run torch.cuda.memory_summary() to see which allocation is actually maxing out your VRAM.

Last reviewed: May 26, 2026. Ready to size a cluster? Here's how GPUaaS.com works. For GPU clusters by VRAM capacity, visit the GPUaaS.com cluster catalogue.

Share this article:LinkedInX / TwitterCopy link
FIND THE BEST GPU DEAL

Get a wholesale GPU quote in a few hours

NVIDIA B200, H200, H100, A100, RTX Pro 6000 — N. America, EU, MEA, APAC. No buyer fees.

Related articles