The most important optimisation in LLM inference
Without the KV cache, generating a 200-token response would require recomputing attention from scratch 200 times — and each recomputation grows more expensive as the sequence gets longer. The KV cache eliminates that entirely. It is not optional, it is foundational.
Every LLM inference engine — vLLM, TGI, TensorRT-LLM, and nano-vLLM — has a KV cache. Its design determines how many concurrent requests you can serve, how fast decode is, and how much GPU memory you need. The chapters on PagedAttention → Ch.04 and Prefix Caching → Ch.07 are both direct extensions of this one concept.
What happens without a KV cache
To understand why the KV cache exists, you first need to feel the pain of not having it. Let's trace through what naive inference would look like.
Attention needs every previous token — always
Recall from Chapter 01 → Ch.01 that each new token's attention mechanism computes a Query vector and compares it against the Key vectors of every previous token, then retrieves a weighted sum of their Value vectors. This is how the model understands context — token 50 can "attend to" token 3 directly, regardless of distance.
The critical question is: where do those Key and Value vectors come from? They are computed by running each token through the model's projection matrices. Without caching, you'd have to recompute them from scratch every single decode step.
Generating token N requires computing K and V for all N-1 previous tokens. Cost grows quadratically with sequence length:
Total compute to generate N tokens without caching: O(N²). For a 1,000-token response, that's 500,000 redundant K/V recomputations. The whiteboard analogy: you erase and redo every previous step before writing the next one.
With a KV cache — O(N) total, O(1) per step
The fix is simple: during prefill → Ch.06, compute K and V for all input tokens once and save them. During each decode step, compute K and V only for the one new token, write it to the cache, then read all previous K and V directly. Each decode step costs constant time regardless of sequence length.
Exactly what lives in the KV cache
The KV cache stores one entry per token, per layer, per attention head. Understanding each of those dimensions is key to understanding why the cache can get so large — and why GQA (Grouped Query Attention) was invented to shrink it.
Per token
Every token that has been processed — whether from the prompt during prefill or generated during decode — writes its K and V vectors to the cache. A 500-token prompt + 100 generated tokens = 600 entries in the cache by the time generation ends.
Per layer
The transformer has multiple layers — Qwen3-1.7B has 28, LLaMA-3 70B has 80. Each layer runs its own attention computation with its own K and V projections. So every token produces 28 separate K vectors and 28 separate V vectors (one per layer). The cache must store all of them, because each layer reads from its own slice during attention.
Per attention head — what is an attention head?
This is the part that trips most beginners up, so let's go slow. Recall from Chapter 01 → Ch.01 that attention works by comparing a Query vector against Key vectors and retrieving weighted Values. But a single round of Q/K/V comparison can only "look" for one kind of relationship at a time.
Concretely: within each transformer layer, instead of computing attention once, the model splits the computation into num_heads parallel "heads". Each head has its own learned Q, K, and V projection matrices — so the same input token produces a different Q, K, V for each head. Each head attends to the sequence independently, then the outputs are concatenated.
Every head writes its own K and V to the KV cache independently. So one token in one layer writes num_heads × 2 vectors. A 28-layer model with 16 heads caches 28 × 16 × 2 = 896 vectors per token. Each vector is head_dim = 64 floats. That's 896 × 64 × 2 bytes = ~112 KB per token at fp16 for standard MHA.
Grouped Query Attention — shrinking the cache
Grouped Query Attention (GQA) keeps the multiple-head benefit for Queries (each head still looks for something different) but shares K and V heads across groups of Q heads. If 2 Q heads share 1 KV head, the cache is half the size. If 4 Q heads share 1 KV head, it's a quarter.
K₁
V₁
K₂
V₂
K₃
V₃
K₄
V₄
4 Q heads → 4 KV head pairs. Cache stores 4 K + 4 V per token per layer. Full expressiveness, maximum memory.
shared by Q₁+Q₂
shared by Q₃+Q₄
4 Q heads → only 2 KV head pairs. Cache stores 2 K + 2 V per token per layer. Half the memory, nearly identical quality.
How nano-vLLM stores the KV cache
nano-vLLM pre-allocates a single large tensor in GPU HBM → Ch.01 at startup and uses it for every request, for the entire lifetime of the engine. This is a deliberate design choice — allocation is expensive, so do it once.
The tensor shape — six dimensions
Click each dimension below to understand what it represents and why it exists:
Concrete numbers — Qwen3-1.7B on a 16 GB GPU
torch.zeros() or torch.empty()) is not free — it requires CUDA memory management and can cause fragmentation over time. By allocating one giant tensor at startup and slicing into it using index arithmetic, nano-vLLM eliminates all runtime allocation overhead. Every request gets pre-existing memory slots, not freshly allocated ones. This also makes the memory footprint perfectly predictable: you always know exactly how much GPU memory the KV cache will use.
How K and V get written — the Triton kernel
Writing K and V vectors to the right locations in the cache needs to be fast — it happens on every token, in every layer, on every step. nano-vLLM uses a custom Triton kernel for this. Let's understand what that means.
What is a Triton kernel?
Triton is a Python-based language (from OpenAI) for writing GPU kernels. A kernel is a function that runs directly on the GPU — not through PyTorch's standard operators, but as raw parallel GPU code. Writing a custom kernel lets you do exactly what you need, nothing more, with maximum control over memory access patterns.
The slot mapping — how the kernel knows where to write
Before the kernel runs, the CPU block manager → Ch.02 computes a slot mapping: a list of integers, one per token being processed. Each integer is the physical slot index in the KV cache tensor where that token's K and V should be written.
For example, if three tokens are being processed and they belong to blocks 47, 47, and 12 with offsets 0, 1, and 0 respectively, the slot mapping would be [47×256+0, 47×256+1, 12×256+0] = [12032, 12033, 3072]. The kernel reads this list and writes each token's data to the specified slot — no Python-level indexing on the hot path.
import triton import triton.language as tl @triton.jit def store_kvcache_kernel( keys_ptr, values_ptr, # pointers to the new K, V tensors from this forward pass kv_cache_ptr, # pointer to the base of the pre-allocated KV cache tensor slot_mapping_ptr, # pointer to the slot mapping list computed by the CPU block manager head_size: tl.constexpr, num_heads: tl.constexpr, ): # Each GPU thread handles one token — massively parallel token_idx = tl.program_id(0) # Look up which physical slot this token maps to slot = tl.load(slot_mapping_ptr + token_idx) # Compute byte offsets into the KV cache for this slot cache_offset = slot * num_heads * head_size # Write K and V for this token's slot — no Python overhead here for i in range(num_heads * head_size): tl.store(kv_cache_ptr + cache_offset + i, tl.load(keys_ptr + token_idx * num_heads * head_size + i)) tl.store(kv_cache_ptr + cache_offset + i + /* V offset */, tl.load(values_ptr + token_idx * num_heads * head_size + i))
The key insight: this kernel launches one GPU thread per token. All tokens are written simultaneously in parallel. The CPU's slot mapping list is just integers — cheap to compute — but it lets the GPU write to non-contiguous memory locations correctly without any Python logic running during the write itself.
The KV cache block pool — interactive simulator
The cache is divided into fixed-size blocks. Each block holds block_size tokens (256 by default). Requests are allocated blocks from a free pool; when they grow they get more; when they finish their blocks return instantly. Experiment below — add requests, grow them, finish them, and watch what happens when memory fills up.
The full picture in code
Here is how the KV cache is initialised, and how it flows between the CPU block manager and the GPU attention layer during a forward pass.
class KVCache: def __init__(self, config: ModelConfig, num_blocks: int): # One big tensor — allocated ONCE at startup, never again # Shape: [2, layers, blocks, block_size, kv_heads, head_dim] # dim 0 = 0 for Keys, 1 for Values self.data = torch.zeros( 2, config.num_hidden_layers, # 28 for Qwen3-1.7B num_blocks, # total blocks available (e.g. 512) config.block_size, # 256 tokens per block config.num_key_value_heads, # 8 (GQA — not 16!) config.head_dim, # 64 dtype=torch.float16, device="cuda", )
def forward(self, q, k, v, kv_cache, slot_mapping, block_table, is_prefill): # Write new K and V into the cache at the correct slots # slot_mapping comes from the CPU block manager store_kvcache_kernel[num_tokens,](k, v, kv_cache, slot_mapping, ...) if is_prefill: # Prefill: standard flash attention over the full input sequence # All tokens available as contiguous tensors out = flash_attn_varlen_func(q, k, v, ...) else: # Decode: flash attention that reads K, V from the paged cache # block_table maps logical blocks to physical slots in kv_cache out = flash_attn_with_kvcache(q, kv_cache[0], kv_cache[1], block_table=block_table, ...)
kv_cache.data tensor. flash_attn_varlen_func processes variable-length batches of prompt tokens all at once. flash_attn_with_kvcache handles one new decode token per sequence but needs to read K and V from scattered, non-contiguous blocks — which is exactly what the block_table enables. Both paths are covered in depth in the optimisations chapter → Ch.10.
What the KV cache enables at scale
Constant-time decode
Without the cache, decode step N costs O(N). With it, every decode step costs the same — one Triton write + one flash attention read. Generating token 500 is just as fast as generating token 1.
Enables PagedAttention
The block-based structure of nano-vLLM's KV cache is not incidental — it's the foundation PagedAttention is built on. Blocks are the unit of allocation, sharing, and eviction. → Ch.04
Enables prefix caching
Because the cache is block-structured and blocks are identified by their token content, blocks with the same tokens (system prompts, templates) can be shared across requests without recomputation. → Ch.07
The memory bottleneck
KV cache is why GPU memory is the primary constraint in LLM serving, not GPU compute. A single 70B model request with 128k context needs ~100 GB of KV cache alone. This is the bottleneck PagedAttention was designed to solve. → Ch.04
Things beginners get wrong about the KV cache
Quiz
Three questions to test your understanding of the KV cache. Wrong answers explain exactly why they're wrong.
1. Without a KV cache, generating a 100-token response requires how many K/V computations relative to generating a 10-token response?
2. Qwen3-1.7B uses 16 query heads but only 8 KV heads. What technique is this, and what does it achieve?
3. Why does nano-vLLM use a Triton kernel to write to the KV cache instead of standard PyTorch indexing?
What you now know
KV cache trades memory for compute. Store K and V once during prefill, read them for free on every decode step. Eliminates O(N²) recomputation, makes every decode step O(1).
Stored per token, per layer, per KV head. Every processed token writes one K and one V vector per layer. A 28-layer model with 500 tokens = 28,000 K vectors and 28,000 V vectors in the cache.
GQA shrinks the cache. Grouped Query Attention uses fewer KV heads than Q heads. Qwen3-1.7B's 8 KV heads vs 16 Q heads halves the cache size compared to standard MHA.
Pre-allocated once at startup. nano-vLLM allocates one large tensor in GPU HBM at startup. No runtime allocation, no fragmentation, perfectly predictable memory footprint.
Written via Triton kernel. A custom GPU kernel writes K and V using a slot mapping from the CPU — one thread per token in parallel, zero Python overhead on the hot path.
Foundation for everything ahead. The block structure of the KV cache enables PagedAttention (Ch.04), prefix caching (Ch.07), and is the primary reason GPU memory is the binding constraint in LLM serving.