Chapter 03 of 11 · nano-vLLM Deep Dive
03

The KV Cache

Why storing Keys and Values turns an O(n²) recomputation problem into O(1) lookup — and how nano-vLLM implements it as a single pre-allocated GPU tensor.

← Ch02: Architecture Next: PagedAttention →

The most important optimisation in LLM inference

Without the KV cache, generating a 200-token response would require recomputing attention from scratch 200 times — and each recomputation grows more expensive as the sequence gets longer. The KV cache eliminates that entirely. It is not optional, it is foundational.

The Whiteboard Analogy Imagine you're solving a long maths problem on a whiteboard. Each step requires looking back at every previous step to figure out what comes next. Without a whiteboard, you'd have to redo every prior step from memory each time. The whiteboard is your KV cache — a permanent record of what was computed, available for instant lookup. You write to it as you go, and read from it whenever you need context from earlier in the problem. Erasing it (which happens when a request finishes) is fast and free. The compute saved by not redoing the work is enormous.

Every LLM inference engine — vLLM, TGI, TensorRT-LLM, and nano-vLLM — has a KV cache. Its design determines how many concurrent requests you can serve, how fast decode is, and how much GPU memory you need. The chapters on PagedAttention → Ch.04 and Prefix Caching → Ch.07 are both direct extensions of this one concept.

What happens without a KV cache

To understand why the KV cache exists, you first need to feel the pain of not having it. Let's trace through what naive inference would look like.

Attention needs every previous token — always

Recall from Chapter 01 → Ch.01 that each new token's attention mechanism computes a Query vector and compares it against the Key vectors of every previous token, then retrieves a weighted sum of their Value vectors. This is how the model understands context — token 50 can "attend to" token 3 directly, regardless of distance.

The critical question is: where do those Key and Value vectors come from? They are computed by running each token through the model's projection matrices. Without caching, you'd have to recompute them from scratch every single decode step.

Cost without KV cache

Generating token N requires computing K and V for all N-1 previous tokens. Cost grows quadratically with sequence length:

Token 1 (1 compute)
Token 10 (10 compute)
10×
Token 50 (50 compute)
50×
Token 100 (100 compute)
100× — every step gets more expensive

Total compute to generate N tokens without caching: O(N²). For a 1,000-token response, that's 500,000 redundant K/V recomputations. The whiteboard analogy: you erase and redo every previous step before writing the next one.

With a KV cache — O(N) total, O(1) per step

The fix is simple: during prefill → Ch.06, compute K and V for all input tokens once and save them. During each decode step, compute K and V only for the one new token, write it to the cache, then read all previous K and V directly. Each decode step costs constant time regardless of sequence length.

The fundamental trade-off The KV cache trades memory for compute. You spend GPU RAM to store K and V vectors permanently, in exchange for never recomputing them. For typical LLM workloads — where responses are tens to hundreds of tokens — this is an overwhelmingly good trade. Compute saved >> memory spent. The challenge (which PagedAttention solves) is managing that memory efficiently across many concurrent requests.

Exactly what lives in the KV cache

The KV cache stores one entry per token, per layer, per attention head. Understanding each of those dimensions is key to understanding why the cache can get so large — and why GQA (Grouped Query Attention) was invented to shrink it.

Per token

Every token that has been processed — whether from the prompt during prefill or generated during decode — writes its K and V vectors to the cache. A 500-token prompt + 100 generated tokens = 600 entries in the cache by the time generation ends.

Per layer

The transformer has multiple layers — Qwen3-1.7B has 28, LLaMA-3 70B has 80. Each layer runs its own attention computation with its own K and V projections. So every token produces 28 separate K vectors and 28 separate V vectors (one per layer). The cache must store all of them, because each layer reads from its own slice during attention.

Why each layer needs its own K and V Different layers in a transformer learn to attend to different kinds of patterns. Early layers might focus on syntax and nearby tokens. Later layers might focus on long-range semantic relationships. Each layer's K and V vectors encode that layer's specific "view" of a token — they're not interchangeable. Layer 3's K for the word "bank" is different from Layer 15's K for the same word.

Per attention head — what is an attention head?

This is the part that trips most beginners up, so let's go slow. Recall from Chapter 01 → Ch.01 that attention works by comparing a Query vector against Key vectors and retrieving weighted Values. But a single round of Q/K/V comparison can only "look" for one kind of relationship at a time.

The Committee of Experts Analogy Imagine you're trying to understand a sentence by asking a committee of experts to analyse it simultaneously — each expert looks for something different. Expert 1 focuses on grammar and word order. Expert 2 focuses on which nouns the pronouns refer to. Expert 3 focuses on sentiment and emotional tone. Expert 4 focuses on long-range topic connections. Each expert runs their own Q/K/V analysis independently, in parallel. When they're done, their findings are combined. That committee is Multi-Head Attention. Each expert is one attention head.

Concretely: within each transformer layer, instead of computing attention once, the model splits the computation into num_heads parallel "heads". Each head has its own learned Q, K, and V projection matrices — so the same input token produces a different Q, K, V for each head. Each head attends to the sequence independently, then the outputs are concatenated.

Multi-head attention — one token, one layer, 4 heads
INPUT TOKEN
"bank"
one embedding vector
HEAD 1
grammar & syntax
K₁, V₁ → cache
HEAD 2
co-reference
K₂, V₂ → cache
HEAD 3
semantic meaning
K₃, V₃ → cache
HEAD 4
long-range topic
K₄, V₄ → cache

Every head writes its own K and V to the KV cache independently. So one token in one layer writes num_heads × 2 vectors. A 28-layer model with 16 heads caches 28 × 16 × 2 = 896 vectors per token. Each vector is head_dim = 64 floats. That's 896 × 64 × 2 bytes = ~112 KB per token at fp16 for standard MHA.

This is why the KV cache gets so large Each token must store K and V for every head in every layer. A 70B model with 80 layers and 64 heads per layer (MHA) caches 80 × 64 × 2 = 10,240 vectors per token. At head_dim=128 and fp16: 10,240 × 128 × 2 = 2.6 MB per token. A 4,000-token conversation needs 10.4 GB just for the KV cache — before the model weights even come into the picture.

Grouped Query Attention — shrinking the cache

Grouped Query Attention (GQA) keeps the multiple-head benefit for Queries (each head still looks for something different) but shares K and V heads across groups of Q heads. If 2 Q heads share 1 KV head, the cache is half the size. If 4 Q heads share 1 KV head, it's a quarter.

MHA — Multi-Head Attention
Q₁
K₁
V₁
Q₂
K₂
V₂
Q₃
K₃
V₃
Q₄
K₄
V₄

4 Q heads → 4 KV head pairs. Cache stores 4 K + 4 V per token per layer. Full expressiveness, maximum memory.

GQA — Grouped Query Attention (Qwen3)
Q₁
Q₂
Q₃
Q₄
K₁ V₁
shared by Q₁+Q₂
K₂ V₂
shared by Q₃+Q₄

4 Q heads → only 2 KV head pairs. Cache stores 2 K + 2 V per token per layer. Half the memory, nearly identical quality.

How nano-vLLM stores the KV cache

nano-vLLM pre-allocates a single large tensor in GPU HBM → Ch.01 at startup and uses it for every request, for the entire lifetime of the engine. This is a deliberate design choice — allocation is expensive, so do it once.

The tensor shape — six dimensions

Click each dimension below to understand what it represents and why it exists:

cache.py — kv_cache tensor
kv_cache = torch.zeros(
2
,
num_layers
,
num_blocks
,
block_size
,
num_kv_heads
,
head_dim
)
← Click any dimension to understand what it represents

Concrete numbers — Qwen3-1.7B on a 16 GB GPU

28
layers
256
tokens per block
8
KV heads (GQA)
64
head dimension
fp16
2 bytes per value
~0.45 MB
per block (K+V)
Why pre-allocate everything upfront? PyTorch memory allocation (calling torch.zeros() or torch.empty()) is not free — it requires CUDA memory management and can cause fragmentation over time. By allocating one giant tensor at startup and slicing into it using index arithmetic, nano-vLLM eliminates all runtime allocation overhead. Every request gets pre-existing memory slots, not freshly allocated ones. This also makes the memory footprint perfectly predictable: you always know exactly how much GPU memory the KV cache will use.

How K and V get written — the Triton kernel

Writing K and V vectors to the right locations in the cache needs to be fast — it happens on every token, in every layer, on every step. nano-vLLM uses a custom Triton kernel for this. Let's understand what that means.

What is a Triton kernel?

Triton is a Python-based language (from OpenAI) for writing GPU kernels. A kernel is a function that runs directly on the GPU — not through PyTorch's standard operators, but as raw parallel GPU code. Writing a custom kernel lets you do exactly what you need, nothing more, with maximum control over memory access patterns.

The Post Office Sorting Analogy Imagine a post office where 10,000 letters arrive simultaneously (one per token). Each letter has an address (its slot in the KV cache). A naive Python loop would process them one at a time. A Triton kernel is like having 10,000 sorters working in parallel — each one grabs exactly one letter, reads its destination, and drops it in the right slot simultaneously. No coordination needed, no waiting.

The slot mapping — how the kernel knows where to write

Before the kernel runs, the CPU block manager → Ch.02 computes a slot mapping: a list of integers, one per token being processed. Each integer is the physical slot index in the KV cache tensor where that token's K and V should be written.

For example, if three tokens are being processed and they belong to blocks 47, 47, and 12 with offsets 0, 1, and 0 respectively, the slot mapping would be [47×256+0, 47×256+1, 12×256+0] = [12032, 12033, 3072]. The kernel reads this list and writes each token's data to the specified slot — no Python-level indexing on the hot path.

cache.py — store_kvcache_kernel (Triton)
import triton
import triton.language as tl

@triton.jit
def store_kvcache_kernel(
    keys_ptr, values_ptr,   # pointers to the new K, V tensors from this forward pass
    kv_cache_ptr,           # pointer to the base of the pre-allocated KV cache tensor
    slot_mapping_ptr,       # pointer to the slot mapping list computed by the CPU block manager
    head_size: tl.constexpr,
    num_heads: tl.constexpr,
):
    # Each GPU thread handles one token — massively parallel
    token_idx = tl.program_id(0)

    # Look up which physical slot this token maps to
    slot = tl.load(slot_mapping_ptr + token_idx)

    # Compute byte offsets into the KV cache for this slot
    cache_offset = slot * num_heads * head_size

    # Write K and V for this token's slot — no Python overhead here
    for i in range(num_heads * head_size):
        tl.store(kv_cache_ptr + cache_offset + i,
                tl.load(keys_ptr + token_idx * num_heads * head_size + i))
        tl.store(kv_cache_ptr + cache_offset + i + /* V offset */,
                tl.load(values_ptr + token_idx * num_heads * head_size + i))

The key insight: this kernel launches one GPU thread per token. All tokens are written simultaneously in parallel. The CPU's slot mapping list is just integers — cheap to compute — but it lets the GPU write to non-contiguous memory locations correctly without any Python logic running during the write itself.

Why non-contiguous writes matter Tokens from different requests don't sit in adjacent slots — they may be scattered across many different blocks anywhere in the cache tensor (this is the whole point of PagedAttention → Ch.04). Without the slot mapping + Triton kernel approach, you'd need Python indexing to write each token — thousands of individual Python operations on the hot path. The Triton kernel collapses all of that into a single GPU kernel launch with zero Python overhead per token.

The KV cache block pool — interactive simulator

The cache is divided into fixed-size blocks. Each block holds block_size tokens (256 by default). Requests are allocated blocks from a free pool; when they grow they get more; when they finish their blocks return instantly. Experiment below — add requests, grow them, finish them, and watch what happens when memory fills up.

Block pool — 32 blocks, 256 tokens each
32
Blocks free
0
Blocks in use
0
Active requests
0%
Memory used
Free
Request A
Request B
Request C
Request D
Just freed
REQUEST A
not started
REQUEST B
not started
REQUEST C
not started
REQUEST D
not started
Event log

The full picture in code

Here is how the KV cache is initialised, and how it flows between the CPU block manager and the GPU attention layer during a forward pass.

cache.py — initialisation at engine startup
class KVCache:
    def __init__(self, config: ModelConfig, num_blocks: int):
        # One big tensor — allocated ONCE at startup, never again
        # Shape: [2, layers, blocks, block_size, kv_heads, head_dim]
        # dim 0 = 0 for Keys, 1 for Values
        self.data = torch.zeros(
            2,
            config.num_hidden_layers,   # 28 for Qwen3-1.7B
            num_blocks,                 # total blocks available (e.g. 512)
            config.block_size,          # 256 tokens per block
            config.num_key_value_heads, # 8 (GQA — not 16!)
            config.head_dim,            # 64
            dtype=torch.float16,
            device="cuda",
        )
layers/attention.py — reading during the forward pass
def forward(self, q, k, v, kv_cache, slot_mapping, block_table, is_prefill):
    # Write new K and V into the cache at the correct slots
    # slot_mapping comes from the CPU block manager
    store_kvcache_kernel[num_tokens,](k, v, kv_cache, slot_mapping, ...)

    if is_prefill:
        # Prefill: standard flash attention over the full input sequence
        # All tokens available as contiguous tensors
        out = flash_attn_varlen_func(q, k, v, ...)
    else:
        # Decode: flash attention that reads K, V from the paged cache
        # block_table maps logical blocks to physical slots in kv_cache
        out = flash_attn_with_kvcache(q, kv_cache[0], kv_cache[1],
                                       block_table=block_table, ...)
Two attention paths — same cache Prefill and decode use different flash attention functions, but they both write to and read from the same kv_cache.data tensor. flash_attn_varlen_func processes variable-length batches of prompt tokens all at once. flash_attn_with_kvcache handles one new decode token per sequence but needs to read K and V from scattered, non-contiguous blocks — which is exactly what the block_table enables. Both paths are covered in depth in the optimisations chapter → Ch.10.

What the KV cache enables at scale

Constant-time decode

Without the cache, decode step N costs O(N). With it, every decode step costs the same — one Triton write + one flash attention read. Generating token 500 is just as fast as generating token 1.

Enables PagedAttention

The block-based structure of nano-vLLM's KV cache is not incidental — it's the foundation PagedAttention is built on. Blocks are the unit of allocation, sharing, and eviction. → Ch.04

Enables prefix caching

Because the cache is block-structured and blocks are identified by their token content, blocks with the same tokens (system prompts, templates) can be shared across requests without recomputation. → Ch.07

The memory bottleneck

KV cache is why GPU memory is the primary constraint in LLM serving, not GPU compute. A single 70B model request with 128k context needs ~100 GB of KV cache alone. This is the bottleneck PagedAttention was designed to solve. → Ch.04

Things beginners get wrong about the KV cache

✗ Myth 1 — "The KV cache caches the model's weights"
Reality: The model weights are a completely separate concern — they live in GPU memory always and are never "cached" in the inference sense. The KV cache stores the intermediate activation vectors — specifically the Key and Value projections computed during each token's attention layer. If you cleared the KV cache the weights would be unaffected; you'd just have to recompute K and V from scratch again.
✗ Myth 2 — "A bigger KV cache always means better performance"
Reality: A larger KV cache means more blocks, which means you can serve more concurrent requests or handle longer contexts before running out of memory. But it doesn't speed up individual token generation — that's limited by memory bandwidth (HBM read speed → Ch.01). More cache = more capacity, not more speed per token.
✗ Myth 3 — "The KV cache grows unboundedly during a session"
Reality: The KV cache is bounded by the pre-allocated tensor size. When a request finishes, its blocks are immediately returned to the free pool and can be reused by the next request. The total GPU memory used by the KV cache is fixed at startup — it never grows beyond what was allocated. What grows is the number of slots used within that fixed tensor, up to the max.

Quiz

Three questions to test your understanding of the KV cache. Wrong answers explain exactly why they're wrong.

1. Without a KV cache, generating a 100-token response requires how many K/V computations relative to generating a 10-token response?

2. Qwen3-1.7B uses 16 query heads but only 8 KV heads. What technique is this, and what does it achieve?

3. Why does nano-vLLM use a Triton kernel to write to the KV cache instead of standard PyTorch indexing?

What you now know

Chapter 03 — Summary

KV cache trades memory for compute. Store K and V once during prefill, read them for free on every decode step. Eliminates O(N²) recomputation, makes every decode step O(1).

Stored per token, per layer, per KV head. Every processed token writes one K and one V vector per layer. A 28-layer model with 500 tokens = 28,000 K vectors and 28,000 V vectors in the cache.

GQA shrinks the cache. Grouped Query Attention uses fewer KV heads than Q heads. Qwen3-1.7B's 8 KV heads vs 16 Q heads halves the cache size compared to standard MHA.

Pre-allocated once at startup. nano-vLLM allocates one large tensor in GPU HBM at startup. No runtime allocation, no fragmentation, perfectly predictable memory footprint.

Written via Triton kernel. A custom GPU kernel writes K and V using a slot mapping from the CPU — one thread per token in parallel, zero Python overhead on the hot path.

Foundation for everything ahead. The block structure of the KV cache enables PagedAttention (Ch.04), prefix caching (Ch.07), and is the primary reason GPU memory is the binding constraint in LLM serving.