Chapter 07 of 11 · nano-vLLM Deep Dive
07

Prefix Caching

Why run prefill twice for the same tokens? Prefix caching identifies identical KV cache blocks by their content fingerprint and reuses them instantly — cutting TTFT to near-zero for repeated prompts.

← Ch06: Prefill vs Decode Next: Sampling →

The problem: redoing work you've already done

Every time a request arrives, the prefill phase → Ch.06 processes its input tokens — computing Key and Value vectors for each one and writing them to the KV cache → Ch.03. This is expensive. But what if two consecutive requests share the exact same first 2,000 tokens — a system prompt, say? Without prefix caching, you'd pay the full prefill cost twice, writing identical data into the cache twice. That's pure waste.

The Printed Exam Analogy Imagine a teacher who prints a 10-page exam for every student in a class of 100. The first 8 pages are identical for everyone — the instructions and the questions. Only the last 2 pages differ (different student names, different essay prompts). Without prefix caching, the teacher prints all 10 pages fresh for every student — 1,000 pages total, 800 of which are identical copies. With prefix caching, the teacher prints the first 8 pages once, photocopies them for everyone, and only prints the unique last 2 pages fresh per student. 800 pages of printing eliminated. That's prefix caching. The 8 shared pages are the system prompt. The unique 2 pages are each user's specific query. The photocopier is the KV cache block reuse mechanism.

Prefix caching is especially powerful in real deployments because shared content is extremely common. Every request in a chatbot shares the same system prompt. Every request in a RAG pipeline shares the same retrieved documents. Every turn in a multi-turn conversation shares all the prior turns. Across thousands of requests per second, the compute savings are enormous.

How prefix caching works — three building blocks

Prefix caching is built on top of PagedAttention's block structure → Ch.04. Because the KV cache is already divided into fixed-size blocks, and because each block holds exactly block_size tokens, it's natural to ask: "have I already computed the KV data for these exact tokens?" The answer is a hash lookup.

Building block 1 — Hash functions: turning content into a fingerprint

A hash function is a mathematical recipe that takes any input — a block of tokens, a document, a file — and produces a short, fixed-length string called a hash or fingerprint. Two critical properties make hashes useful for caching:

PROPERTY 1 — DETERMINISTIC

The same input always produces the same hash. Token sequence [849, 1495, 3283] will always hash to the same fingerprint — today, tomorrow, on any machine. This means if two requests contain identical token blocks, their hashes will be identical too, making the match detectable.

PROPERTY 2 — SENSITIVE

Change even one token and the hash changes completely. [849, 1495, 3284] produces an entirely different fingerprint from [849, 1495, 3283]. This means there are no false matches — if two blocks have the same hash, their token content is identical.

The Barcode Analogy A hash is like a barcode. The barcode for "Campbell's Tomato Soup 400g" is always the same barcode. Two cans with the same barcode are the same product — guaranteed. Change any character in the product name and the barcode changes completely. The supermarket's inventory system uses barcodes to avoid re-reading product descriptions manually. nano-vLLM uses hashes to avoid re-running prefill for token blocks it has already processed.

nano-vLLM uses a specific hash function called xxhash — chosen because it's extremely fast (computing a hash takes microseconds, far less than the prefill time it might save) and produces a 64-bit integer output that is compact enough to use as a dictionary key.

xxhash in action

The same token sequence always produces the same hash. Even a tiny change produces a completely different hash:

[849, 1495, 3283, 498, 30]
xxh64: 0xA3F2B8C1D4E7F091
[849, 1495, 3283, 498, 30]
xxh64: 0xA3F2B8C1D4E7F091 ✓ identical
[849, 1495, 3284, 498, 30]
xxh64: 0x1B9E4C2F7A053D86 ✗ completely different

Token 3283 changed to 3284 — one integer different — and the hash is completely unrecognisable. This sensitivity is what makes hash-based matching reliable. Any different content → different hash → no false cache hit.

Building block 2 — The hash-to-block map

The block manager → Ch.04 maintains a Python dictionary: hash_to_block. The key is the xxhash of a block's token content. The value is the physical block ID where those tokens' KV data already lives in the cache. When a new request arrives, the block manager hashes each of its first N blocks and checks this dictionary. If the hash is found, that's a cache hit — the KV data already exists and can be reused instantly. If not, that's a cache miss — prefill must compute that block fresh.

Building block 3 — Reference counting for shared blocks

When a block is shared between multiple requests via the cache, we can't just free it when one request finishes — other requests might still need it. The block manager uses a reference count (already introduced in Ch.04 → Ch.04 for copy-on-write) — an integer tracking how many requests are pointing to a block. The block is only returned to the free list when its reference count drops to zero. Until then, it stays in the cache, ready to serve future cache hits.

What happens when a request arrives — the full algorithm

When the block manager's allocate_with_prefix() is called for a new sequence, it runs through the request's token blocks one by one from the start, checking for cache hits before allocating any new blocks.

1

Divide the sequence into blocks

The sequence's tokens are divided into chunks of exactly block_size tokens each (256 by default). A 1,000-token prompt produces 3 full blocks (256 tokens each = 768 tokens) plus one partial block (232 remaining tokens). Only full blocks are eligible for prefix caching — partial blocks at the end are always computed fresh, because they'll grow as more tokens are generated.

2

Hash each full block and check the map — find the longest matching prefix

For each full block in order: compute the xxhash of its 256 token IDs. Look up the hash in hash_to_block. If it's there — cache hit! Reuse that block, skip prefill for these tokens, increment the block's reference count. Continue to the next block. Stop at the first cache miss — everything after the first miss is a fresh allocation. This stopping rule is important: block N can only be a hit if block N-1 was also a hit (because the KV computation for block N depended on block N-1 being present).

3

Allocate fresh blocks for the cache miss region

From the first cache miss block onwards, pop fresh blocks from the free list as normal. These blocks will be filled during prefill. After prefill completes and writes K/V data to these blocks, their hashes are registered in hash_to_block — making them available for future requests that share the same content.

4

Run prefill only for the cache miss tokens

The scheduler and model runner receive the sequence with seq.num_cached_tokens already set. The prefill pass skips the cached blocks entirely — only the miss region gets processed. From the model's perspective, it starts processing from token N (where N is the number of cached tokens) rather than token 0. The TTFT reduction is proportional to how many tokens were already cached.

Why the match must be a continuous prefix — not scattered blocks Transformer attention is causal — each token attends to all previous tokens in order. The KV cache for token 500 was computed with knowledge of tokens 0–499. If tokens 0–255 are cached but tokens 256–511 are not, you can use the cache for block 0, but you must recompute block 1 from scratch — because block 1's KV data was computed with the actual context of tokens 0–255. You can reuse a cached prefix from the start, but you can't skip a block in the middle and cache blocks after it.

See prefix caching in action — send multiple requests

The simulation below shows how the hash-to-block map fills up as requests arrive. Each request is divided into blocks. Green blocks are cache hits (reused instantly). Cyan blocks are cache misses (must run prefill). Yellow blocks are unique to this request. Watch how the second and third requests with the same system prompt pay progressively less prefill cost.

Prefix cache simulator
0
Requests sent
0
Cache hits
0
Cache misses
0%
Prefill saved
hash_to_block map (grows as new content is processed)
empty — no requests yet
Send a request
Cumulative prefill blocks saved vs always re-running from scratch
Cache HIT — prefill skipped
Cache MISS — prefill ran, now cached
Unique to this request

When prefix caching wins — and by how much

The savings from prefix caching depend on one thing: how many tokens are shared between requests. Here are the scenarios where it matters most.

TTFT reduction by use case (7B model, H100)

TTFT without prefix caching vs with it, for common deployment patterns:

Chatbot (2k sys prompt)
~700ms → ~90ms TTFT
RAG (4k context)
~1.4s → ~110ms TTFT
Multi-turn chat (10 turns)
~3.5s → ~750ms TTFT
Fresh request (no overlap)
0% — no shared tokens

Estimates based on ~0.35ms/token prefill on H100 with 7B model. The RAG example assumes the same retrieved documents are reused across many queries, which is common when the same knowledge base is queried repeatedly within a short window.

The hit rate question — when does the cache actually have what you need? Prefix caching only helps if the matching blocks are still in the cache when the next request arrives. If the cache is full and blocks have been evicted to make room for other requests, there's nothing to hit. nano-vLLM uses an LRU (Least Recently Used) eviction policy — when the free list is exhausted and no new blocks can be allocated, the block whose hash was looked up least recently gets evicted. This prioritises keeping recently-used shared content (like an active system prompt) over stale content.

LRU eviction — keeping the most useful blocks alive

LRU (Least Recently Used) is a simple but effective eviction strategy. The idea: if a block hasn't been accessed recently, it's unlikely to be accessed soon. Evict the block that was last used the longest time ago. For prefix caching, this means a system prompt that's being actively used stays in the cache. A system prompt from a conversation that ended an hour ago gets evicted first. The cache naturally prioritises the content that's most likely to produce hits.

The prefix cache in code

Prefix caching lives in core/block_manager.py → Ch.02. The key addition over basic PagedAttention → Ch.04 is the hash_to_block dictionary and the hash-check loop inside allocation.

core/block_manager.py — allocate_with_prefix()
import xxhash

def allocate_with_prefix(self, seq: Sequence) -> bool:
    seq.block_table = []
    seq.num_cached_tokens = 0  # how many tokens we can skip in prefill

    # How many FULL blocks does this sequence have?
    # Only full blocks are eligible for prefix caching.
    num_full_blocks = len(seq.tokens) // self.block_size

    # ── PHASE 1: check for prefix cache hits ──────────────────
    for i in range(num_full_blocks):
        # Extract the token IDs for this block
        start = i * self.block_size
        block_tokens = seq.tokens[start : start + self.block_size]

        # Hash the token content → fingerprint for this block
        # bytes() converts the list of ints to a byte string for hashing
        h = xxhash.xxh64(bytes(block_tokens)).hexdigest()

        if h in self.hash_to_block:
            # ✓ CACHE HIT — this block's KV data already exists
            blk = self.hash_to_block[h]
            self.ref_counts[blk] += 1    # one more request using this block
            seq.block_table.append(blk)
            seq.num_cached_tokens += self.block_size
        else:
            # ✗ CACHE MISS — stop checking prefix; allocate fresh from here
            break

    # ── PHASE 2: allocate fresh blocks for the miss region ────
    remaining_tokens = len(seq.tokens) - seq.num_cached_tokens
    blocks_needed = math.ceil(remaining_tokens / self.block_size)

    if len(self.free_blocks) < blocks_needed:
        return False  # signal OOM to scheduler

    for _ in range(blocks_needed):
        blk = self.free_blocks.pop()
        self.ref_counts[blk] = 1
        seq.block_table.append(blk)

    return True
core/block_manager.py — register_prefix() — called after prefill completes
def register_prefix(self, seq: Sequence) -> None:
    # After prefill writes KV data to the new blocks,
    # register their hashes so future requests can hit them.
    num_full_blocks = len(seq.tokens) // self.block_size

    for i in range(seq.num_cached_tokens // self.block_size, num_full_blocks):
        # Only register blocks that were freshly computed (not already cached)
        start = i * self.block_size
        block_tokens = seq.tokens[start : start + self.block_size]
        h = xxhash.xxh64(bytes(block_tokens)).hexdigest()

        blk = seq.block_table[i]
        self.hash_to_block[h] = blk   # register: this hash lives at this block
        # ref_count stays at 1 — owned by this sequence for now
        # If another request hits this hash later, ref_count will increment
The two-phase design is important allocate_with_prefix() runs before prefill — it figures out what's cached and sets up the block table. register_prefix() runs after prefill — it records the freshly computed blocks so future requests can reuse them. This clean separation means the model runner doesn't need to know anything about prefix caching. It just runs prefill on the tokens it receives, and the block manager handles the caching logic entirely on the CPU.

What prefix caching enables at scale

Near-zero TTFT for repeat prompts

A 2,000-token system prompt that would take ~700ms to prefill takes <5ms with a cache hit — the blocks are already in GPU memory, no compute needed. Users experience instant responses after the first request "warms" the cache.

Massive GPU compute savings

In production chatbot deployments, 60–80% of all input tokens are shared prefixes (system prompts, conversation history). Prefix caching eliminates prefill for all of them. On a busy server, this can reduce total GPU compute by half.

Higher throughput at no hardware cost

Fewer prefill operations per request means the GPU spends more time on decode — actually generating tokens. The same hardware serves more requests per second. This is free throughput improvement through smarter memory management.

Enables longer contexts affordably

Without prefix caching, long system prompts and RAG contexts make every request expensive. With it, the cost is paid once and amortised across all requests that share that context. 10,000-token system prompts become practically free after the first hit.

Things beginners get wrong about prefix caching

✗ Myth 1 — "Prefix caching works for any matching tokens, anywhere in the prompt"
Reality: Prefix caching only works for a continuous prefix from the very start of the prompt. If request A starts with "You are a helpful assistant. [2000 tokens]" and request B starts with the same string, the prefix cache hits. But if request B has the same 2,000 tokens buried in the middle of a different prompt, there is no cache hit. The match must start at token 0 and proceed continuously — because transformer attention is causal and the KV data for each block depends on all previous blocks being identical.
✗ Myth 2 — "A cache hit means the response is pre-computed"
Reality: A cache hit means the prefill phase for the matching tokens is skipped — the KV data for those tokens already exists. The model still has to run the full decode phase to generate the response. The cache hit saves the TTFT cost of processing the shared prefix, but the response itself is generated fresh every time. No response is ever "pre-computed" — only the context processing is cached.
✗ Myth 3 — "Prefix caching uses a separate cache from the KV cache"
Reality: Prefix caching is not a separate cache — it's a smarter allocation strategy on top of the same pre-allocated KV cache tensor → Ch.03. The physical blocks live in exactly the same GPU memory. The only addition is the hash_to_block Python dictionary on the CPU that maps token fingerprints to block IDs. There is no additional GPU memory usage — just a small CPU-side metadata structure that makes existing blocks reusable.

Quiz

Three questions on prefix caching. Wrong answers explain exactly where the reasoning broke down.

1. Request A has a 512-token system prompt followed by a 50-token user query. Request B arrives with the same system prompt but a completely different 50-token query. How many tokens does Request B need to prefill?

2. Why does nano-vLLM use xxhash specifically, rather than a simpler method like comparing token arrays directly?

3. A block's reference count is currently 3. Two of the three requests that shared it have now finished. What happens to the block?

What you now know

Chapter 07 — Summary

Prefix caching skips prefill for shared tokens. Identical token blocks across requests are identified by their xxhash fingerprint. A cache hit means the KV data already exists — no prefill needed for those tokens.

A hash function is a content fingerprint. Same tokens → same hash, always. One token different → completely different hash. xxhash makes block lookup O(1) instead of O(256) per block.

Only continuous prefixes hit the cache. Matching must start at token 0 and proceed without gaps. A shared middle section with different beginnings produces no cache hit, because the KV data depends on the full prior context.

Reference counting keeps shared blocks alive. A block used by 3 requests has ref_count=3. Only when all 3 finish and ref_count=0 is the block freed — preventing a finished request from breaking others still using that block.

TTFT improvement can be dramatic. A 2,000-token system prompt goes from ~700ms TTFT to ~90ms on a cache hit. In production deployments where 60–80% of tokens are shared prefixes, prefix caching can halve total GPU compute.

No extra GPU memory required. Prefix caching reuses existing KV cache blocks with a small CPU-side dictionary. The physical storage is the same pre-allocated tensor — only the allocation strategy is smarter.