The problem: redoing work you've already done
Every time a request arrives, the prefill phase → Ch.06 processes its input tokens — computing Key and Value vectors for each one and writing them to the KV cache → Ch.03. This is expensive. But what if two consecutive requests share the exact same first 2,000 tokens — a system prompt, say? Without prefix caching, you'd pay the full prefill cost twice, writing identical data into the cache twice. That's pure waste.
Prefix caching is especially powerful in real deployments because shared content is extremely common. Every request in a chatbot shares the same system prompt. Every request in a RAG pipeline shares the same retrieved documents. Every turn in a multi-turn conversation shares all the prior turns. Across thousands of requests per second, the compute savings are enormous.
How prefix caching works — three building blocks
Prefix caching is built on top of PagedAttention's block structure → Ch.04. Because the KV cache is already divided into fixed-size blocks, and because each block holds exactly block_size tokens, it's natural to ask: "have I already computed the KV data for these exact tokens?" The answer is a hash lookup.
Building block 1 — Hash functions: turning content into a fingerprint
A hash function is a mathematical recipe that takes any input — a block of tokens, a document, a file — and produces a short, fixed-length string called a hash or fingerprint. Two critical properties make hashes useful for caching:
The same input always produces the same hash. Token sequence [849, 1495, 3283] will always hash to the same fingerprint — today, tomorrow, on any machine. This means if two requests contain identical token blocks, their hashes will be identical too, making the match detectable.
Change even one token and the hash changes completely. [849, 1495, 3284] produces an entirely different fingerprint from [849, 1495, 3283]. This means there are no false matches — if two blocks have the same hash, their token content is identical.
nano-vLLM uses a specific hash function called xxhash — chosen because it's extremely fast (computing a hash takes microseconds, far less than the prefill time it might save) and produces a 64-bit integer output that is compact enough to use as a dictionary key.
The same token sequence always produces the same hash. Even a tiny change produces a completely different hash:
Token 3283 changed to 3284 — one integer different — and the hash is completely unrecognisable. This sensitivity is what makes hash-based matching reliable. Any different content → different hash → no false cache hit.
Building block 2 — The hash-to-block map
The block manager → Ch.04 maintains a Python dictionary: hash_to_block. The key is the xxhash of a block's token content. The value is the physical block ID where those tokens' KV data already lives in the cache. When a new request arrives, the block manager hashes each of its first N blocks and checks this dictionary. If the hash is found, that's a cache hit — the KV data already exists and can be reused instantly. If not, that's a cache miss — prefill must compute that block fresh.
Building block 3 — Reference counting for shared blocks
When a block is shared between multiple requests via the cache, we can't just free it when one request finishes — other requests might still need it. The block manager uses a reference count (already introduced in Ch.04 → Ch.04 for copy-on-write) — an integer tracking how many requests are pointing to a block. The block is only returned to the free list when its reference count drops to zero. Until then, it stays in the cache, ready to serve future cache hits.
What happens when a request arrives — the full algorithm
When the block manager's allocate_with_prefix() is called for a new sequence, it runs through the request's token blocks one by one from the start, checking for cache hits before allocating any new blocks.
Divide the sequence into blocks
The sequence's tokens are divided into chunks of exactly block_size tokens each (256 by default). A 1,000-token prompt produces 3 full blocks (256 tokens each = 768 tokens) plus one partial block (232 remaining tokens). Only full blocks are eligible for prefix caching — partial blocks at the end are always computed fresh, because they'll grow as more tokens are generated.
Hash each full block and check the map — find the longest matching prefix
For each full block in order: compute the xxhash of its 256 token IDs. Look up the hash in hash_to_block. If it's there — cache hit! Reuse that block, skip prefill for these tokens, increment the block's reference count. Continue to the next block. Stop at the first cache miss — everything after the first miss is a fresh allocation. This stopping rule is important: block N can only be a hit if block N-1 was also a hit (because the KV computation for block N depended on block N-1 being present).
Allocate fresh blocks for the cache miss region
From the first cache miss block onwards, pop fresh blocks from the free list as normal. These blocks will be filled during prefill. After prefill completes and writes K/V data to these blocks, their hashes are registered in hash_to_block — making them available for future requests that share the same content.
Run prefill only for the cache miss tokens
The scheduler and model runner receive the sequence with seq.num_cached_tokens already set. The prefill pass skips the cached blocks entirely — only the miss region gets processed. From the model's perspective, it starts processing from token N (where N is the number of cached tokens) rather than token 0. The TTFT reduction is proportional to how many tokens were already cached.
See prefix caching in action — send multiple requests
The simulation below shows how the hash-to-block map fills up as requests arrive. Each request is divided into blocks. Green blocks are cache hits (reused instantly). Cyan blocks are cache misses (must run prefill). Yellow blocks are unique to this request. Watch how the second and third requests with the same system prompt pay progressively less prefill cost.
When prefix caching wins — and by how much
The savings from prefix caching depend on one thing: how many tokens are shared between requests. Here are the scenarios where it matters most.
TTFT without prefix caching vs with it, for common deployment patterns:
Estimates based on ~0.35ms/token prefill on H100 with 7B model. The RAG example assumes the same retrieved documents are reused across many queries, which is common when the same knowledge base is queried repeatedly within a short window.
LRU eviction — keeping the most useful blocks alive
LRU (Least Recently Used) is a simple but effective eviction strategy. The idea: if a block hasn't been accessed recently, it's unlikely to be accessed soon. Evict the block that was last used the longest time ago. For prefix caching, this means a system prompt that's being actively used stays in the cache. A system prompt from a conversation that ended an hour ago gets evicted first. The cache naturally prioritises the content that's most likely to produce hits.
The prefix cache in code
Prefix caching lives in core/block_manager.py → Ch.02. The key addition over basic PagedAttention → Ch.04 is the hash_to_block dictionary and the hash-check loop inside allocation.
import xxhash def allocate_with_prefix(self, seq: Sequence) -> bool: seq.block_table = [] seq.num_cached_tokens = 0 # how many tokens we can skip in prefill # How many FULL blocks does this sequence have? # Only full blocks are eligible for prefix caching. num_full_blocks = len(seq.tokens) // self.block_size # ── PHASE 1: check for prefix cache hits ────────────────── for i in range(num_full_blocks): # Extract the token IDs for this block start = i * self.block_size block_tokens = seq.tokens[start : start + self.block_size] # Hash the token content → fingerprint for this block # bytes() converts the list of ints to a byte string for hashing h = xxhash.xxh64(bytes(block_tokens)).hexdigest() if h in self.hash_to_block: # ✓ CACHE HIT — this block's KV data already exists blk = self.hash_to_block[h] self.ref_counts[blk] += 1 # one more request using this block seq.block_table.append(blk) seq.num_cached_tokens += self.block_size else: # ✗ CACHE MISS — stop checking prefix; allocate fresh from here break # ── PHASE 2: allocate fresh blocks for the miss region ──── remaining_tokens = len(seq.tokens) - seq.num_cached_tokens blocks_needed = math.ceil(remaining_tokens / self.block_size) if len(self.free_blocks) < blocks_needed: return False # signal OOM to scheduler for _ in range(blocks_needed): blk = self.free_blocks.pop() self.ref_counts[blk] = 1 seq.block_table.append(blk) return True
def register_prefix(self, seq: Sequence) -> None: # After prefill writes KV data to the new blocks, # register their hashes so future requests can hit them. num_full_blocks = len(seq.tokens) // self.block_size for i in range(seq.num_cached_tokens // self.block_size, num_full_blocks): # Only register blocks that were freshly computed (not already cached) start = i * self.block_size block_tokens = seq.tokens[start : start + self.block_size] h = xxhash.xxh64(bytes(block_tokens)).hexdigest() blk = seq.block_table[i] self.hash_to_block[h] = blk # register: this hash lives at this block # ref_count stays at 1 — owned by this sequence for now # If another request hits this hash later, ref_count will increment
allocate_with_prefix() runs before prefill — it figures out what's cached and sets up the block table. register_prefix() runs after prefill — it records the freshly computed blocks so future requests can reuse them. This clean separation means the model runner doesn't need to know anything about prefix caching. It just runs prefill on the tokens it receives, and the block manager handles the caching logic entirely on the CPU.
What prefix caching enables at scale
Near-zero TTFT for repeat prompts
A 2,000-token system prompt that would take ~700ms to prefill takes <5ms with a cache hit — the blocks are already in GPU memory, no compute needed. Users experience instant responses after the first request "warms" the cache.
Massive GPU compute savings
In production chatbot deployments, 60–80% of all input tokens are shared prefixes (system prompts, conversation history). Prefix caching eliminates prefill for all of them. On a busy server, this can reduce total GPU compute by half.
Higher throughput at no hardware cost
Fewer prefill operations per request means the GPU spends more time on decode — actually generating tokens. The same hardware serves more requests per second. This is free throughput improvement through smarter memory management.
Enables longer contexts affordably
Without prefix caching, long system prompts and RAG contexts make every request expensive. With it, the cost is paid once and amortised across all requests that share that context. 10,000-token system prompts become practically free after the first hit.
Things beginners get wrong about prefix caching
hash_to_block Python dictionary on the CPU that maps token fingerprints to block IDs. There is no additional GPU memory usage — just a small CPU-side metadata structure that makes existing blocks reusable.Quiz
Three questions on prefix caching. Wrong answers explain exactly where the reasoning broke down.
1. Request A has a 512-token system prompt followed by a 50-token user query. Request B arrives with the same system prompt but a completely different 50-token query. How many tokens does Request B need to prefill?
2. Why does nano-vLLM use xxhash specifically, rather than a simpler method like comparing token arrays directly?
3. A block's reference count is currently 3. Two of the three requests that shared it have now finished. What happens to the block?
What you now know
Prefix caching skips prefill for shared tokens. Identical token blocks across requests are identified by their xxhash fingerprint. A cache hit means the KV data already exists — no prefill needed for those tokens.
A hash function is a content fingerprint. Same tokens → same hash, always. One token different → completely different hash. xxhash makes block lookup O(1) instead of O(256) per block.
Only continuous prefixes hit the cache. Matching must start at token 0 and proceed without gaps. A shared middle section with different beginnings produces no cache hit, because the KV data depends on the full prior context.
Reference counting keeps shared blocks alive. A block used by 3 requests has ref_count=3. Only when all 3 finish and ref_count=0 is the block freed — preventing a finished request from breaking others still using that block.
TTFT improvement can be dramatic. A 2,000-token system prompt goes from ~700ms TTFT to ~90ms on a cache hit. In production deployments where 60–80% of tokens are shared prefixes, prefix caching can halve total GPU compute.
No extra GPU memory required. Prefix caching reuses existing KV cache blocks with a small CPU-side dictionary. The physical storage is the same pre-allocated tensor — only the allocation strategy is smarter.