The problem that PagedAttention solves
You now know that the KV cache stores Key and Value vectors for every token a model processes → Ch.03. But there's a major problem with how that memory has traditionally been managed — and it causes GPUs to waste up to 80% of available memory before running out of space. PagedAttention is the solution.
Why the naive approach wastes 60–80% of GPU memory
Before PagedAttention, every LLM inference engine allocated KV cache memory the same way: reserve a contiguous (all-in-one-block) chunk of GPU memory for each request, sized to the maximum possible length that request could reach. This approach has three critical problems.
What does "contiguous" mean?
Think of GPU memory as a long row of numbered storage lockers. Contiguous allocation means a request must get a sequence of lockers with consecutive numbers — locker 100, 101, 102, 103, and so on, all in a row, no gaps. The request cannot use locker 50, locker 200, and locker 307 — they must all be adjacent.
This seems harmless, but it creates three serious problems at scale:
Reserved-but-empty memory (internal fragmentation)
A request allocated 2,048 token slots might only generate 47 tokens before finishing. The remaining 2,001 slots were reserved but never written to. That's 98% of the allocation sitting empty — but still held, preventing other requests from using it. Across a batch of requests with varying lengths, this waste compounds massively.
Gaps between allocations (external fragmentation)
As requests finish and their memory is released, the free space becomes scattered across many small gaps between other active allocations. A new large request might need 1,000 contiguous locker slots, but only 200-slot gaps are available — even if the total free memory is 5,000 slots. The memory exists but is too fragmented to use. The engine must reject the new request even though the GPU is far from full.
Unknown length problem
When a request arrives, you don't know how long the response will be. You must either over-allocate (wasteful) or under-allocate and reallocate later (slow and complex). There is no good answer — the contiguous model fundamentally can't handle variable-length responses efficiently.
The bar below shows how GPU memory looks after several requests have run with contiguous allocation. Notice how much space is wasted:
This is not a worst case — this is a typical snapshot during real inference. The vLLM paper (Kwon et al., 2023) measured that existing systems waste 60–80% of KV cache memory due to fragmentation before any request even runs out of space.
The insight: borrow from operating systems
This problem is not new. Operating systems solved it decades ago — not for GPU memory, but for regular computer RAM. The solution is called virtual memory with paging. PagedAttention applies exactly this idea to the KV cache.
How virtual memory works in an OS — the full analogy
Your computer runs many programs simultaneously — a browser, a music app, a code editor. Each program believes it has a large, contiguous block of RAM all to itself. But that's an illusion — a virtual address space. Behind the scenes, the OS breaks RAM into small fixed-size chunks called pages (typically 4 KB), and allocates pages from anywhere in physical RAM, not necessarily adjacent. A program that "sees" addresses 1000–2000 might actually be using physical RAM at locations 5000, 12000, and 200 — scattered across physical memory.
The bridge between the program's virtual view and the physical reality is a page table — a lookup map maintained by the OS. For each virtual page, the page table says: "virtual page 3 is actually stored at physical frame 47". The program never sees physical addresses — it only sees virtual ones, and the page table translates on every access.
PagedAttention — the same idea, applied to KV cache blocks
PagedAttention replaces the large contiguous KV cache reservation per request with a pool of small, fixed-size blocks. Each block holds exactly block_size tokens (256 in nano-vLLM). Instead of reserving 2,048 contiguous token slots upfront, a request gets blocks one at a time — only when it actually needs them. The blocks can be anywhere in GPU memory; they don't need to be adjacent. A block table (nano-vLLM's version of the OS page table) maps each request's logical token positions to the physical blocks where those tokens' K and V vectors are actually stored.
Request A has 768 tokens (3 blocks of 256). Its block table says: logical block 0 → physical block 47, logical block 1 → physical block 12, logical block 2 → physical block 83. The blocks are scattered — but the request doesn't care.
table
How PagedAttention works — three interlocking pieces
PagedAttention has three components that work together. Understanding each one separately makes the whole picture clear.
Mechanism 1 — The free list
At startup, the engine pre-allocates the entire KV cache tensor in GPU memory → Ch.03 and divides it into N equal-sized blocks. All N block IDs (integers 0 to N-1) are placed in a free list — a simple Python list on the CPU. That's it. No complex data structures. The free list is the inventory of available blocks.
free_blocks = [0, 1, 2, 3, ..., 511] — that's the entire free list at startup for a 512-block pool. When a request needs a block, the block manager calls free_blocks.pop() and gets block 511. When the request finishes, it calls free_blocks.append(511) and the block is back. The entire "memory allocator" is two Python list operations. No GPU code involved.
Mechanism 2 — The block table
Every request has a block table — a Python list that maps logical block positions to physical block IDs. When a request is first scheduled, the block manager pops one or more blocks from the free list and assigns them to the request's block table. As the request generates more tokens and fills up a block, a new block is popped from the free list and appended to the block table. The block table grows one entry at a time, on demand.
For example, after a request has processed 600 tokens: block_table = [47, 12, 83]. Logical block 0 (tokens 0–255) lives at physical block 47. Logical block 1 (tokens 256–511) lives at physical block 12. Logical block 2 (tokens 512–600, partially filled) lives at physical block 83.
Mechanism 3 — The slot mapping
The GPU needs to know exactly which slot in the KV cache tensor to write each token's K and V vectors to. The slot mapping is a list of integers — one per token being processed — that gives this exact physical slot location. Each slot is computed as: physical_block_id × block_size + offset_within_block.
For example: token 257 belongs to logical block 1, offset 1 within that block. From the block table, logical block 1 → physical block 12. So the slot = 12 × 256 + 1 = 3073. The Triton write kernel → Ch.03 receives this slot mapping from the CPU and uses it to write directly to the right location in the pre-allocated tensor — no Python logic on the GPU hot path.
This computation happens on the CPU for every token in the batch, producing the slot_mapping list. The GPU kernel receives this list and writes K and V data at slot 3073 in the pre-allocated cache tensor — without any address calculation on the GPU side.
Contiguous vs paged — interactive simulation
Use the buttons below to add requests to both systems simultaneously and watch how differently they use GPU memory. The paged system should serve far more requests with the same total memory.
Copy-on-write block sharing
Because blocks are identified by integer IDs and assigned via a block table, PagedAttention unlocks a powerful bonus feature for free: copy-on-write block sharing. Two requests can literally point their block tables to the same physical block — sharing the data without copying it. The block manager tracks this with a reference count.
When this is useful — parallel decoding
Imagine you want to generate 4 different responses to the same prompt (to pick the best one). Before PagedAttention, you'd need 4 separate full copies of the prompt's KV cache — 4× the memory. With block sharing, all 4 requests share the exact same physical blocks for the prompt tokens. Their block tables all point to the same block IDs. The only new memory needed is for the tokens each response generates uniquely.
The block manager implements this with a simple reference count per block — an integer that tracks how many requests are pointing to a block. When a request wants to write to a shared block, the block manager checks: is ref_count > 1? If yes — copy the block first, decrement the old block's ref_count, assign the new copy exclusively, then write. If ref_count == 1 — write directly. This is called copy-on-write.
Block 12 ref_count = 3 ✓
Memory used: 2 blocks
New blocks 91, 34, 67 private
Memory used: 5 blocks (not 6)
The BlockManager in code
The entire PagedAttention memory management system in nano-vLLM is implemented in core/block_manager.py → Ch.02. It runs entirely on the CPU — no GPU code. Here are the key methods:
class BlockManager: def __init__(self, num_blocks: int, block_size: int = 256): self.block_size = block_size # The free list — all block IDs available at startup # This is the ENTIRE memory allocator: one Python list self.free_blocks: list[int] = list(range(num_blocks)) # Reference counts — tracks how many requests share each block # Used for copy-on-write (prefix caching) self.ref_counts: dict[int, int] = defaultdict(int) # Hash → block ID map for prefix caching (Ch.07) self.hash_to_block: dict[str, int] = {}
def allocate(self, seq: Sequence) -> bool: # How many blocks does this sequence need right now? # (ceil division: 600 tokens / 256 = 3 blocks needed) num_blocks = math.ceil(len(seq.tokens) / self.block_size) # If not enough blocks in the free list, signal OOM to the scheduler if len(self.free_blocks) < num_blocks: return False # scheduler will handle this (preempt or wait) # Pop blocks from the free list and assign to the sequence's block_table # Note: blocks may be from anywhere in the pool — NOT contiguous seq.block_table = [self.free_blocks.pop() for _ in range(num_blocks)] # Set reference count to 1 — this request owns these blocks exclusively for blk in seq.block_table: self.ref_counts[blk] = 1 return True # allocation succeeded def free(self, seq: Sequence) -> None: for blk in seq.block_table: self.ref_counts[blk] -= 1 # Only return to the free list when NO request references this block # (ref_count > 0 means another request is still sharing it) if self.ref_counts[blk] == 0: self.free_blocks.append(blk) # immediately available for reuse seq.block_table = [] # clear the sequence's block table def append_slot(self, seq: Sequence) -> None: # Called when a decode step fills the current last block # and the sequence needs one more block for the next token last_token_idx = len(seq.tokens) - 1 if last_token_idx % self.block_size == 0: # block boundary crossed new_block = self.free_blocks.pop() seq.block_table.append(new_block) self.ref_counts[new_block] = 1
allocate() returns False instead of crashing. This signals to the scheduler → Ch.05 that there isn't enough memory for this request right now. The scheduler can then choose to either keep the request waiting, or preempt (pause and evict) an existing lower-priority request to free up blocks. This graceful failure is what allows the scheduler to handle memory pressure without crashing the engine.
What PagedAttention enables
| Metric | Contiguous Allocation | PagedAttention |
|---|---|---|
| Memory wasted | 60–80% wasted — reserved but unused slots | <4% wasted — only the last partial block per request |
| Concurrent requests | Limited by worst-case allocation — 10–20 requests | Limited by actual tokens used — 3–5× more requests |
| Fragmentation | Grows over time as requests finish and leave holes | Zero external fragmentation — any free block fits any request |
| Prefix sharing | Impossible — each request needs its own contiguous copy | Free — block tables can point to the same physical blocks |
| Allocation complexity | O(N) search for free contiguous region | O(1) — pop from free list |
Things beginners get wrong about PagedAttention
block_table[1] is a nanosecond operation. The slot mapping is computed once per batch on the CPU, then passed to the GPU as a tensor. The GPU kernel reads the slot directly — there is no runtime Python during the GPU's hot path. The overhead is negligible compared to the memory savings.Quiz
Three questions to verify you've understood the core ideas. Wrong answers explain exactly where the misunderstanding is.
1. A request has block_table = [47, 12, 83] and block_size = 256. Which physical block contains the KV data for token number 300?
2. Three requests share the same prompt. In PagedAttention with copy-on-write, how many physical blocks are needed for the prompt's KV data?
3. What happens when the free list is empty and a new request arrives?
What you now know
Contiguous allocation wastes 60–80% of GPU memory. Reserving max-length slots upfront for uncertain-length responses causes massive internal and external fragmentation before the GPU is genuinely full.
PagedAttention is virtual memory for the KV cache. The same OS idea — fixed pages, a page table, a free pool — applied to GPU memory. Blocks go anywhere; the block table provides the address translation.
Three mechanisms work together. Free list (O(1) allocation from a Python list), block table (logical → physical mapping per request), slot mapping (exact write address for each token's K/V data).
Copy-on-write enables free prefix sharing. Multiple requests can share identical physical blocks (same block table entries, ref_count > 1) until the moment one of them writes unique data — then and only then is a private copy made.
The BlockManager is pure Python on the CPU. It only manipulates integer lists — no GPU tensors, no CUDA. The entire "allocator" is list.pop() and list.append(). Simplicity is a feature.
2–4× throughput improvement in practice. Not faster per-token, but far more requests served concurrently on the same hardware. The same GPU that served 12 requests now serves 50 — without any new hardware.