nano-vLLM · Ch04 · PagedAttention

Section 1 — The Big Picture

The problem that PagedAttention solves

You now know that the KV cache stores Key and Value vectors for every token a model processes → Ch.03. But there's a major problem with how that memory has traditionally been managed — and it causes GPUs to waste up to 80% of available memory before running out of space. PagedAttention is the solution.

The Hotel Room Analogy Imagine a hotel that books rooms the old-fashioned way: when a guest checks in, the manager holds a block of rooms for the entire length of their expected stay — even if the guest ends up leaving early, or staying longer than expected. A guest who might stay 1–10 nights gets 10 rooms reserved upfront "just in case". Most of those rooms sit empty while other guests are turned away at the door because the hotel appears full. That wasted space is fragmentation. PagedAttention runs the hotel like a modern system: guests get one room at a time, only when they actually need it. The moment a guest checks out, their room goes back into the pool. No held space, no waste.

Section 2 — The Problem: Contiguous Allocation

Why the naive approach wastes 60–80% of GPU memory

Before PagedAttention, every LLM inference engine allocated KV cache memory the same way: reserve a contiguous (all-in-one-block) chunk of GPU memory for each request, sized to the maximum possible length that request could reach. This approach has three critical problems.

What does "contiguous" mean?

Think of GPU memory as a long row of numbered storage lockers. Contiguous allocation means a request must get a sequence of lockers with consecutive numbers — locker 100, 101, 102, 103, and so on, all in a row, no gaps. The request cannot use locker 50, locker 200, and locker 307 — they must all be adjacent.

This seems harmless, but it creates three serious problems at scale:

1

Reserved-but-empty memory (internal fragmentation)

A request allocated 2,048 token slots might only generate 47 tokens before finishing. The remaining 2,001 slots were reserved but never written to. That's 98% of the allocation sitting empty — but still held, preventing other requests from using it. Across a batch of requests with varying lengths, this waste compounds massively.

2

Gaps between allocations (external fragmentation)

As requests finish and their memory is released, the free space becomes scattered across many small gaps between other active allocations. A new large request might need 1,000 contiguous locker slots, but only 200-slot gaps are available — even if the total free memory is 5,000 slots. The memory exists but is too fragmented to use. The engine must reject the new request even though the GPU is far from full.

3

Unknown length problem

When a request arrives, you don't know how long the response will be. You must either over-allocate (wasteful) or under-allocate and reallocate later (slow and complex). There is no good answer — the contiguous model fundamentally can't handle variable-length responses efficiently.

Memory fragmentation — visualised

The bar below shows how GPU memory looks after several requests have run with contiguous allocation. Notice how much space is wasted:

CONTIGUOUS ALLOCATION — GPU memory (100 slots)

0

slots actually used

0

slots reserved but empty

0%

memory wasted

This is not a worst case — this is a typical snapshot during real inference. The vLLM paper (Kwon et al., 2023) measured that existing systems waste 60–80% of KV cache memory due to fragmentation before any request even runs out of space.

Section 3 — The Solution: Virtual Memory

The insight: borrow from operating systems

This problem is not new. Operating systems solved it decades ago — not for GPU memory, but for regular computer RAM. The solution is called virtual memory with paging. PagedAttention applies exactly this idea to the KV cache.

How virtual memory works in an OS — the full analogy

Your computer runs many programs simultaneously — a browser, a music app, a code editor. Each program believes it has a large, contiguous block of RAM all to itself. But that's an illusion — a virtual address space. Behind the scenes, the OS breaks RAM into small fixed-size chunks called pages (typically 4 KB), and allocates pages from anywhere in physical RAM, not necessarily adjacent. A program that "sees" addresses 1000–2000 might actually be using physical RAM at locations 5000, 12000, and 200 — scattered across physical memory.

The bridge between the program's virtual view and the physical reality is a page table — a lookup map maintained by the OS. For each virtual page, the page table says: "virtual page 3 is actually stored at physical frame 47". The program never sees physical addresses — it only sees virtual ones, and the page table translates on every access.

The Virtual Memory Analogy — apartment building Imagine a city (GPU memory) divided into small, identical apartment units (pages/blocks). Each tenant (request) is given a list of unit numbers — their "address book" (page table / block table). The units might be scattered across the city — unit 3 is in the north, unit 7 is downtown, unit 12 is in the east — but the tenant doesn't care. They just look up their address book and go directly to each unit as needed. New units are assigned from a central pool as needed, and returned immediately when the tenant moves out. No tenant ever holds a block of units "just in case" — they only hold what they're currently using.

PagedAttention — the same idea, applied to KV cache blocks

PagedAttention replaces the large contiguous KV cache reservation per request with a pool of small, fixed-size blocks. Each block holds exactly block_size tokens (256 in nano-vLLM). Instead of reserving 2,048 contiguous token slots upfront, a request gets blocks one at a time — only when it actually needs them. The blocks can be anywhere in GPU memory; they don't need to be adjacent. A block table (nano-vLLM's version of the OS page table) maps each request's logical token positions to the physical blocks where those tokens' K and V vectors are actually stored.

Block table — logical to physical mapping

Request A has 768 tokens (3 blocks of 256). Its block table says: logical block 0 → physical block 47, logical block 1 → physical block 12, logical block 2 → physical block 83. The blocks are scattered — but the request doesn't care.

Logical view (Request A)

0Tokens 0–255 (256 tokens)

1Tokens 256–511 (256 tokens)

2Tokens 512–767 (256 tokens)

→

block
table

Physical GPU memory blocks

Block 47 K+V data, tokens 0–255

Block 12 K+V data, tokens 256–511

Block 83 K+V data, tokens 512–767

Not adjacent in physical memory — scattered anywhere in the 512-block pool

The key insight — indirection eliminates waste By adding one layer of indirection (the block table lookup), we completely decouple the request's logical view from physical memory layout. Requests no longer compete for contiguous space. Any combination of free blocks can satisfy any request, regardless of where those blocks sit in GPU memory. This single change — borrowed directly from OS virtual memory — is what PagedAttention is.

Section 4 — The Three Mechanisms

How PagedAttention works — three interlocking pieces

PagedAttention has three components that work together. Understanding each one separately makes the whole picture clear.

Mechanism 1 — The free list

At startup, the engine pre-allocates the entire KV cache tensor in GPU memory → Ch.03 and divides it into N equal-sized blocks. All N block IDs (integers 0 to N-1) are placed in a free list — a simple Python list on the CPU. That's it. No complex data structures. The free list is the inventory of available blocks.

The free list is just a Python list of numbers free_blocks = [0, 1, 2, 3, ..., 511] — that's the entire free list at startup for a 512-block pool. When a request needs a block, the block manager calls free_blocks.pop() and gets block 511. When the request finishes, it calls free_blocks.append(511) and the block is back. The entire "memory allocator" is two Python list operations. No GPU code involved.

Mechanism 2 — The block table

Every request has a block table — a Python list that maps logical block positions to physical block IDs. When a request is first scheduled, the block manager pops one or more blocks from the free list and assigns them to the request's block table. As the request generates more tokens and fills up a block, a new block is popped from the free list and appended to the block table. The block table grows one entry at a time, on demand.

For example, after a request has processed 600 tokens: block_table = [47, 12, 83]. Logical block 0 (tokens 0–255) lives at physical block 47. Logical block 1 (tokens 256–511) lives at physical block 12. Logical block 2 (tokens 512–600, partially filled) lives at physical block 83.

Mechanism 3 — The slot mapping

The GPU needs to know exactly which slot in the KV cache tensor to write each token's K and V vectors to. The slot mapping is a list of integers — one per token being processed — that gives this exact physical slot location. Each slot is computed as: physical_block_id × block_size + offset_within_block.

For example: token 257 belongs to logical block 1, offset 1 within that block. From the block table, logical block 1 → physical block 12. So the slot = 12 × 256 + 1 = 3073. The Triton write kernel → Ch.03 receives this slot mapping from the CPU and uses it to write directly to the right location in the pre-allocated tensor — no Python logic on the GPU hot path.

Slot mapping computation

Token position

257

Logical block

1

⌊257 / 256⌋

Physical block

12

block_table[1]

Final slot

3073

12 × 256 + 1

This computation happens on the CPU for every token in the batch, producing the slot_mapping list. The GPU kernel receives this list and writes K and V data at slot 3073 in the pre-allocated cache tensor — without any address calculation on the GPU side.

Section 5 — Side-by-Side Comparison

Contiguous vs paged — interactive simulation

Use the buttons below to add requests to both systems simultaneously and watch how differently they use GPU memory. The paged system should serve far more requests with the same total memory.

❌ Contiguous Allocation (before PagedAttention)

✓ Paged Allocation (PagedAttention)

Section 6 — Bonus Feature

Copy-on-write block sharing

Because blocks are identified by integer IDs and assigned via a block table, PagedAttention unlocks a powerful bonus feature for free: copy-on-write block sharing. Two requests can literally point their block tables to the same physical block — sharing the data without copying it. The block manager tracks this with a reference count.

When this is useful — parallel decoding

Imagine you want to generate 4 different responses to the same prompt (to pick the best one). Before PagedAttention, you'd need 4 separate full copies of the prompt's KV cache — 4× the memory. With block sharing, all 4 requests share the exact same physical blocks for the prompt tokens. Their block tables all point to the same block IDs. The only new memory needed is for the tokens each response generates uniquely.

The photocopier analogy Copy-on-write is like sharing a document instead of photocopying it. If you and three colleagues all need to read the same 100-page report, you share one physical copy — as long as no one needs to write on it. The moment someone picks up a pen to annotate their copy, then you make them a personal copy. Before that moment, one copy serves everyone. In PagedAttention: the blocks a request only reads (prompt tokens) are shared. The blocks it writes to (newly generated tokens) get their own private copy.

The block manager implements this with a simple reference count per block — an integer that tracks how many requests are pointing to a block. When a request wants to write to a shared block, the block manager checks: is ref_count > 1? If yes — copy the block first, decrement the old block's ref_count, assign the new copy exclusively, then write. If ref_count == 1 — write directly. This is called copy-on-write.

Copy-on-write in action

BEFORE FIRST UNIQUE TOKEN

Req A block_table: [47, 12]

Req B block_table: [47, 12]

Req C block_table: [47, 12]

Block 47 ref_count = 3 ✓
Block 12 ref_count = 3 ✓
Memory used: 2 blocks

AFTER EACH GENERATES UNIQUE TOKEN

Req A block_table: [47, 12, 91]

Req B block_table: [47, 12, 34]

Req C block_table: [47, 12, 67]

Blocks 47, 12 still shared ✓
New blocks 91, 34, 67 private
Memory used: 5 blocks (not 6)

Section 7 — In nano-vLLM

The BlockManager in code

The entire PagedAttention memory management system in nano-vLLM is implemented in core/block_manager.py → Ch.02. It runs entirely on the CPU — no GPU code. Here are the key methods:

core/block_manager.py — initialisation

class BlockManager:
    def __init__(self, num_blocks: int, block_size: int = 256):
        self.block_size = block_size

        # The free list — all block IDs available at startup
        # This is the ENTIRE memory allocator: one Python list
        self.free_blocks: list[int] = list(range(num_blocks))

        # Reference counts — tracks how many requests share each block
        # Used for copy-on-write (prefix caching)
        self.ref_counts: dict[int, int] = defaultdict(int)

        # Hash → block ID map for prefix caching (Ch.07)
        self.hash_to_block: dict[str, int] = {}

core/block_manager.py — allocating blocks for a new request

def allocate(self, seq: Sequence) -> bool:
    # How many blocks does this sequence need right now?
    # (ceil division: 600 tokens / 256 = 3 blocks needed)
    num_blocks = math.ceil(len(seq.tokens) / self.block_size)

    # If not enough blocks in the free list, signal OOM to the scheduler
    if len(self.free_blocks) < num_blocks:
        return False  # scheduler will handle this (preempt or wait)

    # Pop blocks from the free list and assign to the sequence's block_table
    # Note: blocks may be from anywhere in the pool — NOT contiguous
    seq.block_table = [self.free_blocks.pop() for _ in range(num_blocks)]

    # Set reference count to 1 — this request owns these blocks exclusively
    for blk in seq.block_table:
        self.ref_counts[blk] = 1

    return True  # allocation succeeded

def free(self, seq: Sequence) -> None:
    for blk in seq.block_table:
        self.ref_counts[blk] -= 1
        # Only return to the free list when NO request references this block
        # (ref_count > 0 means another request is still sharing it)
        if self.ref_counts[blk] == 0:
            self.free_blocks.append(blk)  # immediately available for reuse
    seq.block_table = []  # clear the sequence's block table

def append_slot(self, seq: Sequence) -> None:
    # Called when a decode step fills the current last block
    # and the sequence needs one more block for the next token
    last_token_idx = len(seq.tokens) - 1
    if last_token_idx % self.block_size == 0:  # block boundary crossed
        new_block = self.free_blocks.pop()
        seq.block_table.append(new_block)
        self.ref_counts[new_block] = 1

Why allocate() returns bool instead of raising an exception When the free list is empty, allocate() returns False instead of crashing. This signals to the scheduler → Ch.05 that there isn't enough memory for this request right now. The scheduler can then choose to either keep the request waiting, or preempt (pause and evict) an existing lower-priority request to free up blocks. This graceful failure is what allows the scheduler to handle memory pressure without crashing the engine.

Section 8 — Why It Matters

What PagedAttention enables

Metric	Contiguous Allocation	PagedAttention
Memory wasted	60–80% wasted — reserved but unused slots	<4% wasted — only the last partial block per request
Concurrent requests	Limited by worst-case allocation — 10–20 requests	Limited by actual tokens used — 3–5× more requests
Fragmentation	Grows over time as requests finish and leave holes	Zero external fragmentation — any free block fits any request
Prefix sharing	Impossible — each request needs its own contiguous copy	Free — block tables can point to the same physical blocks
Allocation complexity	O(N) search for free contiguous region	O(1) — pop from free list

The real-world impact The original vLLM paper (Kwon et al., 2023) showed that PagedAttention increased LLM serving throughput by 2–4× compared to prior systems on identical hardware — not because it made individual tokens faster, but because it allowed far more requests to run concurrently on the same GPU memory. This is the difference between a GPU that serves 12 requests at once vs one that serves 50. Same hardware, same speed per token, 4× more users served.

Section 9 — Common Mistakes

Things beginners get wrong about PagedAttention

✗ Myth 1 — "PagedAttention makes individual token generation faster"

Reality: PagedAttention does not speed up the GPU computation per token. The attention kernel still does the same amount of floating-point math to generate each token — it just reads from scattered blocks instead of contiguous memory. What PagedAttention improves is throughput: how many requests you can serve simultaneously. If you're running a single request alone, PagedAttention provides essentially no speed improvement. The win is at scale — many concurrent users on the same GPU.

✗ Myth 2 — "The block table lookup adds significant overhead"

Reality: The block table is a tiny Python list of integers — looking up block_table[1] is a nanosecond operation. The slot mapping is computed once per batch on the CPU, then passed to the GPU as a tensor. The GPU kernel reads the slot directly — there is no runtime Python during the GPU's hot path. The overhead is negligible compared to the memory savings.

✗ Myth 3 — "Pages and blocks are the same as CUDA memory blocks"

Reality: PagedAttention blocks are a logical concept managed by nano-vLLM's Python block manager — they have nothing to do with CUDA thread blocks, GPU memory pages, or hardware-level memory management. A "block" in PagedAttention is simply a fixed-size slice of the pre-allocated KV cache tensor, tracked by an integer ID in a Python list. It exists entirely in software, not in hardware.

Section 10 — Check Your Understanding

Quiz

Three questions to verify you've understood the core ideas. Wrong answers explain exactly where the misunderstanding is.

1. A request has block_table = [47, 12, 83] and block_size = 256. Which physical block contains the KV data for token number 300?

2. Three requests share the same prompt. In PagedAttention with copy-on-write, how many physical blocks are needed for the prompt's KV data?

3. What happens when the free list is empty and a new request arrives?

Section 11 — Key Takeaways

What you now know

Chapter 04 — Summary

✓

Contiguous allocation wastes 60–80% of GPU memory. Reserving max-length slots upfront for uncertain-length responses causes massive internal and external fragmentation before the GPU is genuinely full.

✓

PagedAttention is virtual memory for the KV cache. The same OS idea — fixed pages, a page table, a free pool — applied to GPU memory. Blocks go anywhere; the block table provides the address translation.

✓

Three mechanisms work together. Free list (O(1) allocation from a Python list), block table (logical → physical mapping per request), slot mapping (exact write address for each token's K/V data).

✓

Copy-on-write enables free prefix sharing. Multiple requests can share identical physical blocks (same block table entries, ref_count > 1) until the moment one of them writes unique data — then and only then is a private copy made.

✓

The BlockManager is pure Python on the CPU. It only manipulates integer lists — no GPU tensors, no CUDA. The entire "allocator" is list.pop() and list.append(). Simplicity is a feature.

✓

2–4× throughput improvement in practice. Not faster per-token, but far more requests served concurrently on the same hardware. The same GPU that served 12 requests now serves 50 — without any new hardware.