Why architecture matters
Before touching any concept — KV cache, PagedAttention, scheduling — you need to know where those things live in the codebase. Architecture is the map. Without it, you're reading code in the dark.
This separation is not cosmetic — it's the key design decision that makes the entire engine fast. The CPU can make scheduling decisions without blocking the GPU, and the GPU can run compute without waiting for Python-level bookkeeping.
Every file, explained
The entire nano-vLLM codebase is ~1,200 lines across 11 files. Click any file below to see what it does, why it exists, and how it fits into the bigger picture.
CPU control plane vs GPU data plane
The single most important architectural decision in nano-vLLM — and in production vLLM — is the strict separation between what runs on the CPU and what runs on the GPU. Understanding this division makes everything else in the codebase make sense.
Manages metadata only
- Runs the scheduler — decides which requests get processed each step
- Runs the block manager — tracks which GPU memory blocks are free, in use, or shared
- Maintains the sequence state machine — tracks where each request is in its lifecycle
- Computes slot mappings — tells the GPU exactly where to write each token's KV data
- Never touches GPU tensors directly — only works with integers, lists, and dictionaries
- Runs in regular Python — no CUDA, no Triton
Executes all computation
- Runs the model — all transformer layers, attention, feed-forward
- Writes K and V vectors to the KV cache using a Triton kernel → Ch.03
- Reads the KV cache during attention via flash attention kernels
- Applies sampling to logits to pick the next token → Ch.08
- Never makes scheduling decisions — just executes what the CPU tells it
- Runs in PyTorch + Triton — all tensor operations
What "metadata" actually means
When we say the CPU only manages metadata, we mean small, cheap Python data structures — not tensors:
Block IDs
Plain Python integers. Block 47, block 12, block 83. The CPU tracks which physical GPU memory blocks belong to which request — but never touches the actual memory at those locations.
Slot mappings
A list of integers: [slot_0, slot_1, ...]. Each entry says "token N goes into slot M of the KV cache". The CPU computes this mapping; the GPU executes the write.
Sequence state
Python enums: WAITING, PREFILL, DECODING, FINISHED. The scheduler reads and writes these states to decide what runs next. No tensors involved — just state machine transitions.
Four layers, one request
When you call llm.generate(), your request passes through four distinct architectural layers. Each layer has exactly one responsibility and hands off to the next. Click each layer to understand its role.
One request's journey through the architecture
Let's trace a single generate() call through every layer from the moment you call it to the moment you receive text back. Each hand-off between layers is where the CPU/GPU boundary matters most.
llm.generate(["Hello world"], params). The LLM class tokenises the prompt string into a list of integer token IDs → Ch.01, wraps it in a Sequence object with an initial state of WAITING, and adds it to the engine's request queue.scheduler.schedule() to get this step's batch, call model_runner.run() with that batch, collect output token IDs, update sequences. Repeat until all sequences are FINISHED. This loop is the heartbeat of the entire system.[47, 12, 83]. It also computes the slot mapping: for each token position, which exact slot in the physical KV cache should that token's K and V be written to. This integer list is handed to the GPU.The engine loop in code
The engine's step() method is the core of the architecture — it calls the scheduler, runs the model, and updates sequences. Here it is, annotated:
def step(self) -> list[SequenceOutput]: # 1. Ask the CPU scheduler what to run this step. # Returns prefill sequences + decode sequences as a batch. scheduler_output = self.scheduler.schedule() # 2. If nothing to run (all requests waiting for memory), return empty. if not scheduler_output.has_work(): return [] # 3. Hand the batch to the GPU model runner. # This is where the transformer forward pass happens. # Returns one sampled token ID per sequence in the batch. sampled_tokens = self.model_runner.run(scheduler_output) # 4. Update each sequence with its new token. # The scheduler handles state transitions (prefill → decode, # decode → finished) and block deallocation on completion. outputs = self.scheduler.update(sampled_tokens) return outputs # list of (sequence_id, token_id, is_finished)
The Sequence object — the unit of work
Every request is represented as a Sequence object. It's the single piece of state that flows between the CPU control plane and GPU data plane:
class Sequence: def __init__(self, prompt_tokens: list[int], params: SamplingParams): self.tokens = prompt_tokens # all token IDs so far (grows with each decode step) self.params = params # temperature, top_k, max_tokens etc. # CPU metadata — managed by the block manager self.block_table: list[int] = [] # [physical_block_id, ...] for this sequence # State machine — managed by the scheduler self.status = SequenceStatus.WAITING # WAITING → PREFILL → DECODING → FINISHED self.is_prefill = True # True until first token is generated # Output accumulator self.output_tokens: list[int] = [] # generated token IDs (not including prompt)
block_table is just a Python list of integers — CPU-side metadata. But it controls exactly where on the GPU the KV cache data lives. The CPU computes the block table, the GPU reads it through the slot mapping. This is the interface between the two planes — a tiny list of numbers that carries enormous semantic weight.
What this architecture enables
The CPU/GPU separation and the four-layer design aren't just clean code — they're what makes the performance features in later chapters possible.
Enables PagedAttention
Because the block manager runs entirely on the CPU with plain Python data structures, it can make complex allocation decisions (free lists, reference counting, prefix caching) without ever stalling the GPU. → Ch.04
Enables continuous batching
The scheduler can add or remove requests from the batch every single step — because it only manipulates metadata. Swapping a sequence in or out is just a list operation on the CPU, not a GPU reallocation. → Ch.05
Enables CUDA Graphs
The GPU data plane is stateless from step to step — it just executes whatever the CPU tells it. This predictability lets the decode step be captured as a CUDA Graph and replayed with zero Python overhead. → Ch.10
Enables tensor parallelism
Because the GPU data plane is cleanly isolated in models/ and layers/, adding tensor parallelism only required modifying the linear layers and attention. The CPU control plane needed zero changes. → Ch.09
Things beginners get wrong about LLM architecture
step() always processes a batch of sequences — some in prefill, some in decode, potentially dozens simultaneously. The scheduler builds this batch every step. This concurrent processing of multiple requests is what makes throughput high. A single request running alone is the worst case, not the design intent.Quiz
Three questions on the architecture. Wrong answers tell you exactly why they're wrong.
1. The block manager tracks which GPU memory blocks are free. Where does it run?
2. What does the engine's step() function return?
3. Why does nano-vLLM use a Sequence object to represent each request?
What you now know
CPU controls, GPU computes. The CPU control plane manages scheduling and memory metadata in pure Python. The GPU data plane executes the transformer. Neither does the other's job.
Four layers, one request. API → Engine → Control Plane → Data Plane. Each layer has one responsibility and hands off to the next through a well-defined interface.
Metadata stays on the CPU. Block IDs, slot mappings, and sequence state are plain Python integers and lists — cheap to compute, never touching GPU tensors.
The Sequence object is the bridge. Its block_table and tokens list connect CPU scheduling decisions to GPU execution — a tiny list of integers with huge semantic weight.
step() is the heartbeat. Every token generation is one call to step() — schedule, run, sample, update. The engine loop runs this until all sequences finish.
The architecture enables everything ahead. PagedAttention, continuous batching, CUDA Graphs, tensor parallelism — all are possible because the CPU/GPU boundary is clean and explicit.