Chapter 02 of 11 · nano-vLLM Deep Dive
02

nano-vLLM Architecture

How 1,200 lines of Python are organized to run a complete LLM inference engine — every file, every responsibility, and how they all connect.

← Ch01: Inference Next: KV Cache →

Why architecture matters

Before touching any concept — KV cache, PagedAttention, scheduling — you need to know where those things live in the codebase. Architecture is the map. Without it, you're reading code in the dark.

The Restaurant Kitchen Analogy Think of nano-vLLM as a professional kitchen. There's a front-of-house (the waiter who takes orders, manages the queue, decides which table gets served next) and a back-of-house (the chefs doing the actual cooking on expensive equipment). The front-of-house never touches the stove. The back-of-house never talks to diners. Each has a single responsibility. nano-vLLM is designed exactly this way — a CPU control plane that manages requests and memory metadata, and a GPU data plane that executes the actual model computation. They communicate through a thin, well-defined interface. Neither one does the other's job.

This separation is not cosmetic — it's the key design decision that makes the entire engine fast. The CPU can make scheduling decisions without blocking the GPU, and the GPU can run compute without waiting for Python-level bookkeeping.

Every file, explained

The entire nano-vLLM codebase is ~1,200 lines across 11 files. Click any file below to see what it does, why it exists, and how it fits into the bigger picture.

Interactive file tree — click to explore
nanovllm/
llm.py PUBLIC API — Entry point, the LLM class
engine.py ORCHESTRATOR — LLMEngine, ties everything together
config.py CPU — ModelConfig dataclass
sampler.py GPU — Token sampling logic
cache.py GPU — KV cache tensor + Triton kernel
core/
scheduler.py CPU — Continuous batching scheduler
block_manager.py CPU — PagedAttention block allocation
sequence.py CPU — Sequence state machine
models/
qwen3.py GPU — Full Qwen3 transformer
layers/
attention.py GPU — Paged flash attention
linear.py GPU — Parallel linear layers
← Click any file above to see its role, responsibilities, and how it connects to other modules.

CPU control plane vs GPU data plane

The single most important architectural decision in nano-vLLM — and in production vLLM — is the strict separation between what runs on the CPU and what runs on the GPU. Understanding this division makes everything else in the codebase make sense.

CPU — control plane

Manages metadata only

  • Runs the scheduler — decides which requests get processed each step
  • Runs the block manager — tracks which GPU memory blocks are free, in use, or shared
  • Maintains the sequence state machine — tracks where each request is in its lifecycle
  • Computes slot mappings — tells the GPU exactly where to write each token's KV data
  • Never touches GPU tensors directly — only works with integers, lists, and dictionaries
  • Runs in regular Python — no CUDA, no Triton
GPU — data plane

Executes all computation

  • Runs the model — all transformer layers, attention, feed-forward
  • Writes K and V vectors to the KV cache using a Triton kernel → Ch.03
  • Reads the KV cache during attention via flash attention kernels
  • Applies sampling to logits to pick the next token → Ch.08
  • Never makes scheduling decisions — just executes what the CPU tells it
  • Runs in PyTorch + Triton — all tensor operations
Why this separation is fast If the CPU and GPU were tangled — if the scheduler needed to inspect GPU tensors, or if the model runner needed to make allocation decisions — they'd constantly block each other. Synchronisation is expensive: every time Python waits for a GPU kernel to finish, that's dead time. By keeping the CPU on metadata and the GPU on tensors, both can do their work with minimal synchronisation. The CPU prepares the next batch's metadata while the GPU is still executing the current batch.

What "metadata" actually means

When we say the CPU only manages metadata, we mean small, cheap Python data structures — not tensors:

Block IDs

Plain Python integers. Block 47, block 12, block 83. The CPU tracks which physical GPU memory blocks belong to which request — but never touches the actual memory at those locations.

Slot mappings

A list of integers: [slot_0, slot_1, ...]. Each entry says "token N goes into slot M of the KV cache". The CPU computes this mapping; the GPU executes the write.

Sequence state

Python enums: WAITING, PREFILL, DECODING, FINISHED. The scheduler reads and writes these states to decide what runs next. No tensors involved — just state machine transitions.

Four layers, one request

When you call llm.generate(), your request passes through four distinct architectural layers. Each layer has exactly one responsibility and hands off to the next. Click each layer to understand its role.

Click a layer to expand
L1 Public API Layer llm.py
L2 Engine Layer engine.py
L3 Control Plane (CPU) scheduler.py · block_manager.py · sequence.py
L4 Data Plane (GPU) qwen3.py · attention.py · linear.py · cache.py · sampler.py

One request's journey through the architecture

Let's trace a single generate() call through every layer from the moment you call it to the moment you receive text back. Each hand-off between layers is where the CPU/GPU boundary matters most.

Step 1 — API Layer (llm.py)
You call llm.generate(["Hello world"], params). The LLM class tokenises the prompt string into a list of integer token IDs → Ch.01, wraps it in a Sequence object with an initial state of WAITING, and adds it to the engine's request queue.
Step 2 — Engine Loop (engine.py)
The engine runs a loop: call scheduler.schedule() to get this step's batch, call model_runner.run() with that batch, collect output token IDs, update sequences. Repeat until all sequences are FINISHED. This loop is the heartbeat of the entire system.
Step 3 — Scheduler (core/scheduler.py)
The scheduler checks: is our sequence in WAITING? Does the block manager have enough free GPU memory blocks to fit its tokens? If yes, it transitions the sequence to PREFILL, allocates blocks via the block manager, and returns it as part of this step's batch. If no memory, the sequence stays in WAITING.
Step 4 — Block Manager (core/block_manager.py)
The block manager pops blocks from its free list and assigns them to the sequence's block table — a Python list like [47, 12, 83]. It also computes the slot mapping: for each token position, which exact slot in the physical KV cache should that token's K and V be written to. This integer list is handed to the GPU.
Step 5 — Model Runner / GPU (qwen3.py + cache.py + attention.py)
The model runner takes the token IDs and slot mapping from the CPU. It runs the Qwen3 transformer: embedding → 28 attention+FFN layers → language model head → logits. During each attention layer, K and V for new tokens are written to the KV cache using the Triton kernel. Attention reads the paged KV cache using the block table.
Step 6 — Sampler (sampler.py)
The raw logits (50,000 scores, one per vocab token) are passed to the sampler. Temperature scaling, top-k filtering, softmax, multinomial sampling — one token ID is selected. This ID goes back to the CPU.
Step 7 — Back to CPU → Loop or Finish
The engine appends the new token ID to the sequence. If it's an EOS token or we've hit max_tokens, the sequence transitions to FINISHED and the block manager returns its blocks to the free list. Otherwise the sequence transitions to DECODING and the loop runs again from Step 2 — this time processing only the one new token.

The engine loop in code

The engine's step() method is the core of the architecture — it calls the scheduler, runs the model, and updates sequences. Here it is, annotated:

engine.py — the core step() loop
def step(self) -> list[SequenceOutput]:
    # 1. Ask the CPU scheduler what to run this step.
    #    Returns prefill sequences + decode sequences as a batch.
    scheduler_output = self.scheduler.schedule()

    # 2. If nothing to run (all requests waiting for memory), return empty.
    if not scheduler_output.has_work():
        return []

    # 3. Hand the batch to the GPU model runner.
    #    This is where the transformer forward pass happens.
    #    Returns one sampled token ID per sequence in the batch.
    sampled_tokens = self.model_runner.run(scheduler_output)

    # 4. Update each sequence with its new token.
    #    The scheduler handles state transitions (prefill → decode,
    #    decode → finished) and block deallocation on completion.
    outputs = self.scheduler.update(sampled_tokens)

    return outputs  # list of (sequence_id, token_id, is_finished)
Why step() is so small The engine's step function is intentionally thin — just 4 lines of real logic. All the complexity lives in the modules it delegates to: the scheduler handles batching and state, the model runner handles GPU compute, the block manager handles memory. This layering means you can read and understand each piece independently. That's the entire point of the architecture.

The Sequence object — the unit of work

Every request is represented as a Sequence object. It's the single piece of state that flows between the CPU control plane and GPU data plane:

core/sequence.py — what a Sequence holds
class Sequence:
    def __init__(self, prompt_tokens: list[int], params: SamplingParams):
        self.tokens = prompt_tokens    # all token IDs so far (grows with each decode step)
        self.params = params           # temperature, top_k, max_tokens etc.

        # CPU metadata — managed by the block manager
        self.block_table: list[int] = []   # [physical_block_id, ...] for this sequence

        # State machine — managed by the scheduler
        self.status = SequenceStatus.WAITING   # WAITING → PREFILL → DECODING → FINISHED
        self.is_prefill = True               # True until first token is generated

        # Output accumulator
        self.output_tokens: list[int] = []  # generated token IDs (not including prompt)
The block_table is the bridge Notice that block_table is just a Python list of integers — CPU-side metadata. But it controls exactly where on the GPU the KV cache data lives. The CPU computes the block table, the GPU reads it through the slot mapping. This is the interface between the two planes — a tiny list of numbers that carries enormous semantic weight.

What this architecture enables

The CPU/GPU separation and the four-layer design aren't just clean code — they're what makes the performance features in later chapters possible.

Enables PagedAttention

Because the block manager runs entirely on the CPU with plain Python data structures, it can make complex allocation decisions (free lists, reference counting, prefix caching) without ever stalling the GPU. → Ch.04

Enables continuous batching

The scheduler can add or remove requests from the batch every single step — because it only manipulates metadata. Swapping a sequence in or out is just a list operation on the CPU, not a GPU reallocation. → Ch.05

Enables CUDA Graphs

The GPU data plane is stateless from step to step — it just executes whatever the CPU tells it. This predictability lets the decode step be captured as a CUDA Graph and replayed with zero Python overhead. → Ch.10

Enables tensor parallelism

Because the GPU data plane is cleanly isolated in models/ and layers/, adding tensor parallelism only required modifying the linear layers and attention. The CPU control plane needed zero changes. → Ch.09

Things beginners get wrong about LLM architecture

✗ Myth 1 — "The GPU manages its own memory"
Reality: The GPU has no concept of "requests" or "blocks". It just executes kernel operations on tensors. All memory management logic — the free list, block tables, slot mappings — runs on the CPU in pure Python. The GPU is told exactly where to read and write; it never decides on its own.
✗ Myth 2 — "More code = more features"
Reality: nano-vLLM deliberately stays at ~1,200 lines. Every feature in production vLLM — speculative decoding, quantization, beam search, multi-modal inputs — adds thousands of lines that obscure the core algorithms. nano-vLLM's constraint is a feature: the algorithmic skeleton is always visible, never buried. Less code teaches more.
✗ Myth 3 — "The engine runs one request at a time"
Reality: The engine's step() always processes a batch of sequences — some in prefill, some in decode, potentially dozens simultaneously. The scheduler builds this batch every step. This concurrent processing of multiple requests is what makes throughput high. A single request running alone is the worst case, not the design intent.

Quiz

Three questions on the architecture. Wrong answers tell you exactly why they're wrong.

1. The block manager tracks which GPU memory blocks are free. Where does it run?

2. What does the engine's step() function return?

3. Why does nano-vLLM use a Sequence object to represent each request?

What you now know

Chapter 02 — Summary

CPU controls, GPU computes. The CPU control plane manages scheduling and memory metadata in pure Python. The GPU data plane executes the transformer. Neither does the other's job.

Four layers, one request. API → Engine → Control Plane → Data Plane. Each layer has one responsibility and hands off to the next through a well-defined interface.

Metadata stays on the CPU. Block IDs, slot mappings, and sequence state are plain Python integers and lists — cheap to compute, never touching GPU tensors.

The Sequence object is the bridge. Its block_table and tokens list connect CPU scheduling decisions to GPU execution — a tiny list of integers with huge semantic weight.

step() is the heartbeat. Every token generation is one call to step() — schedule, run, sample, update. The engine loop runs this until all sequences finish.

The architecture enables everything ahead. PagedAttention, continuous batching, CUDA Graphs, tensor parallelism — all are possible because the CPU/GPU boundary is clean and explicit.