LLM Inference Engineering · Study Series

nano-vLLM
Deep Dive

An 11-chapter series on how large language model inference actually works — built around the nano-vLLM codebase, explained from zero for beginners and engineers alike.

11 Chapters
1,200 Lines of code
0 Prerequisites
~4 hrs Total read time
Start Chapter 01 → View on GitHub ↗
Your progress 0 of 11 chapters read
The Series

11 chapters — start anywhere

Each chapter is self-contained. Read in order for the full picture, or jump straight to any concept you want to understand. Cross-references keep everything connected.

Chapter 01 · ~22 min
What Is LLM Inference?
Tokens, autoregressive generation, training vs inference, Q/K/V attention, HBM vs SRAM, and the three core bottlenecks — built from scratch.
Chapter 02 · ~18 min
nano-vLLM Architecture
The full file structure in ~1,200 lines. The CPU control plane vs GPU data plane design, and how each module connects to the others.
Chapter 03 · ~20 min
KV Cache
Why K and V are cached, the physical tensor layout in GPU memory, and the Triton kernel that writes to it with zero Python overhead on the hot path.
Chapter 04 · ~22 min
PagedAttention
Virtual memory for the KV cache. How fixed-size blocks, block tables, and a free list eliminate fragmentation and multiply concurrent request capacity.
Chapter 05 · ~20 min
The Scheduler
Continuous batching from first principles. The waiting → running state machine, preemption, and why static batching wastes the GPU.
Chapter 06 · ~18 min
Prefill vs Decode
The two phases of inference side by side. Why prefill is compute-bound and decode is memory-bound, and what that means for batching strategy.
Chapter 07 · ~18 min
Prefix Caching
Skip redundant prefill for shared system prompts. xxhash content-addressable blocks, reference counting, and when this wins big.
Chapter 08 · ~16 min
Sampling Strategies
Greedy, temperature, top-k, top-p — what each does to the logit distribution, when to use each, and the sampler code in nano-vLLM.
Chapter 09 · ~20 min
Tensor Parallelism
Splitting model weights across multiple GPUs. ColumnParallel and RowParallel linear layers, all-reduce, and when to use tensor parallelism vs pipeline parallelism.
Chapter 10 · ~24 min
Optimization Stack
FlashAttention, CUDA Graphs, torch.compile, Triton KV kernels, GQA. Each technique, what bottleneck it targets, and how they layer together.
Chapter 11 · ~14 min
Benchmarks
nano-vLLM vs vLLM — throughput numbers, what nano-vLLM includes vs omits, and how to read inference benchmarks without being misled.
What You'll Learn

Five systems, one inference engine

Understanding LLM inference means understanding five interacting systems simultaneously. This series builds each one from scratch, in the order they were designed.

01–02 Inference fundamentals Tokens, autoregressive generation, Q/K/V, HBM, the 6-step pipeline Start →
03–04 KV cache & memory Physical layout, Triton kernels, PagedAttention, block tables Read →
05–06 Scheduling & batching Continuous batching, prefill vs decode, preemption Read →
07–08 Caching & sampling Prefix caching, xxhash blocks, greedy/top-k/temperature/top-p Read →
09–11 Distributed compute & performance Tensor parallelism, FlashAttention, CUDA Graphs, benchmarks Read →
About This Series

Why nano-vLLM?

Production vLLM is 100,000+ lines of C++, CUDA, and Python. Understanding it directly is hard. nano-vLLM reimplements the core ideas in ~1,200 lines of pure Python and Triton — readable in an afternoon, yet covering all the fundamental algorithms.

The Philosophy nano-vLLM was built by Xingkai Yu, a DeepSeek engineer whose name appears on the DeepSeek-V3 and DeepSeek-R1 papers. His goal was the same as this series: expose the algorithmic skeleton of production LLM inference, without the engineering complexity that hides it. 12,700+ GitHub stars and a growing ecosystem of forks and extensions suggest it worked.

Who this is for

Engineers curious about how LLMs work under the hood. ML practitioners who use LLM APIs but want to understand what's happening beneath generate(). Anyone who's heard "KV cache" or "PagedAttention" and wants a real explanation, not a handwave.

Software engineers ML practitioners Curious learners

What you actually need

Basic Python literacy helps but isn't required — all code is annotated line by line. No prior ML, no linear algebra, no CUDA knowledge assumed. New terms are always defined before they're used, with a real-world analogy first.

Basic Python (helpful) No ML required No CUDA required

Each chapter includes

An opening real-world analogy. Full concept explanation from first principles. At least one interactive visual. Annotated nano-vLLM source code. A "why it matters" callout. Common misconceptions. A 3-question quiz with full feedback. Key takeaways grid.

Analogies Interactive demos Quizzes Real code

The codebase

All code examples come directly from GeeeekExplorer/nano-vllm on GitHub. The repo is MIT-licensed. We follow the actual implementation — not pseudocode, not simplified stand-ins. Every snippet you see runs.

MIT License Real code, not pseudocode
Get started

Ready? Chapter 01 takes about 22 minutes.

It covers everything that happens between you typing a message and seeing a response — from raw characters to autoregressive token generation.

Begin Chapter 01 →