An 11-chapter series on how large language model inference actually works — built around the nano-vLLM codebase, explained from zero for beginners and engineers alike.
Each chapter is self-contained. Read in order for the full picture, or jump straight to any concept you want to understand. Cross-references keep everything connected.
Understanding LLM inference means understanding five interacting systems simultaneously. This series builds each one from scratch, in the order they were designed.
Production vLLM is 100,000+ lines of C++, CUDA, and Python. Understanding it directly is hard. nano-vLLM reimplements the core ideas in ~1,200 lines of pure Python and Triton — readable in an afternoon, yet covering all the fundamental algorithms.
Engineers curious about how LLMs work under the hood. ML practitioners who use LLM APIs but want to understand what's happening beneath generate(). Anyone who's heard "KV cache" or "PagedAttention" and wants a real explanation, not a handwave.
Basic Python literacy helps but isn't required — all code is annotated line by line. No prior ML, no linear algebra, no CUDA knowledge assumed. New terms are always defined before they're used, with a real-world analogy first.
An opening real-world analogy. Full concept explanation from first principles. At least one interactive visual. Annotated nano-vLLM source code. A "why it matters" callout. Common misconceptions. A 3-question quiz with full feedback. Key takeaways grid.
All code examples come directly from GeeeekExplorer/nano-vllm on GitHub. The repo is MIT-licensed. We follow the actual implementation — not pseudocode, not simplified stand-ins. Every snippet you see runs.
It covers everything that happens between you typing a message and seeing a response — from raw characters to autoregressive token generation.
Begin Chapter 01 →