nano-vLLM · LLM Inference Deep Dive

nano-vLLM
Deep Dive

An 11-chapter series on how large language model inference actually works — built around the nano-vLLM codebase, explained from zero for beginners and engineers alike.

11 Chapters

1,200 Lines of code

0 Prerequisites

~4 hrs Total read time

What You'll Learn

Five systems, one inference engine

Understanding LLM inference means understanding five interacting systems simultaneously. This series builds each one from scratch, in the order they were designed.

01–02 Inference fundamentals Tokens, autoregressive generation, Q/K/V, HBM, the 6-step pipeline Start →

03–04 KV cache & memory Physical layout, Triton kernels, PagedAttention, block tables Read →

05–06 Scheduling & batching Continuous batching, prefill vs decode, preemption Read →

07–08 Caching & sampling Prefix caching, xxhash blocks, greedy/top-k/temperature/top-p Read →

09–11 Distributed compute & performance Tensor parallelism, FlashAttention, CUDA Graphs, benchmarks Read →

About This Series

Why nano-vLLM?

Production vLLM is 100,000+ lines of C++, CUDA, and Python. Understanding it directly is hard. nano-vLLM reimplements the core ideas in ~1,200 lines of pure Python and Triton — readable in an afternoon, yet covering all the fundamental algorithms.

The Philosophy nano-vLLM was built by Xingkai Yu, a DeepSeek engineer whose name appears on the DeepSeek-V3 and DeepSeek-R1 papers. His goal was the same as this series: expose the algorithmic skeleton of production LLM inference, without the engineering complexity that hides it. 12,700+ GitHub stars and a growing ecosystem of forks and extensions suggest it worked.

Who this is for

Engineers curious about how LLMs work under the hood. ML practitioners who use LLM APIs but want to understand what's happening beneath generate(). Anyone who's heard "KV cache" or "PagedAttention" and wants a real explanation, not a handwave.

Software engineers ML practitioners Curious learners

What you actually need

Basic Python literacy helps but isn't required — all code is annotated line by line. No prior ML, no linear algebra, no CUDA knowledge assumed. New terms are always defined before they're used, with a real-world analogy first.

Basic Python (helpful) No ML required No CUDA required

Each chapter includes

An opening real-world analogy. Full concept explanation from first principles. At least one interactive visual. Annotated nano-vLLM source code. A "why it matters" callout. Common misconceptions. A 3-question quiz with full feedback. Key takeaways grid.

Analogies Interactive demos Quizzes Real code

The codebase

All code examples come directly from GeeeekExplorer/nano-vllm on GitHub. The repo is MIT-licensed. We follow the actual implementation — not pseudocode, not simplified stand-ins. Every snippet you see runs.

MIT License Real code, not pseudocode

nano-vLLM
Deep Dive

11 chapters — start anywhere

Five systems, one inference engine

Why nano-vLLM?

Who this is for

What you actually need

Each chapter includes

The codebase

Ready? Chapter 01 takes about 22 minutes.