Numbers lie — unless you know what they measure
"Our engine does 5,000 tokens per second!" — but at what batch size, on what hardware, with what prompt lengths, measured how? A benchmark number without context is meaningless, and the LLM serving world is full of misleading comparisons. This final chapter teaches you to measure performance honestly and read others' numbers critically.
The four numbers worth measuring
There is no single "speed" of an inference engine. There are several metrics, each answering a different question. A serious benchmark reports all of them.
Throughput — total tokens per second across all requests
The total number of output tokens the engine produces per second, summed across every concurrent request. This is the metric that matters for cost — higher throughput means more users served per GPU, lower cost per token. A batch server optimises for this. Measured in tokens/second (sometimes requests/second).
TTFT — Time To First Token
How long from sending a request until the first token appears → Ch.06. Determined by prefill speed and queue waiting time. This is what users feel as "responsiveness". For interactive chat, low TTFT matters more than raw throughput.
TPOT — Time Per Output Token
The average time between successive generated tokens after the first → Ch.06. Determines streaming speed — how fast text flows once it starts. A TPOT of 25ms = 40 tokens/second, comfortably faster than human reading speed.
Percentiles — p50, p95, p99 latency
Averages hide bad experiences. Percentiles tell the full story: p50 (median) is the typical case; p99 is the worst 1% — the unlucky requests that waited longest. A system with great average latency but terrible p99 means 1 in 100 users has an awful experience. Production systems are judged on p95 and p99, not just averages.
Throughput vs latency — you can't max both
The single most important concept in inference benchmarking is the tension between throughput and latency, and it's controlled mainly by one knob: batch size — how many requests the engine processes simultaneously.
Large batch size → high throughput, worse latency
Processing many requests together makes excellent use of the GPU — especially in decode, where the weight read is shared across the whole batch → Ch.06. Total tokens/second soars. But each individual request shares GPU time with many others, so any single request's tokens arrive a little slower. Great for cost, worse for the individual user's experience.
Small batch size → low latency, worse throughput
With few requests in flight, each gets a large share of the GPU — fast individual responses, low TPOT. But the GPU is underutilised: you're paying for a whole GPU to serve just a few users. Total throughput is low, cost per token is high. Great for a premium low-latency experience, expensive at scale.
Drag the batch size slider and watch throughput and per-request latency move in opposite directions. There is no setting that maximises both — the right choice depends on what you're optimising for.
How does nano-vLLM actually perform?
The remarkable headline: nano-vLLM, at ~1,200 lines, achieves throughput comparable to production vLLM on offline batch inference — sometimes matching or slightly exceeding it on simple benchmarks. This is genuinely impressive and worth understanding precisely, because the comparison is more nuanced than "nano-vLLM is as fast as vLLM".
On a clean offline benchmark — fixed set of prompts, processed in large batches — the two are close:
Numbers are illustrative of the published ballpark (exact figures depend on GPU, model, and prompt mix). The key takeaway: on offline batch throughput, a 1,200-line implementation is competitive with a 100,000-line production system. That's a testament to how much of the performance comes from a few core ideas — PagedAttention, continuous batching, and FlashAttention — all of which nano-vLLM implements.
The honest list of what's missing
A benchmark number never tells you what an engine can't do. nano-vLLM's competitive throughput comes partly from its simplicity — it skips entire categories of production features. Understanding these omissions is essential to reading the comparison honestly, and it's a fitting way to close the series: knowing the boundaries of what you've learned.
| Feature | nano-vLLM | Production vLLM |
|---|---|---|
| Core inference loop PagedAttention, continuous batching, FlashAttention |
✓ Full | ✓ Full |
| Quantization int8, fp8, AWQ, GPTQ — smaller/faster models |
✗ None (fp16 only) | ✓ Many formats |
| Speculative decoding draft-and-verify to speed up decode |
✗ None | ✓ Yes |
| Beam search explore multiple generation paths |
✗ Not implemented | ✓ Yes |
| Model coverage supported architectures |
~ Qwen3 focus | ✓ Dozens |
| LoRA adapters serve fine-tuned variants efficiently |
✗ None | ✓ Yes |
| Structured output guaranteed JSON / grammar constraints |
✗ None | ✓ Yes |
| Production serving OpenAI-compatible API, metrics, multi-node |
~ Minimal | ✓ Full stack |
How to benchmark it yourself
nano-vLLM ships with a bench.py script. Reading it shows exactly what a clean throughput benchmark looks like — and how to avoid the common measurement mistakes.
import time from nanovllm import LLM, SamplingParams # 1. Load the model once, outside the timed region llm = LLM("Qwen/Qwen3-0.6B", enforce_eager=False) # graphs ON for real perf # 2. Build a fixed, reproducible set of prompts # Same prompts every run = comparable numbers prompts = ["Explain quantum computing"] * 256 params = SamplingParams(temperature=0.6, max_tokens=256) # 3. WARMUP — run once untimed. # The first run pays CUDA graph capture + compile costs (Ch.10). # Timing it would unfairly penalise the engine. llm.generate(prompts[:8], params) # 4. TIMED REGION — measure only steady-state generation start = time.perf_counter() outputs = llm.generate(prompts, params) elapsed = time.perf_counter() - start # 5. Report throughput as TOTAL output tokens / wall-clock time total_tokens = sum(len(o.token_ids) for o in outputs) print(f"Throughput: {total_tokens / elapsed:.0f} tok/s") print(f"Requests: {len(prompts)}, total tokens: {total_tokens}")
Reading benchmarks like an engineer
Always ask "at what latency?"
A throughput number alone is incomplete. The honest question is "what throughput at a p99 latency of X ms?" — because throughput at unbounded latency is easy and useless. Real SLAs bound latency.
Watch for cherry-picked conditions
Short prompts flatter prefill. Long generations flatter decode optimizations. Big batches flatter throughput. A benchmark that uses only favourable conditions tells you nothing about your workload.
Match the benchmark to your use case
Building interactive chat? TTFT and p99 latency matter most. Running offline batch jobs? Pure throughput. The "best" engine depends entirely on which metric maps to your actual need.
Features vs speed is a real trade
nano-vLLM's competitive throughput partly reflects what it omits. When comparing engines, account for what each does — raw speed on a simple benchmark isn't the whole picture if you need quantization or structured output.
Things beginners get wrong about benchmarks
Quiz
Three final questions. Wrong answers explain exactly where the reasoning broke down.
1. An engine reports 8,000 tok/s throughput. A competitor reports 2,000 tok/s but advertises "10× lower latency". How can both be true?
2. Why must a throughput benchmark include an untimed "warmup" run before measuring?
3. nano-vLLM matches vLLM's throughput on an offline benchmark despite being 1% of the code. What does this most accurately tell us?
What you now know
There is no single "speed". Throughput, TTFT, TPOT, and percentile latencies each answer a different question. A serious benchmark reports all of them, not one cherry-picked number.
Throughput and latency trade off. Batch size is the knob. Large batches maximise throughput but raise per-request latency; small batches minimise latency but waste GPU capacity. You cannot maximise both.
Percentiles reveal the tail. Averages hide bad experiences. p95 and p99 latency — the unlucky requests — are what production SLAs are built on, not the median.
nano-vLLM matches vLLM on core throughput. ~1,200 lines competes with ~100,000 on offline batch benchmarks — because throughput comes from a few core ideas this series covered, faithfully implemented.
The omissions are the point. No quantization, beam search, speculative decoding, or broad model support. nano-vLLM trades production features for legibility — that's what makes it teachable.
Benchmark honestly. Warm up before timing, fix the workload, measure total tokens over wall-clock time, and report the curve — not a single point chosen to flatter.
You've reached the end
Eleven chapters ago, "LLM inference" might have been a black box. Now you understand what happens from the moment a prompt arrives to the moment the final token streams back — and why every design decision was made the way it was.
🎓 The whole picture, in one breath
A prompt arrives and is tokenised (Ch.01). The engine wraps it in a Sequence and hands it to the scheduler (Ch.02, Ch.05), which uses continuous batching to keep the GPU full. The block manager allocates KV cache blocks via PagedAttention (Ch.03, Ch.04), reusing shared prefixes through prefix caching (Ch.07). The model runs prefill then decode (Ch.06), split across GPUs by tensor parallelism (Ch.09), accelerated by FlashAttention and CUDA Graphs (Ch.10). The sampler picks each token (Ch.08), and you measure it all with honest benchmarks (Ch.11). That's a complete LLM inference engine.
The best next step: clone nano-vLLM, open the source, and read it. With this series behind you, every line will make sense.