nano-vLLM · Ch11 · Benchmarks

Section 1 — The Big Picture

Numbers lie — unless you know what they measure

"Our engine does 5,000 tokens per second!" — but at what batch size, on what hardware, with what prompt lengths, measured how? A benchmark number without context is meaningless, and the LLM serving world is full of misleading comparisons. This final chapter teaches you to measure performance honestly and read others' numbers critically.

The Highway Analogy Imagine measuring a highway's performance. You could measure throughput — how many cars pass a point per hour. Or latency — how long one specific car takes to drive the route. These are different and often in tension: packing the highway with cars maximises throughput (cars/hour) but each individual car moves slower in the congestion (worse latency). An empty highway gives any single car the fastest trip (best latency) but moves very few cars total (poor throughput). LLM inference has exactly this tension. A benchmark that reports only one number is hiding the trade-off. You must always ask: throughput or latency — and at what load?

Section 2 — The Metrics That Matter

The four numbers worth measuring

There is no single "speed" of an inference engine. There are several metrics, each answering a different question. A serious benchmark reports all of them.

1

Throughput — total tokens per second across all requests

The total number of output tokens the engine produces per second, summed across every concurrent request. This is the metric that matters for cost — higher throughput means more users served per GPU, lower cost per token. A batch server optimises for this. Measured in tokens/second (sometimes requests/second).

2

TTFT — Time To First Token

How long from sending a request until the first token appears → Ch.06. Determined by prefill speed and queue waiting time. This is what users feel as "responsiveness". For interactive chat, low TTFT matters more than raw throughput.

3

TPOT — Time Per Output Token

The average time between successive generated tokens after the first → Ch.06. Determines streaming speed — how fast text flows once it starts. A TPOT of 25ms = 40 tokens/second, comfortably faster than human reading speed.

4

Percentiles — p50, p95, p99 latency

Averages hide bad experiences. Percentiles tell the full story: p50 (median) is the typical case; p99 is the worst 1% — the unlucky requests that waited longest. A system with great average latency but terrible p99 means 1 in 100 users has an awful experience. Production systems are judged on p95 and p99, not just averages.

The "worst day" analogy for percentiles Imagine describing your commute. The average is 30 minutes — but that hides the reality. p50 (median) says "half my commutes are under 28 minutes". p99 says "1 in 100 commutes takes over 90 minutes" (the day of the accident). If you promised your boss you'd always arrive by a certain time, the average is useless — you'd plan around p99, the worst realistic case. Production LLM services make latency promises (SLAs) based on p95/p99 for exactly this reason: the typical case isn't what hurts users, the tail is.

Section 3 — The Fundamental Trade-Off

Throughput vs latency — you can't max both

The single most important concept in inference benchmarking is the tension between throughput and latency, and it's controlled mainly by one knob: batch size — how many requests the engine processes simultaneously.

↑

Large batch size → high throughput, worse latency

Processing many requests together makes excellent use of the GPU — especially in decode, where the weight read is shared across the whole batch → Ch.06. Total tokens/second soars. But each individual request shares GPU time with many others, so any single request's tokens arrive a little slower. Great for cost, worse for the individual user's experience.

↓

Small batch size → low latency, worse throughput

With few requests in flight, each gets a large share of the GPU — fast individual responses, low TPOT. But the GPU is underutilised: you're paying for a whole GPU to serve just a few users. Total throughput is low, cost per token is high. Great for a premium low-latency experience, expensive at scale.

Interactive: the batch-size trade-off

Drag the batch size slider and watch throughput and per-request latency move in opposite directions. There is no setting that maximises both — the right choice depends on what you're optimising for.

BATCH SIZE: 16 concurrent requests

–

total throughput (tok/s)

–

per-request TPOT (ms)

THROUGHPUT

PER-REQUEST LATENCY (higher = slower for each user)

This is why "fastest engine" is a meaningless claim An engine tuned for maximum throughput (huge batches) and one tuned for minimum latency (tiny batches) are optimising opposite ends of this curve. When someone says their engine is "fastest", always ask: throughput-fastest or latency-fastest? At what batch size? They're usually quoting whichever number flatters them. An honest benchmark shows the whole curve — throughput at a range of latency targets — not a single cherry-picked point.

Section 4 — nano-vLLM vs vLLM

How does nano-vLLM actually perform?

The remarkable headline: nano-vLLM, at ~1,200 lines, achieves throughput comparable to production vLLM on offline batch inference — sometimes matching or slightly exceeding it on simple benchmarks. This is genuinely impressive and worth understanding precisely, because the comparison is more nuanced than "nano-vLLM is as fast as vLLM".

Offline batched throughput (illustrative, single GPU)

On a clean offline benchmark — fixed set of prompts, processed in large batches — the two are close:

vLLM

~1,320 tok/s

nano-vLLM

~1,370 tok/s

Numbers are illustrative of the published ballpark (exact figures depend on GPU, model, and prompt mix). The key takeaway: on offline batch throughput, a 1,200-line implementation is competitive with a 100,000-line production system. That's a testament to how much of the performance comes from a few core ideas — PagedAttention, continuous batching, and FlashAttention — all of which nano-vLLM implements.

Why nano-vLLM can match vLLM despite being 1% of the code Most of vLLM's 100,000+ lines aren't about raw throughput on simple workloads — they handle the long tail of production requirements: dozens of model architectures, quantization formats, structured output, LoRA adapters, multi-modal inputs, distributed serving across nodes, extensive APIs, and edge cases. The core inference loop — the part that determines throughput on a standard benchmark — comes down to a handful of ideas this series has covered. nano-vLLM implements those core ideas faithfully, so it competes on the core benchmark. It just doesn't do the other 95% of what production vLLM does.

Section 5 — What nano-vLLM Leaves Out

The honest list of what's missing

A benchmark number never tells you what an engine can't do. nano-vLLM's competitive throughput comes partly from its simplicity — it skips entire categories of production features. Understanding these omissions is essential to reading the comparison honestly, and it's a fitting way to close the series: knowing the boundaries of what you've learned.

Feature	nano-vLLM	Production vLLM
Core inference loop PagedAttention, continuous batching, FlashAttention	✓ Full	✓ Full
Quantization int8, fp8, AWQ, GPTQ — smaller/faster models	✗ None (fp16 only)	✓ Many formats
Speculative decoding draft-and-verify to speed up decode	✗ None	✓ Yes
Beam search explore multiple generation paths	✗ Not implemented	✓ Yes
Model coverage supported architectures	~ Qwen3 focus	✓ Dozens
LoRA adapters serve fine-tuned variants efficiently	✗ None	✓ Yes
Structured output guaranteed JSON / grammar constraints	✗ None	✓ Yes
Production serving OpenAI-compatible API, metrics, multi-node	~ Minimal	✓ Full stack

nano-vLLM is a teaching tool, not a production server This is not a criticism — it's the entire point. nano-vLLM exists to make the core ideas legible. By omitting the production long-tail, it keeps the codebase small enough to read in an afternoon. If you need to serve real traffic, use production vLLM. If you need to understand how production vLLM works, nano-vLLM — and this series — is the clearest path there. The omissions are what make it teachable.

Section 6 — In nano-vLLM

How to benchmark it yourself

nano-vLLM ships with a bench.py script. Reading it shows exactly what a clean throughput benchmark looks like — and how to avoid the common measurement mistakes.

bench.py — the structure of an honest throughput benchmark

import time
from nanovllm import LLM, SamplingParams

# 1. Load the model once, outside the timed region
llm = LLM("Qwen/Qwen3-0.6B", enforce_eager=False)  # graphs ON for real perf

# 2. Build a fixed, reproducible set of prompts
#    Same prompts every run = comparable numbers
prompts = ["Explain quantum computing"] * 256
params = SamplingParams(temperature=0.6, max_tokens=256)

# 3. WARMUP — run once untimed.
#    The first run pays CUDA graph capture + compile costs (Ch.10).
#    Timing it would unfairly penalise the engine.
llm.generate(prompts[:8], params)

# 4. TIMED REGION — measure only steady-state generation
start = time.perf_counter()
outputs = llm.generate(prompts, params)
elapsed = time.perf_counter() - start

# 5. Report throughput as TOTAL output tokens / wall-clock time
total_tokens = sum(len(o.token_ids) for o in outputs)
print(f"Throughput: {total_tokens / elapsed:.0f} tok/s")
print(f"Requests: {len(prompts)}, total tokens: {total_tokens}")

The three benchmark mistakes this code avoids (1) Timing the warmup — the first run includes one-time graph capture and compilation → Ch.10; including it understates real performance. (2) Varying the workload — using different prompts each run makes numbers incomparable; fixed prompts are essential. (3) Measuring the wrong thing — throughput must be total output tokens over wall-clock time, not a per-request average that hides batching effects. Get any of these wrong and your benchmark is fiction.

Section 7 — Why It Matters

Reading benchmarks like an engineer

Always ask "at what latency?"

A throughput number alone is incomplete. The honest question is "what throughput at a p99 latency of X ms?" — because throughput at unbounded latency is easy and useless. Real SLAs bound latency.

Watch for cherry-picked conditions

Short prompts flatter prefill. Long generations flatter decode optimizations. Big batches flatter throughput. A benchmark that uses only favourable conditions tells you nothing about your workload.

Match the benchmark to your use case

Building interactive chat? TTFT and p99 latency matter most. Running offline batch jobs? Pure throughput. The "best" engine depends entirely on which metric maps to your actual need.

Features vs speed is a real trade

nano-vLLM's competitive throughput partly reflects what it omits. When comparing engines, account for what each does — raw speed on a simple benchmark isn't the whole picture if you need quantization or structured output.

Section 8 — Common Mistakes

Things beginners get wrong about benchmarks

✗ Myth 1 — "Higher throughput always means a better engine"

Reality: Throughput and latency trade off against each other via batch size. An engine reporting enormous throughput is likely running huge batches — which means high per-request latency. For an interactive chat application, that "high throughput" engine might deliver a worse user experience than a "lower throughput" one tuned for latency. The right metric depends entirely on your use case; there is no universal "better".

✗ Myth 2 — "nano-vLLM matching vLLM means it's production-ready"

Reality: Matching vLLM on an offline throughput benchmark only means the core inference loop is efficient. Production readiness requires the entire long tail nano-vLLM omits: quantization, broad model support, structured output, robust serving infrastructure, multi-node scaling, and battle-testing under real traffic. nano-vLLM is an exceptional learning tool that happens to be fast on simple benchmarks — that is very different from being a production server.

✗ Myth 3 — "A single benchmark number captures performance"

Reality: Performance is a curve, not a point. The same engine produces wildly different numbers depending on batch size, prompt length, generation length, and hardware. A single number is a single point on a multi-dimensional surface. Honest benchmarking reports the relevant curve — throughput across latency targets, performance across prompt-length distributions — so you can find the point that matches your workload.

Section 9 — Check Your Understanding

Quiz

Three final questions. Wrong answers explain exactly where the reasoning broke down.

1. An engine reports 8,000 tok/s throughput. A competitor reports 2,000 tok/s but advertises "10× lower latency". How can both be true?

2. Why must a throughput benchmark include an untimed "warmup" run before measuring?

3. nano-vLLM matches vLLM's throughput on an offline benchmark despite being 1% of the code. What does this most accurately tell us?

Section 10 — Key Takeaways

What you now know

Chapter 11 — Summary

✓

There is no single "speed". Throughput, TTFT, TPOT, and percentile latencies each answer a different question. A serious benchmark reports all of them, not one cherry-picked number.

✓

Throughput and latency trade off. Batch size is the knob. Large batches maximise throughput but raise per-request latency; small batches minimise latency but waste GPU capacity. You cannot maximise both.

✓

Percentiles reveal the tail. Averages hide bad experiences. p95 and p99 latency — the unlucky requests — are what production SLAs are built on, not the median.

✓

nano-vLLM matches vLLM on core throughput. ~1,200 lines competes with ~100,000 on offline batch benchmarks — because throughput comes from a few core ideas this series covered, faithfully implemented.

✓

The omissions are the point. No quantization, beam search, speculative decoding, or broad model support. nano-vLLM trades production features for legibility — that's what makes it teachable.

✓

Benchmark honestly. Warm up before timing, fix the workload, measure total tokens over wall-clock time, and report the curve — not a single point chosen to flatter.

The End — Series Complete

You've reached the end

Eleven chapters ago, "LLM inference" might have been a black box. Now you understand what happens from the moment a prompt arrives to the moment the final token streams back — and why every design decision was made the way it was.

🎓 The whole picture, in one breath

A prompt arrives and is tokenised (Ch.01). The engine wraps it in a Sequence and hands it to the scheduler (Ch.02, Ch.05), which uses continuous batching to keep the GPU full. The block manager allocates KV cache blocks via PagedAttention (Ch.03, Ch.04), reusing shared prefixes through prefix caching (Ch.07). The model runs prefill then decode (Ch.06), split across GPUs by tensor parallelism (Ch.09), accelerated by FlashAttention and CUDA Graphs (Ch.10). The sampler picks each token (Ch.08), and you measure it all with honest benchmarks (Ch.11). That's a complete LLM inference engine.

01 · Inference 02 · Architecture 03 · KV Cache 04 · PagedAttention 05 · Scheduler 06 · Prefill vs Decode 07 · Prefix Caching 08 · Sampling 09 · Parallelism 10 · Optimizations 11 · Benchmarks

The best next step: clone nano-vLLM, open the source, and read it. With this series behind you, every line will make sense.