Chapter 01 of 11 · nano-vLLM Deep Dive
01

What Is LLM Inference?

Everything that happens between you typing a question and seeing the answer — explained from zero, one step at a time.

← Series Index Next: Architecture →

What even is inference?

Every time you send a message to an AI and it responds, that's inference. But what's happening inside? Let's build the mental model from scratch before touching any code.

The Chef Analogy Imagine a master chef who spent years reading every cookbook ever written. That long study period is training — it shaped who she is, it's done, the books are closed. Now when you hand her a half-written recipe and ask "what comes next?", she doesn't re-read all those books. She just thinks and writes the next ingredient. Then the next. One at a time. That's inference. The LLM is the chef. Your prompt is the half-written recipe. Each generated word is the next ingredient she writes down.

Training vs Inference — not the same thing

These describe two completely separate phases of an AI system's life. Most beginners conflate them. Here's the precise distinction:

Training — happens once

Learning from data

  • Runs once, before deployment, on thousands of GPUs for weeks
  • The model reads billions of text documents and finds patterns
  • Adjusts billions of internal numbers called weights or parameters
  • Goal: get very good at predicting "what word comes next?"
  • When done, the weights are frozen — the model stops learning
  • Example: GPT-4 trained once, then deployed for millions of users
Inference — happens constantly

Using the trained model

  • Happens every time you send a message — thousands per second at scale
  • The frozen model uses its learned weights to generate a response
  • No learning happens — weights don't change at all
  • Takes milliseconds to seconds per response
  • This is entirely what nano-vLLM is about
  • Example: every reply Claude gives you is one inference run

What is a "token"?

LLMs don't read words — they read tokens. A token is a chunk of text, roughly 3–4 characters on average. Common words are usually one token. Rare or long words split into multiple tokens. Punctuation is its own token.

Tokenization demo — hover to reveal ID

The sentence "Hello, how are you?" splits into 6 tokens. Hover each to see the integer ID the model actually processes:

Hello
,
how
are
you
?

The model never sees the letters "H-e-l-l-o". It sees the integer 9906. The mapping was fixed during training and never changes.

Why tokens instead of whole words?

Fixed vocabulary

~50,000 tokens can represent any human language. A word-based vocabulary would need millions of entries and still fail on new words.

Handles unknowns

"nano-vLLM" splits into ["nano", "-", "v", "LLM"]. New technical terms, names, code — everything decomposes into known sub-pieces. Nothing is ever truly "unknown".

Math-friendly

Neural networks work with numbers, not text. Tokens are the bridge: text → integers → math → integers → text.

Token count matters for cost and speed When an API charges "per token" or says "128k context limit", they mean this. 1 page of text ≈ 500 tokens. A full novel ≈ 100,000 tokens. Every token requires a full forward pass during decode. More tokens = more computation = more time and money. nano-vLLM's entire job is to process as many tokens as possible, as fast as possible, on as little GPU memory as possible.

The 6 steps of inference

When you send a prompt, it passes through six distinct stages before you see any output. Click through each step to understand what's happening and why.

Keys, Queries, and Values

In the pipeline you saw that the Prefill step produces Q, K, and V vectors, and that K and V get saved to the "KV cache". This section explains what those actually are — because they appear in every chapter that follows.

The Library Search Analogy You're in a library looking for information about aerodynamics. You write your question on a slip of paper — that's your Query (Q). Every book has a summary card on its spine describing what it covers — those are the Keys (K). The actual text inside each book — the information you'd read once you find the right one — is the Value (V). You compare your Query slip against every Key card to decide relevance, then read the Value content from the most relevant books in proportion to how relevant each was.
Q — QUERY VECTOR

What the current token is asking. When the model processes the word "fly", its Query says something like "I need context about motion and physics".

Q is used immediately to compute attention scores against all past tokens' Keys. Once the scores are computed, Q is discarded.

Used now. Never saved.
Not in the KV cache.
K — KEY VECTOR

What a token advertises about itself. The token "airplane" has a Key that encodes something like "I am a flying machine, relevant to queries about flight or physics".

Every future token will compute its Q against this K to decide how much attention to give "airplane". So K must be kept forever.

Saved to KV cache.
Reused by every future token.
V — VALUE VECTOR

The actual content retrieved. Once a token's Key wins high relevance, its Value is what gets mixed into the output. K decides if you're selected; V is what you contribute.

A high Q·K dot-product score means more of that token's V is weighted into the final attention output.

Saved to KV cache.
Retrieved weighted by Q·K scores.

How Q, K, V produce the attention output — 4 steps

This runs in every transformer layer, for every token, on every forward pass.

1

Compute relevance scores — Q · K (dot product)

The current token's Q vector is dot-producted with every past token's K vector. A high score means "this past token is very relevant to what I'm looking for right now". For a 1,000-token context, this produces 1,000 scores — one per past token.

2

Scale and normalise — softmax

Scores are divided by √(head_dim) to keep them numerically stable, then passed through softmax to convert them into probabilities summing to 1.0. These are the attention weights. Most are near zero; a handful spike high. This is literally "how much to attend to each past token".

3

Retrieve content — weighted sum of V

Each past token's V vector is multiplied by its attention weight, then all are summed. Tokens with high weight contribute more of their V to the output. The result is a rich, context-aware vector — the "answer" to the current token's Query.

4

Pass through feed-forward layer

The attention output feeds into a feed-forward network (two linear layers with a non-linearity) that further transforms it. Result: the token's updated representation — now enriched with context from the whole sequence. This flows into the next transformer layer and the process repeats.

Why only K and V are saved — not Q The Query is ephemeral — token 50 asks its question and immediately gets its answer. But the Keys and Values of tokens 1–49 will be needed again when token 51 runs, then 52, then 53. Without caching, you'd recompute K and V for every previous token on every single decode step — O(n²) work as the sequence grows. By saving K and V once during prefill and reusing them, each decode step only computes Q for the new token. That's the entire reason the KV cache exists. Chapters 3 and 4 are entirely about how to store and manage it efficiently.

Autoregressive generation

The single most important thing to internalise: the model cannot see the future. It generates one token, appends it to the context, then generates the next — based on everything including what it just wrote. This is called autoregressive generation.

Watch a sentence build — one token at a time
PROMPT (given to the model):
The
cat
sat
on
the
GENERATED (so far):
Why this loop is expensive Each token generation runs a full forward pass through the entire model — all transformer layers, all attention heads. For a 7B parameter model that's roughly 7 billion multiply-accumulate operations per token. A 200-word response is ~260 tokens = 260 full forward passes. At scale, with thousands of users, this is an enormous amount of compute — which is exactly why every optimisation in the chapters ahead matters.

HBM and SRAM — the GPU's two memories

When inference is described as "memory-bandwidth bound" or you see "80 GB HBM3e" in a GPU spec sheet, this is what it means. Understanding this one diagram explains why most kernel-level optimisations exist.

GPU Memory Hierarchy
HBM — HIGH BANDWIDTH MEMORY
80 GB
capacity (H100)
3.35 TB/s
bandwidth
  • Stacked memory chips physically on the GPU package, separate from the compute die
  • Stores everything: model weights, KV cache, activations, gradients
  • The "80 GB" in an "H100 80 GB" spec refers to this
  • Decode is bottlenecked here: every step reads 14 GB+ of weights from HBM
SRAM — ON-CHIP CACHE
50 MB
capacity (H100)
~19 TB/s
bandwidth
  • Built directly into the compute die — physically next to CUDA cores
  • ~6× faster than HBM, but tiny — only ~50 MB total
  • Used as a high-speed scratchpad: data arrives from HBM, is processed here, result written back
  • FlashAttention → Ch.10 keeps attention tiles in SRAM to avoid HBM round-trips
Bandwidth comparison (H100)
SRAM (on-chip)~19 TB/s
██████████ BLAZING FAST (right next to compute)
HBM (GPU memory)3.35 TB/s
███ FAST
PCIe 5.0 (CPU ↔ GPU)0.064 TB/s
← 300× slower than SRAM

The HBM → SRAM → compute → SRAM → HBM round-trip is the critical path for every GPU operation. Minimising unnecessary HBM reads is the goal of FlashAttention, CUDA Graphs, and Triton kernels — all covered in Ch.10.

Interactive: how context length eats GPU memory

Drag the sliders to see how KV cache memory grows. This is the problem that PagedAttention → Ch.04 is designed to solve.

GPU Memory Estimator
8,192
KV Cache
Model Weights
Total Required
KV Cache
Model Weights
Total vs 80 GB H100

How inference looks in code

Here's the complete public API for running inference with nano-vLLM. Every concept from this chapter — tokens, sampling, generate loop — is represented in just these few lines.

example.py — the complete inference call
from nanovllm import LLM, SamplingParams

# Step 1 — Load the model (weights go into GPU HBM)
llm = LLM(
    "./Qwen3-0.6B",          # path to HuggingFace weights directory
    enforce_eager=True,      # disable CUDA graphs — simpler for learning
    tensor_parallel_size=1   # single GPU — see Ch.09 for multi-GPU
)

# Step 2 — Define sampling behaviour (how to pick each next token)
params = SamplingParams(
    temperature=0.7,          # randomness: 0 = greedy, 1 = more creative
    top_k=50,                 # only consider the top 50 candidates
    max_tokens=256            # stop after 256 generated tokens
)

# Step 3 — Run inference (tokenise → prefill → decode loop → detokenise)
outputs = llm.generate(
    ["Explain how airplanes fly."],  # list of prompt strings
    params
)

# Step 4 — outputs[0]['text'] is the detokenised string response
print(outputs[0]['text'])
What happens inside llm.generate() The call to generate() runs the entire 6-step pipeline you walked through in Section 3: (1) tokenise the prompt strings, (2) schedule them via the Scheduler → Ch.05, (3) prefill — process all input tokens and build the KV cache → Ch.03, (4) decode loop — generate one token per step until done, (5) sample each token via SamplingParams → Ch.08, (6) detokenise the output token IDs back into text. All of that is abstracted behind one method call.

The three core bottlenecks

Running inference for one user is manageable. Running it for thousands of concurrent users with long prompts and strict latency requirements introduces three fundamental constraints. Every remaining chapter addresses one of these.

① Memory Bandwidth

During decode, the GPU reads the entire model's weights from HBM once per generated token. For a 7B model at fp16, that's 14 GB transferred every single step. The compute cores sit idle waiting for data. This is why decode is said to be "memory-bandwidth bound" rather than "compute bound".

② Compute Throughput

Prefill — processing the full input prompt — requires massive parallel matrix multiplications across all tokens simultaneously. A 10,000-token prompt can take seconds even on powerful hardware. This phase is "compute bound": the floating-point units, not the memory bus, are the bottleneck.

③ GPU Memory Capacity

Every processed token writes K and V vectors to the KV cache — which must stay in HBM for the duration of the request. A 70B model with 128k context needs 100+ GB just for the cache — exceeding a single GPU. Managing this memory is what most of nano-vLLM's ~1,200 lines of code handle.

④ GPU Utilisation

GPUs are efficient when processing large batches. But requests arrive at random times with different lengths. Without careful scheduling, the GPU idles between requests or wastes cycles waiting for a slow request to finish. Continuous batching → Ch.05 solves this.

Things beginners get wrong

✗ Myth 1 — "The model learns from my messages"
Reality: During inference, the model's weights are completely frozen. Your conversation doesn't update them. The model isn't "learning" anything. It's applying patterns baked in during training to generate a contextually appropriate response — it's sophisticated pattern completion, not real-time learning.
✗ Myth 2 — "The model generates the full response at once"
Reality: Generation is strictly sequential — one token at a time. Token N+1 cannot be predicted before token N exists. This is why you see streaming responses appear word by word, not all at once. There's no shortcut: each token requires a full forward pass through the model.
✗ Myth 3 — "More GPU memory = faster generation"
Reality: More GPU memory increases the number of requests you can serve simultaneously (bigger KV cache, more requests fit), but doesn't directly speed up individual token generation. Speed is limited by memory bandwidth (GB/s), not memory capacity (GB). A 40 GB A100 and an 80 GB A100 generate tokens at the same speed — the 80 GB just fits more concurrent requests.

Quiz

Three questions to cement the key ideas. Wrong answers explain why they're wrong, not just mark you incorrect.

1. When does an LLM update its weights — during training, or during inference?

2. Why are K and V vectors cached, but Q vectors are not?

3. What does "HBM" refer to in GPU specs?

What you now know

Chapter 01 — Summary

Training ≠ Inference. Training shapes weights once, on huge compute. Inference uses those frozen weights to generate text — no learning during inference.

Tokens are the atomic unit. Text → integers via tokenizer → model math → integers → text. The model only ever sees numbers, never characters.

Q asks, K advertises, V delivers. Query is used once and discarded. Key and Value are saved to the KV cache and reused by every future token — that's why caching them saves enormous compute.

Autoregressive = one token at a time, always. Each new token requires a full forward pass. The model cannot predict token N+1 without first generating token N.

HBM is the GPU's large memory. The "80 GB" on an H100. Decode is slow because reading 14+ GB from HBM per step is the bottleneck — compute cores sit idle waiting for data.

Three bottlenecks drive everything ahead: memory bandwidth (decode), compute throughput (prefill), and GPU memory capacity (KV cache size). Every chapter solves one of these.