Two phases, two completely different problems
You've seen prefill and decode mentioned throughout this series. Now it's time to understand them deeply — not just what they do, but why they behave so differently on the same GPU hardware. This distinction shapes every design decision in nano-vLLM and every optimisation in production inference engines.
The factory is compute-bound. The paint shop is memory-bandwidth bound. Same building, same power supply, completely different bottlenecks. This is exactly what happens with prefill and decode on a GPU.
Prefill — processing the input prompt
Prefill is the first phase of handling any request. When your prompt arrives — say, 500 tokens of text — the model processes all 500 tokens simultaneously in a single forward pass. All tokens go in together, all layers process them in parallel, and at the end the model has computed the Key and Value vectors for every input token and written them to the KV cache → Ch.03.
What makes prefill compute-bound?
To understand "compute-bound", you first need to understand the concept of arithmetic intensity — the ratio of math operations performed to memory bytes read. High arithmetic intensity = more computation per byte of data = the compute units (CUDA cores) are the bottleneck, not the memory bus.
During prefill, for each of the 500 input tokens, the model performs large matrix multiplications — multiplying the token's embedding vector against the model's weight matrices in every layer. Each weight matrix is read once but produces many multiply-accumulate operations. This is high arithmetic intensity. The GPU's thousands of floating-point units are all saturated. Adding more CUDA cores would make prefill faster. The memory bus is not the limit — the math units are.
Processes all input tokens at once
- All N input tokens processed in one parallel forward pass
- Produces K and V for every input token — writes full KV cache
- Outputs logits only for the last token (to sample first output)
- High arithmetic intensity — GPU compute units are saturated
- Scales well with batch size — more sequences = better utilisation
- A 1,000-token prompt may take 100–500ms on a single GPU
- Uses
flash_attn_varlen_funcfor variable-length batches
Generates tokens one at a time
- Exactly one new token processed per forward pass, per sequence
- Reads full KV cache for all prior tokens — writes only 1 new entry
- Outputs logits for the next token to sample
- Low arithmetic intensity — memory bandwidth is the bottleneck
- Doesn't scale much with batch size — bottleneck is the weight read
- A single decode step takes 20–50ms on a 7B model
- Uses
flash_attn_with_kvcachefor paged KV reads
Decode — generating tokens one at a time
After prefill, the model enters the decode phase — the autoregressive loop → Ch.01 where it generates one token per step. Each step processes only the single most-recently-generated token, reads the full KV cache for context, and samples the next token. This repeats until the sequence finishes.
Why decode is memory-bandwidth bound
Here's the key insight: to generate a single token, the GPU must read the entire model's weights from HBM → Ch.01 — all 7 billion parameters of a 7B model, or 14 GB at fp16. But it only uses those weights to perform computation for one token's worth of input. That's a tiny amount of math relative to the data movement. The arithmetic intensity is very low.
Think of it this way: you're reading 14 GB of data from HBM every decode step, but doing only a fraction of the math you'd do if you were processing a large batch of tokens. The memory bus is running flat out pumping weights from HBM to the compute cores, but the compute cores are only lightly loaded. Adding more CUDA cores would not help at all — they'd just sit idle waiting for data. The only thing that would help decode is faster HBM bandwidth.
Compute units saturated — adding more CUDA cores would directly improve speed.
Memory bus saturated — more CUDA cores are wasted. Only faster HBM bandwidth helps.
TTFT and TPOT — the two latency metrics
Because prefill and decode behave so differently, inference engineers use two separate metrics to measure latency. Using just one metric hides the full picture.
TTFT — Time To First Token
TTFT (Time To First Token) measures how long after you send a request until you see the very first token of the response appear. This is entirely determined by prefill duration — the model must finish processing your entire prompt before it can generate token 1.
TTFT is what users feel as "responsiveness". A system with slow prefill feels sluggish even if it generates subsequent tokens quickly. A 5-second TTFT feels like the system is hanging, even if tokens stream out rapidly after that. For interactive applications, TTFT is often the most important metric.
TPOT — Time Per Output Token
TPOT (Time Per Output Token) measures the average time between each successive generated token after the first one. This is entirely determined by decode speed — specifically by HBM bandwidth, batch size, and model size.
TPOT is what users feel as "streaming speed" — how fast the text appears to flow after it starts. For long responses (essays, code, analyses), TPOT dominates the total wait time. A TPOT of 25ms means 40 tokens per second, which is roughly human reading speed — fast enough to feel natural.
Drag the sliders to see how prompt length, response length, and model size affect TTFT and total time:
Watch prefill and decode happen — step by step
Click the buttons below to see the two phases in action. Notice how prefill processes all input tokens in one shot, while decode advances one token at a time. Watch the KV cache fill up as each phase runs.
Chunked prefill — the best of both worlds
Prefill and decode have a tension: prefill is fast but blocks decode (no new tokens until prefill finishes), while decode is slow but generates tokens continuously. For a very long prompt, users stare at a blank screen during the entire prefill — terrible TTFT. Chunked prefill resolves this tension.
What is chunked prefill?
Instead of processing the entire prompt in one massive forward pass, chunked prefill splits the prompt into smaller chunks — say, 512 tokens at a time. After each chunk is processed, decode steps for existing sequences can be interleaved. The prompt is still fully processed before the new request generates any tokens, but other requests already in DECODING state keep generating throughout.
Without chunked prefill
A 4,000-token prompt takes ~800ms to prefill on a 7B model. During those 800ms, every other sequence in the batch is blocked — no decode steps run. Users already waiting for their responses experience a 800ms pause. TTFT for the new request is 800ms. This is called prefill starvation.
With chunked prefill (chunk size = 512 tokens)
The 4,000-token prompt is split into 8 chunks of 512. After each chunk (≈100ms), the scheduler runs a decode step for all waiting sequences. Existing responses continue streaming. The new request's TTFT is still ~800ms total, but other users see no pause. GPU utilisation stays high throughout. This is what nano-vLLM implements when enforce_eager=False.
How the two phases appear in code
The clearest place to see the prefill/decode split is in the attention layer — the same function handles both phases, but dispatches to a completely different underlying kernel based on seq.is_prefill.
def forward(self, q, k, v, kv_cache, slot_mapping, block_table, is_prefill): # Step 1: Write this step's new K and V to the cache (both phases do this) store_kvcache_kernel[num_tokens,](k, v, kv_cache, slot_mapping, ...) if is_prefill: # ── PREFILL PATH ────────────────────────────────────────────── # All input tokens are in q, k, v as contiguous tensors. # flash_attn_varlen_func handles variable-length sequences # in a single batch, masking future tokens (causal mask). # This is compute-intensive: full attention over N tokens. out = flash_attn_varlen_func( q, k, v, cu_seqlens_q=cu_seqlens, # cumulative sequence lengths for batching cu_seqlens_k=cu_seqlens, max_seqlen_q=max_seqlen, causal=True, # don't attend to future tokens ) else: # ── DECODE PATH ─────────────────────────────────────────────── # Only the new token is in q. K and V for all prior tokens # are read from the paged KV cache via block_table. # flash_attn_with_kvcache handles non-contiguous block reads. # This is memory-bandwidth intensive: reads entire KV cache. out = flash_attn_with_kvcache( q, kv_cache[0], # the full K cache tensor (all blocks) kv_cache[1], # the full V cache tensor (all blocks) block_table=block_table, # logical → physical block mapping cache_seqlens=seqlens, # how many tokens are in cache per sequence ) return out
q, k, v are all dense tensors in contiguous memory — fast to access. In decode, q is tiny (one token per sequence), but kv_cache[0] and kv_cache[1] are the entire paged cache tensors scattered across non-contiguous blocks → Ch.04. Reading all those scattered blocks is exactly the memory bandwidth bottleneck. flash_attn_with_kvcache is highly optimised to do this paged read as efficiently as possible — it's one of the most performance-critical kernels in the entire system.
def run(self, scheduler_output: SchedulerOutput): # Build a mixed batch: prefill sequences + decode sequences # Each sequence knows its phase via seq.is_prefill all_seqs = scheduler_output.prefill + scheduler_output.decode # Collect token IDs for the forward pass # Prefill sequences contribute ALL their tokens # Decode sequences contribute only their LATEST token input_ids = [] for seq in all_seqs: if seq.is_prefill: input_ids.extend(seq.tokens) # all N prompt tokens else: input_ids.append(seq.tokens[-1]) # just the last generated token # Single forward pass handles both prefill and decode sequences # The attention layer dispatches to the right kernel per sequence logits = self.model.forward(input_ids, ...)
Why understanding this changes how you think about LLM performance
Prefill determines TTFT
Long system prompts, RAG contexts, and long chat histories all increase prefill time. Prefix caching → Ch.07 is the primary way to reduce TTFT — by skipping prefill entirely for tokens already in the cache.
Decode determines throughput
Generating 1,000 tokens takes 1,000 decode steps regardless of batch size. The only way to improve decode throughput is faster HBM bandwidth (better hardware) or smaller models (fewer weights to read per step).
Flash attention solves both differently
FlashAttention-2 uses tiling to reduce HBM traffic during prefill → Ch.10. flash_attn_with_kvcache handles the paged, scattered reads during decode. Two different problems, two different kernel strategies.
Speculative decoding attacks decode cost
Speculative decoding (not in nano-vLLM, but used in production vLLM) generates multiple draft tokens in one step and verifies them. It exploits the fact that decode is memory-bound and compute-underutilised — using the idle compute to speculatively generate ahead.
Things beginners get wrong about the two phases
Quiz
Three questions on prefill, decode, and their bottlenecks. Wrong answers explain exactly where the reasoning went wrong.
1. A user complains that responses from your LLM service "feel slow to start but stream quickly once they begin." Which metric is the problem, and which phase is causing it?
2. During decode, the GPU's CUDA cores are only 18% utilised but HBM bandwidth is at 95%. What would most improve decode speed?
3. Why does chunked prefill improve the experience for users whose requests are already being decoded — even though their requests are not the ones being prefilled?
What you now know
Prefill is compute-bound. All input tokens processed in parallel, high arithmetic intensity, CUDA cores saturated. Bottleneck: how many compute units you have. Scales well with batch size.
Decode is memory-bandwidth bound. One token per step, 14 GB of weights read from HBM per step (7B model), CUDA cores underutilised. Bottleneck: HBM bandwidth. Adding more CUDA cores does not help.
TTFT measures prefill. TPOT measures decode. High TTFT = slow to start. High TPOT = slow streaming. They're independent — optimise them separately with different techniques.
Same weights, different bottlenecks. Prefill and decode use identical model weights. The difference is arithmetic intensity — how much math is done per byte of data read from HBM.
Chunked prefill bridges the gap. Splitting long prompts into chunks and interleaving decode steps prevents prefill starvation — existing sequences keep streaming without pause during long prompt processing.
Two phases, two attention kernels. Prefill uses flash_attn_varlen_func (dense, contiguous). Decode uses flash_attn_with_kvcache (paged, scattered). Same layer, different dispatch based on seq.is_prefill.