Chapter 06 of 11 · nano-vLLM Deep Dive
06

Prefill vs Decode

Same model. Two completely different bottlenecks. Understanding why prefill and decode behave so differently is the master key to every LLM performance optimisation.

← Ch05: Scheduler Next: Prefix Caching →

Two phases, two completely different problems

You've seen prefill and decode mentioned throughout this series. Now it's time to understand them deeply — not just what they do, but why they behave so differently on the same GPU hardware. This distinction shapes every design decision in nano-vLLM and every optimisation in production inference engines.

The Factory vs Conveyor Belt Analogy Imagine a factory that builds cars. Prefill is like the initial assembly stage — dozens of workers all attack the car simultaneously, welding, wiring, and bolting in parallel. It's chaotic, power-intensive, and fast. Every worker is busy at once. The bottleneck is how many workers (compute units) you have. Decode is like the paint shop at the end — one robot arm applies paint to one car panel at a time, carefully, with a huge tank of paint (GPU memory) that it dips into on every stroke. The bottleneck isn't the robot arm's speed — it's how fast paint can be pumped from the tank. More robot arms don't help if the pump is already at capacity.

The factory is compute-bound. The paint shop is memory-bandwidth bound. Same building, same power supply, completely different bottlenecks. This is exactly what happens with prefill and decode on a GPU.

Prefill — processing the input prompt

Prefill is the first phase of handling any request. When your prompt arrives — say, 500 tokens of text — the model processes all 500 tokens simultaneously in a single forward pass. All tokens go in together, all layers process them in parallel, and at the end the model has computed the Key and Value vectors for every input token and written them to the KV cache → Ch.03.

What makes prefill compute-bound?

To understand "compute-bound", you first need to understand the concept of arithmetic intensity — the ratio of math operations performed to memory bytes read. High arithmetic intensity = more computation per byte of data = the compute units (CUDA cores) are the bottleneck, not the memory bus.

The Long Division Analogy Imagine two tasks: (A) solving 1,000 long-division problems on paper — you write a number, do a lot of arithmetic, write the answer; (B) copying 1,000 numbers from one notebook to another — you read a number, write it down, move on. Task A is compute-intensive — you're doing lots of work per number you read. Task B is memory-bandwidth intensive — you're barely doing any work, you're just moving data. Prefill is task A. Decode is task B.

During prefill, for each of the 500 input tokens, the model performs large matrix multiplications — multiplying the token's embedding vector against the model's weight matrices in every layer. Each weight matrix is read once but produces many multiply-accumulate operations. This is high arithmetic intensity. The GPU's thousands of floating-point units are all saturated. Adding more CUDA cores would make prefill faster. The memory bus is not the limit — the math units are.

Prefill — key properties

Processes all input tokens at once

  • All N input tokens processed in one parallel forward pass
  • Produces K and V for every input token — writes full KV cache
  • Outputs logits only for the last token (to sample first output)
  • High arithmetic intensity — GPU compute units are saturated
  • Scales well with batch size — more sequences = better utilisation
  • A 1,000-token prompt may take 100–500ms on a single GPU
  • Uses flash_attn_varlen_func for variable-length batches
Decode — key properties

Generates tokens one at a time

  • Exactly one new token processed per forward pass, per sequence
  • Reads full KV cache for all prior tokens — writes only 1 new entry
  • Outputs logits for the next token to sample
  • Low arithmetic intensity — memory bandwidth is the bottleneck
  • Doesn't scale much with batch size — bottleneck is the weight read
  • A single decode step takes 20–50ms on a 7B model
  • Uses flash_attn_with_kvcache for paged KV reads

Decode — generating tokens one at a time

After prefill, the model enters the decode phase — the autoregressive loop → Ch.01 where it generates one token per step. Each step processes only the single most-recently-generated token, reads the full KV cache for context, and samples the next token. This repeats until the sequence finishes.

Why decode is memory-bandwidth bound

Here's the key insight: to generate a single token, the GPU must read the entire model's weights from HBM → Ch.01 — all 7 billion parameters of a 7B model, or 14 GB at fp16. But it only uses those weights to perform computation for one token's worth of input. That's a tiny amount of math relative to the data movement. The arithmetic intensity is very low.

Think of it this way: you're reading 14 GB of data from HBM every decode step, but doing only a fraction of the math you'd do if you were processing a large batch of tokens. The memory bus is running flat out pumping weights from HBM to the compute cores, but the compute cores are only lightly loaded. Adding more CUDA cores would not help at all — they'd just sit idle waiting for data. The only thing that would help decode is faster HBM bandwidth.

Resource utilisation — prefill vs decode
PREFILL — compute-bound
CUDA Cores
96%
HBM Bandwidth
45%
SRAM Usage
82%

Compute units saturated — adding more CUDA cores would directly improve speed.

DECODE — memory-bound
CUDA Cores
18%
HBM Bandwidth
95%
SRAM Usage
30%

Memory bus saturated — more CUDA cores are wasted. Only faster HBM bandwidth helps.

Batching helps decode more than you'd think While decode is memory-bound per-token, running many sequences in the same decode batch actually improves efficiency. Why? Because all sequences in the batch share the same weight matrix reads — you pay the HBM read cost once, then use those weights for all N sequences. A decode batch of 32 sequences reads model weights once and computes 32 tokens simultaneously — same memory cost, 32× the output. This is why the scheduler → Ch.05 tries to keep the batch as full as possible even during decode.

TTFT and TPOT — the two latency metrics

Because prefill and decode behave so differently, inference engineers use two separate metrics to measure latency. Using just one metric hides the full picture.

TTFT — Time To First Token

TTFT (Time To First Token) measures how long after you send a request until you see the very first token of the response appear. This is entirely determined by prefill duration — the model must finish processing your entire prompt before it can generate token 1.

TTFT is what users feel as "responsiveness". A system with slow prefill feels sluggish even if it generates subsequent tokens quickly. A 5-second TTFT feels like the system is hanging, even if tokens stream out rapidly after that. For interactive applications, TTFT is often the most important metric.

TPOT — Time Per Output Token

TPOT (Time Per Output Token) measures the average time between each successive generated token after the first one. This is entirely determined by decode speed — specifically by HBM bandwidth, batch size, and model size.

TPOT is what users feel as "streaming speed" — how fast the text appears to flow after it starts. For long responses (essays, code, analyses), TPOT dominates the total wait time. A TPOT of 25ms means 40 tokens per second, which is roughly human reading speed — fast enough to feel natural.

Latency breakdown — example: 200-token prompt, 100-token response

Drag the sliders to see how prompt length, response length, and model size affect TTFT and total time:

200
100
TTFT (prefill time)
TPOT (per decode step)
Total response time
TTFT (prefill)
Decode time

Watch prefill and decode happen — step by step

Click the buttons below to see the two phases in action. Notice how prefill processes all input tokens in one shot, while decode advances one token at a time. Watch the KV cache fill up as each phase runs.

Phase simulator
Current phase
0
Tokens processed
0
Tokens generated
0
KV cache slots
Prompt tokens (red = being processed, dim = waiting)
Generated tokens (appear one by one during decode)
KV Cache fill level
CUDA Core utilisation
0%
HBM Bandwidth utilisation
0%

Chunked prefill — the best of both worlds

Prefill and decode have a tension: prefill is fast but blocks decode (no new tokens until prefill finishes), while decode is slow but generates tokens continuously. For a very long prompt, users stare at a blank screen during the entire prefill — terrible TTFT. Chunked prefill resolves this tension.

What is chunked prefill?

Instead of processing the entire prompt in one massive forward pass, chunked prefill splits the prompt into smaller chunks — say, 512 tokens at a time. After each chunk is processed, decode steps for existing sequences can be interleaved. The prompt is still fully processed before the new request generates any tokens, but other requests already in DECODING state keep generating throughout.

Without chunked prefill

A 4,000-token prompt takes ~800ms to prefill on a 7B model. During those 800ms, every other sequence in the batch is blocked — no decode steps run. Users already waiting for their responses experience a 800ms pause. TTFT for the new request is 800ms. This is called prefill starvation.

With chunked prefill (chunk size = 512 tokens)

The 4,000-token prompt is split into 8 chunks of 512. After each chunk (≈100ms), the scheduler runs a decode step for all waiting sequences. Existing responses continue streaming. The new request's TTFT is still ~800ms total, but other users see no pause. GPU utilisation stays high throughout. This is what nano-vLLM implements when enforce_eager=False.

Why not just make chunks tiny? Smaller chunks mean more frequent interleaving — better for existing sequences, but worse for the new request's TTFT (more overhead per chunk). Larger chunks process more tokens per GPU step but block decode longer. The optimal chunk size is a tuning parameter: typically 512–2048 tokens, balancing TTFT against decode-step interruption frequency. nano-vLLM uses the scheduler's → Ch.05 continuous batching loop to mix prefill chunks and decode steps naturally in the same batch.

How the two phases appear in code

The clearest place to see the prefill/decode split is in the attention layer — the same function handles both phases, but dispatches to a completely different underlying kernel based on seq.is_prefill.

layers/attention.py — phase-aware forward pass
def forward(self, q, k, v, kv_cache, slot_mapping, block_table, is_prefill):
    # Step 1: Write this step's new K and V to the cache (both phases do this)
    store_kvcache_kernel[num_tokens,](k, v, kv_cache, slot_mapping, ...)

    if is_prefill:
        # ── PREFILL PATH ──────────────────────────────────────────────
        # All input tokens are in q, k, v as contiguous tensors.
        # flash_attn_varlen_func handles variable-length sequences
        # in a single batch, masking future tokens (causal mask).
        # This is compute-intensive: full attention over N tokens.
        out = flash_attn_varlen_func(
            q, k, v,
            cu_seqlens_q=cu_seqlens,   # cumulative sequence lengths for batching
            cu_seqlens_k=cu_seqlens,
            max_seqlen_q=max_seqlen,
            causal=True,               # don't attend to future tokens
        )
    else:
        # ── DECODE PATH ───────────────────────────────────────────────
        # Only the new token is in q. K and V for all prior tokens
        # are read from the paged KV cache via block_table.
        # flash_attn_with_kvcache handles non-contiguous block reads.
        # This is memory-bandwidth intensive: reads entire KV cache.
        out = flash_attn_with_kvcache(
            q,
            kv_cache[0],              # the full K cache tensor (all blocks)
            kv_cache[1],              # the full V cache tensor (all blocks)
            block_table=block_table,   # logical → physical block mapping
            cache_seqlens=seqlens,     # how many tokens are in cache per sequence
        )
    return out
Same function, two completely different memory access patterns In prefill, q, k, v are all dense tensors in contiguous memory — fast to access. In decode, q is tiny (one token per sequence), but kv_cache[0] and kv_cache[1] are the entire paged cache tensors scattered across non-contiguous blocks → Ch.04. Reading all those scattered blocks is exactly the memory bandwidth bottleneck. flash_attn_with_kvcache is highly optimised to do this paged read as efficiently as possible — it's one of the most performance-critical kernels in the entire system.
engine.py — model_runner separates prefill and decode sequences
def run(self, scheduler_output: SchedulerOutput):
    # Build a mixed batch: prefill sequences + decode sequences
    # Each sequence knows its phase via seq.is_prefill
    all_seqs = scheduler_output.prefill + scheduler_output.decode

    # Collect token IDs for the forward pass
    # Prefill sequences contribute ALL their tokens
    # Decode sequences contribute only their LATEST token
    input_ids = []
    for seq in all_seqs:
        if seq.is_prefill:
            input_ids.extend(seq.tokens)       # all N prompt tokens
        else:
            input_ids.append(seq.tokens[-1])   # just the last generated token

    # Single forward pass handles both prefill and decode sequences
    # The attention layer dispatches to the right kernel per sequence
    logits = self.model.forward(input_ids, ...)

Why understanding this changes how you think about LLM performance

Prefill determines TTFT

Long system prompts, RAG contexts, and long chat histories all increase prefill time. Prefix caching → Ch.07 is the primary way to reduce TTFT — by skipping prefill entirely for tokens already in the cache.

Decode determines throughput

Generating 1,000 tokens takes 1,000 decode steps regardless of batch size. The only way to improve decode throughput is faster HBM bandwidth (better hardware) or smaller models (fewer weights to read per step).

Flash attention solves both differently

FlashAttention-2 uses tiling to reduce HBM traffic during prefill → Ch.10. flash_attn_with_kvcache handles the paged, scattered reads during decode. Two different problems, two different kernel strategies.

Speculative decoding attacks decode cost

Speculative decoding (not in nano-vLLM, but used in production vLLM) generates multiple draft tokens in one step and verifies them. It exploits the fact that decode is memory-bound and compute-underutilised — using the idle compute to speculatively generate ahead.

Things beginners get wrong about the two phases

✗ Myth 1 — "A faster GPU always means faster token generation"
Reality: "Faster GPU" can mean more CUDA cores or faster HBM bandwidth — and these help different phases. More CUDA cores help prefill (compute-bound). Faster HBM bandwidth helps decode (memory-bound). An H100 is faster than an A100 primarily because it has much higher HBM bandwidth (3.35 TB/s vs 2.0 TB/s), which is why it's so much faster at decode. The compute improvement is secondary for most inference workloads.
✗ Myth 2 — "Prefill and decode use different model weights"
Reality: Both phases use exactly the same model weights — the same 7 billion parameters, the same weight matrices, the same transformer architecture. The difference is not what is computed, but how much is computed relative to how much data is moved. Prefill does more math per weight read; decode does less math per weight read. Same weights, different utilisation profiles.
✗ Myth 3 — "TTFT and total response time are the same metric"
Reality: TTFT measures only prefill duration — the wait until token 1 appears. Total response time includes all decode steps too. A system with fast prefill but slow decode has great TTFT but poor total time. A system with slow prefill but very fast decode feels unresponsive (long TTFT) but finishes quickly once started. These are orthogonal — you need to optimise them separately with different techniques.

Quiz

Three questions on prefill, decode, and their bottlenecks. Wrong answers explain exactly where the reasoning went wrong.

1. A user complains that responses from your LLM service "feel slow to start but stream quickly once they begin." Which metric is the problem, and which phase is causing it?

2. During decode, the GPU's CUDA cores are only 18% utilised but HBM bandwidth is at 95%. What would most improve decode speed?

3. Why does chunked prefill improve the experience for users whose requests are already being decoded — even though their requests are not the ones being prefilled?

What you now know

Chapter 06 — Summary

Prefill is compute-bound. All input tokens processed in parallel, high arithmetic intensity, CUDA cores saturated. Bottleneck: how many compute units you have. Scales well with batch size.

Decode is memory-bandwidth bound. One token per step, 14 GB of weights read from HBM per step (7B model), CUDA cores underutilised. Bottleneck: HBM bandwidth. Adding more CUDA cores does not help.

TTFT measures prefill. TPOT measures decode. High TTFT = slow to start. High TPOT = slow streaming. They're independent — optimise them separately with different techniques.

Same weights, different bottlenecks. Prefill and decode use identical model weights. The difference is arithmetic intensity — how much math is done per byte of data read from HBM.

Chunked prefill bridges the gap. Splitting long prompts into chunks and interleaving decode steps prevents prefill starvation — existing sequences keep streaming without pause during long prompt processing.

Two phases, two attention kernels. Prefill uses flash_attn_varlen_func (dense, contiguous). Decode uses flash_attn_with_kvcache (paged, scattered). Same layer, different dispatch based on seq.is_prefill.