Chapter 05 of 11 · nano-vLLM Deep Dive
05

The Scheduler

How nano-vLLM keeps the GPU busy every single step — the waiting queue, the running set, state transitions, and what happens when memory runs out.

← Ch04: PagedAttention Next: Prefill vs Decode →

What is a scheduler, and why does it matter?

The GPU is enormously powerful, but expensive and finite. Left unmanaged, it would sit idle between requests, or waste cycles waiting for a slow request while others queue up. The scheduler's one job is to make sure the GPU is always working — processing the right set of requests at every moment, no idle time, no wasted capacity.

The Subway Train Analogy Imagine two ways to run a subway line. The old way (static batching): the train waits at the station until every seat is filled, then departs — and doesn't open its doors again until it returns to the station, no matter how many people are waiting on the platform. If three passengers get off halfway, those seats sit empty for the rest of the journey. The new way (continuous batching): the train runs on a loop with rolling doors — at every station it lets off passengers who've arrived at their destination, and immediately lets on new passengers to fill those exact seats. The train is always full and always moving. The scheduler is the train controller. Requests are passengers. The GPU is the train.

This analogy captures the core insight: the old way (static batching) wastes capacity by holding fixed seats for the full journey. The new way (continuous batching) recycles seats the moment they free up. The rest of this chapter explains exactly how nano-vLLM implements the new way.

The problem with waiting for everyone

What is a "batch"?

A batch is a group of requests processed together in a single GPU forward pass. GPUs are designed for parallel computation — running 10 requests simultaneously is barely slower than running 1, up to the hardware limit. So the goal is always to process as many requests as possible at once. The question is: which requests, and when?

Static batching — the old approach

In static batching, a fixed batch of requests is assembled once and processed together from start to finish. Every request in the batch must complete before any new requests can join. If you start with 8 requests and 3 finish early, the GPU processes a batch of 5 for the rest — with 3 slots sitting empty.

Why is this bad? Consider a real scenario: Request A generates 500 tokens (a long essay). Requests B–H each generate 10 tokens (short answers). B–H finish quickly, but the entire batch is held open until A finishes its 500th token. The GPU runs at partial capacity — maybe 15% utilisation — for 490 token steps while waiting for A.

GPU utilisation over time

8 requests: 1 long (500 tokens) + 7 short (10 tokens each). Watch how utilisation collapses with static batching but stays high with continuous batching.

Static batching — GPU utilisation over 500 steps~23% average
Steps 1–10
IDLE — 7 empty slots waiting for Request A (steps 11–500)
Continuous batching — GPU utilisation over 500 steps~95% average
All 8 requests
New requests fill freed slots instantly — stays near full

With continuous batching, the moment B–H finish at step 10, 7 new requests from the waiting queue fill those slots. The GPU never runs at partial capacity. The throughput difference is not marginal — it can be 5–10× more requests served per second on identical hardware.

Continuous batching — the key idea

Continuous batching treats the set of running requests as a living, breathing pool. After every single decode step, the scheduler checks: did anyone finish? If yes — free their blocks immediately and pull in the next waiting request. The batch composition changes at every step. No request holds a "seat" beyond what it needs right now.

Why "after every step" and not "after every N steps"? Because tokens complete at unpredictable times. You don't know when a request will emit an end-of-sequence token. If you checked only every 10 steps, a request that finished at step 3 would hold a GPU slot idle for 7 unnecessary steps. Checking every step means the minimum possible idle time — at the cost of a tiny scheduling overhead per step, which is negligible compared to the GPU compute saved.

A request's lifecycle — four states

Every request in nano-vLLM is a Sequence object → Ch.02 with a status field. This status is a state machine — a formal way of saying the request can only be in one of four defined states, and transitions between them follow strict rules. The scheduler reads and writes this status on every step.

A state machine is like a traffic light: it can only be red, yellow, or green — never two at once — and it can only change in specific ways (red → green, not red → yellow). State machines make complex behaviour predictable and debuggable. Here are the four states a request moves through:

Click a state to understand it
WAITING In queue
blocks available
PREFILL Processing prompt
prompt done
DECODING Generating tokens
EOS or max_tokens
FINISHED Done
← Click any state above to understand what happens in it and what triggers the transition.

What the scheduler does on every single step

The scheduler's schedule() method is called once per engine step → Ch.02 — which means once per generated token across the entire running batch. It executes five actions in a fixed order. Understanding this order is the key to understanding the whole system.

1

Collect finished sequences — free their blocks immediately

First, the scheduler scans the running set for any sequence whose last generated token was an end-of-sequence (EOS) token, or that has reached max_tokens. For each finished sequence: call block_manager.free(seq) → Ch.04 to return its blocks to the free list, then move the sequence to FINISHED. This happens before anything else — freed blocks can immediately be used by new requests in the same step.

2

Append new slots for decoding sequences

Every sequence in DECODING is about to generate one more token. That new token needs a slot in the KV cache → Ch.03. The scheduler calls block_manager.append_slot(seq) for each decoding sequence — if the sequence has just crossed a block boundary (filled its last block), a new block is allocated from the free list. If no free blocks are available, this triggers preemption (step 4b below).

3

Promote waiting → running (prefill)

With freed and newly available blocks accounted for, the scheduler tries to promote requests from the waiting queue into the running set. For each waiting sequence: call block_manager.allocate(seq). If it returns True, the sequence moves to PREFILL and joins this step's batch. If it returns False (not enough free blocks), stop trying — all remaining waiting sequences stay queued.

4

Preempt if memory is critically low (optional)

Preemption is the scheduler's emergency valve. If step 2 fails because there are no free blocks for a decoding sequence, the scheduler must act. It selects a lower-priority running sequence, calls block_manager.free() on it, and moves it back to WAITING. Its KV cache data is lost — it will have to redo its prefill step when re-scheduled. This is expensive, but it prevents the engine from crashing with an out-of-memory error.

5

Return the batch — prefill sequences + decode sequences

Finally, the scheduler returns a SchedulerOutput object to the engine with two lists: sequences in PREFILL (processing their prompt for the first time) and sequences in DECODING (each generating their next token). The model runner receives this and executes one GPU forward pass covering all of them simultaneously.

Prefill and decode run in the same forward pass A single call to the transformer model can process prefill sequences and decode sequences in the same batch. They use different attention variants internally (flash_attn_varlen_func for prefill, flash_attn_with_kvcache for decode → Ch.10), but from the scheduler's perspective they're all just "sequences to process this step." This is called chunked prefill or mixed batching, and it's what makes continuous batching so effective — you're never waiting for prefill to finish before decode can start.

Step through the scheduler — see every decision

Click Run One Step to advance the scheduler one step at a time. Watch requests move between the waiting queue and running set, blocks allocate and free, and states transition. The event log explains every decision the scheduler makes.

Scheduler simulator
0
Engine step
20
Free blocks
0
Running
0
Finished
⏳ WAITING QUEUE 0 requests
▶ RUNNING SET 0 requests
Scheduler event log

Preemption — when memory runs out

What happens if the GPU runs out of free KV cache blocks while sequences are still generating? The scheduler can't crash — that would mean every user's request fails. Instead, it uses preemption to recover gracefully.

What preemption means

Preemption means pausing a currently-running sequence mid-generation, freeing its KV cache blocks back to the pool, and moving it back to the WAITING queue. Its blocks are gone — the KV data is lost. When the sequence gets re-scheduled later, it must re-run its prefill phase from scratch to rebuild its KV cache.

The Overbooked Flight Analogy Preemption is like an airline bumping a standby passenger off a flight that's overbooked. The passenger (sequence) was already seated (running), but a higher-priority passenger needs the seat (memory blocks). The bumped passenger gets a voucher (is moved back to waiting) and will board the next available flight (get re-scheduled when blocks free up). It's disruptive and costly — they have to re-board — but it keeps the airline (engine) operating rather than cancelling all flights.

The preemption cost — why it's a last resort

Preemption is expensive for two reasons. First, the evicted sequence must redo its entire prefill phase when re-scheduled — which might be hundreds of tokens of GPU compute. Second, it adds latency for the affected user. The scheduler tries to avoid preemption by promoting waiting requests conservatively — only adding new requests if there's clearly enough memory for them to run to completion without triggering preemption.

Preemption priority

nano-vLLM preempts the sequence that arrived most recently (last-in, first-out). The intuition: requests that just started have used the least compute. Evicting them wastes less work than evicting a sequence that has already generated 200 tokens.

Swap vs recompute

Production vLLM can swap KV cache blocks to CPU RAM instead of discarding them — expensive but avoids re-prefill. nano-vLLM takes the simpler path: discard and recompute. Sufficient for its scope, and far easier to reason about.

Prevention is better

The best preemption is the one that never happens. nano-vLLM's scheduler is conservative: it won't promote a waiting request if doing so would leave fewer free blocks than the expected decode length. This is imperfect (decode length is unknown) but keeps preemption rare.

The scheduler in code

Here is nano-vLLM's schedule() method, annotated step by step. Notice how short it is — the complexity lives in the BlockManager, not here.

core/scheduler.py — schedule() method
def schedule(self) -> SchedulerOutput:
    # ── STEP 1: collect finished sequences and free their blocks ──
    finished = [s for s in self.running if s.is_finished()]
    for s in finished:
        self.block_mgr.free(s)      # return blocks to free list immediately
        self.running.remove(s)

    # ── STEP 2: append a new slot for every decoding sequence ──
    # Each decode step needs space for 1 new token's K and V vectors
    for s in self.running:
        if not s.is_prefill:
            ok = self.block_mgr.append_slot(s)
            if not ok:
                self._preempt(s)    # no free blocks → preempt this sequence

    # ── STEP 3: promote waiting → running while blocks are available ──
    promoted = []
    while self.waiting:
        seq = self.waiting[0]         # peek at the front of the queue
        if self.block_mgr.allocate(seq):
            self.waiting.pop(0)       # remove from waiting queue
            self.running.append(seq)  # add to running set
            promoted.append(seq)
        else:
            break                      # no memory for this request — stop trying

    # ── STEP 5: return split batch to the engine ──
    return SchedulerOutput(
        prefill=[s for s in self.running if s.is_prefill],
        decode =[s for s in self.running if not s.is_prefill],
    )
core/scheduler.py — _preempt() helper
def _preempt(self, seq: Sequence) -> None:
    # Emergency valve: free this sequence's blocks and send it back to waiting
    # It will re-run prefill from scratch when re-scheduled
    self.block_mgr.free(seq)          # return all blocks to free list
    seq.output_tokens = []             # discard generated tokens so far
    seq.status = SequenceStatus.WAITING
    self.running.remove(seq)
    self.waiting.appendleft(seq)     # push to front of queue (high priority)
Why waiting is a deque, not a list The waiting queue uses Python's deque (double-ended queue) rather than a regular list. This allows O(1) appendleft() — pushing a preempted sequence back to the front of the queue so it gets re-scheduled first. A regular list's insert(0, item) is O(N), which becomes slow with hundreds of waiting requests. Small detail, real performance impact.

What continuous batching enables

5–10× throughput improvement

Real-world benchmarks show continuous batching delivers 5–10× higher request throughput over static batching on the same hardware, for workloads with mixed response lengths — which is every production workload.

Lower average latency

Short requests no longer wait for long ones. A 10-token response that arrives behind a 500-token request gets a GPU slot the moment one frees up — often within milliseconds, not seconds.

Works hand-in-hand with PagedAttention

Continuous batching only works because PagedAttention → Ch.04 makes block allocation O(1) and fragmentation-free. Without PagedAttention, adding/removing requests mid-batch would require expensive memory reorganisation at every step.

Graceful degradation under load

When the system is overloaded, preemption prevents crashes. Requests wait or retry rather than failing. The engine degrades gracefully: slower responses, never broken ones.

Things beginners get wrong about the scheduler

✗ Myth 1 — "The scheduler runs once at the start and builds a plan"
Reality: The scheduler's schedule() runs on every single engine step — which means once per decode token across the batch. There is no upfront plan. Every step is a fresh decision based on the current state of the queues and the free block count. This is what makes it "continuous" — the batch composition can change at any step, not just at the start.
✗ Myth 2 — "Preemption means the user's request fails"
Reality: Preemption is transparent to the user. Their request is moved back to the waiting queue and re-scheduled when memory is available. They experience a longer wait time — but their request eventually completes. The output is identical to a non-preempted run because the model deterministically regenerates the same tokens (with temperature=0) or equivalent ones (with sampling). The user never sees an error.
✗ Myth 3 — "Continuous batching means all requests start and finish together"
Reality: That is the definition of static batching — the thing continuous batching replaces. In continuous batching, every request starts and finishes independently. A batch at step 47 might contain Request A (on token 200), Request B (on token 5), and Request C (just starting its prefill). They have nothing in common except that they're all being processed in the same GPU forward pass right now.

Quiz

Three questions to test your understanding of the scheduler. Wrong answers explain exactly why they're wrong.

1. In static batching, 8 requests start together. 7 finish at step 10, but 1 takes 500 steps. What happens to GPU utilisation between steps 11 and 500?

2. The scheduler checks for finished sequences as its FIRST action every step, before allocating blocks to new requests. Why does this order matter?

3. A preempted sequence is moved back to the front of the waiting queue (not the back). Why?

What you now know

Chapter 05 — Summary

Static batching wastes the GPU. A fixed batch locks slots until the slowest request finishes. Mixed-length workloads cause utilisation to collapse to single digits within a few steps.

Continuous batching recycles slots instantly. After every single step, finished sequences free their blocks and new requests fill the vacated slots. Utilisation stays near 100%.

Four states, strict transitions. Every request passes through WAITING → PREFILL → DECODING → FINISHED. The scheduler reads and writes these states on every step.

schedule() runs five steps every time. Free finished → append decode slots → promote waiting → preempt if OOM → return batch. Order matters: free first so freed blocks can be reused immediately.

Preemption is the graceful OOM handler. When blocks run out, the most recent running sequence is evicted back to waiting. Expensive (re-prefill required) but prevents crashes. The waiting deque's front placement minimises re-schedule latency.

Continuous batching needs PagedAttention. O(1) block allocation and zero fragmentation make per-step scheduling decisions cheap. Without PagedAttention, changing the batch every step would be too expensive to be practical.