What is a scheduler, and why does it matter?
The GPU is enormously powerful, but expensive and finite. Left unmanaged, it would sit idle between requests, or waste cycles waiting for a slow request while others queue up. The scheduler's one job is to make sure the GPU is always working — processing the right set of requests at every moment, no idle time, no wasted capacity.
This analogy captures the core insight: the old way (static batching) wastes capacity by holding fixed seats for the full journey. The new way (continuous batching) recycles seats the moment they free up. The rest of this chapter explains exactly how nano-vLLM implements the new way.
The problem with waiting for everyone
What is a "batch"?
A batch is a group of requests processed together in a single GPU forward pass. GPUs are designed for parallel computation — running 10 requests simultaneously is barely slower than running 1, up to the hardware limit. So the goal is always to process as many requests as possible at once. The question is: which requests, and when?
Static batching — the old approach
In static batching, a fixed batch of requests is assembled once and processed together from start to finish. Every request in the batch must complete before any new requests can join. If you start with 8 requests and 3 finish early, the GPU processes a batch of 5 for the rest — with 3 slots sitting empty.
Why is this bad? Consider a real scenario: Request A generates 500 tokens (a long essay). Requests B–H each generate 10 tokens (short answers). B–H finish quickly, but the entire batch is held open until A finishes its 500th token. The GPU runs at partial capacity — maybe 15% utilisation — for 490 token steps while waiting for A.
8 requests: 1 long (500 tokens) + 7 short (10 tokens each). Watch how utilisation collapses with static batching but stays high with continuous batching.
With continuous batching, the moment B–H finish at step 10, 7 new requests from the waiting queue fill those slots. The GPU never runs at partial capacity. The throughput difference is not marginal — it can be 5–10× more requests served per second on identical hardware.
Continuous batching — the key idea
Continuous batching treats the set of running requests as a living, breathing pool. After every single decode step, the scheduler checks: did anyone finish? If yes — free their blocks immediately and pull in the next waiting request. The batch composition changes at every step. No request holds a "seat" beyond what it needs right now.
A request's lifecycle — four states
Every request in nano-vLLM is a Sequence object → Ch.02 with a status field. This status is a state machine — a formal way of saying the request can only be in one of four defined states, and transitions between them follow strict rules. The scheduler reads and writes this status on every step.
A state machine is like a traffic light: it can only be red, yellow, or green — never two at once — and it can only change in specific ways (red → green, not red → yellow). State machines make complex behaviour predictable and debuggable. Here are the four states a request moves through:
What the scheduler does on every single step
The scheduler's schedule() method is called once per engine step → Ch.02 — which means once per generated token across the entire running batch. It executes five actions in a fixed order. Understanding this order is the key to understanding the whole system.
Collect finished sequences — free their blocks immediately
First, the scheduler scans the running set for any sequence whose last generated token was an end-of-sequence (EOS) token, or that has reached max_tokens. For each finished sequence: call block_manager.free(seq) → Ch.04 to return its blocks to the free list, then move the sequence to FINISHED. This happens before anything else — freed blocks can immediately be used by new requests in the same step.
Append new slots for decoding sequences
Every sequence in DECODING is about to generate one more token. That new token needs a slot in the KV cache → Ch.03. The scheduler calls block_manager.append_slot(seq) for each decoding sequence — if the sequence has just crossed a block boundary (filled its last block), a new block is allocated from the free list. If no free blocks are available, this triggers preemption (step 4b below).
Promote waiting → running (prefill)
With freed and newly available blocks accounted for, the scheduler tries to promote requests from the waiting queue into the running set. For each waiting sequence: call block_manager.allocate(seq). If it returns True, the sequence moves to PREFILL and joins this step's batch. If it returns False (not enough free blocks), stop trying — all remaining waiting sequences stay queued.
Preempt if memory is critically low (optional)
Preemption is the scheduler's emergency valve. If step 2 fails because there are no free blocks for a decoding sequence, the scheduler must act. It selects a lower-priority running sequence, calls block_manager.free() on it, and moves it back to WAITING. Its KV cache data is lost — it will have to redo its prefill step when re-scheduled. This is expensive, but it prevents the engine from crashing with an out-of-memory error.
Return the batch — prefill sequences + decode sequences
Finally, the scheduler returns a SchedulerOutput object to the engine with two lists: sequences in PREFILL (processing their prompt for the first time) and sequences in DECODING (each generating their next token). The model runner receives this and executes one GPU forward pass covering all of them simultaneously.
Step through the scheduler — see every decision
Click Run One Step to advance the scheduler one step at a time. Watch requests move between the waiting queue and running set, blocks allocate and free, and states transition. The event log explains every decision the scheduler makes.
Preemption — when memory runs out
What happens if the GPU runs out of free KV cache blocks while sequences are still generating? The scheduler can't crash — that would mean every user's request fails. Instead, it uses preemption to recover gracefully.
What preemption means
Preemption means pausing a currently-running sequence mid-generation, freeing its KV cache blocks back to the pool, and moving it back to the WAITING queue. Its blocks are gone — the KV data is lost. When the sequence gets re-scheduled later, it must re-run its prefill phase from scratch to rebuild its KV cache.
The preemption cost — why it's a last resort
Preemption is expensive for two reasons. First, the evicted sequence must redo its entire prefill phase when re-scheduled — which might be hundreds of tokens of GPU compute. Second, it adds latency for the affected user. The scheduler tries to avoid preemption by promoting waiting requests conservatively — only adding new requests if there's clearly enough memory for them to run to completion without triggering preemption.
Preemption priority
nano-vLLM preempts the sequence that arrived most recently (last-in, first-out). The intuition: requests that just started have used the least compute. Evicting them wastes less work than evicting a sequence that has already generated 200 tokens.
Swap vs recompute
Production vLLM can swap KV cache blocks to CPU RAM instead of discarding them — expensive but avoids re-prefill. nano-vLLM takes the simpler path: discard and recompute. Sufficient for its scope, and far easier to reason about.
Prevention is better
The best preemption is the one that never happens. nano-vLLM's scheduler is conservative: it won't promote a waiting request if doing so would leave fewer free blocks than the expected decode length. This is imperfect (decode length is unknown) but keeps preemption rare.
The scheduler in code
Here is nano-vLLM's schedule() method, annotated step by step. Notice how short it is — the complexity lives in the BlockManager, not here.
def schedule(self) -> SchedulerOutput: # ── STEP 1: collect finished sequences and free their blocks ── finished = [s for s in self.running if s.is_finished()] for s in finished: self.block_mgr.free(s) # return blocks to free list immediately self.running.remove(s) # ── STEP 2: append a new slot for every decoding sequence ── # Each decode step needs space for 1 new token's K and V vectors for s in self.running: if not s.is_prefill: ok = self.block_mgr.append_slot(s) if not ok: self._preempt(s) # no free blocks → preempt this sequence # ── STEP 3: promote waiting → running while blocks are available ── promoted = [] while self.waiting: seq = self.waiting[0] # peek at the front of the queue if self.block_mgr.allocate(seq): self.waiting.pop(0) # remove from waiting queue self.running.append(seq) # add to running set promoted.append(seq) else: break # no memory for this request — stop trying # ── STEP 5: return split batch to the engine ── return SchedulerOutput( prefill=[s for s in self.running if s.is_prefill], decode =[s for s in self.running if not s.is_prefill], )
def _preempt(self, seq: Sequence) -> None: # Emergency valve: free this sequence's blocks and send it back to waiting # It will re-run prefill from scratch when re-scheduled self.block_mgr.free(seq) # return all blocks to free list seq.output_tokens = [] # discard generated tokens so far seq.status = SequenceStatus.WAITING self.running.remove(seq) self.waiting.appendleft(seq) # push to front of queue (high priority)
deque (double-ended queue) rather than a regular list. This allows O(1) appendleft() — pushing a preempted sequence back to the front of the queue so it gets re-scheduled first. A regular list's insert(0, item) is O(N), which becomes slow with hundreds of waiting requests. Small detail, real performance impact.
What continuous batching enables
5–10× throughput improvement
Real-world benchmarks show continuous batching delivers 5–10× higher request throughput over static batching on the same hardware, for workloads with mixed response lengths — which is every production workload.
Lower average latency
Short requests no longer wait for long ones. A 10-token response that arrives behind a 500-token request gets a GPU slot the moment one frees up — often within milliseconds, not seconds.
Works hand-in-hand with PagedAttention
Continuous batching only works because PagedAttention → Ch.04 makes block allocation O(1) and fragmentation-free. Without PagedAttention, adding/removing requests mid-batch would require expensive memory reorganisation at every step.
Graceful degradation under load
When the system is overloaded, preemption prevents crashes. Requests wait or retry rather than failing. The engine degrades gracefully: slower responses, never broken ones.
Things beginners get wrong about the scheduler
schedule() runs on every single engine step — which means once per decode token across the batch. There is no upfront plan. Every step is a fresh decision based on the current state of the queues and the free block count. This is what makes it "continuous" — the batch composition can change at any step, not just at the start.Quiz
Three questions to test your understanding of the scheduler. Wrong answers explain exactly why they're wrong.
1. In static batching, 8 requests start together. 7 finish at step 10, but 1 takes 500 steps. What happens to GPU utilisation between steps 11 and 500?
2. The scheduler checks for finished sequences as its FIRST action every step, before allocating blocks to new requests. Why does this order matter?
3. A preempted sequence is moved back to the front of the waiting queue (not the back). Why?
What you now know
Static batching wastes the GPU. A fixed batch locks slots until the slowest request finishes. Mixed-length workloads cause utilisation to collapse to single digits within a few steps.
Continuous batching recycles slots instantly. After every single step, finished sequences free their blocks and new requests fill the vacated slots. Utilisation stays near 100%.
Four states, strict transitions. Every request passes through WAITING → PREFILL → DECODING → FINISHED. The scheduler reads and writes these states on every step.
schedule() runs five steps every time. Free finished → append decode slots → promote waiting → preempt if OOM → return batch. Order matters: free first so freed blocks can be reused immediately.
Preemption is the graceful OOM handler. When blocks run out, the most recent running sequence is evicted back to waiting. Expensive (re-prefill required) but prevents crashes. The waiting deque's front placement minimises re-schedule latency.
Continuous batching needs PagedAttention. O(1) block allocation and zero fragmentation make per-step scheduling decisions cheap. Without PagedAttention, changing the batch every step would be too expensive to be practical.