The model proposes. The sampler decides.
In every decode step → Ch.06, the model produces one number for every word in its vocabulary — about 50,000 numbers, called logits. These are raw scores, not a decision. The job of the sampler is to turn those 50,000 scores into a single chosen token. How it makes that choice determines the entire character of the output.
This chapter walks through each "judging style" — what it does mathematically, when to use it, and how it appears in nano-vLLM's sampler. By the end, you'll understand exactly what happens when you set temperature=0.7 in an API call.
Logits and softmax — the foundation
What is a logit?
A logit is a raw, unbounded score the model assigns to each possible next token. It can be any number — positive, negative, large, small. A logit of 9.7 means the model thinks this token is very likely; a logit of -3.2 means it thinks it's very unlikely. But logits are not probabilities — they don't add up to 100%, and a logit of 9.7 doesn't directly tell you "97% likely". They're just relative scores.
What is softmax?
To make a decision, we need to convert those raw logits into proper probabilities — numbers between 0 and 1 that sum to exactly 100%. The function that does this is called softmax. It takes the whole list of logits and produces a probability for each, where higher logits get higher probabilities, and everything adds up to 1.
A simplified example with 5 candidate tokens for the sentence "The weather today is ___":
Raw scores. Don't sum to anything meaningful. "purple" scores negative — model knows it's unlikely.
Now they sum to ~100%. "sunny" is most likely but not guaranteed. "purple" is nearly 0%.
Once we have probabilities, the sampler can make its choice. The different sampling strategies are all about how to use these probabilities — and how to reshape them before choosing.
Greedy decoding — always pick the top
Greedy decoding is the simplest strategy: always pick the token with the highest probability. No randomness, no chance for second place. In our weather example, greedy always picks "sunny" because it has the highest score (40%). The technical name for "pick the highest" is argmax — the argument (position) of the maximum value.
✓ When greedy is good
Factual Q&A, code generation, math, translation, structured extraction — anywhere there's a single correct answer and you want it deterministically. Greedy is also reproducible: same prompt always gives the same output, which matters for testing and debugging.
✗ When greedy fails
Creative writing, brainstorming, dialogue — anywhere you want variety. Greedy produces repetitive, predictable text. Ask it to "write a poem" twice and you get the identical poem. It also tends to get stuck in loops, repeating the same phrase, because the highest-probability continuation of a repeated phrase is often to repeat it again.
temperature=0 activates greedy decoding. It's the same thing. When you want deterministic, reproducible output — set temperature to 0 and you get pure argmax, every time.
Temperature — the creativity dial
Temperature controls how much randomness the sampler allows. It works by scaling the logits before softmax: each logit is divided by the temperature value. This simple division has a powerful effect on the shape of the probability distribution.
Low temperature (0.1 – 0.5) — sharper, more confident
Dividing by a small number (like 0.2) makes large logits even larger relative to small ones — exaggerating the differences. After softmax, the top token's probability shoots up toward 100%, and everything else collapses toward 0%. The model becomes very confident and focused. Output is consistent and safe, close to greedy.
Temperature 1.0 — the model's natural distribution
Dividing by 1.0 changes nothing — the logits pass through unchanged. This is the model's "true" probability distribution as it learned during training. A balanced default for general use: some variety, but still grounded in what the model considers likely.
High temperature (1.2 – 2.0) — flatter, more random
Dividing by a large number (like 1.8) shrinks the gaps between logits — making all options more equal. After softmax, the distribution flattens: unlikely tokens get a real chance of being picked. Output becomes creative and surprising — but push too high and it degrades into incoherent nonsense, because genuinely bad tokens start getting selected.
Reshape the distribution yourself
Adjust temperature, top-k, and top-p below and watch the probability distribution change in real time. Then click "Sample a token" to see which one gets picked — run it multiple times to feel how randomness changes with each setting.
Top-k and top-p — trimming the candidates
Temperature reshapes the entire distribution, but it never fully removes bad options — even at high temperature, a terrible token retains a small chance. Top-k and top-p solve this by cutting off the unlikely tokens entirely before sampling.
Top-k sampling — keep the best K
Top-k keeps only the K highest-probability tokens and discards everything else. With top-k=50, only the 50 most likely tokens can be selected; the other ~49,950 are removed (their probability set to zero). Then sampling happens among those 50. This guarantees the model never picks something wildly unlikely, while still allowing variety among reasonable options.
Top-p (nucleus) sampling — keep the smallest group covering P%
Top-p, also called nucleus sampling, fixes top-k's rigidity. Instead of a fixed count, it keeps the smallest set of tokens whose probabilities add up to at least P. With top-p=0.9, you sort tokens by probability and keep adding them to the shortlist until their combined probability reaches 90% — then stop. The shortlist size adapts automatically to the model's confidence.
Always keeps exactly 3, regardless of their probabilities. "cold" and "windy" are cut.
Keeps tokens until cumulative ≥ 80%: 40+27+13 = 80%. Here it also lands on 3 — but if "sunny" were 85%, it would keep just 1.
The sampler in code
nano-vLLM's sampler lives in sampler.py → Ch.02. It runs on the GPU, taking the raw logits from the model's final layer and producing one token ID per sequence. Here it is, annotated:
def sample(logits: Tensor, params: SamplingParams) -> Tensor: # logits shape: [batch_size, vocab_size] — one row per sequence, # 50,000 raw scores per row # ── GREEDY SHORTCUT: temperature 0 means pure argmax ── if params.temperature == 0: return logits.argmax(dim=-1) # pick the single highest-scoring token # ── STEP 1: temperature scaling ── # Divide all logits by temperature. Smaller temp = sharper distribution. logits = logits / params.temperature # ── STEP 2: top-k filtering ── # Keep only the K highest logits; set the rest to -infinity # (-inf becomes 0 after softmax, removing those tokens entirely) if params.top_k > 0: # Find the K-th largest value as a threshold kth_value = torch.topk(logits, params.top_k, dim=-1).values[..., -1, None] logits = torch.where(logits < kth_value, float('-inf'), logits) # ── STEP 3: convert to probabilities via softmax ── probs = torch.softmax(logits, dim=-1) # ── STEP 4: top-p (nucleus) filtering ── # Sort by probability, keep the smallest set summing to >= top_p if params.top_p < 1.0: probs = _apply_top_p(probs, params.top_p) # ── STEP 5: sample one token from the final distribution ── # multinomial picks an index weighted by probability — this is # where the actual randomness happens return torch.multinomial(probs, num_samples=1).squeeze(-1)
def _apply_top_p(probs: Tensor, top_p: float) -> Tensor: # Sort probabilities descending, track the cumulative sum sorted_probs, sorted_idx = torch.sort(probs, descending=True, dim=-1) cumulative = torch.cumsum(sorted_probs, dim=-1) # Mark tokens to remove: those beyond the point where cumulative >= top_p remove = cumulative - sorted_probs > top_p # keep the token that crosses the threshold sorted_probs[remove] = 0.0 # Restore original token order and renormalise so probs sum to 1 again probs = torch.zeros_like(probs).scatter(-1, sorted_idx, sorted_probs) return probs / probs.sum(dim=-1, keepdim=True)
torch.multinomial is the one place where chance enters. Given a probability distribution, it rolls a weighted die and returns one index. A token with 40% probability gets picked roughly 40% of the time across many runs. This is why, with temperature > 0, the same prompt can produce different outputs each time — multinomial makes a fresh random draw every decode step. With temperature = 0, we skip multinomial entirely and use argmax, which is deterministic.
Choosing the right strategy for the job
Factual / code → temperature 0
Q&A, code generation, math, data extraction, classification. You want the single most likely answer, deterministically reproducible. Greedy decoding. No surprises.
Balanced chat → temp 0.7, top-p 0.9
General conversation, explanations, assistance. The most common production setting. Enough variety to feel natural, enough grounding to stay coherent. The de facto default for most chatbots.
Creative → temp 1.0+, top-p 0.95
Story writing, brainstorming, poetry, marketing copy. Higher temperature and a generous top-p let the model explore unusual word choices and surprising directions. More variety, occasionally more risk.
Avoid → temp 2.0, no filtering
Maximum temperature with no top-k or top-p produces near-random gibberish. Genuinely bad tokens get selected. Almost never useful in practice — included here so you recognise the failure mode when you see incoherent output.
Things beginners get wrong about sampling
Quiz
Three questions on sampling. Wrong answers explain exactly where the reasoning went wrong.
1. You're building a system that extracts structured data (names, dates, amounts) from documents. Which sampling setting should you use?
2. The model is very confident — token "Paris" has 96% probability for "The capital of France is ___". With top-p = 0.9, how many tokens survive the filter?
3. Where in the sampling pipeline does the actual randomness happen?
What you now know
Logits are raw scores; softmax makes probabilities. The model outputs ~50,000 unbounded logits per step. Softmax converts them into probabilities that sum to 100%, amplifying the gaps via an exponential curve.
Greedy = always pick the top (argmax). Deterministic and reproducible. Best for factual answers, code, and extraction. Equivalent to temperature 0. Repetitive and loop-prone for creative work.
Temperature is the randomness dial. Logits divided by temperature before softmax. Low = sharp and focused. 1.0 = the model's natural distribution. High = flat and creative, but eventually incoherent.
Top-k keeps a fixed count; top-p adapts. Top-k keeps the best K tokens regardless of confidence. Top-p (nucleus) keeps the smallest set covering P% of probability — growing and shrinking with the model's certainty.
Strategies apply in sequence. Temperature → top-k → softmax → top-p → multinomial. Each step reshapes the distribution; the final multinomial draw is the single source of randomness.
Match the strategy to the task. temp 0 for facts/code, temp 0.7 + top-p 0.9 for balanced chat, temp 1.0+ for creative. Stacking aggressive filters over-constrains output — usually pick one truncation method.