Chapter 08 of 11 · nano-vLLM Deep Dive
08

Sampling Strategies

The model outputs 50,000 scores. The sampler picks one. How that choice is made — greedy, temperature, top-k, top-p — shapes whether your AI sounds robotic, creative, or unhinged.

← Ch07: Prefix Caching Next: Tensor Parallelism →

The model proposes. The sampler decides.

In every decode step → Ch.06, the model produces one number for every word in its vocabulary — about 50,000 numbers, called logits. These are raw scores, not a decision. The job of the sampler is to turn those 50,000 scores into a single chosen token. How it makes that choice determines the entire character of the output.

The Talent Show Judge Analogy Imagine 50,000 contestants each get a score from the model — say, 9.7, 9.6, 8.1, 2.3, and so on down to near zero. The sampler is the judge who picks the winner. A strict judge (greedy) always picks the single highest score — same winner every time, predictable but boring. A lenient judge (high temperature) gives lower-scoring contestants a real chance — more surprising, sometimes brilliant, sometimes nonsense. A judge who only considers the top 50 (top-k) ignores everyone below rank 50 entirely. A judge who considers the smallest group that covers 90% of the votes (top-p) adapts the shortlist size to how confident the scores are. Each judging style produces a different kind of winner.

This chapter walks through each "judging style" — what it does mathematically, when to use it, and how it appears in nano-vLLM's sampler. By the end, you'll understand exactly what happens when you set temperature=0.7 in an API call.

Logits and softmax — the foundation

What is a logit?

A logit is a raw, unbounded score the model assigns to each possible next token. It can be any number — positive, negative, large, small. A logit of 9.7 means the model thinks this token is very likely; a logit of -3.2 means it thinks it's very unlikely. But logits are not probabilities — they don't add up to 100%, and a logit of 9.7 doesn't directly tell you "97% likely". They're just relative scores.

What is softmax?

To make a decision, we need to convert those raw logits into proper probabilities — numbers between 0 and 1 that sum to exactly 100%. The function that does this is called softmax. It takes the whole list of logits and produces a probability for each, where higher logits get higher probabilities, and everything adds up to 1.

The Vote-Counting Analogy Softmax is like converting raw vote counts into percentages. If candidate A got 9,700 votes, B got 9,600, and C got 8,100, you can't directly compare those to other elections — but if you convert to percentages (A: 35.4%, B: 35.0%, C: 29.6%), now they sum to 100% and you can reason about them as probabilities. Softmax does exactly this, with one twist: it uses an exponential curve, so larger logits get amplified — the gap between the top score and the rest gets exaggerated. This makes the model more decisive than a simple percentage would.
Logits → softmax → probabilities

A simplified example with 5 candidate tokens for the sentence "The weather today is ___":

Raw logits (model output)
"sunny"3.2
"cloudy"2.8
"rainy"2.1
"cold"1.5
"purple"-2.0

Raw scores. Don't sum to anything meaningful. "purple" scores negative — model knows it's unlikely.

After softmax (probabilities)
sunny
40%
cloudy
27%
rainy
13%
cold
7%
purple

Now they sum to ~100%. "sunny" is most likely but not guaranteed. "purple" is nearly 0%.

Once we have probabilities, the sampler can make its choice. The different sampling strategies are all about how to use these probabilities — and how to reshape them before choosing.

Greedy decoding — always pick the top

Greedy decoding is the simplest strategy: always pick the token with the highest probability. No randomness, no chance for second place. In our weather example, greedy always picks "sunny" because it has the highest score (40%). The technical name for "pick the highest" is argmax — the argument (position) of the maximum value.

✓ When greedy is good

Factual Q&A, code generation, math, translation, structured extraction — anywhere there's a single correct answer and you want it deterministically. Greedy is also reproducible: same prompt always gives the same output, which matters for testing and debugging.

✗ When greedy fails

Creative writing, brainstorming, dialogue — anywhere you want variety. Greedy produces repetitive, predictable text. Ask it to "write a poem" twice and you get the identical poem. It also tends to get stuck in loops, repeating the same phrase, because the highest-probability continuation of a repeated phrase is often to repeat it again.

Greedy = temperature 0 In every LLM API, setting temperature=0 activates greedy decoding. It's the same thing. When you want deterministic, reproducible output — set temperature to 0 and you get pure argmax, every time.

Temperature — the creativity dial

Temperature controls how much randomness the sampler allows. It works by scaling the logits before softmax: each logit is divided by the temperature value. This simple division has a powerful effect on the shape of the probability distribution.

Low temperature (0.1 – 0.5) — sharper, more confident

Dividing by a small number (like 0.2) makes large logits even larger relative to small ones — exaggerating the differences. After softmax, the top token's probability shoots up toward 100%, and everything else collapses toward 0%. The model becomes very confident and focused. Output is consistent and safe, close to greedy.

=

Temperature 1.0 — the model's natural distribution

Dividing by 1.0 changes nothing — the logits pass through unchanged. This is the model's "true" probability distribution as it learned during training. A balanced default for general use: some variety, but still grounded in what the model considers likely.

High temperature (1.2 – 2.0) — flatter, more random

Dividing by a large number (like 1.8) shrinks the gaps between logits — making all options more equal. After softmax, the distribution flattens: unlikely tokens get a real chance of being picked. Output becomes creative and surprising — but push too high and it degrades into incoherent nonsense, because genuinely bad tokens start getting selected.

The Weighted Dice Analogy Imagine a die where each face is a candidate token, weighted by probability. Low temperature is a heavily loaded die — it almost always lands on the favourite. Temperature 1.0 is the die weighted exactly as the model intends. High temperature sands down the weighting toward a fair die — every face becomes roughly equally likely, even the bad ones. Temperature literally controls how "loaded" the die is.

Reshape the distribution yourself

Adjust temperature, top-k, and top-p below and watch the probability distribution change in real time. Then click "Sample a token" to see which one gets picked — run it multiple times to feel how randomness changes with each setting.

Sampling playground
0 = greedy · 2 = chaotic
0 = off · keep top K only
1.0 = off · keep top P mass
Probability distribution for "The cat sat on the ___"
Adjust the sliders, then sample a token.
How the strategies combine In practice these aren't either/or — they're applied in sequence. The typical order: (1) temperature scaling reshapes the logits, (2) top-k filtering removes all but the top K, (3) top-p filtering removes the long tail, (4) softmax converts survivors to probabilities, (5) sample one token from the result. Setting top-k=0 and top-p=1.0 disables those filters, leaving just temperature. This is exactly the order nano-vLLM applies them in code.

Top-k and top-p — trimming the candidates

Temperature reshapes the entire distribution, but it never fully removes bad options — even at high temperature, a terrible token retains a small chance. Top-k and top-p solve this by cutting off the unlikely tokens entirely before sampling.

Top-k sampling — keep the best K

Top-k keeps only the K highest-probability tokens and discards everything else. With top-k=50, only the 50 most likely tokens can be selected; the other ~49,950 are removed (their probability set to zero). Then sampling happens among those 50. This guarantees the model never picks something wildly unlikely, while still allowing variety among reasonable options.

The Shortlist Analogy Top-k is like a hiring manager who says "I'll only consider the top 50 candidates by score — everyone below rank 50 is automatically rejected, no matter what." It's a fixed-size shortlist. The downside: 50 is arbitrary. Sometimes only 3 tokens are reasonable (and 50 lets in 47 bad ones); sometimes 200 are reasonable (and 50 cuts off good options). The shortlist size doesn't adapt to the situation.

Top-p (nucleus) sampling — keep the smallest group covering P%

Top-p, also called nucleus sampling, fixes top-k's rigidity. Instead of a fixed count, it keeps the smallest set of tokens whose probabilities add up to at least P. With top-p=0.9, you sort tokens by probability and keep adding them to the shortlist until their combined probability reaches 90% — then stop. The shortlist size adapts automatically to the model's confidence.

The Adaptive Shortlist Analogy Top-p is like a hiring manager who says "I'll consider however many candidates it takes to cover 90% of the total quality — could be 3 people, could be 200." When the model is confident (one token has 95% probability), top-p=0.9 keeps just that one token — effectively greedy. When the model is uncertain (probability spread across many tokens), top-p=0.9 keeps a wide set — allowing variety. The shortlist grows and shrinks based on how confident the model is, which is exactly what you want.
Top-k vs Top-p on the same distribution
TOP-K = 3 (fixed count)
sunny
40%
cloudy
27%
rainy
13%
cold
windy

Always keeps exactly 3, regardless of their probabilities. "cold" and "windy" are cut.

TOP-P = 0.80 (adaptive)
sunny
40%
cloudy
27%
rainy
13%
cold
windy

Keeps tokens until cumulative ≥ 80%: 40+27+13 = 80%. Here it also lands on 3 — but if "sunny" were 85%, it would keep just 1.

The sampler in code

nano-vLLM's sampler lives in sampler.py → Ch.02. It runs on the GPU, taking the raw logits from the model's final layer and producing one token ID per sequence. Here it is, annotated:

sampler.py — the sampling pipeline
def sample(logits: Tensor, params: SamplingParams) -> Tensor:
    # logits shape: [batch_size, vocab_size] — one row per sequence,
    # 50,000 raw scores per row

    # ── GREEDY SHORTCUT: temperature 0 means pure argmax ──
    if params.temperature == 0:
        return logits.argmax(dim=-1)   # pick the single highest-scoring token

    # ── STEP 1: temperature scaling ──
    # Divide all logits by temperature. Smaller temp = sharper distribution.
    logits = logits / params.temperature

    # ── STEP 2: top-k filtering ──
    # Keep only the K highest logits; set the rest to -infinity
    # (-inf becomes 0 after softmax, removing those tokens entirely)
    if params.top_k > 0:
        # Find the K-th largest value as a threshold
        kth_value = torch.topk(logits, params.top_k, dim=-1).values[..., -1, None]
        logits = torch.where(logits < kth_value, float('-inf'), logits)

    # ── STEP 3: convert to probabilities via softmax ──
    probs = torch.softmax(logits, dim=-1)

    # ── STEP 4: top-p (nucleus) filtering ──
    # Sort by probability, keep the smallest set summing to >= top_p
    if params.top_p < 1.0:
        probs = _apply_top_p(probs, params.top_p)

    # ── STEP 5: sample one token from the final distribution ──
    # multinomial picks an index weighted by probability — this is
    # where the actual randomness happens
    return torch.multinomial(probs, num_samples=1).squeeze(-1)
sampler.py — top-p helper
def _apply_top_p(probs: Tensor, top_p: float) -> Tensor:
    # Sort probabilities descending, track the cumulative sum
    sorted_probs, sorted_idx = torch.sort(probs, descending=True, dim=-1)
    cumulative = torch.cumsum(sorted_probs, dim=-1)

    # Mark tokens to remove: those beyond the point where cumulative >= top_p
    remove = cumulative - sorted_probs > top_p   # keep the token that crosses the threshold
    sorted_probs[remove] = 0.0

    # Restore original token order and renormalise so probs sum to 1 again
    probs = torch.zeros_like(probs).scatter(-1, sorted_idx, sorted_probs)
    return probs / probs.sum(dim=-1, keepdim=True)
Why multinomial is the source of randomness torch.multinomial is the one place where chance enters. Given a probability distribution, it rolls a weighted die and returns one index. A token with 40% probability gets picked roughly 40% of the time across many runs. This is why, with temperature > 0, the same prompt can produce different outputs each time — multinomial makes a fresh random draw every decode step. With temperature = 0, we skip multinomial entirely and use argmax, which is deterministic.

Choosing the right strategy for the job

Factual / code → temperature 0

Q&A, code generation, math, data extraction, classification. You want the single most likely answer, deterministically reproducible. Greedy decoding. No surprises.

Balanced chat → temp 0.7, top-p 0.9

General conversation, explanations, assistance. The most common production setting. Enough variety to feel natural, enough grounding to stay coherent. The de facto default for most chatbots.

Creative → temp 1.0+, top-p 0.95

Story writing, brainstorming, poetry, marketing copy. Higher temperature and a generous top-p let the model explore unusual word choices and surprising directions. More variety, occasionally more risk.

Avoid → temp 2.0, no filtering

Maximum temperature with no top-k or top-p produces near-random gibberish. Genuinely bad tokens get selected. Almost never useful in practice — included here so you recognise the failure mode when you see incoherent output.

Things beginners get wrong about sampling

✗ Myth 1 — "Higher temperature makes the model smarter or more creative-in-quality"
Reality: Temperature doesn't change what the model knows or how good its ideas are — it only changes how much randomness is allowed in selecting from what it already considers possible. High temperature produces more varied output, not better output. Past a certain point, higher temperature just means more frequently picking tokens the model itself rated as unlikely — which degrades quality into incoherence. Creativity from sampling is variety, not intelligence.
✗ Myth 2 — "Temperature 0 means no randomness anywhere in the system"
Reality: Temperature 0 (greedy) makes the sampling deterministic — argmax always picks the same top token. But the same prompt can still occasionally produce slightly different outputs across runs due to other sources of non-determinism: floating-point operations on GPUs can produce tiny numerical differences depending on batch composition and parallel execution order. For most purposes temperature 0 is "deterministic enough", but bit-exact reproducibility across different hardware or batch sizes isn't guaranteed by temperature alone.
✗ Myth 3 — "You should always use top-k and top-p together at aggressive settings"
Reality: Stacking aggressive top-k and top-p can over-constrain the output, making it repetitive and bland — you've removed so many options that only the safest tokens survive. Most production systems use one truncation method, not both: typically top-p 0.9–0.95 alone, with top-k disabled (set to 0). Top-p's adaptive nature usually makes it the better single choice. Use top-k mainly when you want a hard guarantee on the maximum number of candidates.

Quiz

Three questions on sampling. Wrong answers explain exactly where the reasoning went wrong.

1. You're building a system that extracts structured data (names, dates, amounts) from documents. Which sampling setting should you use?

2. The model is very confident — token "Paris" has 96% probability for "The capital of France is ___". With top-p = 0.9, how many tokens survive the filter?

3. Where in the sampling pipeline does the actual randomness happen?

What you now know

Chapter 08 — Summary

Logits are raw scores; softmax makes probabilities. The model outputs ~50,000 unbounded logits per step. Softmax converts them into probabilities that sum to 100%, amplifying the gaps via an exponential curve.

Greedy = always pick the top (argmax). Deterministic and reproducible. Best for factual answers, code, and extraction. Equivalent to temperature 0. Repetitive and loop-prone for creative work.

Temperature is the randomness dial. Logits divided by temperature before softmax. Low = sharp and focused. 1.0 = the model's natural distribution. High = flat and creative, but eventually incoherent.

Top-k keeps a fixed count; top-p adapts. Top-k keeps the best K tokens regardless of confidence. Top-p (nucleus) keeps the smallest set covering P% of probability — growing and shrinking with the model's certainty.

Strategies apply in sequence. Temperature → top-k → softmax → top-p → multinomial. Each step reshapes the distribution; the final multinomial draw is the single source of randomness.

Match the strategy to the task. temp 0 for facts/code, temp 0.7 + top-p 0.9 for balanced chat, temp 1.0+ for creative. Stacking aggressive filters over-constrains output — usually pick one truncation method.