Chapter 09 of 11 · nano-vLLM Deep Dive
09

Tensor Parallelism

When a model is too big for one GPU, split it across several. How column-parallel and row-parallel layers divide the work — and pair up to keep communication to an absolute minimum.

← Ch08: Sampling Next: Optimizations →

The problem: the model doesn't fit

A 70-billion-parameter model needs about 140 GB of memory just to hold its weights at 16-bit precision. The largest single GPUs available have 80 GB. The model simply does not fit on one GPU — not even close. And even when a model does fit, running it on a single GPU may be too slow. Tensor parallelism is the answer to both problems: split the model's weights across multiple GPUs that work together as one.

The Shared Spreadsheet Analogy Imagine a single spreadsheet so enormous it won't fit in one person's computer memory, and so full of calculations that one person would take hours to finish it. The solution: a team of four. You split the spreadsheet's columns among them — person 1 takes columns A–F, person 2 takes G–L, and so on. Each person loads only their slice into memory (so it fits) and computes their columns in parallel (so it's fast). At the end, they combine their partial results into the final answer. That combining step requires them to talk to each other — and minimising that conversation is the whole art of tensor parallelism. Each team member is a GPU. The spreadsheet is the model's weight matrices.

The term tensor parallelism comes from the fact that we're splitting the tensors — the weight matrices — across devices. (A tensor is just the mathematical name for a multi-dimensional array of numbers; a weight matrix is a 2D tensor.) This chapter shows exactly how that split works, and why nano-vLLM uses two complementary types of split that fit together perfectly.

Two reasons to use multiple GPUs

A 70B model across 1 vs 4 GPUs

A 70B model needs ~140 GB for weights (fp16). One 80 GB GPU can't hold it. Split across 4 GPUs, each holds ~35 GB — comfortably fits, with room for the KV cache.

1× GPU (80 GB)
140 GB needed
✗ Won't fit — 60 GB short
split across 4 GPUs ↓
GPU 0
35 GB
¼ of weights ✓
GPU 1
35 GB
¼ of weights ✓
GPU 2
35 GB
¼ of weights ✓
GPU 3
35 GB
¼ of weights ✓

Reason 1 — Memory capacity

The model's weights are too large for one GPU's memory. Splitting them across N GPUs means each holds only 1/N of the weights. This is the difference between "can run this model at all" and "cannot". A 405B model essentially requires tensor parallelism — there's no single GPU on earth that fits it.

Reason 2 — Speed

Even if a model fits, splitting its matrix multiplications across N GPUs means each GPU does 1/N of the math, in parallel. For the memory-bound decode phase → Ch.06, this also means N GPUs reading weights simultaneously — effectively N× the memory bandwidth, which is exactly what decode needs.

Tensor parallelism vs other kinds of parallelism There are several ways to use multiple GPUs. Data parallelism runs a full copy of the model on each GPU, each handling different requests — good for throughput but needs the model to fit on one GPU. Pipeline parallelism puts different layers on different GPUs, like an assembly line. Tensor parallelism (this chapter) splits each individual layer's weights across GPUs. nano-vLLM implements tensor parallelism specifically, because it's the most effective for low-latency inference of large models.

The core operation: splitting one matrix multiplication

Almost all of a transformer's compute is matrix multiplication — the input vector multiplied by a weight matrix to produce an output vector. To parallelise the model, we need to split these matrix multiplications across GPUs. There are exactly two ways to cut a weight matrix, and tensor parallelism uses both.

A quick refresher — what a linear layer does A linear layer → Ch.02 takes an input vector X and multiplies it by a weight matrix W to produce an output: Y = X · W. If X has 4 numbers and W is a 4×6 matrix, the output Y has 6 numbers. Every transformer is built from these multiplications — the attention projections (Q, K, V) and the feed-forward layers are all X · W operations. Splitting these is the heart of tensor parallelism.

Way 1 — Column parallelism (split the output)

Column-parallel splitting divides the weight matrix by its columns. Each GPU gets a vertical slice of the matrix — a subset of the columns. Since each column of W produces one number in the output Y, splitting columns means each GPU computes a different part of the output. GPU 0 produces output numbers 1–3, GPU 1 produces output numbers 4–6, and so on.

Column-parallel — each GPU owns some output columns
Weight matrix W (4×6)
■ GPU 0: cols 1–3 ■ GPU 1: cols 4–6

Each GPU multiplies the full input by its slice of columns, producing part of the output. GPU 0 → outputs [y1,y2,y3]. GPU 1 → outputs [y4,y5,y6]. No communication needed yet — each just holds a partial output.

Way 2 — Row parallelism (split the input)

Row-parallel splitting divides the weight matrix by its rows. Each GPU gets a horizontal slice. Since each row of W corresponds to one number in the input X, row splitting means each GPU handles a different part of the input — and each produces a partial version of the full output that must be summed together.

The key difference — what comes out Column-parallel produces pieces of the output that need to be concatenated (stuck side by side) to form the full output. Row-parallel produces full-width partial outputs that need to be summed (added together element by element) to form the correct output. This summing step is the communication operation — and it's called all-reduce.

All-reduce — how GPUs combine their results

When GPUs each compute a partial result, they need to combine them. The operation that does this is called all-reduce — and understanding it is essential, because it's the one moment where the GPUs must stop computing and talk to each other.

What "all-reduce" means

Break the name in two. Reduce means combining many values into one — here, summing. All means every GPU ends up with the final combined result, not just one of them. So all-reduce means: "every GPU shares its partial result, all partials are summed, and every GPU receives the complete sum." After an all-reduce, all GPUs hold the same, correct, full output.

The Group Tally Analogy Four people each counted attendees at different doors of an event. Each knows only their own door's count. To get the total, they do an all-reduce: each announces their number, everyone adds them up, and everyone writes down the same grand total. The "reduce" is the addition; the "all" is that everyone — not just the organiser — ends up knowing the total. In tensor parallelism, the partial sums are partial output vectors, and after all-reduce every GPU has the identical complete output, ready for the next layer.

All-reduce is implemented by a library called NCCL (NVIDIA Collective Communications Library, pronounced "nickel"). It uses the high-speed interconnect between GPUs — NVLink — to exchange data far faster than going through the CPU or system memory. But it's still vastly slower than on-GPU compute. Every all-reduce is a synchronisation point where GPUs wait for each other. This is why minimising all-reduces matters so much.

Why communication is the enemy of parallelism Splitting work across GPUs gives you more compute — but every all-reduce forces all GPUs to stop and synchronise. If you communicate after every operation, the GPUs spend more time waiting and exchanging data than computing. The entire design goal of tensor parallelism is to do as much independent computation as possible between communication steps. This is exactly what the column-then-row pairing achieves.

Why column-parallel and row-parallel pair perfectly

Here's the insight that makes tensor parallelism efficient. Transformer layers come in pairs of matrix multiplications — for example, the feed-forward network is two linear layers back to back, and attention has the QKV projection followed by the output projection. nano-vLLM makes the first layer column-parallel and the second layer row-parallel. When you chain them this way, the intermediate result never needs communication — only one all-reduce is needed at the very end.

1

First layer: column-parallel → each GPU has a slice of the intermediate

The input is fed to all GPUs. Each computes its column-slice of the first layer's output. GPU 0 holds intermediate columns 1–3, GPU 1 holds columns 4–6. Crucially, each GPU's slice is exactly what it needs to feed into the next layer — no communication required to move from layer 1 to layer 2.

2

Second layer: row-parallel → consumes the slice directly

The second layer is row-parallel, which means it expects its input to be split across GPUs — which is exactly the form the column-parallel output is already in! GPU 0 multiplies its intermediate slice by its row-slice of the second weight matrix. Each GPU produces a full-width partial output.

3

One all-reduce at the end → combine partial outputs

Now each GPU has a partial version of the full output. A single all-reduce sums them, and every GPU has the complete, correct result — ready for the next pair of layers. One communication step for two matrix multiplications. That's the efficiency win.

The relay race analogy Column-then-row is like a relay race where the baton hand-off requires no slowing down. The column-parallel layer hands its output slice directly to the row-parallel layer on the same GPU — no passing between runners (GPUs) needed in the middle. Only at the finish line (after the row-parallel layer) do all runners' results get combined in a single step. If instead you used column-parallel twice, you'd need to regroup the runners between every leg — far more communication.

Walk through a tensor-parallel forward pass

Step through a feed-forward layer split across GPUs. Pick how many GPUs to use, then advance through the column-parallel layer, the row-parallel layer, and the final all-reduce. Watch where communication does — and doesn't — happen.

Tensor-parallel forward pass
GPUs:
Current stage
Ready — input broadcast to all GPUs
Click "Next Stage" to begin the forward pass.
✓ No communication yet — GPUs working independently

The parallel layers in code

nano-vLLM implements tensor parallelism in layers/linear.py → Ch.02 with two classes: ColumnParallelLinear and RowParallelLinear. Each shards its weight matrix at load time and handles its part of the computation.

layers/linear.py — ColumnParallelLinear
class ColumnParallelLinear(nn.Module):
    def __init__(self, in_features, out_features, tp_size, tp_rank):
        # tp_size = number of GPUs; tp_rank = which GPU this is (0, 1, 2...)
        # Each GPU holds only out_features / tp_size columns of the weight
        self.out_per_gpu = out_features // tp_size

        # This GPU's slice of the weight matrix — only its columns
        # Shape: [in_features, out_features / tp_size]
        self.weight = nn.Parameter(torch.empty(in_features, self.out_per_gpu))

    def forward(self, x):
        # Full input x, this GPU's column slice → partial output
        # No communication needed — each GPU independently produces
        # its slice of the output columns
        return x @ self.weight   # shape: [..., out_features / tp_size]
layers/linear.py — RowParallelLinear
class RowParallelLinear(nn.Module):
    def __init__(self, in_features, out_features, tp_size, tp_rank):
        # Each GPU holds only in_features / tp_size rows of the weight
        self.in_per_gpu = in_features // tp_size

        # This GPU's slice — only its rows
        # Shape: [in_features / tp_size, out_features]
        self.weight = nn.Parameter(torch.empty(self.in_per_gpu, out_features))

    def forward(self, x):
        # x is this GPU's slice of the input (from the previous
        # column-parallel layer). Produces a FULL-WIDTH partial output.
        partial = x @ self.weight   # shape: [..., out_features], but partial

        # ── THE ONE COMMUNICATION STEP ──
        # all_reduce sums every GPU's partial output element-wise.
        # After this, every GPU holds the complete, correct output.
        torch.distributed.all_reduce(partial)   # NCCL sums across GPUs
        return partial
The all_reduce call is the whole story Notice that ColumnParallelLinear.forward() has no communication — it just does a local matrix multiply. Only RowParallelLinear.forward() calls all_reduce, exactly once, at the end. This is the column-then-row pairing in code: a column-parallel layer feeds its row-parallel partner, and the single all-reduce at the end of the row-parallel layer produces the final result. Two matrix multiplies, one communication step.
models/qwen3.py — how attention uses both
class Qwen3Attention(nn.Module):
    def __init__(self, config, tp_size, tp_rank):
        # QKV projection is COLUMN-parallel: each GPU computes a subset
        # of the attention heads (Ch.03 — heads split naturally across GPUs)
        self.qkv_proj = ColumnParallelLinear(config.hidden, config.qkv_dim, tp_size, tp_rank)

        # Output projection is ROW-parallel: consumes the per-GPU head
        # outputs directly, then all-reduces to combine
        self.o_proj = RowParallelLinear(config.qkv_dim, config.hidden, tp_size, tp_rank)

    def forward(self, x):
        qkv = self.qkv_proj(x)        # column-parallel: each GPU gets its heads
        attn_out = attention(qkv)      # each GPU runs attention on ITS heads only
        return self.o_proj(attn_out)   # row-parallel: one all-reduce → done
Attention heads split across GPUs for free Remember from Chapter 03 that attention has multiple independent heads → Ch.03. Tensor parallelism exploits this beautifully: each GPU simply handles a subset of the heads. With 16 heads across 4 GPUs, each GPU computes 4 heads entirely on its own — no communication during attention itself. The heads are independent by design, so splitting them needs no coordination until the output projection's single all-reduce.

What tensor parallelism enables

Runs models that don't fit

The fundamental enabler for large models. A 405B model needs ~810 GB at fp16 — impossible on any single GPU. Tensor parallelism across 8–16 GPUs is the only way to serve it at all.

Faster decode via more bandwidth

Decode is memory-bandwidth bound → Ch.06. With N GPUs reading weights in parallel, you get roughly N× the effective HBM bandwidth — directly speeding up the bottleneck phase of generation.

Scales within a node

Tensor parallelism works best across GPUs connected by fast NVLink within a single server (typically 8 GPUs). The all-reduce communication is frequent, so it needs the fastest possible interconnect — which is why TP usually stays within one node.

Combines with other parallelism

For truly massive models, tensor parallelism combines with pipeline parallelism (layers across nodes) and data parallelism (replicas for throughput). Production systems layer all three. nano-vLLM focuses on tensor parallelism as the foundational one.

Things beginners get wrong about tensor parallelism

✗ Myth 1 — "Each GPU runs a full copy of the model"
Reality: That's data parallelism, not tensor parallelism. In tensor parallelism, each GPU holds only 1/N of every layer's weights — no GPU has a complete copy of any weight matrix. The GPUs are not independent replicas; they're more like organs in one body, each indispensable. If one GPU fails, the model cannot run at all, because no single GPU has the full weights.
✗ Myth 2 — "More GPUs always means proportionally faster inference"
Reality: Tensor parallelism doesn't scale linearly because every layer-pair requires an all-reduce, and communication overhead grows with more GPUs. Going from 1 to 2 GPUs might give a 1.8× speedup; from 4 to 8 might give only 1.5×. Past a point, adding GPUs makes things slower because the GPUs spend more time communicating than computing. There's an optimal degree of parallelism for each model and hardware setup — usually capped at the number of GPUs with fast NVLink in one node.
✗ Myth 3 — "All-reduce just sends data to one main GPU to combine"
Reality: All-reduce is not a gather-to-one operation. Every GPU ends up with the complete result, not just a designated leader. NCCL uses clever ring or tree algorithms where GPUs exchange data peer-to-peer simultaneously, so the bandwidth cost is shared and no single GPU becomes a bottleneck. The "all" in all-reduce specifically means all participants get the final answer — which is necessary because every GPU needs the full output to compute its slice of the next layer.

Quiz

Three questions on tensor parallelism. Wrong answers explain exactly where the reasoning broke down.

1. Why does nano-vLLM make the first layer of a pair column-parallel and the second row-parallel, rather than both column-parallel?

2. After a row-parallel layer computes, each GPU holds a "full-width partial output." What does all-reduce do with these?

3. You scale a model from 4 GPUs to 8 GPUs with tensor parallelism but see only a 1.4× speedup, not 2×. Why?

What you now know

Chapter 09 — Summary

Tensor parallelism splits weights across GPUs. Each GPU holds 1/N of every layer's weight matrices. Enables models too big for one GPU, and speeds up compute by parallelising matrix multiplications.

Two ways to cut a matrix. Column-parallel splits the output (results concatenated). Row-parallel splits the input (results summed via all-reduce). They are complementary.

All-reduce is the communication step. Every GPU shares its partial result, all are summed, and every GPU receives the complete sum. Implemented by NCCL over NVLink. It's a synchronisation point — GPUs wait for each other.

Column-then-row minimises communication. The column-parallel output feeds the row-parallel layer with no intermediate communication. One all-reduce per layer-pair instead of one per layer.

Attention heads split for free. Heads are independent → Ch.03, so each GPU runs a subset with no coordination until the output projection's single all-reduce.

Scaling is sub-linear. Communication overhead grows with GPU count, so doubling GPUs gives less than 2× speedup. There's an optimal degree of parallelism, usually within one NVLink-connected node.