The problem: the model doesn't fit
A 70-billion-parameter model needs about 140 GB of memory just to hold its weights at 16-bit precision. The largest single GPUs available have 80 GB. The model simply does not fit on one GPU — not even close. And even when a model does fit, running it on a single GPU may be too slow. Tensor parallelism is the answer to both problems: split the model's weights across multiple GPUs that work together as one.
The term tensor parallelism comes from the fact that we're splitting the tensors — the weight matrices — across devices. (A tensor is just the mathematical name for a multi-dimensional array of numbers; a weight matrix is a 2D tensor.) This chapter shows exactly how that split works, and why nano-vLLM uses two complementary types of split that fit together perfectly.
Two reasons to use multiple GPUs
A 70B model needs ~140 GB for weights (fp16). One 80 GB GPU can't hold it. Split across 4 GPUs, each holds ~35 GB — comfortably fits, with room for the KV cache.
Reason 1 — Memory capacity
The model's weights are too large for one GPU's memory. Splitting them across N GPUs means each holds only 1/N of the weights. This is the difference between "can run this model at all" and "cannot". A 405B model essentially requires tensor parallelism — there's no single GPU on earth that fits it.
Reason 2 — Speed
Even if a model fits, splitting its matrix multiplications across N GPUs means each GPU does 1/N of the math, in parallel. For the memory-bound decode phase → Ch.06, this also means N GPUs reading weights simultaneously — effectively N× the memory bandwidth, which is exactly what decode needs.
The core operation: splitting one matrix multiplication
Almost all of a transformer's compute is matrix multiplication — the input vector multiplied by a weight matrix to produce an output vector. To parallelise the model, we need to split these matrix multiplications across GPUs. There are exactly two ways to cut a weight matrix, and tensor parallelism uses both.
Way 1 — Column parallelism (split the output)
Column-parallel splitting divides the weight matrix by its columns. Each GPU gets a vertical slice of the matrix — a subset of the columns. Since each column of W produces one number in the output Y, splitting columns means each GPU computes a different part of the output. GPU 0 produces output numbers 1–3, GPU 1 produces output numbers 4–6, and so on.
Each GPU multiplies the full input by its slice of columns, producing part of the output. GPU 0 → outputs [y1,y2,y3]. GPU 1 → outputs [y4,y5,y6]. No communication needed yet — each just holds a partial output.
Way 2 — Row parallelism (split the input)
Row-parallel splitting divides the weight matrix by its rows. Each GPU gets a horizontal slice. Since each row of W corresponds to one number in the input X, row splitting means each GPU handles a different part of the input — and each produces a partial version of the full output that must be summed together.
All-reduce — how GPUs combine their results
When GPUs each compute a partial result, they need to combine them. The operation that does this is called all-reduce — and understanding it is essential, because it's the one moment where the GPUs must stop computing and talk to each other.
What "all-reduce" means
Break the name in two. Reduce means combining many values into one — here, summing. All means every GPU ends up with the final combined result, not just one of them. So all-reduce means: "every GPU shares its partial result, all partials are summed, and every GPU receives the complete sum." After an all-reduce, all GPUs hold the same, correct, full output.
All-reduce is implemented by a library called NCCL (NVIDIA Collective Communications Library, pronounced "nickel"). It uses the high-speed interconnect between GPUs — NVLink — to exchange data far faster than going through the CPU or system memory. But it's still vastly slower than on-GPU compute. Every all-reduce is a synchronisation point where GPUs wait for each other. This is why minimising all-reduces matters so much.
Why column-parallel and row-parallel pair perfectly
Here's the insight that makes tensor parallelism efficient. Transformer layers come in pairs of matrix multiplications — for example, the feed-forward network is two linear layers back to back, and attention has the QKV projection followed by the output projection. nano-vLLM makes the first layer column-parallel and the second layer row-parallel. When you chain them this way, the intermediate result never needs communication — only one all-reduce is needed at the very end.
First layer: column-parallel → each GPU has a slice of the intermediate
The input is fed to all GPUs. Each computes its column-slice of the first layer's output. GPU 0 holds intermediate columns 1–3, GPU 1 holds columns 4–6. Crucially, each GPU's slice is exactly what it needs to feed into the next layer — no communication required to move from layer 1 to layer 2.
Second layer: row-parallel → consumes the slice directly
The second layer is row-parallel, which means it expects its input to be split across GPUs — which is exactly the form the column-parallel output is already in! GPU 0 multiplies its intermediate slice by its row-slice of the second weight matrix. Each GPU produces a full-width partial output.
One all-reduce at the end → combine partial outputs
Now each GPU has a partial version of the full output. A single all-reduce sums them, and every GPU has the complete, correct result — ready for the next pair of layers. One communication step for two matrix multiplications. That's the efficiency win.
Walk through a tensor-parallel forward pass
Step through a feed-forward layer split across GPUs. Pick how many GPUs to use, then advance through the column-parallel layer, the row-parallel layer, and the final all-reduce. Watch where communication does — and doesn't — happen.
The parallel layers in code
nano-vLLM implements tensor parallelism in layers/linear.py → Ch.02 with two classes: ColumnParallelLinear and RowParallelLinear. Each shards its weight matrix at load time and handles its part of the computation.
class ColumnParallelLinear(nn.Module): def __init__(self, in_features, out_features, tp_size, tp_rank): # tp_size = number of GPUs; tp_rank = which GPU this is (0, 1, 2...) # Each GPU holds only out_features / tp_size columns of the weight self.out_per_gpu = out_features // tp_size # This GPU's slice of the weight matrix — only its columns # Shape: [in_features, out_features / tp_size] self.weight = nn.Parameter(torch.empty(in_features, self.out_per_gpu)) def forward(self, x): # Full input x, this GPU's column slice → partial output # No communication needed — each GPU independently produces # its slice of the output columns return x @ self.weight # shape: [..., out_features / tp_size]
class RowParallelLinear(nn.Module): def __init__(self, in_features, out_features, tp_size, tp_rank): # Each GPU holds only in_features / tp_size rows of the weight self.in_per_gpu = in_features // tp_size # This GPU's slice — only its rows # Shape: [in_features / tp_size, out_features] self.weight = nn.Parameter(torch.empty(self.in_per_gpu, out_features)) def forward(self, x): # x is this GPU's slice of the input (from the previous # column-parallel layer). Produces a FULL-WIDTH partial output. partial = x @ self.weight # shape: [..., out_features], but partial # ── THE ONE COMMUNICATION STEP ── # all_reduce sums every GPU's partial output element-wise. # After this, every GPU holds the complete, correct output. torch.distributed.all_reduce(partial) # NCCL sums across GPUs return partial
ColumnParallelLinear.forward() has no communication — it just does a local matrix multiply. Only RowParallelLinear.forward() calls all_reduce, exactly once, at the end. This is the column-then-row pairing in code: a column-parallel layer feeds its row-parallel partner, and the single all-reduce at the end of the row-parallel layer produces the final result. Two matrix multiplies, one communication step.
class Qwen3Attention(nn.Module): def __init__(self, config, tp_size, tp_rank): # QKV projection is COLUMN-parallel: each GPU computes a subset # of the attention heads (Ch.03 — heads split naturally across GPUs) self.qkv_proj = ColumnParallelLinear(config.hidden, config.qkv_dim, tp_size, tp_rank) # Output projection is ROW-parallel: consumes the per-GPU head # outputs directly, then all-reduces to combine self.o_proj = RowParallelLinear(config.qkv_dim, config.hidden, tp_size, tp_rank) def forward(self, x): qkv = self.qkv_proj(x) # column-parallel: each GPU gets its heads attn_out = attention(qkv) # each GPU runs attention on ITS heads only return self.o_proj(attn_out) # row-parallel: one all-reduce → done
What tensor parallelism enables
Runs models that don't fit
The fundamental enabler for large models. A 405B model needs ~810 GB at fp16 — impossible on any single GPU. Tensor parallelism across 8–16 GPUs is the only way to serve it at all.
Faster decode via more bandwidth
Decode is memory-bandwidth bound → Ch.06. With N GPUs reading weights in parallel, you get roughly N× the effective HBM bandwidth — directly speeding up the bottleneck phase of generation.
Scales within a node
Tensor parallelism works best across GPUs connected by fast NVLink within a single server (typically 8 GPUs). The all-reduce communication is frequent, so it needs the fastest possible interconnect — which is why TP usually stays within one node.
Combines with other parallelism
For truly massive models, tensor parallelism combines with pipeline parallelism (layers across nodes) and data parallelism (replicas for throughput). Production systems layer all three. nano-vLLM focuses on tensor parallelism as the foundational one.
Things beginners get wrong about tensor parallelism
Quiz
Three questions on tensor parallelism. Wrong answers explain exactly where the reasoning broke down.
1. Why does nano-vLLM make the first layer of a pair column-parallel and the second row-parallel, rather than both column-parallel?
2. After a row-parallel layer computes, each GPU holds a "full-width partial output." What does all-reduce do with these?
3. You scale a model from 4 GPUs to 8 GPUs with tensor parallelism but see only a 1.4× speedup, not 2×. Why?
What you now know
Tensor parallelism splits weights across GPUs. Each GPU holds 1/N of every layer's weight matrices. Enables models too big for one GPU, and speeds up compute by parallelising matrix multiplications.
Two ways to cut a matrix. Column-parallel splits the output (results concatenated). Row-parallel splits the input (results summed via all-reduce). They are complementary.
All-reduce is the communication step. Every GPU shares its partial result, all are summed, and every GPU receives the complete sum. Implemented by NCCL over NVLink. It's a synchronisation point — GPUs wait for each other.
Column-then-row minimises communication. The column-parallel output feeds the row-parallel layer with no intermediate communication. One all-reduce per layer-pair instead of one per layer.
Attention heads split for free. Heads are independent → Ch.03, so each GPU runs a subset with no coordination until the output projection's single all-reduce.
Scaling is sub-linear. Communication overhead grows with GPU count, so doubling GPUs gives less than 2× speedup. There's an optimal degree of parallelism, usually within one NVLink-connected node.