Example 4: Linear layers and backprop are linear maps + adjoints

Chapter: 4 linear_maps_matrices

Purpose

Make shapes and transposes feel inevitable—so you can reason about forward/backward passes and attention without memorizing formulas.

What you’ll learn:

Adjoint (transpose) as gradient flow: Understand that backpropagation through a linear map $Y = XW$ uses the adjoint (transpose) $W^\top$ to propagate gradients backward. This is not coincidence—it’s fundamental to how matrix calculus works: the gradient with respect to parameters is $X^\top \frac{\partial L}{\partial Y}$, with respect to inputs is $\frac{\partial L}{\partial Y} W^\top$. Knowing this pattern lets you reason about gradient flow in any architecture without memorizing formulas.
Shapes constrain gradients uniquely: Once you fix the forward pass shapes—input $(n, d_1)$, weight $(d_1, d_2)$, output $(n, d_2)$—the backward pass shapes are determined. Parameter gradients must have shape $(d_1, d_2)$ (same as weights), and this forces the use of transposes. Shape discipline is the tool that makes gradient flow inevitable rather than mysterious.
Batch aggregation via summation: The bias gradient requires summing over the batch dimension because the bias is shared across all examples. This pattern—summing gradients across a batch—appears everywhere: parameter updates average over examples, normalization layers average statistics over batches, and all collective operations pool information from multiple samples.
Transpose chaining in deep networks: In a multi-layer network, each layer’s backward pass computes three gradients (parameter, input, bias), and the input gradient becomes the upstream gradient for the previous layer’s backward pass. Understanding the shape transformations through transpose chaining is the key to implementing, debugging, and optimizing neural networks.

This example is the foundation for understanding how neural networks train: the forward pass computes predictions, the backward pass uses adjoints to compute parameter updates, and the entire training loop is a loop of forward-backward-update cycles chaining these matrix operations.

Problem

Compute Y=XW+b for a tiny batch and verify backprop identities for L=sum(Y).

Solution (math)

For $Y=XW+b$, gradients satisfy:

\[ \frac{\partial L}{\partial W}=X^T\frac{\partial L}{\partial Y},\quad \frac{\partial L}{\partial X}=\frac{\partial L}{\partial Y}W^T,\quad \frac{\partial L}{\partial b}=\mathbf{1}^T\frac{\partial L}{\partial Y}. \]

This uses the adjoint (transpose) of the linear map.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Background

These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.

Linear maps and matrix-vector products are the computational substrate of neural networks. A linear layer computes $Y = XW + b$ where $X \in \mathbb{R}^{n \times d_1}$ is a batch of $n$ examples, $W \in \mathbb{R}^{d_1 \times d_2}$ is the learned weight matrix, and $b \in \mathbb{R}^{d_2}$ is the learned bias. Each row of $X$ (one example) is transformed via the same weight matrix: row $i$ of $Y$ is $X_i W + b$, a linear map from $d_1$ dimensions to $d_2$ dimensions. The columns of $W$ are learned basis vectors: each column is the set of weights connecting input features to one output neuron. The ability to stack these layers—each layer’s output becomes the next layer’s input—is what enables deep networks to learn hierarchical representations. Modern GPUs/TPUs are optimized for matrix multiplication (GEMM—general matrix-matrix multiplication), so the entire neural network can be expressed as a sequence of matrix products with element-wise nonlinearities in between.

Backpropagation computes gradients via the chain rule and matrix transposes. Given a loss $L$ that depends on the output $Y$, backpropagation computes $\frac{\partial L}{\partial W}$, $\frac{\partial L}{\partial X}$, and $\frac{\partial L}{\partial b}$ by chaining the adjoint (transpose) of the forward pass. The key insight: if the forward pass is $Y = XW$, then $\frac{\partial Y_{ij}}{\partial W_{kj}} = X_{ik}$, which means $\frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Y}$ (summing over examples). For inputs, since $\frac{\partial Y_{ij}}{\partial X_{ik}} = W_{kj}$, we get $\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^\top$. The transpose appears because the matrix dimensions must match for multiplication. This pattern generalizes: any linear transformation has an adjoint (its transpose), and the backward pass uses adjoints to propagate gradients. In attention mechanisms, convolutions, and all neural network layers, this transpose-chaining pattern appears identically, making it the universal language of deep learning optimization.

Batch processing and gradient accumulation are essential for efficient training. When you process $n$ examples together (a batch), the forward pass computes all $n$ predictions in parallel via matrix multiplication. The backward pass computes gradients for all examples simultaneously, then aggregates them: weight gradients are summed over the batch (via $X^\top$), bias gradients are summed over all examples, and input gradients are kept separate (one per example). This aggregation is why larger batches lead to faster training (more gradient information per forward-backward pass) but may reduce generalization (noisier gradient estimates). Understanding how gradients aggregate across batches is critical for tuning learning rates, batch sizes, and convergence behavior.

What this example demonstrates

Forward pass as linear map: Computing $Y = XW + b$ applies a learned linear transformation to each input example. The code constructs a tiny batch ($n=3$, $d_1=2$), two-dimensional output ($d_2=2$), and computes the predictions. This is the core operation of neural network inference: take an input, multiply by learned weights (stored in columns of $W$), add learned biases, and produce an output. For a single example $x_i$, the output $y_i = x_i W + b$ is a linear combination of the columns of $W$, with coefficients determined by $x_i$. Stacking multiple examples into a batch $X$ lets GPUs compute all predictions simultaneously via a single large matrix product, achieving massive parallelism.

Backward pass via transpose chaining: Given upstream gradients $\frac{\partial L}{\partial Y}$ (how much the loss changes with respect to each output), the code computes three gradient flows: (1) dL_dW = X.T @ dL_dY for parameters, (2) dL_dX = dL_dY @ W.T for inputs, (3) dL_db = dL_dY.sum(axis=0) for bias. Each uses a specific pattern: parameter gradients accumulate input-gradient products (via $X^\top$), input gradients use the parameter transpose ($W^\top$), and bias gradients sum over the batch. These three operations collectively constitute the “adjoint” of the linear map, representing how perturbations to outputs affect parameters and inputs. The fact that all three can be computed from a single upstream gradient dL_dY is the power of automatic differentiation: once you have the upstream gradient, all downstream gradients are cheap.

Shapes determine uniqueness of gradients: The forward pass shapes ($3 \times 2$, $2 \times 2$, etc.) completely determine the backward pass. Parameter gradients must be $2 \times 2$ (matching $W$), so X.T @ dL_dY is the unique way to get a $2 \times 2$ result from $X$ (shape $3 \times 2$, transposed to $2 \times 3$) and dL_dY (shape $3 \times 2$). Input gradients must be $3 \times 2$ (matching $X$), so dL_dY @ W.T is the unique way to get that shape. This inevitability of shapes makes backprop formula-free: you can derive the correct gradient computation from shapes alone, without memorizing formulas. This discipline extends to all neural network architectures: convolutional layers, attention mechanisms, and normalization layers all follow the same principle—shapes determine gradients uniquely.

Numerical Implementation Details

The code demonstrates the complete forward-backward cycle of a linear layer: one forward pass, one backward pass, all shapes determined by the dimensions specified.

For the forward pass, Y = X @ W + b computes predictions via optimized GEMM kernels in $O(ndd_1d_2)$ time (or $O(ndd)$ for square dimensions). Broadcasting adds the bias $b$ to all $n$ rows. The intermediate output Y has shape $(3, 2)$—3 examples, 2 output dimensions. In practice, this forward pass is implemented as a single fused operation on GPUs for memory efficiency.

For the backward pass, three matrix operations propagate gradients:

Weight gradient dL_dW = X.T @ dL_dY computes parameter updates. Shape: $(2, 3) @ (3, 2) = (2, 2)$, matching $W$. This operation sums gradient contributions from all 3 examples: each row $i$ of dL_dY (gradient for example $i$) is weighted by the corresponding row of $X^\top$ (input features for example $i$). The result accumulates how much each weight should change based on the total loss gradient. In a training loop, this is used to update $W \leftarrow W - \alpha \cdot dL\_dW$ for learning rate $\alpha$.
Input gradient dL_dX = dL_dY @ W.T propagates error back to the input. Shape: $(3, 2) @ (2, 2) = (3, 2)$, matching $X$. This is the upstream gradient for the previous layer (if this is not the first layer). Each row of dL_dY is multiplied by $W^\top$ to produce the corresponding row of gradient for input example. Note that dL_dX has the same shape as $X$ but contains gradients, not data.
Bias gradient dL_db = dL_dY.sum(axis=0) computes the bias update. Shape: sum over axis 0 reduces $(3, 2)$ to $(2,)$, matching $b$. Since the bias is added identically to all examples, changing $b$ by $\delta$ changes all 3 outputs by $\delta$, so gradients sum across the batch. This is the only operation where we reduce dimensions—all examples contribute equally to the bias gradient.

Shapes as computation rules: The three operations follow a simple recipe: (1) Parameter gradients use the transpose of inputs; (2) Input gradients use the transpose of weights; (3) Bias gradients sum over the batch dimension. These rules come from the matrix calculus of the forward pass: the Jacobian of $Y = XW + b$ with respect to each variable determines the transpose patterns. Implementing backpropagation for convolutional layers, attention, normalization, etc. follows the identical principle—derive the Jacobian, use transposes to compute gradients, aggregate across batches.

Numerical stability: The code uses standard matrix operations without explicit inverses, which is stable. In practice, implementations include gradient clipping (capping large gradients to prevent explosions), numerical precision handling (float32 vs float64), and careful initialization of $W$ to prevent vanishing/exploding gradients in deep networks. The fact that all three gradients come from dL_dY means a single pass of error computation supports the entire backward pass—this efficiency is why backpropagation is so practical.

Connection to the Broader Examples

This example establishes linear maps and their adjoints as the fundamental building blocks of neural networks. Every neural network layer is a linear transformation (matrix product) followed by optional nonlinearity, and every backward pass uses this transpose-chaining pattern.

Throughout the remaining 96 examples, adjoint-based gradient computation appears constantly:

Convolutional layers (related examples) perform local linear maps (via convolution kernels), and their backward passes compute gradients via transposed convolutions—applying the adjoint of the convolution operator.
Attention mechanisms (Examples 2, and related) compute $O = \text{softmax}(QK^\top / \sqrt{d_k})V$ as linear maps (in the value space), and gradients flow backward through these matrix products via transposes.
Normalization layers (future examples) compute statistics (mean, variance) across batches or spatial dimensions, and backward passes propagate gradients using the chain rule with transpose operations for the normalization transform.
Regularization and weight decay modify the loss $L$ to include penalty terms like $\|W\|^2$, which change the gradient $\frac{\partial L}{\partial W}$ by adding regularization terms.
Optimizers (SGD, Adam, etc.) use these gradients to update weights: they accumulate or transform gradients (via moving averages, second moments) before applying updates.
Loss functions (cross-entropy, MSE, etc.) produce upstream gradients $\frac{\partial L}{\partial Y}$ that feed into the backward pass; different losses have different gradient forms.
Batch normalization and layer normalization apply learned linear transforms after normalization, requiring careful gradient chaining through the normalization and linear operations.

The unifying pattern: every neural network layer can be expressed as a composition of linear maps and element-wise nonlinearities. The forward pass chains these maps left-to-right, and the backward pass chains their adjoints right-to-left. Shapes determine operations uniquely, making it impossible to get the formula wrong if you respect shape discipline. This is the foundation of modern deep learning: it’s “just” matrix multiplication and its transpose.

Notes

Shape discipline: For forward pass $Y = XW + b$ with $X \in \mathbb{R}^{n \times d_1}$, $W \in \mathbb{R}^{d_1 \times d_2}$, $b \in \mathbb{R}^{d_2}$, output is $Y \in \mathbb{R}^{n \times d_2}$. For backward pass with upstream gradient $\frac{\partial L}{\partial Y} \in \mathbb{R}^{n \times d_2}$: parameter gradients are $X^\top \frac{\partial L}{\partial Y} \in \mathbb{R}^{d_1 \times d_2}$ (shape of $W$), input gradients are $\frac{\partial L}{\partial Y}W^\top \in \mathbb{R}^{n \times d_1}$ (shape of $X$), bias gradients are sum over batch $\in \mathbb{R}^{d_2}$ (shape of $b$). Shape mismatch immediately reveals errors in gradient computation.
Adjoint as transpose: The mathematical adjoint of a linear map $T: \mathbb{R}^{d_1} \to \mathbb{R}^{d_2}$ represented by matrix $W \in \mathbb{R}^{d_2 \times d_1}$ is the transpose $W^\top: \mathbb{R}^{d_2} \to \mathbb{R}^{d_1}$. This is not just a convenient identity—it’s fundamental: $\langle Tx, y \rangle = \langle x, T^\top y \rangle$ (duality). Backpropagation is the computational manifestation of this adjoint: upstream gradients are vectors in the output space; applying $W^\top$ maps them back to the input space.
Batch aggregation and averaging: Weight gradients are computed as $X^\top \frac{\partial L}{\partial Y}$, which sums contributions from all $n$ examples. Larger batches accumulate larger weight gradients (summed over more examples), so learning rates must be scaled with batch size to maintain consistent training dynamics. Conversely, bias gradients sum over examples, making bias updates batch-size-dependent in a way that weight updates are not (unless explicitly normalized).
Gradient flow and initialization: For networks with many layers, gradients can vanish (become extremely small) or explode (become extremely large) as they propagate backward through successive transposes. Xavier/He initialization sets weights to have variance proportional to $1/\sqrt{d_1}$ or $1/\sqrt{d_1 + d_2}$ to keep gradients at a stable scale. Skip connections in ResNets bypass transpose chains to preserve gradient magnitudes. Understanding gradient flow through transpose chaining is essential for designing networks that train smoothly.
Transpose patterns in specialized architectures: Convolutional layers use transposed convolutions for gradients (padding adjustments for shape compatibility), attention layers use transpose patterns for query-key-value projections and output computation, recurrent layers backpropagate through time using the same transpose chaining principles. The universality of transpose-based backpropagation means the same debugging principles (shape checking, gradient clipping, initialization) apply across all architectures.
Part 1: Forward Pass — Computing the Linear Map The first part of the code computes $Y = XW + b$ for a batch of 3 examples ($n=3$), 2 input features ($d_1=2$), and 2 output dimensions ($d_2=2$). Each example is transformed by the same weight matrix: $y_i = x_i W + b$ where $x_i$ is row $i$ of $X$. The output $Y$ has shape $(3, 2)$—3 examples, 2 predictions each. This is the complete forward pass of a neural network layer: input features are linearly combined (via columns of $W$) and shifted (via bias $b$). In inference, the network predicts by computing this forward pass for each input batch. In training, this forward pass is computed for all examples simultaneously, enabling GPU parallelism.
Part 2: Backward Pass — Computing Gradients via Transpose Chaining The second part computes three gradient flows from upstream gradients $\frac{\partial L}{\partial Y}$ (assumed to be all ones). (1) dL_dW = X.T @ dL_dY computes parameter gradients with shape $(2, 2)$ matching $W$: this operation accumulates how each weight should change by summing input-gradient products over the batch. (2) dL_dX = dL_dY @ W.T computes input gradients with shape $(3, 2)$ matching $X$: this is the upstream gradient for the previous layer, showing how much the loss changes with respect to inputs. (3) dL_db = dL_dY.sum(axis=0) computes bias gradients with shape $(2,)$ matching $b$: bias gradients sum over the batch because the bias is shared across all examples. These three operations constitute the adjoint (transpose) of the forward pass, and they can all be computed from a single upstream gradient. This is the complete backward pass: parameter updates are derived, inputs gradients flow to the previous layer, and all gradient operations are matrix products involving transposes.

References

See Chapter 4.
G. Strang. Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press, 2016. Chapter 4 (Linear Maps and Matrix Multiplication) covers adjoints and transposes.
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. Chapter 6 (Deep Feedforward Networks) details backpropagation and gradient computation.
D. Rumelhart, G. Hinton, and R. Williams. “Learning Representations by Back-Propagating Errors.” Nature 1986. Original backpropagation paper establishing the gradient computation via chain rule.
Y. LeCun, Y. Bengio, and G. Hinton. “Deep Learning.” Nature 2015. Review article explaining gradient-based learning and the role of matrix operations in deep networks.
A. Karpathy. “A Hacker’s Guide to Neural Networks.” Blog post, 2015. Intuitive explanations of forward/backward passes and gradient shapes.
C. Olah. “Calculus on Computational Graphs: Backpropagation.” Blog post, 2015. Visual explanation of backpropagation as chain rule on computation graphs, with transpose patterns.
J. Martens and I. Sutskever. “Training Deep and Recurrent Networks with Hessian-Free Optimization.” ICML 2011. Advanced treatment of gradient dynamics, conditioning, and stability in deep networks.

Code Introduction

This code demonstrates the forward and backward passes of a linear transformation—the computational backbone of neural network training. It shows how gradients flow through a simple affine map $Y = XW + b$, revealing that backpropagation is fundamentally about chaining matrix transposes to compute sensitivities with respect to parameters and inputs.

Solution (Python)


import numpy as np

X = np.array([[1.,0.],[0.,1.],[1.,1.]])
W = np.array([[2.,1.],[-1.,3.]])
b = np.array([0.5,-2.])

Y = X@W + b
dL_dY = np.ones_like(Y)

dL_dW = X.T@dL_dY
dL_dX = dL_dY@W.T
dL_db = dL_dY.sum(axis=0)

print("Y:
", Y)
print("dL/dW:
", dL_dW)
print("dL/dX:
", dL_dX)
print("dL/db:", dL_db)

History

Matrices as linear transformations were formalized by Cayley (1858), who recognized that square arrays of numbers could represent linear functions. This abstraction enabled the treatment of compositions of transformations as matrix multiplication, unifying geometry and algebra.

Matrix representations in neural networks: While backpropagation (Rumelhart, Hinton, Williams 1986) is often credited as a “deep learning breakthrough,” it is fundamentally the chain rule applied to matrix operations. Each layer is a linear (or affine) map; composing layers composes matrices. The gradient of loss with respect to weights is computed by chaining transposes: $\nabla_W L = X^\top \nabla_Y L$.

Sparsity and efficiency: Not all learned transformations are dense matrices. Sparse transformations (low-rank, banded, structured) reduce memory and computation. Modern efficient deep learning exploits structure: attention uses low-rank patterns, CNNs use local weight sharing, RNNs reuse parameters across time. Understanding matrices as geometric transformations motivates these structural choices.