ex1.ai

Example number

100

Example slug

example_100_linear_maps_forward_backward_as_matrix_products

Background

Historical arc:

The connection between calculus and linear algebra dates to Newton and Leibniz (1600s), who formalized differentiation. By the 1800s, Cayley and Sylvester developed matrix algebra, and Jacobi introduced the Jacobian (matrix of partial derivatives). However, computing gradients in high dimensions remained impractical.

The breakthrough came in the 1960sâ1970s with automatic differentiation (AD). Wengert (1964) and Linnainmaa (1970) showed that reverse-mode AD (backpropagation) computes gradients in $O(n)$ time, not $O(n^2)$. The key insight: compose adjoints (transposes) instead of forming full Jacobians.

Backpropagation in neural networks was popularized by Rumelhart, Hinton, and Williams (1986), who showed that multilayer perceptrons could learn via gradient descent. Their paper, âLearning Representations by Back-Propagating Errors,â demonstrated that the chain rule applies layer-by-layer: gradients flow backward through weight matrices via transposes.

The 1990sâ2000s saw theoretical development (Bengio, LeCun) but limited practical success due to computational constraints. Then the 2010s brought GPU acceleration. NVIDIAâs CUDA (2007) and cuBLAS library enabled fast matrix multiplies (GEMM kernels). Suddenly, backpropâwhich is just chained matrix multipliesâbecame 10â100Ã faster on GPUs than CPUs.

Modern deep learning (2012+): - AlexNet (Krizhevsky et al., 2012): First CNN trained on GPUs via backprop; won ImageNet by large margin. - Automatic differentiation frameworks: Theano (2010), TensorFlow (2015), PyTorch (2016), JAX (2018) implement backprop automatically via computational graphs. - Transformers (Vaswani et al., 2017): Attention mechanism is matrix multiplies; backprop through attention uses transposed projections. - Large models (GPT-3, 2020; GPT-4, 2023): 175Bâ1.7T parameters trained via backprop on thousands of GPUs; efficient gradients are essential.

Core insight: Matrix products became the computational substrate of deep learning because GPUs/TPUs are engineered around GEMM (General Matrix Multiply) kernels. Backprop inherits this efficiency: every gradient computation is a matrix multiply, making deep learning tractable at scale.

Purpose

ML-first framing: Linear maps $Y = XW + b$ are the computational substrate of deep learning. The forward pass computes outputs via matrix multiplication; the backward pass (backpropagation) computes gradients via adjoints (transposes). Understanding that $\frac{\partial L}{\partial W} = X^\top (dL\_dY)$ and $\frac{\partial L}{\partial X} = (dL\_dY) W^\top$ reveals why deep learning is efficient: gradients flow through layers via chained matrix multiplies, not element-wise loops.

Core concepts: - Forward pass: $Y = XW + b$ computes affine transformation (linear map + translation) - Backward pass: Gradients computed via adjoints: $X^\top$ for weights, $W^\top$ for inputs, summation for bias - Chain rule in matrix form: Backprop composes adjoints: $(f \circ g)' = g' \circ f'$ becomes transpose reversal - Shape discipline: Gradient has same shape as variable; transpose preserves dimensions correctly

Why this matters: - Backpropagation is automatic differentiation via composed adjoints (PyTorch, JAX, TensorFlow) - Matrix operations are GPU-accelerated (GEMM kernels); backprop inherits this efficiency - Transpose patterns generalize: attention, convolutions, recurrence all use adjoint structure - Understanding adjoints unifies forward/backward passes across all neural architectures

Learning objective: Master the forward-backward pattern for linear maps. Once you internalize that âbackward is transpose,â you can derive backprop rules for any differentiable operationânonlinear activations, convolutions, attention, batch normâby applying the chain rule mechanically.

Problem

Compute forward + backward passes for a linear layer Y=XW+b. Verify: dL/dW = X^T dL/dY and dL/dX = dL/dY W^T for loss L = sum(Y).

Solution (Math)

With $Y=XW+b$ and scalar loss $L(Y)$, reverse-mode differentiation gives

\[ \frac{\partial L}{\partial W} = X^T\frac{\partial L}{\partial Y},\qquad \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y}W^T,\qquad \frac{\partial L}{\partial b} = \mathbf{1}^T\frac{\partial L}{\partial Y}. \]

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np

X = np.array([[1.,0.],
              [0.,1.],
              [1.,1.]])
W = np.array([[2.,1.],
              [-1.,3.]])
b = np.array([0.5,-2.0])

Y = X@W + b
dL_dY = np.ones_like(Y)

dL_dW = X.T @ dL_dY
dL_dX = dL_dY @ W.T
dL_db = dL_dY.sum(axis=0)

print("Y:
", Y)
print("dL/dW:
", dL_dW)
print("dL/dX:
", dL_dX)
print("dL/db:", dL_db)

Code Introduction

This snippet demonstrates backpropagation through a linear transformationâthe forward pass computes an affine map $Y = XW + b$, and the backward pass uses the chain rule to propagate gradients from loss to weights, inputs, and biases. This is the foundational pattern of automatic differentiation in neural networks, where each operation has an adjoint (transpose) that flows gradients backward.

Numerical Implementation Details

Forward pass cost: $Y = X @ W + b$ costs $O(ndm)$ for matrix multiply ($X \in \mathbb{R}^{n \times d}, W \in \mathbb{R}^{d \times m}$), plus $O(nm)$ for bias addition (broadcast). For typical neural network layers ($n = 32$ batch, $d = 512$ input, $m = 256$ output), this is ~4M FLOPsâfast on GPUs.
Backward pass cost: $\frac{\partial L}{\partial W} = X^\top (dL\_dY)$ costs $O(ndm)$; $\frac{\partial L}{\partial X} = (dL\_dY) @ W^\top$ costs $O(ndm)$; $\frac{\partial L}{\partial b}$ summation costs $O(nm)$. Total: $O(ndm)$ (same order as forward pass). Backprop is not more expensive than forwardâboth are dominated by matrix multiplies.
Memory layout and cache efficiency: Store matrices in row-major (C) or column-major (Fortran) order. NumPy uses row-major by default; BLAS/LAPACK prefer column-major. Transpose operations can be implicit (swap strides) or explicit (copy data). For backprop, implicit transpose is free (pointer arithmetic); explicit transpose costs $O(dm)$ (data copy).
GEMM kernels dominate: Matrix multiply Y = X @ W calls GEMM (General Matrix Multiply) from BLAS (Basic Linear Algebra Subprograms). Modern GEMM implementations (OpenBLAS, MKL, cuBLAS) use cache blocking, vectorization (AVX-512), and GPU parallelism. On GPU, GEMM achieves 10â50 TFLOP/s (teraflops per second); on CPU, 0.1â1 TFLOP/s.
Accumulation and numerical stability: Gradient $\frac{\partial L}{\partial W} = X^\top (dL\_dY)$ accumulates contributions from all $n$ examples. For large $n$ or many batches, accumulated gradient can lose precision (catastrophic cancellation). Use Kahan summation or mixed precision (accumulate in FP32, compute in FP16) to mitigate.
Batch size affects efficiency: For batch size $n$, cost is $O(ndm)$. Larger $n$ amortizes fixed costs (kernel launch, memory transfers). On GPUs, batch sizes 32â256 are typical; too small ($n < 8$) underutilizes GPU; too large ($n > 1024$) exhausts memory. Optimal $n$ depends on hardware and model size.
Gradient clipping and normalization: Gradients $\frac{\partial L}{\partial W}$ can explode (large values) or vanish (near-zero). Gradient clipping (cap $\|\nabla L\|$ at threshold) and gradient normalization ($\nabla L \leftarrow \nabla L / \|\nabla L\|$) stabilize training. Applied after backprop, before optimizer step.
Checkpointing for memory: In very deep networks (100+ layers), storing all activations for backprop exhausts memory. Gradient checkpointing saves only some activations (e.g., every 10 layers), recomputing others during backprop. Trades computation (2Ã forward passes) for memory (10Ã reduction).

What This Example Demonstrates

Forward pass as affine transformation: $Y = XW + b$ computes a linear map (matrix multiply) plus translation (bias). Input $X \in \mathbb{R}^{n \times d}$ is transformed to output $Y \in \mathbb{R}^{n \times m}$ via weights $W \in \mathbb{R}^{d \times m}$ and bias $b \in \mathbb{R}^m$. This is the fundamental operation of neural networks.
Backward pass via adjoints: Given upstream gradient $\frac{\partial L}{\partial Y}$, compute parameter gradients via transposed operations: $\frac{\partial L}{\partial W} = X^\top (dL\_dY)$, $\frac{\partial L}{\partial X} = (dL\_dY) W^\top$. The transpose is the adjointâit reverses the direction of information flow.
Bias gradient via summation: Bias is broadcast to all examples in the forward pass ($Y = XW + b$ adds $b$ to each row). In the backward pass, gradients sum across examples: $\frac{\partial L}{\partial b} = \mathbf{1}^\top (dL\_dY)$ (sum over axis 0). Broadcasting forward = summation backward.
Chain rule in matrix form: For composed maps $Z = f(Y), Y = g(X)$, the chain rule $\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial X}$ becomes matrix multiplication. For linear $Y = XW$, the Jacobian $\frac{\partial Y}{\partial X}$ is implicitly $W^\top$, so $\frac{\partial L}{\partial X} = (dL\_dY) W^\top$.
Shape discipline: Gradients match variable shapes. $\frac{\partial L}{\partial W}$ has shape $(d, m)$ (same as $W$), $\frac{\partial L}{\partial X}$ has shape $(n, d)$ (same as $X$), $\frac{\partial L}{\partial b}$ has shape $(m,)$ (same as $b$). Transposes preserve dimensional correctness.
Numerical verification via finite differences: Analytical gradients (backprop) should match numerical gradients (finite differences): $\frac{\partial L}{\partial W_{ij}} \approx \frac{L(W + \epsilon e_{ij}) - L(W - \epsilon e_{ij})}{2\epsilon}$. Agreement to $10^{-8}$ confirms correctness.
Efficiency of matrix operations: Computing $\frac{\partial L}{\partial W} = X^\top (dL\_dY)$ is $O(ndm)$ (one matrix multiply). Element-wise loop over all $d \times m$ parameters would be $O(ndm \cdot n) = O(n^2 dm)$ (slower). Matrix formulation is essential for efficiency.
Generalizes to all layers: This forward-backward pattern applies to all differentiable operations: convolutions (transpose = flipped kernel), pooling (transpose = unpooling with routing), attention (transpose of Q/K/V projections). Master linear maps, and nonlinear operations are straightforward extensions.

Notes

Part 1: Forward Pass and Affine Transformations

The forward pass $Y = XW + b$ is an affine transformation: a linear map ($XW$) plus a translation ($+b$). Linear maps preserve vector space structure (addition and scalar multiplication); translation shifts the origin. In neural networks, each layer applies an affine transformation followed by a nonlinear activation (e.g., ReLU, sigmoid). The nonlinearity is essentialâstacking linear layers without activation is equivalent to a single linear layer (composition of linear maps is linear).

Part 2: Adjoints and Transpose Operations

The adjoint of a linear operator $T$ is $T^*$ such that $\langle Tx, y \rangle = \langle x, T^* y \rangle$ (inner product definition). For matrices, the adjoint is the transpose: $(XW)^* = W^\top X^\top$. In backpropagation, the adjoint flows gradients backward. For $Y = XW$, forward multiplies by $W$; backward multiplies by $W^\top$. This symmetryâforward is $W$, backward is $W^\top$âis the organizing principle of automatic differentiation.

Part 3: Broadcasting and Bias Gradients

Bias addition $Y = XW + b$ broadcasts $b \in \mathbb{R}^m$ across all $n$ examples (adds $b$ to each row of $XW$). In the backward pass, this becomes summation: $\frac{\partial L}{\partial b} = \sum_{i=1}^n \frac{\partial L}{\partial Y_i}$ (sum over examples). Broadcasting in forward pass corresponds to reduction (summation) in backward pass. This pattern generalizes: any operation that replicates data (broadcast, repeat) has adjoint that aggregates data (sum, mean).

Why This Matters for ML

Backpropagation is the engine of deep learning. Without efficient gradient computation, neural networks couldnât scale beyond toy problems. The key insight: gradients are computed via reverse-mode automatic differentiation, which applies the chain rule backward through the computation graph. Each operation has a forward pass (compute output) and a backward pass (compute gradient via adjoint). For linear maps, the adjoint is the transpose; for nonlinear operations (ReLU, softmax), the adjoint is the Jacobian-vector product. Understanding linear map backprop is the foundationâonce mastered, all other operations follow the same pattern.

ML Examples and Patterns

Multilayer perceptron (MLP): Stack linear layers with nonlinear activations: \[ h_1 = \sigma(X W_1 + b_1), \quad h_2 = \sigma(h_1 W_2 + b_2), \quad Y = h_2 W_3 + b_3. \] Backprop flows gradients backward: $\frac{\partial L}{\partial W_3} = h_2^\top (dL\_dY)$, $\frac{\partial L}{\partial h_2} = (dL\_dY) W_3^\top$, then through $\sigma'(\cdot)$, then $\frac{\partial L}{\partial W_2} = h_1^\top (dL\_dh_2 \odot \sigma'(\cdot))$, and so on.

Convolutional layers: Convolution $Y = X * W$ (where $*$ is convolution operator) has adjoint $\frac{\partial L}{\partial X} = (dL\_dY) * \text{flip}(W)$ (convolution with flipped kernel). Transpose convolution (deconvolution) is the adjoint of forward convolution.

Recurrent neural networks (RNNs): $h_t = \sigma(h_{t-1} W_h + x_t W_x + b)$. Backprop through time (BPTT) unrolls the recurrence and applies backprop through the sequence. Gradients $\frac{\partial L}{\partial W_h}$ accumulate contributions from all time steps: $\sum_t h_{t-1}^\top (dL\_dh_t)$.

Attention mechanisms: $O = \text{softmax}(QK^\top) V$. Backprop through softmax is nontrivial (Jacobian is dense), but backprop through $QK^\top$ and $V$ follows the linear map pattern: $\frac{\partial L}{\partial Q} = (dL\_dO \cdot \text{softmax}(QK^\top)) K$, $\frac{\partial L}{\partial V} = (\text{softmax}(QK^\top))^\top (dL\_dO)$.

Batch normalization: $\hat{x} = \frac{x - \mu}{\sigma}$ (normalize), $y = \gamma \hat{x} + \beta$ (scale and shift). Backprop computes $\frac{\partial L}{\partial x}$ via chain rule through normalization and affine transformation. The adjoint of mean subtraction is summation; adjoint of division by $\sigma$ is multiplication by $-\sigma^{-2}$.

Connection to Linear Algebra Theory

Chain rule and composition: For $f \circ g$, the derivative is $(f \circ g)' = f' \circ g'$ (function composition). For matrices, this becomes $(AB)^\top = B^\top A^\top$ (transpose reverses order). Backprop chains adjoints in reverse order: gradient flows right-to-left through layers.

Jacobian-vector products (JVPs): Forward-mode autodiff computes $\frac{\partial Y}{\partial X} \cdot v$ (Jacobian times vector). Reverse-mode autodiff (backprop) computes $u^\top \frac{\partial Y}{\partial X}$ (vector times Jacobian). For $Y = XW$, JVP is $(vW)$; VJP is $(uW^\top)$. Both are matrix multiplies; reverse-mode is faster when $\dim(Y) \ll \dim(X)$ (typical in deep learning: scalar loss, many parameters).

Gradient as covector: Gradients are elements of the dual space (covectors, not vectors). For $f: \mathbb{R}^n \to \mathbb{R}$, $\nabla f \in (\mathbb{R}^n)^*$ (dual space). In coordinates, covectors are row vectors; vectors are column vectors. Transpose converts between them. Backprop naturally produces covectors (row vectors); transposing gives column vectors.

Pushforward and pullback: Forward pass is pushforward (map vectors forward through $f$); backward pass is pullback (pull covectors back through $f^*$). For linear $f(x) = Wx$, pushforward is $v \mapsto Wv$; pullback is $\alpha \mapsto W^\top \alpha$. These are adjoint operations.

Numerical and Implementation Notes

Avoid forming full Jacobians: For $Y = XW \in \mathbb{R}^{n \times m}$ with $X \in \mathbb{R}^{n \times d}$, the Jacobian $\frac{\partial Y}{\partial X} \in \mathbb{R}^{nm \times nd}$ has $n^2 dm$ entries (huge). Never form it explicitly. Instead, compute Jacobian-vector products: $\frac{\partial L}{\partial X} = (dL\_dY) W^\top$ is $O(ndm)$ (one matrix multiply), not $O(n^2 dm)$.

In-place operations and memory: In PyTorch, operations like x += y modify x in-place. This breaks backprop (gradient graph needs original x). Use x = x + y (out-of-place) unless you explicitly detach from gradient graph. Similarly, views (e.g., x.view()) share memory; modifying one affects the other.

Gradient accumulation: In mini-batch training, gradients are averaged over batch: $\frac{\partial L}{\partial W} = \frac{1}{n} X^\top (dL\_dY)$. For multiple batches, accumulate gradients: $\nabla W \leftarrow \nabla W + \frac{\partial L_i}{\partial W}$ (sum over batches), then divide by total examples. Used when batch size exceeds GPU memory.

Mixed precision training: Compute forward/backward in FP16 (half precision, 2 bytes per float); accumulate gradients in FP32 (single precision, 4 bytes). FP16 is 2Ã faster on modern GPUs (Tensor Cores) but less precise. Loss scaling (multiply loss by large constant before backprop, divide gradients after) prevents underflow.

Numerical and Shape Notes

Forward shapes: $X \in \mathbb{R}^{n \times d}$ (input), $W \in \mathbb{R}^{d \times m}$ (weights), $b \in \mathbb{R}^m$ (bias), $Y \in \mathbb{R}^{n \times m}$ (output).
Backward shapes: $\frac{\partial L}{\partial Y} \in \mathbb{R}^{n \times m}$ (upstream gradient), $\frac{\partial L}{\partial W} \in \mathbb{R}^{d \times m}$ (weight gradient), $\frac{\partial L}{\partial X} \in \mathbb{R}^{n \times d}$ (input gradient), $\frac{\partial L}{\partial b} \in \mathbb{R}^m$ (bias gradient).
Transpose shapes: $X^\top \in \mathbb{R}^{d \times n}$, $W^\top \in \mathbb{R}^{m \times d}$. Check: $X^\top (dL\_dY) \in \mathbb{R}^{d \times n} \times \mathbb{R}^{n \times m} = \mathbb{R}^{d \times m}$ (matches $W$ shape â).
Bias broadcast: Y = X @ W + b broadcasts b (shape (m,)) to (n, m) (adds to each row). NumPy/PyTorch handle this automatically.
Reduction: dL_db = dL_dY.sum(axis=0) sums over examples (axis 0), producing shape (m,) (matches b â).

Pedagogical Significance

This example is the Rosetta Stone of deep learning. Master it, and you unlock:

Automatic differentiation: PyTorch, TensorFlow, JAX implement backprop automatically. Understanding how it works (compose adjoints) demystifies âautograd.â
Debugging gradients: When training fails, check gradient norms, verify shapes, compare analytical vs.Â numerical gradients. This example shows how.
Custom layers: To implement a new operation (e.g., custom attention, novel activation), define forward pass and backward pass (adjoint). Follow the linear map pattern.
Optimization intuition: Gradients point toward steepest ascent. Negative gradient (descent direction) reduces loss. Transpose structure ensures gradients flow correctly.
Efficiency mindset: Backprop is $O(\text{forward cost})$ (same order as forward pass). Never compute full Jacobians; always use matrix multiplies.

Every deep learning practitioner should: - Run this code (forward + backward + finite difference verification). - Verify shapes (print all tensor dimensions; check transpose correctness). - Modify it (change $X, W, b$ sizes; add nonlinearity; chain multiple layers). - Profile it (time forward vs.Â backward; compare CPU vs.Â GPU; measure memory).

This is the pattern that scales to GPT-4, Stable Diffusion, AlphaFold. Internalize it, and you have the foundation for all of modern ML.

Connection to Broader Examples

Example 86 (Orthogonality & Projections): Transpose $W^\top$ is the adjoint of $W$. For orthogonal $W$ ($W^\top W = I$), forward and backward passes have the same condition number (numerically stable). Non-orthogonal $W$ can amplify or dampen gradients.
Example 87 (Rank & Nullspace): If $W$ is rank-deficient ($\text{rank}(W) < \min(d, m)$), some input directions map to zero (nullspace). Gradients cannot flow through nullspace: $\frac{\partial L}{\partial X} = (dL\_dY) W^\top$ loses information if $W$ has nullspace. Low-rank $W$ acts as bottleneck.
Example 88 (Eigenvalues): Spectral norm $\|W\|_2 = \sigma_1(W)$ (largest singular value) controls gradient scaling. If $\|W\|_2 \gg 1$, gradients explode; if $\|W\|_2 \ll 1$, gradients vanish. Spectral normalization (divide $W$ by $\|W\|_2$) stabilizes training.
Example 90 (SVD & Low-Rank): $W = U \Sigma V^\top$ (SVD). Backprop gradient $\frac{\partial L}{\partial W}$ can be low-rank (rank $\ll \min(d,m)$). Low-rank gradient updates (e.g., LoRA in fine-tuning) reduce parameters while preserving expressiveness.
Example 91 (PCA): PCA projection $Z = X V_k$ is forward pass with orthonormal $V_k$; backprop is $\frac{\partial L}{\partial X} = (dL\_dZ) V_k^\top$. Orthonormality ($V_k^\top V_k = I$) ensures gradient norm is preserved.
Example 92 (Least Squares): Normal equations $X^\top X \beta = X^\top y$ arise from gradient $\frac{\partial L}{\partial \beta} = X^\top (X\beta - y) = 0$. Setting backprop gradient to zero gives least squares solution. Regularization (ridge, LASSO) modifies gradient.
Example 96 (Attention & Transformers): Attention computes $O = \text{softmax}(QK^\top) V$. Backprop through softmax uses chain rule; backprop through $QK^\top$ and $V$ uses transposed projections ($Q^\top, K^\top, V^\top$). Same adjoint pattern as linear layers.
Example 97 (Vector Spaces): Linear maps preserve vector space structure: $f(\alpha x + \beta y) = \alpha f(x) + \beta f(y)$. Backprop also linear: $\nabla (\alpha L_1 + \beta L_2) = \alpha \nabla L_1 + \beta \nabla L_2$. Linearity enables efficient gradient computation.
Example 98 (Span): Outputs $Y = XW$ lie in span of columns of $W$ (column space). If $W$ is low-rank, output space is constrained. Gradients $\frac{\partial L}{\partial X}$ lie in span of rows of $W$ (row space).