Example number
52
Example slug
example_52_linear_maps_forward_backward_as_matrix_products
Background

Historical context: Backpropagation was independently discovered multiple times (Werbos 1974, Rumelhart et al. 1986), but the mathematical foundation is the chain rule for multivariate calculus. The key insight: derivatives of compositions of linear maps involve transposes (adjoints). Matrix products became the computational substrate of deep learning because GPUs/TPUs are engineered around GEMM (General Matrix Multiply) kernels—highly optimized routines for computing $C = \alpha AB + \beta C$. Modern deep learning frameworks (PyTorch, TensorFlow, JAX) leverage automatic differentiation (autograd) to compute gradients, but the underlying operations are the manual transpose-based formulas shown here.

Mathematical characterization: For an affine transformation $f(X) = XW + b$ where $X \in \mathbb{R}^{n \times d_{\text{in}}}$, $W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$, and $b \in \mathbb{R}^{d_{\text{out}}}$, the Jacobian (matrix of partial derivatives) is $\partial f / \partial X = W$ (treating $X$ as vectorized). Backpropagation applies the chain rule: if $L$ is a scalar loss depending on $Y = f(X)$, then $\partial L / \partial X = (\partial L / \partial Y)(\partial Y / \partial X) = (\partial L / \partial Y) W^\top$. The transpose arises from the adjoint of the linear map $x \mapsto Wx$.

Prevalence in ML: Every dense layer, embedding layer, and attention projection uses affine transformations. Gradient descent requires computing $\partial L / \partial W$ for parameter updates, and $\partial L / \partial X$ to propagate gradients to earlier layers. Understanding transpose chaining is essential for implementing custom layers, debugging gradient issues, and reasoning about deep architectures.

Purpose

Show how backpropagation through an affine layer $Y = XW + b$ reduces to simple matrix transposes: forward uses $W$, backward uses $W^\top$. Demonstrate that the chain rule for matrix-valued functions decomposes into three gradient computations: $\partial L / \partial W = X^\top (\partial L / \partial Y)$ (weight gradient), $\partial L / \partial X = (\partial L / \partial Y) W^\top$ (input gradient), and $\partial L / \partial b = \sum_i (\partial L / \partial Y)_i$ (bias gradient). Build intuition for transpose chaining—the universal pattern powering backpropagation in neural networks—and emphasize that understanding shapes and transposes makes gradient computation feel inevitable rather than magical.

Problem

Compute forward + backward passes for a linear layer Y=XW+b. Verify: dL/dW = X^T dL/dY and dL/dX = dL/dY W^T for loss L = sum(Y).

Solution (Math)

With $Y=XW+b$ and scalar loss $L(Y)$, reverse-mode differentiation gives

\[ \frac{\partial L}{\partial W} = X^T\frac{\partial L}{\partial Y},\qquad \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y}W^T,\qquad \frac{\partial L}{\partial b} = \mathbf{1}^T\frac{\partial L}{\partial Y}. \]

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

X = np.array([[1.,0.],
              [0.,1.],
              [1.,1.]])
W = np.array([[2.,1.],
              [-1.,3.]])
b = np.array([0.5,-2.0])

Y = X@W + b
dL_dY = np.ones_like(Y)

dL_dW = X.T @ dL_dY
dL_dX = dL_dY @ W.T
dL_db = dL_dY.sum(axis=0)

print("Y:
", Y)
print("dL/dW:
", dL_dW)
print("dL/dX:
", dL_dX)
print("dL/db:", dL_db)
Code Introduction

This code demonstrates backpropagation through an affine transformation $Y = XW + b$, the fundamental building block of neural network layers. It computes the forward pass (predictions $Y$) and backward pass (gradients with respect to all parameters and inputs), illustrating how the chain rule decomposes into simple matrix transpositions.

Part 1: Forward Pass as Affine Map. The code defines a mini-batch affine transformation $Y = XW + b \in \mathbb{R}^{3 \times 2}$, where $X \in \mathbb{R}^{3 \times 2}$ (3 examples, 2 input features), $W \in \mathbb{R}^{2 \times 2}$ (weight matrix), and $b \in \mathbb{R}^2$ (bias vector broadcast across all 3 examples). The matrix product X @ W computes a linear map $f_W : \mathbb{R}^2 \to \mathbb{R}^2$ applied row-wise to each example. Adding bias $b$ (broadcast to shape (3, 2)) shifts all outputs, making this an affine map. This is the operation performed by dense/fully-connected layers before applying nonlinearities.

Part 2: Backward Pass via Chain Rule. The gradient of a scalar loss $L$ with respect to the output $Y$ is given as dL_dY = np.ones_like(Y) (simulating $L = \sum_{ij} Y_{ij}$, so $\partial L / \partial Y_{ij} = 1$). Three gradient computations follow: (1) Weight gradient: $\partial L / \partial W = X^\top (\partial L / \partial Y)$ computed as dL_dW = X.T @ dL_dY (shape $(2, 2)$); (2) Input gradient: $\partial L / \partial X = (\partial L / \partial Y) W^\top$ computed as dL_dX = dL_dY @ W.T (shape $(3, 2)$); (3) Bias gradient: $\partial L / \partial b = \sum_i (\partial L / \partial Y)_i$ computed as dL_db = dL_dY.sum(axis=0) (shape $(2,)$). The transpose $W^\top$ “reverses” the forward linear map—this is the transpose chaining pattern that powers backpropagation. All gradients match the shapes of their corresponding parameters, and the computation leverages matrix products for efficiency (no explicit loops).

Why This Matters for ML: Backpropagation is transpose chaining—forward uses $W$, backward uses $W^\top$. This pattern repeats at every layer in a deep network. Understanding the manual computation clarifies what automatic differentiation frameworks do under the hood, essential for debugging gradient issues, implementing custom layers, and reasoning about numerical stability. Shape verification (gradients match parameter shapes) catches transpose errors and missing reductions. All operations are matrix products or simple reductions, enabling efficient vectorization on GPUs/TPUs.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Define input batch: $X \in \mathbb{R}^{3 \times 2}$ with 3 examples, 2 input features.
  2. Define parameters: Weight matrix $W \in \mathbb{R}^{2 \times 2}$ and bias $b \in \mathbb{R}^2$.
  3. Forward pass: Compute $Y = XW + b$ via matrix product and broadcasting.
  4. Simulate upstream gradient: Set $\partial L / \partial Y = \mathbf{1}$ (all ones) as placeholder for loss gradient.
  5. Compute weight gradient: $\partial L / \partial W = X^\top (\partial L / \partial Y)$ via transpose and matrix product.
  6. Compute input gradient: $\partial L / \partial X = (\partial L / \partial Y) W^\top$ for backprop to earlier layers.
  7. Compute bias gradient: $\partial L / \partial b = \sum_{i=1}^n (\partial L / \partial Y)_i$ via sum(axis=0).
  8. Verify shapes: Check that gradient shapes match parameter shapes for correctness.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Forward pass as affine map: $Y = XW + b$ computes a linear transformation followed by translation (bias).
  • Transpose reverses linear maps: Gradient w.r.t. inputs uses $W^\top$, reversing the forward multiplication by $W$.
  • Gradient accumulation over batch: Weight gradient $X^\top (\partial L / \partial Y)$ accumulates contributions from all examples.
  • Bias gradient via reduction: $\partial L / \partial b$ sums across the batch dimension because all examples share the same bias.
  • Shape verification: Gradients must match parameter shapes: $\partial L / \partial W$ has shape of $W$, $\partial L / \partial X$ has shape of $X$, $\partial L / \partial b$ has shape of $b$.
  • Matrix products vectorize computation: No explicit loops—efficient and hardware-friendly.
  • Chain rule for matrix functions: Backpropagation decomposes into simple transpose and matrix multiplication operations.
  • Foundation for deep learning: This pattern repeats at every layer in neural networks.
Notes

Part 1: Forward Pass as Affine Map - $Y = XW + b$ computes a linear transformation $x \mapsto Wx$ followed by translation $x \mapsto x + b$. - Matrix product X @ W: $(3, 2) \times (2, 2) \to (3, 2)$ applies $W$ row-wise to each example. - Bias broadcasting: $b \in \mathbb{R}^2$ is added to all 3 rows, yielding $Y \in \mathbb{R}^{3 \times 2}$. - This is the operation performed by dense/fully-connected layers before applying nonlinearities.

Part 2: Backward Pass via Chain Rule - Weight gradient: $\partial L / \partial W = X^\top (\partial L / \partial Y)$ accumulates contributions from all examples. - Input gradient: $\partial L / \partial X = (\partial L / \partial Y) W^\top$ propagates gradients to earlier layers. - Bias gradient: $\partial L / \partial b = \sum_i (\partial L / \partial Y)_i$ sums across batch (all examples share $b$). - Transpose $W^\top$ “reverses” the forward linear map—fundamental to chain rule for linear transformations.

Why This Matters for ML - Backpropagation is transpose chaining: Forward uses $W$, backward uses $W^\top$—this pattern repeats at every layer. - Batch accumulation: Gradients accumulate over examples via matrix products, providing stable estimates. - Efficient implementation: All gradients computed via matrix ops—no loops, leverages hardware acceleration. - Shape verification catches bugs: Mismatched gradient shapes signal transpose errors or missing reductions.

ML Examples and Patterns - Multi-layer perceptron: Chain multiple affine layers $Y_1 = X W_1 + b_1$, $Y_2 = Y_1 W_2 + b_2$; backprop uses transpose chaining. - Gradient descent updates: $W \leftarrow W - \eta (\partial L / \partial W)$ directly uses the computed weight gradient. - Convolutional layers: Structured linear maps where backward pass uses transposed convolution (adjoint). - Attention projections: $Q = XW_Q$, $K = XW_K$, $V = XW_V$ are affine layers; gradients use the same transpose pattern.

Connection to Linear Algebra Theory - Affine maps are not linear: $f(x) = Wx + b$ fails $f(0) = 0$, but derivative $\partial f / \partial x = W$ is linear. - Adjoint interpretation: Gradient $\partial L / \partial X = (\partial L / \partial Y) W^\top$ uses the adjoint (transpose) of forward map. - Matrix chain rule: Derivatives of matrix-valued functions compose via transpose and matrix multiplication. - Bias as rank-1 perturbation: Adding $b$ is $Y = XW + \mathbf{1} b^\top$; gradient $\partial L / \partial b = \mathbf{1}^\top (\partial L / \partial Y)$ projects onto constant direction.

Numerical and Implementation Notes - Transpose direction: Weight gradient uses $X^\top$, input gradient uses $W^\top$—swapping these breaks backprop. - Bias axis: sum(axis=0) sums over batch (rows); axis=1 would give wrong shape. - Gradient initialization: In practice, $\partial L / \partial Y$ comes from loss function or next layer, not all ones. - No nonlinearity here: Adding ReLU/sigmoid requires element-wise derivative multiplication before gradient computation. - Numerical gradient check: Verify analytical gradients via finite differences for debugging.

Numerical and Shape Notes - $X \in \mathbb{R}^{3 \times 2}$, $W \in \mathbb{R}^{2 \times 2}$, $b \in \mathbb{R}^2$, $Y \in \mathbb{R}^{3 \times 2}$. - $\partial L / \partial Y \in \mathbb{R}^{3 \times 2}$, $\partial L / \partial W \in \mathbb{R}^{2 \times 2}$, $\partial L / \partial X \in \mathbb{R}^{3 \times 2}$, $\partial L / \partial b \in \mathbb{R}^2$. - Verify: dL_dW.shape == W.shape, dL_dX.shape == X.shape, dL_db.shape == b.shape.

ML Context: From Attention to Transformers - Attention uses affine projections: $Q = XW_Q + b_Q$, $K = XW_K + b_K$, $V = XW_V + b_V$. - Backprop through attention requires gradients w.r.t. $W_Q, W_K, W_V$ using the same transpose pattern. - Multi-head attention runs multiple affine layers in parallel; gradients accumulate across heads. - Transformer blocks chain attention + feedforward (affine layers); backprop uses repeated transpose chaining.

Pedagogical Significance - Transposes reverse linear maps: Forward $Y = XW$ uses $W$; backward uses $W^\top$. - Reductions accumulate shared parameters: Bias gradients sum across batch because all examples share $b$. - Shape verification catches bugs: Mismatched shapes signal transpose errors or missing reductions. - Matrix products vectorize: No explicit loops—efficient and hardware-friendly. - Foundation for deep learning: This pattern underlies every neural network operation.

History and Applications

Backpropagation was independently discovered multiple times: Werbos (1974 thesis, published 1994), Parker (1985), and Rumelhart, Hinton, Williams (1986 Nature paper that popularized it). The core insight—derivatives of compositions of linear maps involve transposes (adjoints)—was known in calculus of variations and optimal control, but its application to neural networks enabled the deep learning revolution. The key enabler was recognizing that the chain rule decomposes gradient computation into local operations: each layer computes gradients w.r.t. its parameters and propagates gradients backward via the transpose. Modern automatic differentiation frameworks (PyTorch 2016, TensorFlow 2015, JAX 2018) automate this process via computational graphs and reverse-mode AD, but the underlying operations remain the transpose-based formulas shown here. Hardware co-evolved: GPUs (NVIDIA CUDA 2007) and TPUs (Google 2016) are optimized for dense matrix multiplication (GEMM), making affine layers $Y = XW + b$ the computational substrate of deep learning. Applications span computer vision (CNNs), NLP (transformers), reinforcement learning (policy gradients), and scientific ML (PINNs, neural ODEs). Understanding manual backpropagation remains essential for custom layer design, gradient debugging, and numerical optimization—skills that distinguish ML engineers from library users.

Connection to Broader Examples
  • Chapter 2 (Span/Linear combinations): Forward pass $Y = XW$ is a linear combination of weight columns.
  • Chapter 3 (Basis/Dimension): Weight matrix $W$ defines a linear map between input/output spaces.
  • Chapter 4 (Linear maps): This is the canonical example of forward/backward through a linear map.
  • Chapter 5 (Inner products): Gradient computation involves inner products between input/output gradients.
  • Chapter 6 (Projections): Backward pass can be viewed as projecting gradients onto input space.
  • Chapter 10 (SVD): Analyzing $W$ via SVD reveals conditioning and gradient flow properties.
  • Chapter 12 (Least-squares): Linear regression is a special case: $\hat{y} = Xw$ with loss $\|y - \hat{y}\|^2$.
  • Chapter 14 (Conditioning): Ill-conditioned $W$ causes gradient instability (vanishing/exploding).
  • Chapter 16 (Matrix products): Forward and backward are matrix products; efficient on modern hardware.
  • Deep learning: Every dense layer, attention projection, and embedding uses this pattern.

Comments