Deep learning rides on GEMM: GPUs/TPUs are optimized for batched matrix multiplications. Linear layers ($XW+b$) are the canonical GEMM use case. Backprop through linear maps uses adjoints (transposes) because inner products define how gradients pull back: the adjoint of $x \mapsto XW$ with respect to the standard inner product is $g \mapsto g W^\top$ for inputs and $g \mapsto X^\top g$ for weights. Bias gradients sum over the batch because $b$ broadcasts across rows. These rules underlie automatic differentiation systems, which repeatedly apply transposes of Jacobians for composed operations.
- Log in to post comments
Make shapes, transposes, and adjoints feel inevitable so forward/backward passes become mechanical. A linear layer is just a matrix multiplication plus a bias; its backward pass is the adjoint (transpose) applied to upstream gradients. If you internalize that backprop through linear maps is applying the transpose, you can derive gradients on the fly without memorizing. This is the core pattern reused in attention, convolutions (with reshaping), and any batched affine transform.
Compute Y=XW+b for a tiny batch and verify backprop identities for L=sum(Y).
For $Y=XW+b$, gradients satisfy:
\[ \frac{\partial L}{\partial W}=X^T\frac{\partial L}{\partial Y},\quad \frac{\partial L}{\partial X}=\frac{\partial L}{\partial Y}W^T,\quad \frac{\partial L}{\partial b}=\mathbf{1}^T\frac{\partial L}{\partial Y}. \]This uses the adjoint (transpose) of the linear map.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
X = np.array([[1.,0.],[0.,1.],[1.,1.]])
W = np.array([[2.,1.],[-1.,3.]])
b = np.array([0.5,-2.])
Y = X@W + b
dL_dY = np.ones_like(Y)
dL_dW = X.T@dL_dY
dL_dX = dL_dY@W.T
dL_db = dL_dY.sum(axis=0)
print("Y:
", Y)
print("dL/dW:
", dL_dW)
print("dL/dX:
", dL_dX)
print("dL/db:", dL_db)This snippet performs a linear map followed by a backward pass for its parameters and inputs. It computes $Y = XW + b$ for a tiny batch, sets an upstream gradient of ones, and then applies the adjoint rules: $\partial L/\partial W = X^\top (\partial L/\partial Y)$, $\partial L/\partial X = (\partial L/\partial Y) W^\top$, and $\partial L/\partial b$ as a row-sum reduction. The printed outputs verify forward values and gradient shapes, illustrating the standard affine layer backward formulas used in neural network training.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Define a minibatch $X \in \mathbb{R}^{n\times d}$, weights $W \in \mathbb{R}^{d\times k}$, bias $b \in \mathbb{R}^k$.
- Forward pass: compute $Y = XW + b$ (bias broadcasts across rows).
- Set upstream gradient
dL_dY(here all ones) matching the shape of $Y$. - Weight gradient:
dL_dW = X.T @ dL_dY, contracting batch/examples dimension to accumulate contributions. - Input gradient:
dL_dX = dL_dY @ W.T, applying the adjoint of the linear map with respect to inputs. - Bias gradient:
dL_db = dL_dY.sum(axis=0), summing over batch rows because $b$ is shared across them. - Sanity checks: print shapes and values; confirm
dL_dWhas shape $(d, k)$,dL_dXhas shape $(n, d)$, anddL_dbhas shape $(k,)`. - Extend to loss scaling: if $L = \frac{1}{n} \mathbf{1}^\top Y \mathbf{1}$, scale upstream gradients by $1/n$ before backprop to keep gradients unbiased.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Forward affine map: $Y = XW + b$ as a batched matrix multiply plus broadcasted bias.
- Backward via adjoints: $\partial L / \partial W = X^\top (\partial L / \partial Y)$ and $\partial L / \partial X = (\partial L / \partial Y) W^\top$âtransposes appear because gradients flow through adjoints.
- Bias gradient as reduction: $\partial L / \partial b$ is the row-wise sum of upstream gradients since $b$ is added to every row.
- Shape tracing: $X \in \mathbb{R}^{n\times d}$, $W \in \mathbb{R}^{d\times k}$, $b \in \mathbb{R}^k$, $Y \in \mathbb{R}^{n\times k}$, $\partial L / \partial Y$ matches $Y$.
- Verifying with code: explicit small matrices make the algebra checkable; printing confirms the formulas.
Part 1: Forward linear map
Compute $Y = XW + b$. Bias broadcasts across rows. Shape: $(n, d) @ (d, k) + (k,) \to (n, k)$.
Part 2: Backward for weights and inputs
Weight gradient $\partial L/\partial W = X^\top (\partial L/\partial Y)$ accumulates over the batch. Input gradient $\partial L/\partial X = (\partial L/\partial Y) W^\top$ applies the adjoint of the forward map with respect to $X$.
Part 3: Bias gradient as sum
$\partial L/\partial b$ sums upstream gradients across the batch dimension because $b$ affects every row equally.
Connection to ML Primitives
This pattern underlies dense layers, attention score/value projections, and convolution (when reshaped to matrix multiplication). Backprop is repeatedly applying adjoints of each linear piece in the computation graph.
Numerical and Implementation Notes
Use fused GEMM where possible; ensure dL_dY shape matches Y. If using reduction losses (mean vs. sum), scale dL_dY accordingly. Monitor gradient norms when $W$ is ill-conditioned.
Numerical and Shape Notes
Shapes: $X(n\times d)$, $W(d\times k)$, $b(k)$, $Y(n\times k)$, $dL_dY(n\times k)$, $dL_dW(d\times k)$, $dL_dX(n\times d)$, $dL_db(k)$. Broadcasting $b$ is along rows. For batches in frameworks, ensure contiguous memory/layout or use proper BLAS strides.
Backpropagation of errors was popularized for multilayer neural nets by Rumelhart, Hinton, and Williams (1986), though the underlying adjoint/chain rule idea dates to control theory and calculus of variations. Efficient implementations relied on matrix multiplications; the rise of GPUs (mid-2000s) and CUDA made GEMM the workhorse for both forward and backward passes. Linear layers ($XW+b$) are the simplest case: their Jacobians are constant, and their adjoints are transposes, making gradients easy to derive and stable to compute. Conditioning of $W$ has long been known to affect optimization (ill-conditioned $W$ amplifies or damps gradients), motivating practices like orthogonal initialization and normalization layers (batch norm in 2015) to stabilize training. Today, every deep modelâtransformers, CNNs, RNNsâuses these same affine/adjoint rules within their building blocks (attention projections, 1Ã1 convolutions, feedforward layers). Understanding this example removes mystery from backprop: it is just linear algebra with transposes and reductions applied repeatedly through the computation graph.
- Vector spaces (Ch. 1): Linear layers are linear maps between finite-dimensional vector spaces.
- Linear maps (Ch. 4): $XW$ is a matrix representation of a linear map; backprop applies the adjoint (transpose) map.
- Inner products (Ch. 5): Gradients use adjoints defined by the chosen inner product; here the standard Euclidean inner product yields transposes.
- Orthogonality & projections (Ch. 6): Backprop projects upstream gradients through adjoints, analogous to projecting onto subspaces.
- SVD (Ch. 10): Conditioning of $W$ (via its singular values) affects gradient magnitudesâill-conditioned $W$ amplifies or shrinks $dL_dX$.
- Least squares (Ch. 12): Solving for $W$ in linear regression uses the same $X^\top X$ and $X^\top y$ structures as appear in $dL_dW$.
- Conditioning (Ch. 14): Poorly conditioned weight matrices lead to unstable gradients; monitoring norms and singular values is essential.
Comments