ex1.ai

Example number

Example slug

example_4_linear_layers_and_backprop_are_linear_maps_adjoints

Background

These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.

Linear maps and matrix-vector products are the computational substrate of neural networks. A linear layer computes $Y = XW + b$ where $X \in \mathbb{R}^{n \times d_1}$ is a batch of $n$ examples, $W \in \mathbb{R}^{d_1 \times d_2}$ is the learned weight matrix, and $b \in \mathbb{R}^{d_2}$ is the learned bias. Each row of $X$ (one example) is transformed via the same weight matrix: row $i$ of $Y$ is $X_i W + b$, a linear map from $d_1$ dimensions to $d_2$ dimensions. The columns of $W$ are learned basis vectors: each column is the set of weights connecting input features to one output neuron. The ability to stack these layersâeach layerâs output becomes the next layerâs inputâis what enables deep networks to learn hierarchical representations. Modern GPUs/TPUs are optimized for matrix multiplication (GEMMâgeneral matrix-matrix multiplication), so the entire neural network can be expressed as a sequence of matrix products with element-wise nonlinearities in between.

Backpropagation computes gradients via the chain rule and matrix transposes. Given a loss $L$ that depends on the output $Y$, backpropagation computes $\frac{\partial L}{\partial W}$, $\frac{\partial L}{\partial X}$, and $\frac{\partial L}{\partial b}$ by chaining the adjoint (transpose) of the forward pass. The key insight: if the forward pass is $Y = XW$, then $\frac{\partial Y_{ij}}{\partial W_{kj}} = X_{ik}$, which means $\frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Y}$ (summing over examples). For inputs, since $\frac{\partial Y_{ij}}{\partial X_{ik}} = W_{kj}$, we get $\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^\top$. The transpose appears because the matrix dimensions must match for multiplication. This pattern generalizes: any linear transformation has an adjoint (its transpose), and the backward pass uses adjoints to propagate gradients. In attention mechanisms, convolutions, and all neural network layers, this transpose-chaining pattern appears identically, making it the universal language of deep learning optimization.

Batch processing and gradient accumulation are essential for efficient training. When you process $n$ examples together (a batch), the forward pass computes all $n$ predictions in parallel via matrix multiplication. The backward pass computes gradients for all examples simultaneously, then aggregates them: weight gradients are summed over the batch (via $X^\top$), bias gradients are summed over all examples, and input gradients are kept separate (one per example). This aggregation is why larger batches lead to faster training (more gradient information per forward-backward pass) but may reduce generalization (noisier gradient estimates). Understanding how gradients aggregate across batches is critical for tuning learning rates, batch sizes, and convergence behavior.

Purpose

Make shapes and transposes feel inevitableâso you can reason about forward/backward passes and attention without memorizing formulas.

What youâll learn:

Adjoint (transpose) as gradient flow: Understand that backpropagation through a linear map $Y = XW$ uses the adjoint (transpose) $W^\top$ to propagate gradients backward. This is not coincidenceâitâs fundamental to how matrix calculus works: the gradient with respect to parameters is $X^\top \frac{\partial L}{\partial Y}$, with respect to inputs is $\frac{\partial L}{\partial Y} W^\top$. Knowing this pattern lets you reason about gradient flow in any architecture without memorizing formulas.
Shapes constrain gradients uniquely: Once you fix the forward pass shapesâinput $(n, d_1)$, weight $(d_1, d_2)$, output $(n, d_2)$âthe backward pass shapes are determined. Parameter gradients must have shape $(d_1, d_2)$ (same as weights), and this forces the use of transposes. Shape discipline is the tool that makes gradient flow inevitable rather than mysterious.
Batch aggregation via summation: The bias gradient requires summing over the batch dimension because the bias is shared across all examples. This patternâsumming gradients across a batchâappears everywhere: parameter updates average over examples, normalization layers average statistics over batches, and all collective operations pool information from multiple samples.
Transpose chaining in deep networks: In a multi-layer network, each layerâs backward pass computes three gradients (parameter, input, bias), and the input gradient becomes the upstream gradient for the previous layerâs backward pass. Understanding the shape transformations through transpose chaining is the key to implementing, debugging, and optimizing neural networks.

This example is the foundation for understanding how neural networks train: the forward pass computes predictions, the backward pass uses adjoints to compute parameter updates, and the entire training loop is a loop of forward-backward-update cycles chaining these matrix operations.

Problem

Compute Y=XW+b for a tiny batch and verify backprop identities for L=sum(Y).

Solution (Math)

For $Y=XW+b$, gradients satisfy:

\[ \frac{\partial L}{\partial W}=X^T\frac{\partial L}{\partial Y},\quad \frac{\partial L}{\partial X}=\frac{\partial L}{\partial Y}W^T,\quad \frac{\partial L}{\partial b}=\mathbf{1}^T\frac{\partial L}{\partial Y}. \]

This uses the adjoint (transpose) of the linear map.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np

X = np.array([[1.,0.],[0.,1.],[1.,1.]])
W = np.array([[2.,1.],[-1.,3.]])
b = np.array([0.5,-2.])

Y = X@W + b
dL_dY = np.ones_like(Y)

dL_dW = X.T@dL_dY
dL_dX = dL_dY@W.T
dL_db = dL_dY.sum(axis=0)

print("Y:
", Y)
print("dL/dW:
", dL_dW)
print("dL/dX:
", dL_dX)
print("dL/db:", dL_db)

Code Introduction

This code demonstrates the forward and backward passes of a linear transformationâthe computational backbone of neural network training. It shows how gradients flow through a simple affine map $Y = XW + b$, revealing that backpropagation is fundamentally about chaining matrix transposes to compute sensitivities with respect to parameters and inputs.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

The code demonstrates the complete forward-backward cycle of a linear layer: one forward pass, one backward pass, all shapes determined by the dimensions specified.

For the forward pass, Y = X @ W + b computes predictions via optimized GEMM kernels in $O(ndd_1d_2)$ time (or $O(ndd)$ for square dimensions). Broadcasting adds the bias $b$ to all $n$ rows. The intermediate output Y has shape $(3, 2)$â3 examples, 2 output dimensions. In practice, this forward pass is implemented as a single fused operation on GPUs for memory efficiency.

For the backward pass, three matrix operations propagate gradients:

Weight gradient dL_dW = X.T @ dL_dY computes parameter updates. Shape: $(2, 3) @ (3, 2) = (2, 2)$, matching $W$. This operation sums gradient contributions from all 3 examples: each row $i$ of dL_dY (gradient for example $i$) is weighted by the corresponding row of $X^\top$ (input features for example $i$). The result accumulates how much each weight should change based on the total loss gradient. In a training loop, this is used to update $W \leftarrow W - \alpha \cdot dL\_dW$ for learning rate $\alpha$.
Input gradient dL_dX = dL_dY @ W.T propagates error back to the input. Shape: $(3, 2) @ (2, 2) = (3, 2)$, matching $X$. This is the upstream gradient for the previous layer (if this is not the first layer). Each row of dL_dY is multiplied by $W^\top$ to produce the corresponding row of gradient for input example. Note that dL_dX has the same shape as $X$ but contains gradients, not data.
Bias gradient dL_db = dL_dY.sum(axis=0) computes the bias update. Shape: sum over axis 0 reduces $(3, 2)$ to $(2,)$, matching $b$. Since the bias is added identically to all examples, changing $b$ by $\delta$ changes all 3 outputs by $\delta$, so gradients sum across the batch. This is the only operation where we reduce dimensionsâall examples contribute equally to the bias gradient.

Shapes as computation rules: The three operations follow a simple recipe: (1) Parameter gradients use the transpose of inputs; (2) Input gradients use the transpose of weights; (3) Bias gradients sum over the batch dimension. These rules come from the matrix calculus of the forward pass: the Jacobian of $Y = XW + b$ with respect to each variable determines the transpose patterns. Implementing backpropagation for convolutional layers, attention, normalization, etc. follows the identical principleâderive the Jacobian, use transposes to compute gradients, aggregate across batches.

Numerical stability: The code uses standard matrix operations without explicit inverses, which is stable. In practice, implementations include gradient clipping (capping large gradients to prevent explosions), numerical precision handling (float32 vs float64), and careful initialization of $W$ to prevent vanishing/exploding gradients in deep networks. The fact that all three gradients come from dL_dY means a single pass of error computation supports the entire backward passâthis efficiency is why backpropagation is so practical.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Forward pass as linear map: Computing $Y = XW + b$ applies a learned linear transformation to each input example. The code constructs a tiny batch ($n=3$, $d_1=2$), two-dimensional output ($d_2=2$), and computes the predictions. This is the core operation of neural network inference: take an input, multiply by learned weights (stored in columns of $W$), add learned biases, and produce an output. For a single example $x_i$, the output $y_i = x_i W + b$ is a linear combination of the columns of $W$, with coefficients determined by $x_i$. Stacking multiple examples into a batch $X$ lets GPUs compute all predictions simultaneously via a single large matrix product, achieving massive parallelism.

Backward pass via transpose chaining: Given upstream gradients $\frac{\partial L}{\partial Y}$ (how much the loss changes with respect to each output), the code computes three gradient flows: (1) dL_dW = X.T @ dL_dY for parameters, (2) dL_dX = dL_dY @ W.T for inputs, (3) dL_db = dL_dY.sum(axis=0) for bias. Each uses a specific pattern: parameter gradients accumulate input-gradient products (via $X^\top$), input gradients use the parameter transpose ($W^\top$), and bias gradients sum over the batch. These three operations collectively constitute the âadjointâ of the linear map, representing how perturbations to outputs affect parameters and inputs. The fact that all three can be computed from a single upstream gradient dL_dY is the power of automatic differentiation: once you have the upstream gradient, all downstream gradients are cheap.

Shapes determine uniqueness of gradients: The forward pass shapes ($3 \times 2$, $2 \times 2$, etc.) completely determine the backward pass. Parameter gradients must be $2 \times 2$ (matching $W$), so X.T @ dL_dY is the unique way to get a $2 \times 2$ result from $X$ (shape $3 \times 2$, transposed to $2 \times 3$) and dL_dY (shape $3 \times 2$). Input gradients must be $3 \times 2$ (matching $X$), so dL_dY @ W.T is the unique way to get that shape. This inevitability of shapes makes backprop formula-free: you can derive the correct gradient computation from shapes alone, without memorizing formulas. This discipline extends to all neural network architectures: convolutional layers, attention mechanisms, and normalization layers all follow the same principleâshapes determine gradients uniquely.

Notes

Shape discipline: For forward pass $Y = XW + b$ with $X \in \mathbb{R}^{n \times d_1}$, $W \in \mathbb{R}^{d_1 \times d_2}$, $b \in \mathbb{R}^{d_2}$, output is $Y \in \mathbb{R}^{n \times d_2}$. For backward pass with upstream gradient $\frac{\partial L}{\partial Y} \in \mathbb{R}^{n \times d_2}$: parameter gradients are $X^\top \frac{\partial L}{\partial Y} \in \mathbb{R}^{d_1 \times d_2}$ (shape of $W$), input gradients are $\frac{\partial L}{\partial Y}W^\top \in \mathbb{R}^{n \times d_1}$ (shape of $X$), bias gradients are sum over batch $\in \mathbb{R}^{d_2}$ (shape of $b$). Shape mismatch immediately reveals errors in gradient computation.
Adjoint as transpose: The mathematical adjoint of a linear map $T: \mathbb{R}^{d_1} \to \mathbb{R}^{d_2}$ represented by matrix $W \in \mathbb{R}^{d_2 \times d_1}$ is the transpose $W^\top: \mathbb{R}^{d_2} \to \mathbb{R}^{d_1}$. This is not just a convenient identityâitâs fundamental: $\langle Tx, y \rangle = \langle x, T^\top y \rangle$ (duality). Backpropagation is the computational manifestation of this adjoint: upstream gradients are vectors in the output space; applying $W^\top$ maps them back to the input space.
Batch aggregation and averaging: Weight gradients are computed as $X^\top \frac{\partial L}{\partial Y}$, which sums contributions from all $n$ examples. Larger batches accumulate larger weight gradients (summed over more examples), so learning rates must be scaled with batch size to maintain consistent training dynamics. Conversely, bias gradients sum over examples, making bias updates batch-size-dependent in a way that weight updates are not (unless explicitly normalized).
Gradient flow and initialization: For networks with many layers, gradients can vanish (become extremely small) or explode (become extremely large) as they propagate backward through successive transposes. Xavier/He initialization sets weights to have variance proportional to $1/\sqrt{d_1}$ or $1/\sqrt{d_1 + d_2}$ to keep gradients at a stable scale. Skip connections in ResNets bypass transpose chains to preserve gradient magnitudes. Understanding gradient flow through transpose chaining is essential for designing networks that train smoothly.
Transpose patterns in specialized architectures: Convolutional layers use transposed convolutions for gradients (padding adjustments for shape compatibility), attention layers use transpose patterns for query-key-value projections and output computation, recurrent layers backpropagate through time using the same transpose chaining principles. The universality of transpose-based backpropagation means the same debugging principles (shape checking, gradient clipping, initialization) apply across all architectures.
Part 1: Forward Pass â Computing the Linear Map The first part of the code computes $Y = XW + b$ for a batch of 3 examples ($n=3$), 2 input features ($d_1=2$), and 2 output dimensions ($d_2=2$). Each example is transformed by the same weight matrix: $y_i = x_i W + b$ where $x_i$ is row $i$ of $X$. The output $Y$ has shape $(3, 2)$â3 examples, 2 predictions each. This is the complete forward pass of a neural network layer: input features are linearly combined (via columns of $W$) and shifted (via bias $b$). In inference, the network predicts by computing this forward pass for each input batch. In training, this forward pass is computed for all examples simultaneously, enabling GPU parallelism.
Part 2: Backward Pass â Computing Gradients via Transpose Chaining The second part computes three gradient flows from upstream gradients $\frac{\partial L}{\partial Y}$ (assumed to be all ones). (1) dL_dW = X.T @ dL_dY computes parameter gradients with shape $(2, 2)$ matching $W$: this operation accumulates how each weight should change by summing input-gradient products over the batch. (2) dL_dX = dL_dY @ W.T computes input gradients with shape $(3, 2)$ matching $X$: this is the upstream gradient for the previous layer, showing how much the loss changes with respect to inputs. (3) dL_db = dL_dY.sum(axis=0) computes bias gradients with shape $(2,)$ matching $b$: bias gradients sum over the batch because the bias is shared across all examples. These three operations constitute the adjoint (transpose) of the forward pass, and they can all be computed from a single upstream gradient. This is the complete backward pass: parameter updates are derived, inputs gradients flow to the previous layer, and all gradient operations are matrix products involving transposes.
Part 3: Shape-Driven Derivation and Generalization The transpose pattern emerges purely from dimensional constraintsâyou never memorize formulas, only follow shapes. For any forward operation $Y = f(X, W)$, gradients must match parameter dimensions: $\frac{\partial L}{\partial W}$ has the same shape as $W$, $\frac{\partial L}{\partial X}$ matches $X$. Given $Y = XW + b$ with shapes $(n, d_1) \times (d_1, d_2) \to (n, d_2)$, only $X^\top \frac{\partial L}{\partial Y}$ produces $(d_1, n) \times (n, d_2) = (d_1, d_2)$ matching $W$; similarly, only $\frac{\partial L}{\partial Y} W^\top$ produces $(n, d_2) \times (d_2, d_1) = (n, d_1)$ matching $X$. This principle generalizes to all differentiable operations: convolutions (transpose becomes deconvolution with flipped kernels), attention (transpose redistributes gradients across queries/keys/values), batch normalization (transpose handles statistics separately). In frameworks like PyTorch or JAX, automatic differentiation implements these transpose patterns automatically, but understanding the underlying shape logic is essential for debugging gradient flow, designing custom layers, implementing efficient backward passes, and reasoning about memory layouts in distributed training. The printed shapes in this codeâ(3,2) for Y, (2,2) for dL_dW, (2,) for dL_db, (3,2) for dL_dXâare the primary verification tool: if shapes don't match expectations, the backward pass is wrong before you even check numerical values.

History and Applications

Matrices as linear transformations were formalized by Cayley (1858), who recognized that square arrays of numbers could represent linear functions. This abstraction enabled the treatment of compositions of transformations as matrix multiplication, unifying geometry and algebra.

Matrix representations in neural networks: While backpropagation (Rumelhart, Hinton, Williams 1986) is often credited as a âdeep learning breakthrough,â it is fundamentally the chain rule applied to matrix operations. Each layer is a linear (or affine) map; composing layers composes matrices. The gradient of loss with respect to weights is computed by chaining transposes: $\nabla_W L = X^\top \nabla_Y L$.

Sparsity and efficiency: Not all learned transformations are dense matrices. Sparse transformations (low-rank, banded, structured) reduce memory and computation. Modern efficient deep learning exploits structure: attention uses low-rank patterns, CNNs use local weight sharing, RNNs reuse parameters across time. Understanding matrices as geometric transformations motivates these structural choices.

Connection to Broader Examples

This example establishes linear maps and their adjoints as the fundamental building blocks of neural networks. Every neural network layer is a linear transformation (matrix product) followed by optional nonlinearity, and every backward pass uses this transpose-chaining pattern.

Throughout the remaining 96 examples, adjoint-based gradient computation appears constantly:

Convolutional layers (related examples) perform local linear maps (via convolution kernels), and their backward passes compute gradients via transposed convolutionsâapplying the adjoint of the convolution operator.
Attention mechanisms (Examples 2, and related) compute $O = \text{softmax}(QK^\top / \sqrt{d_k})V$ as linear maps (in the value space), and gradients flow backward through these matrix products via transposes.
Normalization layers (future examples) compute statistics (mean, variance) across batches or spatial dimensions, and backward passes propagate gradients using the chain rule with transpose operations for the normalization transform.
Regularization and weight decay modify the loss $L$ to include penalty terms like $\|W\|^2$, which change the gradient $\frac{\partial L}{\partial W}$ by adding regularization terms.
Optimizers (SGD, Adam, etc.) use these gradients to update weights: they accumulate or transform gradients (via moving averages, second moments) before applying updates.
Loss functions (cross-entropy, MSE, etc.) produce upstream gradients $\frac{\partial L}{\partial Y}$ that feed into the backward pass; different losses have different gradient forms.
Batch normalization and layer normalization apply learned linear transforms after normalization, requiring careful gradient chaining through the normalization and linear operations.

The unifying pattern: every neural network layer can be expressed as a composition of linear maps and element-wise nonlinearities. The forward pass chains these maps left-to-right, and the backward pass chains their adjoints right-to-left. Shapes determine operations uniquely, making it impossible to get the formula wrong if you respect shape discipline. This is the foundation of modern deep learning: itâs âjustâ matrix multiplication and its transpose.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments