Part 1: Forward Pass as Affine Map - $Y = XW + b$ computes a linear transformation $x \mapsto Wx$ followed by translation $x \mapsto x + b$. - Matrix product X @ W: $(3, 2) \times (2, 2) \to (3, 2)$ applies $W$ row-wise to each example. - Bias broadcasting: $b \in \mathbb{R}^2$ is added to all 3 rows, yielding $Y \in \mathbb{R}^{3 \times 2}$. - This is the operation performed by dense/fully-connected layers before applying nonlinearities.
Part 2: Backward Pass via Chain Rule - Weight gradient: $\partial L / \partial W = X^\top (\partial L / \partial Y)$ accumulates contributions from all examples. - Input gradient: $\partial L / \partial X = (\partial L / \partial Y) W^\top$ propagates gradients to earlier layers. - Bias gradient: $\partial L / \partial b = \sum_i (\partial L / \partial Y)_i$ sums across batch (all examples share $b$). - Transpose $W^\top$ âreversesâ the forward linear mapâfundamental to chain rule for linear transformations.
Why This Matters for ML - Backpropagation is transpose chaining: Forward uses $W$, backward uses $W^\top$âthis pattern repeats at every layer. - Batch accumulation: Gradients accumulate over examples via matrix products, providing stable estimates. - Efficient implementation: All gradients computed via matrix opsâno loops, leverages hardware acceleration. - Shape verification catches bugs: Mismatched gradient shapes signal transpose errors or missing reductions.
ML Examples and Patterns - Multi-layer perceptron: Chain multiple affine layers $Y_1 = X W_1 + b_1$, $Y_2 = Y_1 W_2 + b_2$; backprop uses transpose chaining. - Gradient descent updates: $W \leftarrow W - \eta (\partial L / \partial W)$ directly uses the computed weight gradient. - Convolutional layers: Structured linear maps where backward pass uses transposed convolution (adjoint). - Attention projections: $Q = XW_Q$, $K = XW_K$, $V = XW_V$ are affine layers; gradients use the same transpose pattern.
Connection to Linear Algebra Theory - Affine maps are not linear: $f(x) = Wx + b$ fails $f(0) = 0$, but derivative $\partial f / \partial x = W$ is linear. - Adjoint interpretation: Gradient $\partial L / \partial X = (\partial L / \partial Y) W^\top$ uses the adjoint (transpose) of forward map. - Matrix chain rule: Derivatives of matrix-valued functions compose via transpose and matrix multiplication. - Bias as rank-1 perturbation: Adding $b$ is $Y = XW + \mathbf{1} b^\top$; gradient $\partial L / \partial b = \mathbf{1}^\top (\partial L / \partial Y)$ projects onto constant direction.
Numerical and Implementation Notes - Transpose direction: Weight gradient uses $X^\top$, input gradient uses $W^\top$âswapping these breaks backprop. - Bias axis: sum(axis=0) sums over batch (rows); axis=1 would give wrong shape. - Gradient initialization: In practice, $\partial L / \partial Y$ comes from loss function or next layer, not all ones. - No nonlinearity here: Adding ReLU/sigmoid requires element-wise derivative multiplication before gradient computation. - Numerical gradient check: Verify analytical gradients via finite differences for debugging.
Numerical and Shape Notes - $X \in \mathbb{R}^{3 \times 2}$, $W \in \mathbb{R}^{2 \times 2}$, $b \in \mathbb{R}^2$, $Y \in \mathbb{R}^{3 \times 2}$. - $\partial L / \partial Y \in \mathbb{R}^{3 \times 2}$, $\partial L / \partial W \in \mathbb{R}^{2 \times 2}$, $\partial L / \partial X \in \mathbb{R}^{3 \times 2}$, $\partial L / \partial b \in \mathbb{R}^2$. - Verify: dL_dW.shape == W.shape, dL_dX.shape == X.shape, dL_db.shape == b.shape.
ML Context: From Attention to Transformers - Attention uses affine projections: $Q = XW_Q + b_Q$, $K = XW_K + b_K$, $V = XW_V + b_V$. - Backprop through attention requires gradients w.r.t. $W_Q, W_K, W_V$ using the same transpose pattern. - Multi-head attention runs multiple affine layers in parallel; gradients accumulate across heads. - Transformer blocks chain attention + feedforward (affine layers); backprop uses repeated transpose chaining.
Pedagogical Significance - Transposes reverse linear maps: Forward $Y = XW$ uses $W$; backward uses $W^\top$. - Reductions accumulate shared parameters: Bias gradients sum across batch because all examples share $b$. - Shape verification catches bugs: Mismatched shapes signal transpose errors or missing reductions. - Matrix products vectorize: No explicit loopsâefficient and hardware-friendly. - Foundation for deep learning: This pattern underlies every neural network operation.
Comments