Part 1: Regression Predictions as Column Combinations - Each prediction $\hat{y}_i = \sum_j w_j X_{ij}$ is a weighted sum of feature values for example $i$, weighted by coefficients in $w$. - The weight vector $w$ is fixed after training; applying $Xw$ to new data uses the same learned coefficients. - Predictions lie in $ ext{span}(X_{:,1}, X_{:,2})$, a 2D subspace of $\mathbb{R}^3$ (assuming features are linearly independent). - If the true target $y$ lies outside this span, no choice of $w$ fits perfectlyâthe model has limited capacity.
Part 2: Attention as Weighted Value Combination - Each attention output $o = \sum_i a_i v_i$ is a weighted sum of value vectors, where weights $a$ are computed from input (via query-key similarity). - Weights $a$ are a probability distribution: $a_i \ge 0$, $\sum_i a_i = 1$ (softmax constraint). - Outputs lie in the convex hull of value vectors (a subset of $\text{span}(V)$, restricted by $a \ge 0$). - Different queries produce different attention patternsâdynamic routing enables context-dependent outputs.
Part 3: Span Constraints and Model Capacity - Both operations are fundamentally limited by span: regression can only predict vectors in $\text{span}(X)$; attention can only output vectors in $\text{span}(V)$ (further restricted to the convex hull). - This explains why feature engineering matters: adding new, linearly independent features expands $\text{span}(X)$ and increases model capacity. Similarly, increasing embedding dimensions or using multi-head attention expands the span of possible outputs. - Rank deficiency (when columns are linearly dependent) reduces effective dimensionality: if $\text{rank}(X) < d$, some features are redundant and don't expand the span. Use np.linalg.matrix_rank(X) to diagnose. - In deep networks, nonlinearities expand span beyond what linear combinations can reach: $ = \sigma(Wx)$ where $\sigma$ (ReLU, tanh) enables outputs outside $\text{span}(W)$. Transformers alternate between attention (linear routing within value span) and feed-forward layers (nonlinear expansion) to build hierarchical representations. - Debugging underfitting: If model loss plateaus, check if targets lie outside the span of basis vectors. For regression, try polynomial features or kernels; for attention, verify value vectors aren't nearly collinear (check singular values of $V$).
Connection Between the Two Blocks - Both are linear combinations of column/value vectors, expanded via matrix-vector products: $Xw$ and $aV$. - Difference: $w$ is fixed; $a$ is data-dependent (computed from query-key similarity). - Regression uses affine combinations ($w$ can be negative, unbounded); attention uses convex combinations ($a \ge 0$, sum to 1). - Model expressiveness: regression can produce any vector in $\text{span}(X)$; attention restricted to convex hull of $V$. - Training: regression adjusts $w$ to minimize loss; attention adjusts $Q, K, V$ projections to make good routing decisions.
ML Examples - Linear regression: $\hat{y} = Xeta$ predicts continuous targets as weighted sums of feature columns. - Neural network hidden layers: $h = \sigma(Wx)$ computes activations as weighted sums of input features (then apply nonlinearity). - Self-attention in transformers: Each tokenâs output is a weighted sum of all token values, where weights depend on query-key similarities. - Graph neural networks: Aggregation is a weighted sum of neighbor features, where weights may be learned (GCN, GAT) or fixed (GraphSAGE).
Pedagogical Notes - Why two blocks? Regression (fixed $w$) teaches that predictions lie in a fixed subspace; attention (dynamic $a$) shows context-dependent routing. - Minimal example: 2 features in regression, 3 keys in attentionâsmall enough to verify explicitly, large enough to be non-trivial. - Verification by expansion: Expanding @ into element-wise sums reveals the geometric structure and confirms equivalence.
ML Examples and Patterns - Generalization via fixed weights: Linear models train once (find $w$), then apply to infinitely many test examples. The learned $w$ must capture patterns from training data that generalize to unseen examples. - Adaptive routing via attention: Transformers recompute $a$ for each input, enabling different attention patterns. A single set of learned projections ($Q, K, V$ weights) generates context-specific linear combinations. - Gradient flow: For regression, $ abla_w = X^ op(Xw - y)$ depends on features and residuals. For attention, gradients flow through softmax (smooth but can saturate) and then to $Q, K, V$ projections. - Sparse attention: Setting $a_i = 0$ (via masking) prunes terms from the sum, reducing computation from $O(n^2)$ to $O(n)$ or $O(n \log n)$.
Connection to Linear Algebra Theory - Span as column space: The set of all possible predictions $\{Xw : w \in \mathbb{R}^d\} = ext{col}(X)$ (column space of $X$). - Rank and expressiveness: $ ext{rank}(X) = d$ means features are linearly independent and $ ext{span}(X)$ is $d$-dimensional; if $ ext{rank}(X) < d$, some features are redundant. - Projection theorem: The best regression fit (least-squares solution) is $\hat{w} = rg\min_w \|y - Xw\|_2$, which projects $y$ onto $ ext{col}(X)$. - Convex combinations and convex sets: Attention with $a \ge 0$, $\sum_i a_i = 1$ produces outputs in the convex hull of valuesâa convex set invariant under convex combinations. - Basis and change of basis: Feature columns form a basis if theyâre linearly independent; changing the basis changes $w$ but leaves $\hat{y} = Xw$ unchanged.
Numerical and Implementation Notes - Feature scaling: Large features can dominate weights $w$; standardization $(x - \mu)/\sigma$ improves numerical stability and prevents weight explosion. - Softmax numerical stability: Use log-sum-exp trick: $\log ext{softmax}(x) = x - \log \sum_j \exp(x_j)$ to prevent overflow. - Avoid explicit matrix inverse: Instead of $w = (X^ op X)^{-1} X^ op y$, use np.linalg.lstsq(X, y) which employs numerically stable QR or SVD. - Check rank: If features are collinear, $ ext{rank}(X) < d$ and $X^ op X$ is singular; regularization ($w$-decay, ridge regression) adds $\lambda I$ to improve conditioning.
Numerical and Shape Notes - Feature matrix: $X \in \mathbb{R}^{3 imes 2}$ means 3 examples, 2 features; columns are basis vectors, rows are example vectors. - Weight vector: $w \in \mathbb{R}^2$ has one entry per feature. - Prediction vector: $\hat{y} = Xw \in \mathbb{R}^3$ has one entry per example. - Attention shapes: $Q \in \mathbb{R}^{n_q imes d_k}$, $K \in \mathbb{R}^{n_k imes d_k}$, $V \in \mathbb{R}^{n_k imes d_v}$ â $ ext{scores} ^{n_q imes n_k}$ â $a \in \mathbb{R}^{n_q imes n_k}$ â $o \in \mathbb{R}^{n_q imes d_v}$. - Flatten for 1D: softmax(...).reshape(-1) converts shape (1, 3) to (3,) for broadcasting in a @ V.
ML Context: From Attention to Transformers - Self-attention: Input sequence $X \in \mathbb{R}^{n imes d}$ â project to $Q = XW_Q, K = XW_K, V = XW_V$ â attention â output. - Multi-head attention: Run $h$ heads in parallel, each computing a different linear combination; concatenate and project again. - Cross-attention: Decoder queries attend to encoder keys/values, enabling source-target alignment in seq2seq tasks. - Causal masking: Set attention weights to zero for future positions, enforcing autoregressive constraints (GPT, causal models). - Positional encoding: Add learnable or fixed positional embeddings to input to encode token positions (attention is permutation-equivariant without position information). - Layer stacking: Stack many attention and feedforward layers, each computing linear combinations and nonlinear transformations, building hierarchical representations.
Pedagogical Significance - Core bridge concept: Span connects abstract linear algebra to concrete ML operations. Understanding that $Xw$ and $aV$ are linear combinations reveals: - Model limitations: Canât fit targets outside $ ext{span}(X)$; thatâs why feature engineering and architecture design matter. - Gradient interpretation: Gradients $ abla_w $ update coefficients of linear combinations; larger gradients push weights toward better combinations. - Expressiveness growth: Adding more features (larger $d$) or deeper networks (more layers) increases the dimensionality of reachable output spaces. - Why this is foundational: Linear combinations are the universal pattern in ML. Every matrix multiply is a weighted sum; understanding this pattern unlocks intuition for regression, neural networks, attention, convolution, and graph neural networks. - From theory to practice: This example is the bridge from Chapter 2 (Span/Linear Combinations as theory) to Chapter 16 (Matrix Products as implementation), showing that theory directly applies to state-of-the-art architectures.
Comments