Example number
50
Example slug
example_50_span_in_ml_xw_and_attention_as_weighted_sums
Background

Historical context: The notion of “span” emerged from 19th-century work on systems of linear equations and vector spaces (Grassmann’s Ausdehnungslehre, 1844). Early statisticians (Gauss, Legendre) implicitly used span when solving least-squares problems: predictions $\hat{y} = Xw$ lie in the column space of $X$ (the span of feature vectors). The formal definition—span as the set of all linear combinations—was codified in early 20th-century abstract algebra (Weyl, van der Waerden). In ML, span determines model capacity: linear models can only fit targets lying in $\text{span}(X)$, and attention can only produce outputs in $\text{span}(V)$. Modern architectures (transformers, graph neural networks) are carefully designed to ensure their spanning sets capture the right inductive biases.

Mathematical characterization: Given vectors $v_1, \ldots, v_k \in \mathbb{R}^d$, their span is: \[ \text{span}(v_1, \ldots, v_k) = \left\{ \sum_{i=1}^k c_i v_i : c_i \in \mathbb{R} \right\}, \] the set of all linear combinations. For regression, $\hat{y} = Xw = \sum_{j=1}^d w_j X_{:,j}$ shows predictions lie in $\text{span}(X_{:,1}, \ldots, X_{:,d})$ (column space). For attention, $o = \sum_{i=1}^n a_i v_i$ with $a_i \ge 0$, $\sum_i a_i = 1$ (convex combination) shows outputs lie in the convex hull of value vectors (a subset of their span).

Prevalence in ML: Every feedforward layer in a neural network computes $h = \sigma(Wx + b)$ where $Wx$ is a linear combination of input features weighted by rows of $W$. Attention mechanisms dynamically compute linear combinations via data-dependent weights. Graph neural networks aggregate neighbor features via weighted sums. Understanding these as span operations clarifies expressiveness limits and guides architectural design.

Purpose

Show that matrix-vector products are linear combinations, revealing the geometric meaning behind two ubiquitous ML operations: regression predictions ($\hat{y} = Xw$ as weighted sums of feature columns) and attention outputs ($o = aV$ as weighted sums of value vectors). Demonstrate that both operations produce vectors lying in the span of their basis (feature columns for regression, value vectors for attention), making the abstract notion of “span” concrete and operational. Build intuition for why model expressiveness depends on the span of basis vectors, why coefficients (weights $w$ or attention scores $a$) determine which linear combinations we compute, and how this pattern—linear combination = matrix-vector product—underlies virtually all ML architectures from linear models to transformers.

Problem

Show that (a) linear predictions lie in the span of feature columns and (b) attention outputs lie in the span of value vectors. Compute both explicitly and verify the span claim by expressing outputs as linear combinations.

Solution (Math)

For linear predictions $\hat y=Xw$,

\[ Xw=\sum_{j=1}^d w_j X_{:j}, \]

so $\hat y\in \mathrm{span}\{X_{:1},\dots,X_{:d}\}$.

For attention with one query, output is $o=a^T V=\sum_i a_i v_i$, a linear combination of values, hence $o\in\mathrm{span}\{v_i\}$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import softmax

X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])
w = np.array([2., -1.])
yhat = X @ w
print("yhat:", yhat)
print("check combo:", 2*X[:,0] - 1*X[:,1])

Q = np.array([[1., 0.]])
K = np.array([[1., 0.],
              [0., 1.],
              [1., 1.]])
V = np.array([[1., 0.],
              [0., 2.],
              [1., 1.]])

scores = (Q @ K.T) / np.sqrt(2)
a = softmax(scores, axis=1).reshape(-1)
o = a @ V
print("a:", a, "sum:", a.sum())
print("o:", o)
print("o combo:", a[0]*V[0] + a[1]*V[1] + a[2]*V[2])
Code Introduction

This example has two short blocks that both demonstrate “matrix–vector product = linear combination.” The first computes regression predictions yhat = X @ w and verifies they equal an explicit weighted sum of feature columns. The second computes single-query attention, showing o = a @ V equals an explicit weighted sum of value vectors.

  • Regression block: Define X and w, compute yhat = X @ w, then check yhat == 2*X[:,0] - 1*X[:,1] for the chosen weights.
  • Attention block: Define Q, K, V, compute scores = Q @ K.T / sqrt(d_k), a = softmax(scores, axis=1), then o = a @ V; verify by expanding a[0]*V[0] + a[1]*V[1] + a[2]*V[2].
  • Shapes: X \in R^{n\times d}, w \in R^d → yhat \in R^n; Q \in R^{1\times d_k}, K \in R^{m\times d_k}, V \in R^{m\times d_v} → o \in R^{d_v}.
  • Takeaway: Both outputs live in the span of their columns: yhat \in span(X_{:,j}) and o \in span(V_i). Attention additionally yields a convex combination (nonnegative weights summing to 1).
Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Define design matrix and weight vector: Create $X \in \mathbb{R}^{3 imes 2}$ (3 examples, 2 features) and $w \in \mathbb{R}^2$ (one weight per feature).
  2. Compute regression predictions: yhat = X @ w computes $\hat{y} = Xw \in \mathbb{R}^3$ via matrix-vector product.
  3. Extract feature columns: X[:,0] and X[:,1] are the two feature column vectors (shape (3,)).
  4. Verify linear combination: Explicitly compute 2*X[:,0] - 1*X[:,1] and confirm it matches yhat.
  5. Define attention inputs: Create $Q \in \mathbb{R}^{1 imes 2}$ (one query), $K \in \mathbb{R}^{3 imes 2}$ (three keys), $V \in \mathbb{R}^{3 imes 2}$ (three values).
  6. Compute scaled scores: scores = Q @ K.T / np.sqrt(d_k) yields $ ext{scores} ^{1 imes 3}$ (query-key similarities, scaled).
  7. Apply softmax: a = softmax(scores, axis=1).reshape(-1) normalizes to attention weights $a \in \mathbb{R}^3$ (probability distribution).
  8. Compute attention output: o = a @ V aggregates values via weighted sum; verify by explicit enumeration a[0]*V[0] + a[1]*V[1] + a[2]*V[2].
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Matrix-vector product is shorthand for linear combination: $Xw = \sum_j w_j X_{:,j}$ expands multiplication into explicit weighted sums.
  • Regression predictions lie in column space: $\hat{y} = Xw \in ext{span}(X_{:,1}, \ldots, X_{:,d})$ (span of feature columns).
  • Attention outputs lie in value span: $o = aV = \sum_i a_i v_i \in ext{span}(v_1, \ldots, v_n)$ (span of value vectors).
  • Coefficients determine expressiveness: Weights $w$ (regression) and scores $a$ (attention) determine which linear combination we compute.
  • Fixed vs. dynamic coefficients: Regression uses learned fixed $w$; attention computes $a$ from input (data-dependent routing).
  • Convex combinations are bounded: Attention outputs are convex combinations ($a_i \ge 0$, $\sum_i a_i = 1$), lying in the convex hull of values.
  • Model capacity is limited by span: Linear models cannot fit targets outside the column space of $X$; architecture design ensures relevant spanning sets.
  • Verification by explicit enumeration: Expanding $@$ into element-wise sums confirms mathematical equivalence and builds geometric intuition.
Notes

Part 1: Regression Predictions as Column Combinations - Each prediction $\hat{y}_i = \sum_j w_j X_{ij}$ is a weighted sum of feature values for example $i$, weighted by coefficients in $w$. - The weight vector $w$ is fixed after training; applying $Xw$ to new data uses the same learned coefficients. - Predictions lie in $ ext{span}(X_{:,1}, X_{:,2})$, a 2D subspace of $\mathbb{R}^3$ (assuming features are linearly independent). - If the true target $y$ lies outside this span, no choice of $w$ fits perfectly—the model has limited capacity.

Part 2: Attention as Weighted Value Combination - Each attention output $o = \sum_i a_i v_i$ is a weighted sum of value vectors, where weights $a$ are computed from input (via query-key similarity). - Weights $a$ are a probability distribution: $a_i \ge 0$, $\sum_i a_i = 1$ (softmax constraint). - Outputs lie in the convex hull of value vectors (a subset of $\text{span}(V)$, restricted by $a \ge 0$). - Different queries produce different attention patterns—dynamic routing enables context-dependent outputs.

Part 3: Span Constraints and Model Capacity - Both operations are fundamentally limited by span: regression can only predict vectors in $\text{span}(X)$; attention can only output vectors in $\text{span}(V)$ (further restricted to the convex hull). - This explains why feature engineering matters: adding new, linearly independent features expands $\text{span}(X)$ and increases model capacity. Similarly, increasing embedding dimensions or using multi-head attention expands the span of possible outputs. - Rank deficiency (when columns are linearly dependent) reduces effective dimensionality: if $\text{rank}(X) < d$, some features are redundant and don't expand the span. Use np.linalg.matrix_rank(X) to diagnose. - In deep networks, nonlinearities expand span beyond what linear combinations can reach: $ = \sigma(Wx)$ where $\sigma$ (ReLU, tanh) enables outputs outside $\text{span}(W)$. Transformers alternate between attention (linear routing within value span) and feed-forward layers (nonlinear expansion) to build hierarchical representations. - Debugging underfitting: If model loss plateaus, check if targets lie outside the span of basis vectors. For regression, try polynomial features or kernels; for attention, verify value vectors aren't nearly collinear (check singular values of $V$).

Connection Between the Two Blocks - Both are linear combinations of column/value vectors, expanded via matrix-vector products: $Xw$ and $aV$. - Difference: $w$ is fixed; $a$ is data-dependent (computed from query-key similarity). - Regression uses affine combinations ($w$ can be negative, unbounded); attention uses convex combinations ($a \ge 0$, sum to 1). - Model expressiveness: regression can produce any vector in $\text{span}(X)$; attention restricted to convex hull of $V$. - Training: regression adjusts $w$ to minimize loss; attention adjusts $Q, K, V$ projections to make good routing decisions.

ML Examples - Linear regression: $\hat{y} = Xeta$ predicts continuous targets as weighted sums of feature columns. - Neural network hidden layers: $h = \sigma(Wx)$ computes activations as weighted sums of input features (then apply nonlinearity). - Self-attention in transformers: Each token’s output is a weighted sum of all token values, where weights depend on query-key similarities. - Graph neural networks: Aggregation is a weighted sum of neighbor features, where weights may be learned (GCN, GAT) or fixed (GraphSAGE).

Pedagogical Notes - Why two blocks? Regression (fixed $w$) teaches that predictions lie in a fixed subspace; attention (dynamic $a$) shows context-dependent routing. - Minimal example: 2 features in regression, 3 keys in attention—small enough to verify explicitly, large enough to be non-trivial. - Verification by expansion: Expanding @ into element-wise sums reveals the geometric structure and confirms equivalence.

ML Examples and Patterns - Generalization via fixed weights: Linear models train once (find $w$), then apply to infinitely many test examples. The learned $w$ must capture patterns from training data that generalize to unseen examples. - Adaptive routing via attention: Transformers recompute $a$ for each input, enabling different attention patterns. A single set of learned projections ($Q, K, V$ weights) generates context-specific linear combinations. - Gradient flow: For regression, $ abla_w = X^ op(Xw - y)$ depends on features and residuals. For attention, gradients flow through softmax (smooth but can saturate) and then to $Q, K, V$ projections. - Sparse attention: Setting $a_i = 0$ (via masking) prunes terms from the sum, reducing computation from $O(n^2)$ to $O(n)$ or $O(n \log n)$.

Connection to Linear Algebra Theory - Span as column space: The set of all possible predictions $\{Xw : w \in \mathbb{R}^d\} = ext{col}(X)$ (column space of $X$). - Rank and expressiveness: $ ext{rank}(X) = d$ means features are linearly independent and $ ext{span}(X)$ is $d$-dimensional; if $ ext{rank}(X) < d$, some features are redundant. - Projection theorem: The best regression fit (least-squares solution) is $\hat{w} = rg\min_w \|y - Xw\|_2$, which projects $y$ onto $ ext{col}(X)$. - Convex combinations and convex sets: Attention with $a \ge 0$, $\sum_i a_i = 1$ produces outputs in the convex hull of values—a convex set invariant under convex combinations. - Basis and change of basis: Feature columns form a basis if they’re linearly independent; changing the basis changes $w$ but leaves $\hat{y} = Xw$ unchanged.

Numerical and Implementation Notes - Feature scaling: Large features can dominate weights $w$; standardization $(x - \mu)/\sigma$ improves numerical stability and prevents weight explosion. - Softmax numerical stability: Use log-sum-exp trick: $\log ext{softmax}(x) = x - \log \sum_j \exp(x_j)$ to prevent overflow. - Avoid explicit matrix inverse: Instead of $w = (X^ op X)^{-1} X^ op y$, use np.linalg.lstsq(X, y) which employs numerically stable QR or SVD. - Check rank: If features are collinear, $ ext{rank}(X) < d$ and $X^ op X$ is singular; regularization ($w$-decay, ridge regression) adds $\lambda I$ to improve conditioning.

Numerical and Shape Notes - Feature matrix: $X \in \mathbb{R}^{3 imes 2}$ means 3 examples, 2 features; columns are basis vectors, rows are example vectors. - Weight vector: $w \in \mathbb{R}^2$ has one entry per feature. - Prediction vector: $\hat{y} = Xw \in \mathbb{R}^3$ has one entry per example. - Attention shapes: $Q \in \mathbb{R}^{n_q imes d_k}$, $K \in \mathbb{R}^{n_k imes d_k}$, $V \in \mathbb{R}^{n_k imes d_v}$ → $ ext{scores} ^{n_q imes n_k}$ → $a \in \mathbb{R}^{n_q imes n_k}$ → $o \in \mathbb{R}^{n_q imes d_v}$. - Flatten for 1D: softmax(...).reshape(-1) converts shape (1, 3) to (3,) for broadcasting in a @ V.

ML Context: From Attention to Transformers - Self-attention: Input sequence $X \in \mathbb{R}^{n imes d}$ → project to $Q = XW_Q, K = XW_K, V = XW_V$ → attention → output. - Multi-head attention: Run $h$ heads in parallel, each computing a different linear combination; concatenate and project again. - Cross-attention: Decoder queries attend to encoder keys/values, enabling source-target alignment in seq2seq tasks. - Causal masking: Set attention weights to zero for future positions, enforcing autoregressive constraints (GPT, causal models). - Positional encoding: Add learnable or fixed positional embeddings to input to encode token positions (attention is permutation-equivariant without position information). - Layer stacking: Stack many attention and feedforward layers, each computing linear combinations and nonlinear transformations, building hierarchical representations.

Pedagogical Significance - Core bridge concept: Span connects abstract linear algebra to concrete ML operations. Understanding that $Xw$ and $aV$ are linear combinations reveals: - Model limitations: Can’t fit targets outside $ ext{span}(X)$; that’s why feature engineering and architecture design matter. - Gradient interpretation: Gradients $ abla_w $ update coefficients of linear combinations; larger gradients push weights toward better combinations. - Expressiveness growth: Adding more features (larger $d$) or deeper networks (more layers) increases the dimensionality of reachable output spaces. - Why this is foundational: Linear combinations are the universal pattern in ML. Every matrix multiply is a weighted sum; understanding this pattern unlocks intuition for regression, neural networks, attention, convolution, and graph neural networks. - From theory to practice: This example is the bridge from Chapter 2 (Span/Linear Combinations as theory) to Chapter 16 (Matrix Products as implementation), showing that theory directly applies to state-of-the-art architectures.

History and Applications

History: The concept of span traces back to Grassmann (1844) and the formalization of vector spaces. Earlier, Gauss and Legendre implicitly used column spaces in least squares. The recognition that matrix–vector products are linear combinations sits at the heart of linear algebra pedagogy and numerical methods.

Applications: In ML, $Xw$ expresses predictions as combinations of feature columns; capacity is limited by $\mathrm{span}(X)$. Attention computes $o = aV$, a convex combination of values, so outputs lie in the convex hull of $V$. Neural layers ($Wx$), graph message passing, and embedding lookups are all structured linear combinations—understanding span clarifies expressiveness, identifiability, and the effect of regularization.

Connection to Broader Examples
  • Chapter 1 (Vector spaces): Span is a vector space; closure under addition and scalar multiplication ensures linear combinations stay in the span.
  • Chapter 2 (Span & linear combinations): Core chapter; this is the canonical application of span to ML.
  • Chapter 3 (Basis/Dimension): Feature columns may form a basis of their span; if $d < n$ and features are independent, they span an $d$-dimensional subspace.
  • Chapter 4 (Linear maps): Matrix $X$ represents a linear map $\mathbb{R}^d o \mathbb{R}^n$; predictions are outputs of this map.
  • Chapter 5 (Inner products): Attention scores are inner products $\langle q, k angle$ between queries and keys.
  • Chapter 6 (Orthogonality & projections): Regression finds $\hat{w}$ that minimizes distance from $y$ to $ ext{span}(X)$—the orthogonal projection of $y$ onto the column space.
  • Chapter 12 (Least-squares): Regression with squared loss $\min_w \|y - Xw\|^2$ projects $y$ onto $ ext{span}(X)$ and solves for coefficients.
  • Chapter 14 (Conditioning): Ill-conditioned $X^ op X$ makes weights $w$ sensitive to perturbations; conditioning relates to linear independence of feature columns.
  • Chapter 16 (Matrix products): Core chapter; all matrix products reduce to linear combinations.
  • Transformers: Attention is the flagship multi-head linear combination mechanism in modern ML.

Comments