Example number
2
Example slug
example_2_predictions_and_attention_are_in_spans
Background

These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.

Linear combinations as predictions form the basis of all linear models. When you compute $\hat{y} = Xw$, the prediction vector lives in $\text{span}\{X_{:,1}, \ldots, X_{:,d}\}$—the set of all linear combinations of feature columns. This immediately reveals a fundamental limitation: if the true relationship requires outputs outside this span (e.g., nonlinear decision boundaries), no choice of $w$ can represent it. Feature engineering (adding polynomial features, interactions, embeddings) works by expanding the column span to include richer representations. Kernel methods implicitly map to infinite-dimensional spans. Neural networks learn nonlinear feature maps that adaptively expand the span layer by layer.

Attention mechanisms as weighted aggregation implement the core operation in transformers: $o = \sum_i a_i v_i$ where attention weights $a_i$ come from $\text{softmax}(QK^\top/\sqrt{d_k})$. This is a convex combination (weights non-negative, sum to 1), meaning the output $o$ is a weighted average of value vectors $\{v_i\}$ and must lie in their convex hull. The query-key similarity determines how to mix the values, but the output is geometrically constrained to $\text{span}\{v_i\}$. Multi-head attention runs this operation in parallel across different learned subspaces, expanding the effective span. Positional encodings add new basis vectors to distinguish token positions. This geometric view explains why transformers need deep stacks—each layer expands the representational span available to subsequent layers.

Purpose

Give you a crisp, ML-first mental model you can reuse across models and datasets.

What you’ll learn:

  • Span as the output space: Understand that linear model predictions $\hat{y} = Xw$ always live in the column span of $X$, which determines the expressive power of the model. If your features don’t span the true decision boundary, no weight vector can achieve it.
  • Attention as convex combination: See that attention mechanisms compute weighted averages of value vectors—a convex combination where weights sum to 1. This geometric view explains why attention can’t extrapolate beyond the span of its value vectors and why positional encodings expand the representational span.
  • Matrix-vector multiplication as aggregation: Train yourself to see $Xw$ and $a^\top V$ as selecting coefficients for basis vectors (columns of $X$ or rows of $V$). This perspective unifies linear regression, fully connected layers, and attention as “coefficient selection + linear combination.”
  • Computation pattern: Recognize linear combination as one of five core patterns (fit, project, decompose, solve, measure) that recur throughout ML. The span of your basis vectors determines what outputs are reachable.

This example establishes the algebraic foundation for understanding why feature engineering matters (expanding the span), why transformers scale (learning rich value representations), and why linear models have inherent limitations (constrained by column span).

Problem
  1. Verify ŷ = Xw is a linear combination of the columns of X.
  2. Verify attention output is a linear combination of value vectors.
Solution (Math)
  1. $Xw=\sum_j w_j X_{:j}$, so $\hat y$ lies in the column span of $X$.

  2. For a single query, attention produces $o=\sum_i a_i v_i$, so $o$ lies in $\mathrm{span}\{v_i\}$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import softmax

X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])
w = np.array([2., -1.])
yhat = X@w
print("yhat:", yhat)
print("as combo:", 2*X[:,0] - 1*X[:,1])

Q = np.array([[1., 0.]])
K = np.array([[1., 0.],
              [0., 1.],
              [1., 1.]])
V = np.array([[1., 0.],
              [0., 2.],
              [1., 1.]])

scores = (Q@K.T)/np.sqrt(2)
a = softmax(scores, axis=1).reshape(-1)
o = a@V
print("weights:", a, "sum:", a.sum())
print("o:", o)
print("combo:", a[0]*V[0] + a[1]*V[1] + a[2]*V[2])
Code Introduction

This code demonstrates two fundamental ML operations—linear regression prediction and attention mechanism output—and reveals that both are secretly computing the same mathematical object: a linear combination of basis vectors. This connection is profound because it shows that seemingly different ML algorithms (supervised learning, self-attention in transformers) share a common algebraic structure rooted in span and linear combinations.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

The code demonstrates two equivalent views of the same mathematical operation, reinforcing that matrix operations are syntactic sugar for explicit linear combinations.

For linear regression, X @ w computes predictions via optimized BLAS routines in $O(nd)$ time. The manual computation 2*X[:,0] - 1*X[:,1] performs the same operation by explicitly scaling and adding column vectors. NumPy’s broadcasting handles the scalar multiplication element-wise, producing [2., 6., 10.] from 2*X[:,0] and [2., 4., 6.] from 1*X[:,1], then subtracting to get [0., 2., 4.]. This dual representation is pedagogical—in production, always use the matrix form (X @ w) for efficiency and numerical stability. The equivalence confirms that predictions are geometrically constrained to the column span of $X$.

For attention, the code implements scaled dot-product attention step-by-step. First, Q @ K.T computes query-key similarities as a $1 \times 3$ vector of dot products. Scaling by $1/\sqrt{2}$ (where $d_k = 2$) prevents the magnitudes from growing large, which would push softmax into saturation regions with near-zero gradients. The softmax function converts raw scores to normalized weights that sum to 1.0—this is what makes attention a convex combination rather than just any linear combination. Finally, a @ V computes the weighted sum of value vectors. The verification a[0]*V[0] + a[1]*V[1] + a[2]*V[2] explicitly shows the output is a point in the convex hull of $\{V_0, V_1, V_2\}$.

Memory consideration: For batch attention with $n$ queries and $m$ keys, the score matrix is $n \times m$. For self-attention ($n = m = $ sequence length), this is $O(n^2)$ memory, which is why transformers struggle with long sequences. Efficient attention variants (linear attention, sparse attention) exploit low-rank structure or sparsity patterns to reduce this cost.

What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Linear regression predictions live in column span: When you compute $\hat{y} = Xw$, the result is $w_1 X_{:,1} + w_2 X_{:,2}$, a linear combination of feature columns. This means predictions are constrained to $\text{span}\{X_{:,j}\}$. If your features are linearly dependent (collinear), the span has lower dimension than the number of features, causing rank deficiency. If the true targets require outputs outside this span, the model is fundamentally limited—no amount of training will fix it. This is why feature selection matters: removing redundant features doesn’t hurt expressiveness (they’re already in the span), but adding informative features can expand the span.

Attention output as value span: The attention mechanism computes $o = a_1 v_1 + a_2 v_2 + a_3 v_3$ where $a_i \geq 0$ and $\sum a_i = 1$. This is a convex combination—a weighted average—so $o$ lies in the convex hull of $\{v_1, v_2, v_3\}$ and cannot extrapolate beyond the span of value vectors. The query-key similarity (via softmax) determines the mixing weights, implementing a learned “routing” mechanism. Unlike regression where weights can be negative or unbounded, attention weights are always non-negative and normalized, making attention outputs geometrically interpretable as points “inside” the value set.

Computation pattern: This example demonstrates linear combination—selecting coefficients for basis vectors and summing. It appears in fully connected layers ($Wx$), attention ($a^\top V$), mixture models ($\sum_k \pi_k \mathcal{N}(\mu_k, \Sigma_k)$), and ensemble methods ($\sum_m \alpha_m h_m(x)$). The span of your basis determines expressiveness; the coefficients determine which point in that span you select.

Notes
  • Shape discipline: For $\hat{y} = Xw$, ensure $X \in \mathbb{R}^{n \times d}$ and $w \in \mathbb{R}^d$ so the result is $\hat{y} \in \mathbb{R}^n$. Each prediction $\hat{y}_i$ is the inner product of row $i$ with $w$. For attention, $Q \in \mathbb{R}^{1 \times d_k}$, $K \in \mathbb{R}^{m \times d_k}$, $V \in \mathbb{R}^{m \times d_v}$ produces scores $\in \mathbb{R}^{1 \times m}$, weights $a \in \mathbb{R}^m$, and output $o \in \mathbb{R}^{d_v}$. Batch processing extends to $Q \in \mathbb{R}^{n \times d_k}$ producing $O \in \mathbb{R}^{n \times d_v}$.

  • Verification strategy: The code explicitly computes both the matrix operation ($X @ w$, $a @ V$) and the manual linear combination (summing scaled columns/rows) to verify they’re identical. This “dual representation” is pedagogical—in production, use the matrix form for efficiency. The equivalence confirms your understanding: matrix-vector multiplication is linear combination of columns.

  • Softmax numerical stability: The implementation uses softmax from shared utilities, which likely includes the standard exp(x - x.max()) trick for numerical stability. When implementing attention, always subtract the max before exponentiating to prevent overflow. The scaling by $1/\sqrt{d_k}$ prevents dot products from growing large as dimensionality increases, keeping softmax gradients well-behaved.

  • Span interpretation: The column span of $X$ is the set of all vectors reachable by varying $w$. If $X$ has rank $r < d$, the span is $r$-dimensional even though you have $d$ features—some features are redundant. Use np.linalg.matrix_rank(X) to check. For attention, if all value vectors are parallel (rank 1), the output is always a scalar multiple of that direction regardless of attention weights.

  • Convex combination property: Attention weights $a_i \in [0,1]$ with $\sum a_i = 1$ guarantee the output is a convex combination. This means attention cannot extrapolate—it can only interpolate between value vectors. If you need outputs outside the value span, you must apply a post-attention transformation (e.g., feed-forward layer). This is why transformer blocks alternate attention (routes within value span) with MLPs (expand the span nonlinearly).

  • Part 1: Linear Regression as Linear Combination The first section demonstrates that predictions $\hat{y} = Xw$ are always linear combinations of feature columns. The $2 \times 2$ design matrix $X$ has 2 columns (features), and the weight vector $w = [2, -1]$ specifies how much of each column to include. The result [0., 2., 4.] is a weighted combination $2 \cdot X_{:,0} - 1 \cdot X_{:,1}$. This illustrates a fundamental principle: all linear model outputs live in the column span of $X$. If you add a third feature but it’s a linear combination of the first two (e.g., $X_{:,2} = X_{:,0} + X_{:,1}$), rank deficiency occurs and no amount of training data changes the span. Feature engineering works by adding new linearly independent columns to expand this span.

  • Part 2: Attention as Convex Linear Combination The second section implements scaled dot-product attention, showing that outputs are always convex combinations of value vectors. Unlike regression where weights can be negative or unbounded (any $w \in \mathbb{R}^d$), attention enforces $a_i \geq 0$ and $\sum a_i = 1$ via softmax. This makes the output geometrically interpretable: it’s a weighted average of values, constrained to their convex hull. The query-key similarity determines the routing—which values to emphasize—but the fundamental constraint remains: all attention outputs live in the value span. Multi-head attention and positional encodings expand this span by learning richer value representations across layers.

  • Part 3: Practical Implications and Debugging Understanding span constraints enables effective model debugging and architecture design. When linear models underfit, the issue is often that true targets lie outside $\text{span}(X)$—adding nonlinear features (polynomials, interactions) or kernel methods expands the span. For attention, if all value vectors are nearly collinear (rank-deficient $V$), the model cannot produce diverse outputs regardless of attention weights; use np.linalg.matrix_rank(V) to diagnose. In transformers, residual connections and layer normalization preserve span properties across layers while enabling gradient flow. When attention scores collapse to uniform distribution (all $a_i \approx 1/m$), the output becomes a simple average—check query-key similarity patterns and consider temperature scaling or learned position embeddings. The dual verification (matrix product vs. manual sum) in this code is not just pedagogical: it's a production debugging technique to catch shape errors, numerical instability, or incorrect broadcast semantics before they propagate through deep networks.

History and Applications

Span and linear combinations have been central to mathematics since Descartes’ coordinate geometry (17th century). The abstract notion of “span as the set of all linear combinations” emerged during the 20th century through the development of vector spaces and linear algebra.

Linear regression roots: The method of least squares, formalized by Gauss (1809) and Legendre (1805), is fundamentally about expressing a target as a linear combination of features. The solution finds optimal coefficients such that $y \approx w_1 x_1 + w_2 x_2$—the best-fitting element in $\text{span}(X)$.

Attention and convex combinations: The softmax attention mechanism (Vaswani et al., 2017) computes normalized weights that sum to 1, forming a convex combination of value vectors. This embeds the span and linear combination concept—values are averaged along directions chosen by learned queries and keys, making attention a learnable weighted sum over the span of value embeddings.

Connection to Broader Examples

This example establishes span and linear combination as the unifying algebraic structure underlying diverse ML algorithms. The key insight is that both supervised learning (regression) and self-supervised learning (attention) work by selecting coefficients for basis vectors, differing only in how those coefficients are determined.

Throughout the remaining 98 examples, this pattern recurs constantly:

  • Neural network layers (Examples 4, 16) compute $f(Wx + b)$ where $Wx$ is a linear combination of weight matrix columns, followed by nonlinear activation that enables multi-layer networks to escape the span constraints of single layers
  • PCA (Examples 11-20) projects data onto the span of top eigenvectors: $X_{\text{reduced}} = XW_k$ where $W_k$ contains principal components—you’re expressing data as combinations of these learned basis vectors
  • SVD (Example 10) factorizes $X = U\Sigma V^\top$, revealing the optimal low-rank basis: the columns of $U$ span the column space, $V^\top$ rows span the row space, and $\Sigma$ gives the importance (singular values) of each basis vector
  • Mixture models (future examples) represent distributions as $p(x) = \sum_k \pi_k p_k(x)$—a convex combination of component distributions where $\pi_k$ are mixing weights (analogous to attention weights $a_i$)

The transformer architecture exploits this structure deliberately: attention layers route information via convex combinations (staying within the value span), then MLP layers apply nonlinear transformations (expanding the span for the next layer). This alternating pattern—linear combination within a span, then nonlinear expansion to a richer span—is how deep networks build hierarchical representations.

Why span matters for ML practitioners: When debugging poor model performance, ask “Is the issue the span (do my features/values cover the needed representations?) or the coefficients (am I selecting the wrong linear combination)?” Feature engineering expands the span. Regularization (L1, L2) constrains coefficient magnitudes. Dropout randomly removes basis vectors. Understanding this distinction guides your debugging strategy—if the span is insufficient, add features or layers; if coefficients are wrong, adjust optimization or regularization.

Comments