ex1.ai

Example number

Example slug

example_18_predictions_and_attention_are_in_spans

Background

Every machine learning prediction is a linear combination of basis elements: in regression, basis = features (columns of $X$); in attention, basis = value vectors; in kernel methods, basis = kernel evaluations. The column span of a matrix is the set of all possible linear combinations of its columnsâa subspace of the ambient space. Any output that can be expressed as $Xw$ lies in $\text{span}(X)$, no matter how large or small the weights $w$ are. In contrast, convex combinations (where weights are non-negative and sum to 1) lie in the convex hullâa geometric shape bounded by the data points, never extrapolating beyond them. This distinction is crucial: regression can make predictions far outside the convex hull (which can be good for extrapolation but risks overfit); attention stays convex (which is interpretable as a weighted average). Both operations are computationally dominated by matrix products, making them efficient on GPUs/TPUs.

Purpose

Give you a unifying view of supervised learning and attention through the lens of linear combinations and spans:

Predictions live in spans: Linear regression outputs $\hat{y} = Xw$ are linear combinations of feature columns.
Attention outputs live in convex hulls: Attention computes $o = \sum_i a_i V_i$ where weights sum to 1 (convex combination).
Two sides of the same coin: Both operations answer âwhich point in my span/convex hull best represents the data?â
Coefficient constraints matter: Regression allows arbitrary $w \in \mathbb{R}^d$; attention constrains $a$ to be non-negative and sum to 1.
Unifying theme: Ensemble methods, mixtures, neural network layersâall use linear combinations with different constraint sets.

Problem

Verify yÌ = Xw is a linear combination of the columns of X.
Verify attention output is a linear combination of value vectors.

Solution (Math)

$Xw=\sum_j w_j X_{:j}$, so $\hat y$ lies in the column span of $X$.
For a single query, attention produces $o=\sum_i a_i v_i$, so $o$ lies in $\mathrm{span}\{v_i\}$.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import softmax

X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])
w = np.array([2., -1.])
yhat = X@w
print("yhat:", yhat)
print("as combo:", 2*X[:,0] - 1*X[:,1])

Q = np.array([[1., 0.]])
K = np.array([[1., 0.],
              [0., 1.],
              [1., 1.]])
V = np.array([[1., 0.],
              [0., 2.],
              [1., 1.]])

scores = (Q@K.T)/np.sqrt(2)
a = softmax(scores, axis=1).reshape(-1)
o = a@V
print("weights:", a, "sum:", a.sum())
print("o:", o)
print("combo:", a[0]*V[0] + a[1]*V[1] + a[2]*V[2])

Code Introduction

This code demonstrates linear combinations in two canonical ML operations: linear regression prediction and scaled dot-product attention. Both reduce to the same fundamental patternâselecting coefficients for basis vectors and summingâbut differ in how those coefficients are determined and constrained.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Regression: yhat = X @ w computes the linear combination in one matrix-vector product; equivalent to looping over columns and summing.
Feature extraction: X[:,0] and X[:,1] extract individual feature vectors; manual combo 2*X[:,0] - 1*X[:,1] shows the weights.
Attention scores: scores = (Q @ K.T) / np.sqrt(d_k) computes scaled dot products; shape $(1, 3)$ for 1 query and 3 keys.
Softmax normalization: a = softmax(scores, axis=1) converts scores to weights; each row sums to 1.0 (verified by a.sum()).
Reshape for clarity: .reshape(-1) flattens $(1, 3)$ to shape $(3,)$ for easier handling in subsequent operations.
Value aggregation: o = a @ V computes the weighted sum; equivalent to explicit loop a[0]*V[0] + a[1]*V[1] + a[2]*V[2].
Stable softmax: softmax from scripts/toy_data uses log-sum-exp trick to prevent overflow when scores are large.
Shapes: regression has $X:(3, 2)$, $w:(2)$, $\hat{y}:(3)$; attention has $Q:(1, 2)$, $K:(3, 2)$, $V:(3, 2)$, $a:(3)$, $o:(2)$.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Linear regression $\hat{y} = Xw$ is a linear combination: $\hat{y} = \sum_j w_j \mathbf{x}_j$ where $\mathbf{x}_j$ are columns of $X$.
Coefficients can be arbitrary: weights $w$ are unconstrained (can be negative, large, anything).
Attention output $o = \sum_i a_i V_i$ is a convex combination: weights are non-negative and sum to 1.
Verification: explicit reconstruction (loop or formula) should match the matrix product result.
Both outputs live in their respective spans; geometry determines feasible region (span vs convex hull).

Notes

Part 1: Core setup - Predictions and attention are in spans

State the objects, shapes, and target question for Predictions and attention are in spans. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Predictions and attention are in spans

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Predictions and attention are in spans work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Predictions and attention are in spans

Give the computational recipe for Predictions and attention are in spans, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

Shape discipline: check dimensions before manipulating formulas.
Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).
Linear Regression as an Unconstrained Linear Combination The first section computes predictions for a simple linear model. Given a design matrix $X \in \mathbb{R}^{3 \times 2}$ with 3 examples and 2 features, and weights $w = [2, -1]^\top$, the matrix-vector product yhat = X @ w produces predictions $\hat{y} = [0., 2., 4.]^\top$ by computing the inner product of each row with $w$. The key insight emerges from the second print statement: 2*X[:,0] - 1*X[:,1] produces the identical result by explicitly forming a linear combination of $X$âs columns. This reveals that $Xw = w_1 \mathbf{x}_1 + w_2 \mathbf{x}_2$, where $\mathbf{x}_1$ and $\mathbf{x}_2$ are the column vectors of $X$. The coefficients $w$ can be any real numbers (positive, negative, or unbounded in magnitude)âthere are no constraints. The output lives in the column span of $X$: any vector that is a linear combination of the columns. This unconstrained linear combination is the essence of supervised learning: find weights that best fit the data, constrained only by the loss function and regularization.
Attention as a Constrained Convex Combination The second section implements scaled dot-product attention, which appears throughout transformers and modern deep learning. A query $Q \in \mathbb{R}^{1 \times 2}$ matches against keys $K \in \mathbb{R}^{3 \times 2}$ to produce raw similarity scores via scores = (Q @ K.T) / np.sqrt(2). The $1/\sqrt{d_k}$ scaling (with $d_k = 2$) prevents saturation by controlling the magnitude of dot products. These raw scores are passed through softmax(scores, axis=1), which converts them to normalized attention weights $a \in \mathbb{R}^3$ that sum to 1.0 (verified by printing a.sum()). This is the crucial constraint: softmax ensures non-negative coefficients that form a probability distribution, making the output a convex combination rather than an arbitrary linear combination. The final step o = a @ V computes the weighted sum of value vectors $V \in \mathbb{R}^{3 \times 2}$, producing output $o \in \mathbb{R}^2$. The verification a[0]*V[0] + a[1]*V[1] + a[2]*V[2] explicitly shows this is a convex combinationâ$o = \sum_i a_i V_i$ where $a_i \geq 0$ and $\sum_i a_i = 1$. The output lives in the convex hull of the value vectors, a geometric shape bounded by interpolating between them.
The Deep Connection: Both Are Linear Combinations The profound insight is that regression and attention are two points on a spectrum of linear combinations. Regression selects arbitrary coefficients $w$ to minimize prediction error; attention selects non-negative coefficients (via softmax) that sum to 1. Regression projects onto the column span of $X$; attention projects onto the convex hull of value vectors. Despite this difference, both operations answer the same question: âWhich point in my span/convex hull best represents the data?â The span is determined by the data (features in regression, value embeddings in attention), while the coefficients emerge from the learning signal (loss gradient in regression, query-key similarity in attention). This unifying view appears throughout ML: ensemble methods average models via convex combinations, mixture models interpolate distributions via softmax weights, and neural networks alternate between linear combinations (matrix products) and nonlinear expansions (activations) to build hierarchical representations.
Part 4: Numerical and Shape Notes Shape discipline is essential here. For regression: $X \in \mathbb{R}^{3 \times 2}$, $w \in \mathbb{R}^2$ produce $\hat{y} \in \mathbb{R}^3$. For attention: $Q \in \mathbb{R}^{1 \times 2}$, $K \in \mathbb{R}^{3 \times 2}$ produce scores $\in \mathbb{R}^{1 \times 3}$, then softmax across axis=1 normalizes each row independently to sum to 1.0. The .reshape(-1) flattens the $1 \times 3$ weight matrix to a $3$-dimensional vector for clarity. Printing intermediate results (yhat, a, o) and verifying constraints (a.sum(), explicit combo) are best practices for debugging: they confirm that shapes align and operations behave as expected. The softmax import from scripts/toy_data provides numerically stable softmax (using the log-sum-exp trick to prevent overflow), which is essential when $d_k$ is large and raw scores can exceed floating-point range.

History and Applications

Linear regression and least squares: Linear regression emerged in the early 19th century (Gauss, Legendre) as a tool for fitting astronomical data. The insight that predictions are linear combinations of features made the method both interpretable and computationally tractable. For nearly 200 years, regression was the dominant ML method because it was theoretically sound and numerically stable.

Convex combinations and probability: Softmax and probability simplex constraints emerged from information theory and statistical mechanics (Boltzmann distribution). In the 1970sâ80s, researchers recognized that constraining linear combinations to be convex (non-negative, summing to 1) preserves probabilistic interpretabilityâa key property for building generative models, mixture models, and latent variable models.

Attention and the transformer revolution: The scaled dot-product attention mechanism (Vaswani et al., 2017) unified two ideas: (1) learning query-key similarities (a learned basis for selecting which values to attend to) and (2) convex combination constraints (softmax to form normalized weights). This design proved revolutionary because it enabled learning long-range dependencies more efficiently than RNNs and discovered that pure attentionâwithout convolution or recurrenceâcould scale to billions of parameters. Modern transformers (BERT, GPT, etc.) are fundamentally stacks of attention and feed-forward layers, each using linear combinations with different constraint sets.

Unifying ML through spans: The realization that predictions, attention, ensembles, and mixture models all boil down to linear combinations reveals a deep structure in ML. Spanning sets differ (features, values, models, distributions), and constraint sets differ (unbounded, convex, positive), but the fundamental operation is the same. This unified view helps practitioners reason about new architectures and design choicesâby asking âwhat is the spanning set?â and âwhat constraints make sense?â one can design methods with transparent behavior and clear optimization objectives. Understanding spans and linear combinations is essential for debugging models, designing efficient implementations, and extending methods to new domains.

Connection to Broader Examples

Vector spaces and closure (Ch. 1): Convex combinations are a special case of linear combinations with bounded coefficients.
Projections (Ch. 6): Least squares finds the projection of $y$ onto $\text{span}(X)$âthe closest point in the span.
Orthogonality (Ch. 6): The residual $y - \hat{y}$ is orthogonal to all columns of $X$, a geometric optimality condition.
Attention (Ch. 16): Attention as a weighted combination is fundamental; constraining weights to be convex makes outputs interpretable.
Transformers: Stacking attention and feed-forward (which apply linear combinations via $W_1$ and $W_2$ matrices) builds deep models.