ex1.ai

Example number

Example slug

example_34_predictions_and_attention_are_in_spans

Background

Column spaces and spans are the core language for expressing linear models: any matrixâvector product $Xw$ is a linear combination of columns. This viewpoint clarifies identifiability (rank), generalization (bias towards the column space), and constraints (regularization chooses among many $w$ with the same $\hat y$ when $X$ is rank-deficient). Attention tells a parallel story: dot products create similarity scores, softmax normalizes them to probabilities, and the output is a probability-weighted sum of value vectors. As such, attention lives in the span of $\{v_i\}$; with softmax it is even a convex combination. These span facts connect linear regression, kernel methods (representer theorem), and transformers under one algebraic roof.

Purpose

Build a transferable intuition that predictions and attention outputs live in spans. In linear prediction, $\hat y = Xw$ is a weighted sum of feature columns, so $\hat y$ sits inside the column space of $X$. In attention, for a single query, the output is a softmax-weighted sum of value vectors, so the result lies in the span (indeed convex hull) of the values. Recognizing these as span operations turns black-box pipelines into concrete linear combinations you can inspect, debug, and control.

Problem

Verify yÌ = Xw is a linear combination of the columns of X.
Verify attention output is a linear combination of value vectors.

Solution (Math)

$Xw=\sum_j w_j X_{:j}$, so $\hat y$ lies in the column span of $X$.
For a single query, attention produces $o=\sum_i a_i v_i$, so $o$ lies in $\mathrm{span}\{v_i\}$.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import softmax

X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])
w = np.array([2., -1.])
yhat = X@w
print("yhat:", yhat)
print("as combo:", 2*X[:,0] - 1*X[:,1])

Q = np.array([[1., 0.]])
K = np.array([[1., 0.],
              [0., 1.],
              [1., 1.]])
V = np.array([[1., 0.],
              [0., 2.],
              [1., 1.]])

scores = (Q@K.T)/np.sqrt(2)
a = softmax(scores, axis=1).reshape(-1)
o = a@V
print("weights:", a, "sum:", a.sum())
print("o:", o)
print("combo:", a[0]*V[0] + a[1]*V[1] + a[2]*V[2])

Code Introduction

This snippet shows two manifestations of linear combinations. First, with $X \in \mathbb{R}^{3\times 2}$ and $w \in \mathbb{R}^2$, the product $yhat = Xw$ forms each prediction as a linear combination of the columns of $X$: $yhat = 2\,X_{:,0} - 1\,X_{:,1}$, which the code prints explicitly to confirm equality. This illustrates that matrixâvector multiplication is just a weighted sum of columns by the coefficients in $w$.

Second, it computes a single-query scaled dot-product attention. Scores $QK^\top / \sqrt{2} \in \mathbb{R}^{1\times 3}$ are converted to attention weights $a$ via softmax, ensuring $a \ge 0$ and $a_0 + a_1 + a_2 = 1$. The output $o = a V$ is a convex combination of the three value vectors. The final print shows $o$ matches the explicit sum $a_0 V_0 + a_1 V_1 + a_2 V_2$, reinforcing that attention aggregates values through probability-weighted linear combinations.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Define $X \in \mathbb{R}^{n\times d}$ and $w \in \mathbb{R}^d$. Compute $\hat y = Xw$ and verify by explicitly forming $\sum_j w_j X_{:,j}$.
Define a single-query attention setup with $Q \in \mathbb{R}^{1\times d_k}$, $K \in \mathbb{R}^{n_v\times d_k}$, $V \in \mathbb{R}^{n_v\times d_v}$.
Compute scaled scores: scores = (Q @ K.T) / sqrt(d_k).
Apply row-wise softmax to obtain attention weights a with nonnegative entries summing to 1.
Aggregate values: o = a @ V. Verify equivalence to explicit sum a[0]*V[0] + ....
Sanity checks: print a.sum() (should be 1 up to numerical tolerance) and check shapes throughout.
Numerical stability: use a stable softmax implementation (subtract row max); for batched multi-head shapes, maintain (batch, heads, seq, dim) conventions.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

$\hat y = Xw$ lies in the column space of $X$ and can be written explicitly as a sum of columns scaled by $w$.
Attention output $o$ for a single query is $\sum_i a_i v_i$; with softmax, $a_i \ge 0$ and $\sum_i a_i = 1$ so $o$ is a convex combination of the $v_i$.
Spans unify linear prediction and attention aggregation as the same algebraic pattern: weighted sums of basis vectors.
Rank/identifiability intuition: the richness of the span (rank of $X$ or diversity of $v_i$) bounds what outputs are representable.
Interpretability: weights $w$ or attention $a$ are directly inspectable coefficients in a basis; visualizing them explains model behavior.
Computational shape discipline: $X \in \mathbb{R}^{n\times d}$, $w \in \mathbb{R}^d$, $\hat y \in \mathbb{R}^{n}$; $a \in \mathbb{R}^{n_v}$, $V \in \mathbb{R}^{n_v\times d_v}$, $o \in \mathbb{R}^{d_v}$.

Notes

Part 1: Column-span predictions
Matrixâvector multiply $Xw$ equals a weighted sum of columns of $X$. The prediction space is exactly the column space of $X$; changing $w$ only moves within that span.

Part 2: Attention as convex combination
With softmax weights $a$, attention outputs are convex combinations of $\{v_i\}$, hence inside their convex hull; without softmax (unnormalized) they remain in the linear span.

Part 3: Basis, rank, and expressivity
The ability to match targets depends on basis richness: full column rank for $X$, and value diversity for $V$. Low rank restricts expressible outputs.

Why This Matters for ML
Spans expose model bias and interpretability: coefficients $w$ and attention weights $a$ are explanations in a chosen basis. They clarify generalization (regularization shrinks coefficients), guide architecture (choose informative bases), and inform efficiency (low-rank approximations).

Numerical and Implementation Notes
Use np.linalg.lstsq or QR for regression rather than explicit inverses; use stable softmax for attention. Check sums and shapes to catch errors early.

Numerical and Shape Notes
Shapes: $X(n\times d)$, $w(d)$, $\hat y(n)$; $Q(1\times d_k)$, $K(n_v\times d_k)$, $V(n_v\times d_v)$, $a(n_v)$, $o(d_v)$. For batched multi-head attention, track (B,H,N,dk) and (B,H,N,dv) and reduce along the correct axes.

History and Applications

The span viewpoint traces back to early linear algebra: predictions as $Xw$ have been interpreted as column-space combinations since Gaussâs least squares (1809), where solutions were understood in terms of projecting observations onto spans of regressors. In statistics and signal processing, basis expansions (Fourier, wavelets, splines) make the span explicitâlearning amounts to choosing coefficients in a fixed basis. Modern kernel methods formalize this with the representer theorem: solutions live in the span of training-point features. Attention continues this story in deep learning: starting with Bahdanau et al.Â (2015) and crystallized by Vaswani et al.Â (2017), attention outputs are convex combinations of value vectors, enabling dynamic, content-based selection. This unifies linear regression, kernel machines, and transformers as span-based models with different ways of setting coefficients (solving least squares, maximizing margins, or softmaxing similarities). Practically, understanding spans informs feature engineering (choose expressive bases), regularization (control coefficient norms), interpretability (inspect weights/attention), and efficiency (low-rank approximations and pruning are span restrictions).

Connection to Broader Examples

Vector spaces (Ch. 1): Spans and linear combinations define the structure of representable outputs.
Linear maps (Ch. 4): $Xw$ and $aV$ are linear maps applied to coefficient vectors.
Inner products (Ch. 5): Attention weights originate from dot-product similarities (queries vs.Â keys).
Orthogonality & projections (Ch. 6): Column space vs.Â residual interpretation in least squares is a projection story.
SVD (Ch. 10) & PCA (Ch. 11): Bases and low-rank spans govern approximation power and compression.
Least squares (Ch. 12): Fitting $w$ chooses coefficients in the column-space basis; attention learns to set $a$ via softmaxed scores.
Conditioning (Ch. 14): Poorly conditioned spans amplify coefficient variance; monitor norms of $w$ and entropy of $a$.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments