ex1.ai

Example 2: Predictions and attention are in spans

Chapter: 2 span_linear_combination

Purpose

Give you a crisp, ML-first mental model you can reuse across models and datasets.

What you’ll learn:

Span as the output space: Understand that linear model predictions $\hat{y} = Xw$ always live in the column span of $X$, which determines the expressive power of the model. If your features don’t span the true decision boundary, no weight vector can achieve it.
Attention as convex combination: See that attention mechanisms compute weighted averages of value vectors—a convex combination where weights sum to 1. This geometric view explains why attention can’t extrapolate beyond the span of its value vectors and why positional encodings expand the representational span.
Matrix-vector multiplication as aggregation: Train yourself to see $Xw$ and $a^\top V$ as selecting coefficients for basis vectors (columns of $X$ or rows of $V$). This perspective unifies linear regression, fully connected layers, and attention as “coefficient selection + linear combination.”
Computation pattern: Recognize linear combination as one of five core patterns (fit, project, decompose, solve, measure) that recur throughout ML. The span of your basis vectors determines what outputs are reachable.

This example establishes the algebraic foundation for understanding why feature engineering matters (expanding the span), why transformers scale (learning rich value representations), and why linear models have inherent limitations (constrained by column span).

Problem

Verify ŷ = Xw is a linear combination of the columns of X.
Verify attention output is a linear combination of value vectors.

Solution (math)

$Xw=\sum_j w_j X_{:j}$, so $\hat y$ lies in the column span of $X$.
For a single query, attention produces $o=\sum_i a_i v_i$, so $o$ lies in $\mathrm{span}\{v_i\}$.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Background

These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.

Linear combinations as predictions form the basis of all linear models. When you compute $\hat{y} = Xw$, the prediction vector lives in $\text{span}\{X_{:,1}, \ldots, X_{:,d}\}$—the set of all linear combinations of feature columns. This immediately reveals a fundamental limitation: if the true relationship requires outputs outside this span (e.g., nonlinear decision boundaries), no choice of $w$ can represent it. Feature engineering (adding polynomial features, interactions, embeddings) works by expanding the column span to include richer representations. Kernel methods implicitly map to infinite-dimensional spans. Neural networks learn nonlinear feature maps that adaptively expand the span layer by layer.

Attention mechanisms as weighted aggregation implement the core operation in transformers: $o = \sum_i a_i v_i$ where attention weights $a_i$ come from $\text{softmax}(QK^\top/\sqrt{d_k})$. This is a convex combination (weights non-negative, sum to 1), meaning the output $o$ is a weighted average of value vectors $\{v_i\}$ and must lie in their convex hull. The query-key similarity determines how to mix the values, but the output is geometrically constrained to $\text{span}\{v_i\}$. Multi-head attention runs this operation in parallel across different learned subspaces, expanding the effective span. Positional encodings add new basis vectors to distinguish token positions. This geometric view explains why transformers need deep stacks—each layer expands the representational span available to subsequent layers.

What this example demonstrates

Linear regression predictions live in column span: When you compute $\hat{y} = Xw$, the result is $w_1 X_{:,1} + w_2 X_{:,2}$, a linear combination of feature columns. This means predictions are constrained to $\text{span}\{X_{:,j}\}$. If your features are linearly dependent (collinear), the span has lower dimension than the number of features, causing rank deficiency. If the true targets require outputs outside this span, the model is fundamentally limited—no amount of training will fix it. This is why feature selection matters: removing redundant features doesn’t hurt expressiveness (they’re already in the span), but adding informative features can expand the span.

Attention output as value span: The attention mechanism computes $o = a_1 v_1 + a_2 v_2 + a_3 v_3$ where $a_i \geq 0$ and $\sum a_i = 1$. This is a convex combination—a weighted average—so $o$ lies in the convex hull of $\{v_1, v_2, v_3\}$ and cannot extrapolate beyond the span of value vectors. The query-key similarity (via softmax) determines the mixing weights, implementing a learned “routing” mechanism. Unlike regression where weights can be negative or unbounded, attention weights are always non-negative and normalized, making attention outputs geometrically interpretable as points “inside” the value set.

Computation pattern: This example demonstrates linear combination—selecting coefficients for basis vectors and summing. It appears in fully connected layers ($Wx$), attention ($a^\top V$), mixture models ($\sum_k \pi_k \mathcal{N}(\mu_k, \Sigma_k)$), and ensemble methods ($\sum_m \alpha_m h_m(x)$). The span of your basis determines expressiveness; the coefficients determine which point in that span you select.

Numerical Implementation Details

The code demonstrates two equivalent views of the same mathematical operation, reinforcing that matrix operations are syntactic sugar for explicit linear combinations.

For linear regression, X @ w computes predictions via optimized BLAS routines in $O(nd)$ time. The manual computation 2*X[:,0] - 1*X[:,1] performs the same operation by explicitly scaling and adding column vectors. NumPy’s broadcasting handles the scalar multiplication element-wise, producing [2., 6., 10.] from 2*X[:,0] and [2., 4., 6.] from 1*X[:,1], then subtracting to get [0., 2., 4.]. This dual representation is pedagogical—in production, always use the matrix form (X @ w) for efficiency and numerical stability. The equivalence confirms that predictions are geometrically constrained to the column span of $X$.

For attention, the code implements scaled dot-product attention step-by-step. First, Q @ K.T computes query-key similarities as a $1 \times 3$ vector of dot products. Scaling by $1/\sqrt{2}$ (where $d_k = 2$) prevents the magnitudes from growing large, which would push softmax into saturation regions with near-zero gradients. The softmax function converts raw scores to normalized weights that sum to 1.0—this is what makes attention a convex combination rather than just any linear combination. Finally, a @ V computes the weighted sum of value vectors. The verification a[0]*V[0] + a[1]*V[1] + a[2]*V[2] explicitly shows the output is a point in the convex hull of $\{V_0, V_1, V_2\}$.

Memory consideration: For batch attention with $n$ queries and $m$ keys, the score matrix is $n \times m$. For self-attention ($n = m = $ sequence length), this is $O(n^2)$ memory, which is why transformers struggle with long sequences. Efficient attention variants (linear attention, sparse attention) exploit low-rank structure or sparsity patterns to reduce this cost.

Connection to the Broader Example

This example establishes span and linear combination as the unifying algebraic structure underlying diverse ML algorithms. The key insight is that both supervised learning (regression) and self-supervised learning (attention) work by selecting coefficients for basis vectors, differing only in how those coefficients are determined.

Throughout the remaining 98 examples, this pattern recurs constantly:

Neural network layers (Examples 4, 16) compute $f(Wx + b)$ where $Wx$ is a linear combination of weight matrix columns, followed by nonlinear activation that enables multi-layer networks to escape the span constraints of single layers
PCA (Examples 11-20) projects data onto the span of top eigenvectors: $X_{\text{reduced}} = XW_k$ where $W_k$ contains principal components—you’re expressing data as combinations of these learned basis vectors
SVD (Example 10) factorizes $X = U\Sigma V^\top$, revealing the optimal low-rank basis: the columns of $U$ span the column space, $V^\top$ rows span the row space, and $\Sigma$ gives the importance (singular values) of each basis vector
Mixture models (future examples) represent distributions as $p(x) = \sum_k \pi_k p_k(x)$—a convex combination of component distributions where $\pi_k$ are mixing weights (analogous to attention weights $a_i$)

The transformer architecture exploits this structure deliberately: attention layers route information via convex combinations (staying within the value span), then MLP layers apply nonlinear transformations (expanding the span for the next layer). This alternating pattern—linear combination within a span, then nonlinear expansion to a richer span—is how deep networks build hierarchical representations.

Why span matters for ML practitioners: When debugging poor model performance, ask “Is the issue the span (do my features/values cover the needed representations?) or the coefficients (am I selecting the wrong linear combination)?” Feature engineering expands the span. Regularization (L1, L2) constrains coefficient magnitudes. Dropout randomly removes basis vectors. Understanding this distinction guides your debugging strategy—if the span is insufficient, add features or layers; if coefficients are wrong, adjust optimization or regularization.

Notes

Shape discipline: For $\hat{y} = Xw$, ensure $X \in \mathbb{R}^{n \times d}$ and $w \in \mathbb{R}^d$ so the result is $\hat{y} \in \mathbb{R}^n$. Each prediction $\hat{y}_i$ is the inner product of row $i$ with $w$. For attention, $Q \in \mathbb{R}^{1 \times d_k}$, $K \in \mathbb{R}^{m \times d_k}$, $V \in \mathbb{R}^{m \times d_v}$ produces scores $\in \mathbb{R}^{1 \times m}$, weights $a \in \mathbb{R}^m$, and output $o \in \mathbb{R}^{d_v}$. Batch processing extends to $Q \in \mathbb{R}^{n \times d_k}$ producing $O \in \mathbb{R}^{n \times d_v}$.
Verification strategy: The code explicitly computes both the matrix operation ($X @ w$, $a @ V$) and the manual linear combination (summing scaled columns/rows) to verify they’re identical. This “dual representation” is pedagogical—in production, use the matrix form for efficiency. The equivalence confirms your understanding: matrix-vector multiplication is linear combination of columns.
Softmax numerical stability: The implementation uses softmax from shared utilities, which likely includes the standard exp(x - x.max()) trick for numerical stability. When implementing attention, always subtract the max before exponentiating to prevent overflow. The scaling by $1/\sqrt{d_k}$ prevents dot products from growing large as dimensionality increases, keeping softmax gradients well-behaved.
Span interpretation: The column span of $X$ is the set of all vectors reachable by varying $w$. If $X$ has rank $r < d$, the span is $r$-dimensional even though you have $d$ features—some features are redundant. Use np.linalg.matrix_rank(X) to check. For attention, if all value vectors are parallel (rank 1), the output is always a scalar multiple of that direction regardless of attention weights.
Convex combination property: Attention weights $a_i \in [0,1]$ with $\sum a_i = 1$ guarantee the output is a convex combination. This means attention cannot extrapolate—it can only interpolate between value vectors. If you need outputs outside the value span, you must apply a post-attention transformation (e.g., feed-forward layer). This is why transformer blocks alternate attention (routes within value span) with MLPs (expand the span nonlinearly).
Part 1: Linear Regression as Linear Combination The first section demonstrates that predictions $\hat{y} = Xw$ are always linear combinations of feature columns. The $2 \times 2$ design matrix $X$ has 2 columns (features), and the weight vector $w = [2, -1]$ specifies how much of each column to include. The result [0., 2., 4.] is a weighted combination $2 \cdot X_{:,0} - 1 \cdot X_{:,1}$. This illustrates a fundamental principle: all linear model outputs live in the column span of $X$. If you add a third feature but it’s a linear combination of the first two (e.g., $X_{:,2} = X_{:,0} + X_{:,1}$), rank deficiency occurs and no amount of training data changes the span. Feature engineering works by adding new linearly independent columns to expand this span.
Part 2: Attention as Convex Linear Combination The second section implements scaled dot-product attention, showing that outputs are always convex combinations of value vectors. Unlike regression where weights can be negative or unbounded (any $w \in \mathbb{R}^d$), attention enforces $a_i \geq 0$ and $\sum a_i = 1$ via softmax. This makes the output geometrically interpretable: it’s a weighted average of values, constrained to their convex hull. The query-key similarity determines the routing—which values to emphasize—but the fundamental constraint remains: all attention outputs live in the value span. Multi-head attention and positional encodings expand this span by learning richer value representations across layers.

References

See Chapter 2.
G. Strang. Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press, 2016. Chapter 3 (Vector Spaces and Subspaces) covers span and linear independence.
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. Section 2.2 (Multiplying Matrices and Vectors) explains matrix-vector products as linear combinations.
A. Vaswani et al. “Attention Is All You Need.” NeurIPS 2017. Original transformer paper introducing scaled dot-product attention.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning (2nd ed.). Springer, 2009. Section 3.2 discusses the column space of the design matrix in linear regression.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Section 1.2.5 covers decision regions and their relationship to feature space span.
S. Wiegreffe and Y. Pinter. “Attention is not not Explanation.” EMNLP 2019. Discusses attention as weighted combination and interpretability.
A. Rogers, O. Kovaleva, and A. Rumshisky. “A Primer in BERTology: What We Know About How BERT Works.” TACL 2020. Section 3.1 analyzes attention patterns as span selection mechanisms.

Code Introduction

This code demonstrates two fundamental ML operations—linear regression prediction and attention mechanism output—and reveals that both are secretly computing the same mathematical object: a linear combination of basis vectors. This connection is profound because it shows that seemingly different ML algorithms (supervised learning, self-attention in transformers) share a common algebraic structure rooted in span and linear combinations.

Solution (Python)


import numpy as np
from scripts.toy_data import softmax

X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])
w = np.array([2., -1.])
yhat = X@w
print("yhat:", yhat)
print("as combo:", 2*X[:,0] - 1*X[:,1])

Q = np.array([[1., 0.]])
K = np.array([[1., 0.],
              [0., 1.],
              [1., 1.]])
V = np.array([[1., 0.],
              [0., 2.],
              [1., 1.]])

scores = (Q@K.T)/np.sqrt(2)
a = softmax(scores, axis=1).reshape(-1)
o = a@V
print("weights:", a, "sum:", a.sum())
print("o:", o)
print("combo:", a[0]*V[0] + a[1]*V[1] + a[2]*V[2])

History

Span and linear combinations have been central to mathematics since Descartes’ coordinate geometry (17th century). The abstract notion of “span as the set of all linear combinations” emerged during the 20th century through the development of vector spaces and linear algebra.

Linear regression roots: The method of least squares, formalized by Gauss (1809) and Legendre (1805), is fundamentally about expressing a target as a linear combination of features. The solution finds optimal coefficients such that $y \approx w_1 x_1 + w_2 x_2$—the best-fitting element in $\text{span}(X)$.

Attention and convex combinations: The softmax attention mechanism (Vaswani et al., 2017) computes normalized weights that sum to 1, forming a convex combination of value vectors. This embeds the span and linear combination concept—values are averaged along directions chosen by learned queries and keys, making attention a learnable weighted sum over the span of value embeddings.