ex1.ai

Example number

Example slug

example_82_span_in_ml_xw_and_attention_as_weighted_sums

Background

In linear algebra, the span of a set of vectors is the collection of all linear combinations of those vectors. Matrixâvector products like $Xw$ are exactly such combinations: the result lies in the span of the matrixâs columns. In machine learning, this shows up everywhere: linear regression predictions, logistic regression logits, and the outputs of fully-connected layers are all in spans of learned feature directions. Attention mechanisms perform weighted sums of value vectors using non-negative weights that sum to one (a convex combination), also producing outputs in the span of values. Understanding spans clarifies expressivity (which outputs are possible), interpretability (which directions contribute), and numerical stability (how weights scale outputs).

Purpose

Build an ML-first intuition for the concept of a span: outputs formed by weighted sums of basis vectors. Show that linear predictions $\hat y = Xw$ live in the span of the feature columns of $X$, and that attention outputs live in the span of the value vectors. This connects two ubiquitous operationsâmatrixâvector products and attentionâto the same geometric idea: combining a small set of directions with data-dependent weights.

Problem

Show that (a) linear predictions lie in the span of feature columns and (b) attention outputs lie in the span of value vectors. Compute both explicitly and verify the span claim by expressing outputs as linear combinations.

Solution (Math)

For linear predictions $\hat y=Xw$,

\[ Xw=\sum_{j=1}^d w_j X_{:j}, \]

so $\hat y\in \mathrm{span}\{X_{:1},\dots,X_{:d}\}$.

For attention with one query, output is $o=a^T V=\sum_i a_i v_i$, a linear combination of values, hence $o\in\mathrm{span}\{v_i\}$.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import softmax

X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])
w = np.array([2., -1.])
yhat = X @ w
print("yhat:", yhat)
print("check combo:", 2*X[:,0] - 1*X[:,1])

Q = np.array([[1., 0.]])
K = np.array([[1., 0.],
              [0., 1.],
              [1., 1.]])
V = np.array([[1., 0.],
              [0., 2.],
              [1., 1.]])

scores = (Q @ K.T) / np.sqrt(2)
a = softmax(scores, axis=1).reshape(-1)
o = a @ V
print("a:", a, "sum:", a.sum())
print("o:", o)
print("o combo:", a[0]*V[0] + a[1]*V[1] + a[2]*V[2])

Code Introduction

This snippet illustrates the same linear-algebra idea in two places: a prediction as a linear combination of feature columns, and an attention output as a convex combination of value vectors.

First, the regression-style part computes $y = Xw$ with $X \in \mathbb{R}^{3\times 2}$ and $w \in \mathbb{R}^2$. Writing the columns of $X$ as $X_{:,1}$ and $X_{:,2}$, we have \[ y = Xw = w_1 X_{:,1} + w_2 X_{:,2}. \] With $X = \begin{bmatrix}1&2\\3&4\\5&6\end{bmatrix}$ and $w = [2,-1]^\top$, this yields $y = [0,2,4]^\top$. The print of âcheck comboâ verifies the same result by explicitly forming $2\,X_{:,1} - 1\,X_{:,2}$, emphasizing that $y$ lies in the span of $X$âs columns.

Second, the attention-style part forms a single-query scaled dot-product attention with $Q \in \mathbb{R}^{1\times 2}$, $K \in \mathbb{R}^{3\times 2}$, $V \in \mathbb{R}^{3\times 2}$. Scores are computed as \[ ext{scores} = \frac{QK^\top}{\sqrt{d_k}} \in \mathbb{R}^{1\times 3}, \quad d_k=2, \] then normalized row-wise to attention weights $a = \text{softmax}(\text{scores}) \in \mathbb{R}^{3}$, which satisfy $a_j \ge 0$ and $\sum_j a_j = 1$. For the given $Q=[1,0]$, $K=\{[1,0],[0,1],[1,1]\}$, the unnormalized scores are proportional to $[1,0,1]$, so $a \approx [0.401, 0.197, 0.401]$ and the row sum prints 1. The output is the convex combination \[ o \;=\; a^\top V \;=\; \sum_{j=1}^3 a_j V_j \in \mathbb{R}^2, \] which the code also verifies explicitly via âo combo.â Numerically this gives $o \approx [0.803,\;0.795]$. Conceptually, the two halves mirror each other: $Xw$ is a linear combination of columns (span), while $o$ is a convex combination of rows of $V$ (weights nonnegative and sum to 1), a key property of attention outputs. The $1/\sqrt{d_k}$ scaling keeps scores in a numerically stable range before softmax.

Numerical Implementation Details

Step-by-step verification of both spans:

Linear predictions as column-span

Define $X=\begin{bmatrix}1&2\\3&4\\5&6\end{bmatrix}\in\mathbb{R}^{3\times 2}$ and $w=[2,-1]^\top$.
Compute $\hat y = Xw = [0,\,2,\,4]^\top$.
Express explicitly via columns: $\hat y = 2\,X_{:,1} - 1\,X_{:,2}$ and confirm equality numerically.
Shapes: $X:(3,2)$, $w:(2,) \Rightarrow Xw:(3,)$. The span $\mathrm{span}\{X_{:,1},X_{:,2}\}$ is at most 2-dimensional inside $\mathbb{R}^3$.

Attention output as value-span

Define $Q=[1,0]$, $K=\{[1,0],[0,1],[1,1]\}$, $V=\{[1,0],[0,2],[1,1]\}$.
Compute scaled scores: $\text{scores} = QK^\top/\sqrt{2} = [1,\,0,\,1]/\sqrt{2}$.
Row-wise softmax to weights $a$: non-negative and sum to 1. Verify with a.sum().
Output: $o = a^\top V = \sum_i a_i V_i$. Verify by explicit sum of rows weighted by $a$.
Shapes: $Q:(1,2)$, $K:(3,2)$ $\Rightarrow$ scores $(1,3)$; $a:(3,)$; $V:(3,2)$ $\Rightarrow$ $o:(2,)$. Thus $o\in\mathrm{span}\{V_1,V_2,V_3\}$.

What This Example Demonstrates

Span of feature columns: $\hat y = Xw$ can be written $\hat y = \sum_{j=1}^d w_j X_{:j}$, so $\hat y\in\mathrm{span}\{X_{:1},\ldots,X_{:d}\}$. Changing $w$ slides $\hat y$ within this subspace.
Attention as span of values: For one query, $o = a^\top V = \sum_i a_i v_i$ lies in $\mathrm{span}\{v_1,\ldots,v_n\}$. With softmax, $a_i\ge 0$ and $\sum_i a_i=1$, so $o$ is a convex combination.
Coefficient constraints differ: Linear predictions allow arbitrary real weights; attention uses probability-like weights (non-negative, sum to one). Both are spans; attention adds convexity for stability/interpretability.
Shape discipline: $X\in\mathbb{R}^{n\times d}$, $w\in\mathbb{R}^d$, so $Xw\in\mathbb{R}^n$. For attention, $Q\in\mathbb{R}^{1\times d_k}$, $K\in\mathbb{R}^{n\times d_k}$, $V\in\mathbb{R}^{n\times d_v}$, producing $o\in\mathbb{R}^{d_v}$.
Numerical checks: Verify $Xw$ equals the explicit column combination and $o$ equals the explicit weighted sum of rows of $V$; confirm attention weights sum to 1.
ML mapping: Fit (learn directions), combine (span), then decompose/interpret (which vectors carry signal).

Notes

Part 1: Span of Feature Columns (Xw)

Any matrixâvector product $Xw$ is a linear combination of the columns of $X$. Geometrically, $Xw$ lies in the column-space (range) of $X$. If columns are linearly independent, the span has dimension $d$; if not, the span is lower-dimensional and certain outputs are unattainable.

Part 2: Span of Value Vectors (Attention)

With one query, attention output $o=a^\top V$ lies in the span of the value vectors. Softmax constraints ($a_j\ge 0$, $\sum_j a_j=1$) make this a convex combination, placing $o$ in the convex hull of the valuesâinterpretable and stable.

Part 3: Coefficient Regimes and Expressivity

Linear predictions allow weights $w_j\in\mathbb{R}$ (unconstrained), enabling extrapolation beyond convex hulls. Attention restricts to convex mixing; it cannot produce outputs outside the hull, which can be a feature (regularization and interpretability) or a limitation (needs multi-heads or deeper stacks for expressivity).

Why This Matters for ML

Clarifies what outputs are possible (expressivity) given a chosen basis.
Highlights interpretability: outputs decomposed into known directions.
Encourages shape discipline and sanity checks (sum-to-one for attention weights; column-combination equality for $Xw$).
Connects linear predictors and attention under one geometric lens.

ML Examples and Patterns

Linear/logistic regression: logits $Xw$ in column-span; coefficients quantify feature contribution.
Fully-connected layers: $y=W\,h$ is a span of columns of $W$; activations pick directions.
Self-attention: outputs are convex combinations of values; multi-heads mix multiple spans.
Graph neural networks: neighborhood aggregation is a weighted sum (span) of neighbor embeddings.

Connection to Linear Algebra Theory

Column-space, rank, and basis determine possible outputs; $\dim(\mathrm{span})=\text{rank}$ for linearly independent bases.
Convex combinations live in convex hulls; attention realizes probabilistic weights over a finite set.
Dual view: $Xw$ can also be seen as $\sum_i y_i e_i$ in the standard basis; changing basis reveals structure.

Numerical and Implementation Notes

Verify $Xw$ via explicit column combination to catch shape/axis errors.
For attention, confirm a.sum() == 1 within tolerance and all entries non-negative; use stable softmax (log-sum-exp).
Scaling scores by $1/\sqrt{d_k}$ keeps softmax in a non-saturated regime.
Avoid explicit inverses; for fitting $w$, use lstsq, QR, or regularized solvers.

Numerical and Shape Notes

Shapes: $X:(n,d)$, $w:(d,) \Rightarrow Xw:(n,)$. Attention: $Q:(1,d_k)$, $K:(n,d_k)$, scores $(1,n)$, $a:(n,)$, $V:(n,d_v)$, $o:(d_v,)\!$.
Tolerances: use np.allclose for equality checks; softmax sums may differ from 1 at $<10^{-12}$.
Cost: column-combo verification $O(nd)$; attention with one query $O(nd_k + nd_v)$.

Pedagogical Significance

Seeing both $Xw$ and attention as spans unifies linear predictors and modern architectures under the same geometry. It demystifies attention (itâs just a weighted sum) and clarifies linear models (they pick directions and combine them). This example builds reusable intuition for basis selection, coefficient effects, and output constraints.

History and Applications

Linear models and spans (1800sâ1900s): The method of least squares (Gauss, Legendre) solves for weights $w$ that best combine columns of $X$ to approximate targetsâchoosing a point in the span of feature columns. This viewpoint underpins linear regression, ridge/lasso regularization, and generalized linear models. The geometry (column-space, rank, basis) dictates expressivity: low-rank $X$ limits the set of reachable predictions.

Basis learning and PCA (1901â1933): Pearson (1901) and Hotelling (1933) formalized principal components: learned orthonormal bases onto which data is projected. Outputs in PCA lie in the span of top components, trading dimensionality for interpretability and stability. This âfit basis â combine â interpretâ pattern remains core to modern ML.

Attention as convex span (2015â2017): Bahdanau et al.Â (2015) introduced attention for sequence-to-sequence models; Vaswani et al.Â (2017) made attention the core computational primitive in Transformers. Attention outputs are convex combinations of value vectors, with weights learned from queries and keys. Convexity improves stability and interpretability (probability-like weights), while multi-head attention expands expressivity by mixing multiple spans.

Modern practice: - Regression/classification heads: $Xw$ or $Wh$ combine learned features (span) to produce logits. - Vision transformers: patch embeddings aggregated via attention (convex span) over spatial tokens. - Graph learning: neighborhood aggregation as weighted sums (span) of neighbor features. - Ensembles and mixtures: convex spans of multiple model outputs improve robustness.

The span viewpoint unifies classic linear models and modern attention architectures: both are weighted sums over learned directions, differing mainly in coefficient constraints (unconstrained vs convex) and how directions are obtained (fixed features vs data-dependent values).

Connection to Broader Examples

Example 70 (Projection): Spans and projections are dual ideas; projecting onto a column-space produces the best approximation within that span.
Example 75 (PCA/SVD): Principal components are basis vectors; projecting data onto top PCs forms outputs in a lower-dimensional span.
Example 76 (Least Squares): Solutions minimize residual by choosing weights $w$ that best combine columns of $X$âchoosing a point in the span.
Example 80 (Attention): Attention outputs are convex combinations; this example ties attention geometry to spans.
Example 81 (Zero-mean subspace): Centering projects onto a subspace; subsequent combinations occur within that subspaceâs span.
Chapter 4 (Linear maps): Matrixâvector products as linear combinations of columns is the core operator viewpoint.
Chapter 11 (PCA): Spans built from orthonormal bases aid interpretation and stability.
Chapter 16 (Matrix products): Composition of products builds complex spans (multi-layer networks).