Example number
98
Example slug
example_98_span_in_ml_xw_and_attention_as_weighted_sums
Background

Span is one of the first ideas in linear algebra: given vectors (v_1,,v_k), their span is all vectors you can form with linear combinations (_i _i v_i). In ML, this surfaces everywhere. Classical linear regression (Gauss, Legendre) predicts (y = Xw), which is a weighted sum of feature columns—hence (y) lies in the span of those columns. In dimensionality reduction (Pearson 1901; Hotelling 1933), PCA finds principal directions and projects data onto the span of those directions. In modern deep learning, attention (Vaswani et al. 2017) produces outputs (o = A V) that are explicit weighted sums of values, so (o) lies in the span of the value vectors. Thinking in spans provides intuition: capacity is governed by rank; redundancy appears as collinearity; regularization reduces effective span; projections keep us inside constrained subspaces (e.g., zero-mean hyperplane). Span unifies fit, project, decompose, and aggregate.

Purpose

Build a crisp, ML-first mental model of span and linear combinations, then connect it directly to two ubiquitous computations: linear predictions Xw and attention outputs as weighted sums of value vectors. Understand that “span” means all linear combinations of a set of vectors, and that many ML outputs are literally in the span of some basis: predictions lie in the span of feature columns; attention outputs lie in the span of values. Learn shape discipline and why expressing outputs as explicit linear combinations clarifies geometry, stability, and interpretability. See how span connects to projections (least squares), decompositions (PCA/SVD), and constraints (zero-mean subspace). The goal: when you see Xw or softmax(QK^T)V, you immediately recognize a weighted sum over columns/rows and reason about capacity, rank, and robustness.

Problem

Show that (a) linear predictions lie in the span of feature columns and (b) attention outputs lie in the span of value vectors. Compute both explicitly and verify the span claim by expressing outputs as linear combinations.

Solution (Math)

For linear predictions $\hat y=Xw$,

\[ Xw=\sum_{j=1}^d w_j X_{:j}, \]

so $\hat y\in \mathrm{span}\{X_{:1},\dots,X_{:d}\}$.

For attention with one query, output is $o=a^T V=\sum_i a_i v_i$, a linear combination of values, hence $o\in\mathrm{span}\{v_i\}$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import softmax

X = np.array([[1., 2.],
              [3., 4.],
              [5., 6.]])
w = np.array([2., -1.])
yhat = X @ w
print("yhat:", yhat)
print("check combo:", 2*X[:,0] - 1*X[:,1])

Q = np.array([[1., 0.]])
K = np.array([[1., 0.],
              [0., 1.],
              [1., 1.]])
V = np.array([[1., 0.],
              [0., 2.],
              [1., 1.]])

scores = (Q @ K.T) / np.sqrt(2)
a = softmax(scores, axis=1).reshape(-1)
o = a @ V
print("a:", a, "sum:", a.sum())
print("o:", o)
print("o combo:", a[0]*V[0] + a[1]*V[1] + a[2]*V[2])
Code Introduction

This script demonstrates span via two computations:

  1. Linear predictions are in the span of feature columns. With (X^{nd}) and (w^d), (y = Xw) equals a weighted sum of columns: (y = {j=1}^d w_j X{:j}). The code prints yhat and separately constructs the same vector via 2*X[:,0] - 1*X[:,1] (since w = [2, -1]), verifying equality. This makes explicit that predictions live inside ({X_{:j}}).

  2. Attention output is in the span of values. With a single query (Q^{1d_k}), keys (K^{md_k}), and values (V^{md_v}), we compute scores (QK^/), weights (a=()^m), then the output (o = a^V). The script prints the weights a (summing to 1) and reconstructs o both via a @ V and an explicit sum a[0]*V[0] + a[1]*V[1] + a[2]*V[2]. This confirms that attention is a weighted sum, hence (o{v_i}).

Shapes and checks: - X: 3×2; w: 2; yhat: 3. The explicit combination matches X @ w exactly. - Q: 1×2; K: 3×2; V: 3×2; scores: 1×3; a: 3, sums to 1; o: 2. Both constructions of o agree.

Why this matters: Thinking in spans clarifies what your model can express (capacity), how coefficients/weights explain outputs (interpretability), and how rank/conditioning affects stability (numerics).

Numerical Implementation Details
  • Linear combination cost: Forming (y = Xw) is a matrix-vector product costing (O(nd)). Writing as (j w_j X{:j}) clarifies it’s a weighted sum over columns.
  • Attention weights: With one query, compute (a = (QK^/) ^m). Softmax ensures non-negative weights summing to 1; division by () stabilizes entropy.
  • Attention output cost: (o = a^V) costs (O(md_v)). Over batches, attention is implemented with GEMM for efficiency.
  • Verification checks: For regression, compare yhat to linear combination of columns: sum_j w[j] * X[:, j]. For attention, verify a.sum() ≈ 1 and reconstruct o via explicit sum of a[i] * V[i].
  • Conditioning: If columns of (X) are nearly collinear, the span is thin and solutions may be unstable; same for value vectors in (V). Orthogonalization (QR/SVD) reveals effective rank.
  • Precision: Use double precision for large dynamic ranges; softmax computed with log-sum-exp for stability.
  • Shapes: Ensure (X) and (w) dimensions align; for attention, (Q ^{1d_k}), (K ^{md_k}), (V ^{md_v}) producing (a ^m) and (o ^{d_v}).
What This Example Demonstrates
  • Span definition in action: (y = Xw) is a linear combination of feature columns; attention output (o = A V) is a linear combination of value vectors.
  • Explicit linear combination forms: (Xw = {j=1}^d w_j X{:j}); with one query, (o = _i a_i v_i) where (a = (QK^/)).
  • Shape discipline: Identify shapes: (X ^{nd}), (w ^d), (y ^n); (V ^{md_v}), (a ^m), (o ^{d_v}).
  • Rank and capacity: The span of ({X_{:j}}) has dimension at most ((X)); predictions cannot leave that subspace. Similarly, (o) lives in (({v_i})).
  • Interpretability via coefficients: Coefficients (w) and attention weights (a) show “how much” each basis vector contributes.
  • Connection to projections: Least squares finds the projection of (b) onto (({X_{:j}})); attention can be viewed as data-dependent weighted projection/aggregation.
  • Numerical stability intuition: Collinear features/value vectors reduce effective span and can cause instability; orthogonal bases improve conditioning.
  • Unified ML pattern: Many ML computations are “weighted sums over a learned basis,” i.e., operating inside a span.
Notes

Part 1: Span and Explicit Linear Combinations

Span means “all weighted sums.” Writing (Xw) as (j w_j X{:j}) makes the computation transparent: the prediction is a mixture of feature columns. For attention, (o = _i a_i v_i) shows the output is a mixture of value vectors. This explicit view clarifies capacity (which directions can we express?) and interpretability (which basis vectors contribute?).

Part 2: Rank, Capacity, and Redundancy

The dimension of a span equals the rank of its basis matrix. Redundant (collinear) vectors don’t increase span dimension. In regression, collinear features reduce effective capacity and harm stability. In attention, redundant values reduce diversity of outputs. Rank-aware design (orthogonalization, PCA) improves robustness.

Part 3: Projections Into Spans

Least squares projects (b) into (({X_{:j}})); PCA projects data into principal spans; attention aggregates within the value span. Projection residuals are orthogonal to the span (key geometric fact), enabling diagnostics and error decomposition.

Why This Matters for ML

  • Many ML computations are weighted sums over learned bases (features, components, values).
  • Understanding span highlights limits: models cannot output directions outside the span without changing the basis.
  • Regularization, dimensionality reduction, and orthogonalization are span management tools.

ML Examples and Patterns

  1. Linear regression: (y = Xw) lives in the feature-column span.
  2. Attention: (o = A V) lives in the value span; (A) provides data-dependent weights.
  3. PCA: Data projected onto top-(k) components lives in a (k)-dimensional span.
  4. Mixup augmentation: Convex combinations of examples stay inside the data span.
  5. Residual analysis: (r = b - Xw) is orthogonal to the span, aiding diagnostics.

Connection to Linear Algebra Theory

  • Span, basis, and dimension: fundamental notions explaining representational capacity.
  • Rank-nullity theorem: connects span dimension to nullspace size.
  • Orthogonal projections: symmetric idempotent operators map into spans with perpendicular residuals.

Numerical and Implementation Notes

  • Use QR/SVD to assess rank and orthogonalize bases.
  • Compute softmax with log-sum-exp for stability.
  • Verify shapes and sums: a.sum() ≈ 1; reconstruct outputs via explicit weighted sums.

Numerical and Shape Notes

  • Shapes: (X^{nd}), (w^d), (y^n); (V^{md_v}), (a^m), (o^{d_v}).
  • Broadcasting rules: ensure consistent dimensions when forming sums.
  • Precision: prefer float64 for large ranges and accumulation.

Pedagogical Significance

Seeing outputs as explicit linear combinations builds intuition that many ML systems operate “inside a span.” This perspective unifies regression, PCA, attention, and more, emphasizing shapes, rank, and stability. It’s a simple lens with wide reach.

History and Applications

Classical roots: Linear combinations and span date to the origins of linear algebra (Gauss, Legendre) via least squares—expressing observations as combinations of basis functions. PCA (Pearson 1901; Hotelling 1933) projects data into the span of principal directions, reducing dimensionality while preserving variance.

Modern ML: Linear models (ridge/LASSO) operate in the span of features; kernel methods expand spans via feature maps; attention produces outputs in the span of values with data-dependent weights. Mixup and convex augmentation rely on span/convexity: interpolations stay inside feasible sets, improving robustness.

Practical implications: - Capacity is limited by span dimension (rank). Increase basis diversity to widen span; reduce via regularization for stability. - Interpretability arises from coefficients/weights indicating basis contributions. - Efficiency comes from operating on compact bases (low-rank spans) via SVD/PCA.

Span is a simple concept with wide reach: seeing ML computations as weighted sums over bases (i.e., operating inside a span) unifies regression, attention, and dimensionality reduction.

Connection to Broader Examples
  • Vector spaces & subspaces (Ex 97): Span is the smallest subspace containing a set of vectors; projections keep outputs inside that subspace (e.g., zero-mean hyperplane).
  • Least squares (Ex 92): The solution projects (b) onto (({X_{:j}})). Residuals are orthogonal to the span.
  • PCA (Ex 91): PCA finds a low-dimensional basis; projecting data onto principal components constrains outputs to the PCA span.
  • SVD (Ex 90): SVD exposes bases (left/right singular vectors) whose spans structure the data; truncation reduces span dimension (low-rank).
  • Orthogonality & projections (Ex 86): Orthogonal bases make spans numerically stable; projections onto orthogonal subspaces are well-conditioned.
  • Rank & nullspace (Ex 87): Rank determines span dimension; nullspace captures directions not represented in the span.
  • Attention (Ex 96): Attention outputs are explicitly in the span of values; sparse/low-rank approximations reduce effective span for efficiency.
  • Conditioning (Ex 94): Poorly conditioned spans (nearly collinear bases) amplify errors; regularization improves effective conditioning.

Comments