Example number
49
Example slug
example_49_vector_spaces_subspace_projection_embeddings_zeromean
Background

Historical context: Vector spaces emerged in the 19th century (Grassmann, Hamilton) as abstractions of geometric spaces; the formal axiomatization (Peano, Weyl) came in the early 20th century. Subspaces as kernels and images of linear maps became central in functional analysis and quantum mechanics. In statistics, centering (removing the mean) dates to Gauss’s method of least squares (1809), while projection operators were formalized in Hilbert space theory (von Neumann, 1930s). Modern ML inherited these tools wholesale: embeddings are vectors, centering is PCA preprocessing, and residuals are projections.

Mathematical characterization: A vector space $V$ over $\mathbb{R}$ satisfies closure under addition and scalar multiplication: $u, v \in V \Rightarrow u + v \in V$ and $\alpha v \in V$ for all $\alpha \in \mathbb{R}$. A subspace $S \subseteq V$ is a vector space itself (inherits operations). The zero-mean subspace $S = \{x : \mathbf{1}^\top x = 0\}$ is the kernel of the linear functional $x \mapsto \mathbf{1}^\top x$ (the sum). Any $x \in \mathbb{R}^n$ decomposes uniquely as $x = \bar{x} \mathbf{1} + (x - \bar{x} \mathbf{1})$ where $\bar{x} \mathbf{1} \in \text{span}(\mathbf{1})$ (constant vectors) and $x - \bar{x} \mathbf{1} \in S$ (zero-mean vectors), an orthogonal decomposition.

Prevalence in ML: Every preprocessing step (standardization, whitening), dimensionality reduction (PCA, LDA), and residual computation involves projections. Convex combinations power ensembles (Stochastic Weight Averaging), data augmentation (Mixup), and embedding arithmetic (“king” - “man” + “woman” ≈ “queen”). Understanding these as vector space operations clarifies when and why they work.

Purpose

Show how vector spaces and subspaces provide the formal foundation for two ubiquitous ML operations: linear combinations (convex interpolation for model ensembles, data augmentation, embedding arithmetic) and projection onto subspaces (centering/normalization, PCA preprocessing, residual computation). Demonstrate that embeddings live in $\mathbb{R}^d$ (a vector space) where closure under addition and scalar multiplication enables interpolation, and that centering is projection onto the zero-mean subspace $S = \{x : \mathbf{1}^\top x = 0\}$ via the explicit projector $P = I - (1/n) \mathbf{1} \mathbf{1}^\top$. Build intuition for construction (building vectors via linear combinations) and decomposition (splitting vectors into orthogonal components), the dual perspectives that underpin all linear operations in ML.

Problem

Using embeddings as vectors, show closure properties. 1) Given embeddings e(a), e(b) ∈ R^d, compute v = α e(a) + (1-α) e(b) and confirm it stays in R^d. 2) Project a length-7 vector onto the zero-mean subspace S = {x: 1^T x = 0} and verify the projected vector sums to ~0.

Solution (Math)

Embeddings live in $\mathbb{R}^d$, a vector space, so any linear combination

\[ v = \alpha e(a) + $1-\alpha$e(b) \]

remains in $\mathbb{R}^d$.

The zero-mean subspace $S=\{x:\mathbf{1}^T x=0\}$ is a subspace (kernel of $\mathbf{1}^T$). The orthogonal projector onto $S$ is

\[ P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T,\qquad x_{\text{proj}} = Px, \]

and $\mathbf{1}^T x_{\text{proj}}=0$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

E = {
  "a": np.array([1.0, 0.0, 2.0, -1.0]),
  "b": np.array([-1.0, 3.0, 0.0,  2.0]),
}
alpha = 0.35
v = alpha*E["a"] + (1-alpha)*E["b"]
print("interpolated v:", v, "shape:", v.shape)

n = 7
x = np.linspace(-2, 2, n)
one = np.ones((n, 1))
P = np.eye(n) - (1/n) * (one @ one.T)
xp = P @ x
print("sum(xp) (should be ~0):", float(xp.sum()))
Code Introduction

This code demonstrates two fundamental vector space operations: linear combinations (convex combinations as interpolation) and centering via projection (removing the mean component). It illustrates that vector spaces are closed under addition and scalar multiplication, and that subspaces like the zero-mean subspace can be characterized via projection operators.

Part 1: Convex Combination as Interpolation. The code defines two 4-dimensional vectors $a = [1, 0, 2, -1]^\top$ and $b = [-1, 3, 0, 2]^\top$ stored in a dictionary. The convex combination with $\alpha = 0.35$ computes $v = \alpha a + (1-\alpha) b = 0.35 a + 0.65 b$, a weighted average lying on the line segment between $a$ and $b$. Since $\alpha \in [0,1]$ and the coefficients sum to 1, $v$ is guaranteed to be “between” $a$ and $b$ geometrically—it lies in their convex hull. The printed shape (4,) confirms $v \in \mathbb{R}^4$, demonstrating closure under addition and scalar multiplication: combining vectors in $\mathbb{R}^4$ stays in $\mathbb{R}^4$. ML applications include model interpolation (ensemble predictions), parameter averaging (Stochastic Weight Averaging), and data augmentation (Mixup blending).

Part 2: Centering via Projection Matrix. The second block demonstrates mean-centering using a projection operator. Given a vector $x \in \mathbb{R}^7$ with entries linspace(-2, 2, n) (7 evenly spaced points from -2 to 2), the centering projection is $P = I_n - (1/n) \mathbf{1} \mathbf{1}^\top \in \mathbb{R}^{7 \times 7}$, where $\mathbf{1} \in \mathbb{R}^{7 \times 1}$ is the all-ones column vector and $I_n$ is the $7 \times 7$ identity. The product $\mathbf{1} \mathbf{1}^\top$ is a rank-1 matrix of all ones, and dividing by $n$ scales it to the mean projection matrix: $(1/n) \mathbf{1} \mathbf{1}^\top$ projects any vector onto the constant subspace (the span of $\mathbf{1}$). Subtracting this from the identity yields $Px = x - \bar{x} \mathbf{1}$ where $\bar{x} = (1/n) \sum_i x_i$ is the mean. This removes the mean component, producing a centered vector $x_p$ with $\sum_i (x_p)_i = 0$. The printed sum verifies this property: sum(xp) should be near machine precision zero (e.g., $10^{-15}$). ML applications include PCA preprocessing (requires centered data), batch normalization (centers activations per mini-batch), and residual computation (residuals for models with intercept automatically have zero mean). The matrix $P$ is idempotent ($P^2 = P$), symmetric ($P^\top = P$), and an orthogonal projector onto the complement of $\text{span}(\mathbf{1})$.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Define embeddings: Store two 4D vectors a and b in a dictionary; this pattern extends to large embedding tables.
  2. Choose interpolation weight: Set $\alpha = 0.35 \in [0,1]$ for a convex combination (35% of a, 65% of b).
  3. Compute linear combination: v = alpha * E["a"] + (1 - alpha) * E["b"] performs element-wise weighted sum.
  4. Verify closure: Check v.shape == (4,) confirms $v \in \mathbb{R}^4$ (same space as inputs).
  5. Create test vector: Use np.linspace(-2, 2, n) for a simple, interpretable signal with non-zero mean.
  6. Build projection matrix: Form $P = I_n - (1/n) \mathbf{1} \mathbf{1}^\top$ where $\mathbf{1}$ is a column vector (shape (n, 1)).
  7. Apply projection: xp = P @ x computes the centered vector via matrix-vector product.
  8. Verify zero-mean: xp.sum() should be $\approx 10^{-15}$ (machine epsilon), confirming $\mathbf{1}^\top (Px) = 0$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Closure under linear combinations: $v = \alpha a + (1-\alpha) b \in \mathbb{R}^d$ for any $a, b \in \mathbb{R}^d$ and $\alpha \in [0,1]$ (convex combination).
  • Convex interpolation: For $\alpha \in [0,1]$, $v$ lies on the line segment between $a$ and $b$; geometrically, it’s a weighted average.
  • Zero-mean subspace as kernel: $S = \{x : \mathbf{1}^\top x = 0\}$ is an $(n-1)$-dimensional subspace (codimension 1).
  • Projection via explicit formula: $P = I - (1/n) \mathbf{1} \mathbf{1}^\top$ projects onto $S$; $Px$ removes the mean component.
  • Idempotent and symmetric: $P^2 = P$ (projecting twice = projecting once) and $P^\top = P$ (self-adjoint).
  • Orthogonal decomposition: $x = \bar{x} \mathbf{1} + (x - \bar{x} \mathbf{1})$ splits any vector into constant + zero-mean components.
  • Numerical verification: Sum of centered vector is $\approx 0$ (up to floating-point precision).
  • Shape discipline: Linear combinations preserve dimension; projection preserves vector length $n$.
Notes

Part 1: Convex Combination as Interpolation - Convex combination: $v = \alpha a + (1-\alpha) b$ with $\alpha \in [0,1]$ ensures $v$ lies on the line segment $[a, b]$. - For $\alpha = 0$, $v = b$; for $\alpha = 1$, $v = a$; for $\alpha = 0.5$, $v$ is the midpoint. - Affine combinations allow $\alpha \notin [0,1]$ (extrapolation beyond the segment), but may leave the convex hull. - ML applications: Model ensembles (average predictions), embedding arithmetic (analogy queries), data augmentation (Mixup). - Closure property: Any linear combination $\sum_i c_i v_i$ with $v_i \in \mathbb{R}^d$ yields a vector in $\mathbb{R}^d$ (vector space axiom).

Part 2: Centering via Projection Matrix - Projection matrix: $P = I - (1/n) \mathbf{1} \mathbf{1}^\top$ where $(1/n) \mathbf{1} \mathbf{1}^\top$ is the mean projector. - Action: $Px = x - \bar{x} \mathbf{1}$ subtracts the mean $\bar{x} = (1/n) \sum_i x_i$ from each entry. - Result: $\mathbf{1}^\top (Px) = 0$ (sum equals zero) and $\|Px\| \le \|x\|$ (projection contracts or preserves norm). - Idempotent: $P^2 = P$ because projecting a centered vector does nothing. - Symmetric: $P^\top = P$ because $(\mathbf{1} \mathbf{1}^\top)^\top = \mathbf{1} \mathbf{1}^\top$ (outer product is symmetric). - Computational cost: Forming $P$ explicitly is $O(n^2)$ memory; in practice, compute x - x.mean() directly ($O(n)$).

Part 3: Orthogonal Decomposition - Any $x \in \mathbb{R}^n$ decomposes uniquely as $x = \underbrace{\bar{x} \mathbf{1}}_{\text{constant}} + \underbrace{(x - \bar{x} \mathbf{1})}_{\text{zero-mean}}$. - The two components are orthogonal: $\langle \mathbf{1}, x - \bar{x} \mathbf{1} \rangle = 0$. - Direct sum: $\mathbb{R}^n = \text{span}(\mathbf{1}) \oplus S$ where $S = \ker(\mathbf{1}^\top)$. - Dimension: $\dim(\text{span}(\mathbf{1})) = 1$, $\dim(S) = n-1$, and $1 + (n-1) = n$. - Geometric intuition: In $\mathbb{R}^3$, $S$ is a plane through the origin perpendicular to $\mathbf{1} = [1,1,1]^\top$.

ML Examples - Word embeddings: Convex combinations interpolate semantics; $0.5 \cdot \text{king} + 0.5 \cdot \text{queen}$ yields a “royal” concept. - Mixup augmentation: $x' = \lambda x_i + (1-\lambda) x_j$ with $\lambda \sim \text{Beta}(\alpha, \alpha)$ creates synthetic training examples. - Stochastic Weight Averaging: Average model parameters over trajectory: $\theta_{\text{avg}} = (1/T) \sum_t \theta_t$ improves generalization. - PCA preprocessing: Center data via $X_c = X - \mu$ (equivalently, $Px$ per column) before computing eigenvectors of covariance. - Batch normalization: Normalize activations $\hat{x} = (x - \mu_B) / \sigma_B$ where $\mu_B$ is the batch mean (centering step). - Residual connections: ResNet blocks compute $y = x + F(x)$, an affine combination (not convex unless normalized).

Pedagogical Notes - Dual perspectives: Construction (build vectors via combinations) vs. decomposition (split into orthogonal components). - Why two unrelated blocks? Both are vector space operations, but highlight different aspects: closure under operations (Part 1) and subspace characterization (Part 2). - Minimal example: 4D embeddings and 7D signal are small enough to inspect manually but large enough to be non-trivial. - Verification: Printing shapes and sums provides sanity checks that operations succeeded. - Generalization: The pattern extends to $\mathbb{R}^{1000}$ embeddings, $n=10^6$ time series, and batch/channel dimensions in CNNs.

Connection to ML Applications - Embedding arithmetic: Analogy queries like “king” - “man” + “woman” ≈ “queen” rely on vector space structure. - Model interpolation: Linear mode connectivity (Frankle et al.) shows that interpolating between trained models often yields low-loss paths. - Data centering: Standardization $(x - \mu) / \sigma$ combines centering (projection) and scaling (normalization). - Low-rank approximation: Centered data enables PCA/SVD to capture variance directions rather than the mean offset. - Gradient descent: Parameter updates $\theta \leftarrow \theta - \eta \nabla L$ are affine combinations (weight $1$, gradient weight $-\eta$).

Connection to Linear Algebra Theory - Vector space axioms: Closure under addition/scalar multiplication, associativity, commutativity, identity, inverses, distributivity. - Subspace criteria: $S \subseteq V$ is a subspace iff $u, v \in S \Rightarrow u+v \in S$ and $\alpha v \in S$ for all $\alpha$. - Kernel characterization: $S = \ker(\mathbf{1}^\top)$ defines $S$ via a linear functional; kernels are always subspaces. - Projection theorem: For any $x \in V$ and subspace $S$, there exists a unique $x_S \in S$ minimizing $\|x - x_S\|$ (the projection). - Direct sum: $V = S \oplus S^\perp$ decomposes $V$ into orthogonal subspaces (here, $S^\perp = \text{span}(\mathbf{1})$). - Gram-Schmidt preview: Orthogonalizing vectors starts by projecting each against the span of previous ones (centering is the first step).

Numerical and Implementation Notes - Avoid forming $P$ explicitly: For large $n$, compute x - x.mean() instead of P @ x to save $O(n^2)$ memory. - Column vector for $\mathbf{1}$: Use np.ones((n, 1)) (shape (n, 1)) to enable outer product one @ one.T → (n, n). - Broadcasting: P @ x works when x has shape (n,) or (n, 1); NumPy broadcasts appropriately. - Floating-point sum: Centered vector sum is $\approx 10^{-15}$, not exactly zero, due to rounding errors. - Numerical stability: Projection $P$ is numerically stable (subtracting mean doesn’t amplify errors); explicit matrix formation is the only concern.

Numerical and Shape Notes - Embedding shapes: a, b are shape (4,); convex combination yields shape (4,) (preserved). - Signal shape: x is shape (7,); projection yields shape (7,) (preserved). - Projection matrix: $P \in \mathbb{R}^{7 \times 7}$ (shape (7, 7)); expensive to store for large $n$. - Outer product: one @ one.T with one shape (7, 1) yields shape (7, 7) (rank-1 matrix). - Identity matrix: np.eye(n) yields shape (n, n); subtracting outer product yields $P$.

ML Context: From Vector Spaces to Deep Learning - Embeddings as vectors: Word2Vec, GloVe, BERT embeddings all live in $\mathbb{R}^d$; distances and angles have semantic meaning. - Layer activations: Each hidden layer in a neural network maps $\mathbb{R}^{d_{in}} \to \mathbb{R}^{d_{out}}$; activations are vectors. - Parameter space: Model parameters $\theta \in \mathbb{R}^p$ form a high-dimensional vector space; optimization navigates this space. - Gradient flow: Backpropagation computes gradients $\nabla_{\theta} L \in \mathbb{R}^p$; updates are vector additions. - Normalization layers: Batch norm, layer norm, group norm all center activations (projection onto zero-mean subspace). - Residual connections: $y = x + F(x)$ exploits vector addition to enable gradient flow through 100+ layers. - Attention mechanisms: Queries, keys, values are vectors; scaled dot-product attention computes inner products and linear combinations.

History and Applications

Vector spaces originated in 19th-century efforts to abstract geometric reasoning (Grassmann’s Ausdehnungslehre, 1844; Hamilton’s quaternions, 1843), culminating in the axiomatic treatment by Peano (1888) and Weyl (1918). Subspaces as kernels and images emerged from linear algebra’s formalization in the early 20th century, driven by quantum mechanics (Hilbert spaces, von Neumann) and functional analysis. Projection operators became central in statistical theory: Gauss’s least squares (1809) implicitly projects observations onto the column space of the design matrix, and Fisher’s linear discriminant analysis (1936) projects data onto discriminant directions. Centering—removing the mean—is a special case of projection onto the zero-mean subspace, rediscovered independently in PCA (Pearson, 1901; Hotelling, 1933), factor analysis (Spearman, 1904), and time series detrending. Modern ML repurposed these classical tools at scale: word embeddings (Word2Vec, 2013) treat words as vectors in $\mathbb{R}^{300}$ where arithmetic captures semantic relations; Mixup (2018) uses convex combinations for data augmentation; Stochastic Weight Averaging (2018) averages model parameters along the training trajectory. Projection matrices underpin batch normalization (Ioffe & Szegedy, 2015), layer normalization (Ba et al., 2016), and whitening transformations (ZCA, Kessy et al., 2018). The abstraction from geometric vectors to arbitrary function spaces enabled kernel methods, infinite-dimensional optimization, and reproducing kernel Hilbert spaces—foundations of modern statistical learning.

Connection to Broader Examples
  • Chapter 2 (Span/Linear combinations): Convex combinations are special cases of spanning sets; any vector in the convex hull is a positive-weighted sum.
  • Chapter 3 (Basis/Dimension): The zero-mean subspace $S$ has dimension $n-1$; a basis is any $(n-1)$ linearly independent centered vectors.
  • Chapter 4 (Linear maps): Projection $P$ is a linear map $\mathbb{R}^n \to \mathbb{R}^n$ with $\text{rank}(P) = n-1$ and $\ker(P) = \text{span}(\mathbf{1})$.
  • Chapter 5 (Inner products): Projection decomposes $x = x_{\parallel} + x_{\perp}$ where $x_{\parallel} \perp x_{\perp}$ (orthogonal to constant subspace).
  • Chapter 6 (Orthogonality/Projections): This is the canonical example of orthogonal projection; $P$ is the closest point in $S$ to $x$.
  • Chapter 11 (PCA): PCA requires centered data $X_c = (I - (1/n) \mathbf{1} \mathbf{1}^\top) X$ where $P$ is applied to each column.
  • Chapter 12 (Least-squares): Residuals $r = y - \hat{y}$ for models with intercept satisfy $\mathbf{1}^\top r = 0$ (automatically centered).
  • Chapter 14 (Conditioning): Centering improves conditioning of $X^\top X$ by removing the dominant constant direction.
  • Chapter 16 (Matrix products): Projection is matrix-vector product $Px$; outer product $\mathbf{1} \mathbf{1}^\top$ is a rank-1 matrix.
  • Embeddings: Word2Vec, GloVe, and transformer embeddings live in $\mathbb{R}^d$; interpolation explores the semantic manifold.

Comments