Example number
65
Example slug
example_65_vector_spaces_subspace_projection_embeddings_zeromean
Background

Vector spaces provide the language for embeddings: addition and scalar multiplication encode interpolation, extrapolation, and mixing of features. Subspaces capture structure (constraints, invariances) and projections implement common preprocessing steps such as centering, de-meaning, and removing nuisance components.

Centering via $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$ appears in classical statistics (sample mean removal), PCA (covariance computation and principal directions of centered data), and modern deep learning (batch normalization conceptually centers activations). The closure of embeddings under linear combination underlies convex combinations, attention-weighted sums, and residual connections.

Purpose

Build an ML-first intuition for two ubiquitous operations: forming linear combinations in a vector space and projecting onto a subspace. You should come away comfortable that embeddings live in $\mathbb{R}^d$ and are closed under linear combination, and that centering a vector is an orthogonal projection onto the zero-mean subspace. These ideas recur across preprocessing, optimization, and representation learning. We emphasize shape discipline and numerically stable, interpretation-friendly formulations you can reuse across models and datasets.

Problem

Using embeddings as vectors, show closure properties. 1) Given embeddings e(a), e(b) ∈ R^d, compute v = α e(a) + (1-α) e(b) and confirm it stays in R^d. 2) Project a length-8 vector onto the zero-mean subspace S = {x: 1^T x = 0} and verify the projected vector sums to ~0.

Solution (Math)

Embeddings live in $\mathbb{R}^d$ (columns, $e(a), e(b) \in \mathbb{R}^d$), a vector space, so any linear combination remains in $\mathbb{R}^d$:

\[ v = \alpha\, e(a) + (1-\alpha)\, e(b), \quad v \in \mathbb{R}^d. \]

The zero-mean subspace is

\[ S = \{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}, \]

which is a subspace (it is the kernel of the linear functional $x \mapsto \mathbf{1}^\top x$). The orthogonal projector onto $S$ is

\[ P = I - \frac{1}{n}\, \mathbf{1}\,\mathbf{1}^\top, \qquad x_{\text{proj}} = P x, \]

and $\mathbf{1}^\top x_{\text{proj}} = 0$ because $\mathbf{1}^\top P = \mathbf{1}^\top - \tfrac{1}{n} \mathbf{1}^\top \mathbf{1} \mathbf{1}^\top = \mathbf{1}^\top - \mathbf{1}^\top = 0$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

E = {
  "a": np.array([1.0, 0.0, 2.0, -1.0]),
  "b": np.array([-1.0, 3.0, 0.0,  2.0]),
}
alpha = 0.35
v = alpha*E["a"] + (1-alpha)*E["b"]
print("interpolated v:", v, "shape:", v.shape)

n = 8
x = np.linspace(-2, 2, n)
one = np.ones((n, 1))
P = np.eye(n) - (1/n) * (one @ one.T)
xp = P @ x
print("sum(xp) (should be ~0):", float(xp.sum()))
Code Introduction

This snippet shows two linear algebra primitives. First, it forms a convex combination of two vectors E["a"] and E["b"] using alpha=0.35: v = alpha*a + (1-alpha)*b. This produces an interpolated vector v that lies in the span of a and b, with the print confirming its values and shape (4,).

Second, it builds and applies a centering projection matrix. For $n=8$, $x$ is a linearly spaced vector from $-2$ to $2$. The matrix $P = I - (1/n)\, (\mathbf{1}\mathbf{1}^\top)$ projects any vector onto the subspace orthogonal to the all-ones vector, effectively removing its mean. Multiplying $x_{\text{proj}} = P x$ subtracts the average of $x$ from each entry, and $\sum_i (x_{\text{proj}})_i$ prints near zero, verifying the projection annihilates the constant component.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Build two embeddings $e(a), e(b) \in \mathbb{R}^d$ and choose $\alpha \in [0,1]$ to form $v = \alpha e(a) + (1-\alpha) e(b)$.
  • For centering, construct $\mathbf{1} \in \mathbb{R}^n$ as a column vector and form $P = I - \frac{1}{n}\, \mathbf{1}\mathbf{1}^\top$.
  • Apply $x_{\text{proj}} = Px$; verify $\mathbf{1}^\top x_{\text{proj}} \approx 0$.
  • Prefer matrix–vector application over explicit inverse; $P$ is symmetric idempotent ($P^2 = P$), so repeated application leaves $x_{\text{proj}}$ unchanged.
  • Check shapes explicitly before multiplies: $I, P \in \mathbb{R}^{n\times n}$, $\mathbf{1} \in \mathbb{R}^n$, $x \in \mathbb{R}^n$.
  • For batched data, center rows by subtracting row means or use $X_c = X - \frac{1}{n}\mathbf{1}\mathbf{1}^\top X$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Closure: any linear combination of embeddings remains an embedding in $\mathbb{R}^d$.
  • Subspace projection: centering is the orthogonal projection onto the hyperplane $\mathbf{1}^\top x=0$.
  • Shape discipline: annotate $x \in \mathbb{R}^n$, $\mathbf{1} \in \mathbb{R}^n$, $P \in \mathbb{R}^{n\times n}$, $v \in \mathbb{R}^d$.
  • Numerical sanity: $\sum_i (Px)_i = 0$ within floating-point tolerance.
  • ML mapping: preprocess (center), then decompose (PCA), then fit (regression/classification) with improved conditioning.
  • Reusability: the same projector works for any $n$ and is stable, sparse-friendly, and geometry-consistent.
Notes
  • Part 1: Constructing the operations. Form $v = \alpha e(a) + (1-\alpha)e(b)$ to illustrate closure in $\mathbb{R}^d$. Build $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$ and apply $x_{\text{proj}} = Px$ to illustrate orthogonal projection onto the zero-mean subspace.
  • Part 2: Properties and proofs. $P$ is symmetric and idempotent: $P^\top = P$, $P^2 = P$. The range of $P$ is $S$, and the nullspace is $\operatorname{span}\{\mathbf{1}\}$. For any $x$, $x_{\text{proj}}$ minimizes $\|x - y\|_2$ over all $y \in S$.
  • Part 3: Numerical checks. Validate $\mathbf{1}^\top (Px) = 0$ within tolerance, and observe that applying $P$ twice does not change the result. Inspect shapes to avoid broadcast mistakes.
  • Why This Matters for ML. Centering reduces bias from offsets, improves conditioning of least squares, and is a prerequisite for PCA. Linear combinations of embeddings underpin attention, residual connections, and interpolation in representation spaces.
  • ML Examples and Patterns. Fit: regress on centered features to improve interpretability; Project: remove mean/nuisance components; Decompose: PCA/SVD on centered data; Solve: better-conditioned systems accelerate convergence.
  • Connection to Linear Algebra Theory. Projections arise from orthogonal decompositions $\mathbb{R}^n = S \oplus S^\perp$; $P$ projects onto $S$ along $S^\perp$. Rank-1 update $\frac{1}{n}\mathbf{1}\mathbf{1}^\top$ removes the mean direction.
  • Numerical and Implementation Notes. Avoid explicit inverses; apply $P$ via matrix–vector multiplies or subtract means directly. Keep $\mathbf{1}$ as a column to match $\mathbf{1}\mathbf{1}^\top$ shapes. Beware integer division when computing $\frac{1}{n}$.
  • Numerical and Shape Notes. Annotate $x \in \mathbb{R}^n$, $\mathbf{1} \in \mathbb{R}^n$, $P \in \mathbb{R}^{n\times n}$. In NumPy, ensure one has shape (n, 1) so one @ one.T yields (n, n).
  • Pedagogical Significance. This is the smallest end-to-end example that ties together closure, projection, and shape discipline. It sets up later chapters on PCA and least squares where centering is non-optional.
History and Applications

Centering and projection onto the zero-mean subspace trace back to classical statistics, where de-meaning precedes covariance estimation and principal component analysis. In numerical linear algebra, orthogonal projectors are core tools for least squares and Krylov methods. In modern ML, centering underlies feature standardization, batch normalization, and stable optimization; closure under linear combination supports interpolation in embedding spaces, residual connections, and attention-weighted sums. These simple operations are the gateway to well-conditioned pipelines and interpretable decompositions.

Connection to Broader Examples
  • Links to Chapter 11 (PCA): centering precedes covariance estimation and principal directions.
  • Connects to Chapter 12 (least squares): centering features improves conditioning and interpretability.
  • Relates to Chapter 5/6 (inner products/projections): $P$ is an orthogonal projector with $P = P^\top$ and $P^2=P$.
  • Bridges to Chapter 16 (matrix products): $\mathbf{1}\mathbf{1}^\top$ is a rank-1 matrix; subtracting its scaled form removes the mean component.
  • Complements Chapter 15 (sparse): $P$ can be applied efficiently without dense inversion; projection kernels generalize to sparse settings.
  • Reinforces early chapters: vectors as columns, rows as examples, consistent shape annotations.

Comments