Example number
17
Example slug
example_17_vector_space_closure_centering_as_a_subspace_projection
Background

Vector spaces form the foundational structure of linear algebra: any two vectors can be added, any vector can be scaled, and the result stays in the space. A subspace is a subset that inherits this closure property. The set of zero-mean vectors $S = \{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}$ is the kernel of the linear functional $x \mapsto \mathbf{1}^\top x$, hence a subspace. The orthogonal projector onto $S$ is $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$, and applying $P$ to any vector gives the unique point in $S$ closest to it (in the Euclidean sense). In ML, mean-centering is the standard preprocessing step before PCA, regression, and clustering; understanding it as projection reveals why it’s optimal—it removes the constant offset without losing information, minimizing reconstruction error.

These concepts appear throughout modern deep learning and statistics: regularization methods project onto constraint sets, attention mechanisms compute weighted sums in value span, and optimization algorithms iterate within feasible parameter spaces. The interplay between closure (ensuring algorithms stay valid) and projection (finding optimal points in subspaces) is the thread connecting linear algebra to ML.

Purpose

Give you a crisp, ML-first mental model that unifies closure (algebraic safety) and projection (geometric optimality):

  • Closure: Vector spaces are closed under addition and scalar multiplication; convex combinations stay in the space.
  • Subspaces: The zero-mean vectors $\{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}$ form a subspace (preserved under all operations).
  • Projection: Mean-centering is orthogonal projection onto the zero-mean subspace; it’s both algebraically valid and geometrically optimal.
  • ML reuse: These concepts recur in PCA (centering before eigendecomposition), batch normalization (zero-mean preprocessing), and gradient descent (staying in parameter space).
Problem
  1. Show closure: form a convex combination of two vectors in R^d.
  2. Show that mean-centering projects onto the subspace of zero-mean vectors.
Solution (Math)
  1. $\mathbb{R}^d$ is closed under addition and scalar multiplication, so for any $a,b\in\mathbb{R}^d$ and $\alpha\in\mathbb{R}$,
\[ v=\alpha a + (1-\alpha)b \in \mathbb{R}^d. \]
  1. The set $S=\{x\in\mathbb{R}^n:\mathbf{1}^Tx=0\}$ is a subspace (kernel of $\mathbf{1}^T$). The orthogonal projector is $P=I-\frac{1}{n}\mathbf{1}\mathbf{1}^T$. Applying $P$ is exactly “subtract the mean”.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

a = np.array([1., 2., -1.])
b = np.array([0., -2., 3.])
alpha = 0.2
v = alpha*a + (1-alpha)*b
print("v:", v)

# Centering as projection onto zero-mean subspace

x = np.array([2., 0., -1., 5.])
n = x.size
one = np.ones((n,1))
P = np.eye(n) - (1/n)*(one@one.T)
xc = P@x
print("xc:", xc, "sum(xc):", xc.sum())
print("xc == x - mean(x):", np.allclose(xc, x - x.mean()))
Code Introduction

This code demonstrates two foundational concepts of linear algebra in machine learning: closure under convex combinations (the algebraic property that makes vector spaces valid) and mean-centering as orthogonal projection (the geometric operation underpinning PCA, data preprocessing, and normalization).

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Convex combination: v = alpha*a + (1-alpha)*b with $\alpha \in [0, 1]$ stays in $\mathbb{R}^d$; result lies on the line segment from $b$ to $a$.
  • Projection matrix construction: one = np.ones((n,1)) creates the all-ones column vector; one @ one.T is rank-1 outer product (all entries 1); scaling by $1/n$ and subtracting from identity gives the projector.
  • Application: xc = P @ x applies the projection; equivalent to xc = x - x.mean() * np.ones(n).
  • Verification: xc.sum() should be near zero (within machine precision); np.allclose(xc, x - x.mean()) confirms equivalence to direct centering.
  • Idempotence check: P @ P @ x should equal P @ x (though not shown, always true for projectors).
  • Shapes: for $x \in \mathbb{R}^n$, $P \in \mathbb{R}^{n \times n}$ is rank $(n-1)$ (one null direction: the all-ones vector).
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Closure under convex combinations: for $a,b\in\mathbb{R}^d$ and $\alpha\in[0,1]$, $v = \alpha a + (1-\alpha)b \in \mathbb{R}^d$.
  • The zero-mean subspace $S = \{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}$ is closed under addition and scaling.
  • Orthogonal projection: $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$ projects any vector onto $S$; equivalently, subtracts the mean.
  • Idempotence: $P^2 = P$; projecting twice yields the same result as projecting once.
  • ML practice: mean-centering via projection is the foundation for PCA, regression preprocessing, and batch normalization.
Notes

Part 1: Core setup - Vector space closure + centering as a subspace projection

State the objects, shapes, and target question for Vector space closure + centering as a subspace projection. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Vector space closure + centering as a subspace projection

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Vector space closure + centering as a subspace projection work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Vector space closure + centering as a subspace projection

Give the computational recipe for Vector space closure + centering as a subspace projection, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: check dimensions before manipulating formulas.
  • Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
  • Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).
  • Convex Combination and Closure in $\mathbb{R}^d$ The first section constructs two 3-dimensional vectors $a = [1., 2., -1.]$ and $b = [0., -2., 3.]$, then forms their convex combination with parameter $\alpha = 0.2$. The operation $v = \alpha a + (1-\alpha)b$ computes $v = 0.2 \cdot a + 0.8 \cdot b = 0.2[1, 2, -1] + 0.8[0, -2, 3] = [0.2, -1.2, 2.2]$. Since $\alpha \in [0, 1]$, this is a convex combination—a weighted average where coefficients are non-negative and sum to 1. The result $v$ lies on the line segment connecting $a$ and $b$, positioned 20% of the way from $b$ toward $a$. Crucially, $v \in \mathbb{R}^3$ because $\mathbb{R}^3$ is closed under addition and scalar multiplication: no exotic coordinates appear, and all arithmetic operations are valid. This closure property is essential for optimization algorithms like gradient descent, which iteratively form convex combinations of parameter updates while remaining in valid parameter space.
  • Mean-Centering as Orthogonal Projection The second section reveals that centering a vector is equivalent to projecting it onto a subspace. Given a vector $x = [2, 0, -1, 5] \in \mathbb{R}^4$, the code constructs the projection matrix $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$, where $\mathbf{1} = [1, 1, 1, 1]^\top$ is the all-ones vector and $n=4$. The outer product $\mathbf{1}\mathbf{1}^\top$ is a $4 \times 4$ matrix of ones (each entry is 1), and scaling by $1/n$ produces the rank-1 averaging operator. Subtracting from the identity gives a rank-3 projection onto the zero-mean subspace $S = \{x \in \mathbb{R}^4 : \mathbf{1}^\top x = 0\}$—the set of all vectors whose entries sum to zero. Applying $P$ to $x$ yields $Px = x - \frac{1}{n}(\mathbf{1}^\top x)\mathbf{1} = x - \bar{x}\mathbf{1}$, where $\bar{x} = \frac{1}{4}(2+0-1+5) = 1.5$ is the mean. The result is $x_c = [0.5, -1.5, -2.5, 3.5]$—each entry decreased by the mean. The sanity checks verify this: $\sum x_c$ is approximately zero (within floating-point precision), and np.allclose(xc, x - x.mean()) confirms that matrix projection equals element-wise subtraction of the mean.
  • ML Connection These two operations—convex combinations and mean-centering—are ubiquitous in machine learning. Convex combinations appear in ensemble methods (averaging model predictions), optimization (interpolating between gradient steps), and attention mechanisms (softmax weights form a convex combination of value vectors). Mean-centering via projection is the first step in PCA (removes translational bias before eigendecomposition), data standardization (shifts origin to center of mass for numerical stability), and batch normalization (centers features before scaling). Understanding centering as projection onto a subspace reveals why it’s geometrically optimal: among all rank-3 subspaces of $\mathbb{R}^4$, the zero-mean subspace best approximates the data in a least-squares sense, and $P$ is the orthogonal projector onto it. The matrix $P$ is idempotent ($P^2 = P$) and symmetric ($P^\top = P$), which are hallmarks of orthogonal projections and ensure that applying $P$ repeatedly has no additional effect—a centered vector remains centered. This distinction—between algebraic closure (vectors stay in the space) and geometric projection (finding the best point in a subspace)—unifies linear algebra and machine learning: the former guarantees operations are valid, the latter explains why they’re optimal.
History and Applications

Foundations of vector spaces: The modern axiomatization of vector spaces emerged in the early 20th century (Peano, Banach, Hilbert), unifying diverse objects (Euclidean spaces, function spaces, polynomial rings) under a single algebraic structure. The requirement of closure under addition and scalar multiplication ensures that operations stay within the space—a critical property for reasoning about algorithms and numerical stability.

Projections and least squares: The connection between orthogonal projections and least-squares solutions was formalized through Gram-Schmidt orthogonalization (1880s–1900s) and later through SVD and QR decomposition (Golub, Kahan, 1960s). Understanding mean-centering as orthogonal projection onto the zero-mean subspace makes PCA and related methods transparent: they exploit the geometry of data to find low-dimensional approximations.

Modern ML practice: Data preprocessing always begins with centering: in classical statistics, this removes location parameters and focuses on covariance structure; in deep learning, batch normalization centers activations per batch to improve training stability and convergence. The convex combination structure appears everywhere—softmax produces convex combinations (attention), ensemble methods average models as convex combinations, and constrained optimization projects onto convex feasible sets. Understanding closure and subspace projections as fundamental operations makes it clear why these methods work: they leverage the algebraic and geometric structure of spaces.

Connection to optimization: Gradient descent iterates within parameter space by taking steps that are affine combinations of historical gradients (in momentum and adaptive methods); the space remains valid throughout because of closure. Constrained optimization projects iterates onto constraint sets (e.g., $\ell_2$ balls, hyperplanes) using projectors like $P$. Federated learning and distributed training benefit from understanding how convex combinations of models (averaging) preserve model validity and enable communication-efficient training.

Connection to Broader Examples
  • Convex combinations (Ch. 2): Closure under convex combinations is fundamental to span and linear combinations.
  • Orthogonal projections (Ch. 6): Mean-centering is the prototype orthogonal projection; understanding it geometrically prepares for least-squares projections.
  • PCA (Ch. 11): PCA always centers data first, projecting onto the zero-mean subspace before eigendecomposition.
  • Optimization: Gradient descent iterates within parameter space (closure), and constrained optimization projects onto feasible regions.
  • Normalization and preprocessing: Batch normalization, layer normalization, standardization—all exploit the subspace structure of zero-mean or zero-sum vectors.

Comments