Example number
1
Example slug
example_1_vector_space_closure_centering_as_a_subspace_projection
Background

These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.

Closure under linear combinations guarantees that optimization algorithms never produce invalid parameter configurations. When gradient descent computes $\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$, the result stays in the same parameter space. Momentum methods form convex combinations of velocity vectors across iterations. Ensemble methods like model averaging compute weighted sums of predictions. In every case, closure ensures these operations produce valid outputs in $\mathbb{R}^d$—you never “fall off the edge” by performing legal vector arithmetic.

Mean-centering as projection is the core operation in Principal Component Analysis (PCA), batch normalization, and any algorithm that requires zero-mean data. The projection matrix $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$ decomposes vectors into orthogonal components: one aligned with the constant vector $\mathbf{1}$ (the mean), and one perpendicular to it (the centered data). Understanding this as a subspace projection—not just “subtract the mean”—reveals why PCA eigenvectors are orthogonal and why centered data has lower numerical condition numbers.

Purpose

Give you a crisp, ML-first mental model you can reuse across models and datasets.

What you’ll learn:

  • Closure as optimization foundation: Understand why parameter updates in gradient descent, momentum, and ensemble methods always produce valid outputs—because $\mathbb{R}^d$ is closed under linear combinations.
  • Mean-centering as geometry: See “subtract the mean” as an orthogonal projection onto a subspace, not just arithmetic. This geometric view explains why PCA requires centered data and why batch normalization stabilizes training.
  • Dual perspectives: Train yourself to switch between algebraic operations (addition, scaling) and geometric interpretations (subspaces, projections). This flexibility is essential for understanding why algorithms work and diagnosing when they fail.
  • Computational patterns: Recognize projection as one of five core patterns (fit, project, decompose, solve, measure) that recur throughout ML. Once you see it here, you’ll spot it everywhere—from least squares to attention mechanisms.

This example establishes the mental framework for all subsequent examples: every linear algebra operation has both an algebraic formula and a geometric meaning, and ML algorithms exploit both.

Problem
  1. Show closure: form a convex combination of two vectors in R^d.
  2. Show that mean-centering projects onto the subspace of zero-mean vectors.
Solution (Math)
  1. $\mathbb{R}^d$ is closed under addition and scalar multiplication, so for any $a,b\in\mathbb{R}^d$ and $\alpha\in\mathbb{R}$,
\[ v=\alpha a + (1-\alpha)b \in \mathbb{R}^d. \]
  1. The set $S=\{x\in\mathbb{R}^n:\mathbf{1}^Tx=0\}$ is a subspace (kernel of $\mathbf{1}^T$). The orthogonal projector is $P=I-\frac{1}{n}\mathbf{1}\mathbf{1}^T$. Applying $P$ is exactly “subtract the mean”.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

a = np.array([1., 2., -1.])
b = np.array([0., -2., 3.])
alpha = 0.2
v = alpha*a + (1-alpha)*b
print("v:", v)

# Centering as projection onto zero-mean subspace

x = np.array([2., 0., -1., 5.])
n = x.size
one = np.ones((n,1))
P = np.eye(n) - (1/n)*(one@one.T)
xc = P@x
print("xc:", xc, "sum(xc):", xc.sum())
print("xc == x - mean(x):", np.allclose(xc, x - x.mean()))
Code Introduction

This Python code demonstrates closure under convex combinations in $\mathbb{R}^d$one of the fundamental properties that makes Euclidean space a valid vector space. The computation shows that when you take a weighted average of two vectors with coefficients that sum to 1, the result stays within the same space.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

The code demonstrates closure under convex combinations through explicit scalar multiplication and vector addition: alpha * a + (1 - alpha) * b. NumPy’s broadcasting handles this element-wise computation efficiently—multiply each component of a by 0.2, multiply each component of b by 0.8, then add corresponding entries. The result is computed in $O(d)$ time and demonstrates Python’s array-oriented programming style, which matches the mathematical notation directly—no index gymnastics required.

This clarity is crucial in ML codebases where readable implementations prevent bugs in gradient computations. The convex combination operation shows that $v$ remains in $\mathbb{R}^3$—no exotic coordinates appear, no dimension changes occur, and all arithmetic operations are well-defined. The result v = [0.2, -1.2, 2.2] lies on the line segment connecting a and b, positioned 20% of the way from b toward a.

For the mean-centering projection, the code constructs the projection matrix explicitly: P = np.eye(n) - (1/n)*(one@one.T). This is fine for small examples and pedagogical purposes. In production code operating on large datasets ($n \gg 1000$), prefer computing xc = x - x.mean() directly rather than materializing $P$—this avoids the $O(n^2)$ memory cost of storing the full projection matrix. The verification np.allclose(xc, x - x.mean()) confirms that the geometric projection view and the arithmetic “subtract mean” view are equivalent.

What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Closure under convex combinations: When you compute $v = \alpha a + (1-\alpha)b$ with $\alpha \in [0,1]$, the result stays in $\mathbb{R}^d$. This seemingly simple property is why gradient descent works—parameter updates are linear combinations, and closure guarantees they remain valid. No exotic coordinates appear, no dimension changes occur, and all arithmetic operations stay well-defined.

Mean-centering as orthogonal projection: Subtracting the mean is equivalent to applying the projection matrix $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$, which maps data onto the subspace of zero-mean vectors. This geometric perspective explains why PCA operates on centered data (eigenvectors must be orthogonal to $\mathbf{1}$) and why batch normalization improves training stability (removes the mean component before scaling).

Computation pattern: This example demonstrates projection—decomposing vectors into orthogonal subspace components. It’s one of five core patterns (fit, project, decompose, solve, measure) that recur throughout ML linear algebra.

Notes

Part 1: Core setup - Vector space closure + centering as a subspace projection

State the objects, shapes, and target question for Vector space closure + centering as a subspace projection. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Vector space closure + centering as a subspace projection

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Vector space closure + centering as a subspace projection work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Vector space closure + centering as a subspace projection

Give the computational recipe for Vector space closure + centering as a subspace projection, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: Check dimensions before manipulating formulas. The data matrix $X \in \mathbb{R}^{n \times d}$ has rows = examples, columns = features. The projection matrix $P \in \mathbb{R}^{n \times n}$ acts on the left to center rows: $X_c = PX$. Always verify matrix dimensions match before multiplication.

  • Array-oriented programming: The code uses alpha*a + (1-alpha)*b rather than looping over components. NumPy’s broadcasting handles element-wise operations efficiently, computing the result in $O(d)$ time. This clarity matches mathematical notation directly—no index gymnastics required—which prevents bugs in gradient computations.

  • Projection matrix structure: $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$ is symmetric ($P = P^T$) and idempotent ($P^2 = P$), the defining properties of orthogonal projectors. The rank-one matrix $\frac{1}{n}\mathbf{1}\mathbf{1}^T$ projects onto the constant vector subspace; subtracting it from $I$ projects onto the orthogonal complement (zero-mean vectors).

  • Numerical note: For projection, explicit matrix multiplication P @ x is fine. For least squares or solving systems, prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.

  • Algebraic vs geometric perspective: Closure demonstrates vector spaces support algebraic operations (addition, scaling). Projection demonstrates they support geometric operations (orthogonal decomposition, distances). Both perspectives are essential—PCA uses eigenvectors (algebraic) to find principal directions (geometric), and attention mechanisms compute weighted sums (algebraic) to focus on relevant features (geometric).

History and Applications

Vector spaces and closure emerged from abstract algebra in the 19th century, formalizing the intuition that certain sets of objects behave consistently under addition and scaling. The modern axiomatization solidified through works by Hilbert, Banach, and others, providing the rigorous foundation for functional analysis and quantum mechanics.

Closure in optimization: Closure under linear combinations is not just algebraic curiosity—it is the defining property that enables optimization algorithms. Early machine learning methods implicitly relied on closure: adding gradients and scaling parameters always produce valid predictions. The formalization of optimization in convex analysis (Boyd & Vandenberghe, 2004) made explicit use of convex combinations, where closure guarantees line searches and convex sets remain well-defined.

Mean-centering as projection: Understanding “subtract the mean” as a geometric projection onto a subspace proved crucial for explaining why PCA works (eigenvectors must be orthogonal to the constant vector) and why batch normalization stabilizes neural networks.

Connection to Broader Examples

This example establishes the fundamental duality between algebraic operations and geometric interpretations that permeates all of linear algebra in ML. The convex combination demonstrates algebraic closure (addition and scaling stay in the space), while the projection matrix demonstrates geometric decomposition (splitting vectors into orthogonal components).

Throughout the remaining 99 examples, you’ll see this same pattern repeated:

  • PCA (Examples 11-20) uses eigenvectors (algebraic decomposition) to find principal directions (geometric subspaces)—and requires centered data because eigenvectors must be orthogonal to $\mathbf{1}$
  • Least squares (Examples 12-20) solves $X^\top X w = X^\top y$ (algebraic manipulation) by projecting $y$ onto the column space of $X$ (geometric projection)
  • Attention mechanisms (Example 16) compute $\text{softmax}(QK^\top)V$ (weighted sum, algebraic) to focus on relevant features (geometric selection)

The projection matrix $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$ also reveals why algorithms prefer zero-mean data: the rank-one matrix $\frac{1}{n}\mathbf{1}\mathbf{1}^\top$ captures the “constant” component, and subtracting it from $I$ removes that component. This decomposition reduces numerical condition numbers (fewer large eigenvalues spanning multiple scales), which is why batch normalization improves training stability—it applies this same projection idea before scaling activations.

Comments