ex1.ai

Example number

Example slug

example_1_vector_space_closure_centering_as_a_subspace_projection

Background

These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.

Closure under linear combinations guarantees that optimization algorithms never produce invalid parameter configurations. When gradient descent computes $\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$, the result stays in the same parameter space. Momentum methods form convex combinations of velocity vectors across iterations. Ensemble methods like model averaging compute weighted sums of predictions. In every case, closure ensures these operations produce valid outputs in $\mathbb{R}^d$âyou never âfall off the edgeâ by performing legal vector arithmetic.

Mean-centering as projection is the core operation in Principal Component Analysis (PCA), batch normalization, and any algorithm that requires zero-mean data. The projection matrix $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$ decomposes vectors into orthogonal components: one aligned with the constant vector $\mathbf{1}$ (the mean), and one perpendicular to it (the centered data). Understanding this as a subspace projectionânot just âsubtract the meanââreveals why PCA eigenvectors are orthogonal and why centered data has lower numerical condition numbers.

Purpose

Give you a crisp, ML-first mental model you can reuse across models and datasets.

What youâll learn:

Closure as optimization foundation: Understand why parameter updates in gradient descent, momentum, and ensemble methods always produce valid outputsâbecause $\mathbb{R}^d$ is closed under linear combinations.
Mean-centering as geometry: See âsubtract the meanâ as an orthogonal projection onto a subspace, not just arithmetic. This geometric view explains why PCA requires centered data and why batch normalization stabilizes training.
Dual perspectives: Train yourself to switch between algebraic operations (addition, scaling) and geometric interpretations (subspaces, projections). This flexibility is essential for understanding why algorithms work and diagnosing when they fail.
Computational patterns: Recognize projection as one of five core patterns (fit, project, decompose, solve, measure) that recur throughout ML. Once you see it here, youâll spot it everywhereâfrom least squares to attention mechanisms.

This example establishes the mental framework for all subsequent examples: every linear algebra operation has both an algebraic formula and a geometric meaning, and ML algorithms exploit both.

Problem

Show closure: form a convex combination of two vectors in R^d.
Show that mean-centering projects onto the subspace of zero-mean vectors.

Solution (Math)

$\mathbb{R}^d$ is closed under addition and scalar multiplication, so for any $a,b\in\mathbb{R}^d$ and $\alpha\in\mathbb{R}$,

\[ v=\alpha a + (1-\alpha)b \in \mathbb{R}^d. \]

The set $S=\{x\in\mathbb{R}^n:\mathbf{1}^Tx=0\}$ is a subspace (kernel of $\mathbf{1}^T$). The orthogonal projector is $P=I-\frac{1}{n}\mathbf{1}\mathbf{1}^T$. Applying $P$ is exactly âsubtract the meanâ.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np

a = np.array([1., 2., -1.])
b = np.array([0., -2., 3.])
alpha = 0.2
v = alpha*a + (1-alpha)*b
print("v:", v)

# Centering as projection onto zero-mean subspace

x = np.array([2., 0., -1., 5.])
n = x.size
one = np.ones((n,1))
P = np.eye(n) - (1/n)*(one@one.T)
xc = P@x
print("xc:", xc, "sum(xc):", xc.sum())
print("xc == x - mean(x):", np.allclose(xc, x - x.mean()))

Code Introduction

This Python code demonstrates closure under convex combinations in $\mathbb{R}^d$one of the fundamental properties that makes Euclidean space a valid vector space. The computation shows that when you take a weighted average of two vectors with coefficients that sum to 1, the result stays within the same space.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

The code demonstrates closure under convex combinations through explicit scalar multiplication and vector addition: alpha * a + (1 - alpha) * b. NumPyâs broadcasting handles this element-wise computation efficientlyâmultiply each component of a by 0.2, multiply each component of b by 0.8, then add corresponding entries. The result is computed in $O(d)$ time and demonstrates Pythonâs array-oriented programming style, which matches the mathematical notation directlyâno index gymnastics required.

This clarity is crucial in ML codebases where readable implementations prevent bugs in gradient computations. The convex combination operation shows that $v$ remains in $\mathbb{R}^3$âno exotic coordinates appear, no dimension changes occur, and all arithmetic operations are well-defined. The result v = [0.2, -1.2, 2.2] lies on the line segment connecting a and b, positioned 20% of the way from b toward a.

For the mean-centering projection, the code constructs the projection matrix explicitly: P = np.eye(n) - (1/n)*(one@one.T). This is fine for small examples and pedagogical purposes. In production code operating on large datasets ($n \gg 1000$), prefer computing xc = x - x.mean() directly rather than materializing $P$âthis avoids the $O(n^2)$ memory cost of storing the full projection matrix. The verification np.allclose(xc, x - x.mean()) confirms that the geometric projection view and the arithmetic âsubtract meanâ view are equivalent.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Closure under convex combinations: When you compute $v = \alpha a + (1-\alpha)b$ with $\alpha \in [0,1]$, the result stays in $\mathbb{R}^d$. This seemingly simple property is why gradient descent worksâparameter updates are linear combinations, and closure guarantees they remain valid. No exotic coordinates appear, no dimension changes occur, and all arithmetic operations stay well-defined.

Mean-centering as orthogonal projection: Subtracting the mean is equivalent to applying the projection matrix $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$, which maps data onto the subspace of zero-mean vectors. This geometric perspective explains why PCA operates on centered data (eigenvectors must be orthogonal to $\mathbf{1}$) and why batch normalization improves training stability (removes the mean component before scaling).

Computation pattern: This example demonstrates projectionâdecomposing vectors into orthogonal subspace components. Itâs one of five core patterns (fit, project, decompose, solve, measure) that recur throughout ML linear algebra.

Notes

Part 1: Core setup - Vector space closure + centering as a subspace projection

State the objects, shapes, and target question for Vector space closure + centering as a subspace projection. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Vector space closure + centering as a subspace projection

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Vector space closure + centering as a subspace projection work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Vector space closure + centering as a subspace projection

Give the computational recipe for Vector space closure + centering as a subspace projection, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

Shape discipline: Check dimensions before manipulating formulas. The data matrix $X \in \mathbb{R}^{n \times d}$ has rows = examples, columns = features. The projection matrix $P \in \mathbb{R}^{n \times n}$ acts on the left to center rows: $X_c = PX$. Always verify matrix dimensions match before multiplication.
Array-oriented programming: The code uses alpha*a + (1-alpha)*b rather than looping over components. NumPyâs broadcasting handles element-wise operations efficiently, computing the result in $O(d)$ time. This clarity matches mathematical notation directlyâno index gymnastics requiredâwhich prevents bugs in gradient computations.
Projection matrix structure: $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$ is symmetric ($P = P^T$) and idempotent ($P^2 = P$), the defining properties of orthogonal projectors. The rank-one matrix $\frac{1}{n}\mathbf{1}\mathbf{1}^T$ projects onto the constant vector subspace; subtracting it from $I$ projects onto the orthogonal complement (zero-mean vectors).
Numerical note: For projection, explicit matrix multiplication P @ x is fine. For least squares or solving systems, prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
Algebraic vs geometric perspective: Closure demonstrates vector spaces support algebraic operations (addition, scaling). Projection demonstrates they support geometric operations (orthogonal decomposition, distances). Both perspectives are essentialâPCA uses eigenvectors (algebraic) to find principal directions (geometric), and attention mechanisms compute weighted sums (algebraic) to focus on relevant features (geometric).

History and Applications

Vector spaces and closure emerged from abstract algebra in the 19th century, formalizing the intuition that certain sets of objects behave consistently under addition and scaling. The modern axiomatization solidified through works by Hilbert, Banach, and others, providing the rigorous foundation for functional analysis and quantum mechanics.

Closure in optimization: Closure under linear combinations is not just algebraic curiosityâit is the defining property that enables optimization algorithms. Early machine learning methods implicitly relied on closure: adding gradients and scaling parameters always produce valid predictions. The formalization of optimization in convex analysis (Boyd & Vandenberghe, 2004) made explicit use of convex combinations, where closure guarantees line searches and convex sets remain well-defined.

Mean-centering as projection: Understanding âsubtract the meanâ as a geometric projection onto a subspace proved crucial for explaining why PCA works (eigenvectors must be orthogonal to the constant vector) and why batch normalization stabilizes neural networks.

Connection to Broader Examples

This example establishes the fundamental duality between algebraic operations and geometric interpretations that permeates all of linear algebra in ML. The convex combination demonstrates algebraic closure (addition and scaling stay in the space), while the projection matrix demonstrates geometric decomposition (splitting vectors into orthogonal components).

Throughout the remaining 99 examples, youâll see this same pattern repeated:

PCA (Examples 11-20) uses eigenvectors (algebraic decomposition) to find principal directions (geometric subspaces)âand requires centered data because eigenvectors must be orthogonal to $\mathbf{1}$
Least squares (Examples 12-20) solves $X^\top X w = X^\top y$ (algebraic manipulation) by projecting $y$ onto the column space of $X$ (geometric projection)
Attention mechanisms (Example 16) compute $\text{softmax}(QK^\top)V$ (weighted sum, algebraic) to focus on relevant features (geometric selection)

The projection matrix $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$ also reveals why algorithms prefer zero-mean data: the rank-one matrix $\frac{1}{n}\mathbf{1}\mathbf{1}^\top$ captures the âconstantâ component, and subtracting it from $I$ removes that component. This decomposition reduces numerical condition numbers (fewer large eigenvalues spanning multiple scales), which is why batch normalization improves training stabilityâit applies this same projection idea before scaling activations.