Example number
33
Example slug
example_33_vector_space_closure_centering_as_a_subspace_projection
Background

Closure under addition and scalar multiplication defines vector spaces. Convex combinations ($0 \le \alpha \le 1$) are a special case used for interpolation, mixtures, and ensembling. Mean-centering is a classic preprocessing step: PCA requires zero-mean data so covariance equals second moments; many optimization routines benefit from centered features to reduce parameter coupling; normalizations (batch/layer norm) explicitly enforce zero-mean activations. Geometrically, centering is an orthogonal projection onto the hyperplane orthogonal to the all-ones vector $\mathbf{1}$, with projector $P = I - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$. Because $P$ is symmetric and idempotent ($P^2 = P$), it cleanly splits any vector into a mean component and a zero-mean component.

Purpose

Build an ML-first intuition that every linear operation lives inside a vector space and that common preprocessing steps are just projections. Two recurring themes: (1) closure says linear combinations stay inside the space you started with, so spans are stable under mixing; (2) mean-centering is an orthogonal projection onto the zero-mean subspace, so removing the mean is not a hack—it is a precise linear operator. These two ingredients show up everywhere: feature scaling, batch/layer normalization (zero-mean activations), PCA (center first, then decompose), and regression with an intercept (centering decouples slope from bias). The goal: make “subtract the mean” and “take convex/linear combinations” feel like deliberate linear maps you can reason about and verify with shapes and simple matrix identities.

Problem
  1. Show closure: form a convex combination of two vectors in R^d.
  2. Show that mean-centering projects onto the subspace of zero-mean vectors.
Solution (Math)
  1. $\mathbb{R}^d$ is closed under addition and scalar multiplication, so for any $a,b\in\mathbb{R}^d$ and $\alpha\in\mathbb{R}$,
\[ v=\alpha a + (1-\alpha)b \in \mathbb{R}^d. \]
  1. The set $S=\{x\in\mathbb{R}^n:\mathbf{1}^Tx=0\}$ is a subspace (kernel of $\mathbf{1}^T$). The orthogonal projector is $P=I-\frac{1}{n}\mathbf{1}\mathbf{1}^T$. Applying $P$ is exactly “subtract the mean”.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

a = np.array([1., 2., -1.])
b = np.array([0., -2., 3.])
alpha = 0.2
v = alpha*a + (1-alpha)*b
print("v:", v)

# Centering as projection onto zero-mean subspace

x = np.array([2., 0., -1., 5.])
n = x.size
one = np.ones((n,1))
P = np.eye(n) - (1/n)*(one@one.T)
xc = P@x
print("xc:", xc, "sum(xc):", xc.sum())
print("xc == x - mean(x):", np.allclose(xc, x - x.mean()))
Code Introduction

This snippet shows two linear-algebra primitives: (1) forming a convex/linear combination $v = \alpha a + (1-\alpha)b$ to illustrate closure in $\mathbb{R}^d$, and (2) constructing the centering projector $P = I - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$ to map any vector to the zero-mean subspace. The code prints the mixed vector v, applies the projector to x as xc = P@x, and verifies that xc.sum() is (numerically) zero and equals x - x.mean(). Because $P$ is symmetric and idempotent, it exactly removes the mean component and returns a zero-mean vector.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Define two vectors $a,b \in \mathbb{R}^d$ and a mixing coefficient $\alpha$. Compute $v = \alpha a + (1-\alpha)b$ to illustrate closure (for $\alpha \in [0,1]$ this is a convex combination).
  2. Define a sample vector $x \in \mathbb{R}^n$. Compute its length n = x.size.
  3. Construct the all-ones column vector one = np.ones((n,1)) and the centering matrix $P = I - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$ using np.eye(n).
  4. Apply the projection: xc = P @ x. Because $P$ is symmetric idempotent, this removes exactly the mean component and leaves a zero-sum vector.
  5. Sanity checks: xc.sum() should be (near) zero; np.allclose(xc, x - x.mean()) confirms the projection equals mean subtraction.
  6. Shape check: $P$ is $n\times n$, $x$ is length $n$, output $xc$ is length $n$. The intermediate $\mathbf{1}\mathbf{1}^\top$ is $n\times n$.
  7. Numerical stability: for large $n$, constructing $\mathbf{1}\mathbf{1}^\top$ is $O(n^2)$; in code you can apply centering as x - x.mean() without forming $P$, but $P$ makes the projector explicit for pedagogy.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Closure in $\mathbb{R}^d$: any linear (or convex) combination of vectors stays in the space and in their span.
  • Convex combination as interpolation: $v = \alpha a + (1-\alpha)b$ traces the line segment between $a$ and $b$ when $\alpha \in [0,1]$.
  • Mean-centering as projection: $P = I - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$ is the orthogonal projector onto the zero-mean subspace (kernel of $\mathbf{1}^\top$).
  • Properties of projectors: $P = P^2 = P^\top$ (idempotent and symmetric) so it is an orthogonal projection.
  • Verifying with code: matrix multiplication implements projection; np.allclose(xc, x - x.mean()) confirms the algebra numerically.
  • Shape discipline: $P \in \mathbb{R}^{n\times n}$, $x \in \mathbb{R}^n$, and $Px$ preserves dimension while enforcing zero sum.
Notes

Part 1: Closure and convex combinations
Any $\alpha a + (1-\alpha)b$ stays in the span of $\{a,b\}$. When $\alpha \in [0,1]$, $v$ lies on the line segment between $a$ and $b$, illustrating interpolation. For arbitrary $\alpha$, it is still inside $\mathbb{R}^d$ by closure.

Part 2: Zero-mean subspace
The subspace $S = \{x : \mathbf{1}^\top x = 0\}$ is the kernel of $\mathbf{1}^\top$. Its dimension is $n-1$, a hyperplane orthogonal to $\mathbf{1}$. The projector $P = I - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$ maps any vector to its zero-mean component.

Part 3: Orthogonal projector properties
$P$ is symmetric and idempotent ($P = P^2 = P^\top$), so it performs an orthogonal projection. It decomposes $x$ into mean component ($\tfrac{1}{n}\mathbf{1}\mathbf{1}^\top x$) plus zero-mean component ($Px$).

Why This Matters for ML
Centering is the first step in PCA and whitening; it reduces parameter coupling in regressions with intercepts; it stabilizes training in deep nets (batch/layer norm enforce zero-mean activations). Convex/linear combinations underpin ensembling, mixing embeddings, and interpolation in latent spaces.

Numerical and Implementation Notes
For large $n$, avoid materializing $\mathbf{1}\mathbf{1}^\top$; subtract x.mean() instead. $P$ is helpful conceptually and for proofs. Use np.allclose tolerances to handle floating-point noise when checking zero-sum conditions.

Numerical and Shape Notes
Shapes: $P \in \mathbb{R}^{n\times n}$, $\mathbf{1} \in \mathbb{R}^{n\times 1}$, $x \in \mathbb{R}^n$, and $Px \in \mathbb{R}^n$. Row/column orientation matters: using column-vector convention makes $\mathbf{1}\mathbf{1}^\top$ an $n\times n$ matrix. The projection preserves dimension while enforcing $\mathbf{1}^\top (Px) = 0$.

History and Applications

Vector space closure is the earliest linear algebra principle, formalized in the 19th century, and underpins every linear model: linear regression, linear classifiers, and linear layers in neural networks are all maps between vector spaces. Centering as an operator dates to classical statistics—Karl Pearson’s 1901 PCA explicitly centers data so covariance equals second moments. In numerical linear algebra, orthogonal projectors like $P = I - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$ are fundamental for decompositions (QR, SVD) and for least-squares residual analysis. In modern ML, centering is everywhere: PCA/whitening pipelines, batch and layer normalization in deep nets, intercept handling in regression, and feature standardization for optimization stability. The projector viewpoint clarifies why these steps work: they are not ad-hoc tricks, but exact orthogonal projections onto meaningful subspaces.

Connection to Broader Examples
  • Vector spaces (Chapter 1): Closure illustrates the defining axioms; projection shows how subspaces structure data operations.
  • Linear maps (Chapter 4): $P$ is a linear operator with clear algebraic properties (symmetric, idempotent).
  • Inner products (Chapter 5): The projector formula comes from orthogonality to $\mathbf{1}$ under the standard inner product.
  • Orthogonality and projections (Chapter 6): $P$ is a textbook orthogonal projector onto the kernel of $\mathbf{1}^\top$.
  • PCA (Chapter 11): Center before computing covariance; $P$ is the centering step that precedes SVD/eigendecomposition.
  • Least squares (Chapter 12): Centering features often reduces multicollinearity and stabilizes regression with an intercept.
  • Conditioning (Chapter 14): Centering can improve conditioning of design matrices by decoupling mean offsets from variation.

Comments