ex1.ai

Example number

Example slug

example_81_vector_spaces_subspace_projection_embeddings_zeromean

Background

Vector spaces emerged in the 19th century as an abstraction for geometric objects (vectors, matrices) with addition and scalar multiplication. Subspaces (closed under these operations) appeared naturally in linear algebra (spans, kernels, images). Orthogonal projectionâthe operation of âdroppingâ a vector onto a lower-dimensional subspace along a perpendicular directionâwas formalized by Gram and Schmidt (1907) for constructing orthonormal bases. In the 20th century, projection matrices became central to linear algebra numerics: least squares (Project data onto column space of design matrix), PCA (Project onto principal component subspace), and statistical analysis (Project onto mean-zero subspace for unbiased estimation). In modern ML, centering via projection is the first preprocessing step of PCA (Hotelling, 1933; Pearson, 1901), standardization (z-score normalization), batch normalization (Ioffe & Szegedy, 2015), and virtually all statistical models. Convex combinations (weighted averages with non-negative weights summing to 1) form the basis of probability theory, linear programming (simplex method), and modern deep learning: attention mechanisms compute convex combinations of value vectors (Vaswani et al., 2017), mixing layers combine features, and ensemble methods average predictions. The interplay between vector spaces, subspaces, and projections is the geometric foundation of machine learning.

Purpose

Introduce vector space structure and orthogonal projection as foundational ML operations. Show that embeddings in $\mathbb{R}^d$ are closed under linear combinations (you stay in the space) and that projecting onto the zero-mean subspace is equivalent to centering dataâa preprocessing step ubiquitous in PCA, regression, and neural networks. This example builds intuition for why subspaces matter: they encode geometric constraints (âzero-sumâ, âorthogonal to noiseâ) that we exploit to structure learning algorithms. Youâll see convex combinations (soft interpolation between vectors) and projection matrices (enforcing linear constraints) in attention mechanisms, standardization, PCA, and batch normalization.

Problem

Using embeddings as vectors, show closure properties. 1) Given embeddings e(a), e(b) â R^d, compute v = Î± e(a) + (1-Î±) e(b) and confirm it stays in R^d. 2) Project a length-9 vector onto the zero-mean subspace S = {x: 1^T x = 0} and verify the projected vector sums to ~0.

Solution (Math)

Embeddings live in $\mathbb{R}^d$, a vector space, so any linear combination

\[ v = \alpha e(a) + $1-\alpha$e(b) \]

remains in $\mathbb{R}^d$.

The zero-mean subspace $S=\{x:\mathbf{1}^T x=0\}$ is a subspace (kernel of $\mathbf{1}^T$). The orthogonal projector onto $S$ is

\[ P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T,\qquad x_{\text{proj}} = Px, \]

and $\mathbf{1}^T x_{\text{proj}}=0$.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)

import numpy as np

E = {
    "a": np.array([1.0, 0.0, 2.0, -1.0]),
    "b": np.array([-1.0, 3.0, 0.0,  2.0]),
}
alpha = 0.35
v = alpha*E["a"] + (1-alpha)*E["b"]
print("interpolated v:", v, "shape:", v.shape)

n = 9
x = np.linspace(-2, 2, n)
one = np.ones((n, 1))
P = np.eye(n) - (1/n) * (one @ one.T)
xp = P @ x
print("sum(xp) (should be ~0):", float(xp.sum()))

Historical foundations (1890sâ1930s): Vector spaces emerged as an algebraic abstraction in the late 19th century. Gram and Schmidt (1907) formalized orthogonalization procedures, enabling construction of orthonormal basesâessential for projection onto subspaces. Pearson (1901) introduced principal component analysis (PCA) by analyzing variance of multivariate data, naturally requiring centering as the first step. Hotelling (1933) developed the method rigorously, showing that principal components are eigenvectors of the centered covariance matrix. These pioneers recognized that subspaces encode geometry: the zero-mean subspace is $(n-1)$-dimensional, principal component subspaces capture maximum variance, and orthogonal projections are the canonical operation to move data into these geometries.

20th century numerical linear algebra (1950sâ1990s): As computing scaled, projection matrices became central to stable numerical algorithms. Least squares via QR decomposition projects data onto the column space of the design matrix. Cholesky decomposition for covariance matrices assumes centered data. Iterative solvers (CG, GMRES) use projections to orthogonalize residuals. The SVD reveals all subspaces of a matrix (range, null space, rank). These developments made the operations in Example 81 practical: projection is not a theoretical ideal but an efficient $O(n^2)$ or $O(n)$ operation (depending on structure).

Modern ML applications (1990sâ2020s): Deep learning explosion relies heavily on vector space structure and centering:

Standardization and batch normalization (2010s): Krizhevsky, Sutskever, and Hintonâs AlexNet (2012) didnât use explicit standardization, but Ioffe & Szegedyâs batch normalization (2015) standardized mini-batches to zero mean, unit variance during training. This became universal in modern architectures (ResNet, BERT, GPT). Centering is the first step, enforcing $\mathbb{E}[x] = 0$ within each batch.
Attention mechanisms (2014â2017): Bahdanau et al.Â (2014) introduced attention for seq2seq, using convex combinations of encoder states. Vaswani et al.âs Transformers (2017) use scaled dot-product attention with softmax-normalized weights (row-stochastic, convex combinations). This paradigm dominates NLP and vision.
Embeddings and interpolation (2010sâ2020s): Word embeddings (Word2Vec, GloVe) live in $\mathbb{R}^d$ and are interpolated via convex combinations (mixup data augmentation, ensemble averaging). VAEs and GANs interpolate latent codes: $z = \alpha z_1 + (1-\alpha) z_2$ explores the latent space.
Principal component analysis (1990sâ2020s): PCA remains fundamental for dimensionality reduction, preprocessing, and interpretation. Every implementation starts with centering: $X_c = X - \bar{X}$, then SVD of $X_c$ yields principal components. Modern variants (sparse PCA, kernel PCA) extend this core operation.
Standardization in practice (2000sâ2020s): Feature standardization $(x - \mu) / \sigma$ is standard preprocessing for linear models (linear regression, logistic regression, SVM), tree-based models benefit from balanced scales, and neural networks train faster with standardized inputs. This is motivated by Example 81âs zero-mean subspaceâdata centered at the origin has better geometric properties.

Current frontiers (2020sâpresent):

Layer normalization vs.Â batch normalization: Transformers use layer norm (normalize across features for each sample), which is a generalized centering operation. The principle remains: subtract mean (and divide by std) to enforce zero-mean structure.
Whitening and decorrelation: Modern approaches (e.g., ZCA whitening, PCA whitening) project data onto principal components and scale: $X_w = U \Sigma^{-1/2} U^\top X_c$. This is a composition of projections (onto principal subspace) and scaling (by inverse singular values).
Graph neural networks (2016â2020s): GNNs aggregate neighbor information via convex combinations: $h_v' = \sum_{u \in N(v)} a_{vu} h_u$. Attention weights $a_{vu}$ sum to 1 (convex). This is Example 81 applied to graph structure.
Federated learning and distributed ML: Centering data locally before aggregating is crucial for privacy (removing global mean) and convergence. Example 81âs projection operation is applied per-client.
Adversarial robustness: Centering and standardization can harden models against adversarial examples. The geometric constraint of zero-mean subspace limits the perturbation budget.

Example 81 is the gateway to understanding all ML preprocessing, numerical stability, and modern architecture design. Vector space closure ensures algorithms are well-defined. Centering via projection is the universal preprocessing step. Convex combinations power attention, ensembles, and interpolation. These three ideasâfrom 1890s geometryâremain the foundation of 2024âs largest language models and vision systems.

Numerical Implementation Details

The following step-by-step walkthrough demonstrates both linear combinations and projection:

Define embeddings: Create two 4-dimensional vectors $E_a = [1, 0, 2, -1]^\top$ and $E_b = [-1, 3, 0, 2]^\top$. These represent learned embeddings (e.g., word vectors, latent representations).
Compute convex combination: For weight $\alpha = 0.35$, compute $v = \alpha E_a + (1-\alpha) E_b = 0.35 E_a + 0.65 E_b$. The result $v = [-0.30, 1.95, 0.70, 0.95]^\top$ lies on the line segment connecting $E_a$ and $E_b$âa convex combination.
Verify closure: Check that $v \in \mathbb{R}^4$ (it does). Any weight $\alpha \in [0, 1]$ produces a valid convex combination; weights outside this range extrapolate beyond the segment.
Create zero-mean data: Generate 9 equally-spaced points $x = [-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2]^\top \in \mathbb{R}^9$. The mean is $\bar{x} = 0$.
Construct centering matrix: Form $\mathbf{1} = [1, 1, \ldots, 1]^\top \in \mathbb{R}^{9 \times 1}$, then compute $P = I_9 - \frac{1}{9} \mathbf{1} \mathbf{1}^\top$. This is a $9 \times 9$ matrix with rank 8 (one zero eigenvalue).
Apply projection: Compute $x_p = P x$. For already zero-mean data, $x_p \approx x$. For data with non-zero mean, $x_p$ subtracts the mean: $x_p = x - \bar{x} \mathbf{1}$. Verify $\mathbf{1}^\top x_p = 0$ (sum is zero within machine precision).

What This Example Demonstrates

Closure under linear combinations: Any convex combination $v = \alpha E_a + (1-\alpha) E_b$ of vectors in $\mathbb{R}^d$ remains in $\mathbb{R}^d$ (vector spaces are "closed"). This means embeddings, hidden states, and feature vectors can be safely interpolated without leaving the ambient space.
Convex interpolation for soft selection: Weights $\alpha \in [0, 1]$ constrain the output to lie on the line segment between $E_a$ and $E_b$. This "soft" combination (vs.Â hard selection of one or the other) is why attention mechanisms are powerful: they blend information from multiple sources smoothly.
Subspaces encode geometric constraints: The zero-mean subspace $S = \{x : \mathbf{1}^T x = 0\}$ is a proper subspace (closed under addition and scalar multiplication). Projecting onto $S$ enforces the constraint $\sum_i x_i = 0$ without explicitly checking; the projection matrix automatically handles it.
Orthogonal projection via matrices: The centering matrix $P = I - \frac{1}{n} \mathbf{1} \mathbf{1}^T$ is an orthogonal projector (symmetric, idempotent, eigenvalues 0 or 1). Applying $P$ to any vector produces its orthogonal projection onto $S$; this is far more efficient and numerically stable than explicitly solving linear systems.
Projection matrix rank reveals dimension: $\text{rank}(P) = n - 1$ (one zero eigenvalue) tells us the zero-mean subspace is $(n-1)$-dimensional. We lose one degree of freedom by fixing the sum; this is why PCA outputs have at most $n-1$ non-zero singular values.
Centering simplifies downstream algorithms: PCA, regression, and statistical tests all assume zero-mean data. Centering via projection is the canonical preprocessing step; it appears before SVD in PCA, before solving normal equations in least squares, and before computing sample covariance in statistics.

Notes

Part 1: Linear Combinations and Convex Interpolation

Linear combinations: In a vector space, any combination $v = c_1 e_1 + c_2 e_2 + \cdots + c_d e_d$ of vectors with scalars $c_i \in \mathbb{R}$ remains in the space. The weights can be any real numbers (positive, negative, or zero). For embeddings $E_a, E_b \in \mathbb{R}^d$: \[ v = \alpha E_a + \beta E_b, \quad \alpha, \beta \in \mathbb{R}. \]

Convex combinations: A convex combination constrains weights to be non-negative and sum to 1: \[ v = \alpha E_a + (1-\alpha) E_b, \quad \alpha \in [0, 1]. \]

The set of all such combinations forms the line segment from $E_a$ to $E_b$. For $\alpha = 0$, $v = E_b$; for $\alpha = 1$, $v = E_a$; for $\alpha = 0.5$, $v$ is the midpoint. In machine learning, convex combinations ensure output vectors remain âwithin boundsâ of input vectorsâno wild extrapolations.

Soft interpolation vs.Â hard selection: Without convexity constraints, we might select one vector (hard choice, $\alpha \in \{0, 1\}$) or extrapolate far beyond (overshoot, $\alpha > 1$). Convex combinations provide smooth, bounded blendingâessential in attention mechanisms where we interpolate across many value vectors based on continuous attention weights.

Part 2: Centering via Orthogonal Projection

The zero-mean subspace: Define $S = \{x \in \mathbb{R}^n : \mathbf{1}^\top x = 0\}$, where $\mathbf{1}^T$ is the sum operator. This is a linear subspace (closed under addition and scalar multiplication): - If $\mathbf{1}^\top x = 0$ and $\mathbf{1}^\top y = 0$, then $\mathbf{1}^\top (x + y) = 0$ (closure under addition). - If $\mathbf{1}^\top x = 0$, then $\mathbf{1}^\top (\alpha x) = 0$ (closure under scaling). - $S$ is the kernel of the linear functional $\mathbf{1}^\top$, hence dimension $(n-1)$.

The centering matrix: The orthogonal projector onto $S$ is: \[ P = I - \frac{1}{n} \mathbf{1} \mathbf{1}^\top. \]

This matrix is: - Symmetric: $P^\top = P$ - Idempotent: $P^2 = P$ (applying twice equals applying once) - Orthogonal projector: Eigenvalues are 0 or 1; range is $S$, null space is $\text{span}(\mathbf{1})$

Centering operation: For any vector $x$: \[ x_p = Px = x - \frac{1}{n} \mathbf{1}(\mathbf{1}^\top x) = x - \bar{x} \mathbf{1}, \] where $\bar{x} = \frac{1}{n} \sum_i x_i$ is the mean. Each component is shifted by the mean: $x_{p,i} = x_i - \bar{x}$. Result: $\sum_i x_{p,i} = 0$ (zero sum, zero mean).

Part 3: Why This Matters for ML

Vector space closure ensures stability: When you interpolate between embeddings (or any learned representations), staying within $\mathbb{R}^d$ ensures consistency with downstream layers. Convex combinations guarantee you donât accidentally escape the embedding space; this is the foundation of attention, which safely mixes information from multiple sources.

Centering removes bias: In regression, PCA, and statistical analysis, centering data removes the meanâequivalent to fitting an unbiased estimator. Without centering, estimates are biased by the data offset. Projection via $P$ automates this preprocessing without explicit mean computation.

Subspaces structure learning: By projecting onto subspaces (zero-mean, orthogonal to noise, low-rank), we encode inductive biasâgeometric constraints that guide learning. PCA projects onto the principal subspace (max variance); regression projects onto the column space (linear constraint). These projections are not approximationsâtheyâre structural choices that enable efficient algorithms.

Numerical efficiency: Constructing and applying projection matrices like $P$ is $O(n^2)$ dense, but structure can be exploited. For instance, $P = I - \frac{1}{n} \mathbf{1} \mathbf{1}^\top$ is a rank-1 update, so computing $Px$ requires only $O(n)$ operations (not $O(n^2)$) via direct centering: $Px = x - \bar{x} \mathbf{1}$.

ML Examples and Patterns

Centering in PCA:

import numpy as np
from numpy.linalg import svd

X = np.random.randn(100, 10)  # 100 samples, 10 features
X_centered = X - X.mean(axis=0)  # Center each column
U, s, Vt = svd(X_centered, full_matrices=False)
# Principal components: Vt[0], Vt[1], ... (rows of Vt)

Convex combinations in attention:

def attention(Q, K, V, scale=1.0):
    scores = (Q @ K.T) * scale
    weights = softmax(scores, axis=-1)  # Convex combination weights
    output = weights @ V  # Weighted sum of values
    return output, weights

Standardization for neural networks:

def standardize(X):
    mean = X.mean(axis=0)
    std = X.std(axis=0) + 1e-8
    return (X - mean) / std

Batch normalization (mini-batch approximation):

def batch_norm(X, gamma, beta, epsilon=1e-5):
    mean = X.mean(axis=0)
    var = X.var(axis=0)
    X_norm = (X - mean) / np.sqrt(var + epsilon)
    return gamma * X_norm + beta  # Learned scale/shift

Connection to Linear Algebra Theory

Affine and convex geometry: Points in affine spaces (lines, planes, hyperplanes) are expressed via affine combinations. Convex combinations constrain us to convex setsâline segments, polytopes, convex hulls. In machine learning, convex sets encode feasible regions (e.g., simplex for probabilities, unit ball for regularization).

Orthogonal complements: The zero-mean subspace $S$ is the orthogonal complement of $\text{span}(\mathbf{1})$. Any vector $x$ decomposes uniquely as: \[ x = \bar{x} \mathbf{1} + (x - \bar{x} \mathbf{1}) = \text{const} + \text{centered}, \] where the constant part is parallel to $\mathbf{1}$ and the centered part is orthogonal. This orthogonal decomposition is the foundation of ANOVA (variance partitioning) and factor analysis.

Spectral properties of projection matrices: The eigendecomposition of $P = I - \frac{1}{n} \mathbf{1} \mathbf{1}^\top$ reveals: - Eigenvalue 0 with eigenvector $\mathbf{1}$ (constant directionânull space) - Eigenvalue 1 with eigenvectors orthogonal to $\mathbf{1}$ (rangeâzero-mean subspace) - $\text{rank}(P) = n - 1$, $\text{trace}(P) = n - 1$

This spectral structure is key to understanding PCA, where non-zero singular values correspond to directions of maximum variance.

Gram-Schmidt orthogonalization: Centering is the first step of Gram-Schmidt when building an orthonormal basis. Starting with $\{e_1, \ldots, e_n\}$, the first orthogonal vector is $e_1 - \bar{e}_1 \mathbf{1}$ (center it), then normalize. Subsequent vectors are orthogonalized against both constant and previous directions.

Numerical and Implementation Notes

Shape discipline: Always check: - $E_a, E_b \in \mathbb{R}^d$: Linear combination $v = \alpha E_a + (1-\alpha) E_b$ has shape $(d,)$ - $\mathbf{1} \in \mathbb{R}^{n \times 1}$: Outer product $\mathbf{1} \mathbf{1}^\top$ has shape $(n, n)$ - $P \in \mathbb{R}^{n \times n}$: Projection $Px$ has shape $(n,)$ for $x \in \mathbb{R}^n$

Gotcha 1: Affine vs.Â convex vs.Â linear combinations. Linear allows any weights; affine requires weights to sum to 1 (but can be negative); convex requires non-negative weights summing to 1. Mixing them up leads to invalid interpretations (e.g., claiming $v = 2 E_a - E_b$ is a convex combinationâitâs not).

Gotcha 2: Centering matrix rank. $P = I - \frac{1}{n} \mathbf{1} \mathbf{1}^\top$ has rank $n-1$, not $n$. This means $P$ has a non-trivial null space (the constant vector $\mathbf{1}$). Rank-deficiency must be respected in numerical algorithms (e.g., avoid inverting $P$; use pseudoinverse or projection directly).

Gotcha 3: Numerical stability of centering. For very large or very small numbers, computing $\bar{x} = \frac{1}{n} \sum_i x_i$ via direct summation can lose precision (catastrophic cancellation). Use Welfordâs algorithm or two-pass centering:

# Two-pass: safer
mean = x.sum() / len(x)
x_centered = x - mean
# Or use NumPy (numerically stable)
x_centered = x - np.mean(x, keepdims=True)

Gotcha 4: Outer product vs.Â matrix multiply. one @ one.T computes the full $n \times n$ outer product (requires $O(n^2)$ memory). For large $n$, avoid constructing $P$ explicitly; instead, apply centering directly via x - x.mean(). This is $O(n)$ memory and just as accurate.

Gotcha 5: Symmetry and stability of $P$. The centering matrix is symmetric and idempotent ($P^2 = P$), so eigenvalues are real (0 or 1). This ensures stable computation; no ill-conditioning from complex eigenvalues.

Numerical and Shape Notes

Verification checks:

Convex combinations: For $\alpha \in [0, 1]$, plot $v(\alpha)$ on the line segment between $E_a$ and $E_b$. Verify $v(0) = E_b$, $v(1) = E_a$, $v(0.5)$ is midpoint.
Centering properties:
- $\mathbf{1}^\top P x = 0$ (sum is zero)
- $\text{mean}(Px) = 0$ (mean is zero)
- $P^2 = P$ (idempotency)
- $P^\top = P$ (symmetry)
Rank deficiency: $\text{rank}(P) = n - 1$. Verify via SVD: U, s, Vt = svd(P), count non-zero singular values.

Cost analysis:

Convex combination: $O(d)$ operations for vectors in $\mathbb{R}^d$
Centering matrix construction: $O(n^2)$ memory for full matrix storage
Centering application (direct): $O(n)$ operations via $x - \bar{x} \mathbf{1}$
Centering application (via $P$ matrix): $O(n^2)$ operations if materializing $P$

For large $n$, prefer direct centering over explicit $P$.

Memory efficiency:

Full matrix $P$: $O(n^2)$ storage
Implicit representation: $O(n)$ storage (just $n$ values)
Use: x_centered = x - x.mean() instead of x_centered = P @ x

Pedagogical Significance

This example is the foundational demonstration of vector space and subspace structure:

Key takeaways:

Linear combinations are fundamental: Any point in a vector space can be expressed as a weighted sum of basis vectors.
Convexity constrains combinations: Weights summing to 1 (with non-negativity) ensure convex sets (line segments, polytopes).
Projection enforces constraints: Orthogonal projection onto subspaces (like mean-zero) is efficient and numerically stable.
Centering is ubiquitous in ML: Nearly all preprocessing involves centering (PCA, normalization, regression).
Projection matrices are elegant: Symmetric, idempotent matrices elegantly encode geometric operations.

Common misconceptions:

âConvex combinations are just linear combinationsâ: Noâconvex requires weights in $[0, 1]$ summing to 1, ensuring closure under convex operations.
âCentering is optionalâ: Noâitâs essential for PCA, standardization, and proper statistical inference.
âProjection matrices are slowâ: Noâtheyâre $O(n^2)$ dense operations, but the structure enables efficient algorithms (e.g., sparse centering for large $n$).
âInterpolation is approximateâ: Noâconvex combinations are exact (for finite-dimensional spaces).

Connection to other examples: Example 70 (Projection), Example 73 (Eigenvectors), Example 75 (SVD/PCA), Example 76 (Least Squares), Example 80 (Attention).

Why pedagogically powerful: It isolates two core geometric operationsâlinear combinations and projectionâin minimal code. The convex interpolation shows intuition (moving between points on a line segment); the centering projection shows power (enforcing zero-sum constraint via matrix multiplication). These operations appear in every ML algorithm: attention (convex combinations), PCA (projection + decomposition), regression (centering + solving). Students see that vector spaces are not abstractâtheyâre the foundation of modern MLâs geometric operations. This is the gateway to understanding subspaces, projections, and decompositions that dominate all subsequent chapters.

Shape discipline: check dimensions before manipulating formulas.
Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).

Connection to Broader Examples

This example connects to many other examples in the book:

Example 70 (Projection): Orthogonal projection generalizes centering to arbitrary subspaces; $P$ is a special case of projecting onto $\ker(\mathbf{1}^\top)$.
Example 73 (Eigenvectors): Eigenvectors of $P$ reveal subspace structure; the zero eigenvalue corresponds to the constant direction $\mathbf{1}$.
Example 75 (PCA/SVD): PCA starts by centering data: $X_c = X - \bar{X}$, then computes SVD of $X_c$. The centering matrix $P$ formalizes this preprocessing.
Example 76 (Least Squares): Normal equations assume centered data for simpler form; explicit centering via $P$ avoids fitting an intercept.
Example 77 (Cholesky): Covariance matrix $\frac{1}{n} X_c^\top X_c$ requires centered data; Cholesky decomposition assumes SPD matrix from centered data.
Example 79 (Sparse): Sparse centering (for large $n$) exploits low-rank structure of $\frac{1}{n} \mathbf{1} \mathbf{1}^\top$.
Example 80 (Attention): Attention weights are convex combinations ($\sum_j A_{ij} = 1$, $A_{ij} \ge 0$) of value vectors, generalizing soft interpolation to multi-way mixtures.
Chapter 5 (Inner Products): Orthogonality and projections are defined via inner products; $\mathbf{1}^\top x = 0$ orthogonality to the constant vector.
Chapter 6 (PCA): The first PC captures maximum variance; centering ensures zero-mean PCs.
Chapter 12 (Least Squares): Centering simplifies regressionâno intercept needed if data is centered.