ex1.ai

Example number

Example slug

example_97_vector_spaces_subspace_projection_embeddings_zeromean

Background

Vector spaces are abstract mathematical structures, but they became concrete in machine learning when embeddings emerged as learned representations. In the 1990sâ2000s, latent semantic indexing (LSI) and topic models (LDA) treated documents as vectors in latent spaces. By 2010s, neural network embeddings (Word2Vec, GloVe for NLP; deep CNNs for vision) showed that vector representations learned by gradient descent captured semantic meaning. Vectors interpolate: the path from âcatâ embeddings to âdogâ embeddings passes through âanimal-likeâ concepts. This motivated understanding embeddings as points in vector spaces. Subspace projection has roots in classical statistics: centering data (zero-mean projection) dates to Fisher and Principal Component Analysis (Pearson 1901, Hotelling 1933). Modern ML heavily uses projections: PCA projects onto principal subspace; linear regression projects responses onto predictor column space; matrix factorization projects onto low-rank subspaces. Orthogonal projectorsâmatrices $P$ with $P^2 = P$ and $P^T = P$âare ubiquitous. Understanding them geometrically (projecting perpendicular to a surface) and algebraically (kernels and column spaces) is essential to modern linear algebra.

Purpose

Master the vector space abstraction and its most practical manifestation: subspace projection. Understand that embeddings (learned representations of text, images, users) live in vector spaces where linear combinations stay closedâinterpolating between embeddings produces valid embeddings. Learn to project vectors onto subspaces, a fundamental operation in machine learning: centering data (zero-mean subspace), extracting latent factors (PCA subspace), and constraining solutions (least squares projects onto column space). Build intuition for the geometry underlying all ML: high-dimensional data lives in lower-dimensional subspaces, and projections extract the relevant structure. Gain hands-on experience with the orthogonal projector $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$, a matrix youâll encounter repeatedly (centering data, computing residuals, regularization). Connect linear algebra theory to concrete ML patterns: embeddings are vectors, combinations are interpolations, projections are filters.

Problem

Using embeddings as vectors, show closure properties. 1) Given embeddings e(a), e(b) â R^d, compute v = Î± e(a) + (1-Î±) e(b) and confirm it stays in R^d. 2) Project a length-10 vector onto the zero-mean subspace S = {x: 1^T x = 0} and verify the projected vector sums to ~0.

Solution (Math)

Embeddings live in $\mathbb{R}^d$, a vector space, so any linear combination

\[ v = \alpha e(a) + $1-\alpha$e(b) \]

remains in $\mathbb{R}^d$.

The zero-mean subspace $S=\{x:\mathbf{1}^T x=0\}$ is a subspace (kernel of $\mathbf{1}^T$). The orthogonal projector onto $S$ is

\[ P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T,\qquad x_{\text{proj}} = Px, \]

and $\mathbf{1}^T x_{\text{proj}}=0$.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Numerical Implementation Details

Linear combination cost: $v = \alpha e_a + (1-\alpha) e_b$ requires $O(d)$ operations (two scalar multiplies, one addition per component). For embeddings in millions of dimensions, this is still cheap.
Centering matrix construction: $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$. Explicit outer product takes $O(n^2)$ time/space. For large $n$ (billions of rows), explicit construction infeasible. Implicit: $Px = x - \frac{1}{n}\mathbf{1}(\mathbf{1}^T x)$ costs $O(n)$ (compute mean, broadcast subtract).
Orthogonal projection verification: $\mathbf{1}^T Px = 0$ checks projection result in subspace. Numerically: compute sum (should be near-zero). Use relative error: $|\mathbf{1}^T Px| / (|\mathbf{1}^T x| + \epsilon) < 10^{-10}$.
Subspace dimension: Zero-mean subspace in $\mathbb{R}^n$ has dimension $n - 1$ (one constraint removes one degree of freedom). Rank of $P$ is $n - 1$.
Eigenstructure: $P$ has eigenvalue 1 (multiplicity $n-1$, eigenvectors orthogonal to $\mathbf{1}$) and eigenvalue 0 (multiplicity 1, eigenvector $\mathbf{1}$). Eigenvalues determine projection behavior.
Projection geometry: Distance from $x$ to subspace $S$ is $\|(I - P)x\| = \frac{1}{\sqrt{n}} |\mathbf{1}^T x|$. Measures how âoff-meanâ $x$ is.
Idempotence and stability: $P^2 = P$ (projecting twice is same as once). Numerically stable: $P^2 x - Px = O(\epsilon \|x\|)$ (machine epsilon). Repeated projections donât amplify errors.
Computational complexity at scale: For $X \in \mathbb{R}^{n \times d}$, centering all rows costs $O(nd)$. For $n = 1\text{M}, d = 1\text{k}$: $10^9$ operations (< 1 sec). For $n = 1\text{B}, d = 1\text{M}$: $10^{15}$ operations (infeasible without distributed computing).

What This Example Demonstrates

Vector space closure under linear combination: Given embeddings $e_a, e_b \in \mathbb{R}^d$, any linear combination $v = \alpha e_a + (1-\alpha) e_b$ remains in $\mathbb{R}^d$. Vectors form a closed structure under addition and scalar multiplication. No âmagicââjust algebra.
Subspace as kernel: The zero-mean subspace $S = \{x : \mathbf{1}^T x = 0\}$ is the kernel (null space) of the linear functional $\mathbf{1}^T$. Understanding subspaces as kernels of linear maps is foundational.
Orthogonal projector formula: $P = I - \frac{1}{n}\mathbf{1}\mathbf{1}^T$ projects any vector orthogonally onto $S$. Orthogonal means: projection residual $(I - P)x$ is perpendicular to $S$. Useful in least squares, residual analysis, regularization.
Projector properties: $P^2 = P$ (idempotent: applying twice is same as once). $P^T = P$ (symmetric: orthogonal projection). $\text{rank}(P) = \dim(S) = n - 1$ (for zero-mean subspace in $\mathbb{R}^n$). Rank determines effective dimensionality.
Geometry and algebra unified: Geometrically, projection is perpendicular drop onto a surface. Algebraically, itâs matrix-vector multiplication. Both perspectives are essential.
Verification via residual: $\mathbf{1}^T P x = \mathbf{1}^T (I - \frac{1}{n}\mathbf{1}\mathbf{1}^T) x = \mathbf{1}^T x - \frac{1}{n}\mathbf{1}^T \mathbf{1} \mathbf{1}^T x = \mathbf{1}^T x - n \cdot \frac{1}{n} \mathbf{1}^T x = 0$. Projection always satisfies the constraint.
ML pattern: center before analyzing: Centering (projecting onto zero-mean subspace) is preprocessing step in PCA, regression, clustering. Removes trivial variance from location; isolates structure.
Interpolation and convexity: Linear combination with $\alpha \in [0, 1]$ interpolates between vectors. Convex combinations preserve structureâcrucial for regularization, mixup data augmentation, adversarial robustness.

Notes

Numerical and Shape Notes

Shapes (vectors vs matrices): For a length-$n$ vector, use a 1D array of shape $(n,)$. The all-ones vector $\mathbf{1} \in \mathbb{R}^n$ has shape $(n,)$. The projector $P = I - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top \in \mathbb{R}^{n\times n}$ is $(n, n)$. For a data matrix $X \in \mathbb{R}^{n\times d}$ (rows = examples), centering acts along columns: Xc = X - X.mean(axis=0).
Broadcasting: Prefer implicit centering over explicit outer products. Instead of forming $\mathbf{1}\mathbf{1}^\top$, compute mu = X.mean(axis=0) and do X - mu. For vectors, compute x - x.mean() rather than (I - (1/n)11^T) @ x.
Column vs row conventions: NumPy treats 1D arrays as neither column nor row. When you need a column, reshape to $(n, 1)$; for a row, $(1, n)$. Be explicit with axis in reductions (e.g., mean(axis=0) for column-wise means on $X \in \mathbb{R}^{n\times d}$).
Stability: If $|x_i| \gg |x_i - \bar{x}|$, subtraction can lose precision. Use two-pass mean or Welfordâs algorithm for very large $n$. Prefer float64 for reproducibility of centering and projection checks.
Sanity checks: After projection, verify abs(x_proj.sum()) is near 0; for matrices, check $\|P^2 - P\|_F$ and $\|P^\top - P\|_F$ are ~ machine epsilon on small $n$.
Complexity and memory: Explicit $P$ costs $O(n^2)$ memory/time. Implicit application $Px = x - \tfrac{1}{n}\mathbf{1}(\mathbf{1}^\top x)$ costs $O(n)$ time and $O(1)$ extra memory.
Edge cases: For $n=1$, the zero-mean subspace is trivial and $P=0$. Constant columns in $X$ become zero after centering; downstream algorithms should handle zero-variance features.

Connection to Broader Examples

PCA (Ex 11, Ex 91): PCA first centers data (projects onto zero-mean subspace), then eigendecomposes covariance. Centering essential because covariance is defined w.r.t. mean.
Least squares projections (Ex 86, Ex 92): Least squares projects response $b$ onto column space of $X$. Residual orthogonal to column space. Framework: projection onto constraint subspace.
Rank and nullspace (Ex 87): Zero-mean subspace is the nullspace (kernel) of $\mathbf{1}^T$. Rank of projector $P$ is $\dim(S) = n - 1$.
Orthogonality and projectors (Ex 86): Orthogonal projectors satisfy $P^T = P$ and $P^2 = P$. Residuals orthogonal to subspace.
SVD and low-rank (Ex 10, Ex 90): SVD reveals all subspaces (column, row, null). Truncated SVD projects onto low-rank subspace. Centering interacts with SVD (aligns with covariance structure).
Conditioning (Ex 94): Condition number of $X^T X$ affects regression stability. Centering $X$ affects conditioning. Ill-conditioned means nearly collinear columns; projection is numerically unstable.
Regularization (Ex 93): Ridge regression projects least-squares solution onto small-norm ball (constrained subspace). Cholesky factors solution to regularized system.
Sparse matrices (Ex 95): Centering sparse data adds dense column (mean vector). Workaround: store mean separately, center implicitly.
Attention (Ex 96): Scaled dot-product attention is bilinear form on queries/keys. Projecting onto constrained subspace (softmax normalization) is analogous to centering.

Foundational Era (1700sâ1800s): From Geometry to Abstraction

Vector spaces emerged from geometry. In 1637, Descartes introduced coordinate geometry, representing points as ordered pairs (later triples, n-tuples). This bridged algebra and geometry: geometric shapes became algebraic equations. By 1800s, mathematicians recognized the algebraic structure: sets closed under addition and scalar multiplication. Grassmann (1844) formalized the concept of linear combinations and subspaces in his âAusdehnungslehreâ (Calculus of Extensions). Grassmann showed that any element of a vector space can be expressed as a linear combination of basis elements. His work was ahead of its timeâlargely ignored until the 1900s.

Vector Spaces Formalized (Early 1900s)

Peano (1888) axiomatically defined vector spaces (though not called that yet). The modern axiomsâclosure, associativity, commutativity, identity, inverse, distributivityâcrystallized by 1920. Banach and Hilbert developed functional analysis (1920sâ1930s), extending vector space concepts from finite to infinite dimensions. This was profound: functions themselves could be treated as vectors (function spaces).

Subspaces and Projections (1920sâ1950s)

Subspaces (subsets closed under linear combinations) naturally emerged as the fundamental building block. Hilbert spaces (complete inner product spaces) enabled rigorous treatment of orthogonal complements and orthogonal projections. The formula for orthogonal projection onto a subspace: \[ ext{proj}_S(x) = \frac{\langle x, u_1 \rangle}{\|u_1\|^2} u_1 + \cdots + \frac{\langle x, u_k \rangle}{\|u_k\|^2} u_k \] (where $u_1, \ldots, u_k$ are basis vectors) became standard by 1950. Gram-Schmidt orthogonalization (1907, formalized by Schmidt 1908) provided an algorithm to construct orthonormal bases, enabling stable projection computations.

Statistical Applications: Centering and Covariance (1900â1950)

In statistics, centering (subtracting the mean) has deep roots. Karl Pearson (1901) introduced Principal Component Analysis (PCA), which inherently requires centering: the covariance matrix is defined w.r.t. the mean. Centering is projection onto the zero-mean subspace. By projecting data, PCA computes the principal subspace (spanned by eigenvectors of the covariance matrix). This connectionâcentering as projection onto a constraint subspaceâunified statistical and geometric thinking.

Least Squares and Regression (Gauss 1809 â 1950s)

Gauss (1809) developed least squares for astronomy, minimizing $\|b - X\beta\|^2$ (though not phrased that way). The geometric insightâthe least squares solution is the orthogonal projection of $b$ onto the column space of $X$âemerged gradually. By the 1950s, Householder and Givens developed numerically stable methods (Householder reflections, Givens rotations) for computing projections via QR decomposition. This made least squares practical and stable for large matrices.

Matrix Computations and Numerical Stability (1950â1980)

Wilkinson (1961) and the LAPACK project (1989) systematized numerically stable algorithms for projections, QR, SVD, and eigendecomposition. A key insight: explicit inversion is unstable; use stable primitives (QR for least squares, SVD for low-rank approximations). By 1980, projection-based methods were cornerstones of numerical linear algebra.

Machine Learning and Embeddings (1990sâPresent)

Neural networks (1980sâ1990s) learned to map data into low-dimensional representations (embeddings). The breakthrough insight: learned embeddings lie in lower-dimensional subspaces, and these subspaces capture semantic structure. Word2Vec (Mikolov et al.Â 2013) and GloVe (Pennington et al.Â 2014) produced word embeddings where semantic relationships (king - man + woman â queen) became linear operations. This vindicated linear combinations of embeddings as a fundamental operation.

Deep Learning Scaling: Modern deep learning routinely projects data: - Batch normalization (Ioffe & Szegedy 2015): Normalizes to zero-mean, unit variance (projection onto constrained subspace). Dramatically stabilized training of deep networks. - Layer normalization: Per-layer centering and scaling. - Residual connections: Projecting intermediate representations. - Attention mechanisms: Projections via learned transformations.

By 2020, projection-based thinking is ubiquitous in ML. Every preprocessing pipeline centers data. Every regularization method projects onto a constraint subspace.

Real-World Applications of Vector Spaces and Projections

Data Preprocessing: Centering and scaling are the first step in 99% of ML pipelines (scikit-learnâs StandardScaler applies these projections). Essential for numerical stability and generalization.
PCA and Dimensionality Reduction: Project high-dimensional data onto lower-dimensional subspace (principal subspace). Used in exploratory data analysis, noise reduction, compression. Ubiquitous in computer vision, NLP, genomics.
Regression and Least Squares: Linear regression projects response onto predictor column space. Regularized regression (ridge, LASSO, elastic net) projects onto constrained subspaces. Applied in finance, healthcare, social sciences.
Embeddings and Representation Learning: Neural networks learn embeddings by implicitly projecting onto learned subspaces. Word embeddings (BERT, GPT), image embeddings (ResNet), user embeddings (collaborative filtering) all rely on subspace structure.
Clustering: K-means, hierarchical clustering, and Gaussian mixture models implicitly assume data lives in lower-dimensional subspaces. Clustering algorithms project data onto subspaces defined by cluster centers or mixture components.
Matrix Factorization and Recommender Systems: Factorize user-item matrix as $X \approx UV^T$. Users and items projected onto low-rank latent subspace. Applied in Netflix, Spotify, Amazon recommendations.
Signal Processing and Compression: JPEG compression: project image onto subspace of low-frequency components (via discrete cosine transformâessentially a projection). MP3 audio: project onto psychoacoustic subspace.
Computer Vision: Convolutional neural networks learn to project images onto feature subspaces (edges, textures, objects). Object detection, segmentation, and face recognition all exploit subspace structure.
NLP and Language Models: Transformer models (BERT, GPT) project text into subspaces where semantically similar tokens cluster. Attention mechanism projects queries onto key subspaces. Enabling billion-parameter language models.
Optimization and Constrained Learning: Interior-point methods (convex optimization) use projections onto feasible regions. Projected gradient descent: gradient step, then project onto constraint set. Applied in finance (portfolio optimization), operations research, machine learning (constrained classification, fairness constraints).
Quantum Computing: Quantum states are vectors in Hilbert spaces. Quantum algorithms (Grover, Shor) manipulate superpositions (linear combinations) and projections (measurements).
Multimodal Learning: Vision-language models (CLIP, GPT-4V) project images and text into a shared embedding space. Cross-modal retrieval via projections.

Challenges and Open Questions

Curse of dimensionality: High-dimensional subspaces are sparse; data points become nearly equidistant. Projections can lose information. Remedies: regularization, feature selection, inductive biases.
Nonlinear structure: Real data often lies in nonlinear (curved) manifolds, not linear subspaces. Linear projections can be suboptimal. Remedy: nonlinear dimensionality reduction (autoencoders, t-SNE, UMAP).
Computational scalability: Explicit projection (constructing projection matrix) is $O(n^2)$ time and space. For billions of examples, infeasible. Remedy: implicit projection (e.g., centering via mean subtraction), sketching, streaming algorithms.
Statistical-computational tradeoff: Projecting onto very low-rank subspace loses information. Remedy: regularization (constrain rank while minimizing loss), cross-validation to select rank.
Interpretability: Projections reduce dimensionality, but why do certain directions matter? Interpretability of principal components, embeddings, remains open.

Why Projections Matter for Linear Algebra Education

Projections are the unifying concept in linear algebra. They appear in: - Least squares: Project $b$ onto column space of $X$. - PCA: Project onto principal subspace. - Regression residuals: $(I - P)x$ is orthogonal to column space. - Regularization: Project onto small-norm ball. - SVD: Truncated SVD projects onto rank-$k$ subspace. - QR: Gram-Schmidt projects iteratively.

Understanding projections geometrically (perpendicular drop), algebraically (idempotent, symmetric matrices), and computationally ($O(n)$ for implicit centering) is essential for applied mathematics and ML. Projections encode structure: what directions matter, what constraints apply, what information is preserved. This exampleâsimple projection onto zero-mean subspaceâis the gateway to understanding how all of linear algebra serves machine learning.

Connection to Broader Mathematics and Physics

Functional analysis: Projection onto Banach/Hilbert spaces; weak and strong convergence.
Optimization: Projected gradient descent; proximal methods project onto constraint sets.
Statistics: Maximum likelihood estimation; constraints (e.g., probability simplex) are projections.
Physics: Quantum mechanics: projectors are observables; measurement projects state onto eigenspace.
Differential geometry: Geodesic projections; submanifolds; intrinsic coordinates.

Projections are fundamental across mathematics, physics, and computer science. Mastering them in the linear algebra context provides intuition for their applications everywhere.

Numerical and Shape Notes

Comments