Example number
99
Example slug
example_99_basis_coordinates_onehot_lookup_and_pca_changeofbasis
Background

Historical arc:

The concept of basis vectors dates to Euler and Lagrange (1700s), who formalized coordinate systems in analytic geometry. By the early 1800s, Gauss and Cauchy understood that changing from one basis to another was just a linear transformation—the foundation of linear algebra.

In the 1920s–1930s, Fischer, Hotelling, and Eckart connected basis changes to statistics: Fisher’s Linear Discriminant Analysis (FDA) found optimal bases for classification, and Hotelling’s Principal Component Analysis (PCA) automatically extracted the basis that maximizes variance. This was revolutionary—instead of choosing bases by hand, you could let the data tell you which basis was most informative.

The 1960s–1970s brought Golub and Kahan’s SVD algorithm, making basis extraction (and low-rank approximation) computationally practical. By the 1980s–2000s, PCA became ubiquitous in statistics, signal processing, and machine learning.

Modern ML connection: - Neural embeddings (Word2Vec, 2013; GloVe, 2014) learn bases as row vectors of embedding matrices. Looking up word $j$ is $E e_j$—projecting onto the $j$-th canonical basis direction. - Autoencoders learn nonlinear bases via neural networks; the latent code $z$ is a coordinate in a learned basis. - Transformers (Vaswani et al., 2017) use learned query/key/value projections to compute soft bases and attend to relevant directions dynamically. - Vision Transformers and multimodal models (CLIP, 2021) apply learned basis extraction across image and text modalities simultaneously.

Core insight: Basis selection is a fundamental design choice in ML. The wrong basis leads to irrelevant features and poor generalization; the right basis (either fixed like PCA or learned like neural embeddings) exposes the true low-dimensional structure in high-dimensional data.

Purpose

ML-first framing: Basis vectors are coordinate systems. The canonical basis $e_j$ (one-hot vectors) reveals that embedding lookup is just selecting a column—a linear combination where only one coefficient is nonzero. PCA basis vectors are the principal directions extracted via SVD; projecting onto them reduces data to its most informative degrees of freedom.

Core concepts: - Canonical basis: $E e_j = E_{:,j}$ (select column $j$) - Change of basis: Apply SVD to get orthonormal $V$; coordinates are $z = V^\top (x - \mu)$ - Projection: Rank-$k$ reconstruction $\hat{x} = \mu + V_k V_k^\top (x - \mu)$ discards low-variance directions - Shape discipline: Track dimensions throughout: $X \in \mathbb{R}^{n \times d}$, $V_k \in \mathbb{R}^{d \times k}$, $z \in \mathbb{R}^k$

Why this matters: - Embedding lookup as matrix multiplication (connects discrete indexing to linear algebra) - PCA changes of basis (enables dimensionality reduction, visualization, compression) - Orthonormal bases (preserve norms; simplify numerical algorithms; enable stable computations) - Centering as projection (removes location degree of freedom; isolates intrinsic structure)

Learning objective: Understand that basis selection determines which directions are “visible” in data, and that SVD automatically finds the most informative basis for any dataset. Projecting onto a rank-$k$ basis gives the best $k$-dimensional approximation in squared error.

Problem

Use the standard basis (one-hot vectors) to show embedding lookup is matrix multiplication. Then compute PCA coordinates of one point in an orthonormal basis and interpret dimension as degrees of freedom captured.

Solution (Math)

If $E\in\mathbb{R}^{d\times |\mathcal{V}|}$ is an embedding matrix and $x=e_j$ is a one-hot basis vector, then

\[ Ex = E e_j \]

selects the $j$-th column of $E$.

For PCA with orthonormal basis $V_k$, coordinates are $z = V_k^T$x-$$ and reconstruction is $\hat x=\mu+V_k z$. The dimension $k$ counts retained degrees of freedom.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

E = np.array([[ 1.,  2.,  3.],
              [ 0., -1.,  1.]])
x = np.array([0., 1., 0.])
print("E@x:", E@x, "expected:", E[:,1])

X = toy_pca_points(n=15, seed=2)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)

k = 1
Vk = Vt[:k].T
x0 = X[0]
z = Vk.T @ (x0 - mu)
x_hat = mu + Vk @ z
print("z:", z)
print("x_hat:", x_hat, "x0:", x0)
Code Introduction

The core idea: Basis vectors define a coordinate system. The canonical basis $\{e_j\}$ is the simplest—$e_j$ selects column $j$. PCA basis is learned—SVD extracts principal directions from data. We illustrate both with a concrete example.

Numerical Implementation Details
  1. Canonical basis lookup: $E e_j$ costs $O(d)$ (copy one column). In practice, this is a single memory access $E[j]$ or E[:, j] in code—no arithmetic needed.

  2. Centering cost: $X_c = X - \mathbf{1} \mu^\top$ costs $O(nd)$ (broadcast subtract mean from each row). NumPy’s vectorization makes this one operation; naive loops are 10–100× slower.

  3. SVD cost: Thin SVD on $X_c \in \mathbb{R}^{n \times d}$ costs $O(\min(n,d)^2 \max(n,d))$ using randomized or Golub-Kahan algorithms. For $n \gg d$ (typical in ML), this is $O(n d^2)$; NumPy’s linalg.svd uses LAPACK DGESDD (divide-and-conquer), which is near-optimal.

  4. Coordinate transform: Computing $z = V_k^\top (x - \mu)$ costs $O(dk)$ for a single point, or $O(ndk)$ for $n$ points. Use (X_c @ V_k) (batch) instead of looping; NumPy broadcasts efficiently.

  5. Projection onto rank-$k$ subspace: $\hat{X} = X_c V_k V_k^\top$ costs $O(nkd)$. First multiply $X_c V_k$ ($O(ndk)$), then multiply by $V_k^\top$ ($O(ndk)$). Use dense BLAS (NumPy/PyTorch) for small $k$; for $k \gg d$, low-rank structure is lost.

  6. Orthonormality preservation: For orthonormal $V_k$ ($V_k^\top V_k = I_k$), the projection $\hat{X} = X_c V_k V_k^\top$ is numerically stable (condition number 1). Contrast with non-orthonormal bases, where repeated projections accumulate round-off error.

  7. Reconstruction error computation: $\text{error}^2 = \|X_c - \hat{X}\|_F^2 = \sum_{j > k} \sigma_j^2$ (sum of squares of discarded singular values). Compute from SVD without reconstructing $\hat{X}$: saves memory and time.

  8. Batch projection (typical ML): For dataset $X \in \mathbb{R}^{n \times d}$ and rank-$k$ basis, compute $Z = (X - \mu) V_k$ in one operation ($O(ndk)$). Use GPU acceleration (PyTorch, TensorFlow) for large $n, d$; even $n = 1M, d = 1k, k = 50$ takes milliseconds on GPU.

What This Example Demonstrates
  1. Canonical basis as selection: The one-hot vector $e_j$ (all zeros except $j$-th entry = 1) selects the $j$-th column of any matrix. $E e_j = E_{:,j}$ is the simplest basis, requiring zero arithmetic; embedding lookup $e_{\text{word}} = E e_j$ is just indexing.

  2. Basis as coordinate system: Any basis $V = [v_1, \ldots, v_d]$ transforms between data space and coordinate space. If $V$ is orthonormal, coordinates $z = V^\top x$ preserve inner products and norms.

  3. PCA basis extraction: SVD on centered data $X_c = U \Sigma V^\top$ yields orthonormal basis $V$ ordered by variance. First $k$ columns $V_{:,1:k}$ form the rank-$k$ PCA basis; remaining columns span the low-variance “noise” directions.

  4. Change of basis via SVD: For data $X \in \mathbb{R}^{n \times d}$, compute $X_c = X - \mu$ (center), then $[U, \Sigma, V^\top] = \text{SVD}(X_c)$. Coordinates in the PC basis are $Z = X_c V$; they decorrelate the data and sort by variance.

  5. Rank-$k$ projection as best approximation: Projecting onto the first $k$ PCs, $\hat{X} = X_c V_k V_k^\top$, minimizes $\|X_c - \hat{X}\|_F$ (Frobenius norm) over all rank-$k$ matrices. This is the Eckart-Young theorem—the geometric foundation of PCA.

  6. Centering as zero-mean projection: The centering step $X_c = X - \mathbf{1} \mu^\top$ projects data onto the subspace where coordinates sum to zero. It removes the trivial “location” degree of freedom, letting PCA focus on shape.

  7. Orthonormality preservation: For orthonormal basis $V$ ($V^\top V = I$), the transformation $z = V^\top x$ is an isometry—it preserves Euclidean distance and inner products. This makes orthonormal bases numerically stable and interpretable.

  8. Reconstruction from coordinates: Given coordinates $z \in \mathbb{R}^k$ in the rank-$k$ basis, reconstruction is $\hat{x} = \mu + V_k z$. If all data actually lies in a $k$-dimensional subspace, this is lossless; otherwise, the truncation error measures unexplained variance.

Notes

Part 1: Canonical Basis and One-Hot Indexing

The canonical basis $\{e_1, \ldots, e_d\}$ (where $e_j$ has a 1 in position $j$ and 0s elsewhere) has the remarkable property that $E e_j$ simply selects the $j$-th column of any matrix $E$. In embedding layers, this means word lookup is just matrix-vector multiplication—the discrete “find word $j$ in the vocabulary” operation is a linear map $e_{\text{word}} = E e_j$. This unifies symbolic indexing with numerical linear algebra, letting GPUs efficiently compute millions of lookups in parallel via batched matrix multiplication.

Part 3: Rank-$k$ Projection and Low-Rank Approximation

Projecting data onto the rank-$k$ basis, $\hat{X} = X_c V_k V_k^\top$, discards directions with small singular values (low variance). The reconstruction error is $\|X_c - \hat{X}\|_F^2 = \sum_{j > k} \sigma_j^2$. For $k = 1$, we retain only the top principal component (direction of maximum spread). For small $k \ll d$, we get aggressive compression, useful for visualization (2D/3D plot) and noise reduction; large $k$ near $d$ is lossless but offers no compression.

Why This Matters for ML

Basis selection is a critical inductive bias. Fixed bases (e.g., polynomial features, Fourier basis) commit to domain knowledge upfront; learned bases (PCA, neural embeddings) adapt to data. The right basis makes learning efficient (fewer parameters, faster convergence); the wrong basis forces the model to waste capacity on uninformative directions. Understanding basis selection deepens your intuition for feature engineering, dimensionality reduction, and model design.

ML Examples and Patterns

Embedding lookup: Word2Vec, GloVe, FastText: $e_{\text{word}} = E e_j$. The embedding matrix $E$ is learned to minimize loss (e.g., skip-gram); its columns form a semantic basis.

Autoencoders: Encoder learns basis $V$ (via neural network); $z = f_{\text{enc}}(x)$ are coordinates. Decoder reconstructs $\hat{x} = f_{\text{dec}}(z)$. Bottleneck width controls basis dimension.

PCA preprocessing: Reduce $d$-dimensional features to $k < d$ PCA coordinates, then pass to downstream model. Common in classical ML (SVM, k-NN); less common in deep learning (which learns its own basis).

Attention mechanism: Query/Key/Value projections compute soft bases dynamically. $Q, K \in \mathbb{R}^{n \times d_k}$ define a basis; attention weights select which basis directions are relevant for each position.

Transfer learning: Pre-trained models (BERT, ResNet, GPT-3) learn rich bases from large datasets. Fine-tuning on small downstream tasks reuses this basis, avoiding learning from scratch.

Connection to Linear Algebra Theory

Basis theory is central to linear algebra: - Rank-nullity theorem: For matrix $A \in \mathbb{R}^{m \times n}$, $\text{rank}(A) + \text{nullity}(A) = n$. Basis of column space has $\text{rank}(A)$ vectors; basis of null space has $\text{nullity}(A)$ vectors. - Fundamental Theorem of Linear Algebra: Four fundamental subspaces (column space, row space, null space, left null space) are partitioned by basis selection. - Spectral Theorem: Symmetric matrices have orthonormal eigenbasis. Covariance matrices $X^\top X$ have eigenbasis = PCA basis. - SVD as universal basis: Every matrix has SVD; the right singular vectors are an orthonormal basis for the column space (data basis), left vectors for the row space (dual basis).

Numerical and Implementation Notes

Centering is mandatory: Omitting centering gives wrong PCs. $X_c = X - \mathbf{1} \mu^\top$ (subtract mean) is a rank-1 modification; it projects onto the zero-mean hyperplane. Algebraically essential; numerically cheap ($O(nd)$).

SVD algorithm choice: np.linalg.svd uses LAPACK DGESDD (divide-and-conquer), which is fast and stable for $n, d$ up to ~10K. For larger problems, use randomized SVD (e.g., sklearn.decomposition.TruncatedSVD or scipy.sparse.linalg.svds).

Memory efficiency: Store $V_k$ (not $V$) and $\Sigma_k$ (not $\Sigma$); recompute coordinates on demand. For $k = 50, d = 1000, n = 1M$, storing all $U$ costs 40 GB; storing only $V_k$ costs 400 MB.

GPU acceleration: torch.linalg.svd and tf.linalg.svd are GPU-accelerated. Batch PCA on millions of points is practical; on CPU it can be slow.

Numerical and Shape Notes

  • Shapes to track: $X \in \mathbb{R}^{n \times d}$ (data), $\mu \in \mathbb{R}^d$ (mean), $X_c = X - \mathbf{1}\mu^\top \in \mathbb{R}^{n \times d}$ (centered), $U \in \mathbb{R}^{n \times d}, \Sigma \in \mathbb{R}^{d \times d}, V \in \mathbb{R}^{d \times d}$ (thin SVD).
  • Truncated shapes: $V_k \in \mathbb{R}^{d \times k}$ (first $k$ PCs), $Z = X_c V_k \in \mathbb{R}^{n \times k}$ (coordinates), $\hat{X} = Z V_k^\top \in \mathbb{R}^{n \times d}$ (reconstruction).
  • Single-point projection: $z = V_k^\top (x - \mu) \in \mathbb{R}^k$ (coordinates), $\hat{x} = \mu + V_k z \in \mathbb{R}^d$ (reconstructed point).
  • Orthonormality verification: Check V.T @ V ≈ I numerically; if not, SVD failed or precision was lost.

Pedagogical Significance

Understanding basis selection is a gateway to: 1. Dimensionality reduction: Why PCA works; how to choose $k$. 2. Least squares: Column space of $X$ is spanned by basis $V$ (if $X$ is tall). Normal equations project onto this basis. 3. Covariance structure: Basis of eigenvectors decorrelates covariance matrix (diagonalization). 4. Neural networks: Hidden layers learn basis transformations; deeper networks compose bases. 5. Geometric intuition: Basis changes correspond to rotations and scalings; they reveal hidden structure (clusters, manifolds, trends).

This example is a capstone to Chapters 1–3 (vector spaces, span, basis, dimension). It ties algebra (matrix operations) to geometry (subspaces, projections) to ML (feature extraction, compression, learning).

Connection to Broader Examples
  • Example 86 (Orthogonality): Orthogonal bases preserve norms and inner products. PCA basis $V$ is orthonormal by SVD construction; this orthogonality makes computations numerically stable and coordinates interpretable (uncorrelated components).

  • Example 87 (Rank & Nullspace): Basis dimension equals rank. A $d$-dimensional basis spans the full column space; restricting to rank-$k$ basis spans a $k$-dimensional subspace. The orthogonal complement (nullspace of $V_k^\top$) has dimension $d - k$.

  • Example 88 (Eigenvalues & Eigenvectors): SVD and eigendecomposition are closely related. For $X^\top X$ (covariance matrix), eigenvectors are the same as right singular vectors of $X$; eigenvalues are squares of singular values. PCA extracts eigenvectors as basis.

  • Example 90 (SVD & Conditioning): SVD directly computes orthonormal bases. The condition number $\kappa = \sigma_1 / \sigma_d$ measures how stretched the data ellipsoid is; basis selection (choosing which $k$ singular vectors) depends on the singular value spectrum.

  • Example 91 (PCA): PCA is basis selection. This example implements the core PCA operation: SVD finds the basis; coordinates measure each point’s projection onto that basis.

  • Example 92 (Least Squares): Normal equations $X^\top X \beta = X^\top y$ project $y$ onto the column space of $X$. The basis of this column space determines which solutions are possible; adding regularization (ridge) shrinks solutions along low-variance basis directions.

  • Example 97 (Vector Spaces & Subspaces): Basis vectors span a subspace; the zero-mean subspace (Example 97) is a hyperplane in $\mathbb{R}^d$. Centering projects onto this subspace; PCA basis gives coordinates within it.

  • Example 98 (Span in ML): Linear predictions $\hat{y} = Xw$ lie in the span of feature columns; these columns form a basis (if full rank). Attention outputs $o = a^\top V$ lie in span of value vectors. Basis selection determines expressiveness.

  • Example 16 (Matrix Products & Kernels): Kernel methods implicitly map data to a high-dimensional basis via inner product in original space. PCA basis selection differs: explicit projection onto a low-dimensional basis extracted from data.

Comments