Example number
51
Example slug
example_51_basis_coordinates_onehot_lookup_and_pca_changeofbasis
Background

Historical context: Bases and coordinates formalize how vectors are represented; the standard basis corresponds to one-hot coordinates that select columns. Orthogonal bases (Fourier, PCA) emerged to express signals/data in directions with clear semantics (frequencies, variance). PCA dates back to Pearson (1901) and Hotelling (1933); modern SVD-based PCA computes orthonormal components robustly.

Mathematical characterization: Given a matrix of basis vectors $E = [e_1\ \cdots\ e_m] \in \mathbb{R}^{d\times m}$ and a coordinate vector $x \in \mathbb{R}^m$, the reconstruction is $Ex = \sum_j x_j e_j$. For PCA, with mean $\mu$ and right singular vectors $V$, the top-$k$ basis $V_k \in \mathbb{R}^{d\times k}$ is orthonormal ($V_k^\top V_k = I$). Coordinates are $z = V_k^\top (x - \mu)$ and reconstruction is $\hat{x} = \mu + V_k z$. The dimension $k$ is the number of retained degrees of freedom.

Prevalence in ML: Embedding lookup, spectral methods, dimensionality reduction, whitening, and visualization all rely on basis/coordinates. PCA change-of-basis clarifies variance structure, improves conditioning, and reduces compute.

Purpose

Show how basis and coordinates unify two core ML operations: (1) one-hot lookup as selecting a column from an embedding matrix (coordinates in the standard basis), and (2) PCA change-of-basis projecting data into an orthonormal basis of principal directions to control dimensionality. Emphasize that choosing a basis determines how we express vectors (coordinates), and changing basis (e.g., to PCA components) reveals structure, compresses degrees of freedom, and improves downstream tasks (denoising, visualization, regression).

Problem

Use the standard basis (one-hot vectors) to show embedding lookup is matrix multiplication. Then compute PCA coordinates of one point in an orthonormal basis and interpret dimension as degrees of freedom captured.

Solution (Math)

If $E\in\mathbb{R}^{d\times |\mathcal{V}|}$ is an embedding matrix and $x=e_j$ is a one-hot basis vector, then

\[ Ex = E e_j \]

selects the $j$-th column of $E$.

For PCA with orthonormal basis $V_k$, coordinates are $z = V_k^T$x-$$ and reconstruction is $\hat x=\mu+V_k z$. The dimension $k$ counts retained degrees of freedom.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

E = np.array([[ 1.,  2.,  3.],
              [ 0., -1.,  1.]])
x = np.array([0., 1., 0.])
print("E@x:", E@x, "expected:", E[:,1])

X = toy_pca_points(n=15, seed=2)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)

k = 1
Vk = Vt[:k].T
x0 = X[0]
z = Vk.T @ (x0 - mu)
x_hat = mu + Vk @ z
print("z:", z)
print("x_hat:", x_hat, "x0:", x0)
Code Introduction

This snippet ties basis/coordinates to PCA projection.

First, $E \in \mathbb{R}^{2\times 3}$ has columns in $\mathbb{R}^2$. The coordinate vector $x = [0, 1, 0]^\top$ is one-hot for the second column, so the linear combination $Ex$ selects that column: $Ex = 0\cdot E_{:,0} + 1\cdot E_{:,1} + 0\cdot E_{:,2} = E_{:,1}$. This is the coordinate view of a basis: columns of $E$ are basis vectors for their span, and $x$ provides coefficients.

Second, $X \in \mathbb{R}^{15\times 2}$ is centered as $X_c = X - \mu$ and factored via SVD $X_c = U \Sigma V^\top$. The rows of Vt are $V^\top$; the first principal component is $v_1 = V^\top_{0,:}^\top \in \mathbb{R}^2$. With $k=1$, $V_k = v_1$ gives a 1D orthonormal basis of the top-variance subspace. A point $x_0$ is projected onto this line by its principal coordinate $z = v_1^\top (x_0 - \mu)$ and reconstructed by $\hat{x} = \mu + v_1 z$. Here $z$ is the scalar coefficient in the PCA basis, and $\hat{x}$ is the rank-1 reconstruction (orthogonal projection onto the $k=1$ subspace through the mean).

Notes and gotchas: Centering is mandatory for PCA; without $X_c$ the components point toward the mean. NumPy returns $V^\top$ as Vt; use Vk = Vt[:k].T to get the first $k$ principal components as columns. Shapes: $X(15,2)$, Vt(2,2), $V_k(2,1)$, $z(1,)$, $(2,)`.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Embedding selection via one-hot: Form $E \in \mathbb{R}^{2\times 3}$; set $x=e_2$ to select column 2 by computing $Ex$.
  2. Generate data: X = toy_pca_points(n=15, seed=2) to get a 2D dataset.
  3. Center data: Compute $\mu = X.\text{mean}(0)$ and $X_c = X - \mu$; centering is required for PCA.
  4. Compute SVD: U, S, Vt = np.linalg.svd(Xc, full_matrices=False); Vt is $V^\top$.
  5. Select top-$k$ basis: Vk = Vt[:k].T yields $V_k$ as $k$ columns.
  6. Project a point: For $x_0$, compute coordinates z = Vk.T @ (x0 - mu).
  7. Reconstruct: x_hat = mu + Vk @ z gives the $k$-dimensional reconstruction.
  8. Verify shapes and properties: Check $V_k^\top V_k = I_k$, and that $z \in \mathbb{R}^k$, $\hat{x} \in \mathbb{R}^d$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • One-hot lookup is matrix multiplication: $Ex$ with $x=e_j$ selects the $j$-th column (standard basis coordinates).
  • Coordinates in an orthonormal basis: PCA gives $z = V_k^\top (x-\mu)$, the coordinates of $x$ in the principal basis.
  • Orthogonal projection and reconstruction: $\hat{x} = \mu + V_k z$ is the projection of $x$ onto the $k$-dimensional principal subspace.
  • Degrees of freedom: $k$ counts retained dimensions; smaller $k$ compresses data while preserving dominant variance.
  • Shape discipline: Track $E \in \mathbb{R}^{d\times m}$, $x \in \mathbb{R}^m$, $V_k \in \mathbb{R}^{d\times k}$, $z \in \mathbb{R}^k$, $\hat{x} \in \mathbb{R}^d$.
Notes

Part 1: One-hot lookup and basis columns - One-hot $x=e_j$ selects column $j$ from $E$: $Ex = E e_j = E_{:,j}$. - Coordinates encode coefficients in the chosen basis; standard basis uses one-hot vectors.

Part 2: PCA coordinates and projection - Center data: $X_c = X - \mu$; then $X_c = U \Sigma V^\top$. - $V_k$ (first $k$ columns of $V$) is an orthonormal basis of the principal subspace. - Coordinates: $z = V_k^\top (x - \mu)$; reconstruction: $\hat{x} = \mu + V_k z$.

Part 3: Degrees of freedom and geometry - $k$ controls information retained; smaller $k$ compresses, larger $k$ preserves more variance. - Orthogonal projection minimizes reconstruction error $\|x - \hat{x}\|_2$ over all vectors in the subspace.

Pedagogical Notes - Basis choice affects interpretability; PCA basis aligns with variance directions. - One-hot coordinates in embedding tables show that a lookup is a matrix product. - Verify shapes to prevent silent broadcasting errors; check orthonormality $V_k^\top V_k = I$.

ML Examples and Patterns - Embedding lookup: NLP uses $Ex$ to retrieve token embeddings. - PCA preprocessing: reduce dimension for visualization/regression while preserving variance. - Whitening: change-of-basis followed by scaling to unit variance. - Autoencoders: learn a data-adaptive basis similar to PCA.

Connection to Linear Algebra Theory - Vector spaces: coordinates express vectors in a basis; change-of-basis is a linear isomorphism. - Orthonormal bases: preserve lengths/angles; convenient for projection. - SVD: guarantees orthonormal bases for data subspaces.

Numerical and Implementation Notes - Always center before PCA; otherwise components point toward the mean. - Use SVD (np.linalg.svd) for stability; avoid forming covariance for very ill-conditioned data without care. - Confirm $V_k^\top V_k = I$ numerically; tolerances account for floating point.

Numerical and Shape Notes - Shapes: $E(2,3)$, $x(3,)$, $Ex(2,)$; $X(15,2)$, $Vt(2,2)$, $V_k(2,1)$, $z(1,)$, $\hat{x}(2,)$. - Distinguish row/column vectors: Vt[:k].T yields columns for $V_k$. - Broadcasting: ensure matrix-vector products use matching shapes.

ML Context: From Attention to Transformers - Attention uses basis-like projections: $Q = XW_Q$, $K = XW_K$, $V = XW_V$ define learned bases for queries/keys/values. - Coordinates (attention weights) are data-dependent; outputs are weighted sums in the value basis.

Pedagogical Significance - This example connects discrete lookup (one-hot basis) with continuous change-of-basis (PCA), building intuition for coordinates and projections used across ML.

History and Applications

The concept of a basis dates to 19th-century algebra and geometry; coordinates provide the numeric representation of vectors relative to chosen basis vectors. Orthogonal bases (e.g., Fourier) enabled efficient analysis of signals. PCA originated with Pearson (1901) and Hotelling (1933) to discover directions of maximal variance, later unified with SVD for robust computation. In modern ML: embedding lookup via one-hot coordinates powers NLP; PCA and SVD serve as preprocessing for visualization, compression, and conditioning improvement; autoencoders learn data-adaptive bases; whitening and decorrelation are change-of-basis operations followed by scaling. Recognizing basis and coordinates clarifies how models store and manipulate information: choosing a basis (learned or fixed) determines interpretability and numerical properties.

Connection to Broader Examples
  • Chapter 2 (Span/Linear combinations): Coordinates express vectors as linear combinations of basis vectors.
  • Chapter 3 (Basis/Dimension): $k$ is the number of basis vectors retained; dimension counts degrees of freedom.
  • Chapter 4 (Linear maps): Change-of-basis is a linear map from coordinates in one basis to another.
  • Chapter 5 (Inner products): Orthonormal bases rely on inner products; $V_k^\top V_k = I$.
  • Chapter 6 (Projections): PCA reconstruction is an orthogonal projection onto a subspace.
  • Chapter 10 (SVD): PCA basis arises from SVD; $X_c = U \Sigma V^\top$.
  • Chapter 11 (PCA): This is the essential PCA workflow: center, SVD, project, reconstruct.
  • Chapter 12 (Least-squares): Working in principal coordinates improves conditioning of regression.
  • Chapter 14 (Conditioning): Change-of-basis can reduce ill-conditioning by aligning with dominant variance.
  • Chapter 16 (Matrix products): Coordinates/reconstruction are matrix products: $z = V_k^\top(x-\mu)$, $\hat{x} = \mu + V_k z$.

Comments