Example number
67
Example slug
example_67_basis_coordinates_onehot_lookup_and_pca_changeofbasis
Background

Bases and coordinates are the grammar for representation: one-hot vectors index and select columns from an embedding table; coordinates in an orthonormal basis express a vector as a weighted sum of basis directions. PCA supplies data-driven bases (principal directions) that capture variance efficiently; choosing $k$ trades reconstruction fidelity against dimensionality and conditioning. Orthonormality ensures numerical stability: $V_k^\top V_k = I_k$ and projections are length-preserving along the subspace.

Purpose

Build an ML-first intuition for bases, coordinates, and dimension: one-hot lookup as selecting a basis vector via matrix multiplication, and PCA as an orthonormal change of basis with coordinates and reconstruction. The goal is to reason about how many degrees of freedom you retain ($k$) and how coordinates in a basis map back to the original space stably and interpretably.

Problem

Use the standard basis (one-hot vectors) to show embedding lookup is matrix multiplication. Then compute PCA coordinates of one point in an orthonormal basis and interpret dimension as degrees of freedom captured.

Solution (Math)

If $E\in\mathbb{R}^{d\times |\mathcal{V}|}$ is an embedding matrix and $x=e_j$ is a one-hot basis vector, then

\[ Ex = E e_j, \]

which selects the $j$-th column of $E$.

For PCA with orthonormal basis $V_k \in \mathbb{R}^{d\times k}$, coordinates and reconstruction are

\[ z = V_k^\top (x - \mu),\qquad \hat x = \mu + V_k z = \mu + V_k V_k^\top (x - \mu), \]

and the dimension $k$ counts retained degrees of freedom.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

E = np.array([[ 1.,  2.,  3.],
              [ 0., -1.,  1.]])
x = np.array([0., 1., 0.])
print("E@x:", E@x, "expected:", E[:,1])

X = toy_pca_points(n=15, seed=2)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)

k = 1
Vk = Vt[:k].T
x0 = X[0]
z = Vk.T @ (x0 - mu)
x_hat = mu + Vk @ z
print("z:", z)
print("x_hat:", x_hat, "x0:", x0)
Code Introduction

This snippet highlights two ideas: selecting a basis vector via a linear combination, and projecting a point onto a low-dimensional PCA subspace.

First, $E \in \mathbb{R}^{2\times 3}$ is treated as a set of basis columns in $\mathbb{R}^2$. The vector $x = [0,\,1,\,0]^\top$ is the second standard basis vector, so $E x$ extracts the second column $E_{:,2}$. This shows how linear combinations select or mix basis elements; using a one-hot $x$ picks a single column, while general $x$ mixes columns to produce any vector in the column span.

Second, PCA projection is performed on $X \in \mathbb{R}^{n\times d}$ by centering ($X_c = X - \mu$), computing SVD $X_c = U\Sigma V^\top$, and taking the leading principal direction $V_k \in \mathbb{R}^{d\times k}$ (here $k=1$ via Vk = Vt[:k].T). A point $x_0$ is mapped to its coordinate $z = V_k^\top (x_0 - \mu)$, then reconstructed by the rank-1 projection $\hat x = \mu + V_k z$, equivalently $\hat x = \mu + V_k V_k^\top (x_0 - \mu)$. This yields the orthogonal projection onto the 1D principal subspace, minimizing the squared reconstruction error within that subspace.

Gotchas and shapes: SVD returns $V^\top$ (Vt), so transpose to get $V$. Ensure data are centered before PCA; otherwise, the first component may capture the mean direction. With $d=2$, Vk has shape $(2,1)$, $z$ is $(1,)$, and $\hat x$ is $(2,)$. A quick check is that $(x_0 - \hat x)$ is orthogonal to $V_k$, i.e., $V_k^\top (x_0 - \hat x) \approx 0$, and that projecting twice is idempotent.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Embedding lookup: construct $E \in \mathbb{R}^{d\times m}$ and a one-hot $e_j \in \mathbb{R}^m$; compute $E e_j$ and verify it equals column $j$.
  • PCA basis: center data $X_c = X - \mu$; compute SVD $X_c = U\Sigma V^\top$; set $V_k = V_{:k}$ (columns) or Vk = Vt[:k].T in NumPy.
  • Coordinates: compute $z = V_k^\top (x_0 - \mu)$; reconstruction $\hat x = \mu + V_k z$.
  • Checks: confirm $\|x_0 - \hat x\|_2$ decreases with larger $k$; verify $V_k^\top (x_0 - \hat x) \approx 0$.
  • Shapes: $X \in \mathbb{R}^{n\times d}$, $\mu \in \mathbb{R}^d$, $V_k \in \mathbb{R}^{d\times k}$, $z \in \mathbb{R}^k$, $\hat x \in \mathbb{R}^d$.
  • Use orthonormal bases for stability; avoid explicit inverses and rely on SVD.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • One-hot lookup is matrix multiplication: $E e_j$ returns the $j$-th embedding.
  • PCA change-of-basis: $z = V_k^\top (x - \mu)$, reconstruction $\hat x = \mu + V_k z$.
  • Dimension $k$ as degrees of freedom retained; idempotence of projection ($V_k V_k^\top$).
  • Shape discipline: $E \in \mathbb{R}^{d\times |\mathcal{V}|}$, $e_j \in \mathbb{R}^{|\mathcal{V}|}$, $V_k \in \mathbb{R}^{d\times k}$, $z \in \mathbb{R}^k$.
  • Numerical sanity checks: orthogonality of residual $(x - \hat x)$ to $\operatorname{span}(V_k)$.
Notes
  • Part 1: One-hot and selection. $E e_j$ selects column $j$; general coordinates weigh columns to reconstruct vectors in $\operatorname{span}(E)$.
  • Part 2: PCA coordinates and projection. $z = V_k^\top (x - \mu)$, $\hat x = \mu + V_k z$; projection matrix $P_k = V_k V_k^\top$ is symmetric idempotent.
  • Part 3: Numerical checks. Verify $V_k^\top (x - \hat x) \approx 0$, and that projecting twice leaves $\hat x$ unchanged; inspect shapes rigorously.
  • Why This Matters for ML. Embedding lookups and PCA are everywhere: retrieval, compression, denoising, and visualization; $k$ controls capacity and overfitting.
  • ML Examples and Patterns. Fit: regress in reduced coordinates; Project: denoise by rank-$k$ projection; Decompose: learn orthonormal bases via SVD/PCA; Solve: use orthonormal transforms for stable optimization.
  • Connection to Linear Algebra Theory. Change-of-basis with orthonormal columns preserves inner products in the subspace; projectors split $\mathbb{R}^d = \operatorname{span}(V_k) \oplus \operatorname{span}(V_k)^\perp$.
  • Numerical and Implementation Notes. Always center before PCA; use SVD (np.linalg.svd) to obtain $V^\top$ and transpose to get $V$; avoid explicit covariance eigen-decompositions on ill-conditioned data.
  • Numerical and Shape Notes. Validate $V_k \in \mathbb{R}^{d\times k}$, $z \in \mathbb{R}^k$, $\hat x \in \mathbb{R}^d$; check broadcasting in NumPy for $\mu$ subtraction.
  • Pedagogical Significance. The smallest working example that ties embeddings-as-basis selection to PCA change-of-basis and dimensionality.
History and Applications

Basis selection and coordinates underpin embedding lookups, feature engineering, and dimensionality reduction throughout ML. One-hot indexing connects categorical variables to continuous embeddings; PCA traces back to Pearson (1901) and Hotelling (1933) and remains the standard for low-dimensional structure discovery, denoising, and visualization. Orthonormal change-of-basis keeps computations stable and interpretable, with $k$ directly controlling capacity and bias–variance trade-offs.

Connection to Broader Examples
  • Links to Chapter 3 (basis/dimension): coordinates and reconstruction embody change-of-basis.
  • Connects to Chapter 11 (PCA): centering and orthonormal principal directions; rank-$k$ approximation $V_k V_k^\top$.
  • Complements Chapter 12 (least squares): projections onto subspaces and idempotent projectors.
  • Relates to Chapter 16 (matrix products): embedding lookup, projections, and basis transforms are matrix multiplies.
  • Bridges to Chapter 14 (conditioning): orthonormal bases improve numeric stability versus arbitrary bases.

Comments