Example number
83
Example slug
example_83_basis_coordinates_onehot_lookup_and_pca_changeofbasis
Background

Embedding tables implement $E e_j$ in deep learning; PCA provides an orthonormal basis capturing variance for dimensionality reduction and reconstruction. Both are basis/coordinate stories central to ML.

Purpose

Build a reusable ML-first model of bases and coordinates: show that one-hot lookup is just matrix multiplication (selecting a column) and that PCA provides an orthonormal basis for expressing data with fewer degrees of freedom. Emphasize shapes, how coordinates become coefficients, and how dimension (k) trades fidelity for simplicity.

Problem

Use the standard basis (one-hot vectors) to show embedding lookup is matrix multiplication. Then compute PCA coordinates of one point in an orthonormal basis and interpret dimension as degrees of freedom captured.

Solution (Math)
  • One-hot lookup: Let $E \in \mathbb{R}^{d\times |\mathcal{V}|}$ and $e_j \in \mathbb{R}^{|\mathcal{V}|}$. Then $E e_j = E_{:,j}$ selects the $j$-th column of $E$. Embedding lookup is matrix multiplication.

  • PCA coordinates and reconstruction: With $X \in \mathbb{R}^{n\times d}$ and mean $\mu \in \mathbb{R}^d$, center $X_c = X - \mu$. Take SVD $X_c = U \Sigma V^\top$. For $k$ components, $V_k \in \mathbb{R}^{d\times k}$. For a point $x \in \mathbb{R}^d$: \[ z = V_k^\top (x - \mu) \in \mathbb{R}^k, \qquad \hat x = \mu + V_k z \in \mathbb{R}^d. \] Here $k$ is the dimension (degrees of freedom) retained.

We use: - $X \in \mathbb{R}^{n\times d}$ (rows = examples). - Column vectors by default. - Inner product $\langle x,y\rangle = x^\top y$; norm $\|x\|_2$.

Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

E = np.array([[ 1.,  2.,  3.],
              [ 0., -1.,  1.]])
x = np.array([0., 1., 0.])
print("E@x:", E@x, "expected:", E[:,1])

X = toy_pca_points(n=15, seed=2)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)

k = 1
Vk = Vt[:k].T
x0 = X[0]
z = Vk.T @ (x0 - mu)
x_hat = mu + Vk @ z
print("z:", z)
print("x_hat:", x_hat, "x0:", x0)
Code Introduction
  • Demonstrate $E e_j$ as column selection.
  • Compute PCA via SVD, take $k=1$, get coordinates $z$ and reconstruction $\hat x$; print results to verify.
Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Center data: mu = X.mean(0), Xc = X - mu.
  • SVD: U, S, Vt = np.linalg.svd(Xc, full_matrices=False); set Vk = Vt[:k].T.
  • Coordinates: z = Vk.T @ (x0 - mu); reconstruction: x_hat = mu + Vk @ z.
  • Checks: np.allclose(E @ x, E[:, j]); reconstruction error np.linalg.norm(x_hat - x0).
  • Explained variance: (S[:k]**2).sum() / (S**2).sum().
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • One-hot lookup: $E e_j$ selects column $j$ of $E$.
  • Change of basis: PCA yields coordinates $z$ and reconstruction $\hat x$.
  • Degrees of freedom: $k$ controls retained variability and fidelity.
  • Shape thinking: verify dimensions at every step.
Notes

Part 1: Core setup - Basis + coordinates: one-hot lookup and PCA change-of-basis

State the objects, shapes, and target question for Basis + coordinates: one-hot lookup and PCA change-of-basis. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Basis + coordinates: one-hot lookup and PCA change-of-basis

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Basis + coordinates: one-hot lookup and PCA change-of-basis work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Basis + coordinates: one-hot lookup and PCA change-of-basis

Give the computational recipe for Basis + coordinates: one-hot lookup and PCA change-of-basis, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shapes: $E \in \mathbb{R}^{d\times |\mathcal{V}|}$, $e_j \in \mathbb{R}^{|\mathcal{V}|}$, $X \in \mathbb{R}^{n\times d}$, $V_k \in \mathbb{R}^{d\times k}$.
  • Centering: Always subtract $\mu$ before computing SVD/PCA.
  • Orthonormality: Columns of $V$ satisfy $V^\top V = I$; enables clean coordinates.
  • Coordinates: $z = V_k^\top (x-\mu)$ are basis coefficients; order by variance.
  • Reconstruction: $\hat x = \mu + V_k z$ lies in the span of $V_k$.
  • Explained variance: Choose $k$ by cumulative ratio of $\sigma_i^2$.
  • Conditioning: Use SVD rather than explicit inverse; avoid unstable computations.
  • Basis choice: Standard basis for lookup; data-driven basis for PCA.
  • Validation: Check $\|x-\hat x\|$ and that $\sum_i \sigma_i^2$ matches total variance.
History and Applications
  • One-hot basis underpins indexing and embedding lookups in NLP and recommender systems.
  • PCA dates back to Hotelling (1933) and remains a core tool for dimensionality reduction, denoising, visualization, and pre-processing in ML.
  • Modern variants include randomized SVD for large-scale data and whitening for decorrelation prior to downstream tasks.
Connection to Broader Examples
  • Span and linear combinations (Xw lies in the column span).
  • Orthogonal projection and least squares share PCA geometry.
  • SVD underpins PCA and numerically stable solves throughout the book.
  • Embedding lookup mirrors attention’s weighted sum when weights are one-hot.

Comments