Bases and coordinates are the grammar for representation: one-hot vectors index and select columns from an embedding table; coordinates in an orthonormal basis express a vector as a weighted sum of basis directions. PCA supplies data-driven bases (principal directions) that capture variance efficiently; choosing $k$ trades reconstruction fidelity against dimensionality and conditioning. Orthonormality ensures numerical stability: $V_k^\top V_k = I_k$ and projections are length-preserving along the subspace.
- Log in to post comments
Build an ML-first intuition for bases, coordinates, and dimension: one-hot lookup as selecting a basis vector via matrix multiplication, and PCA as an orthonormal change of basis with coordinates and reconstruction. The goal is to reason about how many degrees of freedom you retain ($k$) and how coordinates in a basis map back to the original space stably and interpretably.
Use the standard basis (one-hot vectors) to show embedding lookup is matrix multiplication. Then compute PCA coordinates of one point in an orthonormal basis and interpret dimension as degrees of freedom captured.
If $E\in\mathbb{R}^{d\times |\mathcal{V}|}$ is an embedding matrix and $x=e_j$ is a one-hot basis vector, then
\[ Ex = E e_j, \]
which selects the $j$-th column of $E$.
For PCA with orthonormal basis $V_k \in \mathbb{R}^{d\times k}$, coordinates and reconstruction are
\[ z = V_k^\top (x - \mu),\qquad \hat x = \mu + V_k z = \mu + V_k V_k^\top (x - \mu), \]
and the dimension $k$ counts retained degrees of freedom.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import toy_pca_points
E = np.array([[ 1., 2., 3.],
[ 0., -1., 1.]])
x = np.array([0., 1., 0.])
print("E@x:", E@x, "expected:", E[:,1])
X = toy_pca_points(n=15, seed=2)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
k = 1
Vk = Vt[:k].T
x0 = X[0]
z = Vk.T @ (x0 - mu)
x_hat = mu + Vk @ z
print("z:", z)
print("x_hat:", x_hat, "x0:", x0)This snippet highlights two ideas: selecting a basis vector via a linear combination, and projecting a point onto a low-dimensional PCA subspace.
First, $E \in \mathbb{R}^{2\times 3}$ is treated as a set of basis columns in $\mathbb{R}^2$. The vector $x = [0,\,1,\,0]^\top$ is the second standard basis vector, so $E x$ extracts the second column $E_{:,2}$. This shows how linear combinations select or mix basis elements; using a one-hot $x$ picks a single column, while general $x$ mixes columns to produce any vector in the column span.
Second, PCA projection is performed on $X \in \mathbb{R}^{n\times d}$ by centering ($X_c = X - \mu$), computing SVD $X_c = U\Sigma V^\top$, and taking the leading principal direction $V_k \in \mathbb{R}^{d\times k}$ (here $k=1$ via Vk = Vt[:k].T). A point $x_0$ is mapped to its coordinate $z = V_k^\top (x_0 - \mu)$, then reconstructed by the rank-1 projection $\hat x = \mu + V_k z$, equivalently $\hat x = \mu + V_k V_k^\top (x_0 - \mu)$. This yields the orthogonal projection onto the 1D principal subspace, minimizing the squared reconstruction error within that subspace.
Gotchas and shapes: SVD returns $V^\top$ (Vt), so transpose to get $V$. Ensure data are centered before PCA; otherwise, the first component may capture the mean direction. With $d=2$, Vk has shape $(2,1)$, $z$ is $(1,)$, and $\hat x$ is $(2,)$. A quick check is that $(x_0 - \hat x)$ is orthogonal to $V_k$, i.e., $V_k^\top (x_0 - \hat x) \approx 0$, and that projecting twice is idempotent.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Embedding lookup: construct $E \in \mathbb{R}^{d\times m}$ and a one-hot $e_j \in \mathbb{R}^m$; compute $E e_j$ and verify it equals column $j$.
- PCA basis: center data $X_c = X - \mu$; compute SVD $X_c = U\Sigma V^\top$; set $V_k = V_{:k}$ (columns) or
Vk = Vt[:k].Tin NumPy. - Coordinates: compute $z = V_k^\top (x_0 - \mu)$; reconstruction $\hat x = \mu + V_k z$.
- Checks: confirm $\|x_0 - \hat x\|_2$ decreases with larger $k$; verify $V_k^\top (x_0 - \hat x) \approx 0$.
- Shapes: $X \in \mathbb{R}^{n\times d}$, $\mu \in \mathbb{R}^d$, $V_k \in \mathbb{R}^{d\times k}$, $z \in \mathbb{R}^k$, $\hat x \in \mathbb{R}^d$.
- Use orthonormal bases for stability; avoid explicit inverses and rely on SVD.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- One-hot lookup is matrix multiplication: $E e_j$ returns the $j$-th embedding.
- PCA change-of-basis: $z = V_k^\top (x - \mu)$, reconstruction $\hat x = \mu + V_k z$.
- Dimension $k$ as degrees of freedom retained; idempotence of projection ($V_k V_k^\top$).
- Shape discipline: $E \in \mathbb{R}^{d\times |\mathcal{V}|}$, $e_j \in \mathbb{R}^{|\mathcal{V}|}$, $V_k \in \mathbb{R}^{d\times k}$, $z \in \mathbb{R}^k$.
- Numerical sanity checks: orthogonality of residual $(x - \hat x)$ to $\operatorname{span}(V_k)$.
- Part 1: One-hot and selection. $E e_j$ selects column $j$; general coordinates weigh columns to reconstruct vectors in $\operatorname{span}(E)$.
- Part 2: PCA coordinates and projection. $z = V_k^\top (x - \mu)$, $\hat x = \mu + V_k z$; projection matrix $P_k = V_k V_k^\top$ is symmetric idempotent.
- Part 3: Numerical checks. Verify $V_k^\top (x - \hat x) \approx 0$, and that projecting twice leaves $\hat x$ unchanged; inspect shapes rigorously.
- Why This Matters for ML. Embedding lookups and PCA are everywhere: retrieval, compression, denoising, and visualization; $k$ controls capacity and overfitting.
- ML Examples and Patterns. Fit: regress in reduced coordinates; Project: denoise by rank-$k$ projection; Decompose: learn orthonormal bases via SVD/PCA; Solve: use orthonormal transforms for stable optimization.
- Connection to Linear Algebra Theory. Change-of-basis with orthonormal columns preserves inner products in the subspace; projectors split $\mathbb{R}^d = \operatorname{span}(V_k) \oplus \operatorname{span}(V_k)^\perp$.
- Numerical and Implementation Notes. Always center before PCA; use SVD (
np.linalg.svd) to obtain $V^\top$ and transpose to get $V$; avoid explicit covariance eigen-decompositions on ill-conditioned data. - Numerical and Shape Notes. Validate $V_k \in \mathbb{R}^{d\times k}$, $z \in \mathbb{R}^k$, $\hat x \in \mathbb{R}^d$; check broadcasting in NumPy for $\mu$ subtraction.
- Pedagogical Significance. The smallest working example that ties embeddings-as-basis selection to PCA change-of-basis and dimensionality.
Basis selection and coordinates underpin embedding lookups, feature engineering, and dimensionality reduction throughout ML. One-hot indexing connects categorical variables to continuous embeddings; PCA traces back to Pearson (1901) and Hotelling (1933) and remains the standard for low-dimensional structure discovery, denoising, and visualization. Orthonormal change-of-basis keeps computations stable and interpretable, with $k$ directly controlling capacity and biasâvariance trade-offs.
- Links to Chapter 3 (basis/dimension): coordinates and reconstruction embody change-of-basis.
- Connects to Chapter 11 (PCA): centering and orthonormal principal directions; rank-$k$ approximation $V_k V_k^\top$.
- Complements Chapter 12 (least squares): projections onto subspaces and idempotent projectors.
- Relates to Chapter 16 (matrix products): embedding lookup, projections, and basis transforms are matrix multiplies.
- Bridges to Chapter 14 (conditioning): orthonormal bases improve numeric stability versus arbitrary bases.
Comments