Bases provide coordinate systems for vector spaces. The standard basis (one-hot vectors) is discrete and orthonormal; multiplying an embedding matrix by a one-hot vector selects a single columnâthe mathematical operation behind embedding layers in neural networks (word embeddings, positional encodings, graph node features). PCA constructs continuous orthonormal bases aligned with data variance: the principal components form a new coordinate system where axes point along directions of maximum spread. Projecting onto the top-$k$ components compresses data while preserving variance, providing a linear autoencoder: encoder $z = V^\top (x - \mu)$, decoder $\hat{x} = \mu + Vz$, loss $\|x - \hat{x}\|_2^2$. This change-of-basis perspective clarifies embeddings, dimensionality reduction, autoencoders, and latent representations as instances of the same pattern: represent data with respect to an informative basis.
- Log in to post comments
Build intuition that bases are both discrete (embeddings) and continuous (PCA). One-hot vectors act as selectors in the standard basisâmultiplying an embedding matrix by a one-hot extracts a column, which is how transformers retrieve token embeddings. Orthonormal bases (e.g., principal components) provide coordinate systems for projection and reconstructionâmapping high-dimensional data to low-dimensional representations and back. Mastering these two operations unifies embedding lookups, dimensionality reduction, autoencoders, and latent representations under one algebraic framework: change of basis.
- Use one-hot vectors to perform embedding lookup as a matrix product.
- Compute coordinates of a point in an orthonormal basis (PCA basis).
With embedding matrix $E=[e_1,\dots,e_{|V|}]$, a one-hot $x=e_j$ selects column $e_j$: $Ex=e_j$.
If $V$ has orthonormal columns (a basis for a subspace), coordinates are $z=V^T$x-$$, reconstruction is $\hat x=\mu+Vz$.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import toy_pca_points
E = np.array([[1., 2., 3.],
[0.,-1., 1.]])
one_hot = np.array([0., 1., 0.])
print("lookup:", E@one_hot, "expected:", E[:,1])
X = toy_pca_points(n=20, seed=1)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
V = Vt.T[:, :1] # top-1 basis
x = X[0]
z = V.T @ (x-mu)
xhat = mu + V@z
print("z:", z, "xhat:", xhat, "x:", x)This code demonstrates two fundamental uses of basis vectors: lookup via one-hot encoding and low-dimensional representation via PCA projection. It shows how bases enable both discrete indexing (embedding tables) and continuous compression (dimensionality reduction).
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- One-hot embedding lookup: Define embedding matrix $E \in \mathbb{R}^{d \times |V|}$ and one-hot vector $e_j \in \mathbb{R}^{|V|}$ with entry $j$ equal to 1. Compute $E e_j$ and verify it equals $E[:, j]$ (the $j$-th column).
- Load data: Use
toy_pca_points(n=20, seed=1)to generate toy data $X \in \mathbb{R}^{20 \times 2}$. - Center data: Compute mean $\mu = X.\text{mean}(0)$ and centered data $X_c = X - \mu$. Centering is mandatory for PCAâit ensures the first principal component points along maximum variance, not toward the mean.
- SVD decomposition: Run
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)to get $X_c = U \Sigma V^\top$. The rows of $V^\top$ (stored asVt) are the principal components. - Extract top-$k$ basis: Transpose
V = Vt.T[:, :k]to get a $d \times k$ matrix whose columns are the top-$k$ principal components (for $k=1$, a single direction). - Project to PCA coordinates: For a data point $x$, compute $z = V^\top (x - \mu)$. This gives the coordinates of $x$ in the PCA basisâscalar(s) measuring alignment with principal components.
- Reconstruct in original space: Compute $\hat{x} = \mu + Vz$. This places the reconstruction on the $k$-dimensional subspace spanned by the top-$k$ components, shifted back to the original centroid.
- Measure reconstruction error: Compute $\|x - \hat{x}\|_2$ to quantify information loss. Small error means data is well-approximated by the low-dimensional subspace.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- One-hot vectors as discrete basis selectors: $E e_j$ extracts the $j$-th column of the embedding matrix $E$, implementing token/item lookups as matrix multiplications.
- Embedding layers as linear maps: an embedding matrix $E \in \mathbb{R}^{d \times |V|}$ stores learned representations for a vocabulary of size $|V|$ with dimension $d$.
- PCA as change of basis: orthonormal principal components form a basis; coordinates $z = V^\top (x - \mu)$ represent data in the new system.
- Projection and reconstruction: $\hat{x} = \mu + Vz$ reconstructs the point in the original space, with error $\|x - \hat{x}\|_2$ quantifying information loss.
- Dimensionality reduction via basis truncation: keeping top-$k$ of $d$ principal components compresses data while retaining maximum variance.
- Shape discipline: embedding lookup changes $(|V|,) \to (d,)$; PCA projection changes $(d,) \to (k,)$, reconstruction changes $(k,) \to (d,)$.
Part 1: One-Hot Encoding as Basis Selection
One-hot vectors are the standard basis: $e_1 = [1, 0, 0, \ldots]$, $e_2 = [0, 1, 0, \ldots]$, etc. Multiplying an embedding matrix $E = [e_1, e_2, \ldots, e_{|V|}]$ by $e_j$ selects the $j$-th column: $E e_j = e_j$ (the embedding for token/item $j$). This is how embedding layers work in transformers, recommenders, and graph neural networks. The operation is mathematically $O(d)$ (extracting a column), but frameworks optimize it to direct indexing E[:, j] for speed. The algebraic interpretation: one-hot vectors act as projection operators onto single basis elements, making them discrete versions of continuous projection matrices.
Part 2: PCA Projection and Reconstruction
PCA constructs an orthonormal basis $V = [v_1, v_2, \ldots, v_d]$ where $v_i$ are the principal components, ordered by explained variance (singular values squared). Projecting a centered point $x - \mu$ onto the top-$k$ components gives coordinates $z = V^\top (x - \mu) \in \mathbb{R}^k$, where $V \in \mathbb{R}^{d \times k}$ contains the first $k$ columns. Reconstruction $\hat{x} = \mu + Vz$ maps back to the original space, placing $\hat{x}$ on the $k$-dimensional hyperplane through $\mu$ spanned by the top-$k$ components. The reconstruction minimizes $\|x - \hat{x}\|_2^2$ over all $k$-dimensional subspacesâPCA is the optimal linear dimensionality reduction under the $L^2$ loss.
Part 3: Basis Change and Dimensionality Reduction
Change of basis transforms coordinates from one system to another. In the standard basis, a point $x \in \mathbb{R}^d$ has coordinates $[x_1, \ldots, x_d]$. In the PCA basis, the same point has coordinates $z = V^\top (x - \mu) \in \mathbb{R}^k$ (for top-$k$ components). The transformation $z = V^\top (x - \mu)$ is the encoder, $\hat{x} = \mu + Vz$ is the decoder. For $k < d$, this is lossy compression: $z$ has fewer entries than $x$, storing only the $k$ most important coordinates. The reconstruction error $\|x - \hat{x}\|_2$ measures the price of compression. Applications: visualizing high-dimensional embeddings (project to 2D/3D), denoising (discard low-variance components), feature extraction (use $z$ as input to downstream models), and data compression (store $z$ instead of $x$).
Connection to ML Primitives
One-hot embeddings and PCA unify discrete and continuous representations. Embeddings are discrete bases: each column of $E$ is a learned vector for a discrete entity (token, user, item). PCA constructs continuous bases: each column of $V$ is a data-driven direction of variation. Both enable change of basis: embeddings map categorical indices to dense vectors, PCA maps high-dimensional data to low-dimensional coordinates. Autoencoders generalize PCA: replace linear encoder/decoder with neural networks, but the geometric intuition (compress to a bottleneck, then reconstruct) is the same. Variational autoencoders (VAEs) add probabilistic structure to the latent space $z$, and diffusion models iteratively denoise in latent spaces learned via similar compression principles.
Numerical and Implementation Notes
For embeddings, avoid materializing one-hot vectorsâuse direct indexing E[:, i] for efficiency. For PCA, always center data first; use SVD (np.linalg.svd) over eigendecomposition of the covariance matrix (more numerically stable, avoids forming $X^\top X$). When extracting principal components, remember NumPy returns $V^\top$ (rows are components), so transpose to get $V$ (columns are components). For large $d$, consider randomized SVD (e.g., sklearn.decomposition.PCA with svd_solver='randomized') for faster approximation. Check orthonormality: $V^\top V$ should be a $k \times k$ identity matrix.
Numerical and Shape Notes
Embedding shapes: $E \in \mathbb{R}^{d \times |V|}$ (embedding dim à vocabulary size), one-hot $e_j \in \mathbb{R}^{|V|}$ (column vector or 1D array), output $E e_j \in \mathbb{R}^d$. For batched lookups, use indices array $[i_1, \ldots, i_B]$ and index E[:, indices] to get $d \times B$ matrix of embeddings.
PCA shapes: $X \in \mathbb{R}^{n \times d}$ (rows are examples), $X_c \in \mathbb{R}^{n \times d}$ (centered), SVD gives $V^\top \in \mathbb{R}^{d \times d}$, extract $V \in \mathbb{R}^{d \times k}$ (top-$k$ columns). For a single point $x \in \mathbb{R}^d$, projection $z = V^\top (x - \mu)$ has shape $(k,)$, reconstruction $\hat{x} = \mu + Vz$ has shape $(d,)$. For a batch $X_{\text{new}} \in \mathbb{R}^{m \times d}$, project as $Z = (X_{\text{new}} - \mu) V \in \mathbb{R}^{m \times k}$ (matrix product), reconstruct as $\hat{X} = \mu + Z V^\top \in \mathbb{R}^{m \times d}$.
Gotchas: Forgetting to center before PCA (first component points toward the mean, not maximum variance). Confusing $V$ and $V^\top$ from NumPyâs SVD (transpose needed). Not checking orthonormality (numerical errors can accumulate). Using eigendecomposition of $X^\top X$ for large $d$ (forms $d \times d$ matrix, expensive and unstable).
One-hot encoding traces back to early computing and statistics, where categorical variables were represented as indicator vectors (dummy variables in regression, dating to the 1950s). Modern embedding layers emerged with neural language models: Bengio et al. (2003) learned distributed word representations as rows of a matrix, and Word2Vec (Mikolov et al., 2013) popularized efficient embedding training via skip-gram and CBOW. Transformers (Vaswani et al., 2017) use embeddings for tokens and positions, while graph neural networks use embeddings for nodes/edges. The key insight: replacing sparse one-hot vectors with dense learned embeddings captures semantic similarityâwords with similar meanings have similar embedding vectors.
Principal Component Analysis was invented by Karl Pearson (1901) for fitting lines/planes to data and independently by Harold Hotelling (1933) in the context of multivariate statistics. PCA became foundational for dimensionality reduction, visualization, and feature extraction across statistics, signal processing (Karhunen-Loève transform), and computer vision (eigenfaces for face recognition, Turk & Pentland 1991). In machine learning, PCA is the linear autoencoder: minimizing reconstruction error $\sum_i \|x_i - \hat{x}_i\|^2$ under orthonormality constraints yields the principal components. Modern deep learning extends this with nonlinear autoencoders (Hinton & Salakhutdinov, 2006), variational autoencoders (Kingma & Welling, 2014), and diffusion models (Ho et al., 2020), but PCA remains the interpretable baseline for understanding latent representations, explaining variance, and preprocessing high-dimensional data (gene expression, images, time series). The change-of-basis perspectiveâgoing from standard coordinates to PCA coordinatesâclarifies compression, denoising, and the bias-variance tradeoff in dimensionality reduction.
- Vector spaces (Ch. 1): Bases are minimal spanning sets; one-hot vectors span $\mathbb{R}^{|V|}$, principal components span the variance-aligned subspace.
- Span and combinations (Ch. 2): Embeddings and reconstructions are linear combinations: $E e_j$ and $\mu + Vz$ are weighted sums of basis vectors.
- Linear maps (Ch. 4): Embedding lookup and PCA projection are linear maps; change of basis is matrix multiplication by $V$ or $V^\top$.
- Inner products (Ch. 5): PCA coordinates are inner products $z_i = \langle v_i, x - \mu \rangle$, measuring similarity to principal components.
- Orthogonality & projections (Ch. 6): PCA projects onto the column space of $V$; reconstruction $\hat{x}$ is the orthogonal projection of $x$ onto that subspace.
- SVD (Ch. 10): Principal components are the right singular vectors of centered data; PCA decomposes $X_c = U \Sigma V^\top$.
- PCA (Ch. 11): This example is PCA in miniatureâcentering, SVD, projection, reconstruction, and variance retention.
Comments