Example number
19
Example slug
example_19_onehot_basis_vectors_and_coordinate_representations
Background

A basis is a minimal spanning set: every vector in the space can be uniquely expressed as a linear combination of basis vectors. The standard basis in $\mathbb{R}^d$ consists of one-hot vectors $e_1, \dots, e_d$; any vector $x$ is trivially $x = \sum_i x_i e_i$. Orthonormal bases satisfy $v_i^\top v_j = \delta_{ij}$ (inner products are 0 or 1), making coordinate computation especially simple: $z_i = v_i^\top x$. PCA discovers an orthonormal basis aligned with data variance: the first principal component points in the direction of maximum variance, the second in the direction of maximum remaining variance orthogonal to the first, and so on.

In ML, basis choice is representation choice. One-hot vectors are a sparse basis (one nonzero entry) for categorical data but become inefficient when vocabulary size is large (millions of words/tokens). Learned embeddings compress one-hot into dense low-dimensional vectors by learning a better basis $E$ where semantically similar items have similar coordinates. PCA extends this to continuous data: instead of storing $d$ coordinates per point, store $k \ll d$ principal coordinates and reconstruct approximately. This encode-decode pattern (project to low dimension, then reconstruct) underpins autoencoders, variational methods, and modern representation learning.

Purpose

Build intuition that choosing a basis—hand-crafted (one-hot) or data-driven (PCA)—determines representation efficiency:

  • One-hot embeddings: Use discrete indicator vectors as a basis for categorical data; matrix multiply becomes column lookup.
  • PCA bases: Learn orthonormal directions that capture variance; encode/decode with minimal information loss.
  • Coordinate representation: Any vector can be expressed as coefficients in a chosen basis; $x = \sum_i z_i v_i$.
  • Dimensionality reduction: Choosing fewer basis vectors (lower rank) trades fidelity for compactness.
  • ML pattern: Embeddings, autoencoders, and feature learning all revolve around learning good bases.
Problem
  1. Use one-hot vectors to perform embedding lookup as a matrix product.
  2. Compute coordinates of a point in an orthonormal basis (PCA basis).
Solution (Math)
  1. With embedding matrix $E=[e_1,\dots,e_{|V|}]$, a one-hot $x=e_j$ selects column $e_j$: $Ex=e_j$.

  2. If $V$ has orthonormal columns (a basis for a subspace), coordinates are $z=V^T$x-$$, reconstruction is $\hat x=\mu+Vz$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

E = np.array([[1., 2., 3.],
              [0.,-1., 1.]])
one_hot = np.array([0., 1., 0.])
print("lookup:", E@one_hot, "expected:", E[:,1])

X = toy_pca_points(n=20, seed=1)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
V = Vt.T[:, :1]  # top-1 basis
x = X[0]
z = V.T @ (x-mu)
xhat = mu + V@z
print("z:", z, "xhat:", xhat, "x:", x)
Code Introduction

This code demonstrates two fundamental representations of data: discrete indexing via one-hot vectors and continuous projection via learned PCA bases. Together, they reveal that choosing a basis—whether hand-crafted (one-hot) or data-driven (PCA)—determines how efficiently you can represent structure.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • One-hot embedding: E @ one_hot performs column lookup; shape $E:(d_{\text{emb}}, |V|)$, one_hot$:(|V|)$ yields $(d_{\text{emb}})$.
  • Verification: E[:,1] directly indexes column 1; should match E @ [0, 1, 0].
  • Centering: mu = X.mean(0) computes the mean; Xc = X - mu shifts origin to data center.
  • SVD: np.linalg.svd(Xc, full_matrices=False) factorizes; Vt.T gives right singular vectors (principal components).
  • Top-k selection: V = Vt.T[:, :k] extracts the first $k$ principal components as a $(d, k)$ basis matrix.
  • Encoding: z = V.T @ (x - mu) projects centered point onto basis; shape $(k,)$.
  • Decoding: xhat = mu + V @ z reconstructs; should approximate $x$ with error determined by discarded components.
  • Shapes: for 2D data with 1D basis, $X:(n, 2)$, $V:(2, 1)$, $z:(1,)$, $\hat{x}:(2)$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • One-hot lookup: $E \cdot \text{one-hot}(j)$ extracts column $j$ of embedding matrix $E$; this is basis indexing.
  • PCA encoding: $z = V^\top(x - \mu)$ computes coordinates in the learned orthonormal basis $V$.
  • PCA decoding: $\hat{x} = \mu + Vz$ reconstructs by summing scaled basis vectors.
  • Reconstruction error: $\|x - \hat{x}\|$ quantifies information loss from dimensionality reduction.
  • Basis universality: one-hot (discrete, sparse) and PCA (continuous, dense) are two endpoints of a spectrum.
Notes

Part 1: Core setup - One-hot basis vectors and coordinate representations

State the objects, shapes, and target question for One-hot basis vectors and coordinate representations. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - One-hot basis vectors and coordinate representations

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make One-hot basis vectors and coordinate representations work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - One-hot basis vectors and coordinate representations

Give the computational recipe for One-hot basis vectors and coordinate representations, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: check dimensions before manipulating formulas.
  • Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
  • Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).
  • One-Hot Embedding Lookup as Basis Selection The first section shows the simplest possible basis operation: selecting a column from a matrix. With $E \in \mathbb{R}^{2\times 3}$ containing 2 basis vectors (rows) spanning a 2D subspace of the 3-category space, multiplying by a one-hot vector [0., 1., 0.] extracts the second column $E[:,1] = [2., -1.]$. This is basis indexing in disguise: the one-hot vector specifies coordinates $(0, 1, 0)$ in the standard basis of $\mathbb{R}^3$, and $E \cdot \text{one-hot}$ maps those coordinates into the 2D embedding space spanned by $E$’s columns. The beauty of one-hot encoding is computational: while matrix-vector multiplication is generally $O(d_{\text{emb}} \times |V|)$, the one-hot structure makes this a simple table lookup—just fetch column 1. This pattern appears throughout ML: embedding layers in neural networks use one-hot vectors to index into learned weight matrices, where each column is a learned representation for a discrete token (word, category, user ID). The embedding matrix $E$ is the learned basis for the categorical space.
  • PCA as Optimal Low-Rank Basis Discovery The second section transitions from hand-crafted bases to data-driven basis learning. Loading 20 points from toy_pca_points(n=20, seed=1) produces $X \in \mathbb{R}^{20\times 2}$ with inherent directional structure. Centering via mu = X.mean(0) and Xc = X - mu shifts the origin to the data’s center of mass, removing translational bias so SVD discovers variance directions relative to the mean. The SVD U, S, Vt = svd(Xc) factorizes centered data as $X_c = U \Sigma V^\top$, where Vt (which is $V^\top$) contains the principal directions. Extracting V = Vt.T[:, :1] selects the top-1 principal component—the single direction in $\mathbb{R}^2$ that captures maximum variance. This $V \in \mathbb{R}^{2\times 1}$ matrix is the learned basis for 1D representation: instead of representing points with 2 coordinates, we can approximate them with 1 coordinate along the principal axis.
  • Encoding and Decoding in the Learned Basis Once the basis $V$ is learned, the code demonstrates the encode-decode cycle that underpins all representation learning. For the first data point x = X[0], computing z = V.T @ (x - mu) projects the centered point onto the principal direction, producing a scalar coefficient (the “coordinate” in the 1D basis). This is the encoding step: map high-dimensional data into a low-dimensional representation. To decode, the operation xhat = mu + V @ z reverses the process: scale the basis vector $V$ by the coefficient $z$, then translate back by adding the mean $\mu$. The reconstructed point xhat lies in the 1D subspace defined by $V$, approximating the original $x$. The reconstruction error $\|x - \hat{x}\|$ (not printed but implicit) measures information loss—how much structure was discarded by projecting from 2D to 1D.
  • Why This Matters: Basis Choice Determines Representation Efficiency The profound connection between these two examples is that both operations select a basis, then express data as coordinates in that basis. One-hot encoding uses a trivial basis (indicator vectors for categories), while PCA learns an optimal basis (directions of maximum variance). The one-hot basis is sparse (one nonzero entry per vector) but high-dimensional (dimension equals vocabulary size); the PCA basis is dense (all entries contribute) but low-dimensional (dimension chosen to retain desired variance). This trade-off appears throughout ML: word embeddings compress sparse one-hot vectors into dense low-dimensional representations; autoencoders learn nonlinear generalizations of PCA’s linear encoding; transformers learn context-dependent bases via attention. Understanding basis selection—discrete vs. continuous, fixed vs. learned, sparse vs. dense—is the key to designing representation learning systems. When debugging model performance, ask: “Is my basis capturing the right structure?” Too few dimensions (under-parameterized basis) loses signal; too many dimensions (over-parameterized basis) overfits noise. PCA provides a principled answer via explained variance ratio, guiding the choice of dimensionality $k$ for the basis $V[:, :k]$.
History and Applications

Basis and dimension: The formal notion of a basis emerged in the 19th century through Grassmann’s work on linear algebra and was refined by mathematicians studying vector spaces in the early 20th century. The key insight was that every vector space has a basis, and all bases of the same space have the same cardinality—the dimension. This invariance is powerful: no matter how you choose coordinates, the intrinsic complexity (dimension) of the space remains constant.

Principal Component Analysis: Karl Pearson (1901) introduced PCA as a method for finding the “best-fitting” line through data in the least-squares sense. Harold Hotelling (1933) generalized this to multiple dimensions and connected it to eigenvalue decomposition of the covariance matrix. The modern computational approach via SVD (Golub & Reinsch, 1970) made PCA practical for large datasets, and it became a standard preprocessing step in statistics and ML.

Embedding layers and representation learning: Word embeddings revolutionized NLP by replacing sparse one-hot vectors (dimension = vocabulary size, often $10^5$–$10^6$) with dense low-dimensional vectors (typically 50–300 dimensions). Word2vec (Mikolov et al., 2013) learned embeddings by predicting context words, discovering that semantic relationships (“king” - “man” + “woman” ≈ “queen”) emerged from the learned basis. This basis-learning pattern—replacing a naive high-dimensional representation with a learned compact basis—now appears in entity embeddings (recommender systems), graph embeddings (node2vec), and contextualized embeddings (BERT, GPT).

Modern applications: Autoencoders generalize PCA to nonlinear dimensionality reduction: the encoder learns a basis transformation $z = f(x)$, the decoder reconstructs $\hat{x} = g(z)$. Variational autoencoders (VAEs) add probabilistic structure, learning a latent basis where interpolation is smooth. Transformers use attention to compute context-dependent bases: each token’s representation is a weighted sum of value vectors, where the weights (and hence the effective basis) change per input. Understanding basis selection—hand-crafted vs. learned, linear vs. nonlinear, fixed vs. adaptive—is the foundation for modern representation learning and dimensionality reduction.

Connection to Broader Examples
  • Span and linear combinations (Ch. 2): Any vector in $\text{span}(V)$ is a linear combination $Vz$; basis choice determines span.
  • Orthogonality and projections (Ch. 6): PCA finds orthogonal directions; encoding is projection, decoding is reconstruction.
  • PCA (Ch. 11): Full PCA workflow—center, SVD, select top-$k$ components, encode/decode.
  • Embeddings: One-hot to dense embeddings (word2vec, entity embeddings) is basis change from standard to learned.
  • Autoencoders: Nonlinear generalization of PCA; encoder learns basis transformation, decoder reconstructs.

Comments