Example number
3
Example slug
example_3_onehot_basis_vectors_and_coordinate_representations
Background

These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.

One-hot encoding and embedding lookup are ubiquitous in modern deep learning. When you use a categorical feature (e.g., word in NLP, item in recommendation systems), you first convert it to a one-hot vector—a coordinate vector with a single 1 and zeros elsewhere. This one-hot vector then indexes into a learned embedding matrix via matrix-vector multiplication. Embedding lookup is $E @ x$ where $x$ is one-hot and $E \in \mathbb{R}^{d_{emb} \times |V|}$ is the embedding matrix with one column per category. This algebraic view reveals that embeddings are learned basis vectors: each column of $E$ is a continuous representation of a category, and the one-hot vector selects which basis vector (which category’s representation) to use. Modern transformers, BERT, and GPT all use this mechanism. The embedding matrix is learnable, so the network learns representations that maximize prediction accuracy—discovering basis vectors aligned with the task structure.

Principal Component Analysis (PCA) discovers an optimal low-dimensional basis for your data by solving an eigenvalue problem. Given centered data $X_c$, PCA finds the directions of maximum variance (the eigenvectors of the covariance matrix $X_c^T X_c$), which are exactly the columns of $V$ from the SVD factorization. These directions form an orthonormal basis for a lower-dimensional subspace that captures as much data variance as possible. PCA answers the question: “What is the intrinsic dimensionality of my data?” by revealing how much variance is explained by each principal component (given by the singular values). Data that appears high-dimensional may actually have most of its structure along a few principal directions. This insight drives dimensionality reduction, compression, noise filtering, and visualization (projecting high-dimensional data onto 2D or 3D principal subspaces for plotting).

The encode-decode paradigm central to representation learning works as follows: given a basis $V$ (learned via PCA, or trained via a neural network), you encode a data point $x$ into low-dimensional coordinates $z = V^T(x - \mu)$, then decode back via $\hat{x} = \mu + Vz$. The reconstruction error $\|x - \hat{x}\|_2$ measures how much information is lost by using only the basis $V$. Autoencoders generalize this by learning the encoding and decoding functions (via neural networks) jointly: $z = f_{\text{enc}}(x)$ and $\hat{x} = f_{\text{dec}}(z)$. This pattern enables unsupervised feature learning: the network discovers a basis that preserves information about the data distribution, and downstream tasks can use the learned coordinates $z$ instead of raw features. Variational autoencoders add probabilistic structure, and the same principle underlies representation learning in all modern deep networks.

Purpose

Give you a crisp, ML-first mental model you can reuse across models and datasets.

What you’ll learn:

  • One-hot encoding as basis selection: Understand that embedding lookup is not magic—it’s matrix-vector multiplication where a one-hot vector selects a column (a basis vector) from an embedding matrix. This perspective unifies lookup tables, neural embeddings, and learnable basis selection under one algebraic operation.
  • Dimension as a choice, not an intrinsic property: See that PCA discovers a low-dimensional basis for your data by finding directions of maximum variance. The “true” dimensionality of your data is determined by how many basis vectors you need to represent it with acceptable error—not by how many features your raw data has.
  • Encode-decode duality: Learn that dimensional reduction works via projection onto a learned basis: encode by computing coordinates z = V.T @ (x - μ), decode by reconstructing x̂ = μ + V @ z. This pattern appears in autoencoders, VAEs, neural network hidden layers, and all representation learning.
  • Basis vectors as learnable structure: Recognize that basis matrices (embedding tables, PCA components, neural network weight matrices) are the fundamental objects that learning algorithms discover. Understanding how to choose, learn, and validate basis vectors is central to model design.

This example bridges discrete (one-hot indexing) and continuous (PCA projection) representations, showing that both are instances of expressing data in a basis. It’s the foundation for understanding feature engineering, neural embeddings, autoencoders, and why deep networks work.

Problem
  1. Use one-hot vectors to perform embedding lookup as a matrix product.
  2. Compute coordinates of a point in an orthonormal basis (PCA basis).
Solution (Math)
  1. With embedding matrix $E=[e_1,\dots,e_{|V|}]$, a one-hot $x=e_j$ selects column $e_j$: $Ex=e_j$.

  2. If $V$ has orthonormal columns (a basis for a subspace), coordinates are $z=V^T$x-$$, reconstruction is $\hat x=\mu+Vz$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

E = np.array([[1., 2., 3.],
              [0.,-1., 1.]])
one_hot = np.array([0., 1., 0.])
print("lookup:", E@one_hot, "expected:", E[:,1])

X = toy_pca_points(n=20, seed=1)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
V = Vt.T[:, :1]  # top-1 basis
x = X[0]
z = V.T @ (x-mu)
xhat = mu + V@z
print("z:", z, "xhat:", xhat, "x:", x)
Code Introduction

This code demonstrates two complementary concepts that establish basis and dimension as the organizing principles of vector spaces: (1) how one-hot encoding selects a basis vector (a dimension), and (2) how PCA discovers an optimal low-dimensional basis for representing high-dimensional data. Together, they show that dimension reduction is fundamentally about choosing a subspace basis that captures the essential structure of your data.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

The code demonstrates two distinct representations of the same data: one-hot indexing (discrete selection) and PCA projection (continuous basis decomposition).

For one-hot embedding lookup, E @ one_hot is a simple matrix-vector product where $E \in \mathbb{R}^{2 \times 3}$ has 2 basis vectors (rows) and 3 possible categories (columns). Multiplying by a one-hot vector with a single 1 in position 1 extracts column 1, producing [2., -1.]. This operation takes $O(d_{emb} \cdot |V|)$ time in general (matrix-vector product), but one-hot encoding makes it $O(d_{emb})$: you only need to fetch the corresponding column, not iterate over all columns. This efficiency is why embedding lookup is so fast—it’s essentially a table lookup disguised as linear algebra. In modern deep learning frameworks (PyTorch, TensorFlow), embedding layers exploit this to achieve highly optimized lookups with hardware acceleration.

For PCA, the code first centers the data with mu = X.mean(0) and Xc = X - mu. Centering is essential: it shifts the origin to the data’s center of mass, so SVD discovers variance directions relative to the mean rather than relative to the coordinate origin. SVD via np.linalg.svd(Xc, full_matrices=False) factorizes $X_c = U \Sigma V^T$ in $O(\min(n, d)^2 \max(n, d))$ time. The matrix $V \in \mathbb{R}^{d \times d}$ contains all principal directions (basis vectors); V = Vt.T[:, :1] selects only the top-1 basis (the direction of maximum variance). In practice, you choose $k$ based on how much variance you want to retain: V[:, :k] keeps the top $k$ components. Computing coordinates z = V.T @ (x - mu) projects centered data onto the basis, producing a scalar for top-1 PCA. Reconstruction xhat = mu + V @ z reverses this, mapping back to the original space. The reconstruction error ||x - xhat||_2 measures information loss; singular values from SVD tell you the variance explained by each component, letting you decide how many to keep.

Numerical stability and interpretation: Centering is crucial for correct variance calculation—omitting it leads to PCA that confuses translation with true variance structure. The SVD approach is numerically stable because it avoids computing the covariance matrix $X_c^T X_c$ explicitly (which can suffer from illconditioning). For very high-dimensional data, you may use randomized SVD or truncated SVD (compute only top $k$ singular values/vectors) for speed. The singular values $\sigma_i$ from SVD are proportional to the variance along each principal direction: variance $= \sigma_i^2 / (n-1)$. This lets you choose $k$ by examining the “scree plot” (singular values ranked) and selecting the elbow point where variance plateaus.

What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

One-hot vectors index basis columns: The first part shows that embedding lookup is matrix-vector multiplication. A one-hot vector like [0., 1., 0.] has a single 1 in position $j$, so multiplying E @ one_hot extracts column $j$ of $E$. This is not indexing in the Python sense (E[:, j])—it’s an algebraic operation. The key insight: one-hot encoding converts a discrete choice (which category?) into a continuous linear combination (a weighted sum where only one weight is nonzero). This unifies symbolic/discrete representations with continuous linear algebra, enabling gradient-based learning of discrete choices via embeddings.

PCA discovers optimal low-dimensional basis: The second part shows that you can take high-dimensional data and find a small set of basis vectors that capture its essential structure. SVD on centered data gives you the principal directions. Projecting data onto the top $k$ principal components gives you a $k$-dimensional representation that retains maximum variance—the best possible $k$-dimensional approximation under the Euclidean norm. This solves the dimensionality reduction problem: “Which basis vectors should I keep to represent my data efficiently?”

Encode-decode for representation learning: The code shows that once you have a basis $V$, you can represent any point $x$ by its coordinates $z = V^T(x - \mu)$ in that basis, then reconstruct it as $\hat{x} = \mu + Vz$. The reconstruction error measures how much information is lost. This pattern appears in all representation learning: learn a basis (via PCA, neural networks, or other methods), encode data into that basis, use the low-dimensional representation for downstream tasks. The basis $V$ is the learned structure—it encodes what features matter for the data distribution.

Notes

Part 1: Core setup - One-hot basis vectors and coordinate representations

State the objects, shapes, and target question for One-hot basis vectors and coordinate representations. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - One-hot basis vectors and coordinate representations

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make One-hot basis vectors and coordinate representations work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - One-hot basis vectors and coordinate representations

Give the computational recipe for One-hot basis vectors and coordinate representations, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: For embedding lookup, ensure one-hot vector has shape $(|V|,)$ and embedding matrix $E$ has shape $(d_{emb}, |V|)$, producing output shape $(d_{emb},)$. For PCA with top-$k$ basis, $V \in \mathbb{R}^{d \times k}$, so V.T @ (x - mu) with $x \in \mathbb{R}^d$ produces $z \in \mathbb{R}^k$, and mu + V @ z reconstructs back to $\mathbb{R}^d$. Batch processing: $X_c \in \mathbb{R}^{n \times d}$ gives coordinates $Z = X_c @ V \in \mathbb{R}^{n \times k}$ (all points projected at once).

  • One-hot as sparse representation: One-hot vectors are maximally sparse—they have exactly one nonzero element. This makes them numerically trivial to multiply (just indexing), but they also lose information: a one-hot vector is orthogonal to all other one-hot vectors, so learned embeddings must do all the work of capturing similarity. This is why embedding matrices are learned, not fixed.

  • Orthonormality and reconstruction: SVD guarantees that columns of $V$ are orthonormal: $V^T V = I_k$ (assuming we keep top-$k$ columns). This orthonormality means the basis vectors are perpendicular, which makes coordinate computation and reconstruction highly stable. Inner products V.T @ x are exact dot products (no numerical error from non-orthogonal bases). If you use a non-orthonormal basis (e.g., from linear regression), reconstruction becomes ill-conditioned.

  • Singular values and variance explained: SVD produces singular values $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_d$. The variance explained by component $i$ is proportional to $\sigma_i^2$. The “cumulative variance explained” by top-$k$ components is $\sum_{i=1}^k \sigma_i^2 / \sum_{i=1}^d \sigma_i^2$. You typically choose $k$ to retain 95% or 99% of variance, deciding on the dimension reduction trade-off: fewer features enable faster computation and reduce overfitting (fewer parameters to learn), but more information loss.

  • Reconstruction error and information loss: For a point $x$, the reconstruction error is $\|x - \hat{x}\|_2 = \|x - (\mu + V@z)\|_2$. This error is guaranteed to be minimized (for a fixed number of basis vectors $k$) by the PCA basis—no other choice of $k$ orthonormal basis vectors can do better. However, the minimum error is still nonzero if $k < d$: you lose information by projecting to lower dimensions. This trade-off is fundamental: fewer features are cheaper to compute but less expressive.

History and Applications

Basis and dimension were formalized by Grassmann (1844) and later refined through linear algebra. Grassmann showed that any vector space has a basis—a minimal spanning set of independent vectors—and that all bases of the same space have the same cardinality: the dimension.

Dimension invariance: This fundamental insight is powerful: no matter how you choose a basis, the number of basis vectors is always the same. Sylvester and others built on this to develop rank-nullity theorem, connecting the dimension of the null space to the rank of a linear map.

Change of basis in ML: When we fit PCA (Example 11) or autoencoders, we are learning a new basis $V$ for data. The latent coordinate $z = V^\top(x - \mu)$ expresses the data point in this learned basis. Different applications demand different bases: PCA for variance, ICA for independence, sparse autoencoders for interpretability. Understanding basis changes is essential for feature learning in deep networks, where hidden layers learn successive basis transformations.

Connection to Broader Examples

This example establishes basis and dimension as the core organizational principle of representation learning. The key insight is that most ML algorithms work by choosing (or learning) a basis, then expressing data as coordinates in that basis.

Throughout the remaining 97 examples, basis selection appears constantly:

  • Neural network hidden layers (Examples 4, 16) learn basis matrices $W \in \mathbb{R}^{d_{in} \times d_{hidden}}$ where each column is a learned basis vector. Computing $f(Wx)$ projects input onto these learned basis vectors, then applies nonlinearity. Deeper networks stack basis transformations: each layer learns a basis for the next layer’s computation.
  • Autoencoders and VAEs (future examples) generalize PCA by learning encoding/decoding functions: instead of linear projection z = V.T @ x, they use neural networks z = f_{enc}(x). The learned basis is implicit in the network’s weight matrices. VAEs add probabilistic structure, learning a basis that represents a learned latent distribution.
  • Attention mechanisms (Examples 2, more later) compute weighted combinations of value vectors: the basis is the set of learned value representations, and attention learns the weights (coordinates) that select how to combine them.
  • Word embeddings in NLP (related to transformer layers) embed each word as a dense vector—a column of a learned embedding matrix. The basis is all possible word embeddings; a one-hot word selects one basis vector, which is then passed through transformer layers that apply successive basis changes.
  • Recommender systems use a similar embedding approach: user embeddings and item embeddings form basis vectors for representing preferences; recommendations are computed via inner products in the embedding space.
  • Principal Component Regression and Reduced-Rank Regression (Examples 12+) use the top PCA basis as features for regression, exploiting dimensionality reduction to improve generalization.
  • Spectral clustering (future examples) uses the eigenvectors of a similarity matrix as a basis for clustering—projecting data onto the spectral basis often reveals cluster structure more clearly than the original space.

The unifying pattern: (1) choose or learn a basis, (2) compute coordinates in that basis, (3) use those low-dimensional coordinates for downstream tasks (prediction, clustering, generation). Dimension reduction via basis selection is how you bridge the curse of dimensionality—fewer features mean more data per feature, reducing overfitting and enabling faster computation, at the cost of information loss. The art of ML is choosing bases that capture task-relevant structure while discarding noise.

Comments