ex1.ai

Chapter: 3 basis_dimension

Purpose

Give you a crisp, ML-first mental model you can reuse across models and datasets.

What you’ll learn:

One-hot encoding as basis selection: Understand that embedding lookup is not magic—it’s matrix-vector multiplication where a one-hot vector selects a column (a basis vector) from an embedding matrix. This perspective unifies lookup tables, neural embeddings, and learnable basis selection under one algebraic operation.
Dimension as a choice, not an intrinsic property: See that PCA discovers a low-dimensional basis for your data by finding directions of maximum variance. The “true” dimensionality of your data is determined by how many basis vectors you need to represent it with acceptable error—not by how many features your raw data has.
Encode-decode duality: Learn that dimensional reduction works via projection onto a learned basis: encode by computing coordinates z = V.T @ (x - μ), decode by reconstructing x̂ = μ + V @ z. This pattern appears in autoencoders, VAEs, neural network hidden layers, and all representation learning.
Basis vectors as learnable structure: Recognize that basis matrices (embedding tables, PCA components, neural network weight matrices) are the fundamental objects that learning algorithms discover. Understanding how to choose, learn, and validate basis vectors is central to model design.
Optimization: Gradient descent descent descent descent descent descent descent descent descent descent updates $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ are linear combinations, so parameters remain in $\mathbb{R}^d$.

This example bridges discrete (one-hot indexing) and continuous (PCA projection) representations, showing that both are instances of expressing data in a basis. It’s the foundation for understanding feature engineering, neural embeddings, autoencoders, and why deep networks work.

Problem

Use one-hot vectors to perform embedding lookup as a matrix product.
Compute coordinates of a point in an orthonormal basis (PCA basis).

Solution (math)

With embedding matrix $E=[e_1,\dots,e_{|V|}]$, a one-hot $x=e_j$ selects column $e_j$: $Ex=e_j$.
If $V$ has orthonormal columns (a basis for a subspace), coordinates are $z=V^T$x-$$, reconstruction is $\hat x=\mu+Vz$.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Background

These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.

One-hot encoding and embedding lookup are ubiquitous in modern deep learning. When you use a categorical feature (e.g., word in NLP, item in recommendation systems), you first convert it to a one-hot vector—a coordinate vector with a single 1 and zeros elsewhere. This one-hot vector then indexes into a learned embedding matrix via matrix-vector multiplication. Embedding lookup is $E @ x$ where $x$ is one-hot and $E \in \mathbb{R}^{d_{emb} \times |V|}$ is the embedding matrix with one column per category. This algebraic view reveals that embeddings are learned basis vectors: each column of $E$ is a continuous representation of a category, and the one-hot vector selects which basis vector (which category’s representation) to use. Modern transformers, BERT, and GPT all use this mechanism. The embedding matrix is learnable, so the network learns representations that maximize prediction accuracy—discovering basis vectors aligned with the task structure.

Principal Component Analysis (PCA) discovers an optimal low-dimensional basis for your data by solving an eigenvalue problem. Given centered data $X_c$, PCA finds the directions of maximum variance (the eigenvectors of the covariance matrix $X_c^T X_c$), which are exactly the columns of $V$ from the SVD factorization. These directions form an orthonormal basis for a lower-dimensional subspace that captures as much data variance as possible. PCA answers the question: “What is the intrinsic dimensionality of my data?” by revealing how much variance is explained by each principal component (given by the singular values). Data that appears high-dimensional may actually have most of its structure along a few principal directions. This insight drives dimensionality reduction, compression, noise filtering, and visualization (projecting high-dimensional data onto 2D or 3D principal subspaces for plotting).

The encode-decode paradigm central to representation learning works as follows: given a basis $V$ (learned via PCA, or trained via a neural network), you encode a data point $x$ into low-dimensional coordinates $z = V^T(x - \mu)$, then decode back via $\hat{x} = \mu + Vz$. The reconstruction error $\|x - \hat{x}\|_2$ measures how much information is lost by using only the basis $V$. Autoencoders generalize this by learning the encoding and decoding functions (via neural networks) jointly: $z = f_{\text{enc}}(x)$ and $\hat{x} = f_{\text{dec}}(z)$. This pattern enables unsupervised feature learning: the network discovers a basis that preserves information about the data distribution, and downstream tasks can use the learned coordinates $z$ instead of raw features. Variational autoencoders add probabilistic structure, and the same principle underlies representation learning in all modern deep networks.

What this example demonstrates

One-hot vectors index basis columns: The first part shows that embedding lookup is matrix-vector multiplication. A one-hot vector like [0., 1., 0.] has a single 1 in position $j$, so multiplying E @ one_hot extracts column $j$ of $E$. This is not indexing in the Python sense (E[:, j])—it’s an algebraic operation. The key insight: one-hot encoding converts a discrete choice (which category?) into a continuous linear combination (a weighted sum where only one weight is nonzero). This unifies symbolic/discrete representations with continuous linear algebra, enabling gradient-based learning of discrete choices via embeddings.

PCA discovers optimal low-dimensional basis: The second part shows that you can take high-dimensional data and find a small set of basis vectors that capture its essential structure. SVD on centered data gives you the principal directions. Projecting data onto the top $k$ principal components gives you a $k$-dimensional representation that retains maximum variance—the best possible $k$-dimensional approximation under the Euclidean norm. This solves the dimensionality reduction problem: “Which basis vectors should I keep to represent my data efficiently?”

Encode-decode for representation learning: The code shows that once you have a basis $V$, you can represent any point $x$ by its coordinates $z = V^T(x - \mu)$ in that basis, then reconstruct it as $\hat{x} = \mu + Vz$. The reconstruction error measures how much information is lost. This pattern appears in all representation learning: learn a basis (via PCA, neural networks, or other methods), encode data into that basis, use the low-dimensional representation for downstream tasks. The basis $V$ is the learned structure—it encodes what features matter for the data distribution.

Numerical Implementation Details

The code demonstrates two distinct representations of the same data: one-hot indexing (discrete selection) and PCA projection (continuous basis decomposition).

For one-hot embedding lookup, E @ one_hot is a simple matrix-vector product where $E \in \mathbb{R}^{2 \times 3}$ has 2 basis vectors (rows) and 3 possible categories (columns). Multiplying by a one-hot vector with a single 1 in position 1 extracts column 1, producing [2., -1.]. This operation takes $O(d_{emb} \cdot |V|)$ time in general (matrix-vector product), but one-hot encoding makes it $O(d_{emb})$: you only need to fetch the corresponding column, not iterate over all columns. This efficiency is why embedding lookup is so fast—it’s essentially a table lookup disguised as linear algebra. In modern deep learning frameworks (PyTorch, TensorFlow), embedding layers exploit this to achieve highly optimized lookups with hardware acceleration.

For PCA, the code first centers the data with mu = X.mean(0) and Xc = X - mu. Centering is essential: it shifts the origin to the data’s center of mass, so SVD discovers variance directions relative to the mean rather than relative to the coordinate origin. SVD via np.linalg.svd(Xc, full_matrices=False) factorizes $X_c = U \Sigma V^T$ in $O(\min(n, d)^2 \max(n, d))$ time. The matrix $V \in \mathbb{R}^{d \times d}$ contains all principal directions (basis vectors); V = Vt.T[:, :1] selects only the top-1 basis (the direction of maximum variance). In practice, you choose $k$ based on how much variance you want to retain: V[:, :k] keeps the top $k$ components. Computing coordinates z = V.T @ (x - mu) projects centered data onto the basis, producing a scalar for top-1 PCA. Reconstruction xhat = mu + V @ z reverses this, mapping back to the original space. The reconstruction error ||x - xhat||_2 measures information loss; singular values from SVD tell you the variance explained by each component, letting you decide how many to keep.

Numerical stability and interpretation: Centering is crucial for correct variance calculation—omitting it leads to PCA that confuses translation with true variance structure. The SVD approach is numerically stable because it avoids computing the covariance matrix $X_c^T X_c$ explicitly (which can suffer from illconditioning). For very high-dimensional data, you may use randomized SVD or truncated SVD (compute only top $k$ singular values/vectors) for speed. The singular values $\sigma_i$ from SVD are proportional to the variance along each principal direction: variance $= \sigma_i^2 / (n-1)$. This lets you choose $k$ by examining the “scree plot” (singular values ranked) and selecting the elbow point where variance plateaus.

Connection to the Broader Examples

This example establishes basis and dimension as the core organizational principle of representation learning. The key insight is that most ML algorithms work by choosing (or learning) a basis, then expressing data as coordinates in that basis.

Throughout the remaining 97 examples, basis selection appears constantly:

Neural network hidden layers (Examples 4, 16) learn basis matrices $W \in \mathbb{R}^{d_{in} \times d_{hidden}}$ where each column is a learned basis vector. Computing $f(Wx)$ projects input onto these learned basis vectors, then applies nonlinearity. Deeper networks stack basis transformations: each layer learns a basis for the next layer’s computation.
Autoencoders and VAEs (future examples) generalize PCA by learning encoding/decoding functions: instead of linear projection z = V.T @ x, they use neural networks z = f_{enc}(x). The learned basis is implicit in the network’s weight matrices. VAEs add probabilistic structure, learning a basis that represents a learned latent distribution.
Attention mechanisms (Examples 2, more later) compute weighted combinations of value vectors: the basis is the set of learned value representations, and attention learns the weights (coordinates) that select how to combine them.
Word embeddings in NLP (related to transformer layers) embed each word as a dense vector—a column of a learned embedding matrix. The basis is all possible word embeddings; a one-hot word selects one basis vector, which is then passed through transformer layers that apply successive basis changes.
Recommender systems use a similar embedding approach: user embeddings and item embeddings form basis vectors for representing preferences; recommendations are computed via inner products in the embedding space.
Principal Component Regression and Reduced-Rank Regression (Examples 12+) use the top PCA basis as features for regression, exploiting dimensionality reduction to improve generalization.
Spectral clustering (future examples) uses the eigenvectors of a similarity matrix as a basis for clustering—projecting data onto the spectral basis often reveals cluster structure more clearly than the original space.

The unifying pattern: (1) choose or learn a basis, (2) compute coordinates in that basis, (3) use those low-dimensional coordinates for downstream tasks (prediction, clustering, generation). Dimension reduction via basis selection is how you bridge the curse of dimensionality—fewer features mean more data per feature, reducing overfitting and enabling faster computation, at the cost of information loss. The art of ML is choosing bases that capture task-relevant structure while discarding noise.

Notes

Shape discipline: For embedding lookup, ensure one-hot vector has shape $(|V|,)$ and embedding matrix $E$ has shape $(d_{emb}, |V|)$, producing output shape $(d_{emb},)$. For PCA with top-$k$ basis, $V \in \mathbb{R}^{d \times k}$, so V.T @ (x - mu) with $x \in \mathbb{R}^d$ produces $z \in \mathbb{R}^k$, and mu + V @ z reconstructs back to $\mathbb{R}^d$. Batch processing: $X_c \in \mathbb{R}^{n \times d}$ gives coordinates $Z = X_c @ V \in \mathbb{R}^{n \times k}$ (all points projected at once).
One-hot as sparse representation: One-hot vectors are maximally sparse—they have exactly one nonzero element. This makes them numerically trivial to multiply (just indexing), but they also lose information: a one-hot vector is orthogonal to all other one-hot vectors, so learned embeddings must do all the work of capturing similarity. This is why embedding matrices are learned, not fixed.
Orthonormality and reconstruction: SVD guarantees that columns of $V$ are orthonormal: $V^T V = I_k$ (assuming we keep top-$k$ columns). This orthonormality means the basis vectors are perpendicular, which makes coordinate computation and reconstruction highly stable. Inner products V.T @ x are exact dot products (no numerical error from non-orthogonal bases). If you use a non-orthonormal basis (e.g., from linear regression), reconstruction becomes ill-conditioned.
Singular values and variance explained: SVD produces singular values $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_d$. The variance explained by component $i$ is proportional to $\sigma_i^2$. The “cumulative variance explained” by top-$k$ components is $\sum_{i=1}^k \sigma_i^2 / \sum_{i=1}^d \sigma_i^2$. You typically choose $k$ to retain 95% or 99% of variance, deciding on the dimension reduction trade-off: fewer features enable faster computation and reduce overfitting (fewer parameters to learn), but more information loss.
Reconstruction error and information loss: For a point $x$, the reconstruction error is $\|x - \hat{x}\|_2 = \|x - (\mu + V@z)\|_2$. This error is guaranteed to be minimized (for a fixed number of basis vectors $k$) by the PCA basis—no other choice of $k$ orthonormal basis vectors can do better. However, the minimum error is still nonzero if $k < d$: you lose information by projecting to lower dimensions. This trade-off is fundamental: fewer features are cheaper to compute but less expressive.

References

See Chapter 3.
G. Strang. Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press, 2016. Chapter 6 (Eigenvalues and Eigenvectors) covers PCA and spectral methods.
K. P. Murphy. Probabilistic Machine Learning: An Introduction. MIT Press, 2022. Section 20.1 (PCA) and Section 20.2 (Variants of PCA).
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. Section 2.7 (PCA and Whitening) explains PCA as dimensionality reduction and whitening for neural networks.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning (2nd ed.). Springer, 2009. Chapter 10 (Unsupervised Learning) covers PCA and its relationship to autoencoders.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Section 12.1 (Principal Component Analysis) provides thorough treatment of PCA, variance explained, and scree plots.
Y. Bengio, A. Courville, and P. Vincent. “Representation Learning: A Review and New Perspectives.” IEEE TPAMI 2013. Foundational review on learning basis vectors (representations) for deep networks.
L. McInnes, J. Healy, and J. Melville. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv:1802.03426, 2018. Modern nonlinear dimensionality reduction method extending PCA ideas.

Code Introduction

This code demonstrates two complementary concepts that establish basis and dimension as the organizing principles of vector spaces: (1) how one-hot encoding selects a basis vector (a dimension), and (2) how PCA discovers an optimal low-dimensional basis for representing high-dimensional data. Together, they show that dimension reduction is fundamentally about choosing a subspace basis that captures the essential structure of your data.

Solution (Python)


import numpy as np
from scripts.toy_data import toy_pca_points

E = np.array([[1., 2., 3.],
              [0.,-1., 1.]])
one_hot = np.array([0., 1., 0.])
print("lookup:", E@one_hot, "expected:", E[:,1])

X = toy_pca_points(n=20, seed=1)
mu = X.mean(0)
Xc = X - mu
_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
V = Vt.T[:, :1]  # top-1 basis
x = X[0]
z = V.T @ (x-mu)
xhat = mu + V@z
print("z:", z, "xhat:", xhat, "x:", x)

History

Basis and dimension were formalized by Grassmann (1844) and later refined through linear algebra. Grassmann showed that any vector space has a basis—a minimal spanning set of independent vectors—and that all bases of the same space have the same cardinality: the dimension.

Dimension invariance: This fundamental insight is powerful: no matter how you choose a basis, the number of basis vectors is always the same. Sylvester and others built on this to develop rank-nullity theorem, connecting the dimension of the null space to the rank of a linear map.

Change of basis in ML: When we fit PCA (Example 11) or autoencoders, we are learning a new basis $V$ for data. The latent coordinate $z = V^\top(x - \mu)$ expresses the data point in this learned basis. Different applications demand different bases: PCA for variance, ICA for independence, sparse autoencoders for interpretability. Understanding basis changes is essential for feature learning in deep networks, where hidden layers learn successive basis transformations.