Part 1: Canonical Basis and One-Hot Indexing
The canonical basis $\{e_1, \ldots, e_d\}$ (where $e_j$ has a 1 in position $j$ and 0s elsewhere) has the remarkable property that $E e_j$ simply selects the $j$-th column of any matrix $E$. In embedding layers, this means word lookup is just matrix-vector multiplicationâthe discrete âfind word $j$ in the vocabularyâ operation is a linear map $e_{\text{word}} = E e_j$. This unifies symbolic indexing with numerical linear algebra, letting GPUs efficiently compute millions of lookups in parallel via batched matrix multiplication.
Part 3: Rank-$k$ Projection and Low-Rank Approximation
Projecting data onto the rank-$k$ basis, $\hat{X} = X_c V_k V_k^\top$, discards directions with small singular values (low variance). The reconstruction error is $\|X_c - \hat{X}\|_F^2 = \sum_{j > k} \sigma_j^2$. For $k = 1$, we retain only the top principal component (direction of maximum spread). For small $k \ll d$, we get aggressive compression, useful for visualization (2D/3D plot) and noise reduction; large $k$ near $d$ is lossless but offers no compression.
Why This Matters for ML
Basis selection is a critical inductive bias. Fixed bases (e.g., polynomial features, Fourier basis) commit to domain knowledge upfront; learned bases (PCA, neural embeddings) adapt to data. The right basis makes learning efficient (fewer parameters, faster convergence); the wrong basis forces the model to waste capacity on uninformative directions. Understanding basis selection deepens your intuition for feature engineering, dimensionality reduction, and model design.
ML Examples and Patterns
Embedding lookup: Word2Vec, GloVe, FastText: $e_{\text{word}} = E e_j$. The embedding matrix $E$ is learned to minimize loss (e.g., skip-gram); its columns form a semantic basis.
Autoencoders: Encoder learns basis $V$ (via neural network); $z = f_{\text{enc}}(x)$ are coordinates. Decoder reconstructs $\hat{x} = f_{\text{dec}}(z)$. Bottleneck width controls basis dimension.
PCA preprocessing: Reduce $d$-dimensional features to $k < d$ PCA coordinates, then pass to downstream model. Common in classical ML (SVM, k-NN); less common in deep learning (which learns its own basis).
Attention mechanism: Query/Key/Value projections compute soft bases dynamically. $Q, K \in \mathbb{R}^{n \times d_k}$ define a basis; attention weights select which basis directions are relevant for each position.
Transfer learning: Pre-trained models (BERT, ResNet, GPT-3) learn rich bases from large datasets. Fine-tuning on small downstream tasks reuses this basis, avoiding learning from scratch.
Connection to Linear Algebra Theory
Basis theory is central to linear algebra: - Rank-nullity theorem: For matrix $A \in \mathbb{R}^{m \times n}$, $\text{rank}(A) + \text{nullity}(A) = n$. Basis of column space has $\text{rank}(A)$ vectors; basis of null space has $\text{nullity}(A)$ vectors. - Fundamental Theorem of Linear Algebra: Four fundamental subspaces (column space, row space, null space, left null space) are partitioned by basis selection. - Spectral Theorem: Symmetric matrices have orthonormal eigenbasis. Covariance matrices $X^\top X$ have eigenbasis = PCA basis. - SVD as universal basis: Every matrix has SVD; the right singular vectors are an orthonormal basis for the column space (data basis), left vectors for the row space (dual basis).
Numerical and Implementation Notes
Centering is mandatory: Omitting centering gives wrong PCs. $X_c = X - \mathbf{1} \mu^\top$ (subtract mean) is a rank-1 modification; it projects onto the zero-mean hyperplane. Algebraically essential; numerically cheap ($O(nd)$).
SVD algorithm choice: np.linalg.svd uses LAPACK DGESDD (divide-and-conquer), which is fast and stable for $n, d$ up to ~10K. For larger problems, use randomized SVD (e.g., sklearn.decomposition.TruncatedSVD or scipy.sparse.linalg.svds).
Memory efficiency: Store $V_k$ (not $V$) and $\Sigma_k$ (not $\Sigma$); recompute coordinates on demand. For $k = 50, d = 1000, n = 1M$, storing all $U$ costs 40 GB; storing only $V_k$ costs 400 MB.
GPU acceleration: torch.linalg.svd and tf.linalg.svd are GPU-accelerated. Batch PCA on millions of points is practical; on CPU it can be slow.
Numerical and Shape Notes
- Shapes to track: $X \in \mathbb{R}^{n \times d}$ (data), $\mu \in \mathbb{R}^d$ (mean), $X_c = X - \mathbf{1}\mu^\top \in \mathbb{R}^{n \times d}$ (centered), $U \in \mathbb{R}^{n \times d}, \Sigma \in \mathbb{R}^{d \times d}, V \in \mathbb{R}^{d \times d}$ (thin SVD).
- Truncated shapes: $V_k \in \mathbb{R}^{d \times k}$ (first $k$ PCs), $Z = X_c V_k \in \mathbb{R}^{n \times k}$ (coordinates), $\hat{X} = Z V_k^\top \in \mathbb{R}^{n \times d}$ (reconstruction).
- Single-point projection: $z = V_k^\top (x - \mu) \in \mathbb{R}^k$ (coordinates), $\hat{x} = \mu + V_k z \in \mathbb{R}^d$ (reconstructed point).
- Orthonormality verification: Check
V.T @ V â I numerically; if not, SVD failed or precision was lost.
Pedagogical Significance
Understanding basis selection is a gateway to: 1. Dimensionality reduction: Why PCA works; how to choose $k$. 2. Least squares: Column space of $X$ is spanned by basis $V$ (if $X$ is tall). Normal equations project onto this basis. 3. Covariance structure: Basis of eigenvectors decorrelates covariance matrix (diagonalization). 4. Neural networks: Hidden layers learn basis transformations; deeper networks compose bases. 5. Geometric intuition: Basis changes correspond to rotations and scalings; they reveal hidden structure (clusters, manifolds, trends).
This example is a capstone to Chapters 1â3 (vector spaces, span, basis, dimension). It ties algebra (matrix operations) to geometry (subspaces, projections) to ML (feature extraction, compression, learning).
Comments