These concepts are foundational in modern linear algebra and appear throughout statistical learning theory and deep learning practice.
One-hot encoding and embedding lookup are ubiquitous in modern deep learning. When you use a categorical feature (e.g., word in NLP, item in recommendation systems), you first convert it to a one-hot vectorâa coordinate vector with a single 1 and zeros elsewhere. This one-hot vector then indexes into a learned embedding matrix via matrix-vector multiplication. Embedding lookup is $E @ x$ where $x$ is one-hot and $E \in \mathbb{R}^{d_{emb} \times |V|}$ is the embedding matrix with one column per category. This algebraic view reveals that embeddings are learned basis vectors: each column of $E$ is a continuous representation of a category, and the one-hot vector selects which basis vector (which categoryâs representation) to use. Modern transformers, BERT, and GPT all use this mechanism. The embedding matrix is learnable, so the network learns representations that maximize prediction accuracyâdiscovering basis vectors aligned with the task structure.
Principal Component Analysis (PCA) discovers an optimal low-dimensional basis for your data by solving an eigenvalue problem. Given centered data $X_c$, PCA finds the directions of maximum variance (the eigenvectors of the covariance matrix $X_c^T X_c$), which are exactly the columns of $V$ from the SVD factorization. These directions form an orthonormal basis for a lower-dimensional subspace that captures as much data variance as possible. PCA answers the question: âWhat is the intrinsic dimensionality of my data?â by revealing how much variance is explained by each principal component (given by the singular values). Data that appears high-dimensional may actually have most of its structure along a few principal directions. This insight drives dimensionality reduction, compression, noise filtering, and visualization (projecting high-dimensional data onto 2D or 3D principal subspaces for plotting).
The encode-decode paradigm central to representation learning works as follows: given a basis $V$ (learned via PCA, or trained via a neural network), you encode a data point $x$ into low-dimensional coordinates $z = V^T(x - \mu)$, then decode back via $\hat{x} = \mu + Vz$. The reconstruction error $\|x - \hat{x}\|_2$ measures how much information is lost by using only the basis $V$. Autoencoders generalize this by learning the encoding and decoding functions (via neural networks) jointly: $z = f_{\text{enc}}(x)$ and $\hat{x} = f_{\text{dec}}(z)$. This pattern enables unsupervised feature learning: the network discovers a basis that preserves information about the data distribution, and downstream tasks can use the learned coordinates $z$ instead of raw features. Variational autoencoders add probabilistic structure, and the same principle underlies representation learning in all modern deep networks.
Comments