A basis is a minimal spanning set: every vector in the space can be uniquely expressed as a linear combination of basis vectors. The standard basis in $\mathbb{R}^d$ consists of one-hot vectors $e_1, \dots, e_d$; any vector $x$ is trivially $x = \sum_i x_i e_i$. Orthonormal bases satisfy $v_i^\top v_j = \delta_{ij}$ (inner products are 0 or 1), making coordinate computation especially simple: $z_i = v_i^\top x$. PCA discovers an orthonormal basis aligned with data variance: the first principal component points in the direction of maximum variance, the second in the direction of maximum remaining variance orthogonal to the first, and so on.
In ML, basis choice is representation choice. One-hot vectors are a sparse basis (one nonzero entry) for categorical data but become inefficient when vocabulary size is large (millions of words/tokens). Learned embeddings compress one-hot into dense low-dimensional vectors by learning a better basis $E$ where semantically similar items have similar coordinates. PCA extends this to continuous data: instead of storing $d$ coordinates per point, store $k \ll d$ principal coordinates and reconstruct approximately. This encode-decode pattern (project to low dimension, then reconstruct) underpins autoencoders, variational methods, and modern representation learning.
Comments