Part 1: One-hot lookup and basis columns - One-hot $x=e_j$ selects column $j$ from $E$: $Ex = E e_j = E_{:,j}$. - Coordinates encode coefficients in the chosen basis; standard basis uses one-hot vectors.
Part 2: PCA coordinates and projection - Center data: $X_c = X - \mu$; then $X_c = U \Sigma V^\top$. - $V_k$ (first $k$ columns of $V$) is an orthonormal basis of the principal subspace. - Coordinates: $z = V_k^\top (x - \mu)$; reconstruction: $\hat{x} = \mu + V_k z$.
Part 3: Degrees of freedom and geometry - $k$ controls information retained; smaller $k$ compresses, larger $k$ preserves more variance. - Orthogonal projection minimizes reconstruction error $\|x - \hat{x}\|_2$ over all vectors in the subspace.
Pedagogical Notes - Basis choice affects interpretability; PCA basis aligns with variance directions. - One-hot coordinates in embedding tables show that a lookup is a matrix product. - Verify shapes to prevent silent broadcasting errors; check orthonormality $V_k^\top V_k = I$.
ML Examples and Patterns - Embedding lookup: NLP uses $Ex$ to retrieve token embeddings. - PCA preprocessing: reduce dimension for visualization/regression while preserving variance. - Whitening: change-of-basis followed by scaling to unit variance. - Autoencoders: learn a data-adaptive basis similar to PCA.
Connection to Linear Algebra Theory - Vector spaces: coordinates express vectors in a basis; change-of-basis is a linear isomorphism. - Orthonormal bases: preserve lengths/angles; convenient for projection. - SVD: guarantees orthonormal bases for data subspaces.
Numerical and Implementation Notes - Always center before PCA; otherwise components point toward the mean. - Use SVD (np.linalg.svd) for stability; avoid forming covariance for very ill-conditioned data without care. - Confirm $V_k^\top V_k = I$ numerically; tolerances account for floating point.
Numerical and Shape Notes - Shapes: $E(2,3)$, $x(3,)$, $Ex(2,)$; $X(15,2)$, $Vt(2,2)$, $V_k(2,1)$, $z(1,)$, $\hat{x}(2,)$. - Distinguish row/column vectors: Vt[:k].T yields columns for $V_k$. - Broadcasting: ensure matrix-vector products use matching shapes.
ML Context: From Attention to Transformers - Attention uses basis-like projections: $Q = XW_Q$, $K = XW_K$, $V = XW_V$ define learned bases for queries/keys/values. - Coordinates (attention weights) are data-dependent; outputs are weighted sums in the value basis.
Pedagogical Significance - This example connects discrete lookup (one-hot basis) with continuous change-of-basis (PCA), building intuition for coordinates and projections used across ML.
Comments