PCA traces to Pearson (1901) and Hotelling (1933), formalized as variance-maximizing projections. The Eckart-Young theorem (1936) established that PCAâs low-rank approximation is optimal in the Frobenius normâno other rank-$k$ matrix is closer to $X_c$. In ML, PCA serves multiple roles: dimensionality reduction (compress high-$d$ data to interpretable features), denoising (discard low-variance noise), visualization (2D/3D projection), and variance diagnostics (interpret which directions matter). Modern variants include kernel PCA (nonlinear), robust PCA (low-rank + sparse), and online/incremental PCA (streaming data).
- Log in to post comments
Build deep intuition for PCA as both a variance-maximizing projection and an optimal low-rank approximation. You will understand how explained variance ratios (EVR) quantify information capture, how rank-$k$ reconstruction minimizes MSE, and why PCA is the Eckart-Young solution. Come away understanding the dual interpretation: (1) find directions of maximum variance, (2) find the best low-rank approximation to the data.
Compute explained variance ratios for k=1 and k=2 and rank-1 reconstruction MSE on toy dataset.
Given centered data $X_c \in \mathbb{R}^{n\times d}$ with SVD $X_c = U\Sigma V^\top$, the explained variance ratio for the top-$k$ principal components is:
\[ \text{EVR}_k = \frac{\sum_{i=1}^{k} \sigma_i^2}{\sum_{i=1}^{d} \sigma_i^2} = \frac{\sum_{i=1}^{k} \sigma_i^2}{\|X_c\|_F^2}, \]
where the denominator is the Frobenius norm squared (total variance). The rank-$k$ reconstruction is:
\[ \hat{X}_{c,k} = U_k \Sigma_k V_k^\top \in \mathbb{R}^{n \times d}, \]
where $U_k \in \mathbb{R}^{n\times k}$, $\Sigma_k \in \mathbb{R}^{k\times k}$, $V_k^\top \in \mathbb{R}^{k\times d}$. The reconstruction error is:
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left\|x_i - \hat{x}_i\right\|_2^2, \]
which equals $\left(1 - \text{EVR}_k\right) \|X_c\|_F^2 / n$ (Eckart-Young optimality).
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import toy_pca_points
X = toy_pca_points(n=60, seed=3)
mu = X.mean(0)
Xc = X - mu
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
evr1 = (S[0]**2) / np.sum(S**2)
evr2 = (S[0]**2 + S[1]**2) / np.sum(S**2)
Xc1 = np.outer(U[:,0], Vt[0]) * S[0]
X1 = mu + Xc1
mse1 = np.mean(np.sum((X - X1)**2, axis=1))
print("EVR1:", evr1)
print("EVR2 (should be ~1):", evr2)
print("MSE rank-1:", mse1)This snippet performs a PCA via SVD on centered data to measure variance capture and reconstruction error. The data matrix $X \in \mathbb{R}^{60 \times 2}$ is centered as $X_c = X - \mu$, then factored as $X_c = U \Sigma V^\top$ with full_matrices=False, so $U \in \mathbb{R}^{60 \times 2}$, $S \in \mathbb{R}^2$, $V^\top \in \mathbb{R}^{2 \times 2}$.
Explained variance ratios use the singular values: for 2D data, the eigenvalues of the covariance are proportional to $S_i^2$. The code computes $\text{EVR}_1 = S_1^2 / \sum_i S_i^2$ (variance captured by the first principal component) and $\text{EVR}_2 = (S_1^2 + S_2^2) / \sum_i S_i^2 \approx 1$ (both components capture all variance).
A rank-1 reconstruction keeps only the top component: $\hat{X}_{c,1} = S_1 \, u_1 v_1^\top$ via np.outer(U[:,0], Vt[0]) * S[0], then shifts back by the mean $X_1 = \mu + \hat{X}_{c,1}$. The mean squared reconstruction error is $\text{MSE} = \frac{1}{n} \sum_i \|x_i - \hat{x}_i\|_2^2$, reported as mse1. A low mse1 indicates that one principal direction captures most of the structure.
Key insights: - $\text{EVR}_k$ measures information loss when truncating to rank-$k$. - The reconstruction $\hat{X}_{c,k}$ is the Eckart-Young optimal rank-$k$ approximation. - Centering is critical: non-centered data gives misleading principal directions.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Generate 2D data:
X = toy_pca_points(n=60, seed=3)creates $X \in \mathbb{R}^{60 \times 2}$. - Center data:
mu = X.mean(0),Xc = X - mugives $X_c \in \mathbb{R}^{60 \times 2}$ with zero mean. - SVD:
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)yields $U \in \mathbb{R}^{60 \times 2}$, $S \in \mathbb{R}^2$, $V^\top \in \mathbb{R}^{2 \times 2}$. - EVR computation:
S**2 / np.sum(S**2)gives variance fractions; cumsum for cumulative EVR. - Rank-1 reconstruction:
np.outer(U[:,0], Vt[0]) * S[0]forms $\sigma_1 u_1 v_1^\top$ efficiently. - Shift back:
X1 = mu + Xc1recovers original space coordinates. - MSE: average per-row reconstruction error across all $n$ samples.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Explained variance ratio (EVR) for $k=1$ and $k=2$: $\text{EVR}_k = \sum_{i=1}^{k} \sigma_i^2 / \sum_i \sigma_i^2$.
- EVR2 $\approx 1$ for 2D data: the two principal components capture all variance.
- Rank-1 reconstruction: $\hat{X}_{c,1} = \sigma_1 u_1 v_1^\top$, then shift back by mean to get $X_1 \in \mathbb{R}^{n \times d}$ in original space.
- Reconstruction MSE: $\text{MSE} = \frac{1}{n} \sum_i \|x_i - \hat{x}_i\|_2^2$ measures average per-sample error.
- Eckart-Young optimality: rank-1 reconstruction minimizes MSE among all rank-1 matrices (canonical result).
- Dimensionality reduction and visualization: keeping top-$k$ directions enables compression and interpretable plots.
- Part 1: Explained Variance Ratios and Dimensionality. For 2D data, $\text{EVR}_1 \in (0,1)$ measures the fraction of variance in the top direction; $\text{EVR}_2$ adds the second direction. For $d$-dimensional data, $\sum_{i=1}^d \text{EVR}_i = 1$ (all variance accounted for). Small EVR for high-$i$ indicate noisy/redundant dimensions.
- Part 2: Rank-$k$ Reconstruction and Eckart-Young. The rank-$k$ matrix closest to $X_c$ in Frobenius norm is $\hat{X}_{c,k} = U_k \Sigma_k V_k^\top$ (PCAâs reconstruction). The optimal reconstruction error is the sum of squared singular values beyond rank-$k$: $\sum_{i=k+1}^d \sigma_i^2$, hence MSE $= (1 - \text{EVR}_k) \|X_c\|_F^2 / n$.
- Part 3: Centering and Variance Interpretation. Centering ensures $X_c$ has zero mean, so variance = total squared norm / $n$. Forgetting to center changes the principal directions (they align with the mean, not variance). Always center before PCA.
- Why This Matters for ML. PCA enables interpretable dimensionality reduction (e.g., 784-D MNIST digits to 50-D features), denoising (discard low-variance directions), and visualization (project to 2D/3D). EVR tells practitioners how much variance is lost; reconstruction error guides choice of $k$.
- ML Examples and Patterns. Face recognition: PCA on face images gives âeigenfacesâ (interpretable directions). Anomaly detection: points far from the subspace spanned by top-$k$ PCs are outliers. Denoising autoencoders: learn similar bottleneck by truncating small singular values. Whitening: divide by $\sigma_i$ to equalize variance across directions.
- Connection to Linear Algebra Theory. SVD directly yields PCA: principal directions are columns of $V$ (right singular vectors). Covariance eigendecomposition: $\text{Cov}(X_c) = \frac{1}{n-1} X_c^\top X_c = V \frac{\Sigma^2}{n-1} V^\top$, so eigenvalues are $\sigma_i^2 / (n-1)$. Eckart-Young theorem (1936): $\hat{X}_{c,k} = \arg\min_{\text{rank}(M) \le k} \|X_c - M\|_F$.
- Numerical and Implementation Notes. Use SVD-based PCA (via
np.linalg.svdorsklearn.decomposition.PCA) not eigendecomposition of $X_c^\top X_c$ (squares the condition number). For sparse $X$, randomized SVD (sklearn.utils.extmath.randomized_svd) approximates top-$k$ components efficiently. Online PCA (Ojaâs rule) updates PCs incrementally from streaming data. - Numerical and Shape Notes. $X \in \mathbb{R}^{n\times d}$, $X_c \in \mathbb{R}^{n\times d}$, $U \in \mathbb{R}^{n\times d}$ (with
full_matrices=False), $S \in \mathbb{R}^d$, $V^\top \in \mathbb{R}^{d \times d}$. Reconstruction: $\hat{X}_{c,k} \in \mathbb{R}^{n\times d}$ (same shape as $X_c$). - Pedagogical Significance. This example grounds the abstract PCA formula in concrete computation and interpretation. EVR answers âhow much variance?â while MSE answers âhow close is the approximation?â. Understanding the duality (variance maximization = optimal low-rank approximation) is fundamental to all dimensionality reduction.
Historical Foundations
Principal Component Analysis emerged from the foundational work of Karl Pearson in 1901, who proposed finding lines of best fit through data that maximize variance. Harold Hotelling refined this concept in 1933, developing the modern PCA algorithm that rotates coordinates to align with maximum-variance directions. The Eckart-Young theorem of 1936 provided the crucial optimality result that PCAâs rank-$k$ approximation is the best low-rank reconstruction in the Frobenius normâa fact that remained central to ML dimensionality reduction for decades. Turk and Pentlandâs 1991 work on eigenfaces brought PCA into computer vision and established it as a canonical dimension-reduction tool. Modern randomized SVD methods (Halko, Martinsson, Tropp 2011) extended PCA to petabyte-scale datasets, while kernel PCA (Schölkopf, Smola, Müller 1998) and online/incremental PCA variants (Oja 1982, Ross et al. 2008) adapted the method to nonlinear and streaming settings.
Modern Applications in ML
PCA remains one of the most widely deployed algorithms in machine learning and data science. In face recognition, eigenfaces (Turk-Pentland 1991) compress facial images to 50â200 principal components, enabling fast recognition with minimal storage. In anomaly detection, normal data is projected onto PCA subspace and points far from reconstruction are flagged as anomaliesâa technique used in fraud detection and system monitoring. Denoising autoencoders are neural-network generalizations of PCA that learn nonlinear low-rank representations; their linear counterpart is still the standard baseline. In feature preprocessing, PCA whitening (scaling by $\Sigma^{-1/2}$) decorrelates features and stabilizes training in deep networks and ridge regression. Visualization relies on 2D PCA (or nonlinear t-SNE/UMAP variants) to explore structure in high-dimensional data. In matrix completion and recommender systems, PCA approximations initialize collaborative filtering algorithms. Modern variants like incremental PCA (adding new examples without recomputing from scratch) and kernel PCA (capturing nonlinear manifolds via Mercer kernels) extend the basic method to challenging settings. EVR remains the standard diagnostic for choosing $k$: practitioners typically retain 85â95% of variance, balancing reconstruction quality against model simplicity and overfitting risk.
- Links to Chapter 11 (PCA): this example is the canonical PCA workflowâcenter, SVD, compute EVR, reconstruct.
- Connects to Chapter 10 (SVD): SVD factorization enables fast PCA computation and Eckart-Young reconstruction.
- Relates to Chapter 8 (eigenvalues): eigenvalues of covariance $X_c^\top X_c / (n-1)$ equal $\sigma_i^2 / (n-1)$.
- Bridges to Chapter 12 (least squares): PCA via normal equations (not recommended; SVD is stable).
- Complements Chapter 14 (conditioning): small singular values indicate near-redundancy; PCA truncation avoids overfitting from noise dimensions.
Comments