Example number
59
Example slug
example_59_pca_bookkeeping_evr_and_reconstruction_error
Background

Historical context: Principal Component Analysis traces to Pearson (1901), who framed it as finding lines of closest fit to point clouds. Hotelling (1933) recast PCA as variance maximization and introduced the term “principal components.” The connection to SVD crystallized with Eckart & Young (1936), who proved that the rank-$k$ SVD truncation is the optimal low-rank approximation in Frobenius norm. Modern PCA is ubiquitous: dimensionality reduction (feature extraction), visualization (2D/3D projections), denoising (signal vs. noise separation), and compression (image/video encoding).

Mathematical characterization: The SVD $X_c = U \Sigma V^\top$ (where $X_c = X - \mu$ is centered) yields principal components as right singular vectors $V$. The variance along the $i$-th component is $\frac{\sigma_i^2}{n-1}$ (singular value squared, scaled by sample size). The explained variance ratio is $\text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^d \sigma_i^2}$, measuring what fraction of total variance is retained by top-$k$ components. The rank-$k$ reconstruction $\hat{X}_c^{(k)} = \sum_{i=1}^k \sigma_i u_i v_i^\top$ is the best least-squares fit to $X_c$ by a $k$-dimensional subspace.

Prevalence in ML: PCA is the default preprocessing step for high-dimensional data: reduces computational cost, removes collinearity, and improves visualization. Used in face recognition (Eigenfaces), genomics (gene expression analysis), recommender systems (collaborative filtering via matrix factorization), and neural network initialization (whitening inputs). EVR guides hyperparameter choice: keep components explaining 95% variance (common heuristic). Reconstruction error diagnoses overfitting: if test reconstruction error is much higher than train, the low-rank structure doesn’t generalize.

Purpose

Build intuition for PCA as low-rank approximation via SVD: how explained variance ratio (EVR) quantifies information retention, how rank-$k$ reconstruction works via singular vector outer products, and how mean squared error (MSE) measures reconstruction quality. Understand the trade-off between dimensionality reduction (keeping fewer components) and information loss (reconstruction error), and learn to diagnose when low-rank structure exists in data. Master the mechanics of centering, SVD decomposition, variance accounting, and reconstruction—the foundation of dimensionality reduction, denoising, and compression in ML.

Problem

Compute explained variance ratios for k=1 and k=2 and rank-1 reconstruction MSE on toy dataset.

Solution (Math)

Using SVD $X_c=U\Sigma V^T$, EVR is based on $\sigma_i^2$. Rank-1 reconstruction uses $\hat X_c=\sigma_1 u_1 v_1^T$, and MSE is the mean squared per-row reconstruction error.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

X = toy_pca_points(n=60, seed=3)
mu = X.mean(0)
Xc = X - mu
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)

evr1 = (S[0]**2) / np.sum(S**2)
evr2 = (S[0]**2 + S[1]**2) / np.sum(S**2)

Xc1 = np.outer(U[:,0], Vt[0]) * S[0]
X1 = mu + Xc1
mse1 = np.mean(np.sum((X - X1)**2, axis=1))

print("EVR1:", evr1)
print("EVR2 (should be ~1):", evr2)
print("MSE rank-1:", mse1)
Code Introduction

This code demonstrates PCA as low-rank approximation via SVD: it reconstructs data using only the top principal component and measures how much variance is retained. The explained variance ratio (EVR) quantifies the information loss, while the mean squared error (MSE) measures reconstruction quality.

Part 1: Data Centering and SVD. The code loads 60 2D points via toy_pca_points(n=60, seed=3), yielding $X \in \mathbb{R}^{60 \times 2}$ (60 examples, 2 features). Centering removes the mean: $\mu = \frac{1}{n}\sum_{i=1}^n X_i, \quad X_c = X - \mu \in \mathbb{R}^{60 \times 2}$. The SVD factorization $X_c = U \Sigma V^\top$ decomposes the centered data as: $U \in \mathbb{R}^{60 \times 2}$ (left singular vectors, principal directions in data space), $\Sigma = \text{diag}(\sigma_1, \sigma_2)$ (singular values, importance/variance scales), and $V^\top \in \mathbb{R}^{2 \times 2}$ (right singular vectors, principal axes in feature space). With full_matrices=False, the SVD returns only the first $\min(60, 2) = 2$ columns of $U$ and rows of $V^\top$, matching the data dimensionality.

Part 2: Explained Variance Ratio. The explained variance ratio (EVR) measures what fraction of total variance is captured by the top-$k$ components. For variance computed from singular values: $\text{Total variance} = \sum_{i=1}^d \sigma_i^2 = \|X_c\|_F^2$, the EVR for the first component is: $\text{EVR}_1 = \frac{\sigma_1^2}{\sum_{i=1}^d \sigma_i^2}$. The code computes this as evr1 = (S[0]**2) / np.sum(S**2). For 2D data, cumulative EVR for both components is: $\text{EVR}_{1:2} = \frac{\sigma_1^2 + \sigma_2^2}{\sigma_1^2 + \sigma_2^2} = 1.0$, computed as evr2 = (S[0]**2 + S[1]**2) / np.sum(S**2), which should equal 1.0 (all variance is explained by all components). Interpretation: If $\text{EVR}_1 = 0.95$, the first component captures 95% of the variation in the data—most information fits along a single direction. If $\text{EVR}_1 = 0.5$, the data are more isotropic (variance spread equally across directions).

Part 3: Rank-1 Reconstruction. The rank-1 approximation of centered data uses only the dominant singular vector: $\hat{X}_c^{(1)} = \sigma_1 u_1 v_1^\top$, where $u_1 = U_{:,0}$ (first column of $U$) and $v_1 = V_{0,:}$ (first row of $V^\top$). The code constructs this via Xc1 = np.outer(U[:,0], Vt[0]) * S[0]. Here, np.outer(U[:,0], Vt[0]) computes $u_1 v_1^\top$ (outer product, shape $60 \times 2$), and multiplying by S[0] gives $\sigma_1 u_1 v_1^\top$. To reconstruct the original data scale, the mean is added back: $\hat{X}^{(1)} = \mu + \hat{X}_c^{(1)} \in \mathbb{R}^{60 \times 2}$. The code computes this as X1 = mu + Xc1. Geometric interpretation: $\hat{X}^{(1)}$ projects all data points onto a line passing through the mean $\mu$ in the direction $v_1$. The rank-1 approximation is the best least-squares fit to $X$ by a 1D affine subspace (line through $\mu$).

Part 4: Reconstruction Error (MSE). The mean squared error (MSE) per example measures reconstruction quality: $\text{MSE} = \frac{1}{n} \sum_{i=1}^n \|X_i - \hat{X}_i^{(1)}\|_2^2 = \frac{1}{n} \|X - \hat{X}^{(1)}\|_F^2$. The code computes this as mse1 = np.mean(np.sum((X - X1)**2, axis=1)). (X - X1)**2 squares element-wise differences, np.sum(..., axis=1) sums squared errors per example, and np.mean(...) averages over all examples. Relationship to EVR: The MSE is inversely related to explained variance. If $\text{EVR}_1$ is high (most variance captured), MSE is low (good reconstruction). The variance lost is $\sum_{i=2}^d \sigma_i^2$, which accounts for the reconstruction error—the remaining singular values determine how far points deviate from the rank-1 subspace.

Why This Matters for ML. PCA with $k$ components projects high-dimensional data onto a $k$-dimensional subspace. Choosing $k$ via EVR (e.g., retain components capturing 95% variance) balances information retention and computational efficiency. For $d = 1000$ features, keeping top-10 components might capture 90% variance, reducing dimensionality 100-fold. The rank-$k$ approximation is optimal (Eckart–Young theorem): no other rank-$k$ matrix fits the data better in least-squares sense. Reconstruction error guides model selection: if test MSE is much higher than train MSE, the low-rank structure doesn’t generalize (overfitting). EVR is used in: dimensionality reduction (feature extraction for downstream tasks), visualization (2D/3D projections), denoising (truncate small singular values), compression (image/video encoding), and anomaly detection (high MSE indicates outliers).

Numerical Gotchas and Verification. - Shape discipline: $X \in \mathbb{R}^{60 \times 2}$, $X_c \in \mathbb{R}^{60 \times 2}$, $U \in \mathbb{R}^{60 \times 2}$, $S \in \mathbb{R}^2$, $V^\top \in \mathbb{R}^{2 \times 2}$. np.outer(u, v) with $u \in \mathbb{R}^{60}, v \in \mathbb{R}^2$ yields $(60, 2)$. - Centering is mandatory: Without centering, SVD captures direction toward the mean (not variance structure). Always compute $X_c = X - \mu$ before SVD for PCA. - Singular values ≠ eigenvalues of covariance: Singular values of $X_c$ relate to covariance eigenvalues via $\lambda_i = \frac{\sigma_i^2}{n-1}$ (variance per component). EVR uses $\sigma_i^2$ directly (sum of squared centered values); dividing by $n-1$ shifts the baseline but doesn’t change the ratio. - Rank-1 reconstruction via outer product: np.outer(U[:,0], Vt[0]) * S[0] is equivalent to U[:, [0]] @ (S[0:1, np.newaxis] * Vt[0:1, :]) but clearer. Always verify shape: outer product yields $(60, 2)$. - MSE vs. Frobenius norm: The computed MSE is equivalent to np.linalg.norm(X - X1, 'fro')**2 / 60 (squared Frobenius norm divided by $n$). For comparison with variance, note that Frobenius norm sums squared elements; dividing by $n$ gives average squared error per example. - 2D data edge case: For 2D data, EVR_2 is always 1.0 (both components explain all variance). The code’s print statement “should be ~1” reflects this special case. For higher-dimensional data, EVR_2 would be less than 1.

Verification checks:

# Verify reconstruction is rank-1
rank_Xc1 = np.linalg.matrix_rank(Xc1)
print(f"Rank of rank-1 approximation: {rank_Xc1}")  # Should be 1

# Verify EVR relationship to singular values
total_variance = np.sum(S**2)
print(f"Total variance (sum S^2): {total_variance:.6f}")
print(f"EVR1 + (1 - EVR1): {evr1 + (1 - evr1):.10f}")  # Should be 1.0

# Verify MSE via Frobenius norm
mse1_alt = np.linalg.norm(X - X1, 'fro')**2 / 60
print(f"MSE (direct): {mse1:.6f}, MSE (via Frobenius): {mse1_alt:.6f}")

# Verify reconstruction error relates to lost variance
lost_variance = S[1]**2  # Variance in 2nd component
mse_from_variance = lost_variance / 60
print(f"MSE: {mse1:.6f}, Lost variance / n: {mse_from_variance:.6f}")

Linear Algebra Theory Connection. The condition number bounds relative error: for perturbation $\Delta b$ in $Ax = b$, $\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\Delta b\|}{\|b\|}$. For PCA, the SVD $X_c = U \Sigma V^\top$ reveals conditioning via singular value ratio. The Eckart–Young theorem states that for any matrix $A$, the best rank-$k$ approximation in Frobenius norm is $A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top = U_k \Sigma_k V_k^\top$. The approximation error is $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$. For 2D data with $k=1$, the error is exactly $\sigma_2^2$ (the missing singular value squared). Explained variance decomposition: total variance in centered data is $\text{Var}(X_c) = \frac{1}{n-1} \sum_{i=1}^d \sigma_i^2$. Principal components are ordered by decreasing variance: $\text{Var}(PC_1) \ge \text{Var}(PC_2) \ge \ldots$. The EVR measures what fraction of total variance is captured: $\text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^d \sigma_i^2}$.

Pedagogical Power. This example is the canonical demonstration of PCA as dimensionality reduction. It distills PCA into three core insights—centering, SVD computation, and rank-$k$ reconstruction—showing how linear algebra enables dimensionality reduction. The EVR and MSE metrics concretely measure the trade-off: higher EVR means better reconstruction (lower MSE), but keeping more components increases computational cost. This pattern repeats throughout ML: compression via low-rank structure, guided by variance/energy measures. Understanding this one relationship deepens intuition for matrix factorization, autoencoders, and modern deep learning.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Load data: Generate $X \in \mathbb{R}^{n \times d}$ via toy_pca_points(n=60, seed=3), yielding 60 examples and 2 features.
  2. Compute mean: Calculate $\mu = \frac{1}{n}\sum_{i=1}^n X_i$ via X.mean(0) (column-wise mean, shape $(d,)$).
  3. Center data: Subtract mean $X_c = X - \mu$ via broadcasting; each row of $X$ is shifted by $\mu$.
  4. Compute SVD: Use U, S, Vt = np.linalg.svd(Xc, full_matrices=False) to extract left singular vectors $U \in \mathbb{R}^{60 \times 2}$, singular values $S \in \mathbb{R}^2$, and right singular vectors $V^\top \in \mathbb{R}^{2 \times 2}$.
  5. Explained variance ratio (EVR): Compute $\text{EVR}_1 = \frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2}$ as (S[0]**2) / np.sum(S**2); similarly for $\text{EVR}_2 = 1.0$ (all variance, 2D edge case).
  6. Rank-1 reconstruction (centered): Compute $\hat{X}_c^{(1)} = \sigma_1 u_1 v_1^\top$ via np.outer(U[:,0], Vt[0]) * S[0] (outer product scaled by singular value).
  7. Add mean back: Reconstruct original scale $\hat{X}^{(1)} = \mu + \hat{X}_c^{(1)}$ via mu + Xc1.
  8. Compute MSE: Calculate $\text{MSE} = \frac{1}{n} \sum_{i=1}^n \|X_i - \hat{X}_i^{(1)}\|_2^2$ via np.mean(np.sum((X - X1)**2, axis=1)) (squared error per example, then averaged).
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Centering is mandatory for PCA: Without $X_c = X - \mu$, SVD captures direction toward the mean (not variance structure).
  • EVR quantifies information retention: Explained variance ratio $\text{EVR}_k$ measures what fraction of total variance is captured by top-$k$ components.
  • Rank-$k$ reconstruction via outer products: $\hat{X}_c^{(k)} = \sum_{i=1}^k \sigma_i u_i v_i^\top$ projects data onto $k$-dimensional subspace.
  • MSE measures reconstruction quality: Mean squared error $\text{MSE} = \frac{1}{n} \|X - \hat{X}^{(k)}\|_F^2$ quantifies information loss.
  • EVR and MSE are inversely related: High EVR means low MSE (good reconstruction); low EVR means high MSE (poor approximation).
  • 2D data edge case: For $d=2$, EVR_2 is always 1.0 (both components explain all variance); useful sanity check.
  • Shape discipline in reconstruction: Outer product np.outer(u, v) yields rank-1 matrix; scale by $\sigma_i$ and sum for rank-$k$.
  • Geometric interpretation: Rank-$k$ approximation projects all points onto a $k$-dimensional affine subspace (hyperplane through $\mu$).
Notes

Part 1: Data Centering and SVD - Centering removes the mean: $X_c = X - \mu$, where $\mu = \frac{1}{n}\sum_{i=1}^n X_i \in \mathbb{R}^d$. - Without centering, SVD captures direction toward the mean (not variance structure). - Centered data have zero mean: $\frac{1}{n}\sum_{i=1}^n (X_c)_i = 0$. - SVD factorization $X_c = U \Sigma V^\top$ yields: - $U \in \mathbb{R}^{n \times r}$: left singular vectors (principal directions in data space) - $\Sigma \in \mathbb{R}^{r \times r}$: diagonal matrix of singular values $\sigma_1 \ge \sigma_2 \ge \ldots \ge \sigma_r > 0$ - $V^\top \in \mathbb{R}^{r \times d}$: right singular vectors (principal axes in feature space) - With full_matrices=False, $r = \min(n, d)$ (reduced form, saves memory).

Part 2: Explained Variance Ratio - Total variance in centered data: $\text{Var}_{\text{total}} = \sum_{i=1}^d \sigma_i^2 = \|X_c\|_F^2$ (Frobenius norm squared). - Variance along $i$-th component: $\text{Var}_i = \frac{\sigma_i^2}{n-1}$ (per-sample variance). - EVR for top-$k$ components: $\text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^d \sigma_i^2}$ (fraction of total variance). - For 2D data, $\text{EVR}_2 = \frac{\sigma_1^2 + \sigma_2^2}{\sigma_1^2 + \sigma_2^2} = 1.0$ (all variance explained). - Interpretation: $\text{EVR}_1 = 0.95$ means 95% of variance lies along the first principal axis; data are nearly 1D.

Part 3: Rank-1 Reconstruction - Rank-1 approximation of centered data: $\hat{X}_c^{(1)} = \sigma_1 u_1 v_1^\top$, where $u_1 = U_{:,0}$ and $v_1 = V_{0,:}$. - Computed via outer product: np.outer(U[:,0], Vt[0]) * S[0] yields matrix of shape $(n, d)$. - Geometric interpretation: $\hat{X}_c^{(1)}$ projects all points onto 1D subspace spanned by $v_1$. - Add mean back to original scale: $\hat{X}^{(1)} = \mu + \hat{X}_c^{(1)}$. - Rank-1 approximation is the best least-squares fit to $X$ by a 1D affine subspace (line through $\mu$). - Generalizes to rank-$k$: $\hat{X}_c^{(k)} = \sum_{i=1}^k \sigma_i u_i v_i^\top$ (sum of $k$ rank-1 terms).

Part 4: Reconstruction Error (MSE) - Mean squared error per example: $\text{MSE} = \frac{1}{n} \sum_{i=1}^n \|X_i - \hat{X}_i^{(1)}\|_2^2$. - Computed as np.mean(np.sum((X - X1)**2, axis=1)): square differences, sum per row, average over rows. - Equivalent to Frobenius norm: $\text{MSE} = \frac{\|X - \hat{X}^{(1)}\|_F^2}{n}$. - Relationship to lost variance: $\text{MSE} \propto \sum_{i=k+1}^d \sigma_i^2$ (variance in discarded components). - High EVR → low MSE (good reconstruction); low EVR → high MSE (poor approximation). - For 2D data with rank-1 approximation, $\text{MSE} = \frac{\sigma_2^2}{n}$ (exactly the lost variance).

Why Low-Rank Structure Matters in ML - Computational efficiency: Representing $X \approx U_k \Sigma_k V_k^\top$ reduces storage from $O(nd)$ to $O((n+d)k)$ when $k \ll \min(n, d)$. - Overfitting mitigation: Keeping top-$k$ components removes noise (small singular values), improving generalization. - Interpretability: Principal components are linear combinations of original features; high-variance directions often correspond to meaningful concepts. - Convexity and stability: PCA is convex (unique global solution) and numerically stable (via SVD).

ML Examples and Patterns - Eigenfaces (face recognition): PCA on face images yields principal “eigenfaces”; new faces are linear combinations of eigenfaces. Keep top 50–100 components (EVR ≈ 0.95), reducing dimensionality from $\sim 10^4$ pixels. - Genomics (gene expression): PCA on gene expression matrices reveals tissue types or disease states. First 2 PCs often separate known clusters (e.g., cancer vs. healthy). - Recommender systems: Matrix factorization (PCA-like) decomposes user-item rating matrix into low-rank factors, enabling collaborative filtering. - Neural network initialization: Whitening via PCA (rotate to principal axes, scale by $\sigma_i^{-1}$) decorrelates inputs, improving optimization convergence. - Anomaly detection: Reconstruction error is high for outliers (data far from low-rank subspace); threshold MSE to detect anomalies. - Visualization: Project high-dimensional data onto top 2–3 PCs for scatter plots; EVR_2 or EVR_3 indicates how representative the plot is.

Connection to Linear Algebra Theory - Eckart–Young theorem: The rank-$k$ SVD truncation $\hat{X}_c^{(k)} = U_k \Sigma_k V_k^\top$ minimizes $\|X_c - \hat{X}_c^{(k)}\|_F$ over all rank-$k$ matrices. - Projection onto subspace: Rank-$k$ approximation is orthogonal projection of $X_c$ onto subspace spanned by $\{v_1, \ldots, v_k\}$. - Covariance eigendecomposition: Covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c = V \Lambda V^\top$, where $\Lambda = \text{diag}(\frac{\sigma_1^2}{n-1}, \ldots, \frac{\sigma_d^2}{n-1})$. - Variance decomposition: Total variance $\text{Var}_{\text{total}} = \text{trace}(\Sigma) = \sum_{i=1}^d \lambda_i = \frac{1}{n-1}\sum_{i=1}^d \sigma_i^2$. - Rayleigh quotient: First principal component $v_1$ is the maximizer of $\frac{v^\top \Sigma v}{v^\top v}$ (Rayleigh quotient).

Numerical and Implementation Notes - Centering is mandatory: Always compute $X_c = X - X.mean(0)$ before SVD; uncentered PCA is meaningless. - Singular values vs. eigenvalues: Singular values of $X_c$ relate to covariance eigenvalues via $\lambda_i = \frac{\sigma_i^2}{n-1}$. EVR uses $\sigma_i^2$ directly (no $n-1$ factor). - Outer product for rank-1: np.outer(u, v) computes $u v^\top$ (shape $(n, d)$ if $u \in \mathbb{R}^n, v \in \mathbb{R}^d$). Always scale by singular value. - MSE via Frobenius norm: np.linalg.norm(X - X1, 'fro')**2 / n is equivalent to np.mean(np.sum((X - X1)**2, axis=1)). - EVR sum check: For full-rank data, $\sum_{i=1}^d \text{EVR}_i = 1.0$ (all variance accounted for). - Reconstruction rank check: np.linalg.matrix_rank(Xc1) should be 1 for rank-1 approximation.

Numerical and Shape Notes - $X \in \mathbb{R}^{60 \times 2}$: data matrix (60 examples, 2 features). - $\mu \in \mathbb{R}^{2}$: mean vector (column-wise mean). - $X_c \in \mathbb{R}^{60 \times 2}$: centered data. - $U \in \mathbb{R}^{60 \times 2}$: left singular vectors (reduced form). - $S \in \mathbb{R}^{2}$: singular values (descending order). - $V^\top \in \mathbb{R}^{2 \times 2}$: right singular vectors (full, since $d=2$). - np.outer(U[:,0], Vt[0]) yields shape $(60, 2)$ (rank-1 matrix). - Xc1, X1 have shape $(60, 2)$ (same as original data). - EVR values are scalars in $[0, 1]$. - MSE is a scalar (average squared error per example).

Pedagogical Significance - Central insight: PCA via SVD is the optimal low-rank approximation to data—minimizes reconstruction error while maximizing variance retention. - Builds dimensionality reduction intuition: EVR quantifies information loss; reconstruction error measures approximation quality. - Connects theory to practice: Eckart–Young theorem (theory) is verified empirically via MSE computation (practice). - Demystifies “explained variance”: EVR is just the fraction of total variance (sum of squared singular values) captured by top-$k$ components. - Prepares for advanced topics: Matrix factorization (NMF, ICA), autoencoders (nonlinear PCA), kernel PCA (implicit feature spaces). - Highlights centering importance: Without centering, PCA fails—mean dominates variance structure.

History and Applications

Origins of PCA (1901–1933): Pearson (1901) introduced PCA geometrically as finding lines of closest fit to point clouds in high-dimensional space. Hotelling (1933) recast it as variance maximization, introducing the term “principal components” and connecting it to eigenvalue decomposition. The method remained niche until digital computing enabled large-scale matrix operations. Eckart & Young (1936) proved the optimality of rank-$k$ SVD truncation, establishing PCA as the fundamental low-rank approximation.

Golden age (1960s–1980s): Numerical algorithms for SVD (Golub & Kahan, 1965) made PCA practical for large datasets. Applications emerged in psychology (factor analysis), geology (seismic data), and econometrics (multivariate time series). Jolliffe’s textbook (1986, revised 2002) became the standard reference, unifying theory and applications.

Modern era (1990s–present): Turk & Pentland (1991) popularized PCA with “Eigenfaces” for face recognition, demonstrating that 50–100 principal components could represent thousands of pixels. Recommender systems (Netflix Prize, 2006–2009) used matrix factorization (PCA-like) for collaborative filtering, winning with low-rank models. Deep learning resurgence: autoencoders are nonlinear PCA (neural networks learn nonlinear dimensionality reduction). Kernel PCA (Schölkopf et al., 1998) extends PCA to nonlinear manifolds via kernel trick. Modern genomics: PCA on single-cell RNA-seq data reveals cell types and developmental trajectories (10x Genomics, Broad Institute).

Contemporary applications: Image compression (JPEG 2000 uses wavelet transforms, conceptually similar to PCA). Neural network initialization: whitening via PCA decorrelates inputs, improving optimization convergence (LeCun et al., 2012). Anomaly detection: reconstruction error is high for outliers (fraud detection, network intrusion). Differential privacy: add noise to top-$k$ principal components, preserving privacy while retaining variance. Quantum computing: variational quantum eigensolver (VQE) for chemistry uses PCA-like structure to reduce Hilbert space dimension.

Future directions: High-dimensional statistics: sparse PCA (Zou et al., 2006) handles $d \gg n$ via $\ell_1$ regularization. Streaming PCA: incremental algorithms for online learning (Oja’s algorithm, matrix sketching). Tensor PCA: extends PCA to higher-order tensors (multi-way data). Explainable AI: interpret deep networks via PCA on hidden layer activations (what features do neurons encode?). Understanding PCA—the optimal low-rank approximation—remains foundational for modern data science, and its principles scale from classical statistics to cutting-edge ML.

Connection to Broader Examples
  • Chapter 6 (Projections): Rank-$k$ approximation is orthogonal projection onto $k$-dimensional subspace spanned by top-$k$ right singular vectors.
  • Chapter 8 (Eigenvalues): Principal components are eigenvectors of covariance matrix $\Sigma = \frac{1}{n-1} X_c^\top X_c$; eigenvalues are variances.
  • Chapter 9 (PSD): Covariance $\Sigma$ is PSD (Gram matrix form); ensures real, nonnegative eigenvalues (variances).
  • Chapter 10 (SVD): SVD directly yields PCA: right singular vectors are principal axes, singular values relate to variances as $\lambda_i = \frac{\sigma_i^2}{n-1}$.
  • Chapter 11 (PCA): This example is the canonical implementation of PCA via SVD.
  • Chapter 12 (Least-squares): Rank-$k$ approximation solves least-squares problem: minimize $\|X_c - \hat{X}_c^{(k)}\|_F$ over all rank-$k$ matrices.
  • Chapter 14 (Conditioning): Condition number of covariance relates to EVR spread; large gaps in EVR indicate ill-conditioning.
  • Dimensionality reduction: PCA reduces $d$ features to $k$ principal components ($k \ll d$), improving computational efficiency.

Comments