Example number
91
Example slug
example_91_pca_bookkeeping_evr_and_reconstruction_error
Background

PCA originated with Karl Pearson (1901) as lines/planes of closest fit to data, and was extended by Harold Hotelling (1933) to principal components maximizing variance. The Karhunen–Loève transform formalized PCA in stochastic processes. The Eckart–Young theorem (1936) proved that truncating the SVD gives the best low-rank approximation in Frobenius norm. Modern treatments (Jolliffe, Bishop, Hastie–Tibshirani–Friedman) emphasize PCA as both an optimization problem (variance maximization) and a geometric projection onto orthogonal directions of maximal spread.

Computationally, PCA is implemented stably via SVD of centered data $X_c$: $X_c = U \Sigma V^\top$. The covariance $\Sigma_x = \frac{1}{n-1} X_c^\top X_c$ is positive semidefinite, and its eigen-decomposition aligns with SVD via $\lambda_i = \sigma_i^2/(n-1)$. For large datasets, randomized SVD and incremental PCA compute top components efficiently.

Purpose

Build intuition for PCA as low-rank structure discovery: how the singular values of centered data quantify variance, how explained variance ratio (EVR) guides the choice of components to keep, and how rank-k reconstruction trades off compression against fidelity. EVR is computed from squared singular values $\sigma_i^2$ and tells us what fraction of total variance each principal component preserves. The rank-1 reconstruction error (MSE) measures information lost when keeping only the dominant component. Together, EVR and MSE are the bookkeeping tools that make PCA practical for denoising, visualization, and dimensionality reduction in ML.

This example emphasizes three pillars: - Variance accounting via EVR: $\text{EVR}_i = \sigma_i^2 / \sum_j \sigma_j^2$ and cumulative EVR $\sum_{i=1}^k \text{EVR}_i$. - Optimal low-rank approximation (Eckart–Young): $X_c^{(k)} = U_k \Sigma_k V_k^\top$ minimizes $\|X_c - A\|_F$ over rank-$k$ matrices. - Reconstruction error as discarded variance: $\|X_c - X_c^{(k)}\|_F^2 = \sum_{i=k+1}^d \sigma_i^2$.

Problem

Compute explained variance ratios for k=1 and k=2 and rank-1 reconstruction MSE on toy dataset.

Solution (Math)

Using SVD $X_c=U\Sigma V^T$, EVR is based on $\sigma_i^2$. Rank-1 reconstruction uses $\hat X_c=\sigma_1 u_1 v_1^T$, and MSE is the mean squared per-row reconstruction error.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

X = toy_pca_points(n=60, seed=3)
mu = X.mean(0)
Xc = X - mu
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)

evr1 = (S[0]**2) / np.sum(S**2)
evr2 = (S[0]**2 + S[1]**2) / np.sum(S**2)

Xc1 = np.outer(U[:,0], Vt[0]) * S[0]
X1 = mu + Xc1
mse1 = np.mean(np.sum((X - X1)**2, axis=1))

print("EVR1:", evr1)
print("EVR2 (should be ~1):", evr2)
print("MSE rank-1:", mse1)
Code Introduction

This snippet demonstrates PCA via SVD of centered data, computing explained variance ratios (EVR) and a rank-1 reconstruction to quantify information retention and loss.

We center the data $X$ to obtain $X_c = X - \mu$ and compute U, S, Vt = np.linalg.svd(Xc, full_matrices=False), yielding $X_c = U \Sigma V^\top$. EVR follows directly from squared singular values: $\text{EVR}_i = \sigma_i^2 / \sum_j \sigma_j^2$; cumulative EVR for the first two components in 2D should be $\approx 1$.

The rank-1 reconstruction uses the outer-product form of the leading singular triplet: \[ X_{c,1} = \sigma_1\, u_1 v_1^\top = \big(\text{np.outer}(U[:,0], Vt[0])\big)\, S[0],\quad X_1 = \mu + X_{c,1}. \] The mean squared reconstruction error per example is \[ ext{MSE}_1 = \frac{1}{n}\sum_{i=1}^n \|x_i - x_{1,i}\|_2^2,\quad \text{and}\quad \|X_c - X_{c,1}\|_F^2 = \sum_{j=2}^d \sigma_j^2. \]

Interpretation: - EVR1 is the fraction of variance captured by PC1. - EVR2 (PC1+PC2) in 2D should be $\approx 1.0$. - MSE rank-1 reflects discarded variance (PC2 in 2D). A small value indicates most structure lies along PC1.

Shapes and numerics: - $X \in \mathbb{R}^{n\times d}$, $\mu \in \mathbb{R}^d$, $X_c \in \mathbb{R}^{n\times d}$. - full_matrices=False returns thin SVD; EVR uses S**2. - Singular vector signs may flip; EVR/MSE are invariant.

Numerical Implementation Details
  • Centering: Compute $\mu = \text{mean}(X, \text{axis}=0)$ and $X_c = X - \mu$. PCA on uncentered data is incorrect.
  • Thin SVD: Use np.linalg.svd(Xc, full_matrices=False) to obtain $U \in \mathbb{R}^{n\times d}$, $S \in \mathbb{R}^d$, $V^\top \in \mathbb{R}^{d\times d}$ efficiently.
  • EVR computation: evr = S**2 / np.sum(S**2); cumulative EVR via np.cumsum(evr) to select $k$.
  • Rank-$k$ reconstruction: Xc_k = U[:,:k] @ np.diag(S[:k]) @ Vt[:k]; back to original space: X_k = mu + Xc_k.
  • Complexity: Full SVD on $n\times d$ costs $O(nd\min(n,d))$. For large $n,d$, use randomized SVD (Halko et al.) or incremental PCA.
  • Numerical pitfalls: Sign ambiguity of singular vectors (does not affect EVR/MSE), near-ties in singular values (components may rotate), forgetting full_matrices=False (wastes memory), skipping centering (wrong PCs), mixing singular values instead of squares for EVR.
  • Whitening: For decorrelation and unit variance, use $X_{\text{white}} = X_c V_k \Sigma_k^{-1}$ (add $\epsilon$ for stability).
What This Example Demonstrates
  • Explained variance ratio (EVR): $\text{EVR}_i = \sigma_i^2 / \sum_j \sigma_j^2$; cumulative EVR quantifies how much information is retained by top-$k$ components.
  • Rank-1 and rank-$k$ reconstruction: $X_c^{(k)} = U_k \Sigma_k V_k^\top$; the reconstruction error equals the sum of discarded variances $\sum_{i=k+1}^d \sigma_i^2$.
  • Eckart–Young optimality: Truncated SVD is the best low-rank approximation in Frobenius norm; no other rank-$k$ method can achieve lower error.
  • Centering is essential: PCA must be applied to $X_c = X - \mu$; otherwise, directions point toward the mean rather than variance.
  • SVD–covariance equivalence: Eigenvectors of $X_c^\top X_c$ equal right singular vectors of $X_c$; eigenvalues scale with $\sigma_i^2$.
  • Bookkeeping link between EVR and MSE: $\text{MSE}_k = \frac{1}{n} \sum_{i=k+1}^d \sigma_i^2$ relates reconstruction error to discarded variance per example.
  • Practical guidance: Choose $k$ so cumulative EVR $\ge 95\%$ for compression/denoising; visualize top 2–3 components for structure.
Notes

Part 1: EVR from Singular Values

With $X_c = U \Sigma V^\top$, the total (centered) variance is proportional to $\sum_{i=1}^d \sigma_i^2$. The explained variance ratio for component $i$ is \[ ext{EVR}_i = \frac{\sigma_i^2}{\sum_{j=1}^d \sigma_j^2}, \quad \text{and} \quad \text{EVR}_{1:k} = \sum_{i=1}^k \text{EVR}_i. \] This provides a principled criterion for choosing $k$ (e.g., EVR $\ge 0.95$).

Part 2: Rank-k Reconstruction and Error

The optimal rank-$k$ reconstruction (Eckart–Young) is $X_c^{(k)} = U_k \Sigma_k V_k^\top$. Its Frobenius error satisfies \[ \|X_c - X_c^{(k)}\|_F^2 = \sum_{i=k+1}^d \sigma_i^2, \] which directly links reconstruction error to discarded singular values. The mean per-example squared error is this quantity divided by $n$.

Part 3: Geometry and Projections

PCA projects onto orthonormal directions (columns of $V$) that maximize variance. $U$ contains the coordinates (scores) of examples in this basis. Truncation removes directions of low variance (often noise), yielding a denoised, low-dimensional representation.

Why This Matters for ML

  • Compression: Fewer components reduce storage and training time.
  • Denoising: Discarding small-$\sigma$ components removes high-frequency noise.
  • Visualization: Top 2–3 PCs reveal structure (clusters, trends).
  • Preprocessing: PCA before regression/classification improves stability and can reduce overfitting.

ML Examples and Patterns

  1. Image datasets: PCA compresses images to a few components (eigenfaces), enabling fast search and denoising.
  2. Genomics: PCA reveals population structure; EVR guides how many components capture meaningful variation.
  3. Recommendation systems: PCA uncovers latent factors; rank-$k$ approximations drive collaborative filtering.
  4. NLP embeddings: PCA reduces dimensionality of word vectors for visualization and clustering.

Connection to Linear Algebra Theory

  • PSD covariance: $X_c^\top X_c$ is PSD; eigenvalues $\lambda_i = \sigma_i^2$ (up to scaling) quantify variance.
  • Orthogonality: Columns of $V$ form an orthonormal basis; projections preserve Pythagorean sums (variance additivity).
  • Eckart–Young: Truncated SVD is globally optimal in $\|\cdot\|_F$.

Numerical and Implementation Notes

  • Use full_matrices=False to obtain thin SVD (efficiency and memory).
  • Always center the data; otherwise EVR/MSE are misleading.
  • Singular vector signs are arbitrary; EVR/MSE are invariant.
  • For large $n,d$, prefer randomized SVD or power iteration.

Numerical and Shape Notes

  • Shapes: $X \in \mathbb{R}^{n\times d}$, $\mu \in \mathbb{R}^d$, $X_c \in \mathbb{R}^{n\times d}$; $U \in \mathbb{R}^{n\times d}$ (thin), $S \in \mathbb{R}^d$, $V^\top \in \mathbb{R}^{d\times d}$.
  • EVR uses squares of singular values; cumulative EVR sums to $\approx 1$.
  • Rank-$k$ reconstruction uses top $k$ singular triplets only.

Pedagogical Significance

EVR and reconstruction error make PCA tangible: students see how many components matter and why truncation denoises while preserving structure. The explicit link $\text{MSE}_k \propto$ discarded variance clarifies the trade-off.

History and Applications

Origins (1901–1933): Karl Pearson introduced principal axes as lines/planes of closest fit (1901). Harold Hotelling (1933) formalized principal components as directions maximizing variance, establishing PCA as a cornerstone of multivariate statistics.

Theory and computation (1936–1970s): The Eckart–Young theorem (1936) proved truncated SVD gives the best low-rank approximation in Frobenius norm. The Karhunen–Loève transform connected PCA to stochastic processes. With modern numerical linear algebra, PCA is computed via SVD of centered data, avoiding explicit covariance eigendecomposition for stability.

Modern ML applications: - Denoising and compression: Eigenfaces in computer vision, latent factors in recommender systems, dimensionality reduction in genomics. - Visualization: Top 2–3 PCs provide 2D/3D embeddings revealing clusters and trends. - Preprocessing: PCA reduces collinearity, stabilizes downstream regression/classification, and accelerates training. - Whitening: $X_{\text{white}} = X_c V_k \Sigma_k^{-1}$ decorrelates features and sets unit variance (used in ICA and some deep learning pipelines). - Scalable PCA: Randomized SVD and incremental PCA enable PCA for massive datasets and streaming settings.

Key lesson: EVR and reconstruction error are practical diagnostics: they quantify how many components to keep and what information is lost when truncating. Together they make PCA an engineer’s tool for principled dimensionality reduction.

Connection to Broader Examples
  • SVD (Ch. 10): PCA is computed stably via SVD; avoid explicit covariance eigendecomposition for numerical robustness.
  • PCA (Ch. 11): EVR and reconstruction error are core diagnostics for choosing $k$ and evaluating denoising/compression.
  • Conditioning (Ex 90): Covariance inherits conditioning from $X_c$; small singular values imply fragile directions—randomized SVD and regularization help.
  • PSD matrices (Ex 89): Covariance $X_c^\top X_c$ is PSD; eigenvalues are nonnegative and equal to $\sigma_i^2$ up to scaling.
  • Eigenvectors and power iteration (Ex 88): Top PC(s) can be computed iteratively via matrix–vector products when full SVD is too expensive.
  • Least squares (Ch. 12): Low-rank structure connects to regularization and projection; PCA-based preprocessing improves stability and generalization.

Comments