ex1.ai

Example number

Example slug

example_43_pca_bookkeeping_evr_and_reconstruction_error

Background

For centered data $X_c = X - \bar{x}$, the SVD $X_c = U\Sigma V^\top$ gives principal components as right singular vectors $V$ and variance along each direction proportional to $\sigma_i^2$. The explained variance ratio for the first $k$ components is: \[ \text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^d \sigma_i^2}. \] By the Eckart-Young theorem, the rank-$k$ truncation $\hat{X}_c = \sum_{i=1}^k \sigma_i u_i v_i^\top$ minimizes $\|X_c - \hat{X}_c\|_F$ over all rank-$k$ matrices. The MSE $\frac{1}{n}\|X - \hat{X}\|_F^2$ measures average squared reconstruction error per example. These metrics guide how many principal components to keep: high EVR with low MSE means effective compression; low EVR or high MSE signals the need for more components or nonlinear methods.

Purpose

Show how to quantify PCAâs information content via explained variance ratios (EVR) and reconstruction error (MSE). Demonstrate that singular values squared measure variance captured by each principal component, and that truncated SVD produces optimal low-rank approximations. Build intuition for the compressionâaccuracy tradeoff: high EVR means few components suffice, while low MSE confirms good reconstruction. Emphasize practical PCA bookkeeping for deciding how many components to retain in dimensionality reduction.

Problem

Compute explained variance ratios for k=1 and k=2 and rank-1 reconstruction MSE on toy dataset.

Solution (Math)

Using SVD $X_c=U\Sigma V^T$, EVR is based on $\sigma_i^2$. Rank-1 reconstruction uses $\hat X_c=\sigma_1 u_1 v_1^T$, and MSE is the mean squared per-row reconstruction error.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import toy_pca_points

X = toy_pca_points(n=60, seed=3)
mu = X.mean(0)
Xc = X - mu
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)

evr1 = (S[0]**2) / np.sum(S**2)
evr2 = (S[0]**2 + S[1]**2) / np.sum(S**2)

Xc1 = np.outer(U[:,0], Vt[0]) * S[0]
X1 = mu + Xc1
mse1 = np.mean(np.sum((X - X1)**2, axis=1))

print("EVR1:", evr1)
print("EVR2 (should be ~1):", evr2)
print("MSE rank-1:", mse1)

Code Introduction

This code demonstrates PCA variance accounting and rank-1 reconstruction, showing how singular values quantify information content and how truncated SVD produces optimal low-rank approximations. It connects the abstract SVD factorization to concrete ML metrics: explained variance ratio (EVR) and mean squared error (MSE). The code loads 60 synthetic 2D points, centers them via Xc = X - mu (mandatory for PCA), and computes the SVD: $X_c = U\Sigma V^\top$. Two EVR metrics quantify information capture: evr1 = (S[0]**2) / np.sum(S**2) computes the fraction of total variance explained by the top principal direction ($\sigma_1^2 / (\sigma_1^2 + \sigma_2^2)$), while evr2 = (S[0]**2 + S[1]**2) / np.sum(S**2) sums the top two components and must equal 1.0 for 2D data (sanity check). The rank-1 reconstruction builds the optimal 1D approximation via $\hat{X}_c = \sigma_1 u_1 v_1^\top$, implemented as np.outer(U[:,0], Vt[0]) * S[0]. Shifting back via X1 = mu + Xc1 recovers the reconstruction in original coordinatesâevery reconstructed point lies on the 1D line through mu along the first principal component. The MSE mse1 = np.mean(np.sum((X - X1)**2, axis=1)) measures the average squared distance from original points to their projections, quantifying information loss. By the Eckart-Young theorem, this is the minimal reconstruction error achievable by any rank-1 matrixâtruncated SVD is provably optimal.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Load data: X = toy_pca_points(n=60, seed=3) generates $X \in \mathbb{R}^{60\times 2}$.
Center: mu = X.mean(0); Xc = X - mu shifts origin to data centroid (mandatory for PCA).
SVD: U, S, Vt = np.linalg.svd(Xc, full_matrices=False) computes economy-size decomposition.
EVR for 1st component: evr1 = (S[0]**2) / np.sum(S**2) computes $\sigma_1^2 / (\sigma_1^2 + \sigma_2^2)$.
EVR for both components: evr2 = (S[0]**2 + S[1]**2) / np.sum(S**2) should equal 1.0 for 2D data (sanity check).
Rank-1 reconstruction (centered): Xc1 = np.outer(U[:,0], Vt[0]) * S[0] builds $\sigma_1 u_1 v_1^\top$.
Shift back: X1 = mu + Xc1 recovers reconstruction in original coordinates.
MSE: mse1 = np.mean(np.sum((X - X1)**2, axis=1)) computes per-example squared error, averaged over examples.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

EVR quantifies information capture: $\text{EVR}_k = \sum_{i=1}^k \sigma_i^2 / \sum_i \sigma_i^2$ is the fraction of total variance.
For 2D data, EVR for both components must equal 1.0 (sanity check that all variance is accounted for).
Rank-1 reconstruction via $\sigma_1 u_1 v_1^\top$ gives the optimal 1D approximation (Eckart-Young).
MSE measures reconstruction error: average squared distance from original to projected points.
Compressionâaccuracy tradeoff: high EVRâ means 1D suffices; low MSE confirms small loss.
Shape discipline: $X \in \mathbb{R}^{n\times d}$, $U \in \mathbb{R}^{n\times d}$, $S \in \mathbb{R}^d$, $V^\top \in \mathbb{R}^{d\times d}$.

Notes

Part 1: Centering and SVD for PCA
PCA requires centered data: Xc = X - mu. Without centering, the first principal component points toward the mean, not the direction of maximal spread. The SVD of $X_c$ gives $U\Sigma V^\top$ where $V$âs columns are principal components (ordered by decreasing variance $\sigma_i^2$).

Part 2: Explained Variance Ratios
For $k=1$, EVRâ = $\sigma_1^2 / \sum_i \sigma_i^2$ measures how â1-dimensionalâ the data is. Values near 1 indicate elongated clouds (nearly 1D embedded in higher $d$); values near $1/d$ indicate isotropic spread. For 2D data, EVRâ must equal 1.0 (two orthogonal directions span the plane).

Part 3: Rank-1 Reconstruction and Optimality
The rank-1 approximation $\hat{X}_c = \sigma_1 u_1 v_1^\top$ is the best 1D fit by Eckart-Young: it minimizes $\|X_c - \hat{X}_c\|_F$ over all rank-1 matrices. Shifting back via $\hat{X} = \mu + \hat{X}_c$ recovers the reconstruction in original coordinates. Every reconstructed point lies on a 1D line through $\mu$ along the first principal component.

Why This Matters for ML
EVR guides dimensionality reduction: choose $k$ such that $\text{EVR}_k \ge 0.9$ (retain 90% of variance). High EVR with few components means linear structure dominates; low EVR means nonlinear methods (kernel PCA, autoencoders) may be needed. MSE quantifies reconstruction lossâcritical for generative models, image compression, and feature extraction.

Numerical and Implementation Notes
Use full_matrices=False for economy SVD; critical when $n \gg d$ to avoid allocating huge matrices. Always center before SVD. Verify EVR for all components sums to 1.0 as a sanity check. For higher dimensions, plot EVR vs.Â $k$ (scree plot) to choose $k$ visually; look for an âelbowâ where marginal variance gain drops off.

Numerical and Shape Notes
Shapes: $X (n\times d)$, $X_c (n\times d)$, $U (n\times d)$, $S (d)$, $V^\top (d\times d)$. The outer product np.outer(U[:,0], Vt[0]) yields $(n, d)$ rank-1 matrix. For reconstruction, X1 = mu + Xc1 broadcasts $\mu \in \mathbb{R}^d$ across $n$ rows. MSE averages squared per-example errors; equivalent to $\frac{1}{n}\|X - X_1\|_F^2$.

History and Applications

PCA traces to Pearson (1901) for geometric interpretation and Hotelling (1933) for variance maximization. The Eckart-Young theorem (1936) established SVD-based low-rank approximation as optimal under Frobenius norm, unifying PCA with matrix approximation theory. In modern ML, EVR and reconstruction error guide hyperparameter choices: image compression (JPEG uses DCT, a variant of PCA; retain 10% of components for 90% quality), genomics (reduce 20,000 genes to 50 principal components for clustering), and neural network initialization (PCA whitening decorrelates inputs, speeding convergence). Scree plots (EVR vs.Â $k$) help choose dimensionality; the âelbowâ indicates diminishing returns. Understanding EVR and MSE is essential for interpreting PCA results and debugging compression pipelines.

Connection to Broader Examples

Vector spaces (Ch. 1): Principal components span orthogonal subspaces of $\mathbb{R}^d$.
Span (Ch. 2): Rank-$k$ reconstruction lives in the span of the first $k$ principal components.
Basis and dimension (Ch. 3): PCA finds an orthonormal basis ordered by variance; $k$ components define a $k$-dimensional subspace.
Linear maps (Ch. 4): Projection onto principal components is a linear map $X \mapsto XV_{:,:k}$.
Projections (Ch. 6): PCA is orthogonal projection onto the subspace maximizing variance.
Eigen (Ch. 8): Principal components are eigenvectors of the covariance matrix $\Sigma = \frac{1}{n-1}X_c^\top X_c$.
SVD (Ch. 10): PCA uses SVD of centered data; right singular vectors are principal components.
Least squares (Ch. 12): Rank-$k$ reconstruction minimizes Frobenius norm (sum of squared residuals).
Conditioning (Ch. 14): Ratio $\sigma_1/\sigma_d$ measures how elongated the data cloud is; large ratios indicate near-low-rank structure.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments