ex1.ai

Example number

Example slug

example_27_pca_via_svd_explained_variance_ratio

Background

This example sits on the classic SVD/PCA lineage: EckartâYoung formalized optimal low-rank approximation; Pearson and Hotelling framed PCA as variance-maximizing projections. Pearson (1901) introduced PCA as finding lines and planes of closest fit to point clouds, minimizing orthogonal distances. Hotelling (1933) reformulated it as maximizing variance along projection directions, revealing the eigenvalue problem at its core. The SVD-based approach (popularized after Golubâs 1965 algorithm) directly computes principal components without forming the covariance matrix, improving numerical stability and efficiency. The explained variance ratio quantifies the fraction of total variance captured by each component, enabling principled dimensionality selection. In modern ML, PCA serves multiple roles: data whitening (decorrelating features), visualization (projecting to 2D/3D), noise reduction (truncating low-EVR components), and initialization (e.g., for autoencoders). The equivalence $\text{Cov}(X_c) = \frac{1}{n-1} X_c^\top X_c = V \Lambda V^\top$ (where $\Lambda = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2) / (n-1)$) bridges the eigendecomposition and SVD perspectives, showing that PCA components are eigenvectors of the covariance matrix and right singular vectors of the centered data matrix.

Purpose

Build intuition for low-rank structure as the organizing principle behind compression, denoising, and dimensionality reduction in ML. Principal Component Analysis (PCA) is the canonical dimensionality reduction technique that transforms correlated features into uncorrelated principal components ordered by explained variance. The explained variance ratio (EVR) quantifies how much information each principal component captures, enabling informed decisions about dimensionality reduction, noise filtering, and feature engineering. Understanding PCA via SVD provides both computational efficiency and deep geometric insight: principal components are right singular vectors, and their importance is proportional to squared singular values. This example demonstrates the fundamental SVD-PCA connection that underpins everything from image compression to neural network initialization, from genomics to recommender systems. Mastering this technique equips you to diagnose overfitting (by identifying low-variance directions), compress representations (by retaining high-EVR components), and interpret latent structure in high-dimensional data.

Problem

Compute PC1 and EVR1 from SVD of centered toy data.

Solution (Math)

For $X_c=U\Sigma V^T$, PC1 is $v_1$ and EVR1 is $\sigma_1^2/\sum_j \sigma_j^2$.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import toy_pca_points

X = toy_pca_points(n=40, seed=1)
Xc = X - X.mean(0)
_, S, Vt = np.linalg.svd(Xc, full_matrices=False)
pc1 = Vt[0]
evr1 = (S[0]**2)/np.sum(S**2)
print("pc1:", pc1)
print("evr1:", evr1)

Code Introduction

This code implements Principal Component Analysis (PCA) via Singular Value Decomposition (SVD) on a synthetic 2D dataset. The workflow consists of three steps: (1) load toy data with known correlation structure, (2) center the data by subtracting column means (essential for PCA to correctly identify variance-maximizing directions), and (3) compute the SVD factorization $X_c = U \Sigma V^\top$ to extract the first principal component (PC1) and its explained variance ratio (EVR1). The principal component pc1 = Vt[0] is the direction in feature space along which the data exhibits maximum variance. The explained variance ratio $\text{EVR}_1 = \sigma_1^2 / \sum_j \sigma_j^2$ quantifies the fraction of total variance captured by PC1, guiding dimensionality reduction decisions. This snippet demonstrates the SVD-PCA equivalence: principal components are right singular vectors, and their importance is proportional to squared singular values. The centered data matrix $X_c$ is the critical inputâfailing to center is a common mistake that leads to components dominated by the mean rather than the spread. This approach is numerically stable (avoids squaring condition numbers) and efficient (economic SVD for $n > d$). PCA via SVD is the foundation for dimensionality reduction, data visualization, feature extraction, and understanding latent structure in high-dimensional datasets. It reveals orthogonal directions ordered by variance, enabling principled truncation to retain the most informative components while discarding noise or redundancy. The EVR metric makes PCA interpretable: a high EVR1 (e.g., 0.95) indicates strong low-rank structure, while a flat EVR profile suggests isotropic data with no dominant directions. This code is the computational core of PCA, applicable to any dataset after proper centering and scaling, and generalizes to extracting multiple components to capture maximum variance in the data.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Data loading: toy_pca_points(n=40, seed=1) returns $X \in \mathbb{R}^{40 \times 2}$, a synthetic dataset with correlated features designed to exhibit clear principal directions.
Centering: Compute column means $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ and center $X_c = X - \mathbf{1}_n \bar{x}^\top$. In NumPy: Xc = X - X.mean(axis=0) broadcasts the mean across rows.
SVD factorization: np.linalg.svd(Xc, full_matrices=False) returns $U \in \mathbb{R}^{40 \times 2}$, $\Sigma \in \mathbb{R}^{2}$ (as a vector S), and $V^\top \in \mathbb{R}^{2 \times 2}$. The full_matrices=False option computes the reduced (economic) SVD, avoiding unnecessary zeros when $n > d$.
Principal component extraction: PC1 is the first row of $V^\top$: pc1 = Vt[0]. This is the direction of maximum variance in the centered data.
Explained variance computation: Total variance is $\sum_j \sigma_j^2$; the variance along PC1 is $\sigma_1^2$. Thus, $\text{EVR}_1 = \sigma_1^2 / \sum_j \sigma_j^2$.
Interpretation: If EVR1 â 0.95, the first component captures 95% of the variance, and the second component is nearly redundant (low-signal or noise).
Cumulative EVR: For dimensionality selection, compute cumsum(S**2) / sum(S**2) and find the smallest $k$ where cumulative EVR exceeds a threshold (e.g., 0.95).

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

SVD-PCA equivalence: Principal components are right singular vectors $v_i$ from $X_c = U \Sigma V^\top$, and their importance is quantified by squared singular values $\sigma_i^2$.
Explained variance ratio: $\text{EVR}_i = \sigma_i^2 / \sum_j \sigma_j^2$ measures the fraction of total variance captured by component $i$, enabling principled truncation.
Centering criticality: PCA requires mean-centered data $X_c = X - \bar{x}^\top$ to correctly align variance-maximizing directions with the covariance structure.
Numerical stability: Computing PCA via SVD (rather than eigendecomposing $X_c^\top X_c$) avoids squaring condition numbers and handles rank-deficient data gracefully.
Dimensionality selection: Cumulative EVR ($\sum_{i=1}^k \text{EVR}_i$) guides choosing the number of components to retain (e.g., 95% variance threshold).
Computation pattern: This is a decompose operationâfactoring data into orthogonal directions ordered by importanceâthat enables downstream projection (dimensionality reduction), filtering (noise removal), and reconstruction (approximate the original data with fewer components).

Notes

Part 1: Data Loading and Centering
The toy_pca_points function generates synthetic 2D data with known correlation structure, making principal directions visually interpretable. Centering is non-negotiable for PCA: without subtracting the mean, the first principal component would point toward the data centroid rather than the direction of maximum spread. The operation Xc = X - X.mean(axis=0) shifts the data to have zero mean along each feature, ensuring that the covariance matrix correctly captures variance (not a mix of mean and variance). Failing to center is a common bugâe.g., applying PCA to unnormalized pixel values (0-255) will yield components dominated by brightness rather than texture. Always verify Xc.mean(axis=0) â 0 after centering.

Part 2: SVD and Principal Component Extraction
The SVD factorization $X_c = U \Sigma V^\top$ simultaneously diagonalizes $X_c^\top X_c$ and $X_c X_c^\top$, yielding principal components (columns of $V$) and principal scores (columns of $U \Sigma$). The full_matrices=False argument computes the reduced SVD, which is more efficient when $n \gg d$ (many examples, few features). Singular values in S are sorted in descending order by default, so Vt[0] (the first row of $V^\top$) is PC1, the direction of maximum variance. The transformation from $V^\top$ (rows as components) to $V$ (columns as components) can cause confusionâalways check shapes: Vt.shape == (d, d) means each row is a component.

Part 3: Explained Variance Ratio
The EVR formula $\text{EVR}_i = \sigma_i^2 / \sum_j \sigma_j^2$ partitions total variance into orthogonal components. The numerator $\sigma_i^2$ is the variance along PC$i$; the denominator $\sum_j \sigma_j^2 = \text{tr}(X_c^\top X_c) = \sum_{i,j} X_{c,ij}^2$ is the total variance (sum of all eigenvalues of the covariance matrix). Cumulative EVR $\sum_{i=1}^k \text{EVR}_i$ quantifies the fraction of variance retained by the first $k$ components, guiding dimensionality selection (e.g., âkeep components explaining â¥ 95% of varianceâ). High EVR for early components indicates strong low-rank structure; flat EVR suggests isotropic noise or lack of correlation. This metric is interpretable to non-experts, unlike eigenvalues themselves (which depend on data scale).

Why This Matters for ML
PCA is ubiquitous in ML pipelines: (1) Data preprocessing: whitening (standardizing to unit variance per component) improves neural network training by equalizing gradient magnitudes across directions. (2) Feature engineering: retaining top-$k$ components reduces overfitting by discarding low-variance (high-noise) directions; this is implicit regularization. (3) Visualization: projecting high-dimensional embeddings (e.g., word vectors, image features) onto PC1/PC2 reveals clusters and outliers. (4) Compression: storing $U_k \Sigma_k V_k^\top$ (rank-$k$ approximation) requires $O(nk + dk)$ space instead of $O(nd)$, with controlled error $\| X_c - U_k \Sigma_k V_k^\top \|_F^2 = \sum_{i=k+1}^d \sigma_i^2$. (5) Diagnosing ill-conditioning: small trailing singular values indicate near-collinearity, warning that least-squares solutions will be unstable.

Numerical and Shape Notes
Shape discipline: $X \in \mathbb{R}^{n \times d}$ (rows = examples), $U \in \mathbb{R}^{n \times d}$, $\Sigma \in \mathbb{R}^{d}$ (vector of singular values), $V^\top \in \mathbb{R}^{d \times d}$. The principal component pc1 is a vector in $\mathbb{R}^d$ (feature space); projecting data onto PC1 yields scores $X_c v_1 \in \mathbb{R}^n$ (example space). Numerical stability: SVD avoids the $\kappa(X_c)^2$ condition number amplification of eigendecomposing $X_c^\top X_c$, making it the preferred method for PCA in production systems. For very large $d$, randomized SVD (sklearn.decomposition.TruncatedSVD) approximates the top-$k$ components in $O(ndk)$ time instead of $O(\min(n^2 d, nd^2))$ for full SVD. Diagnostics: if EVR drops off sharply (e.g., EVR1=0.9, EVR2=0.05), the data is approximately one-dimensional; if EVR decays slowly, many components are needed to capture variance.

History and Applications

Principal Component Analysis has a rich history spanning over a century. Karl Pearson introduced the method in 1901 as a geometric problem of fitting lines and planes to point clouds, minimizing orthogonal distances (now recognized as total least squares). Harold Hotelling reformulated PCA in 1933 as a statistical problem of maximizing variance along projection directions, revealing the eigenvalue structure at its core. The connection to SVD became computationally practical after Gene Golub developed the Golub-Kahan algorithm for SVD in 1965, enabling PCA on large datasets without explicitly forming the covariance matrix. The Eckart-Young theorem (1936) formalized PCAâs optimality: the rank-$k$ truncated SVD minimizes reconstruction error in the Frobenius norm, making PCA the best linear dimensionality reduction for preserving variance. In modern machine learning, PCA serves multiple roles: data whitening (decorrelating features) improves neural network training by equalizing gradient magnitudes; visualization (projecting to 2D/3D) reveals clusters in embeddings like word2vec or image features; compression (storing $U_k \Sigma_k V_k^\top$) reduces storage and computation for large datasets; and denoising (truncating low-variance components) filters measurement noise. PCA underpins classical techniques like Linear Discriminant Analysis (LDA) and Canonical Correlation Analysis (CCA), and modern methods like autoencoders and variational autoencoders (VAEs) can be viewed as nonlinear generalizations. Applications span genomics (identifying population structure from SNP data), neuroscience (extracting task-related signals from fMRI), finance (reducing portfolios to a few risk factors), and computer vision (eigenfaces for face recognition). Despite its age, PCA remains the interpretable baseline for understanding high-dimensional data and diagnosing whether more complex methods (e.g., t-SNE, UMAP) are warranted.

Connection to Broader Examples

PCA via SVD synthesizes concepts from multiple chapters:

Chapter 6 (Orthogonality & Projections): Principal components form an orthonormal basis; projecting data onto PC1 is $X_c v_1$, the coordinate along the maximum-variance direction.
Chapter 8 (Eigenvalues): PCA components are eigenvectors of the covariance matrix $C = \frac{1}{n-1} X_c^\top X_c$; eigenvalues $\lambda_i = \sigma_i^2 / (n-1)$ quantify variance along each eigenvector.
Chapter 10 (SVD): The SVD factorization directly yields principal components (right singular vectors $V$) without forming the covariance matrix, improving numerical stability.
Chapter 11 (PCA): This example demonstrates the computational core of PCAâSVD of centered dataâand introduces EVR as the interpretability metric.
Chapter 12 (Least Squares): PCA can be viewed as solving a least-squares problem: find the $k$-dimensional subspace that minimizes reconstruction error (Eckart-Young theorem).
Dimensionality reduction beyond PCA: Modern techniques (t-SNE, UMAP, autoencoders) build on PCAâs foundation but relax linearity/orthogonality constraints. PCA remains the interpretable baseline for diagnosing which components matter.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments