Example number
11
Example slug
example_11_pca_via_svd_explained_variance_ratio
Background

Principal Component Analysis (PCA), introduced by Pearson (1901) and Hotelling (1933), finds orthogonal directions in data space ordered by variance. The first principal component points in the direction of maximum variance; the second is orthogonal to the first and captures maximum remaining variance, and so on. This provides a natural coordinate system for data exploration and compression.

In modern ML, PCA serves as a preprocessing step to reduce input dimensionality (e.g., from thousands of pixels to dozens of components), a feature extraction method (learning interpretable axes in gene expression or text data), and a visualization tool (projecting high-dimensional data onto 2D/3D for plotting). The explained variance ratio quantifies how much information each component retains, guiding the choice of $k$ components: if the first $k$ components explain 95% of variance, downstream models trained on $k$-dimensional projections lose only 5% of the original structure. SVD provides a numerically stable way to compute PCA without forming the covariance matrix explicitly.

The explained variance ratio (EVR) computation evr1 = (S[0]**2) / np.sum(S**2) quantifies what fraction of total variance the first component captures. Since the sample covariance of centered data is $\Sigma = \frac{1}{n-1} X_c^\top X_c$, its eigenvalues are $\lambda_i = \sigma_i^2 / (n-1)$, where $\sigma_i$ are the singular values from SVD. The total variance is $\text{tr}(\Sigma) = \sum_i \lambda_i \propto \sum_i \sigma_i^2$, so the ratio $\sigma_1^2 / \sum_i \sigma_i^2$ gives the proportion of variance along the first principal direction. A value near 1.0 indicates the data is approximately 1-dimensional (all variance concentrated in one direction); a value near 0.5 for 2D data suggests roughly equal variance along both axes. This ratio guides dimensionality reduction decisions: if evr1 = 0.95, retaining only pc1 preserves 95% of the dataset’s structure.

Connection to ML workflows: PCA appears throughout machine learning as a preprocessing step (reducing input dimensionality before feeding to classifiers), a feature extraction method (learning interpretable directions in high-dimensional data), and a denoising technique (truncating small singular values to remove noise). The explained variance ratio helps practitioners choose the number of components $k$: plot EVR for each component (the “scree plot”), select the elbow point where additional components add little variance, and project data onto the top-$k$ principal components via $Z = X_c V_k$, where $V_k \in \mathbb{R}^{d \times k}$ contains the first $k$ columns of $V$. Understanding this singular value–variance relationship also clarifies why poorly conditioned data (small $\sigma_{\min}$) causes numerical instability: the condition number $\kappa = \sigma_{\max} / \sigma_{\min}$ quantifies how sensitive inversions and projections are to perturbations, which is why regularization (adding $\lambda I$ to covariance matrices) stabilizes ill-conditioned PCA by bounding the smallest eigenvalue away from zero.

Purpose

Build practical intuition for PCA as a dimensionality reduction and variance analysis tool:

  • Understand that PCA extracts directions of maximum variance via SVD of centered data.
  • See how singular values relate to explained variance: $\text{EVR}_k = \sigma_k^2 / \sum_i \sigma_i^2$.
  • Recognize the importance of centering: PCA measures variance relative to the mean.
  • Learn to choose the number of components by inspecting explained variance ratios.
Problem

Compute PC1 and EVR1 from SVD of centered toy data.

Solution (Math)

For $X_c=U\Sigma V^T$, PC1 is $v_1$ and EVR1 is $\sigma_1^2/\sum_j \sigma_j^2$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

X = toy_pca_points(n=40, seed=1)
Xc = X - X.mean(0)
_, S, Vt = np.linalg.svd(Xc, full_matrices=False)
pc1 = Vt[0]
evr1 = (S[0]**2)/np.sum(S**2)
print("pc1:", pc1)
print("evr1:", evr1)
Code Introduction

This snippet demonstrates the core computation of Principal Component Analysis (PCA): extracting the direction of maximum variance and quantifying how much total variance it captures. The code loads 2D synthetic data via toy_pca_points(n=40, seed=1), producing $X\in\mathbb{R}^{40\times 2}$ with inherent structure (points aligned along some direction). Centering the data with Xc = X - X.mean(0) shifts the origin to the dataset’s center of mass, which is essential because PCA seeks directions of variance relative to the mean, not relative to the coordinate origin. Omitting this step would conflate translational offset with true variance structure.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Center data: $X_c = X - \bar{X}$ where $\bar{X} = \frac{1}{n}\sum_{i=1}^n x_i$; this shifts the origin to the data centroid.
  • Compute SVD via np.linalg.svd(Xc, full_matrices=False) to get $U\in\mathbb{R}^{n\times r}$, $S\in\mathbb{R}^r$, $V^\top\in\mathbb{R}^{r\times d}$ where $r = \min(n,d)$.
  • Extract PC1: pc1 = Vt[0] is the first row of $V^\top$, a unit vector in $\mathbb{R}^d$.
  • Compute EVR1: evr1 = S[0]**2 / np.sum(S**2) gives the proportion of total variance explained by PC1.
  • For $k$ components, cumulative EVR is $\sum_{i=1}^k \sigma_i^2 / \sum_j \sigma_j^2$; plot this to choose $k$.
  • Shapes: $X:(n\times d)$ (here $40\times 2$), $X_c:(n\times d)$, $S:(2)$, $V^\top:(2\times 2)$, pc1: $(2)$, evr1: scalar.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • PCA via SVD: the first right singular vector $v_1$ is the first principal component (PC1).
  • Explained variance ratio: $\text{EVR}_1 = \sigma_1^2 / \sum_i \sigma_i^2$ measures the fraction of total variance along PC1.
  • Centering is essential: PCA without centering conflates mean offset with variance structure.
  • Singular values encode variance: larger $\sigma_i$ indicates more variance along the $i$-th component.
Notes

Part 1: Core setup - PCA via SVD: explained variance ratio

State the objects, shapes, and target question for PCA via SVD: explained variance ratio. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - PCA via SVD: explained variance ratio

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make PCA via SVD: explained variance ratio work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - PCA via SVD: explained variance ratio

Give the computational recipe for PCA via SVD: explained variance ratio, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: $X\in\mathbb{R}^{n\times d}$ (here $40\times 2$), $X_c$ centered, $U:(n\times 2)$, $S:(2)$, $V^\top:(2\times 2)$.
  • Numerical note: full_matrices=False returns economy SVD; centering via X - X.mean(0) broadcasts the mean across rows.
  • Interpretation: EVR near 1 means data is nearly 1D; EVR near $1/d$ means variance is spread equally across all dimensions.
  • Scree plot: Plot EVR vs component index; select the “elbow” where EVR drops sharply as the cutoff for $k$.
History and Applications

Historical Development: Karl Pearson (1901) introduced PCA as a geometric method to find the best-fitting line or hyperplane through a point cloud, motivated by problems in biology and evolutionary theory. Harold Hotelling (1933) reframed it as a statistical tool for identifying principal axes of variation in multivariate data, establishing the algebraic framework we use today. The connection to low-rank approximation was formalized by Eckart and Young (1936), who proved that truncated SVD minimizes reconstruction error—a result that underpins modern dimensionality reduction.

Key Findings:

  • PCA recovers the eigenvectors of the covariance matrix, ordered by decreasing eigenvalues (variance explained).
  • The first $k$ principal components provide the optimal $k$-dimensional linear subspace for representing data with minimum mean-squared error.
  • Explained variance ratios quantify information retention: cumulative EVR guides the choice of $k$ in practice (e.g., retain 95% of variance).
  • PCA is sensitive to scaling: features with large magnitudes dominate; standardization (zero mean, unit variance) is often essential.

Applications in ML and Data Science:

  • Image compression and facial recognition: Eigenfaces (PCA on face images) reduce dimensionality from thousands of pixels to dozens of components, enabling efficient storage and recognition.
  • Gene expression analysis: PCA visualizes high-dimensional genomic data, revealing clusters of related genes or sample groups (e.g., disease vs. healthy).
  • Natural language processing: Latent Semantic Analysis (LSA) uses SVD/PCA on term-document matrices to uncover semantic structure in text corpora.
  • Anomaly detection: Points with large reconstruction error (distance from the PCA subspace) flag outliers or novelties.
  • Preprocessing for deep learning: Whitening (decorrelating and normalizing via PCA) accelerates training and improves convergence in neural networks.
  • Collaborative filtering: Matrix factorization (related to PCA) powers recommender systems by learning low-rank user-item representations.
  • Signal processing and neuroscience: Independent Component Analysis (ICA) extends PCA to non-Gaussian sources; PCA separates noise from signal in EEG/fMRI data.

Modern Extensions: Kernel PCA generalizes to nonlinear manifolds via kernel methods; sparse PCA enforces sparsity in loadings for interpretability; probabilistic PCA (PPCA) provides a probabilistic generative model with closed-form maximum likelihood estimates. These variants address PCA’s limitations (linearity, lack of sparsity, no uncertainty quantification) while retaining computational efficiency and theoretical foundations.

Connection to Broader Examples
  • Dimensionality reduction: Project data onto top-$k$ PCs via $Z = X_c V_k$ where $V_k\in\mathbb{R}^{d\times k}$; this reduces dimension from $d$ to $k$.
  • Denoising: Truncate small singular values; reconstruct as $\hat{X}_c = U_k S_k V_k^\top$ to remove noise.
  • Feature extraction: PCs are linear combinations of original features; inspect loadings (PC coefficients) to interpret which features drive variance.
  • Preprocessing for ML: Many algorithms (linear regression, logistic regression, neural nets) benefit from PCA-reduced inputs when $d \gg n$.
  • Visualization: Project high-dimensional data onto PC1 and PC2 for 2D scatter plots; these axes maximize displayed variance.

Comments