Example number
26
Example slug
example_26_svd_factorization_and_best_rank1_reconstruction
Background

The singular value decomposition $X = U\Sigma V^\top$ factorizes any $n \times d$ matrix into orthonormal components $U$, $V$ and diagonal singular values $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$ where $r = \text{rank}(X)$. Singular values are ordered $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r \geq 0$, measuring the “energy” or importance of each direction. The Eckart–Young theorem (1936) states that the rank-$k$ matrix closest to $X$ in Frobenius norm is $X_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$, achieved by truncating the SVD.

In ML, low-rank structure is ubiquitous: images have low-rank subspaces (faces in eigenfaces), text has low-rank embeddings (semantic structure), recommendation matrices are low-rank (latent factors), and graphs have low-rank spectra (community structure). SVD is the computational workhorse for PCA, low-rank matrix completion, and spectral clustering.

Purpose

Build a crisp, practical understanding of low-rank structure as the organizing principle behind compression, denoising, dimensionality reduction, and visualization in ML. SVD reveals how data concentrates in a few dominant directions; truncating small singular values yields the best low-rank approximation (Eckart–Young theorem), optimal in Frobenius norm.

You will learn to compute SVD, extract rank-$k$ approximations, measure reconstruction error, and interpret singular values as a spectrum of information importance. The goal is to make you fluent in recognizing when low-rank structure exists, exploiting it for efficiency and noise filtering, and understanding why SVD is the foundational algorithm for PCA, image compression, recommender systems, and spectral methods.

Problem

Compute SVD of centered toy data and rank-1 reconstruction error.

Solution (Math)

SVD: $X_c=U\Sigma V^T$. Best rank-1 approximation is $\sigma_1 u_1 v_1^T$ (Eckart–Young).

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

X = toy_pca_points(n=40, seed=0)
Xc = X - X.mean(0)
U,S,Vt = np.linalg.svd(Xc, full_matrices=False)

X1 = np.outer(U[:,0], Vt[0]) * S[0]
print("rank-1 Fro error:", np.linalg.norm(Xc-X1, ord="fro"))
Code Introduction

This code demonstrates low-rank approximation via SVD, showing how to reconstruct a matrix using only its dominant singular components and measuring reconstruction error. It illustrates why SVD is fundamental to dimensionality reduction, compression, and noise filtering in machine learning.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Center data: Xc = X - X.mean(0) removes mean, required for variance-aligned decomposition.
  • Economy SVD: U, S, Vt = np.linalg.svd(Xc, full_matrices=False) returns $U \in \mathbb{R}^{n \times r}$, $S \in \mathbb{R}^r$, $V^\top \in \mathbb{R}^{r \times d}$ where $r = \text{rank}(X_c)$.
  • Extract rank-1: X1 = np.outer(U[:,0], Vt[0]) * S[0] reconstructs using the top singular triplet.
  • Extend to rank-$k$: loop or vectorize: Xk = sum(S[i] * np.outer(U[:,i], Vt[i]) for i in range(k)).
  • Frobenius error: np.linalg.norm(Xc - X1, ord="fro") measures total squared reconstruction error.
  • Alternative: error for rank-$k$ is $\sqrt{\sum_{i=k+1}^r \sigma_i^2}$; recompute or cache $S$ for efficiency.
  • Shape verification: $(n,r) \times (r,d)$ for outer product + vector broadcast ensures correct $(n,d)$ output.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • SVD factorization: $X_c = U\Sigma V^\top$ separates signal into orthonormal bases and singular values.
  • Rank-1 reconstruction: the best rank-1 approximation is $\sigma_1 u_1 v_1^\top$, capturing maximum variance.
  • Information hierarchy: singular values order importance; truncating small $\sigma_i$ removes less-important structure.
  • Error quantification: reconstruction error equals the discarded singular values; $\|X_c - X_k\|_F = \sqrt{\sum_{i=k+1}^r \sigma_i^2}$.
  • Optimality: SVD rank-$k$ approximation minimizes Frobenius norm over all rank-$k$ matrices (Eckart–Young).
  • Scalability insight: for $n \times d$ with $d \ll n$, economy SVD via full_matrices=False costs $O(nd^2)$ and avoids null-space padding.
Notes

Part 1: Data Centering and SVD Factorization. The code loads 40 points from toy_pca_points(n=40, seed=0), producing $X \in \mathbb{R}^{40 \times 2}$ with inherent directional structure. Centering via Xc = X - X.mean(0) removes the mean, shifting the origin to the data’s center of mass. The SVD factorization U, S, Vt = np.linalg.svd(Xc, full_matrices=False) decomposes the centered matrix as $X_c = U \Sigma V^\top$, where $U \in \mathbb{R}^{40 \times 2}$ contains left singular vectors (orthonormal), $S \in \mathbb{R}^2$ contains singular values in descending order, and $V^\top \in \mathbb{R}^{2 \times 2}$ contains right singular vectors (rows are orthonormal). With full_matrices=False, we get an “economy” or “thin” SVD: since $X_c$ has only 2 columns, the factorization returns at most 2 singular components, avoiding null space padding.

Part 2: Rank-1 Reconstruction. The rank-1 approximation extracts only the first (dominant) singular component. The operation X1 = np.outer(U[:,0], Vt[0]) * S[0] reconstructs the matrix using only the top singular triplet: $X_1 = \sigma_1 u_1 v_1^\top$, where $u_1 = U[:,0] \in \mathbb{R}^{40}$ is the first left singular vector, $v_1 = V^\top[0] \in \mathbb{R}^2$ is the first right singular vector (equivalently, the first row of $V^\top$), and $\sigma_1 = S[0]$ is the largest singular value. The outer product np.outer(U[:,0], Vt[0]) produces a $40 \times 2$ matrix of rank 1; scaling by S[0] applies the magnitude. This $X_1$ lives entirely in the 1D subspace spanned by $u_1$, capturing the direction of maximum variance in the data.

Part 3: Frobenius Norm Error Quantifies Information Loss. The reconstruction error is measured using the Frobenius norm: $\|X_c - X_1\|_F = \sqrt{\sum_{ij} (X_c - X_1)_{ij}^2}$. Geometrically, this is the Euclidean distance between matrices treated as vectors in $\mathbb{R}^{40 \times 2}$. The printed error np.linalg.norm(Xc-X1, ord="fro") quantifies how much structure is lost by discarding the second singular component. In the SVD framework, this error equals $\sigma_2$ (the second singular value), since $X_c - X_1 = \sigma_2 u_2 v_2^\top + 0 = \sigma_2 u_2 v_2^\top$. The error is minimized by selecting the top-$k$ singular components: among all rank-$k$ matrices, $\sum_{i=1}^k \sigma_i u_i v_i^\top$ minimizes $\|X_c - X_{\text{rank-k}}\|_F$, making SVD the optimal low-rank approximation.

Part 4: Why This Matters for ML. Low-rank approximation is ubiquitous in machine learning. PCA is essentially rank-$k$ SVD followed by projection: retain the top $k$ principal components to reduce dimensionality while preserving maximum variance. Denoising exploits the fact that signal concentrates in large singular values while noise spreads across all components—truncating small singular values filters noise. Compression stores only the top singular triplets $(u_i, \sigma_i, v_i)$, reducing memory from $O(n \times d)$ to $O(k(n+d))$ for rank-$k$ approximation. Matrix completion (used in recommender systems) recovers missing entries by fitting a low-rank model. Kernel PCA and spectral clustering use SVD to find low-dimensional embeddings of kernel/graph matrices. Understanding SVD as an optimal rank-$k$ approximation algorithm—not just an eigenvalue solver—reveals why it’s the go-to workhorse for data reduction, visualization, and structure discovery.

Part 5: Numerical and Shape Notes. Shape discipline is critical. With Xc ∈ ℝ^{40×2}, the SVD produces $U \in \mathbb{R}^{40 \times 2}$, $S \in \mathbb{R}^2$ (as a 1D array), and $Vt \in \mathbb{R}^{2 \times 2}$. The outer product np.outer(U[:,0], Vt[0]) yields $40 \times 2$, matching Xc’s shape. Broadcasting the scalar S[0] multiplies all entries uniformly. For higher-rank approximations, loop over $k$ components: Xk = sum(S[i] * np.outer(U[:,i], Vt[i]) for i in range(k)). The Frobenius norm via ord="fro" is the default behavior of np.linalg.norm for matrices; it’s equivalent to vectorizing and computing the $\ell_2$ norm. Gotcha: if full_matrices=True (not recommended here), U would have 40 columns and Vt would have 2 rows, introducing null-space padding—use full_matrices=False for clean, minimal factorization.

History and Applications

The singular value decomposition originated in the early 20th century (Beltrami, Jordan, Sylvester); it became computationally practical with Golub’s algorithm (1965). The Eckart–Young theorem (1936) established that truncating the SVD yields the best low-rank approximation—a foundational result that motivated most subsequent applications. Pearson (1901) and Hotelling (1933) developed PCA as a variance-maximizing projection, later recognized as equivalent to rank-$k$ SVD truncation.

In modern ML, SVD is indispensable: PCA for dimensionality reduction, matrix completion (Netflix Prize), image compression (JPEG-like methods), spectral clustering, and latent semantic indexing in NLP all rely on low-rank SVD structure. Recent extensions (randomized SVD, streaming PCA) enable efficient computation on massive datasets. Low-rank approximation concepts underpin modern deep learning (autoencoders learn low-rank data representations) and theoretical analysis (matrix concentration inequalities). Understanding SVD as the optimal rank-$k$ approximation tool—computationally efficient and theoretically elegant—remains essential across data science, optimization, and machine learning.

Connection to Broader Examples
  • Chapter 10 (SVD): core factorization, rank definition, and Eckart–Young optimality.
  • Chapter 11 (PCA): rank-$k$ SVD truncation IS PCA; principal components are the top right singular vectors.
  • Chapter 8 (eigenvalues): covariance eigenvalues equal $(S^2/(n-1))$; PSD structure underlies SVD.
  • Chapter 14 (conditioning): large condition number $\kappa = \sigma_1/\sigma_r$ signals ill-posed problems; truncation regularizes.
  • Chapter 12 (least squares): SVD solves $\min_w \|Xw - y\|_2^2$ via pseudo-inverse; handles rank-deficiency gracefully.
  • Modern ML: matrix completion (recommenders), image compression, spectral clustering, and dimensionality reduction all exploit low-rank SVD approximation.

Comments