Example number
10
Example slug
example_10_svd_factorization_and_best_rank1_reconstruction
Background

The SVD $X = U\Sigma V^\top$ factorizes any matrix into orthonormal left/right singular vectors and nonnegative singular values. The Eckart–Young theorem (1936) guarantees that truncating to the top $k$ singular triplets minimizes reconstruction error in both Frobenius and spectral norms. This makes SVD the canonical choice for low-rank approximation, compression, and noise filtering.

In ML, SVD underpins PCA (covariance eigenvectors are right singular vectors of centered data), collaborative filtering (matrix completion via low-rank factorization), and latent semantic analysis. The rank-1 case isolates the single most important pattern—useful for visualization, feature extraction, and understanding data geometry.

Purpose

Build intuition for SVD-based low-rank approximation:

  • Understand that SVD provides the optimal rank-$k$ approximation in Frobenius norm (Eckart–Young theorem).
  • See that the rank-1 approximation captures the dominant pattern in the data.
  • Connect singular values to explained variance and reconstruction error.
  • Recognize SVD as the unifying tool for PCA, matrix completion, and denoising.
Problem

Compute SVD of centered toy data and rank-1 reconstruction error.

Solution (Math)

SVD: $X_c=U\Sigma V^T$. Best rank-1 approximation is $\sigma_1 u_1 v_1^T$ (Eckart–Young).

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

X = toy_pca_points(n=40, seed=0)
Xc = X - X.mean(0)
U,S,Vt = np.linalg.svd(Xc, full_matrices=False)

X1 = np.outer(U[:,0], Vt[0]) * S[0]
print("rank-1 Fro error:", np.linalg.norm(Xc-X1, ord="fro"))
Code Introduction

This snippet computes the best rank-1 approximation of centered 2D data via SVD and reports its Frobenius reconstruction error. The data $X$ is centered to $X_c$, then factorized as $X_c = U\Sigma V^\top$. Keeping only the top singular triplet $(u_1, \sigma_1, v_1)$, the rank-1 approximation is $X_1 = \sigma_1 u_1 v_1^\top$. The code builds this with np.outer(U[:,0], Vt[0]) * S[0] and measures $\|X_c - X_1\|_F$, the Frobenius norm of the residual. For any matrix, the truncated SVD gives the minimum-error rank-$k$ approximation in Frobenius norm; here $k=1$ shows how much variance a single component captures.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Center data: $X_c = X - \text{mean}(X)$ to remove the mean offset; covariance and PCA require centered data.
  • Compute SVD via np.linalg.svd(Xc, full_matrices=False) to get $U\in\mathbb{R}^{n\times r}$, $S\in\mathbb{R}^r$, $V^\top\in\mathbb{R}^{r\times d}$ where $r = \min(n,d)$.
  • Construct rank-1 approximation: $X_1 = \sigma_1 u_1 v_1^\top$ using np.outer(U[:,0], Vt[0]) * S[0].
  • Compute Frobenius reconstruction error: np.linalg.norm(Xc - X1, ord="fro") measures total residual.
  • For rank-$k$ approximations, sum the first $k$ outer products: $X_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$.
  • Shapes: $X:(n\times d)$ (here $40\times 2$), $X_c:(n\times d)$, $U:(n\times 2)$, $S:(2)$, $V^\top:(2\times 2)$, $X_1:(n\times d)$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • SVD decomposes $X_c = U\Sigma V^\top$ with orthonormal bases and nonnegative singular values.
  • The best rank-1 approximation is $\sigma_1 u_1 v_1^\top$; its error is $\sqrt{\sigma_2^2 + \sigma_3^2 + \cdots}$.
  • Frobenius norm measures total squared reconstruction error across all entries.
  • Truncated SVD balances model simplicity (low rank) against fidelity (small residual).
Notes

Part 1: Core setup - SVD factorization and best rank-1 reconstruction

State the objects, shapes, and target question for SVD factorization and best rank-1 reconstruction. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - SVD factorization and best rank-1 reconstruction

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make SVD factorization and best rank-1 reconstruction work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - SVD factorization and best rank-1 reconstruction

Give the computational recipe for SVD factorization and best rank-1 reconstruction, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: $X\in\mathbb{R}^{n\times d}$ (here $40\times 2$), $X_c$ centered, $U:(n\times r)$, $S:(r)$, $V^\top:(r\times d)$ with $r=\min(n,d)$.
  • Numerical note: full_matrices=False returns economy SVD, saving memory; use full_matrices=True only if you need complete orthonormal bases.
  • Interpretation: Singular values decay quantifies intrinsic dimensionality; a sharp drop after $k$ values suggests effective rank $k$.
  • Frobenius error: For rank-$k$ truncation, $\|X_c - X_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}$ by Eckart–Young.
History and Applications

Singular Value Decomposition (SVD) was developed independently by multiple researchers: Beltrami (1873), Jordan (1874), and Sylvester (1889) in theoretical contexts. The modern computational algorithm is due to Golub and Kahan (1965), making SVD a practical workhorse for numerical computing.

Eckart–Young theorem: Eckart and Young (1936) proved that the truncated SVD—keeping only the largest $k$ singular values—gives the best rank-$k$ approximation in both Frobenius and spectral norms. This geometric result justified low-rank approximation as an optimal strategy and became central to PCA and matrix compression.

Modern applications: SVD powers recommendation systems (Netflix problem: matrix completion), natural language processing (Latent Semantic Analysis), and image compression. Neural network-based approaches (autoencoders, variational methods) have added probabilistic twists, but the underlying geometry—that low-rank structure captures data patterns—remains unchanged. SVD’s stability and optimality guarantees make it irreplaceable in numerical linear algebra.

Connection to Broader Examples
  • PCA: Right singular vectors $V$ are principal components; singular values measure variance along each axis.
  • Matrix completion: Low-rank SVD reconstructs missing entries in recommender systems (collaborative filtering).
  • Denoising: Truncating small singular values removes noise while preserving structure.
  • Least squares: SVD solves underdetermined and overdetermined systems via pseudoinverse.
  • Conditioning: The ratio $\sigma_{\max}/\sigma_{\min}$ (condition number) flags numerical instability.

Comments