Example slug
example_10_svd_factorization_and_best_rank1_reconstruction
Background
The SVD $X = U\Sigma V^\top$ factorizes any matrix into orthonormal left/right singular vectors and nonnegative singular values. The EckartâYoung theorem (1936) guarantees that truncating to the top $k$ singular triplets minimizes reconstruction error in both Frobenius and spectral norms. This makes SVD the canonical choice for low-rank approximation, compression, and noise filtering.
In ML, SVD underpins PCA (covariance eigenvectors are right singular vectors of centered data), collaborative filtering (matrix completion via low-rank factorization), and latent semantic analysis. The rank-1 case isolates the single most important patternâuseful for visualization, feature extraction, and understanding data geometry.
Purpose
Build intuition for SVD-based low-rank approximation:
- Understand that SVD provides the optimal rank-$k$ approximation in Frobenius norm (EckartâYoung theorem).
- See that the rank-1 approximation captures the dominant pattern in the data.
- Connect singular values to explained variance and reconstruction error.
- Recognize SVD as the unifying tool for PCA, matrix completion, and denoising.
Problem
Compute SVD of centered toy data and rank-1 reconstruction error.
Solution (Math)
SVD: $X_c=U\Sigma V^T$. Best rank-1 approximation is $\sigma_1 u_1 v_1^T$ (EckartâYoung).
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)
import numpy as np
from scripts.toy_data import toy_pca_points
X = toy_pca_points(n=40, seed=0)
Xc = X - X.mean(0)
U,S,Vt = np.linalg.svd(Xc, full_matrices=False)
X1 = np.outer(U[:,0], Vt[0]) * S[0]
print("rank-1 Fro error:", np.linalg.norm(Xc-X1, ord="fro"))
Code Introduction
This snippet computes the best rank-1 approximation of centered 2D data via SVD and reports its Frobenius reconstruction error. The data $X$ is centered to $X_c$, then factorized as $X_c = U\Sigma V^\top$. Keeping only the top singular triplet $(u_1, \sigma_1, v_1)$, the rank-1 approximation is $X_1 = \sigma_1 u_1 v_1^\top$. The code builds this with np.outer(U[:,0], Vt[0]) * S[0] and measures $\|X_c - X_1\|_F$, the Frobenius norm of the residual. For any matrix, the truncated SVD gives the minimum-error rank-$k$ approximation in Frobenius norm; here $k=1$ shows how much variance a single component captures.
Numerical Implementation Details
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
- Broadcasting: Check that broadcasts are intended (e.g.,
(n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
- Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve, lstsq, Cholesky/QR/SVD.
- Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended.
- Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
- Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness.
- Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Center data: $X_c = X - \text{mean}(X)$ to remove the mean offset; covariance and PCA require centered data.
- Compute SVD via
np.linalg.svd(Xc, full_matrices=False) to get $U\in\mathbb{R}^{n\times r}$, $S\in\mathbb{R}^r$, $V^\top\in\mathbb{R}^{r\times d}$ where $r = \min(n,d)$.
- Construct rank-1 approximation: $X_1 = \sigma_1 u_1 v_1^\top$ using
np.outer(U[:,0], Vt[0]) * S[0].
- Compute Frobenius reconstruction error:
np.linalg.norm(Xc - X1, ord="fro") measures total residual.
- For rank-$k$ approximations, sum the first $k$ outer products: $X_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$.
- Shapes: $X:(n\times d)$ (here $40\times 2$), $X_c:(n\times d)$, $U:(n\times 2)$, $S:(2)$, $V^\top:(2\times 2)$, $X_1:(n\times d)$.
What This Example Demonstrates
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- SVD decomposes $X_c = U\Sigma V^\top$ with orthonormal bases and nonnegative singular values.
- The best rank-1 approximation is $\sigma_1 u_1 v_1^\top$; its error is $\sqrt{\sigma_2^2 + \sigma_3^2 + \cdots}$.
- Frobenius norm measures total squared reconstruction error across all entries.
- Truncated SVD balances model simplicity (low rank) against fidelity (small residual).
Notes
Part 1: Core setup - SVD factorization and best rank-1 reconstruction
State the objects, shapes, and target question for SVD factorization and best rank-1 reconstruction. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.
Part 2: Geometry and algebraic insight - SVD factorization and best rank-1 reconstruction
Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make SVD factorization and best rank-1 reconstruction work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.
Part 3: Numerics and ML practice - SVD factorization and best rank-1 reconstruction
Give the computational recipe for SVD factorization and best rank-1 reconstruction, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.
- Shape discipline: $X\in\mathbb{R}^{n\times d}$ (here $40\times 2$), $X_c$ centered, $U:(n\times r)$, $S:(r)$, $V^\top:(r\times d)$ with $r=\min(n,d)$.
- Numerical note:
full_matrices=False returns economy SVD, saving memory; use full_matrices=True only if you need complete orthonormal bases.
- Interpretation: Singular values decay quantifies intrinsic dimensionality; a sharp drop after $k$ values suggests effective rank $k$.
- Frobenius error: For rank-$k$ truncation, $\|X_c - X_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}$ by EckartâYoung.
History and Applications
Singular Value Decomposition (SVD) was developed independently by multiple researchers: Beltrami (1873), Jordan (1874), and Sylvester (1889) in theoretical contexts. The modern computational algorithm is due to Golub and Kahan (1965), making SVD a practical workhorse for numerical computing.
EckartâYoung theorem: Eckart and Young (1936) proved that the truncated SVDâkeeping only the largest $k$ singular valuesâgives the best rank-$k$ approximation in both Frobenius and spectral norms. This geometric result justified low-rank approximation as an optimal strategy and became central to PCA and matrix compression.
Modern applications: SVD powers recommendation systems (Netflix problem: matrix completion), natural language processing (Latent Semantic Analysis), and image compression. Neural network-based approaches (autoencoders, variational methods) have added probabilistic twists, but the underlying geometryâthat low-rank structure captures data patternsâremains unchanged. SVDâs stability and optimality guarantees make it irreplaceable in numerical linear algebra.
Connection to Broader Examples
- PCA: Right singular vectors $V$ are principal components; singular values measure variance along each axis.
- Matrix completion: Low-rank SVD reconstructs missing entries in recommender systems (collaborative filtering).
- Denoising: Truncating small singular values removes noise while preserving structure.
- Least squares: SVD solves underdetermined and overdetermined systems via pseudoinverse.
- Conditioning: The ratio $\sigma_{\max}/\sigma_{\min}$ (condition number) flags numerical instability.
Comments