ex1.ai

Example number

Example slug

example_10_svd_factorization_and_best_rank1_reconstruction

Background

The SVD $X = U\Sigma V^\top$ factorizes any matrix into orthonormal left/right singular vectors and nonnegative singular values. The EckartâYoung theorem (1936) guarantees that truncating to the top $k$ singular triplets minimizes reconstruction error in both Frobenius and spectral norms. This makes SVD the canonical choice for low-rank approximation, compression, and noise filtering.

In ML, SVD underpins PCA (covariance eigenvectors are right singular vectors of centered data), collaborative filtering (matrix completion via low-rank factorization), and latent semantic analysis. The rank-1 case isolates the single most important patternâuseful for visualization, feature extraction, and understanding data geometry.

Purpose

Build intuition for SVD-based low-rank approximation:

Understand that SVD provides the optimal rank-$k$ approximation in Frobenius norm (EckartâYoung theorem).
See that the rank-1 approximation captures the dominant pattern in the data.
Connect singular values to explained variance and reconstruction error.
Recognize SVD as the unifying tool for PCA, matrix completion, and denoising.

Problem

Compute SVD of centered toy data and rank-1 reconstruction error.

Solution (Math)

SVD: $X_c=U\Sigma V^T$. Best rank-1 approximation is $\sigma_1 u_1 v_1^T$ (EckartâYoung).

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import toy_pca_points

X = toy_pca_points(n=40, seed=0)
Xc = X - X.mean(0)
U,S,Vt = np.linalg.svd(Xc, full_matrices=False)

X1 = np.outer(U[:,0], Vt[0]) * S[0]
print("rank-1 Fro error:", np.linalg.norm(Xc-X1, ord="fro"))

Code Introduction

This snippet computes the best rank-1 approximation of centered 2D data via SVD and reports its Frobenius reconstruction error. The data $X$ is centered to $X_c$, then factorized as $X_c = U\Sigma V^\top$. Keeping only the top singular triplet $(u_1, \sigma_1, v_1)$, the rank-1 approximation is $X_1 = \sigma_1 u_1 v_1^\top$. The code builds this with np.outer(U[:,0], Vt[0]) * S[0] and measures $\|X_c - X_1\|_F$, the Frobenius norm of the residual. For any matrix, the truncated SVD gives the minimum-error rank-$k$ approximation in Frobenius norm; here $k=1$ shows how much variance a single component captures.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Center data: $X_c = X - \text{mean}(X)$ to remove the mean offset; covariance and PCA require centered data.
Compute SVD via np.linalg.svd(Xc, full_matrices=False) to get $U\in\mathbb{R}^{n\times r}$, $S\in\mathbb{R}^r$, $V^\top\in\mathbb{R}^{r\times d}$ where $r = \min(n,d)$.
Construct rank-1 approximation: $X_1 = \sigma_1 u_1 v_1^\top$ using np.outer(U[:,0], Vt[0]) * S[0].
Compute Frobenius reconstruction error: np.linalg.norm(Xc - X1, ord="fro") measures total residual.
For rank-$k$ approximations, sum the first $k$ outer products: $X_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$.
Shapes: $X:(n\times d)$ (here $40\times 2$), $X_c:(n\times d)$, $U:(n\times 2)$, $S:(2)$, $V^\top:(2\times 2)$, $X_1:(n\times d)$.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

SVD decomposes $X_c = U\Sigma V^\top$ with orthonormal bases and nonnegative singular values.
The best rank-1 approximation is $\sigma_1 u_1 v_1^\top$; its error is $\sqrt{\sigma_2^2 + \sigma_3^2 + \cdots}$.
Frobenius norm measures total squared reconstruction error across all entries.
Truncated SVD balances model simplicity (low rank) against fidelity (small residual).

Notes

Part 1: Core setup - SVD factorization and best rank-1 reconstruction

State the objects, shapes, and target question for SVD factorization and best rank-1 reconstruction. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - SVD factorization and best rank-1 reconstruction

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make SVD factorization and best rank-1 reconstruction work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - SVD factorization and best rank-1 reconstruction

Give the computational recipe for SVD factorization and best rank-1 reconstruction, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

Shape discipline: $X\in\mathbb{R}^{n\times d}$ (here $40\times 2$), $X_c$ centered, $U:(n\times r)$, $S:(r)$, $V^\top:(r\times d)$ with $r=\min(n,d)$.
Numerical note: full_matrices=False returns economy SVD, saving memory; use full_matrices=True only if you need complete orthonormal bases.
Interpretation: Singular values decay quantifies intrinsic dimensionality; a sharp drop after $k$ values suggests effective rank $k$.
Frobenius error: For rank-$k$ truncation, $\|X_c - X_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}$ by EckartâYoung.

History and Applications

Singular Value Decomposition (SVD) was developed independently by multiple researchers: Beltrami (1873), Jordan (1874), and Sylvester (1889) in theoretical contexts. The modern computational algorithm is due to Golub and Kahan (1965), making SVD a practical workhorse for numerical computing.

EckartâYoung theorem: Eckart and Young (1936) proved that the truncated SVDâkeeping only the largest $k$ singular valuesâgives the best rank-$k$ approximation in both Frobenius and spectral norms. This geometric result justified low-rank approximation as an optimal strategy and became central to PCA and matrix compression.

Modern applications: SVD powers recommendation systems (Netflix problem: matrix completion), natural language processing (Latent Semantic Analysis), and image compression. Neural network-based approaches (autoencoders, variational methods) have added probabilistic twists, but the underlying geometryâthat low-rank structure captures data patternsâremains unchanged. SVDâs stability and optimality guarantees make it irreplaceable in numerical linear algebra.

Connection to Broader Examples

PCA: Right singular vectors $V$ are principal components; singular values measure variance along each axis.
Matrix completion: Low-rank SVD reconstructs missing entries in recommender systems (collaborative filtering).
Denoising: Truncating small singular values removes noise while preserving structure.
Least squares: SVD solves underdetermined and overdetermined systems via pseudoinverse.
Conditioning: The ratio $\sigma_{\max}/\sigma_{\min}$ (condition number) flags numerical instability.