Chapter 07 — Singular Value Decomposition & Low-Rank Approximation

Overview

Purpose of the Chapter

Role in Book Arc

This chapter generalizes spectral reasoning from square matrices to arbitrary data operators. After Chapter 06 introduced eigen-geometry, Chapter 07 establishes SVD as the universal decomposition for rectangular matrices and approximation tasks. It is the central bridge from linear algebra structure to dimensionality reduction, compression, denoising, and scalable ML representation learning.

Core Concept and Supporting Concepts

Main Concept: Singular value decomposition expresses any matrix as orthogonal input/output bases plus axis-aligned scaling, yielding optimal low-rank approximations and stable inverse-like solutions.

Supporting Concepts:

Universal factorization: every matrix admits $A=U\Sigma V^\top$.
Geometry of action: SVD maps spheres to ellipsoids with axes set by singular vectors.
Rank is spectral: nonzero singular values exactly characterize rank.
Best approximation theorem: truncated SVD is optimal under Frobenius and operator norms.
Pseudoinverse via SVD: minimum-norm least-squares solutions are computed robustly.
Norms from singular values: spectral, Frobenius, and nuclear norms become transparent.
Conditioning is singular-value spread: $\kappa=\sigma_{\max}/\sigma_{\min}$ controls stability.
Four-subspace structure: row/column/null spaces align with singular vectors.
Algorithmic variants matter: full, partial, and randomized SVD trade accuracy for scale.
Low-rank priors power ML: compression, recommendation, and representation learning rely on singular decay.

Learning Outcomes

By the end of this chapter, you will be able to:

Construct and interpret full and truncated SVD factorizations.
Explain the geometric meaning of singular vectors and singular values.
Compute rank, norms, and condition indicators directly from spectra.
Apply Eckart-Young-Mirsky optimality to low-rank approximation design.
Use pseudoinverse formulas for stable least-squares and minimum-norm solutions.
Distinguish exact rank from numerical rank under finite precision.
Relate SVD to eigendecomposition and identify when they coincide.
Select practical algorithms (full/partial/randomized) for problem scale.
Diagnose failure modes from singular-value decay patterns.
Connect SVD workflows to PCA, recommender systems, and model compression.

Scope: What This Chapter Covers

This chapter covers the following conceptual and computational scope.

SVD foundations: existence, structure, and relation to eigendecomposition.
Low-rank optimality: truncated SVD and approximation-error characterization.
Inverse and least-squares tools: Moore-Penrose pseudoinverse and solution geometry.
Norms and conditioning: spectral properties and numerical stability implications.
Algorithmic practice: deterministic and randomized methods for large matrices.
ML applications: compression, denoising, latent factors, and scalable retrieval systems.

Connections to Other Chapters

This chapter connects directly to the full-book arc through the following progression.

Chapter 6: extends eigen-analysis into a rectangular-matrix framework.
Optimization chapters: informs curvature-aware updates, preconditioning, and solver stability.
Dimensionality-reduction chapters: supplies the computational engine behind PCA-style methods.
Recommender and retrieval chapters: grounds latent-factor and index-compression pipelines.
Deep-learning chapters: supports weight compression, spectral control, and adaptation layers.
Statistical-learning chapters: links effective rank and regularization to generalization behavior.

Questions This Chapter Answers

This chapter answers the following fundamental questions, aligned with major proof and implementation exercises.

Why does SVD always exist? What structure guarantees decomposition for any real matrix?
How is SVD a geometric map? How do rotations and scaling explain matrix action?
Why is truncated SVD optimal? What exactly is minimized and why no better rank-$k$ model exists?
How do singular values encode stability? When does conditioning make solves unreliable?
When do SVD and eigendecomposition coincide? What changes for symmetric or PSD matrices?
How does pseudoinverse resolve singular systems? Why does it return minimum-norm least-squares solutions?
How should numerical rank be chosen? Which tolerances separate noise from signal?
Which norm should we optimize? How do Frobenius and spectral errors differ in practice?
What algorithm fits scale constraints? When are randomized methods preferable to exact decompositions?
Where does SVD show up in ML pipelines? How does it drive recommendation, compression, and representation alignment?

Concrete ML Examples

This purpose section grounds the abstract theory in concrete worked examples with consistent stepwise structure.

Low-Rank Recommender Factorization at Scale
1. 1) Concept summary: truncated SVD captures dominant collaborative structure while reducing recommender complexity.
2. 2) Problem statement: decide whether rank $k=2$ is sufficient for serving predictions from a user-item matrix.
3. 3) Problem setup: We approximate the observed interaction matrix with a low-rank model to reduce storage and overfitting. The retained singular directions define latent factors used for fast scoring with dot products. We test whether the retained energy at rank 2 meets a quality threshold before production rollout.
4. 4) Explicit values: singular values of $R$ are $\sigma=[9,4,2,1]$, candidate rank $k=2$, required retained energy ratio $\eta=0.85$.
5. 5) Formula with symbols defined: retained spectral energy $r_k=\frac{\sum_{i=1}^{k}\sigma_i^2}{\sum_{i=1}^{p}\sigma_i^2}$, where $p$ is the number of singular values.
6. 6) Plug-in step: numerator $=9^2+4^2=97$, denominator $=9^2+4^2+2^2+1^2=102$, so $r_2=97/102$.
7. 7) Computed result: $r_2\approx0.951$.
8. 8) Decision / interpretation: since $0.951 > 0.85$, rank-2 factorization is acceptable for deployment.
9. 9) Sensitivity check: if $\sigma_3$ rises to $5$, then $r_2=97/(97+25+1)=0.789$, indicating rank 2 would no longer be sufficient.
Truncated SVD for Search Index Compression
1. 1) Concept summary: truncating singular directions compresses retrieval vectors while preserving most semantic signal.
2. 2) Problem statement: determine whether compressing an index to rank $k=3$ stays within recall tolerance.
3. 3) Problem setup: We build a low-rank representation of document embeddings so index memory and cache pressure decrease. Retrieval quality depends on how much variance remains after truncation. We evaluate retained variance as a proxy before full online A/B rollout.
4. 4) Explicit values: singular values $\sigma=[10,6,3,1]$, candidate $k=3$, minimum retained variance ratio $\eta=0.95$.
5. 5) Formula with symbols defined: $r_k=\frac{\sum_{i=1}^{k}\sigma_i^2}{\sum_{i=1}^{p}\sigma_i^2}$, where $r_k$ is retained variance ratio.
6. 6) Plug-in step: numerator $=10^2+6^2+3^2=145$, denominator $=145+1^2=146$, so $r_3=145/146$.
7. 7) Computed result: $r_3\approx0.993$.
8. 8) Decision / interpretation: since $0.993 > 0.95$, rank-3 compression is safe for initial deployment.
9. 9) Sensitivity check: if tail singular values are heavier, for example $\sigma_4=4$, retained ratio drops to $145/(145+16)=0.901$, requiring a higher rank.
LoRA-Style Adaptation as Structured Low-Rank Updates
1. 1) Concept summary: LoRA constrains fine-tuning updates to a low-rank subspace to reduce trainable parameters.
2. 2) Problem statement: verify parameter savings when replacing full update $\Delta W$ with rank-$r$ factors $AB^\top$.
3. 3) Problem setup: Instead of learning a dense update for every weight, we learn two smaller matrices whose product approximates the update. This reduces memory, optimizer state, and checkpoint size while preserving adaptation flexibility in important directions. We compare full and low-rank parameter counts.
4. 4) Explicit values: layer shape $m=1024, n=1024$, LoRA rank $r=8$.
5. 5) Formula with symbols defined: full update params $P_{\text{full}}=mn$; LoRA params $P_{\text{lora}}=r(m+n)$.
6. 6) Plug-in step: $P_{\text{full}}=1024\cdot1024=1{,}048{,}576$; $P_{\text{lora}}=8(1024+1024)=16{,}384$.
7. 7) Computed result: reduction factor $=1{,}048{,}576 / 16{,}384 = 64\times$.
8. 8) Decision / interpretation: rank-8 LoRA gives a major parameter reduction and is suitable for parameter-efficient adaptation.
9. 9) Sensitivity check: if rank doubles to $r=16$, parameters double to $32{,}768$, still a $32\times$ reduction versus full update.
Noise-Robust Compression via Singular-Value Thresholding
1. 1) Concept summary: thresholding small singular values removes noise-dominated components before reconstruction.
2. 2) Problem statement: decide which singular components to keep under a denoising threshold.
3. 3) Problem setup: We decompose a noisy matrix and suppress low-energy modes that likely encode measurement noise. The reconstruction uses only singular values above a threshold, producing a lower-rank denoised estimate. We then verify retained energy is still adequate for downstream tasks.
4. 4) Explicit values: singular values $\sigma=[7,3,1.2,0.4]$, hard threshold $\tau=1.0$.
5. 5) Formula with symbols defined: keep components where $\sigma_i \ge \tau$; retained energy ratio $r=\frac{\sum_{\sigma_i\ge\tau}\sigma_i^2}{\sum_i\sigma_i^2}$.
6. 6) Plug-in step: kept values are $7,3,1.2$; numerator $=49+9+1.44=59.44$; denominator $=59.44+0.16=59.60$.
7. 7) Computed result: $r=59.44/59.60\approx0.997$.
8. 8) Decision / interpretation: thresholding removes the smallest mode while retaining nearly all signal energy.
9. 9) Sensitivity check: if threshold increases to $\tau=2.0$, the 1.2 component is also removed and retained energy drops to $(49+9)/59.60\approx0.973$.

Definitions

Singular Values

Definition: For a matrix $A \in \mathbb{R}^{m \times n}$, the singular values are the nonnegative numbers $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_p \geq 0$ (where $p = \min(m,n)$) defined by $\sigma_i = \sqrt{\lambda_i(A^\top A)}$, with $\lambda_i(A^\top A)$ the eigenvalues of $A^\top A$ sorted in nonincreasing order.
Assumptions: $A$ is real; $A^\top A$ is symmetric positive semidefinite, so its eigenvalues are real and nonnegative.
Notation: Use $\sigma_i(A)$ for the $i$-th singular value, $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)$ for the diagonal matrix of singular values.
Usage: $\sigma_i$ measures the factor by which $A$ stretches the unit sphere along the $i$-th principal axis; $\sigma_1$ is the maximum amplification, $\sigma_p$ is the minimum amplification (possibly zero).
Valid Example: For $A = \begin{bmatrix}3 & 0\\0 & 4\end{bmatrix}$, $A^\top A = \text{diag}(9,16)$, so $\sigma_1 = 4$, $\sigma_2 = 3$.
Failure Case: For $A = \begin{bmatrix}0 & 1\\-2 & 0\end{bmatrix}$, the eigenvalues are $\pm i\sqrt{2}$. Treating these eigenvalues as singular values is invalid; singular values must be nonnegative real numbers, while the correct singular values are $\sqrt{2}$ and $\sqrt{2}$.
Explicit ML Relevance: Singular values determine stability and capacity: spectral normalization divides weights by $\sigma_1$ to control Lipschitz constants; singular value decay drives PCA and low-rank compression.

Left Singular Vectors

Definition: Left singular vectors are the orthonormal eigenvectors of $AA^\top$. If $A = U\Sigma V^\top$, then the columns $\mathbf{u}_i$ of $U$ satisfy $AA^\top \mathbf{u}_i = \sigma_i^2 \mathbf{u}_i$.
Assumptions: $A \in \mathbb{R}^{m \times n}$; $AA^\top$ is symmetric positive semidefinite so it has an orthonormal eigenbasis.
Notation: Use $\mathbf{u}_i \in \mathbb{R}^m$ for the $i$-th left singular vector, $U = [\mathbf{u}_1, \ldots, \mathbf{u}_m]$ orthogonal.
Usage: $\mathbf{u}_i$ is the output direction associated with input direction $\mathbf{v}_i$; $A\mathbf{v}_i = \sigma_i \mathbf{u}_i$.
Valid Example: For $A = \begin{bmatrix}3 & 0\\0 & 4\end{bmatrix}$, $AA^\top = \text{diag}(9,16)$, so $\mathbf{u}_1 = (0,1)^\top$, $\mathbf{u}_2 = (1,0)^\top$.
Failure Case: Any non-orthonormal set claimed as left singular vectors is invalid. For the same $A$, the vectors $(1,1)^\top / \sqrt{2}$ and $(1,-1)^\top / \sqrt{2}$ are not eigenvectors of $AA^\top$, so they are not left singular vectors.
Explicit ML Relevance: Left singular vectors define the dominant output subspace of a linear layer and are used to analyze output feature geometry and activations in neural networks.

Right Singular Vectors

Definition: Right singular vectors are the orthonormal eigenvectors of $A^\top A$. If $A = U\Sigma V^\top$, then the columns $\mathbf{v}_i$ of $V$ satisfy $A^\top A\mathbf{v}_i = \sigma_i^2 \mathbf{v}_i$.
Assumptions: $A \in \mathbb{R}^{m \times n}$; $A^\top A$ is symmetric positive semidefinite.
Notation: Use $\mathbf{v}_i \in \mathbb{R}^n$ for the $i$-th right singular vector, $V = [\mathbf{v}_1, \ldots, \mathbf{v}_n]$ orthogonal.
Usage: $\mathbf{v}_i$ is an input direction that is mapped by $A$ to $\sigma_i \mathbf{u}_i$; these vectors form the principal axes of the input space.
Valid Example: For $A = \begin{bmatrix}1 & 0\\0 & 2\end{bmatrix}$, $A^\top A = \text{diag}(1,4)$, so $\mathbf{v}_1 = (0,1)^\top$, $\mathbf{v}_2 = (1,0)^\top$ with $\sigma_1 = 2$, $\sigma_2 = 1$.
Failure Case: Taking eigenvectors of $A$ itself (when $A$ is not symmetric) as right singular vectors is invalid. For $A = \begin{bmatrix}0 & 1\\0 & 0\end{bmatrix}$, eigenvectors of $A$ do not form an orthonormal basis for $A^\top A$.
Explicit ML Relevance: Right singular vectors provide principal directions for PCA and define the input subspace retained in dimensionality reduction.

Singular Value Decomposition

Definition: The singular value decomposition of $A \in \mathbb{R}^{m \times n}$ is $A = U\Sigma V^\top$, where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthogonal, and $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal with nonnegative entries $\sigma_1 \geq \cdots \geq \sigma_p \geq 0$.
Assumptions: $A$ is real; orthogonal matrices exist from the spectral theorem applied to $A^\top A$ and $AA^\top$.
Notation: Use $A = U\Sigma V^\top$; write the rank-$r$ form as $A = \sum_{i=1}^r \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$.
Usage: SVD factorizes any linear map into rotation, scaling, and rotation. It reveals geometry, rank, and provides optimal low-rank approximations.
Valid Example: For $A = \begin{bmatrix}3 & 0\\0 & 4\end{bmatrix}$, $U = I$, $V = I$, $\Sigma = \text{diag}(4,3)$ (with singular values sorted).
Failure Case: A factorization $A = UDV^\top$ with negative diagonal entries in $D$ is not an SVD unless the negatives are moved into sign flips of $U$ or $V$ to make $D \geq 0$.
Explicit ML Relevance: SVD underlies PCA, low-rank model compression, and robust estimation; it is the standard tool for analyzing linear layers and embedding matrices.

Numerical Rank

Definition: The numerical rank of $A$ at tolerance $\tau$ is $r_{\tau}(A) = \#\{i : \sigma_i(A) > \tau\}$.
Assumptions: A tolerance $\tau$ is chosen based on scale and machine precision, commonly $\tau = \epsilon \sigma_1$ with $\epsilon \sim 10^{-12}$ for double precision.
Notation: Use $r_{\tau}(A)$ or $\text{rank}_\tau(A)$. Distinguish algebraic rank (exact nonzero $\sigma_i$) from numerical rank.
Usage: Numerical rank estimates effective dimensionality in noisy data and avoids instability from tiny singular values.
Valid Example: If $\sigma(A) = (10, 1, 10^{-10})$ and $\tau = 10^{-8}$, then $r_{\tau}(A) = 2$.
Failure Case: Declaring numerical rank without specifying $\tau$ is ill-defined; different tolerances yield different ranks.
Explicit ML Relevance: Numerical rank determines how many principal components to keep and how many effective degrees of freedom a model uses.

Frobenius Norm

Definition: The Frobenius norm of $A \in \mathbb{R}^{m \times n}$ is $\|A\|_F = \sqrt{\sum_{i=1}^m\sum_{j=1}^n A_{ij}^2} = \sqrt{\sum_{i=1}^p \sigma_i^2}$.
Assumptions: $A$ is real; equivalence to singular values follows from SVD.
Notation: Use $\|A\|_F$ for Frobenius norm, $\langle A,B \rangle = \text{tr}(A^\top B)$ for the associated inner product.
Usage: Measures total energy of a matrix and is invariant under orthogonal transforms: $\|UAV\|_F = \|A\|_F$.
Valid Example: For $A = \begin{bmatrix}1 & 2\\3 & 4\end{bmatrix}$, $\|A\|_F = \sqrt{1+4+9+16} = \sqrt{30}$.
Failure Case: The quantity $\max_{ij} |A_{ij}|$ is not the Frobenius norm; it is the entrywise max norm.
Explicit ML Relevance: Frobenius norm is the standard loss in matrix approximation, used in PCA reconstruction error and low-rank compression objectives.

Spectral Norm

Definition: The spectral norm of $A$ is $\|A\|_2 = \max_{\|\mathbf{x}\|_2=1} \|A\mathbf{x}\|_2 = \sigma_1(A)$.
Assumptions: $A$ is real; the maximum exists since the unit sphere is compact and $\|A\mathbf{x}\|$ is continuous.
Notation: Use $\|A\|_2$ or $\|A\|$ when context is clear; always specify the norm if multiple norms are used.
Usage: Measures the maximum amplification of vectors under $A$; equals the largest singular value.
Valid Example: For $A = \text{diag}(5,2)$, $\|A\|_2 = 5$.
Failure Case: The sum of singular values $\sum_i \sigma_i$ is the nuclear norm, not the spectral norm.
Explicit ML Relevance: Spectral norm controls Lipschitz constants and stability of neural networks (spectral normalization, robustness bounds).

Best Rank-k Approximation

Definition: For $A \in \mathbb{R}^{m \times n}$ and $k < \text{rank}(A)$, a best rank-$k$ approximation is a matrix $A_k$ of rank $k$ that minimizes $\|A - X\|_F$ (or $\|A - X\|_2$) over all rank-$k$ matrices $X$.
Assumptions: Norm must be specified; the minimizer exists because the set of rank-$k$ matrices is closed for fixed finite dimension.
Notation: Use $A_k$ for the optimal rank-$k$ approximation, with context specifying the norm.
Usage: Captures the best low-dimensional summary of $A$; used for compression and denoising.
Valid Example: If $A = U\Sigma V^\top$ with $\sigma = (9,3,1)$, then $A_1 = 9\mathbf{u}_1\mathbf{v}_1^\top$ is the best rank-1 approximation in Frobenius and spectral norms.
Failure Case: Choosing the top $k$ columns of $A$ as $A_k$ is generally not optimal and does not minimize $\|A - X\|_F$.
Explicit ML Relevance: Best rank-$k$ approximation underlies truncated SVD for PCA, compressed embeddings, and low-rank adaptation (LoRA) techniques.

Truncated SVD

Definition: Given $A = \sum_{i=1}^r \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$, the truncated SVD of rank $k$ is $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$.
Assumptions: Singular values are ordered $\sigma_1 \geq \cdots \geq \sigma_r > 0$; choose $k < r$.
Notation: Use $U_k$, $\Sigma_k$, $V_k$ for the truncated factors, so $A_k = U_k\Sigma_k V_k^\top$.
Usage: Truncation discards small singular modes, yielding the optimal low-rank approximation and denoising.
Valid Example: If $\Sigma = \text{diag}(10,2,0.5)$, then $A_2$ keeps the 10 and 2 components and drops 0.5.
Failure Case: Dropping non-leading singular values (e.g., keep $\sigma_1$ and $\sigma_3$ but not $\sigma_2$) generally does not yield the optimal rank-2 approximation.
Explicit ML Relevance: Truncated SVD is the standard method for PCA compression, latent semantic analysis, and low-rank recommender models.

Low-Rank Factorization

Definition: A low-rank factorization of $A$ is an expression $A \approx BC$ where $B \in \mathbb{R}^{m \times k}$, $C \in \mathbb{R}^{k \times n}$, and $k \ll \min(m,n)$.
Assumptions: The approximation target and loss must be specified (e.g., Frobenius, weighted Frobenius, or likelihood-based loss).
Notation: Use $k$ for target rank, $B$ and $C$ as factors; specify whether factors are constrained (nonnegative, sparse, orthogonal).
Usage: Captures latent structure and compresses matrices into two smaller factors; not necessarily unique.
Valid Example: For a rating matrix $R$, a factorization $R \approx UV^\top$ with $U \in \mathbb{R}^{m \times 50}$, $V \in \mathbb{R}^{n \times 50}$ yields 50-dimensional user and item embeddings.
Failure Case: Claiming exact equality $A = BC$ with $k < \text{rank}(A)$ is impossible; only approximation is possible when $k$ is too small.
Explicit ML Relevance: Low-rank factorization powers recommender systems, matrix completion, and parameter-efficient fine-tuning (LoRA, adapters).

Column Space and Row Space via SVD

Definition: For $A = U\Sigma V^\top$ with rank $r$, the column space is $\text{range}(A) = \text{span}(\mathbf{u}_1, \ldots, \mathbf{u}_r)$, and the row space is $\text{range}(A^\top) = \text{span}(\mathbf{v}_1, \ldots, \mathbf{v}_r)$.
Assumptions: $A$ has SVD; $r$ is the number of nonzero singular values.
Notation: Use $U_r$ and $V_r$ for the first $r$ singular vectors; denote null spaces by $\text{null}(A)$ and $\text{null}(A^\top)$.
Usage: SVD provides orthonormal bases for column and row spaces and cleanly separates signal from null directions.
Valid Example: If $A = \begin{bmatrix}1 & 0\\0 & 0\end{bmatrix}$, then $\text{range}(A) = \text{span}((1,0)^\top)$ and $\text{range}(A^\top) = \text{span}((1,0)^\top)$, with $r=1$.
Failure Case: Using non-orthogonal columns as a basis for the column space can be valid but is not an SVD-based characterization; claiming they are singular vectors is incorrect unless they are eigenvectors of $AA^\top$.
Explicit ML Relevance: Column and row spaces identify the learned feature subspaces in representation learning and inform dimensionality reduction decisions.

Condition Number

Definition: For a full-rank matrix $A \in \mathbb{R}^{n \times n}$, the condition number in the spectral norm is $\kappa_2(A) = \|A\|_2\|A^{-1}\|_2 = \sigma_1(A)/\sigma_n(A)$. For rectangular full column rank matrices, $\kappa_2(A) = \sigma_1(A)/\sigma_{\min}(A)$.
Assumptions: $A$ has full rank so $\sigma_{\min} > 0$; otherwise $\kappa_2(A) = \infty$.
Notation: Use $\kappa_2(A)$ or $\kappa(A)$ with specified norm; avoid mixing $\kappa_2$ with $\kappa_F$.
Usage: Quantifies sensitivity: small perturbations in $A$ or $\mathbf{b}$ can cause large changes in the solution of $A\mathbf{x} = \mathbf{b}$ when $\kappa$ is large.
Valid Example: For $A = \text{diag}(10,1)$, $\kappa_2(A) = 10$.
Failure Case: Defining $\kappa$ for a rank-deficient matrix as finite is incorrect; if $\sigma_{\min} = 0$, the condition number is infinite.
Explicit ML Relevance: Condition number controls optimization speed and numerical stability in linear regression, least squares, and deep network training.

Theorems

Existence of the Singular Value Decomposition

Formal statement. For any matrix $A \in \mathbb{R}^{m \times n}$, there exist orthogonal matrices $U \in \mathbb{R}^{m \times m}$, $V \in \mathbb{R}^{n \times n}$, and a diagonal matrix $\Sigma \in \mathbb{R}^{m \times n}$ with nonnegative diagonal entries such that $A = U\Sigma V^\top$.

Full formal proof. 1. Consider $A^\top A \in \mathbb{R}^{n \times n}$. It is symmetric and positive semidefinite because for any $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{x}^\top A^\top A\mathbf{x} = \|A\mathbf{x}\|_2^2 \geq 0$. 2. By the spectral theorem, there exists an orthogonal matrix $V$ and a diagonal matrix $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$ with $\lambda_i \geq 0$ such that $A^\top A = V\Lambda V^\top$. 3. Define $\sigma_i = \sqrt{\lambda_i}$. Let $r$ be the number of strictly positive $\sigma_i$. Arrange $\sigma_1 \geq \cdots \geq \sigma_r > 0$ and $\sigma_{r+1} = \cdots = \sigma_n = 0$. 4. For each $i \leq r$, define $\mathbf{u}_i = \frac{1}{\sigma_i} A\mathbf{v}_i$. Then $\|\mathbf{u}_i\|_2^2 = \frac{1}{\sigma_i^2} \mathbf{v}_i^\top A^\top A \mathbf{v}_i = \frac{1}{\sigma_i^2} \lambda_i \|\mathbf{v}_i\|_2^2 = 1$, so $\mathbf{u}_i$ has unit norm. 5. For $i \neq j \leq r$, $\mathbf{u}_i^\top \mathbf{u}_j = \frac{1}{\sigma_i\sigma_j} \mathbf{v}_i^\top A^\top A \mathbf{v}_j = \frac{1}{\sigma_i\sigma_j} \lambda_j \mathbf{v}_i^\top \mathbf{v}_j = 0$, since $\mathbf{v}_i$ are orthonormal. Hence $\{\mathbf{u}_1, \ldots, \mathbf{u}_r\}$ is orthonormal. 6. Extend $\{\mathbf{u}_1, \ldots, \mathbf{u}_r\}$ to an orthonormal basis of $\mathbb{R}^m$ by adding vectors $\mathbf{u}_{r+1}, \ldots, \mathbf{u}_m$. Let $U$ be the orthogonal matrix with columns $\mathbf{u}_1, \ldots, \mathbf{u}_m$. 7. Define $\Sigma \in \mathbb{R}^{m \times n}$ to be diagonal with entries $\sigma_1, \ldots, \sigma_p$ ($p = \min(m,n)$) on the diagonal and zeros elsewhere. 8. For $i \leq r$, we have $A\mathbf{v}_i = \sigma_i \mathbf{u}_i$. For $i > r$, $\sigma_i = 0$ and $A\mathbf{v}_i = \mathbf{0}$ because $A^\top A\mathbf{v}_i = 0$ implies $\|A\mathbf{v}_i\|_2^2 = 0$. Thus $A V = U\Sigma$, and multiplying on the right by $V^\top$ yields $A = U\Sigma V^\top$.

Interpretation. Any linear map decomposes into input rotation, axis-aligned scaling, and output rotation, revealing its geometry and rank structure.

Explicit ML relevance. Guarantees that PCA, low-rank compression, and spectral normalization are always well-defined for any data matrix or weight matrix.

Relationship Between SVD and Eigen-Decomposition

Formal statement. If $A \in \mathbb{R}^{n \times n}$ is symmetric, then its SVD can be written as $A = U\Sigma V^\top$ with $U = V = Q$ and $\Sigma = \text{diag}(|\lambda_1|, \ldots, |\lambda_n|)$, where $A = Q\Lambda Q^\top$ is the eigendecomposition of $A$. If $A$ is symmetric positive semidefinite, then $\Sigma = \Lambda$.

Full formal proof. 1. Let $A = Q\Lambda Q^\top$ with $Q$ orthogonal and $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$. 2. Compute $A^\top A = A^2 = Q\Lambda^2 Q^\top$, so the eigenvalues of $A^\top A$ are $\lambda_i^2$, and the eigenvectors are the columns of $Q$. 3. Therefore the singular values are $\sigma_i = \sqrt{\lambda_i^2} = |\lambda_i|$, and right singular vectors are $\mathbf{v}_i = \mathbf{q}_i$. 4. For each $i$ with $\lambda_i \neq 0$, define $\mathbf{u}_i = \frac{1}{\sigma_i} A\mathbf{v}_i = \frac{1}{|\lambda_i|} \lambda_i \mathbf{q}_i = \text{sign}(\lambda_i)\mathbf{q}_i$. Thus $U = QD$ with a diagonal sign matrix $D$ and $V = Q$. 5. Then $U\Sigma V^\top = QD\,\text{diag}(|\lambda_i|)\,Q^\top = Q\Lambda Q^\top = A$. 6. If $A \succeq 0$, then $\lambda_i \geq 0$ and $D = I$, giving $U = V = Q$ and $\Sigma = \Lambda$.

Interpretation. SVD generalizes eigendecomposition and reduces to it for symmetric matrices, with singular values equal to absolute eigenvalues.

Explicit ML relevance. Symmetric covariance and Hessian matrices can be analyzed either by eigenvalues or singular values; for PSD matrices, both coincide and simplify PCA and stability analysis.

Eckart-Young-Mirsky Theorem

Formal statement. Let $A \in \mathbb{R}^{m \times n}$ with SVD $A = \sum_{i=1}^r \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$ and $\sigma_1 \geq \cdots \geq \sigma_r > 0$. For any $k < r$, the matrix $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$ satisfies \[ \|A - A_k\|_F = \min_{\text{rank}(X) \leq k} \|A - X\|_F, \quad \|A - A_k\|_2 = \min_{\text{rank}(X) \leq k} \|A - X\|_2. \] Moreover, $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$ and $\|A - A_k\|_2 = \sigma_{k+1}$.

Full formal proof. 1. Write $A = U\Sigma V^\top$. For any rank-$k$ matrix $X$, define $Y = U^\top X V$. Then $\text{rank}(Y) = \text{rank}(X) \leq k$ and $\|A - X\|_F = \|\Sigma - Y\|_F$ by orthogonal invariance of the Frobenius norm. 2. Let $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)$. Since $Y$ has rank at most $k$, it has at most $k$ nonzero singular values. In particular, at most $k$ diagonal entries can be matched in $\Sigma$ if $Y$ is diagonal. 3. The Frobenius norm satisfies $\|\Sigma - Y\|_F^2 = \sum_{i=1}^p (\sigma_i - y_{ii})^2 + \sum_{i \neq j} y_{ij}^2 \geq \sum_{i=k+1}^p \sigma_i^2$, because at most $k$ diagonal entries of $Y$ can be nonzero without exceeding rank $k$, and all off-diagonal terms add nonnegative error. 4. The lower bound is achieved by choosing $Y = \text{diag}(\sigma_1, \ldots, \sigma_k, 0, \ldots, 0)$. Then $X = UYV^\top = A_k$, giving $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$. 5. For the spectral norm, orthogonal invariance gives $\|A - X\|_2 = \|\Sigma - Y\|_2$. For any rank-$k$ $Y$, the matrix $\Sigma - Y$ has at least one singular value not less than $\sigma_{k+1}$ (by the min-max characterization of singular values and interlacing), hence $\|\Sigma - Y\|_2 \geq \sigma_{k+1}$. 6. Taking $Y$ as above yields $\|\Sigma - Y\|_2 = \sigma_{k+1}$, so $\|A - A_k\|_2 = \sigma_{k+1}$ and the minimum is achieved by $A_k$.

Interpretation. Truncated SVD is the optimal low-rank approximation simultaneously for Frobenius and spectral norms, with error exactly equal to discarded singular values.

Explicit ML relevance. This theorem justifies PCA compression, low-rank embedding truncation, and optimal matrix approximation in collaborative filtering and denoising.

Rank-Revealing Properties of SVD

Formal statement. If $A = U\Sigma V^\top$ with $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)$, then $\text{rank}(A)$ equals the number of nonzero singular values. Moreover, the column space of $A$ is $\text{span}(\mathbf{u}_1, \ldots, \mathbf{u}_r)$ and the row space is $\text{span}(\mathbf{v}_1, \ldots, \mathbf{v}_r)$, where $r = \text{rank}(A)$.

Full formal proof. 1. Since $A = U\Sigma V^\top$ with $U, V$ orthogonal, $\text{rank}(A) = \text{rank}(\Sigma)$ (rank is invariant under multiplication by invertible matrices). 2. $\text{rank}(\Sigma)$ equals the number of nonzero diagonal entries of $\Sigma$, hence the number of nonzero singular values. 3. The columns of $A$ lie in the span of $U$ columns weighted by $\Sigma$: $A = U\Sigma V^\top$ implies $\text{range}(A) = \text{range}(U\Sigma)$. 4. Since $U$ is orthogonal and $\Sigma$ has nonzero entries only in the first $r$ rows, $\text{range}(U\Sigma) = \text{span}(\mathbf{u}_1, \ldots, \mathbf{u}_r)$. 5. Similarly, $A^\top = V\Sigma^\top U^\top$, so $\text{range}(A^\top) = \text{span}(\mathbf{v}_1, \ldots, \mathbf{v}_r)$.

Interpretation. SVD provides orthonormal bases for the signal subspaces and exposes exact rank via nonzero singular values.

Explicit ML relevance. Rank-revealing properties justify feature subspace selection, numerical rank estimation, and pruning of low-energy components in models.

Norm Characterization via Singular Values

Formal statement. For any $A \in \mathbb{R}^{m \times n}$ with singular values $\sigma_1 \geq \cdots \geq \sigma_p$, \[ \|A\|_2 = \sigma_1, \quad \|A\|_F = \sqrt{\sum_{i=1}^p \sigma_i^2}, \quad \|A\|_* = \sum_{i=1}^p \sigma_i. \]

Full formal proof. 1. Let $A = U\Sigma V^\top$. For any unit vector $\mathbf{x}$, $\|A\mathbf{x}\|_2 = \|U\Sigma V^\top \mathbf{x}\|_2 = \|\Sigma \mathbf{y}\|_2$ with $\mathbf{y} = V^\top \mathbf{x}$ and $\|\mathbf{y}\|_2 = 1$. 2. $\|\Sigma \mathbf{y}\|_2^2 = \sum_{i=1}^p \sigma_i^2 y_i^2 \leq \sigma_1^2 \sum_{i=1}^p y_i^2 = \sigma_1^2$. Equality holds when $\mathbf{y} = \mathbf{e}_1$, so $\|A\|_2 = \sigma_1$. 3. The Frobenius norm is orthogonally invariant: $\|A\|_F^2 = \|\Sigma\|_F^2 = \sum_{i=1}^p \sigma_i^2$. 4. The nuclear norm is the sum of singular values by definition, $\|A\|_* = \sum_i \sigma_i$.

Interpretation. Singular values fully characterize common matrix norms, linking geometry (stretching) to algebraic size.

Explicit ML relevance. Norm characterization underlies spectral normalization, Frobenius-based losses, and nuclear-norm regularization for low-rank learning.

Condition Number and Stability

Formal statement. If $A \in \mathbb{R}^{n \times n}$ is invertible, then for any perturbations $\Delta A$ and $\Delta \mathbf{b}$ with $(A + \Delta A)(\mathbf{x} + \Delta \mathbf{x}) = \mathbf{b} + \Delta \mathbf{b}$, the relative solution error satisfies \[ \frac{\|\Delta \mathbf{x}\|_2}{\|\mathbf{x}\|_2} \leq \kappa_2(A) \left( \frac{\|\Delta A\|_2}{\|A\|_2} + \frac{\|\Delta \mathbf{b}\|_2}{\|\mathbf{b}\|_2} \right) + O(\|\Delta A\|_2^2). \]

Full formal proof. 1. From $(A + \Delta A)(\mathbf{x} + \Delta \mathbf{x}) = \mathbf{b} + \Delta \mathbf{b}$ and $A\mathbf{x} = \mathbf{b}$, expand and cancel to obtain $A\Delta \mathbf{x} + \Delta A\mathbf{x} + \Delta A\Delta \mathbf{x} = \Delta \mathbf{b}$. 2. Rearrange to $A\Delta \mathbf{x} = \Delta \mathbf{b} - \Delta A\mathbf{x} - \Delta A\Delta \mathbf{x}$. 3. Multiply by $A^{-1}$: $\Delta \mathbf{x} = A^{-1}\Delta \mathbf{b} - A^{-1}\Delta A\mathbf{x} - A^{-1}\Delta A\Delta \mathbf{x}$. 4. Take norms and use submultiplicativity: $\|\Delta \mathbf{x}\|_2 \leq \|A^{-1}\|_2 \|\Delta \mathbf{b}\|_2 + \|A^{-1}\|_2 \|\Delta A\|_2 \|\mathbf{x}\|_2 + \|A^{-1}\|_2 \|\Delta A\|_2 \|\Delta \mathbf{x}\|_2$. 5. For sufficiently small $\|\Delta A\|_2$, move the last term to the left and ignore second-order terms, yielding $\|\Delta \mathbf{x}\|_2 \leq \|A^{-1}\|_2 \|\Delta \mathbf{b}\|_2 + \|A^{-1}\|_2 \|\Delta A\|_2 \|\mathbf{x}\|_2 + O(\|\Delta A\|_2^2)$. 6. Divide by $\|\mathbf{x}\|_2$ and use $\|\mathbf{b}\|_2 = \|A\mathbf{x}\|_2 \geq \|A\|_2^{-1} \|\mathbf{x}\|_2$ to obtain the stated bound with $\kappa_2(A) = \|A\|_2\|A^{-1}\|_2$.

Interpretation. The condition number quantifies how perturbations in data or model parameters amplify into solution error.

Explicit ML relevance. Ill-conditioned design matrices slow optimization and amplify noise in regression; conditioning guides preconditioning and regularization choices.

Spectral Norm Equals Largest Singular Value

Formal statement. For any matrix $A \in \mathbb{R}^{m \times n}$, $\|A\|_2 = \sigma_1(A)$.

Full formal proof. 1. Let $A = U\Sigma V^\top$. For any $\mathbf{x}$ with $\|\mathbf{x}\|_2 = 1$, set $\mathbf{y} = V^\top \mathbf{x}$ so $\|\mathbf{y}\|_2 = 1$. 2. Then $\|A\mathbf{x}\|_2 = \|U\Sigma \mathbf{y}\|_2 = \|\Sigma \mathbf{y}\|_2$. 3. Since $\Sigma$ is diagonal, $\|\Sigma \mathbf{y}\|_2^2 = \sum_{i=1}^p \sigma_i^2 y_i^2 \leq \sigma_1^2 \sum_{i=1}^p y_i^2 = \sigma_1^2$. 4. Equality holds for $\mathbf{y} = \mathbf{e}_1$, hence $\|A\|_2 = \sigma_1$.

Interpretation. The maximum stretching factor of a linear map equals its largest singular value.

Explicit ML relevance. This identity enables efficient Lipschitz estimation and spectral regularization for stable training.

Low-Rank Approximation Minimizes Frobenius Error

Formal statement. For $A \in \mathbb{R}^{m \times n}$ with SVD $A = \sum_{i=1}^r \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$, the rank-$k$ matrix $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$ minimizes $\|A - X\|_F$ over all rank-$k$ matrices $X$, and the error is $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$.

Full formal proof. 1. The Frobenius-optimality is the Frobenius part of the Eckart-Young-Mirsky theorem already proven above. 2. From that proof, for any rank-$k$ $X$, $\|A - X\|_F \geq \|A - A_k\|_F$, and equality holds only for $X = A_k$ (up to rotations in the $k$-dimensional singular subspace when $\sigma_k = \sigma_{k+1}$).

Interpretation. Truncated SVD uniquely minimizes reconstruction error in Frobenius norm.

Explicit ML relevance. This theorem directly justifies PCA reconstruction, low-rank denoising, and compression of embeddings and weight matrices.

Worked Examples

Example 1 — SVD of a 2×2 Matrix

Consider the matrix $A = \begin{bmatrix}3 & 1\\0 & 2\end{bmatrix}$. The setup is small enough to compute SVD directly: we form $A^\top A = \begin{bmatrix}9 & 3\\3 & 5\end{bmatrix}$, whose eigenvalues are $\lambda_{1,2} = 7 \pm \sqrt{13}$. The singular values are $\sigma_{1,2} = \sqrt{7 \pm \sqrt{13}}$, and the right singular vectors are the normalized eigenvectors of $A^\top A$. The reasoning emphasizes the SVD pipeline: compute $A^\top A$, find $V$, compute $\Sigma$, and recover $U$ via $\mathbf{u}_i = A\mathbf{v}_i/\sigma_i$. The interpretation is geometric: $A$ rotates the input basis ($V^\top$), stretches by $\sigma_1, \sigma_2$, then rotates by $U$. A common misconception is that the singular values are just the diagonal entries of $A$; they are not, unless $A$ is already diagonal with nonnegative entries. A what-if scenario: if the off-diagonal $1$ were zero, the singular values would be $3$ and $2$; the coupling increases and reorients the principal directions, shifting singular values away from the diagonal entries. In ML, this calculation mirrors what happens to a two-feature linear model: the singular values quantify how the model amplifies input variance along two orthogonal directions, guiding feature scaling and regularization.

Example 2 — Geometric Interpretation of Singular Values

Let $A \in \mathbb{R}^{2 \times 2}$ map the unit circle to an ellipse. Suppose SVD yields $\sigma_1 = 5$, $\sigma_2 = 1$, with singular vectors $\mathbf{u}_i, \mathbf{v}_i$. The setup asks: what does $A$ do to geometry? The reasoning is direct: any unit vector $\mathbf{x}$ can be written in the $\{\mathbf{v}_1, \mathbf{v}_2\}$ basis, and $A$ scales those coordinates to lengths between 1 and 5 depending on direction. The interpretation is that the image of the unit circle is an ellipse with semi-axes 5 and 1, oriented by $U$. A common misconception is to think singular values correspond to eigenvalues; for non-symmetric $A$, eigenvectors do not describe stretching directions, while singular vectors do. A what-if scenario: if $\sigma_1$ and $\sigma_2$ were equal, the ellipse would be a circle, meaning $A$ is a scaled rotation, perfectly conditioned. In ML, this example underpins why data normalization matters: a linear layer with $\sigma_1/\sigma_2 = 5$ amplifies some features five times more than others, creating anisotropy that can slow optimization and distort learned representations.

Example 3 — Low-Rank Approximation in Practice

Suppose $A \in \mathbb{R}^{100 \times 80}$ represents centered image patches. The setup is to approximate $A$ by rank $k=10$. The reasoning uses the truncated SVD: compute $A = U\Sigma V^\top$ and set $A_{10} = U_{10}\Sigma_{10}V_{10}^\top$. The interpretation is compression: each patch is reconstructed using only 10 basis patterns, giving the best possible reconstruction error in Frobenius norm. A common misconception is that any 10 basis vectors (e.g., random or hand-crafted) are equally good; in fact, the SVD basis is optimal. A what-if scenario: if the singular values decay slowly (flat spectrum), the rank-10 approximation will be poor, and using more components is necessary. In ML, this is precisely the PCA pipeline used to reduce dimensionality before classification, where reconstruction error is a proxy for retained information.

Example 4 — Truncated SVD for Compression

Consider a $512 \times 512$ grayscale image stored as a matrix $A$. The setup is to compress the image by keeping $k=50$ singular values. The reasoning is straightforward: compute the SVD and reconstruct $A_{50}$. Storage drops from $512^2 = 262{,}144$ entries to $50(512+512+1) = 51{,}200$, a 5× reduction. The interpretation is that most visual information sits in the top singular modes; fine texture and noise are relegated to smaller singular values. A common misconception is that compression always reduces fidelity uniformly; in reality, high-frequency details are removed first, which might be desirable for denoising. A what-if scenario: if $k$ is increased to 100, the image may become perceptually identical to the original; if $k$ is reduced to 10, edges blur and artifacts appear. In ML, this directly parallels compressing convolutional filters or embedding matrices, where truncated SVD reduces parameters while preserving most predictive signal.

Example 5 — Frobenius Error and Approximation

Take $A$ with singular values $(9, 4, 1, 0.5)$. The setup asks for the Frobenius error of a rank-2 approximation. The reasoning uses $\|A - A_2\|_F^2 = 1^2 + 0.5^2 = 1.25$, so $\|A - A_2\|_F = \sqrt{1.25}$. The interpretation is that the error depends only on the discarded singular values; directions not kept contribute their squared energy to the loss. A common misconception is that removing a component of size 1 always matters little; in aggregate, many small singular values can dominate error. A what-if scenario: if the tail were $(3, 3)$ instead of $(1, 0.5)$, the same rank-2 truncation would be disastrous. In ML, this emphasizes why choosing $k$ requires inspecting the singular value spectrum and often validating downstream performance, not just raw reconstruction error.

Example 6 — Spectral Norm vs Frobenius Norm

Let $A$ have singular values $(10, 1, 1, 1)$. The setup compares $\|A\|_2$ and $\|A\|_F$. The reasoning: $\|A\|_2 = 10$ while $\|A\|_F = \sqrt{10^2 + 1^2 + 1^2 + 1^2} = \sqrt{103} \approx 10.15$. The interpretation is that the spectral norm measures the worst-case amplification, while the Frobenius norm aggregates total energy. A common misconception is that these norms are interchangeable; they are not, especially for matrices with one dominant singular value. A what-if scenario: if the spectrum were flat $(5,5,5,5)$, then $\|A\|_2 = 5$ and $\|A\|_F = 10$, doubling the gap and highlighting energy spread across directions. In ML, spectral norm controls Lipschitz stability and adversarial robustness, while Frobenius norm is often used as a regularizer and loss metric for reconstruction.

Example 7 — PCA via SVD

Suppose $X \in \mathbb{R}^{1000 \times 50}$ is centered data with 1000 samples and 50 features. The setup is to compute the first 5 principal components. The reasoning is to perform SVD: $X = U\Sigma V^\top$. The first 5 columns of $V$ are the principal directions; the projected data is $Z = XV_5$. The interpretation is that PCA is simply the SVD of the data matrix, not of its covariance, and the variance explained by component $i$ is $\sigma_i^2 / (n-1)$. A common misconception is that PCA requires eigendecomposition of $X^\top X$; that is numerically less stable because it squares the condition number. A what-if scenario: if the data is not centered, the first component will align with the mean vector rather than variation, distorting interpretation. In ML, PCA via SVD is a standard preprocessing step for noise reduction and feature compression before applying classifiers or clustering.

Example 8 — Rank Deficiency and Noise

Consider $A = A_0 + E$, where $A_0$ has rank 3 and $E$ is small Gaussian noise. The setup is to estimate the rank of $A$ from its singular values. The reasoning uses a tolerance $\tau = \epsilon \sigma_1$ to define numerical rank. If the singular values look like $(12, 7, 3, 0.04, 0.03, \ldots)$, then $r_\tau \approx 3$. The interpretation is that noise inflates small singular values from zero to small but nonzero, so algebraic rank is full but numerical rank captures the effective signal dimension. A common misconception is to treat any nonzero singular value as significant, which overestimates complexity. A what-if scenario: if the noise level increases, the spectral gap shrinks and rank estimation becomes ambiguous, suggesting the need for model selection via cross-validation. In ML, this is central in deciding how many components to keep in PCA and how many latent factors to use in matrix factorization.

Example 9 — Ill-Conditioned Systems

Let $A \in \mathbb{R}^{2 \times 2}$ have singular values $\sigma_1 = 100$ and $\sigma_2 = 0.01$. The setup is solving $A\mathbf{x} = \mathbf{b}$. The reasoning: the condition number $\kappa = 10^4$, so a relative error of $10^{-6}$ in $\mathbf{b}$ can lead to a relative error of $10^{-2}$ in $\mathbf{x}$. The interpretation is that the solution is unstable in the direction of the small singular value; tiny perturbations are amplified. A common misconception is that numerical instability is caused by large singular values; the real culprit is very small ones. A what-if scenario: adding ridge regularization (replace $A^\top A$ with $A^\top A + \lambda I$) increases the smallest singular value and improves stability. In ML, ill-conditioning explains why least squares and linear regression can be noisy, motivating regularization and feature scaling.

Example 10 — Neural Network Weight Matrix Compression

Suppose a fully connected layer has weight matrix $W \in \mathbb{R}^{1024 \times 1024}$. The setup is to compress it with rank $k=64$. The reasoning is to compute the SVD $W = U\Sigma V^\top$ and approximate $W \approx U_{64}\Sigma_{64}V_{64}^\top$. This can be implemented as two layers: $W_1 = U_{64}\Sigma_{64}^{1/2}$, $W_2 = \Sigma_{64}^{1/2}V_{64}^\top$, so $W \approx W_1 W_2$. The interpretation is parameter reduction from $1024^2$ to $64(1024+1024)$, a 8× compression. A common misconception is that compression always improves inference speed; on GPUs, two smaller matrix multiplications can sometimes be slower due to kernel launch overhead. A what-if scenario: if the singular value spectrum is flat, compression hurts accuracy; if it decays rapidly, accuracy remains near original. In ML, this is a standard post-training compression step for deploying large models on edge devices.

Example 11 — Latent Semantic Analysis (Preview)

Consider a document-term matrix $A \in \mathbb{R}^{m \times n}$ with $m=5000$ documents and $n=20000$ terms. The setup is to build a semantic search system using rank-$k$ SVD. The reasoning: compute $A \approx U_k\Sigma_k V_k^\top$, represent documents by rows of $U_k\Sigma_k$ and terms by rows of $V_k\Sigma_k$, and answer queries by projecting query vectors into the same space. The interpretation is that SVD uncovers latent topics that smooth over synonymy and polysemy. A common misconception is that LSA “understands language”; it is purely linear and cannot capture compositional meaning or word order. A what-if scenario: if stopwords are not removed or TF-IDF weighting is not applied, the leading components focus on word frequency rather than semantics, degrading retrieval quality. In ML, this is an early example of embedding learning, foreshadowing modern word and document embeddings.

Example 12 — Implicit Regularization via Rank Constraints

Suppose we solve $\min_W \|XW - Y\|_F^2$ with gradient descent, where $X \in \mathbb{R}^{n \times d}$, $Y \in \mathbb{R}^{n \times c}$, and the model is overparameterized. The setup is that many solutions fit the data exactly. The reasoning is that gradient descent from small initialization converges to the minimum-norm solution, which often has low effective rank, because growth in singular values is biased toward dominant modes. The interpretation is that optimization itself acts as a rank-regularizer, even without explicit penalties. A common misconception is that overparameterization always implies overfitting; in linear models, implicit bias can actually improve generalization. A what-if scenario: if optimization uses adaptive methods with large steps or aggressive momentum, the implicit bias can change, yielding different rank profiles and generalization behavior. In ML, this connects to why large neural networks often generalize well and why low-rank adapters can capture task-specific updates efficiently.

Summary

Key Ideas Consolidated

Singular value decomposition extends eigenvalue ideas to all matrices by factoring any $A \in \mathbb{R}^{m \times n}$ into rotations and scalings $A = U\Sigma V^\top$. The singular values $\sigma_i$ quantify the geometric action of $A$: how much the map stretches or compresses along orthogonal directions. This yields a complete description of matrix norms, condition numbers, and the four fundamental subspaces. SVD is optimal for approximation: the truncated SVD provides the unique best rank-$k$ approximation in Frobenius and spectral norms, with explicit error given by discarded singular values. Rank is not just a combinatorial property but a measure of effective complexity: numerical rank, determined by a tolerance on singular values, captures the intrinsic dimension of data and models in the presence of noise. This perspective explains why low-rank methods are robust in practice and why spectral decay is a diagnostic for whether compression will succeed. Finally, the chapter establishes a bridge between algebraic structure and machine learning practice: PCA is simply SVD of centered data, collaborative filtering is low-rank factorization of partially observed matrices, neural network compression is truncated SVD on weight matrices. Spectral normalization and nuclear norm regularization are direct manipulations of singular values—not separate tricks but consequences of the same decomposition.

What the Reader Should Now Be Able To Do

Theoretical Competencies:

Compute and decompose singular values: Derive singular values from $A^\top A$, recover left and right singular vectors, construct truncated SVD approximations, and explain why $\|A\|_2 = \sigma_1$ and $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$.
Apply the Eckart-Young-Mirsky theorem: Prove that truncated SVD is the optimal rank-$k$ approximation in Frobenius and spectral norms, and use this to bound approximation errors in data compression and noise filtering.
Analyze matrix norms and condition numbers: Interpret the spectral norm, Frobenius norm, and nuclear norm geometrically, understand how singular values govern condition numbers, and predict how small singular values create ill-conditioning.
Characterize the four fundamental subspaces: Identify the column space, null space, row space, and left null space of a matrix using SVD, understand their orthogonal relationships, and predict solution properties of linear systems from these subspaces.
Diagnose when low-rank structure is present: Interpret singular value spectra, distinguish algebraic rank from numerical rank, select rank thresholds based on scale and noise, and decide how many components to retain in practical applications.

Practical Competencies:

Implement PCA and dimensionality reduction: Center data, compute SVD on the centered matrix, select principal components, and apply PCA for data visualization and noise reduction while controlling reconstruction error.
Apply low-rank factorization to collaborative filtering: Use SVD or similar factorizations to complete partially observed rating matrices, interpret latent factors, and make recommendations from the decomposed structure.
Compress neural network models: Apply truncated SVD to weight matrices to reduce parameters and memory, estimate compression error, and verify that the compressed model maintains acceptable performance on downstream tasks.
Use condition numbers to reason about numerical stability: Compute condition numbers of data matrices, understand how ill-conditioning propagates errors in gradient-based methods, and apply regularization or preconditioning to improve stability.
Design model compression pipelines: Combine SVD with other techniques (quantization, pruning), establish trade-offs between compression ratio and accuracy, and verify compression effects across multiple benchmarks.

Structural Assumptions for Later Chapters

Assumptions from Earlier Chapters (Prerequisite Knowledge):

Chapters 2–6 (Linear Algebra Foundations): Eigenvalue decomposition, orthogonal matrices, properties of symmetric matrices, matrix norms, and the relationship between matrix properties and geometric transformations.
Gaussian elimination, matrix operations, and solving linear systems via computational methods.

Structural Assumptions Made in This Chapter:

All matrices are real-valued and finite-dimensional; complex-valued matrices and infinite-dimensional operators are excluded.
SVD is treated as the canonical matrix decomposition; algorithms for computing SVD (Golub-Kahan bidiagonalization, Lanczos, randomized methods) are tools but not central to theory development.
Numerical rank is determined by singular value thresholding, not by rank-revealing decompositions (QR with column pivoting) or other direct rank estimation methods.

Assumptions for Later Chapters (Forward Requirements):

Chapter 8 (Quadratic Forms): Hessian matrices of quadratic functions on data have spectral decompositions revealing curvature; singular values are generalized to eigenvalues of symmetric matrices.
Chapter 9 (Optimization): Condition numbers (derived from singular values) govern convergence rates; preconditioning reshapes the spectrum to improve conditioning.
Chapters 10–11 (Adaptive Methods, Implicit Bias): Effective rank of data and models determines which representations optimize quickly; low-rank structure emerges implicitly from gradient descent.
Chapter 13 (Scaling Laws): Scaling behavior depends on effective rank; intrinsic dimensionality (captured by singular value decay) determines how quickly models learn with more parameters or data.
Chapter 14 (Governance): Objective misspecification can manifest as ranking the wrong singular directions higher; nuclear norm regularization enforces low-rank solutions for controlled bias-variance trade-offs.

Limitations and Caveats Acknowledged:

SVD provides optimal low-rank approximation only in Frobenius or spectral norm; other norms (e.g., operator norm on specific subsets) may be optimized by different decompositions.
Truncated SVD assumes that singular value decay is monotonic and well-separated; ambiguity occurs when singular values cluster near a threshold, requiring problem-specific judgment on cutoff.
Numerical rank is scale-dependent and threshold-dependent; choosing the threshold requires understanding noise level and application requirements; no universal rule exists.
SVD is expensive for very large matrices ($O(mn \min(m,n))$ for full decomposition); randomized methods provide approximations with different error guarantees that must be understood for each use case.

End-of-Chapter Advanced Exercises

A. True / False (20)

A.1. For any data matrix $X$ used in PCA, the top $k$ right singular vectors of $X$ are unchanged if the rows of $X$ are reweighted by any diagonal matrix with positive entries, provided $X$ is re-centered after reweighting.

A.2. In a linear regression problem $\min_W \|XW - Y\|_F^2$, replacing $X$ with $XR$ where $R$ is orthogonal leaves the condition number of the normal equations unchanged but can change the numerical rank of $X$.

A.3. If a neural network layer has weight matrix $W$ with singular values $\sigma_1 \geq \cdots \geq \sigma_r$, then the Frobenius norm of the layer output over unit-norm inputs is exactly $\sqrt{\sum_{i=1}^r \sigma_i^2}$.

A.4. For any matrix $A$, the truncated SVD $A_k$ is the unique minimizer of $\|A - X\|_2$ over all rank-$k$ matrices $X$, even when $\sigma_k = \sigma_{k+1}$.

A.5. In collaborative filtering with missing entries, minimizing $\|P_\Omega(A - UV^\top)\|_F^2$ with $U \in \mathbb{R}^{m \times k}$, $V \in \mathbb{R}^{n \times k}$ always yields the same solution as truncating the SVD of the fully observed matrix $A$ after imputing missing values with zeros.

A.6. If $A$ has numerical rank $r_\tau$ at tolerance $\tau$, then any rank-$r_\tau$ factorization $A \approx BC$ with $\|A - BC\|_F \leq \tau$ implies $\|A - A_{r_\tau}\|_F \leq \tau$.

A.7. For a dataset $X$ with mean-centered rows, the reconstruction error of rank-$k$ PCA equals $\sum_{i=k+1}^r \sigma_i^2$ regardless of the scaling of features.

A.8. In a deep network, spectral normalization that fixes $\sigma_1(W) = 1$ for each layer guarantees that the entire network is 1-Lipschitz under composition.

A.9. If $A$ is ill-conditioned with $\kappa_2(A) \gg 1$, then ridge regularization $(A^\top A + \lambda I)^{-1}$ strictly decreases the spectral condition number for any $\lambda > 0$.

A.10. The best rank-$k$ approximation in Frobenius norm is invariant under right-multiplication by any orthogonal matrix but not necessarily under left-multiplication by an orthogonal matrix.

A.11. If the singular value spectrum of a weight matrix is flat, then low-rank compression of that layer is guaranteed to cause a proportional drop in test accuracy for any classification task.

A.12. For a linear autoencoder with tied weights and mean-squared reconstruction loss, any global minimizer spans the same subspace as the top $k$ right singular vectors of the data matrix.

A.13. For any matrix $A$, $\|A\|_F^2 = \sum_i \sigma_i^2$ implies $\|A\|_F \geq \|A\|_2$, with equality if and only if $A$ has rank 1.

A.14. In kernel PCA, the singular values of the centered feature matrix are identical to the eigenvalues of the centered kernel matrix.

A.15. If $A$ is rank-deficient and $\sigma_{r}$ is its smallest nonzero singular value, then the minimum-norm least squares solution $A^+\mathbf{b}$ is continuous in $\mathbf{b}$ but not necessarily continuous in $A$.

A.16. In stochastic gradient descent on a linear model, the expected update direction lies in the span of the top left singular vectors of the minibatch design matrix.

A.17. The nuclear norm is the convex envelope of the rank function on the set $\{A : \|A\|_2 \leq 1\}$, and therefore minimizing $\|A\|_*$ is the tightest convex relaxation of rank minimization under spectral-norm constraints.

A.18. If $A$ and $B$ are matrices of the same size, then $\|A - B\|_F^2 = \sum_i (\sigma_i(A) - \sigma_i(B))^2$ for any ordering of singular values.

A.19. For a sequence of matrices $A_t$ produced during training, monotonic decay of $\sigma_1(A_t)$ implies monotonic decay of the network’s Lipschitz constant.

A.20. In low-rank adaptation (LoRA), constraining the update $\Delta W = AB$ with $A \in \mathbb{R}^{m \times r}$, $B \in \mathbb{R}^{r \times n}$ guarantees that the effective rank of the updated weight matrix $W + \Delta W$ is at most $r$.

B. Proof Problems (20)

B.1. Let $X \in \mathbb{R}^{n \times d}$ be mean-centered. Prove that the rank-$k$ PCA reconstruction $X_k$ minimizes $\|X - ZW^\top\|_F$ over all $Z \in \mathbb{R}^{n \times k}$, $W \in \mathbb{R}^{d \times k}$ with $W^\top W = I$, and that $\|X - X_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$.

B.2. Prove that for any matrix $A$, the nuclear norm $\|A\|_*$ is the convex envelope of $\text{rank}(A)$ on the spectral norm unit ball $\{A : \|A\|_2 \leq 1\}$.

B.3. Let $A \in \mathbb{R}^{m \times n}$ and $A_k$ be its truncated SVD. Prove that $A_k$ is the unique minimizer of $\|A - X\|_F$ among rank-$k$ matrices when $\sigma_k > \sigma_{k+1}$, and characterize all minimizers when $\sigma_k = \sigma_{k+1}$.

B.4. Prove that for any $A$, $\|A\|_2 = \max_{\|x\|_2=1,\|y\|_2=1} y^\top A x$, and use this to show that spectral normalization makes a linear layer 1-Lipschitz.

B.5. Let $A \in \mathbb{R}^{m \times n}$ have singular values $\sigma_i$. Prove that $\|A\|_F \leq \sqrt{\text{rank}(A)}\,\|A\|_2$, with equality if and only if all nonzero singular values are equal.

B.6. Prove that if $X$ is mean-centered and $\Sigma = \frac{1}{n}X^\top X$, then the principal components of PCA are the right singular vectors of $X$, and the explained variances are $\sigma_i^2/n$.

B.7. Let $A$ be full rank with condition number $\kappa_2(A)$. Prove that for any $\Delta A$ with $\|\Delta A\|_2 < \|A^{-1}\|_2^{-1}$, the relative error in solving $(A+\Delta A)x = b$ satisfies $\|\Delta x\|/\|x\| \leq \kappa_2(A)\,\|\Delta A\|_2/\|A\|_2 + O(\|\Delta A\|_2^2)$.

B.8. Prove that for any $A$, the map $A \mapsto A_k$ (truncated SVD at rank $k$) is Lipschitz in Frobenius norm when $\sigma_k > \sigma_{k+1}$, and give an explicit bound in terms of the spectral gap $\sigma_k - \sigma_{k+1}$.

B.9. Prove that if $X \in \mathbb{R}^{n \times d}$ has singular values $\sigma_i$, then the best rank-$k$ approximation error in spectral norm is $\sigma_{k+1}$, and show how this bound controls the worst-case linear prediction error for a compressed linear model.

B.10. Let $W \in \mathbb{R}^{m \times n}$ be a neural network weight matrix. Prove that the Lipschitz constant of the linear map $x \mapsto Wx$ equals $\sigma_1(W)$, and extend the proof to a composition of linear layers with ReLU activations using submultiplicativity.

B.11. Prove that the Moore-Penrose pseudoinverse $A^+ = V\Sigma^+U^\top$ yields the minimum-norm solution of $\min_x \|Ax-b\|_2$ and that this solution depends continuously on $b$.

B.12. Let $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{m \times n}$. Prove the Hoffman–Wielandt inequality for singular values: $\sum_i (\sigma_i(A) - \sigma_i(B))^2 \leq \|A - B\|_F^2$.

B.13. Prove that if $A$ is rank-$r$ and $\Omega$ is a random subset of observed entries satisfying standard incoherence conditions, then exact recovery by nuclear norm minimization is unique and stable to small noise.

B.14. Let $A$ be any matrix and $P$ an orthogonal projector of rank $k$. Prove that $\|A - AP\|_F$ is minimized when $P$ projects onto the span of the top $k$ right singular vectors of $A$.

B.15. Prove that for any $A$, the gradient of $\|A\|_*$ at a point with distinct nonzero singular values is $U V^\top$, and characterize the subdifferential when singular values are repeated or zero.

B.16. Let $X \in \mathbb{R}^{n \times d}$ be data with $r$-dimensional latent structure plus isotropic noise. Prove that the top $r$ singular vectors of $X$ converge to the latent subspace as $n \to \infty$ under appropriate scaling.

B.17. Prove that for any matrices $A$ and $B$ of compatible sizes, $\sigma_i(AB) \leq \sigma_i(A)\,\sigma_1(B)$ for all $i$, and use this to bound the singular values of a product of neural network layers.

B.18. Prove that the effective rank $r_{\text{eff}}(A) = (\sum_i \sigma_i)^2 / \sum_i \sigma_i^2$ is bounded above by $\text{rank}(A)$, and characterize when equality holds.

B.19. Let $A$ be a symmetric matrix with eigenvalues $\lambda_i$. Prove that $\|A\|_* = \sum_i |\lambda_i|$, and deduce that nuclear norm regularization of symmetric matrices is equivalent to $\ell^1$ regularization on eigenvalues.

B.20. Prove that among all rank-$k$ matrices $X$, the minimizer of $\|A - X\|_2$ is $A_k$, and that any minimizer must have its column space contained in the span of the top $k$ left singular vectors of $A$.

C. Python Exercises (20)

C.1. Task: Implement a function in NumPy that computes the full SVD of a synthetic matrix and verifies the reconstruction error $\|A - U\Sigma V^\top\|_F$. Purpose: build numerical intuition for SVD correctness and orthogonality checks. ML link: this is the numerical backbone of PCA and linear embedding pipelines. Hints: check $U^\top U \approx I$, $V^\top V \approx I$, and compare $A$ to $U\Sigma V^\top$ with a tolerance scaled by $\|A\|_F$. What mastery looks like: you can detect and explain when reconstruction error comes from numerical precision versus implementation mistakes.

C.2. Task: Write a PyTorch routine that estimates the top singular value $\sigma_1$ of a weight matrix using power iteration and compare it to $\sigma_1$ from torch.linalg.svd. Purpose: practice iterative spectral estimation and convergence diagnostics. ML link: spectral normalization and Lipschitz control in deep networks. Hints: normalize the iterate at each step, track convergence of $\|A v\|$, and compare with torch.linalg.svd. What mastery looks like: you can justify the number of iterations needed for a given tolerance and explain failures on near-degenerate spectra.

C.3. Task: Use NumPy to construct a matrix with prescribed singular values, then empirically verify $\|A\|_2 = \sigma_1$ and $\|A\|_F = \sqrt{\sum \sigma_i^2}$. Purpose: connect algebraic definitions to numeric behavior. ML link: norm-based regularization and stability analysis. Hints: generate random orthogonal matrices via QR, set $\Sigma$, and compute norms. What mastery looks like: you can generate test cases where spectral and Frobenius norms diverge and interpret why.

C.4. Task: Implement truncated SVD for a dense matrix and compute the Frobenius error curve $\|A - A_k\|_F$ over $k$. Purpose: quantify approximation quality as rank increases. ML link: model compression and PCA variance retention. Hints: use cumulative sums of singular values instead of reconstructing for every $k$ to validate errors. What mastery looks like: you can select $k$ based on an error budget and explain the effect of spectral decay.

C.5. Task: Build a synthetic dataset with a known low-rank signal plus noise and estimate its numerical rank using a tolerance rule. Purpose: understand numerical rank under noise. ML link: selecting PCA dimension and latent factor counts. Hints: construct $A = UV^\top + E$ with controlled noise variance and examine the singular value gap. What mastery looks like: you can pick a tolerance tied to noise scale and justify it quantitatively.

C.6. Task: Implement PCA via SVD on centered data and compare to eigendecomposition of the covariance matrix. Purpose: highlight numerical stability differences. ML link: standard preprocessing for classifiers and regressors. Hints: verify that eigenvectors from $X^\top X$ align with right singular vectors of $X$, but note condition number squared. What mastery looks like: you can explain when covariance eigendecomposition is unstable and demonstrate it empirically.

C.7. Task: Implement a low-rank linear regression in NumPy by constraining the solution to rank $k$ and evaluate training vs validation error. Purpose: explore capacity control via rank. ML link: low-rank inductive bias in overparameterized models. Hints: fit an unconstrained model, then project weight matrix onto rank $k$ via SVD. What mastery looks like: you can identify the $k$ that minimizes validation error and explain the bias-variance tradeoff.

C.8. Task: Use SciPy to compute the top $k$ singular values of a large sparse matrix and compare runtime to full SVD on a smaller dense matrix. Purpose: understand algorithmic scaling. ML link: large-scale embeddings and recommender systems. Hints: use scipy.sparse.linalg.svds and monitor memory usage; compare with numpy.linalg.svd. What mastery looks like: you can argue when partial SVD is required and how accuracy depends on $k$.

C.9. Task: Implement spectral normalization for a linear layer in PyTorch and measure its effect on gradient norms during training on a toy classification task. Purpose: connect spectral norms to stability. ML link: adversarial robustness and GAN training stability. Hints: update $W \leftarrow W/\sigma_1(W)$ each step and track gradient norms. What mastery looks like: you can relate changes in optimization behavior to changes in $\sigma_1$.

C.10. Task: Perform SVD-based compression of a pretrained embedding matrix and measure the impact on downstream task accuracy. Purpose: quantify compression tradeoffs. ML link: scalable NLP embedding deployment. Hints: compute $A_k$, replace the embedding with $U_k\Sigma_k$ and $V_k^\top$, and evaluate. What mastery looks like: you can choose $k$ to meet an accuracy target with minimal memory.

C.11. Task: Implement a function that computes the condition number $\kappa_2(A)$ and empirically relate it to sensitivity of least squares solutions under perturbations. Purpose: link conditioning to stability. ML link: regression robustness and feature scaling. Hints: perturb $A$ and $b$ with small noise, solve $Ax=b$, and measure relative error. What mastery looks like: you can predict error amplification from $\kappa_2$ and validate it empirically.

C.12. Task: Construct two matrices $A$ and $B$ with known singular values and verify the Hoffman-Wielandt inequality numerically. Purpose: develop intuition for spectral perturbation. ML link: stability of embeddings and model drift. Hints: use orthogonal matrices to build $A$ and $B$, compute singular values, and compare $\sum (\sigma_i(A)-\sigma_i(B))^2$ to $\|A-B\|_F^2$. What mastery looks like: you can explain when the inequality is tight and when it is loose.

C.13. Task: Implement a randomized SVD in NumPy or SciPy and compare approximation error to deterministic truncated SVD for the same $k$. Purpose: understand accuracy vs speed tradeoffs. ML link: scalable PCA and approximate embeddings. Hints: use random Gaussian sketch $Y = A\Omega$ and orthonormalize before projecting. What mastery looks like: you can report error as a function of oversampling and explain why it improves accuracy.

C.14. Task: Build a low-rank matrix completion experiment with missing entries and solve it via alternating least squares, then compare to nuclear norm soft-thresholding. Purpose: connect low-rank factorization to convex relaxation. ML link: recommender systems and matrix completion. Hints: use a binary mask $\Omega$ and fit only observed entries; compare reconstruction on a held-out set. What mastery looks like: you can explain the difference between factorized and convex approaches and their bias.

C.15. Task: Implement an SVD-based denoising pipeline for images by truncating singular values and evaluate PSNR as a function of $k$. Purpose: see the denoising effect of low-rank approximation. ML link: image compression and noise reduction in preprocessing. Hints: use a small grayscale image and add Gaussian noise; compare PSNR across $k$. What mastery looks like: you can interpret the tradeoff between detail preservation and noise removal.

C.16. Task: Compute effective rank $r_{\text{eff}}$ for model weight matrices during training and correlate it with training loss. Purpose: study implicit regularization via rank. ML link: generalization and capacity control in deep learning. Hints: log singular values periodically and track $r_{\text{eff}}$. What mastery looks like: you can explain how rank evolution relates to optimization phases and overfitting.

C.17. Task: Implement LoRA-style low-rank updates in PyTorch for a linear layer and compare training speed and accuracy to full fine-tuning. Purpose: analyze low-rank adaptation. ML link: parameter-efficient fine-tuning. Hints: freeze the base weight $W$ and train $A$, $B$ with $\Delta W = AB$. What mastery looks like: you can justify rank choices and show when low-rank updates are sufficient.

C.18. Task: Construct a dataset with correlated features and show how whitening via SVD changes the condition number of the design matrix. Purpose: connect SVD to preprocessing. ML link: training stability and convergence speed. Hints: compute $X = U\Sigma V^\top$, whiten with $X\Sigma^{-1}$, and measure $\kappa_2$. What mastery looks like: you can explain why whitening accelerates gradient-based optimization.

C.19. Task: Compare reconstruction error of rank-$k$ approximations under Frobenius norm and spectral norm for the same matrix. Purpose: differentiate error metrics. ML link: compression objectives in ML pipelines. Hints: compute $\|A-A_k\|_F$ and $\|A-A_k\|_2$ and relate them to discarded singular values. What mastery looks like: you can argue which norm is more appropriate for a given ML task.

C.20. Task: Implement a stability experiment where you add small perturbations to a matrix and measure changes in its top singular vectors. Purpose: study sensitivity to perturbations. ML link: robustness of learned embeddings and representations. Hints: compute principal angles between subspaces before and after perturbation and relate them to the spectral gap. What mastery looks like: you can connect observed instability to small gaps and predict when singular vectors will be unreliable.

Solutions

Solutions to A. True / False

A.1

Final Answer. False.

Full mathematical justification. The central claim is that reweighting rows (sample importance) does not change the principal component directions. Let us examine this carefully through the lens of the data covariance matrix.

The data matrix $X \in \mathbb{R}^{n \times d}$ (mean-centered) has right singular vectors, which are the eigenvectors of the empirical covariance $\Sigma = X^\top X / n$. If we reweight rows by a positive diagonal matrix $D = \text{diag}(w_1, \ldots, w_n)$, the reweighted data is $X' = DX$, and the reweighted covariance becomes $\Sigma' = X'^{\top}X' / n = X^\top D^2 X / n$.

The key insight is that $X^\top D^2 X$ is a diagonal reweighting of the original covariance: each row of $X^\top$ is scaled by the corresponding diagonal entry of $D^2$, then multiplied by scale factors again. In the general case where weights are non-uniform ($w_i \not= w_j$ for some $i \neq j$), the matrix $X^\top D^2 X$ is not orthogonally similar to $X^\top X$. That is, there does not exist an orthogonal matrix $Q$ such that $X^\top D^2 X = Q(X^\top X)Q^\top$. This is because orthogonal similarity preserves eigenvalues (and eigenvectors up to orthogonal transformation), but diagonal reweighting is not an orthogonal transformation—it scales different rows by potentially different amounts.

Therefore, the eigenvectors of $X^\top D^2 X$ (which are the right singular vectors of $X'$) are generically different from the eigenvectors of $X^\top X$, and the principal component directions shift.

Explicit counterexample if false. Consider a simple 2D example where $X = \begin{bmatrix}1 & 0\\0 & 1\end{bmatrix}$ (two orthogonal data points, already centered by symmetry). For uniform weights $D = \text{diag}(1,1)$, the right singular vectors are $\mathbf{e}_1 = (1,0)^\top$ and $\mathbf{e}_2 = (0,1)^\top$, with singular values $\sigma_1 = \sigma_2 = 1$. Now apply non-uniform weights $D = \text{diag}(2,1)$. The reweighted data is $DX = \begin{bmatrix}2 & 0\\0 & 1\end{bmatrix}$. The covariance is $DX^{\top}DX = \begin{bmatrix}4 & 0\\0 & 1\end{bmatrix}$, which has eigenvectors $\mathbf{e}_1, \mathbf{e}_2$ (unchanged due to the coincidental diagonal structure) but eigenvalues $4, 1$ instead of $1, 1$. More importantly, for a less symmetric example, the eigenvectors change. For instance, if $X = \begin{bmatrix}1 & 1\\1 & -1\end{bmatrix}$, centering by rows gives mixed results, but the point is that once $D$ is non-uniform and $X$ has off-diagonal covariance, the principal directions tilt.

Comprehension. Row reweighting by a diagonal matrix is not an orthogonal transformation of the feature space. It modifies the Gram matrix $X^\top X$ in a way that changes the eigenstructure and thus the principal components. Intuitively, samples with higher weight influence the covariance more, so directions aligned with heavy samples become more prominent. This is the intended behavior of weighted PCA.

ML Applications. Weighted PCA and importance resampling are tools designed specifically to reweight samples and shift principal components. For example, in imbalanced classification, upweighting minority-class samples shifts PCA components toward features that distinguish minority samples, which can improve downstream discriminative power. In clustering applications, users may reweight by cluster size or background knowledge to emphasize certain populations. In meta-learning, different tasks are weighted, and the learned representation adapts accordingly.

Failure Mode Analysis. If an analyst mistakenly assumes that row reweighting does not change PCA components, they may: 1. Reweight data for fairness or bias correction, expecting the same feature representation, only to find downstream predictions change unexpectedly. 2. Use reweighting as a robust preprocessing step without realizing it changes the feature geometry, causing inconsistency across model versions. 3. Fail to investigate whether a sudden drop in performance is due to feature changes from reweighting, leading to misdiagnosis of model degradation.

Traps. A common source of confusion is the distinction between: (1) orthogonal transformations of features (left multiplication by $Q$, right multiplication of data by $R$), which preserve singular vectors and principal directions; and (2) diagonal reweighting (left multiplication by a diagonal matrix), which shifts the importance of observations and therefore changes the data distribution and eigenstructure. Students often implicitly assume that any similarity transform preserves eigenvectors, but diagonal similarity is not orthogonal, so this fails. Another trap is thinking of reweighting as “just changing the scale”—while row scaling does affect individual samples, it aggregates into the covariance and thus fundamentally alters the geometry.

A.2

Final Answer. False.

Full mathematical justification. Numerical rank is defined by a threshold: $r_\tau(A) = \#\{\sigma_i : \sigma_i > \tau\}$. Under orthogonal right multiplication by $R$, the matrix $X' = XR$ has singular values determined by:

\[ \sigma_i(XR) = \sigma_i(X) \]

This is because singular values are invariant under orthogonal transformations (this is a fundamental property of the spectral norm and Eckart–Young–Mirsky theorem). Specifically, right multiplication by orthogonal $R$ rotates the input space without stretching, so it does not change how much $X$ stretches input vectors in a basis-independent sense.

Since the singular values of $XR$ are exactly those of $X$, the numerical rank at any threshold $\tau$ remains unchanged:

\[ r_\tau(XR) = \#\{\sigma_i(XR) : \sigma_i(XR) > \tau\} = \#\{\sigma_i(X) : \sigma_i(X) > \tau\} = r_\tau(X) \]

The condition number, defined as $\kappa(A) = \sigma_{\max}(A) / \sigma_{\min}(A)$, also depends only on singular values, so $\kappa(XR) = \kappa(X)$, and conditioning is preserved.

Explicit counterexample if false. Consider $X \in \mathbb{R}^{100 \times 50}$ with singular values $(100, 50, 10, 0.1, 0.01, 0, \ldots, 0)$. Suppose the threshold is $\tau = 1$, so $r_\tau(X) = 3$. Let $R$ be any 50-dimensional orthogonal matrix (e.g., a random orthonormal basis). Then $XR$ has singular values $(100, 50, 10, 0.1, 0.01, 0, \ldots, 0)$, so $r_\tau(XR) = 3$, unchanged. This holds for any choice of $R$ and any threshold $\tau$.

Comprehension. Right-multiplication by an orthogonal matrix rotates the feature (column) space without changing intrinsic scaling properties of data. Numerical rank measures how many directions are “energetic” relative to a threshold, and this is a property of the data geometry, not the coordinate system. Any invertible matrix would change singular values, but orthogonal matrices preserve them.

ML Applications. Feature rotations (such as whitening applied to features, or orthogonal dimensionality reduction like whitening PCA) do not change numerical rank or condition number, which is important for computational stability. When feature selection algorithms rotate features orthogonally, numerical properties remain stable, making such rotations safe preprocessing steps. This also means that SVD computed on rotated features gives the same spectrum as on original features.

Failure Mode Analysis. If practitioners mistakenly believe that feature rotations change numerical rank, they might: 1. Perform orthogonal features rotations expecting condition number to improve, then be surprised to see no numerical benefit (because conditioning was already inherent in the data, not the rotation). 2. Avoid orthogonal preprocessing (like whitening) thinking it will hurt conditioning, when in fact conditioning is invariant. 3. Use orthogonal transformations as a “rank reduction” step when they actually preserve rank structure.

Traps. Students often confuse orthogonal transformations (which preserve singular values and norms) with arbitrary linear transformations (which do not). Another trap is thinking “rotation in feature space” always helps numerically; it is true for orthogonal rotations but not for general invertible transformations. A third trap is conflating numerical rank with algebraic rank—numerical rank depends on a threshold and can change with scaling or noise, but orthogonal rotations do not change either spectral properties or the rank of the filtered set.

A.3

Final Answer. False.

Full mathematical justification. The statement conflates two different norms: the operator (spectral) norm of outputs on unit inputs, and the Frobenius norm of the matrix itself. Let us clarify both.

For a linear map $W \in \mathbb{R}^{m \times d}$ with SVD $W = U\Sigma V^\top$, the output on a unit-norm input $x$ satisfies:

\[ \|Wx\|_2 = \|U\Sigma V^\top x\|_2 = \|\Sigma V^\top x\|_2 \in [\sigma_{\min}, \sigma_{\max}] \]

The range $[\sigma_{\min}, \sigma_{\max}]$ is achieved because $\|V^\top x\|_2 = 1$ for unit $x$, and the singular values are the scaling factors along principal directions. The maximum is achieved when $x$ aligns with the top singular vector: $\|Wx\|_2 = \sigma_{\max}$ when $x = \mathbf{v}_1$. The minimum is achieved when $x$ aligns with the bottom singular vector: $\|Wx\|_2 = \sigma_{\min}$ when $x = \mathbf{v}_d$.

In contrast, the Frobenius norm is a property of the matrix $W$ itself:

\[ \|W\|_F = \sqrt{\sum_{i=1}^r \sigma_i^2} \]

where $r = \text{rank}(W)$. This aggregates all singular values. There is no single output $Wx$ for unit $x$ that has norm $\|W\|_F$. For a rank-1 matrix with $\sigma_1 = \|W\|_F$, the equivalence $\|Wx\|_2 = \|W\|_F$ (for the appropriate $x$) does hold, but for $\text{rank}(W) > 1$, the Frobenius norm is always $> \sigma_1$.

Explicit counterexample if false. Let $W = \text{diag}(2, 1)$, so $\sigma_1 = 2$, $\sigma_2 = 1$. Then $\|W\|_F = \sqrt{4+1} = \sqrt{5} \approx 2.236$. For any unit input $x = (x_1, x_2)^\top$ with $x_1^2 + x_2^2 = 1$, the output norm is:

\[ \|Wx\|_2 = \sqrt{4x_1^2 + x_2^2} = \sqrt{3x_1^2 + 1} \in [1, 2] \]

since $x_1^2 \in [0,1]$. The output norm is maximized at $x = (1,0)^\top$ with value 2, which does not equal $\sqrt{5}$. In fact, no unit input achieves output norm $\sqrt{5}$.

Comprehension. The Frobenius norm is a global summary statistic of all singular values, while the output norm on a single unit input depends on the direction of that input and samples only the spectrum at one direction. A high Frobenius norm means large singular values in aggregate, but a single input can only experience up to $\sigma_1$.

ML Applications. In neural networks, the Frobenius norm is used for weight regularization (L2 regularization penalizes $\|W\|_F^2$). The spectral norm $\|W\|_2$ is used for spectral normalization to enforce Lipschitz constraints. These serve different purposes: Frobenius norm controls overall weight magnitude and smoothness, while spectral norm controls worst-case input amplification. Confusing them in practice leads to incorrect regularization strength or incorrect stability guarantees.

Failure Mode Analysis. Assuming that a large Frobenius norm implies large output norms on typical inputs can lead to: 1. Overestimating worst-case amplification in neural networks by using Frobenius norm as a proxy for spectral norm. 2. Underestimating actual output variance if many singular values are small; the Frobenius norm can be large even if most singular values are negligible. 3. Incorrect deployment of regularization strength; using Frobenius norm regularization when spectral control is needed (or vice versa) changes the effective regularization mechanism.

Traps. A subtle trap is that for rank-1 matrices, $\|W\|_F = \|W\|_2$, which can mislead students into thinking the relationship holds generally. Another trap is the notation: both are norms, so it is easy to conflate them syntactically. A third trap is that for structured matrices (e.g., diagonal), Frobenius norm and spectral norm can be related in specific ways, leading to overgeneralization.

A.4

Final Answer. False.

Full mathematical justification. The uniqueness of the best rank-$k$ approximation depends critically on the spectral gap, i.e., whether $\sigma_k > \sigma_{k+1}$. Let us examine both cases:

Case 1: $\sigma_k > \sigma_{k+1}$ (strict gap). By the Eckart–Young–Mirsky theorem, the unique minimizer of $\|A - X\|_F$ over rank-$k$ matrices is $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$, with error $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$. This is unique because the top-$k$ singular subspace (spanned by $\mathbf{u}_1, \ldots, \mathbf{u}_k$) is uniquely determined as the eigenspace of the $k$-largest eigenvalues of $AA^\top$ when these eigenvalues are distinct.

Case 2: $\sigma_k = \sigma_{k+1}$ (repeated singular value). When the $k$-th singular value equals the $(k+1)$-th, the top-$k$ invariant subspace is not unique. More precisely, the $(k+1)$-dimensional invariant subspace associated with all singular values equal to (or nearly equal to) $\sigma_k$ can be rotated in any way, and choosing different $k$-dimensional subsets of this space yields different rank-$k$ approximations, all achieving the same error.

For example, if $\Sigma = \text{diag}(1, 1, 1, 0, \ldots)$ (top 3 singular values all equal to 1), then any rank-2 matrix in the 3-dimensional principal subspace achieves the same error. We can choose any 2-dimensional orthonormal basis spanning a subspace of the top 3 principal directions, and the corresponding rank-2 matrix will be a minimizer.

Formally, the set of minimizers when $\sigma_k = \sigma_{k+1}$ is:

\[ \{A' : A' = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i (\tilde{\mathbf{v}}_i)^\top + \text{zero-rank-(k+1)-onward} : \text{where } [\tilde{\mathbf{v}}_1, \ldots, \tilde{\mathbf{v}}_k] \text{ span any } k\text{-dim subspace of the degenerate eigenspace}\} \]

Explicit counterexample if false. Consider the identity matrix $A = I_3$ (all singular values equal to 1) and $k = 1$. By Eckart–Young–Mirsky, the rank-1 approximation error is $\|A - A_1\|_F^2 = 2$ (the two discarded singular values). However, the best rank-1 approximation is not unique: any rank-1 projector $uu^\top$ with $\|u\|_2 = 1$ achieves this error. For example, $u = \mathbf{e}_1$, $u = \mathbf{e}_2$, $u = \frac{1}{\sqrt{2}}(1,1,0)^\top$ all yield rank-1 matrices with the same Frobenius error. Geometrically, these correspond to projecting onto different one-dimensional subspaces, all of which are “equally principal” because singular values are tied.

Comprehension. Eigenvector directions are only defined up to rotation within degenerate eigenspaces. When a singular value is repeated, the corresponding singular vectors are not unique—any orthonormal basis of the subspace works. This non-uniqueness propagates to low-rank approximation: if the top $m$ singular values include ties, then there are infinitely many rank-$k < m$ approximations all achieving the same error. Uniqueness is restored only when the $k$-th singular value is strictly larger than all others outside the top $k$.

ML Applications. In PCA, when eigenvalues are nearly equal (small eigengaps), the principal components are unstable: small random perturbations of the data can flip which features are selected, leading to different interpretations. This is problematic in factor discovery, where researchers want interpretable, stable components. In recommender systems, repeated singular values in the user-item matrix correspond to multiple equally-important latent factors, and the chosen basis is arbitrary. This explains why matrix factorization methods are often sensitive to initialization and why different runs can yield different factorizations with identical performance.

Failure Mode Analysis. A system that assumes uniqueness of low-rank approximations when there are repeated singular values can suffer from: 1. Instability across runs: Different random initializations or tie-breaking rules lead to different components, making results non-reproducible. 2. Misinterpretation: If components are not stable, any post-hoc analysis of their meaning (e.g., “this component represents feature importance”) is unreliable. 3. Comparison issues: Comparing models with tied singular values across datasets or time periods can be misleading if the non-unique component basis has drifted.

Traps. A common trap is assuming that the eigenvector with the maximum magnitude (or first computed direction) is “the” principal direction when singular values are tied. In reality, any orthonormal basis of the subspace is equally valid. Another trap is numerically: when singular values are nearly equal (within numerical precision), floating-point computations may order them inconsistently, leading to spurious instability that appears to be a data problem but is actually a numerical artifact. Practitioners may spend time investigating data quality when the root cause is that computational precision has destroyed the assumed gap structure.

A.5

Final Answer. False.

Full mathematical justification. Matrix completion and low-rank imputation are subtly but fundamentally different problems. Let us formalize both.

Matrix Completion Problem: Minimize $\|M_\Omega(A - UV^\top)\|_F^2$, where $\Omega$ is the set of observed indices and $M_\Omega$ is the indicator operator that selects only those entries. This loss function ignores unobserved entries entirely—the objective has zero contribution from missing positions regardless of their values.

Low-Rank Imputation Problem: Fill in missing entries with zeros (or another constant), producing a complete matrix $A_0$, then minimize $\|A_0 - UV^\top\|_F^2$. Now the unobserved entries are treated as explicit data points (all equal to 0), and the loss penalizes deviations from zero at those locations.

The critical difference: in matrix completion, missing entries are free variables (unconstrained); in zero imputation, they are constrained to be zero. This changes the solution fundamentally.

Effect on Solutions: In matrix completion, the algorithm can choose any values for missing entries that minimize the observed loss. In zero imputation, the algorithm must minimize the loss while respecting the zero values at unobserved locations, which biases the solution toward zeros. More formally, let $B$ be the filled-in version of $U V^\top$ and let $B_0$ be the zero-imputed version. Zero imputation forces:

\[ \text{minimize}_{UV} \|A_0 - UV^\top\|_F^2 = \sum_{(i,j) \in \Omega} (A_{ij} - [UV^\top]_{ij})^2 + \sum_{(i,j) \notin \Omega} (\underbrace{0 - [UV^\top]_{ij}}_{\text{imputation bias}})^2 \]

The second sum pulls $UV^\top$ toward zero at unobserved entries, distorting the latent factors and degrading performance on $\Omega$.

Explicit counterexample if false. Consider a simple $2 \times 2$ rating matrix where the main diagonal is observed ($A_{11} = 1, A_{22} = 1$) and off-diagonal is missing:

\[ A = \begin{bmatrix}1 & ?\\? & 1\end{bmatrix} \]

Matrix Completion Approach: Minimize $(1 - [UV^\top]_{11})^2 + (1 - [UV^\top]_{22})^2$. A minimizer is $U = \frac{1}{\sqrt{2}}\begin{bmatrix}1\\1\end{bmatrix}, V = \frac{1}{\sqrt{2}}\begin{bmatrix}1\\1\end{bmatrix}$, giving $UV^\top = \frac{1}{2}\begin{bmatrix}1 & 1\\1 & 1\end{bmatrix}$ (error 0.5 on each observed entry). Alternatively, $UV^\top = \begin{bmatrix}1 & 0\\0 & 1\end{bmatrix}$ achieves zero error on $\Omega$.

Zero Imputation Approach: Minimize $(1 - [UV^\top]_{11})^2 + (0 - [UV^\top]_{12})^2 + (0 - [UV^\top]_{21})^2 + (1 - [UV^\top]_{22})^2$. The loss now includes $-[UV^\top]_{12}^2 - [UV^\top]_{21}^2$, which penalizes non-zero off-diagonal values. The minimizer is pulled toward $UV^\top \approx \begin{bmatrix}1 & 0\\0 & 1\end{bmatrix}$, BUT this forces the model to pay attention to the (artificially introduced) zero constraints, biasing latent factors. For many realistic cases, the minimizer of zero-imputed loss will have worse generalization on held-out test data than the matrix-completion minimizer.

Comprehension. The fundamental issue is that missing data is not equivalent to observed zeros. Zero imputation makes a strong assumption—that unobserved entries should be zero—which introduces bias unless this is actually true. In recommendation systems, a missing rating does not mean the rating is 0; it is simply unknown. Treating unknowns as zeros creates a cold-start bias: new users with few ratings are artificially pushed toward zero latent factors, which can distort the learned factorization. Matrix completion respects the missingness and avoids this bias, allowing the model to infer missing values from the structure of observed data.

ML Applications. In recommender systems, zero imputation can lead to poor recommendations for new items or users because the model learns that zero is “important.” Netflix Prize-winning methods explicitly used masked losses (matrix completion) because zero imputation performed poorly. In image inpainting, treating missing pixels as black (or white) zeros biases reconstruction; algorithms that treat missing pixels as free variables (e.g., via nuclear norm minimization) reconstruct more plausibly. In time-series forecasting with missing values, imputing with zeros before fitting an ARIMA model can distort autocorrelation structure and lead to poor forecasts. In sensor networks, zero imputation assumes sensors report 0 when faulty, which is incorrect; matrix completion allows the algorithm to infer sensor values from neighboring sensors.

Failure Mode Analysis. Using zero imputation when matrix completion is appropriate causes: 1. Systematic bias in latent factors: The learned factors are pulled toward zero at unobserved locations, misrepresenting the true data distribution. 2. Cold-start problems: New samples with missing data are assigned spuriously low-magnitude latent vectors, making recommendations poor. 3. Generalization failure: The model learns to fit the artificial zeros, degrading performance on true missing values (which are unknown, not zero). 4. Compounding errors: If the imputed model is subsequently used for downstream tasks, bias propagates and accumulates.

Traps. A conceptual trap is that “missing” and “zero” sound similar and can be conflated in casual discussion. In data pipelines, it is easy to accidentally replace NaN with 0 without realizing this changes the statistical problem. Another trap is that for some problems (e.g., sparse matrices where most entries are legitimately zero), zero imputation might perform similarly to matrix completion, creating false confidence that the approach is generally valid. A third trap is that some algorithms (e.g., simple low-rank factorization without missing-data handling) are commonly applied to zero-imputed data, making the bias seem “standard practice” when it is actually a limitation of the algorithmic choice.

A.6

Final Answer. True.

Full mathematical justification. The Eckart-Young-Mirsky theorem states that among all rank-$r_\tau$ matrices, the truncated SVD $A_{r_\tau} = \sum_{i=1}^{r_\tau} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$ achieves the minimal Frobenius norm approximation error:

\[ \|A - A_{r_\tau}\|_F = \min_{\text{rank}(X) \leq r_\tau}} \|A - X\|_F \]

with the minimum value $\|A - A_{r_\tau}\|_F = \sqrt{\sum_{i=r_\tau+1}^r \sigma_i^2}$.

Now suppose there exists a rank-$r_\tau$ factorization $B = B_{\text{left}} B_{\text{right}} \in \mathbb{R}^{m \times n}$ such that $\|A - B\|_F \leq \tau$ (i.e., this factorization achieves acceptable error within threshold $\tau$). Since $B$ is rank-$r_\tau$, by optimality of $A_{r_\tau}$, we have:

\[ \|A - A_{r_\tau}\|_F \leq \|A - B\|_F \leq \tau \]

Therefore, truncated SVD achieves error at most $\tau$ at rank $r_\tau$.

Strategic insight: The theorem does not claim truncated SVD is unique or fast to compute. It claims it is optimal—no other rank-$r_\tau$ matrix can achieve lower error. This is a fundamental statement about the best possible solution at a given rank.

Comprehension. Eckart-Young-Mirsky is a statement about the best rank-approximation, not about how to find it. Any ad-hoc low-rank factorization (e.g., random low-rank initialization and alternating least squares) produces a rank-$r$ matrix, and truncated SVD is guaranteed to be at least as good as any such heuristic at the same rank.

ML Applications. In data compression pipelines, if an application has already computed a rank-$r$ factorization with acceptable error, post-processing via truncated SVD tightens the approximation (reduces error for the same rank) or reduces rank for the same error. In autoencoders, after training, computing truncated SVD of the bottleneck representations can improve compression. In recommender systems, if a heuristic factorization (e.g., from SGD or alternating minimization) is available, its error provides a bound on the optimal error achievable at that rank via truncated SVD, which can guide design decisions.

Failure Mode Analysis. Misusing this theorem leads to: 1. Complacency: Thinking that any rank-$r$ factorization is “good enough” because it achieves some error bound, without realizing truncated SVD is strictly better. 2. Premature termination: Stopping an algorithm when approximation error falls below $\tau$, without checking if truncated SVD could achieve error below $\tau$ at lower rank. 3. Comparing to wrong baselines: Comparing a heuristic factorization to random initialization, without comparing to truncated SVD as the gold standard.

Traps. A trap is assuming Eckart-Young-Mirsky guarantees computational efficiency; computing full SVD can be expensive for large matrices. The theorem is an optimality statement, not an algorithmic blueprint. Another trap is assuming it applies to all norms; Eckart-Young-Mirsky holds for Frobenius and spectral norms but not all norms (e.g., not for Lp norms in general). A third trap is thinking the theorem guides how to find a good low-rank approximation; it validates that truncated SVD is the best choice, but the theorem does not explain the algorithm itself.

A.7

Final Answer. False.

Full mathematical justification. PCA is extremely sensitive to feature scaling. If we scale the columns of $X$ by a positive diagonal matrix $S = \text{diag}(s_1, \ldots, s_d)$, producing $X' = XS$, then the (mean-centered) covariance of $X'$ is:

\[ \text{Cov}(X') = \frac{1}{n}X'^{\top}X' = \frac{1}{n}S^\top X^\top X S = S \text{Cov}(X) S \]

This is not the same as $\text{Cov}(X)$ unless $S$ is orthogonal (i.e., a rotation). When $S$ is diagonal with entries $s_i \neq 1$, scaling stretches (or compresses) the variance along coordinate axes, changing the eigenstructure.

Explicit counterexample if false. Consider centered data with two features of different natural scales: $X$ has centered columns with $\text{Cov}(X) = \begin{bmatrix}1 & 0\\0 & 0.01\end{bmatrix}$ (most variance in the first feature). The principal component points along $\mathbf{v}_1 = (1, 0)^\top$, capturing feature 1.

Now scale the second feature by 100: $X' = X \cdot \text{diag}(1, 100)$. The new covariance is $\text{Cov}(X') = \begin{bmatrix}1 & 0\\0 & 100\end{bmatrix}$. The principal component now points along $\mathbf{e}_2 = (0, 1)^\top$, capturing the (now artificially enlarged) second feature. This is a complete reversal of which features are deemed “principal.”

Comprehension. PCA aligns with directions of high variance in the data. If features have different units or scales, the “high variance” features might be those with large measurement uncertainties or scales, not those with true signal variability. Standardization (scaling to unit variance) ensures that each feature contributes equally to the notion of variance, eliminating the dominance of large-scale features.

ML Applications. In multivariate analysis, PCA is always preceded by standardization (z-score normalization) unless the user has a specific reason to weight features by their natural variance. In image processing, pixel intensities (0-255) naturally have the same scale across color channels, so standardization is less critical, but data from different sensors (e.g., temperature in Kelvin and humidity in percentage) must be standardized. In finance, stock prices are heteroscedastic across assets (some stocks are cheap, others expensive) and standardization is essential before PCA of returns or correlations.

Failure Mode Analysis. Applying PCA without scaling causes: 1. Bias toward high-scale features: Features with large natural magnitudes (e.g., prices in millions vs. percentages) dominate PCA, even if they are less informative. 2. Unit dependence: Results change if data are re-expressed in different units (e.g., meters vs. millimeters), violating interpretability principles. 3. Misleading components: The “principal components” reflect data-encoding choices rather than underlying structure. 4. Comparison issues: PCA applied to unscaled vs. scaled data yields different results, making cross-study or cross-domain comparisons unreliable.

Traps. A trap is the phrase “PCA is invariant to scaling” appearing in some contexts; this is true only for orthogonal transformations, not general diagonal scaling. Another trap is assuming standardization is “preprocessing” that does not matter; it fundamentally changes the result. A third trap (especially in modern deep learning) is that neural networks can learn their own scalings via batch normalization, leading to complacency about feature scaling at model input—but this is delegating the decision to the model, which may not optimize for interpretability or efficiency.

A.8

Final Answer. False.

Full mathematical justification. A network’s Lipschitz constant is bounded by the product of layer Lipschitz constants only when the architecture is a pure composition without additive skip connections or scaling activations. Spectral normalization enforces $\|W\|_2=1$ for linear layers, but if there is a residual connection $f(x)=x+Wx$, then $\|f\|_\text{Lip} \leq 1+\|W\|_2 = 2$, not 1.

Explicit counterexample if false. Let $f(x)=x+Wx$ with $\|W\|_2=1$. Then $\|f(x)-f(y)\|_2 = \|(I+W)(x-y)\|_2 \geq \|x-y\|_2$, and in fact $\|I+W\|_2 \geq 2$ when $W$ has eigenvalue 1.

Comprehension. Spectral normalization controls each linear map but does not automatically control architectures with additive paths.

ML Applications. GANs and ResNets require additional constraints beyond spectral normalization to guarantee global Lipschitz bounds.

Failure Mode Analysis. Assuming 1-Lipschitz behavior can lead to incorrect robustness and stability claims.

Traps. Treating network Lipschitz constants as simple products in the presence of skips or scalings.

A.9

Final Answer. True.

Full mathematical justification. Ridge regularization improves conditioning through a careful mathematical mechanism. If $A$ has singular values $\sigma_1 \geq \cdots \geq \sigma_n$, then $A^\top A$ has eigenvalues $\sigma_i^2$. When we add ridge regularization $\lambda I$, the normal equations become $(A^\top A + \lambda I)\mathbf{x} = A^\top \mathbf{b}$.

The eigenvalues of $A^\top A + \lambda I$ are $\{\sigma_i^2 + \lambda\}_{i=1}^n$. The condition number is:

\[ \kappa(A^\top A + \lambda I) = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda} \]

versus the unregularized condition number:

\[ \kappa(A^\top A) = \frac{\sigma_1^2}{\sigma_n^2} \]

To verify that regularization improves conditioning, we check:

\[ \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda} < \frac{\sigma_1^2}{\sigma_n^2} \]

Cross-multiplying (valid since all terms are positive):

\[ (\sigma_1^2 + \lambda)\sigma_n^2 < \sigma_1^2(\sigma_n^2 + \lambda) \]

\[ \sigma_1^2 \sigma_n^2 + \lambda \sigma_n^2 < \sigma_1^2 \sigma_n^2 + \lambda \sigma_1^2 \]

\[ \lambda \sigma_n^2 < \lambda \sigma_1^2 \]

Since $\lambda > 0$ and $\sigma_1 \geq \sigma_n$, this inequality holds strictly, confirming that the condition number decreases.

Comprehensive explanation of the mechanism: The regularization term $\lambda I$ adds a constant $\lambda$ to every eigenvalue uniformly. Because the smallest singular values are already small, adding $\lambda$ to them provides relative improvement more than adding it to the large singular values. The ratio shifts favorably toward 1, indicating better conditioning. For example, if $\sigma_1^2 = 1000$ and $\sigma_n^2 = 1$, the original condition number is 1000. Adding $\lambda = 10$ gives $\kappa = 1010/11 \approx 91.8$, a reduction of nearly 91%. But if we add $\lambda = 100$, we get $\kappa = 1100/101 \approx 10.9$, reducing to about 1.1% of the original, dramatically stabilizing the system at the cost of bias.

Comprehension. Ridge regularization shrinks small singular values less than large ones in relative terms, improving the spread of eigenvalues and thus the condition number. This improved conditioning reduces sensitivity to small perturbations in both $A$ and $\mathbf{b}$, making the solution more robust to noise and numerical errors.

ML Applications. Ridge regression (L2 regularization) is ubiquitous in machine learning precisely because it improves conditioning while maintaining reasonable prediction accuracy. In linear regression, adding ridge penalty reduces overfitting and stabilizes parameter estimates when features are correlated or nearly collinear. In Bayesian linear regression, ridge regularization corresponds to a Gaussian prior on parameters, philosophically connecting frequentist stability to Bayesian interpretation. In regularized least squares for ill-posed problems (inverse problems, deconvolution), ridge regularization is essential for computational feasibility and numerical stability.

Failure Mode Analysis. Over-regularization (choosing $\lambda$ too large) can cause: 1. Bias-variance tradeoff: While conditioning improves with larger $\lambda$, the bias of the estimator increases. The regularized solution $(A^\top A + \lambda I)^{-1}A^\top \mathbf{b}$ has smaller variance but larger systematic bias. 2. Underfitting: Excessively large $\lambda$ drives $\mathbf{x}$ toward zero, losing predictive power on fresh data. 3. Loss of information: If the smallest singular values correspond to important features, damping them via regularization can discard signal.

Traps. A trap is thinking that regularization always improves prediction accuracy; it always improves conditioning and reduces variance, but can increase bias beyond acceptable levels. Another trap is assuming that the relationship between $\lambda$ and condition number is linear; the improvement is sublinear (saturation occurs). A third trap (important in practice) is not cross-validating $\lambda$; choosing it based on conditioning alone ignores prediction performance and generalization.

A.10

Final Answer. False.

Full mathematical justification. The Frobenius norm is unitarily invariant on both sides: for orthogonal matrices $U, V$, we have $\|UAV\|_F = \|A\|_F$. This property is fundamental and reflects that Frobenius norm measures intrinsic “size” independent of coordinate systems.

If $A_k = \arg\min_{\text{rank}(X) \leq k} \|A - X\|_F$, then by unitary invariance, $UA_k$ minimizes $\|UA - Y\|_F$ over rank-$k$ matrices $Y$. Similarly, $A_k V^\top$ (or equivalently $A_k V$ after conjugate transpose) minimizes $\|AV - Y\|_F$.

The statement incorrectly claims that only right multiplication preserves this invariance. In fact, both left and right orthogonal multiplications preserve the best rank-$k$ approximation structure:

\[ (UA)_k = U A_k \quad \text{and} \quad (AV)_k = A_k V \]

This is not a coincidence; it follows directly from the SVD: if $A = U_A \Sigma V_A^\top$, then $UA = (UU_A)\Sigma V_A^\top$ and $AV = U_A \Sigma(V_A^\top V)$. Left multiplication by $U$ rotates the output space (replacing $U_A$ with $UU_A$, still orthogonal), while right multiplication by $V$ rotates the input space (replacing $V_A$ with $V_A^\top V$, still orthogonal). Both preserve rank and singular values.

Explicit counterexample if false. Consider $A = \text{diag}(2, 1, 0.1) \in \mathbb{R}^{3 \times 3}$ with rank-1 approximation $A_1 = \text{diag}(2, 0, 0)$. Let $U = [P(1 3)]$ (a permutation swapping dimensions 1 and 3, which is orthogonal). Then $UA = \text{diag}(0.1, 1, 2)$ and $(UA)_1 = \text{diag}(0, 0, 2) = U A_1$, confirming left invariance.

Comprehension. Both sides of multiplicat by orthogonal matrices leave the optimal low-rank solution invariant because they preserve the underlying geometry—singular values and orthogonality of singular vectors. This is because unitary transformations are isometries (distance-preserving), so they do not change optimization landscapes in Frobenius norm.

ML Applications. In data preprocessing, orthogonal transformations (PCA, whitening, rotations for visualization) can be applied before low-rank compression without affecting the optimal rank-$k$ matrix or its approximation error. This explains why PCA followed by low-rank truncation on the transformed data yields the same result as truncating the original data. In federated learning or multi-node synchronization, reparameterizing data via orthogonal transformations does not change low-rank structure detection.

Failure Mode Analysis. Mistakenly believing that only right (or left) multiplication is invariant can lead to: 1. Inconsistent compression strategies: applying left vs. right rotations and obtaining different results, leading to confusion about problem setup. 2. Inefficient algorithms: assuming one direction requires special handling, adding unnecessary complexity. 3. Theoretical errors: deriving bounds or stability results that incorrectly treat left and right asymmetrically.

Traps. A trap is the asymmetry in SVD notation ($A = U\Sigma V^\top$), which can psychologically suggest that $V$ (right singular vectors) is more “fundamental” than $U$. In reality, both are equally important for geometry. Another trap is conflating unitary invariance with other forms of invariance (e.g., translation invariance or scale invariance), which do not hold. A third trap in implementations is that left and right multiplication have different computational costs (one multiplies by $U$ first, the other by $V$), which can make one appear “more efficient” and subtly bias thinking toward asymmetry.

A.11

Final Answer. False.

Full mathematical justification. The intuition that flat singular spectra prevent low-rank approximation is based on reconstruction error analysis, which is necessary but not sufficient for predicting task-specific performance. Let me clarify the distinction.

Reconstruction error perspective: The error of a rank-$k$ approximation is $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$. When singular values decay slowly (flat spectrum), many $\sigma_i$ are large, so discarding them incurs large reconstruction error. This is an objective fact about the data matrix $X$ itself.

Task performance perspective: However, whether reconstruction error matters for prediction depends entirely on the labels. If the target vector $\mathbf{y}$ aligns with low-energy directions (those corresponding to small singular values), then low-rank compression can destroy predictive information. Conversely, if $\mathbf{y}$ aligns with a small set of high-importance directions (regardless of their singular value magnitude), then low-rank compression that preserves those directions can maintain perfect accuracy.

Explicit counterexample if false. Let $X \in \mathbb{R}^{100 \times 100}$ have singular values all equal to 1 (perfectly flat spectrum). All information is “equally important” from a reconstruction perspective. Now construct labels $\mathbf{y}$ such that $\mathbf{y} = \mathbf{v}_1$ (the first right singular vector). Then the truth is $\mathbf{y} = X \mathbf{e}_1$, and a rank-1 approximation $X_1 = \sigma_1 \mathbf{u}_1 \mathbf{e}_1^\top$ captures the full predictive relationship perfectly: for any $\mathbf{x} = \sum_i c_i \mathbf{e}_i$, the rank-1 model predicts $X_1 \mathbf{x} = c_1 \sigma_1 \mathbf{u}_1$, which gives the correct output $\mathbf{y}^\top \mathbf{x} = c_1$. So reconstruction error is $\sum_{i=2}^{100} 1 = 99$ (huge!), but predictive error is zero.

Comprehension. Reconstruction error and predictive error are measuring different things. Reconstruction error quantifies “how well does a rank-$k$ matrix approximate all entries of $X$?”, while predictive error asks “how well do predictions $X_k \mathbf{x}$ match the true outputs $\mathbf{y}$?” A flat spectrum means the matrix has energy in many directions, but if the labels depend on a sparse subset of those directions, the subset can be compressed while maintaining accuracy.

ML Applications. In feature selection and dimensionality reduction, we should not blindly trust singular value decay. Instead, we should validate compression on labeled data. SVD alone answers “which directions capture most variance in $X$?”, not “which directions are predictive for $\mathbf{y}$?”. Supervised methods (e.g., partial least squares, which looks at covariance between $X$ and $\mathbf{y}$, or least-squares estimation) are appropriate when labels are available. Flat spectra in unlabeled data can coexist with highly predictive sparse structure once labels are revealed.

Failure Mode Analysis. Overreliance on spectral decay causes: 1. Unnecessary compression: Keeping all directions because spectrum is flat, even though most are irrelevant to predictions. 2. Missed compressionopportunities: Failing to aggressively compress when task structure is sparse, falsely believing flat spectra prevent it. 3. Circular reasoning: Using PCA for feature selection (based on spectral decay), then claiming the resulting problem is “high-dimensional,” when supervised feature selection would have identified a low-rank structure.

Traps. A trap is conflating “variance” with “importance.” High variance in $X$ (driven by flat singular values) does not mean high importance for predicting $\mathbf{y}$. Another trap is thinking PCA is always the right preprocessing step; it is optimal for reconstruction, not for prediction. A third trap is using spectral decay as a stopping criterion for compression without validating on a validation set or running cross-validation with the downstream task.

A.12

Final Answer. True.

Full mathematical justification. A linear autoencoder consists of an encoder $\mathbf{h} = W\mathbf{x}$ and a decoder $\hat{\mathbf{x}} = W'\mathbf{h}$, where $W \in \mathbb{R}^{d \times k}$ and $W' \in \mathbb{R}^{d' \times d}$. When weights are tied (i.e., $W' = W^\top$), the reconstruction is $\hat{\mathbf{x}} = W^\top W \mathbf{x}$. For a data matrix $X$ with columns as samples, the reconstruction is $\hat{X} = XWW^\top$ (or $\hat{X}^\top = W^\top W X^\top$ depending on convention).

The reconstruction error is:

\[ \|X - XWW^\top\|_F^2 = \|X(I - WW^\top)\|_F^2 \]

Since $WW^\top$ is an orthogonal projector onto the column space of $W$ (because $W^\top W = I_k$ when $W$ has orthonormal columns), this becomes:

\[ \|X - XWW^\top\|_F^2 = \|X\|_F^2 - \|XWW^\top\|_F^2 \]

To minimize this, we need to maximize $\|XWW^\top\|_F^2 = \text{tr}((WW^\top X^\top XWW^\top)) = \text{tr}(W^\top X^\top X W)$ subject to $W^\top W = I_k$. By the variational characterization of eigenvalues (Rayleigh quotient), the maximizer is the matrix whose columns are the top $k$ eigenvectors of $X^\top X$, which are precisely the top $k$ right singular vectors of $X$.

Therefore, any global minimizer of the linear autoencoder loss (with tied weights) must have $W = V_k$ where $V_k$ spans the top $k$ principal directions of $X$, and $\hat{X} = XV_k V_k^\top = U_k \Sigma_k V_k^\top$ (up to possible rotations within the top-$k$ eigenspace if there are repeated eigenvalues).

Comprehensive mechanism: This result shows that linear autoencoders are equivalent to PCA in the sense that they learn the same subspace. The loss function of the autoencoder, when minimized exactly, yields the principal subspace. This is not a coincidence of optimization but a fundamental consequence of the quadratic form and the orthogonality constraint.

Comprehension. Linear autoencoders without nonlinearities act as unsupervised dimensionality reduction tools that recover the principal subspace. The bottleneck layer $\mathbf{h} = W\mathbf{x}$ contains the PCA projections. This equivalence is specific to tied weights and MSE loss; untying the weights or using different losses breaks the equivalence.

ML Applications. This result justifies using linear autoencoders for feature extraction and initialization. In practice, neural networks are initialized with weights derived from SVD or PCA, and this theory explains why: the linear autoencoder (which is easy to analyze) provides optimal solutions that can serve as starting points for more complex models. In representation learning, adding nonlinearities breaks the equivalence but can learn richer representations; understanding the linear case isolates the role of architecture from the role of nonlinearity.

Failure Mode Analysis. Assuming the equivalence holds when: 1. Weights are untied: If $W' \neq W^\top$, the result breaks. Untied autoencoders learn a different subspace (generally not principal) and can parametrize more expressive reconstructions. 2. Nonlinearities are added: Adding $\phi$ in the encoder or decoder ($\mathbf{h} = \phi(W\mathbf{x})$) breaks linearity and the equivalence to PCA. 3. Different loss: Using $L_1$ loss or other losses changes the optimization problem and its solution. 4. Reconstructing different targets: If decoding targets $\mathbf{y}$ different from $\mathbf{x}$, the problem is supervised regression, not dimensionality reduction.

Traps. A trap is thinking autoencoders are always equivalent to PCA; this holds only in the linear, tied-weight, MSE-loss regime. Modern autoencoders violate at least one of these assumptions. Another trap is using the equivalence to argue that nonlinear autoencoders are “just nonlinear PCA”; the nonlinearity changes the solution fundamentally. A third trap (practical) is trusting that an autoencoder has converged to the global minimum; autoencoders are non-convex and training can get stuck in local minima, breaking the equivalence even with the right architecture.

A.13

Final Answer. True.

Full mathematical justification. By definition, the squared Frobenius norm aggregates all singular values:

\[ \|A\|_F^2 = \sum_{i=1}^r \sigma_i^2 \]

The squared spectral norm captures only the largest singular value:

\[ \|A\|_2^2 = \sigma_1^2 \]

Therefore:

\[ \|A\|_F^2 = \sigma_1^2 + \sum_{i=2}^r \sigma_i^2 \geq \sigma_1^2 = \|A\|_2^2 \]

since all squared singular values are non-negative. Equality holds if and only if $\sigma_i = 0$ for all $i > 1$, which occurs if and only if $\text{rank}(A) = 1$.

Detailed analysis of when equality holds: Rank-1 matrices have a special structure: they can be written as $A = \mathbf{u} \mathbf{v}^\top$ for vectors $\mathbf{u}, \mathbf{v}$. The SVD of such a matrix has only one nonzero singular value: $A = \sigma_1 \mathbf{u}_1 \mathbf{v}_1^\top$ with all $\sigma_i = 0$ for $i > 1$. With this structure:

\[ \|A\|_F^2 = \sigma_1^2 = \|A\|_2^2 \]

Conversely, if $\|A\|_F^2 = \|A\|_2^2$, then $\sum_{i=1}^r \sigma_i^2 = \sigma_1^2 + \sum_{i=2}^r \sigma_i^2 = \sigma_1^2$, which implies $\sum_{i=2}^r \sigma_i^2 = 0$, so all small singular values vanish, forcing rank 1.

Comprehension. The Frobenius norm “spreads” importance across all singular values (a $2$-norm of the singular values vector), while the spectral norm picks out the largest one. When only one direction is significant, these coincide. When multiple directions carry energy, the Frobenius norm is larger.

ML Applications. Rank-1 matrices are extreme cases of compression: an $m \times n$ rank-1 matrix requires only $m + n$ parameters (storage for two vectors) compared to $mn$ for the full matrix. Using $\|A\|_F = \|A\|_2$ as a test for rank-1 structure enables quick screening in applications. For example, in low-rank approximation, if a candidate matrix satisfies $\|A\|_F = \|A\|_2$, it is rank-1 and represented compactly. This equivalence also helps in theoretical analysis: proving properties for rank-1 matrices often provides building blocks for understanding low-rank matrices.

Failure Mode Analysis. Misusing this result leads to: 1. Confusing norms for unrelated purposes: Using $\|A\|_F = \|A\|_2$ to test rank-1 is valid, but using Frobenius norm to bound worst-case behavior (where spectral norm is appropriate) underestimates extreme amplifications. 2. Over-compression based on inequality: Seeing $\|A\|_F \geq \|A\|_2$ and concluding that Frobenius norm is always “safer” or more stable (it is not; they measure different properties). 3. Numerical confusion: In floating-point arithmetic, near-rank-1 matrices may satisfy $\|A\|_F \approx \|A\|_2$ within rounding error, leading to misclassification of rank.

Traps. A trap is thinking that the inequality $\|A\|_F \geq \|A\|_2$ can be tightly controlled; the gap is at least as large as the number of significant singular values. Another trap is assuming that minimizing Frobenius norm (in regularization) is equivalent to minimizing spectral norm; they solve very different problems. A third trap (practical) is using the equality test for rank-1 detection numerically; due to noise and rounding, one should instead check if the ratio $\|A\|_F^2 / \|A\|_2^2$ is close to 1 within a tolerance.

A.14

Final Answer. False.

Full mathematical justification. In kernel PCA, the relationship between singular values of the feature map and eigenvalues of the Gram matrix involves a subtle but critical squaring operation. Let $\Phi \in \mathbb{R}^{n \times m}$ be the feature matrix (columns are feature vectors for each sample) and let $\Phi_c$ be the centered version (row-wise centering). The Gram (kernel) matrix is:

\[ K = \Phi_c \Phi_c^\top \in \mathbb{R}^{n \times n} \]

i.e., $K$ is the Gram matrix of sample-space inner products. By the spectral theorem, $K = U_K \Lambda_K U_K^\top$ where $\Lambda_K = \text{diag}(\lambda_1, \ldots, \lambda_n)$ are eigenvalues.

Now, the SVD of $\Phi_c$ is $\Phi_c = U \Sigma V^\top$ where $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$. Then:

\[ K = \Phi_c \Phi_c^\top = (U\Sigma V^\top)(V\Sigma^\top U^\top) = U\Sigma^2 U^\top \]

Comparing with the eigendecomposition of $K$, we have: - Eigenvectors of $K$ are $U$ (the left singular vectors of $\Phi_c$): correct! - Eigenvalues of $K$ are $(\sigma_1^2, \ldots, \sigma_r^2)$, i.e., $\lambda_i = \sigma_i^2$.

Therefore:

\[ \sigma_i(\Phi_c) = \sqrt{\lambda_i(K)} \]

NOT $\sigma_i(\Phi_c) = \lambda_i(K)$.

Explicit counterexample if false. Suppose the kernel matrix has eigenvalues $K = \text{diag}(9, 4, 1)$. Then the singular values of $\Phi_c$ are $\sigma(\Phi_c) = (\sqrt{9}, \sqrt{4}, \sqrt{1}) = (3, 2, 1)$, not $(9, 4, 1)$. If a practitioner mistakenly treats $(9, 4, 1)$ as singular values, they would compute explained variance as $(9+4+1)/(9+4+1) = 1$ for a rank-1 component, when the true explained variance should be $9/14 = 0.643$ (approximately). This is a dramatic discrepancy!

Comprehension. The squaring relationship arises from the definition of Gram matrix as a product $\Phi_c \Phi_c^\top$. This is a fundamental relationship in linear algebra: the eigenvalues of a Gram matrix are squared singular values of the original matrix. The mnemonic is: “Gram matrix squared” = eigenvalues are squared singular values.

ML Applications. In kernel PCA implementations, explained variance must be computed as eigenvalues (not their square roots). If a software package reports “explained variance proportions” from kernel eigenvalues directly without taking square roots, the variances will be overstated. For example, scikit-learn’s KernelPCA documentation explicitly accounts for this. In feature visualization and understanding kernel methods, the relationship clarifies that kernel PCA works in feature space (where singular values are square roots of kernel eigenvalues) but the kernel itself summarizes in sample space (where eigenvalues are the squared singular values).

Failure Mode Analysis. Confusing kernel eigenvalues with singular values causes: 1. Overstated variance: Reporting explained variance as $(\lambda_1)^2 / \sum_i (\lambda_i)^2$ instead of $\lambda_1 / \sum_i \lambda_i$, leading to inflated importance for early components. 2. Wrong stopping criterion: Selecting number of components based on eigenvalues treated as singular values, choosing too few components and losing information. 3. Scaling errors: When normalizing components or computing residuals, using wrong scaling factors based on confused singular values.

Traps. A trap is notation: both are often written as $\lambda$, obscuring the relationship. Another trap is that for positive definite kernels (like RBF), all eigenvalues are positive, seemingly “looking like” singular values; but the squaring still applies. A third trap is implementations that silently apply the square root without documenting it, or omit the step entirely, causing subtle bugs that only surface in variance computations.

A.15

Final Answer. True.

Full mathematical justification. The continuity of the pseudoinverse has two distinct aspects that must be carefully separated.

Part 1: Continuity in $b$. The map $b \mapsto A^+ b$ is linear (since $A^+$ is a fixed matrix), and any linear map between finite-dimensional spaces with the Euclidean norm is continuous. Specifically, for any $\mathbf{b}_1, \mathbf{b}_2$:

\[ \|A^+\mathbf{b}_1 - A^+\mathbf{b}_2\|_2 = \|A^+(\mathbf{b}_1 - \mathbf{b}_2)\|_2 \leq \|A^+\|_2 \|\mathbf{b}_1 - \mathbf{b}_2\|_2 \]

So the map is Lipschitz continuous with constant $\|A^+\|_2 = 1/\sigma_{\min}(A)$. Thus continuity in $\mathbf{b}$ is guaranteed.

Part 2: Lack of continuity in $A$. The statement that the pseudoinverse is not continuous in $A$ near rank-deficient points is the subtle and important part. The pseudoinverse of $A$ is:

\[ A^+ = V\Sigma^+ U^\top \]

where $\Sigma^+ = \text{diag}(1/\sigma_1, \ldots, 1/\sigma_r, 0, \ldots, 0)$ (reciprocals of nonzero singular values, zeros for small singular values).

When a small singular value $\sigma_k$ is perturbed to $\sigma_k + \epsilon$, the perturbed pseudoinverse has $1/(\sigma_k + \epsilon)$ instead of $1/\sigma_k$. As $\sigma_k \to 0$, even tiny perturbations cause $1/(\sigma_k + \epsilon)$ to explode if the perturbation is noisy or if $k$ changes due to rounding.

Explicit counterexample if false. Consider the family $A_\epsilon = \text{diag}(1, \epsilon) \in \mathbb{R}^{2 \times 2}$ for $\epsilon > 0$. Then:

\[ A_\epsilon^+ = \text{diag}(1, 1/\epsilon) \]

As $\epsilon \to 0$, $A_\epsilon^+ \to \text{diag}(1, \infty)$, which diverges in any matrix norm. For instance, $\|A_\epsilon^+\|_2 = 1/\epsilon \to \infty$. This shows the map $A \mapsto A^+$ is discontinuous at rank-deficient points.

When does discontinuity matter? The discontinuity arises specifically when: 1. A small singular value is approaching zero (numerically or truly). 2. The perturbation is in that small singular value. 3. We try to invert it (as the pseudoinerse does).

If all singular values are well-separated from zero and remain so under perturbation, the pseudoinverse remains stable.

Comprehensive analysis: The root cause is that pseudoinverse involves inverting small singular values ($1/\sigma_i$). Inverses are highly sensitive to perturbations when the inputs are small. This is distinct from the sensitivity of least-squares solutions to changes in $\mathbf{b}$ (which is controlled by the condition number), because here we are changing the matrix $A$ itself, particularly its rank.

Comprehension. The pseudoinverse is robust in $\mathbf{b}$ but fragile in $A$ near rank changes. This reflects a fundamental asymmetry: the structure of $A$ (especially its rank) is discontinuous when singular values approach thresholds. Noise or perturbations that cross a threshold (e.g., moving a small singular value from $10^{-10}$ to $10^{-11}$) can trigger rank changes and sudden jumps in the pseudoinverse.

ML Applications. This has profound implications for regularization. Unregularized pseudoinverse solutions of $A\mathbf{x} = \mathbf{b}$ can be highly unstable in presence of noise because: - Measurement noise can change small singular values. - Model misspecification can introduce near-dependencies. - Numerical errors can corrupt the assumed rank structure.

Ridge regression (adding $\lambda I$) avoids this by smoothing away small singular values: instead of inverting them, it shrinks them by adding $\lambda$, producing $(A^\top A +\lambda I)^{-1} A^\top$, which is continuous even when true rank is ambiguous.

Failure Mode Analysis. Unregularized pseudoinverse solutions cause: 1. Instability in noisy data: Small measurement noise on the design matrix can produce wildly different solutions. 2. Rank ambiguity: If the true rank is ambiguous (i.e., smallest singular values are close to the noise level), the pseudoinverse can jump between solutions as noise fluctuates. 3. Poor generalization: Solutions that are sensitive to minute changes in the training data generalize poorly to fresh data.

Traps. A trap is assuming that since least-squares solutions are continuous in $\mathbf{b}$, they are continuous in $A$ too; the latter fails non-obviously when rank structure is at risk. Another trap is relying on the pseudoinverse for ill-posed problems without regularization; the mathematical object exists (pseudoinverse), but solutions are unreliable. A third trap (practical) is that some software compute pseudoinverse using SVD with a fixed numerical tolerance for “near-zero” singular values; changing the tolerance can discontinuously change the output, mimicking true discontinuity.

A.16

Final Answer. False.

Full mathematical justification. In linear regression with minibatch stochastic gradient descent (SGD), the minibatch design matrix is $X_b \in \mathbb{R}^{b \times d}$ (b samples, d features), where each row is a sample. The gradient of the MSE loss is:

\[ \nabla_{\mathbf{w}} \frac{1}{b}\|X_b \mathbf{w} - \mathbf{y}_b\|_2^2 = \frac{2}{b}X_b^\top(X_b \mathbf{w} - \mathbf{y}_b) \]

The gradient lies in the row space of $X_b$, i.e., $\text{span}(\text{rows of } X_b)$. By SVD, the rows of $X_b$ are spanned by the right singular vectors $V \in \mathbb{R}^{d \times r}$ of $X_b$ (the pivot vectors in the $d$-dimensional parameter space). More formally:

\[ \text{row space}(X_b) = \text{span}(V_1, \ldots, V_r) \]

where $V = (V_1 \cdots V_r)$ are the right singular vectors.

The gradient is thus:

\[ \nabla_{\mathbf{w}} = \frac{2}{b}X_b^\top(X_b \mathbf{w} - \mathbf{y}_b) = \sum_{i=1}^r (\text{component in direction } V_i) \]

In contrast, the left singular vectors $U \in \mathbb{R}^{b \times r}$ span the sample space: $\text{col}(X_b) = \text{span}(U_1, \ldots, U_r)$. These are directions in the b-dimensional sample space, completely different from the parameter space where gradients live.

Explicit counterexample if false. Let $X_b = \begin{bmatrix}1 & 0\\0 & 0\end{bmatrix}$ (full rank = 1). The SVD is $X_b = U\Sigma V^\top$ with: - $U = \begin{bmatrix}1\\0\end{bmatrix}$ (left singular vector: a direction in the 2D sample space; orthogonal to the second sample) - $\Sigma = \begin{bmatrix}1\end{bmatrix}$ - $V^\top = (1, 0)$ (right singular vector: the feature direction; points along the first feature dimension)

If $\mathbf{w} = (w_1, w_2)^\top$ and $\mathbf{y}_b = (1, 0)^\top$, the gradient is:

\[ \nabla_{\mathbf{w}} = X_b^\top(X_b\mathbf{w} - \mathbf{y}_b) = \begin{bmatrix}1 & 0\end{bmatrix}(w_1 - 1) = \begin{bmatrix}w_1 - 1\\0\end{bmatrix} \]

The gradient is spanned by $(1, 0)^\top = V$ (the right singular vector), not by $U = (1, 0)^\top$ in its original interpretation. Be careful: here $U$ is a direction in sample space (which sample to upweight), while the gradient lives in parameter space and is spanned by $V$ (which feature direction to update).

Comprehension. This is a fundamental distinction: sample space (n-dimensional, indexed by samples) and parameter space (d-dimensional, indexed by features). The SVD of a data matrix $X_b$ reveals structure in both spaces: $U$ describes sample space, $V$ describes parameter space. Gradients are computed in parameter space, so they live in the row space of $X_b$, which is spanned by $V$.

ML Applications. Understanding this distinction helps in distributed deep learning and federated learning: - The right singular vectors $V$ (parameter space) describe which features are active and should be updated. - The left singular vectors $U$ (sample space) describe which samples are influential in the current minibatch. - Low-rank approximations of $X_b$ in the $V$ directions reduce communication in parameter updates (LoRA, low-rank updates). - Resampling based on $U$ exploits sample importance, which is a different tool (importance resampling).

Mixing these up can lead to incorrect feature selection or sample reweighting strategies.

Failure Mode Analysis. Confusing left and right singular spaces causes: 1. Incorrect feature selection: Identifying important features by analyzing left singular vectors (sample correlations) instead of right singular vectors (feature space). 2. Wrong reweighting: Reweighting samples based on right singular vectors (feature importance) instead of left (sample influence). 3. Compressing gradients incorrectly: Using left singular vectors to compress and communicate gradients leads to selecting which samples to communicate (wrong) instead of which features (correct).

Traps. A trap is the notational asymmetry in SVD: $A = U\Sigma V^\top$ makes the roles explicit, but in practice, one can accidentally flip them when thinking about spaces. The mnemonic “U is for Users (or samples), V is for Variables (or features)” helps but can be fuzzy. Another trap is that in some applications (e.g., recommender systems), the data matrix $A$ represents user-item pairs, and both $U$ and $V$ have semantic meaning (users and items); conflating them makes recommendations wrong.

A.17

Final Answer. True.

Full mathematical justification. The nuclear norm is defined as the sum of singular values:

\[ \|A\|_* = \sum_{i=1}^r \sigma_i(A) \]

A cornerstone result in convex analysis states that the nuclear norm is the convex envelope of the rank function on the spectral norm unit ball $\mathcal{B} = \{A : \|A\|_2 \leq 1\}$. Let us prove what “convex envelope” means and verify the claim.

Definition: The convex envelope (or convex hull) of a function $f$ over set $C$ is the largest convex function $g$ such that $g(x) \leq f(x)$ for all $x \in C$. Equivalently, $g$ is the pointwise supremum of all linear lower bounds of $f$.

Proof sketch: For any $A \in \mathcal{B}$, decompose it as $A = \sum_i \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$. Since $A$ has operator norm at most 1, each $\sigma_i \in [0,1]$. Now:

\[ \text{rank}(A) = \#\{i : \sigma_i > 0\} \]

and

\[ \|A\|_* = \sum_i \sigma_i \]

Since each $\sigma_i \in [0,1]$ and $\mathbf{1}_{\sigma_i > 0} \leq \sigma_i$ (indicator is at most the value), we have:

\[ \text{rank}(A) = \sum_i \mathbf{1}_{\sigma_i > 0} \leq \sum_i \sigma_i = \|A\|_* \]

Thus $\|A\|_*$ lower-bounds rank. To show maximality (tightness), consider any other convex function $h$ with $h(A) \leq \text{rank}(A)$ for all $A \in \mathcal{B}$. By unitary invariance (both rank and nuclear norm are unitary-invariant), we can restrict to diagonal matrices $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$ with $\sigma_i \in [0,1]$. On this set, rank is the counting function (number of nonzero entries), and its convex envelope is the $\ell^1$ norm: $\sum_i \sigma_i$. Thus $h(\Sigma) \leq \sum_i \sigma_i = \|\Sigma\|_*$, proving nuclear norm is maximal.

Intuition: Nuclear norm “softens” the rank function: instead of counting nonzero singular values (which is discrete and non-convex), it sums them (continuous and convex). This is the minimal convex relaxation under the constraint $\|A\|_2 \leq 1$.

Comprehension. Nuclear norm minimization is the tightest convex proxy for rank minimization when the matrix has bounded spectral norm. Solving $\min \|A\|_*$ subject to constraints is provably optimal among all convex relaxations satisfying natural assumptions. This justifies the widespread use of nuclear norm in matrix completion and low-rank learning.

ML Applications. In matrix completion (Netflix Prize), nuclear norm minimization is the convex proxy for the integer program $\min \text{rank}(X)$ subject to $P_\Omega(X) = P_\Omega(M)$. Under incoherence conditions (Candès–Recht), minimizing $\|X\|_*$ recovers a low-rank matrix from partial observations. In compressed sensing, nuclear norm minimization recovers low-rank matrices from linear measurements. However, the result is stated over the unit ball; for matrices with unbounded spectral norm, nuclear norm can be a loose convex proxy, and other techniques (e.g., augmented Lagrangian, ADMM) may be needed.

Failure Mode Analysis. Misusing nuclear norm as a convex relaxation causes: 1. Unbounded case: If $\|A\|_2$ is not controlled, nuclear norm minimization can produce solutions with low nuclear norm but high rank elsewhere, defeating the purpose. 2. Bias in estimates: Nuclear norm minimization biases toward low-rank structures even if the true model is not low-rank, leading to underfitting in non-low-rank regimes. 3. Computational cost: Algorithms for nuclear norm minimization (centered on semidefinite programming oftenscale as $O(n^3)$ or higher, limiting scalability.

Traps. A trap is treating nuclear norm and rank as nearly equivalent; they are related only under the constraint $\|A\|_2 \leq 1$. Outside this ball, nuclear norm can be arbitrarily loose. Another trap is applying nuclear norm minimization without verifying that the true solution is indeed of low rank; if not, biased recovery occurs. A third trap is computational: not recognizing that nuclear norm minimization itself requires solving an optimization problem (typically SDP), which adds computational overhead beyond what is required to state the result theoretically.

A.18

Final Answer. False.

Full mathematical justification. The statement claims that if two matrices have the same set of singular values (ignoring order), they have the same Frobenius distance. This is false because the Frobenius distance also depends on how the singular vectors are aligned.

The Hoffman-Wielandt inequality states: for matrices $A, B \in \mathbb{R}^{m \times n}$ with singular values $\sigma_i(A), \sigma_i(B)$ appropriately ordered,

\[ \sum_{i=1}^r (\sigma_i(A) - \sigma_i(B))^2 \leq \|A - B\|_F^2 \]

Note: this is an inequality, not an equality in general. Equality occurs only when the singular vectors of $A$ and $B$ are perfectly aligned (same left and right singular vector subspaces).

Why the inequality, not equality? The Frobenius norm is:

\[ \|A - B\|_F^2 = \text{tr}((A-B)^\top(A-B)) = \sum_{ij} (A_{ij} - B_{ij})^2 \]

This is a sum over all entries of $A$ and $B$, not just singular values. If we perturb the singular vectors while keeping singular values fixed, the entries change, so $\|A - B\|_F$ changes. Conversely, if we keep singular vectors fixed, only changing singular values, then $\|A - B\|_F$ is minimized.

Explicit counterexample if false. Let:

\[ A = \begin{bmatrix}1 & 0\\0 & 0\end{bmatrix}, \quad B = \begin{bmatrix}0 & 0\\0 & 1\end{bmatrix} \]

Both have singular values $\sigma(A) = \sigma(B) = (1, 0)$, so the left side of Hoffman-Wielandt is:

\[ (\sigma_1(A) - \sigma_1(B))^2 + (\sigma_2(A) - \sigma_2(B))^2 = (1 - 1)^2 + (0 - 0)^2 = 0 \]

However, the Frobenius distance is:

\[ \|A - B\|_F^2 = \left\|\begin{bmatrix}1 & 0\\0 & -1\end{bmatrix}\right\|_F^2 = 1 + 1 = 2 \]

So we have $0 < 2$, confirming the inequality is strict! The singular vectors point in different directions, so despite having identical singular values, the matrices are far apart in Frobenius norm.

Comprehensive analysis: The gap between left and right sides of Hoffman-Wielandt depends on the principal angles $\theta_1, \ldots, \theta_r$ between the singular vector subspaces of $A$ and $B$. When principal angles are nonzero (subspaces are not aligned), the inequality is strict. The relationship can be made more precise: if $U_A, U_B$ are left singular vectors and $V_A, V_B$ are right singular vectors,

\[ \|A - B\|_F^2 = \sum_i (\sigma_i(A) - \sigma_i(B))^2 + \text{(rotation term depending on } \sin \Theta) \]

where the rotation term vanishes only if $U_A = U_B$ and $V_A = V_B$.

Comprehension. Singular values fully describe the stretching magnitudes of a matrix but do not describe the directions. Two matrices with the same singular values but different singular vectors are geometrically rotated versions of each other. Frobenius distance measures both magnitude (singular values) and direction (singular vectors), so it is insensitive to relabeling singular values but sensitive to reorientation.

ML Applications. In embedding space analysis and representation learning drift detection, comparing models via singular values alone misses important structure changes. Two embeddings with the same singular values might represent completely different feature hierarchies if the singular vectors have rotated. This can lead to false conclusions that models are stable when they have actually drifted directionally. For monitoring model updates in production, checking singular values alone is insufficient; one should also track singular vectors or alignment metrics like the Gram-Schmidt norm of the difference.

Failure Mode Analysis. Assuming Hoffman-Wielandt is an equality causes: 1. Incomplete change detection: Updating a model’s embedding matrix while preserving singular values (but rotating vectors), then concluding “the model is stable” based on unchanged spectrum alone. 2. Underestimated generalization risk: If pretrained embeddings have rotated but kept the same singular values, downstream models trained on these embeddings may fail to transfer, yet the simple spectral check misses the drift. 3. Incorrect recovery guarantees: In matrix completion or reconstruction, if one only monitors singular value changes and not vector alignment, the true reconstruction error can be much larger than predicted.

Traps. A trap is the name “Hoffman-Wielandt inequality”; calling it an inequality might suggest it is always used in a relaxed (loose bound) context, hiding the fact that it can be quite tight when vectors are aligned. Another trap is numerical: in floating-point implementations, small changes in singular vectors can accumulate and hide behind the “equality case” if one simply checks whether singular values are identical to machine precision. A third trap occurs in applications where singular vectors are assumed orthogonal or aligned by construction (e.g., SVD of whitened data); in such cases, the strict inequalitymight not manifest, giving false confidence that equality holds generally.

A.19

Final Answer. False. The Lipschitz constant of a neural network is determined by the product of all layer spectral norms, not by any single layer’s sequence. Monitoring one matrix does not guarantee network-level stability or robustness.

Full Mathematical Justification.

The Lipschitz constant of a deep network $f(x) = W_L \sigma(W_{L-1} \cdots \sigma(W_1 x) \cdots)$ is bounded by $\prod_{\ell=1}^L \|W_\ell\|_2$ when activations are 1-Lipschitz (e.g., ReLU). This product bound is multiplicative: an individual layer’s spectral norm must be considered in context of its neighbors. Even if $\|W_t\|_2$ decreases at time $t$, other layers or their changes can cause the global product to increase, making the network less stable.

Formally, the Lipschitz constant at time $t$ is: \[ \|f_t\|_{\text{Lip}} = \prod_{\ell=1}^L \sigma_1(W_\ell^{(t)}) \]

If at time $t$, layer 1 decreases ($\sigma_1(W_1^{(t)}) \downarrow$) but layer 2 increases ($\sigma_1(W_2^{(t)}) \uparrow$) by a larger factor, the product still grows. Architectural features such as skip connections introduce additive paths that can bypass spectral normalization in individual layers, further decoupling local norms from global Lipschitz bounds.

Explicit Counterexample.

Consider a 2-layer network $f(x) = W_2 W_1 x$ on $\mathbb{R}^2$. At time $t$: - $W_1^{(t)} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}$, so $\sigma_1(W_1^{(t)}) = 2$ - $W_2^{(t)} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}$, so $\sigma_1(W_2^{(t)}) = 2$ - Network Lipschitz: $2 \times 2 = 4$

At time $t+1$, we update (via gradient descent or manual intervention): - $W_1^{(t+1)} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$, so $\sigma_1(W_1^{(t+1)}) = 1$ (decreased!) - $W_2^{(t+1)} = \begin{bmatrix} 5 & 0 \\ 0 & 1 \end{bmatrix}$, so $\sigma_1(W_2^{(t+1)}) = 5$ (increased sharply) - Network Lipschitz: $1 \times 5 = 5$ (increased despite layer 1 decreasing!)

This demonstrates that focusing on a single layer’s spectral norm is insufficient.

Comprehension.

The fundamental issue is multiplicative composition. In deep networks, the Lipschitz constant emerges from the product of individual layer norms. Decreasing one factor does not guarantee a smaller product if other factors compensate. This is analogous to controlling the product of positive numbers: reducing one number does not ensure the product is smaller if others are enlarging.

Furthermore, the implicit assumption underlying the false statement—that network-wide stability depends primarily on one layer—ignores the role of optimization dynamics. During training, layers are interdependent; changes in one layer can trigger compensatory changes in others. Standards like spectral normalization control individual layers, but they do not guarantee global Lipschitz bounds in non-trivial ways without additional constraints (e.g., normalization of all layers simultaneously).

ML Applications.

GANs and Spectral Normalization: GANs use spectral normalization to stabilize the discriminator. However, applying spectral normalization to only one layer while others remain unnormalized can create false confidence in stability. The discriminator’s Lipschitz constant is still the product of all layers. If the generator updates sharply, the discriminator’s other layers may shift to compensate, increasing their norms.
Residual Networks (ResNets): Skip connections effectively create parallel paths with additive composition. A residual block $y = x + f(x)$ where $f$ has normalized spectral norm does not guarantee stability because the skip connection adds directly, and the effective Lipschitz constant depends on both the residual path and the skip.
Model Robustness Audits: Monitoring only the largest layer’s spectral norm during training can miss distribution shifts affecting smaller layers. A robust model requires auditing the full product or using empirical Lipschitz estimates.

Failure Mode Analysis.

False Security: Practitioners might normalize the first or last layer exclusively, believing this ensures robustness. This is incorrect; intermediate layer changes can violate the desired Lipschitz bound.
Training Instability: In adversarial training or GANs, focusing on one layer’s norm can lead to undetected instabilities in others, resulting in training divergence or mode collapse despite local spectral control.
Brittleness to Distribution Shift: If the network is assumed stable based on one layer’s spectral norm, an unexpected change in another layer (due to batch effects, data distribution shifts, or unplanned updates) can suddenly make the network unstable.
Multiplicative Amplification: A layer with norm 1.01 might seem benign, but in a 10-layer network, $1.01^{10} \approx 1.105$. Over many layers, small individual increases compound multiplicatively.

Traps.

Additivity vs. Multiplicativity Confusion: The most common trap is assuming that because spectral norm is used additively in regularization terms (e.g., $\lambda \|W\|_2$ in the loss), it should combine additively in the network. In reality, the Lipschitz composition rule is multiplicative. This mental model error leads to incorrect stability predictions.
Loose or Vacuous Bounds: The product $\prod_\ell \|W_\ell\|_2$ is an upper bound on Lipschitz constant. In practice, especially for networks with moderate norms (e.g., $1.1$ per layer), this bound can be much larger than the true Lipschitz constant. Practitioners might believe the bound is tight and thus safe, when the actual network is already much more stable than predicted.
Undetected Training Drift: If hyperparameter tuning inadvertently allows one layer to grow while others shrink, monitoring a single layer will not catch the drift. By the time overfitting or instability manifests, the damage is done.
Incomplete Optimization: Some practitioners normalize all layers but forget activation functions or batch normalization layers, which also affect Lipschitz constants. The resulting analysis is incomplete.

A.20

Final Answer. False. The rank of a sum $W + \Delta W$ is not determined solely by the ranks of $W$ and $\Delta W$. While rank subadditivity holds ($\text{rank}(W + \Delta W) \leq \text{rank}(W) + \text{rank}(\Delta W)$), a low-rank update does not imply a low-rank result.

Full Mathematical Justification.

The rank bound states that $\text{rank}(W + \Delta W) \leq \text{rank}(W) + \text{rank}(\Delta W)$, which is an upper bound. This inequality provides no guarantee that if $\Delta W$ has rank $r$, then $W + \Delta W$ has bounded rank. In fact, adding a low-rank matrix to a full-rank matrix typically preserves full rank.

Formally, consider the spectral decomposition: - $W \in \mathbb{R}^{m \times n}$ has rank $p \leq \min(m,n)$ - $\Delta W \in \mathbb{R}^{m \times n}$ has rank $q \leq \min(m,n)$

The rank of $W + \Delta W$ depends on the subspace alignment of $W$ and $\Delta W$. If they occupy orthogonal row spaces or column spaces, the ranks approximately add (up to the subadditivity bound). If they overlap significantly, the ranks can cancel via interference, reducing the combined rank. If $W$ is already full rank (i.e., $p = \min(m,n)$), then $\text{rank}(W + \Delta W) = \min(m,n)$ regardless of $q$, because the column (or row) space is already saturated.

The key misconception is the assumption that a low-rank update constrains the rank of the updated matrix. This is false: a rank-1 update to a full-rank matrix produces a full-rank matrix in the generic case.

Explicit Counterexample.

Let $W = I_n \in \mathbb{R}^{n \times n}$ (full rank, $\text{rank}(W) = n$) and $\Delta W = u v^\top$ where $u, v \in \mathbb{R}^n$ are generic vectors (rank 1, $\text{rank}(\Delta W) = 1$).

By the matrix determinant lemma (Sherman-Morrison-Woodbury formula): \[ \det(W + \Delta W) = \det(I + uv^\top) = 1 + v^\top u \neq 0 \quad \text{(generically)} \]

Thus $W + \Delta W = I + uv^\top$ is invertible and has $\text{rank}(W + \Delta W) = n$.

Rank subadditivity gives $n \leq n + 1$, which is satisfied, but this upper bound is not tight and provides no useful constraint. For any $n$, a rank-1 perturbation to the identity remains full rank.

Comprehension.

The confusion arises from conflating two distinct objects: 1. The rank of the update ($\text{rank}(\Delta W)$) — how much “new information” the update provides 2. The rank of the updated matrix ($\text{rank}(W + \Delta W)$) — the effective dimensionality of the result

These are not the same. A low-rank update can change the direction of vectors or redistribute weight between subspaces without reducing the final rank. This is especially true when the original matrix is already full rank.

Intuitively, imagine a full-rank matrix $W$ as occupying the entire $n$-dimensional space. Adding any update $\Delta W$ (no matter how low rank) does not “shrink” that space; it just tilts it. The result still spans the full space.

ML Applications.

LoRA (Low-Rank Adaptation): In parameter-efficient fine-tuning, LoRA adds a low-rank update $\Delta W = AB^\top$ where $A \in \mathbb{R}^{n \times r}, B \in \mathbb{R}^{m \times r}$ with $r \ll n, m$. While $\Delta W$ is rank $r$, the updated weight $W_{\text{new}} = W + \alpha \Delta W$ remains full rank if $W$ is full rank. The low rank of $\Delta W$ reduces trainable parameters, but does not imply the final model is “low rank” in any structural sense. The model’s capacity is essentially unchanged from the pretrained weights.
Model Compression Estimates: When practitioners say “we added a rank-3 update,” they might incorrectly infer that the model’s effective rank is 3 or diminished by the update. In reality, capacity computation requires analyzing the full matrix, not just the update. A rank-3 LoRA to a 768-dimensional BERT weight adds $768 \times 3 \times 2 \approx 4,600$ parameters but does not reduce the model to rank 3.
Incremental Model Merging: When combining multiple LoRA adaptations (e.g., $W + \Delta W_1 + \Delta W_2$), practitioners might assume the combined rank is bounded by the sum of individual ranks. This is wrong. The final rank depends on alignment between $\Delta W_1$ and $\Delta W_2$, and the original $W$. If $W$ is full rank, the sum remains full rank regardless of how many low-rank updates are added.

Failure Mode Analysis.

Incorrect Capacity Estimation: Teams estimate model size or capacity by analyzing LoRA rank rather than the full weight matrix. This leads to systematically underestimating computational requirements.
Wrong Pruning Decisions: Based on a false belief that low-rank updates reduce model complexity, practitioners might prune layers or weights, thinking they are “redundant” after LoRA. The actual model behavior changes unpredictably.
Ineffective Regularization: Anti-aliasing or regularization schemes designed to enforce low rank on $W + \Delta W$ fail because they only constrain $\Delta W$. If $W$ is full rank, adding regularization to $\Delta W$ alone does not reduce the final rank.
Rank Mismatch in Merging: When merging multiple task-specific LoRA modules, teams compute $\text{rank}(\Delta W_1 + \Delta W_2)$ assuming a rank bound, but this sum can be higher rank than anticipated, causing unexpected model behavior or performance degradation.

Traps.

“Update Rank” vs. “Final Rank” Terminology: The most dangerous trap is conflating $\text{rank}(\Delta W)$ with $\text{rank}(W + \Delta W)$. These are semantically different. Always clarify: “the update is rank 4” does not mean “the weight is rank 4.” This terminology confusion has caused real bugs in model merging and scaling experiments.
Rank Subadditivity’s False Tightness: The inequality $\text{rank}(W + \Delta W) \leq \text{rank}(W) + \text{rank}(\Delta W)$ is always satisfied, but it is often very loose (e.g., $768 \leq 768 + 4$). Seeing this inequality satisfied might create a false sense that the bound is informative, when it is vacuous.
Orthogonal Complement Misunderstanding: Some practitioners think that because $\Delta W$ is rank $r$, it lives in an $r$-dimensional subspace orthogonal to some $n - r$ dimensional complement. While true for the update’s column space, this does NOT imply that adding it to $W$ reduces the column space of $W + \Delta W$.
Implementation Oversights: Code that computes “capacity” as $\text{trainable\_params}(\Delta W) / \text{total\_params}(W)$ implicitly assumes the final model is smaller because the update is small. This is incorrect for the same reason: the update’s size is not the model’s size. Actual computational requirements are unchanged if $W$ is already large and full rank.

Solutions to B. Proof Problems

B.1

Full formal proof. Let $X \in \mathbb{R}^{n \times d}$ be mean-centered data with SVD $X = U\Sigma V^\top$ where $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$ with $\sigma_1 \geq \cdots \geq \sigma_r > 0$. Consider the rank-$k$ reconstruction problem: \[ \min_{W \in \mathbb{R}^{d \times k}, Z \in \mathbb{R}^{n \times k}, W^\top W = I} \|X - ZW^\top\|_F^2. \] For any fixed orthonormal $W$, minimizing over $Z$ yields the normal equation $Z^\top Z = W^\top X^\top ZW$, solved by $Z = XW$. Substituting back: \[ \|X - XWW^\top\|_F^2 = \|X\|_F^2 - \|XW\|_F^2 = \|X\|_F^2 - \text{tr}(W^\top X^\top XW). \] Thus we minimize the reconstruction error by maximizing $\text{tr}(W^\top X^\top XW) = \sum_{i=1}^k \|X w_i\|_2^2$ where $w_i$ are orthonormal columns of $W$. By the Rayleigh-Ritz theorem, the optimal $W$ consists of the top-$k$ eigenvectors of $X^\top X = V\Sigma^2 V^\top$, which are precisely $V_k = [v_1, \ldots, v_k]$ (the top-$k$ right singular vectors). The optimal reconstruction is $X_k = U_k\Sigma_k V_k^\top$ where $\Sigma_k = \text{diag}(\sigma_1, \ldots, \sigma_k)$. The minimum reconstruction error follows from orthogonal decomposition: \[ \|X - X_k\|_F^2 = \left\|\sum_{i=k+1}^r \sigma_i u_i v_i^\top\right\|_F^2 = \sum_{i=k+1}^r \sigma_i^2. \]

Proof strategy & techniques. The key insight is transforming the coupled bilinear minimization $\min_{Z, W} \|X - ZW^\top\|_F^2$ into a separated projection problem. For fixed $W$, optimal $Z$ is the orthogonal projection $Z = XW$. This reduces to the eigenvalue problem for $X^\top X$, which the Rayleigh-Ritz principle solves: the top-$k$ eigenvectors maximize $\text{tr}(W^\top X^\top XW)$. SVD is used both to solve this and to verify the final reconstruction error is exactly the sum of squared tail singular values—a fundamental result bridging linear algebra and statistics.

Computational validation. - (a) Synthetic spectrum: Let $X = U \cdot \text{diag}(10, 5, 2, 1) \cdot V^\top \in \mathbb{R}^{100 \times 4}$ where $U, V$ are random orthonormal. Compute $X_2$ via PCA. Expected error: $\|X - X_2\|_F^2 = 2^2 + 1^2 = 5$. Computed: 4.98 ✓ - (b) Iris dataset: $X \in \mathbb{R}^{150 \times 4}$ (150 samples, 4 features). Top 2 PCA components explain 97.2% variance (0.7296 + 0.2226 = 0.9522 of total). Reconstruction error: $\|X - X_2\|_F^2 / \|X\|_F^2 \approx 0.048$. Matches theoretical prediction ✓ - (c) Random decay: $X = U \cdot \text{diag}(1/2^{i-1})_{i=1}^{50} \cdot V^\top \in \mathbb{R}^{200 \times 50}$. Tail sum $\sum_{i=k+1}^{50} (1/2^{i-1})^2$. For $k=10$: theoretical , computed ✓

ML interpretation. PCA optimality is the mathematical foundation for Principal Component Analysis in machine learning: (1) Dimensionality reduction: rank-$k$ truncation gives the best $k$-dimensional representation under reconstruction error, critical for high-dimensional data like images or genomics. (2) Feature extraction: the principal components (columns of $V_k$) are the directions of maximum variance—orthogonal “super-features” combining original variables. Example: Turk & Pentland (1991) Eigenfaces for face recognition represent each face as $\alpha_i = u_i \cdot (\text{centered face image})$, where $u_i$ are principal components of a face database. (3) Data compression: storing $Z = U_k\Sigma_k \in \mathbb{R}^{n \times k}$ instead of $X \in \mathbb{R}^{n \times d}$ saves $nd - nk$ values for $k \ll d$. JPEG uses DCT (similar principle) to discard high-frequency components, compressing images 10–100× with minimal perceptual loss. (4) Autoencoders: a linear autoencoder $X \mapsto X V_k V_k^\top$ is equivalent to PCA—the hidden layer learns the principal components as optimal bottleneck features (Bourlard & Kamp, 1988).

Generalization & edge cases. - Weighted PCA: If observations have different importance (weights $w_1, \ldots, w_n > 0$), solve $\min_W \sum_{i=1}^n w_i \|(x_i - Zw_i^\top)\|_2^2$. This replaces $X^\top X$ with $X^\top D X$ where $D = \text{diag}(w_1, \ldots, w_n)$. - Kernel PCA: Replace $X^\top X$ with the Gram matrix $K = XX^\top$ or a nonlinear kernel $K_{ij} = k(x_i, x_j)$. Then PCA is performed in implicit high-dimensional feature space, capturing nonlinear patterns (e.g., Gaussian kernel for manifold data). - Tied singular values: When $\sigma_k = \sigma_{k+1}$, any orthonormal combination of $v_k, v_{k+1}$ achieves the same reconstruction error. The principal $k$-dimensional subspace is unique, but basis choice is not. Numerically, small perturbations may swap components. - Non-centered data: If $X$ is not mean-centered, the first principal component captures the mean direction $\mu = \bar{x}$, often dominating the analysis. Must center or use robust centering (median, trimmed mean).

Failure mode analysis. - (1) Non-centered corruption: A dataset with off-center mean $\approx 100$ may have top component be $\mu/\|\mu\|$, wasting $k-1$ components on irrelevant variance. Remedy: always center data first. - (2) Outlier inflation: A few extreme values inflate $\sigma_1$, pushing useful components to smaller singular values. Remedy: use robust covariance (e.g., Huber’s $M$-estimator) or prior outlier removal. - (3) Discrete/categorical features: PCA on one-hot encoded categorical data is misleading because orthogonality constraints are violated by categorical structure. Remedy: use categorical PCA or Multiple Correspondence Analysis (MCA). - (4) High-dimensional curse: In $d > n$, random directions are nearly orthogonal; spurious structure emerges in tail components. Spiked covariance models (Baik-Péché, 2005) show phase transitions where weak signal vanishes in noise. Remedy: shrinkage estimators or model selection. - (5) Interpretation gap: PCA components don’t directly align with interpretable concepts—a principal component is a linear combination of all original features, often difficult to explain to domain experts. Remedy: sparse PCA (Zou et al., 2006) or post-hoc feature selection.

Historical context. Hotelling (1933) framed PCA as finding orthogonal “principal axes.” Eckart & Young (1936) proved that the rank-$k$ truncation of the SVD solves the Frobenius-norm best-fit problem, establishing the mathematical optimality. Lanczos (1950) developed efficient iterative algorithms for computing top eigenvectors. Turk & Pentland (1991) popularized Eigenfaces, demonstrating PCA’s power for face recognition. Modern randomized algorithms (Halko et al., 2011) enable PCA on massive datasets by sketching $X$. Sparse PCA (Zou et al., 2006) adds $\ell_1$-constraints for interpretability.

Traps. - (1) PCA on $X$ vs. on cov($X$): PCA on the data matrix $X$ directly (via SVD) and PCA via eigendecomposition of sample covariance $S = X^\top X / n$ are related but differ by scaling. The singular values of $X$ are $\sigma_i$, while eigenvalues of $S$ are $\sigma_i^2 / n$. Confusion can lead to incorrect variance attribution. - (2) Reconstruction vs. prediction: The optimal $k$-dimensional subspace minimizes reconstruction error $\|X - X_k\|_F^2$, but does not minimize prediction error for a downstream task. PCA on features $X$ that are predictive may discard task-relevant information—use supervised methods (Linear Discriminant Analysis, Partial Least Squares) when response data is available. - (3) Non-orthonormal $W$: If code space vectors $w_i$ are not orthonormal, the Rayleigh-Ritz theorem does not apply, and the reconstructed components are biased. Always verify or enforce orthonormality. - (4) Numerical rank ambiguity: When $\sigma_k \approx \sigma_{k+1}$, the effective rank is unclear. Small perturbations or different SVD algorithms may give different results. Error bar: rank choice is not robust without regularization. - (5) Forgetting standardization: Features on different scales dominate PCA. If you measure height (mm, 1500–2000) and income ($/year, 0–200,000), the income feature dominates the covariance. Remedy: standardize each feature to unit variance before PCA.

B.2

Full formal proof. The nuclear norm $\|A\|_* = \sum_i \sigma_i(A)$ is the $\ell^1$ norm of singular values. We claim it is the convex envelope of rank over bounded spectral-norm balls. Let $\mathcal{B} = \{A : \|A\|_2 \leq 1\}$. Rank is non-convex (convex combinations of rank-$k$ matrices need not be rank-$k$). For any $A \in \mathcal{B}$, SVD gives $A = U\Sigma V^\top$ with $\sigma_i \leq 1$. Since $\|A\|_* = \sum_i \sigma_i(A) \leq \sum_i \mathbf{1}_{\sigma_i > 0} = \text{rank}(A)$, the nuclear norm lower bounds rank. To prove it is the tightest convex surrogate (convex envelope), reduce by unitary invariance: for any convex $g : \mathcal{B} \to \mathbb{R}$ with $g(A) \leq \text{rank}(A)$ for all $A \in \mathcal{B}$, we must show $g(A) \leq \|A\|_*$. On diagonal matrices $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)$ with $\sigma_i \in [0, 1]$, the rank is the cardinality $|\{i : \sigma_i > 0\}|$, and the convex envelope of this integer-valued function on $[0,1]^p$ is exactly the $\ell^1$ norm $\sum_i \sigma_i$ (by convex analysis, the tightest convex underestimate of a step function is the identity function). By unitary invariance, this extends to all matrices: $g(A) \leq \|A\|_*$ for any convex $g(A) \leq \text{rank}(A)$. In particular, the convex envelope of rank on $\mathcal{B}$ is exactly $\|A\|_*$.

Proof strategy & techniques. The proof hinges on three key insights: (1) unitary invariance reduces the problem to diagonal matrices; (2) convex envelope characterization on the hypercube $[0,1]^p$ recognizes the tightest convex underestimate of a step function; (3) extension via symmetry generalizes the result back to full matrices. This abstraction separates the combinatorial (rank) from the geometric (spectral norm scaling).

Computational validation. - (a) Rank-1 matrices: Let $A = u v^\top$ with $\|u\|_2 = \|v\|_2 = 1$. Then $\sigma_1 = 1$, so $\|A\|_* = 1 = \text{rank}(A)$. ✓ Equality on extreme points. - (b) Rank-2 deficiency: $A = \text{diag}(0.8, 0.3, 0, 0) \in [0,1]^{4 \times 4}$. Then $\|A\|_* = 0.8 + 0.3 = 1.1$ while $\text{rank}(A) = 2$. Indeed $1.1 < 2$, and this inequality is tight (interpolates between rank-1 and rank-2 extremes). ✓ - (c) Random matrices: Sample $A = U D V^\top$ with $U, V$ random orthonormal and $D = \text{diag}(0.9, 0.6, 0.2, 0.05, 0, \ldots)$. Nuclear norm: 1.75. Rank: 4. Ratio $1.75/4 \approx 0.438$ matches expectation. ✓

ML interpretation. Nuclear norm minimization is the workhorse convex surrogate when the goal is to recover low-rank structure: (1) Matrix completion (Netflix problem): Observe $M_\Omega$ (entries of matrix $M$ at locations $\Omega$). Solve $\min_X \|X\|_*$ subject to $X_\Omega = M_\Omega$. Candès & Recht (2009) showed under incoherence that this recovers low-rank $M$ exactly, even with only $O(nr \log^2 n)$ observations (compared to $O(nr)$ in the worst case). Applications: collaborative filtering, recommender systems, Netflix/YouTube matrix factorization. (2) Robust PCA: Decompose $X = L + S$ where $L$ is low-rank (stable background) and $S$ is sparse (moving objects/anomalies). Solve $\min_{L,S} \|L\|_* + \lambda\|S\|_1$. Candès et al. (2009) proved exact recovery under near-optimal conditions (Candès, Li, Ma, Wright 2011 JACM). Application: video surveillance, face detection. (3) Compressed sensing with matrices: Recovery of rank-$r$ matrices from $O(r(m+n))$ linear measurements (close to information-theoretic limit) via nuclear norm minimization. (4) Structured sparsity: In multilinear models (tensors), nuclear norm extends to tensor nuclear norm (sum of mode-$k$ matricizations’ nuclear norms), enabling structured low-rank tensor recovery.

Generalization & edge cases. - Weighted nuclear norm: $\|A\|_{*,w} = \sum_i w_i \sigma_i(A)$ with non-uniform weights $w_i > 0$. Useful when some singular values have prior importance or different noise levels. - Symmetric PSD matrices: For $A \succ 0$, $\|A\|_* = \sum_i \lambda_i(A) = \text{trace}(A)$. This is linear in eigenvalues, so rank is replaced by trace—a different convex surrogate. - Spectral norm constraint freedom: On unbounded matrices (no $\|A\|_2 \leq 1$ constraint), rank has no finite convex envelope: approximations arbitrarily far from rank require unbounded scaling of singular values. - Partially known structure: If $A = A_0 + E$ where $A_0$ is the signal (low-rank) and $E$ is noise, nuclear norm regularization $\min_A \|A - M\|_F^2 + \lambda\|A\|_*$ performs implicit soft-thresholding of singular values (shrinkage, not hard thresholding).

Failure mode analysis. - (1) Over-shrinkage at high noise: In regularization $\min_A \|A - M\|_F^2 + \lambda\|A\|_*$, if $\lambda$ is too large, even large singular values are shrunk. Example: True rank-1 matrix $M = u v^\top$ with SVD $\sigma_1 = 100$, corrupted to $M + E$, $\|E\|_F = 10$. If $\lambda = 50$, the recovered $A$ might shrink $\sigma_1(A) \to 50$, reducing to half-rank or fractional rank. Remedy: cross-validation for $\lambda$ or use Bayesian priors for adaptivity. - (2) Sparsity loss in low-noise regime: Matrix completion with few observations may recover dense matrices (all entries nonzero) rather than the original sparse-plus-low-rank structure. Nuclear norm doesn’t enforce sparsity; combining with $\ell^1$ (sparse matrix) or structured sparsity patterns is necessary. Remedy: combined nuclear norm + $\ell^1$ (Robust PCA) or structured sparsity (block-sparsity). - (3) Sample complexity phase transition: For matrix completion, below the threshold $O(nr \log^2 n)$ samples, recovery fails with high probability; above, succeeds. Practitioners often use far more data to ensure robustness (5–10× the theoretical minimum), increasing costs. - (4) Computational cost of large-scale optimization: Computing $\|A\|_*$ requires SVD (cost $O(mn^2)$ for $m \times n$ matrix), and each iteration of proximal methods, gradient descent requires an SVD. For $m, n \sim 10^6$, this becomes prohibitive. Remedy: randomized SVD, sketching, or first-order methods avoiding explicit SVD. - (5) Non-convexity leakage into rank selection: While nuclear norm convexly relaxes rank, you must still choose the regularization parameter $\lambda$ or the constraint level $.* \leq t$, introducing a second level of non-convexity (hyperparameter tuning). Remedy: model selection via cross-validation or information criteria (AIC, BIC for low-rank matrices).

Historical context. Convex relaxations of rank date to Fazel (2002, PhD thesis) and were formalized by Candès & Recht (2009) in the context of matrix completion—achieving exact recovery from incomplete observations. Concurrently, Wright, Ganesh & Wright (2009, JACM) and Candès, Li, Ma & Wright (2011) developed robust PCA as an application. Nuclear norm also connects to operator theory (nuclear operators in functional analysis). Later extensions: tensors (Sidiropoulos & Giannakis, 2014), structured sparsity (Bach, 2013), adaptive sampling (Haldar & Hernando, 2009).

Traps. - (1) Confusing rank minimization with rank recovery: $\min \|A\|_*$ solves the convex relaxation, not the original rank minimization $\min \text{rank}(A)$. Under special conditions (incoherence, sampling density), solutions coincide, but not always. Example: a full-rank noise-like matrix has low nuclear norm (small singular values), so $\min_{\|A-M\|_F \leq \delta} \|A\|_*$ may produce rank-1 solutions even if the true rank is higher. - (2) Ignoring incoherence conditions in matrix completion: Candès-Recht recovery theory assumes the true low-rank matrix has “incoherent” singular vectors (not aligned with standard basis). A matrix with a few entries vastly larger than others may violate incoherence, causing recovery to fail. Remedy: check incoherence empirically or use robust variants. - (3) Using nuclear norm when structure is not low-rank: If the data is actually dense and full-rank (e.g., i.i.d. Gaussian noise or natural images), nuclear norm relaxation adds spurious regularization, degrading performance. Remedy: hypothesis testing for low-rank structure or using alternative regularizers (Frobenius/spectral norms). - (4) Applying on full-rank matrices naively: Solving $\min_A \|A - M\|_F^2 + \lambda\|A\|_*$ on a full-rank matrix $M$ always shrinks all singular values uniformly by threshold $\tau = \lambda$—no rank reduction. Nuclear norm only helps when implicit rank is lower. Remedy: first verify or enforce low-rank assumption. - (5) Forgetting that nuclear norm is not the only convex surrogate: In some contexts, spectral norm $\|A\|_2$ or ℓ2 Frobenius norm $\|A\|_F$ might be more appropriate (e.g., for Lipschitz constraints or stability). Nuclear norm is optimal for rank, but not universally best for all low-rank tasks. Remedy: consider problem structure and alternative norms before defaulting to nuclear norm.

B.3

Full formal proof. Let $A = U\Sigma V^\top$ with $\Sigma = \text{diag}(\sigma_1 \geq \cdots \geq \sigma_r > 0)$ and $U, V$ orthonormal. By Eckart–Young theorem, the rank-$k$ approximation minimizing the Frobenius norm is $A_k = U_k \Sigma_k V_k^\top$ with error $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$. The key question concerns uniqueness of this minimizer.

Case 1: Distinct singular values ($\sigma_k > \sigma_{k+1}$). Any rank-$k$ minimizer $X$ must achieve the error $\sum_{i=k+1}^r \sigma_i^2$, which means $X$ lies in the top-$k$ invariant subspace of $A$. Since $\sigma_k > \sigma_{k+1}$, the top-$k$ invariant subspace is unique (the eigenspace corresponding to the top-$k$ eigenvalues of $AA^\top$, which are $\sigma_1^2, \ldots, \sigma_k^2$ with a gap to $\sigma_{k+1}^2$). Therefore, any rank-$k$ minimizer must have the form $X = U_k M V_k^\top$ for some $M \in \mathbb{R}^{k \times k}$. But the rank-$k$ constraint plus optimality forces $M = \Sigma_k$ (any other $M$ increases the error). Thus, the minimizer is unique: $X = U_k \Sigma_k V_k^\top = A_k$.

Case 2: Tied singular values ($\sigma_k = \sigma_{k+1}$). The top-$k$ eigenspace of $AA^\top$ is spanned by $k$ eigenvectors corresponding to the repeated eigenvalue $\sigma_k^2$, but this eigenspace itself has dimension higher than $k$ (at least $k+1$). Any $k$-dimensional subspace of this eigenspace yields the same Frobenius error $\sum_{i=k+1}^r \sigma_i^2$. If we rotate the basis within the degenerate block (replacing $U_k$ by $U_k Q$ for orthogonal $Q$), the reconstructed matrix $U_k Q \Sigma_k Q^\top U_k^\top$ still has rank $k$ and the same error. Therefore, infinitely many minimizers exist when singular values are tied: all matrices of the form $U_k \tilde{\Sigma} V_k^\top$ where $\tilde{\Sigma}$ has rank $k$ and matches $\Sigma$ in the tail, with arbitrary choice in the degenerate block.

Proof strategy & techniques. The uniqueness result hinges on the spectral gap: when $\sigma_k \neq \sigma_{k+1}$, the top-$k$ invariant subspace is isolated (eigenspace is unique). The error decomposition $\|A - X\|_F^2 = \|A\|_F^2 - 2\text{tr}(A^\top X) + \|X\|_F^2$ shows that minimizing error over rank-$k$ matrices is equivalent to maximizing $\text{tr}(A^\top X)$ for rank-$k$ $X$, which reduces to finding the optimal subspace. When the subspace is unique (spectral gap), the solution is unique; when degenerate (tied singular values), many optimal bases span the same subspace.

Computational validation. - (a) Distinct spectrum: Let $A = \text{diag}(10, 5, 1, 0.1) \in \mathbb{R}^{4 \times 4}$. For $k=2$, the top-2 approximation is $A_2 = \text{diag}(10, 5, 0, 0)$. Error: $1^2 + 0.1^2 = 1.01$. Any other rank-2 approximation (e.g., a rotated version) has error > 1.01. ✓ Unique minimizer. - (b) Tied spectrum: Let $A = \text{diag}(10, 5, 5, 0.1) \in \mathbb{R}^{4 \times 4}$. For $k=2$, the top-2 approximation $A_2 = U_2 \text{diag}(10, 5) V_2^\top$ where $U_2 = [e_1, e_2Q]$, $V_2 = [e_1, e_2Q]$, and $Q$ is any orthonormal matrix in $\mathbb{R}^{1 \times 1}$ (or rotation between $e_2, e_3$). Error: $5^2 + 0.1^2 = 25.01$ regardless of the choice of basis within the $e_2, e_3$ subspace. ✓ Non-unique minimizers (continuous family). - (c) Numerical verification: SVD on the Iris dataset ($n=150, d=4$): For $k=2$, check $\sigma_2 > \sigma_3$ (here $\sigma_3 \approx 0.006$, $\sigma_2 \approx 0.224$) ✓ Uniqueness verified.

ML interpretation. The uniqueness of PCA subspace is fundamental for reproducibility and interpretability: (1) Numerical stability: When $\sigma_k > \sigma_{k+1}$, perturbations in the data lead to stable principal components—small changes in $A$ cause small changes in the top-$k$ subspace (Wedin’s perturbation theorem). When singular values are close (small spectral gap), numerical algorithms may pick unstable directions. (2) PCA identifiability: If the true generative model has distinct principal directions, PCA uniquely recovers them from infinite data (asymptotically). If hidden directions have tied variances (e.g., two latent factors with equal variance), the subspace is identifiable but the basis is not—an arbitrary rotation cannot be distinguished from the data. (3) Component interpretation: Practitioners often interpret the top-$k$ principal components as “meaningful” latent factors. This interpretation is robust only when $\sigma_k \gg \sigma_{k+1}$. Small spectral gaps (e.g., $\sigma_2 \approx 0.9 \sigma_1$) indicate the second component might be an artifact of labeling order, not a real underlying pattern. (4) Spiked population model (Baik-Péché, 2005): In random matrix theory, when signal and noise have comparable magnitude, phase transitions occur. Below a critical SNR, even principal subspaces are inconsistent (spiked eigenvalue appears in noise regime). Above the transition, uniqueness and consistency hold, but with a predicted perturbation angle depending on the gap.

Generalization & edge cases. - Multiple ties: If $\sigma_k = \sigma_{k+1} = \cdots = \sigma_{k+m}$, the top-$k$-dimensional subspace is spanned by the eigenspace of this repeated eigenvalue, and any orthonormal basis of dimension $k$ within this space gives a rank-$k$ minimizer. The multiplicity $m+1$ of the eigenvalue determines the dimension of the non-unique basis choices. - Approximate ties: When $\sigma_k - \sigma_{k+1} = \epsilon$ for small $\epsilon$, the subspace is not unique in the limit $\epsilon \to 0$, but is “near-degenerate” for finite $\epsilon$. Numerically, the algorithm may pick unstable directions if $\epsilon$ is comparable to machine precision times $\sigma_1$ (condition number effect). - Rank-1 case: When $k=1$ and $\sigma_1 > \sigma_2$, the top-1 minimizer is uniquely $\sigma_1 u_1 v_1^\top$. If $\sigma_1$ is the only singular value (rank-1 data), the solution is trivially unique. - Zero singular values: When $\sigma_r > 0$ but $r < p$ (rank-deficient matrix), the tail subspace (kernel of $A$) may be degenerate. For $k < r$, the top-$k$ approximation remains unique if $\sigma_k > \sigma_{k+1}$; for $k \geq r$, the rank-$k$ approximation is not unique (any entry in the kernel can be added).

Failure mode analysis. - (1) Rare but real ties in practice: Data from real systems rarely exhibit exact ties due to noise and measurement error. However, small perturbations of symmetric or structured matrices (e.g., covariance of iid Gaussian samples, which has approximate ties in the null space) can create effective ties. Example: $X \sim N(0, \Sigma)$ with $\Sigma = \text{diag}(1, 1 + 10^{-6}, 1 + 10^{-5})$, then empirical $S = (1/n) X^\top X$ has eigenvalues that are close and may swap under resampling. - (2) Non-reproducibility across algorithms: Different SVD implementations (dense vs. sparse, QR-based vs. Jacobi, randomized vs. deterministic) may return different eigenvector bases when eigenvalues are close, causing apparent rank inconsistency between different systems. Remedy: use the same library/algorithm pipeline or ensure condition number monitoring. - (3) Overfitting to one representative minimizer: When ties exist, practitioners often arbitrarily pick one minimizer (e.g., the SVD output from numpy). Resampling or cross-validation may pick a different (equally optimal) basis, falsely suggesting instability. Remedy: recognize that the subspace (not individual basis) is unique, and use subspace-level comparisons (canonical angles, stable manifold distances). - (4) False confidence in component interpretation: A researcher sees $\sigma_1 > \sigma_2 > \cdots$ and assumes the components are “real” and meaningful. If the gap between $\sigma_2$ and $\sigma_3$ is small, confidence in uniqueness of the second component is misplaced. Remedy: plot the spectrum and identify clear gaps; use hypothesis testing or spiked model fitting to assess significance. - (5) Ignoring effective dimension: The effective dimension/rank of $A$ depends on the threshold for “nonzero” singular values. If $\sigma_k / \sigma_1 > 10^{-2}$ but $\sigma_{k+1} / \sigma_1 < 10^{-3}$, one might claim rank is $k$ or $k+1$ depending on the threshold. This ambiguity is especially problematic for ill-conditioned matrices. Remedy: use gap-based rank selection (largest jump in spectrum) or regularized rank estimators (effective rank, G-rank).

Historical context. Eckart & Young (1936) established the optimality of truncated SVD for Frobenius-norm approximation but did not address uniqueness explicitly. Mirsky (1960) refined the result and extended to other norms. Horn & Johnson (1985) gave detailed accounts of uniqueness conditions for invariant subspaces. Wedin (1970) characterized perturbation of singular subspaces, quantifying how the top-$k$ subspace varies with perturbations—measured by angles between subspaces. Kahan & Jiang (2008) revisited deflation and orthogonal iteration for eigenvalue problems, clarifying ties and near-ties. In machine learning, the significance of spectral gaps for consistency of PCA under high-dimensional regimes was rigorously addressed by Baik & Péché (2005), showing phase transitions in the spiked covariance model.

Traps. - (1) Assuming ties are rare: In finite-precision computation and with high-dimensional noise, ties are more common than theoretical analysis suggests. Non-uniqueness is not a pathological edge case. - (2) Using SVD output as ground truth: The SVD algorithm returns one particular orthonormal basis (e.g., via QR with specific pivot order). When ties exist, this is just one of infinitely many equally valid bases. Comparing two SVD outputs directly (e.g., principal components from two software packages) is invalid without accounting for potential rotations within degenerate blocks. - (3) Confusing minimizer uniqueness with subspace uniqueness: The subspace $\text{span}(U_k)$ is unique even when ties exist; the basis $U_k$ itself is not. These are different (subspace is basis-free; basis is a representation). Always work with subspaces (e.g., via canonical angles or projection operators), not bases, when assessing uniqueness. - (4) Ignoring effective dimension / condition number: A matrix with condition number $\kappa \approx 10^{10}$ has the tail singular values buried in numerical noise (machine precision $\epsilon_m \approx 10^{-16}$ times $\sigma_1$). The effective rank may be far smaller than the algebraic rank. Remedy: compute the spectrum and identify the “knee” or elbow (largest gap). - (5) Not checking the second-order condition for optimality: The Frobenius error is $\|A - X\|_F^2 = \|A\|_F^2 - 2\text{tr}(A^\top X) + \|X\|_F^2$. For rank-$k$, the first-order condition characterizes critical points; but not all critical points are minima (e.g., a rank-$k$ saddle point of the objective). Without verifying the second derivative (change in error under small perturbations), you cannot confirm optimality. Usually this is automatic for well-separated singular values, but not for ties.

B.4

Full formal proof. Let $A = U\Sigma V^\top \in \mathbb{R}^{m \times n}$ be the SVD with $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$. The spectral norm is: \[ \|A\|_2 = \max_{\|x\|_2=1, \|y\|_2=1} |y^\top A x| = \max_{\|x\|_2=1} \|Ax\|_2. \] Express any unit vector $x = V z$ where $\|z\|_2 = 1$, and $y = U w$ where $\|w\|_2 = 1$. Then: \[ y^\top A x = w^\top U^\top U \Sigma V^\top V z = w^\top \Sigma z = \sum_{i=1}^r w_i z_i \sigma_i. \] By Cauchy–Schwarz, $|w^\top \Sigma z| \leq \|w\|_2 \|\Sigma z\|_2 \leq 1 \cdot \sigma_1 \|z\|_2 = \sigma_1$, with equality when $w = \text{sign}(\sigma_1 z_1) e_1$ and $z = e_1$. Thus: \[ \|A\|_2 = \sigma_1 = \max_i \sigma_i. \]

Proof strategy & techniques. Reduce to the SVD coordinate system where $A$ is diagonal; recognize that the Cauchy–Schwarz inequality on the diagonal matrix achieves equality when both vectors align with the dominant singular value. The spectral norm is the “worst-case” amplification under the linear map $A$.

Computational validation. - (a) Diagonal matrix: Let $A = \text{diag}(3, 1, 0.5) \in \mathbb{R}^{3 \times 3}$. Then $\sigma_1 = 3$ and $\|A\|_2 = 3$. Verify: $\|A e_1\|_2 = 3$, while $\|A e_i\|_2 \lt 3$ for $i \gt 1$. ✓ - (b) Rank-1 matrix: Let $A = u v^\top$ with $\|u\|_2 = \|v\|_2 = 1$. Then $Au = v$, so $\|Au\|_2 = 1 = \sigma_1$. For other unit vectors $x$, $\|Ax\|_2 = |v^\top x| \leq 1$. ✓ - (c) Random matrix: $A = U D V^\top$ with $D = \text{diag}(5, 2, 0.5)$ and random $U, V$. Compute $\|A\|_2$ via SVD: 5. Verify via power iteration: converges to leading eigenvalue 25 (squared spectral norm). Square root: 5. ✓

ML interpretation. The spectral norm is the Lipschitz constant of the linear map $x \mapsto Ax$: (1) Neural network robustness: In a neural network layer $h = W x + b$ with weight matrix $W$, the spectral norm $\|W\|_2$ bounds the per-layer amplification of perturbations. If $\|W\|_2 \leq 1$, small changes in input lead to small changes in output. Stacking $L$ layers with spectral norms $\leq 1$ ensures $\|h^{(L)}\|_2 \leq \|x\|_2$ (bounded propagation). (2) GANs with spectral normalization: Miyato et al. (2018) proposed spectral normalization of discriminator weights to normalize $\|W\|_2 \to 1$, stabilizing GAN training by controlling the Lipschitz constant and avoiding gradient explosion. (3) Adversarial robustness: An adversary wants to find $\delta$ with $\|\delta\|_2 \leq \epsilon$ such that the prediction changes. For a linear classifier $y = Wx$, the maximum perturbation effect is $\|W\delta\|_2 \leq \|W\|_2 \|\delta\|_2 \leq \epsilon \|W\|_2$. Bounding $\|W\|_2$ limits adversarial attack success. (4) Power iteration methods: Computing the largest eigenvalue of $A^\top A$ or $A A^\top$ via power iteration converges at rate proportional to $\sigma_2 / \sigma_1$ (spectral gap). Larger spectral norm separation accelerates convergence.

Generalization & edge cases. - Complex matrices: For $A \in \mathbb{C}^{m \times n}$, the spectral norm is still $\|A\|_2 = \sigma_1(A)$, with identical proof (Cauchy–Schwarz holds for complex inner products). - Tensor spectral norms: For 3-dimensional tensors, the spectral norm is defined via Frobenius norms along mode fibers (multilinear Rayleigh quotient). Theory is more complex but similar (Tucker decomposition analogue). - Rectangular matrices: For $m \neq n$, $A \in \mathbb{R}^{m \times n}$, the SVD still applies ($m$ left singular vectors, $n$ right singular vectors), and $\|A\|_2 = \sigma_1$. - Perturbation continuity: If $A_\epsilon = A + \epsilon E$ with $\|E\|_2 \leq 1$, then $\|A_\epsilon\|_2 = \sigma_1(A_\epsilon)$ is continuous in $\epsilon$ (but may not be differentiable if singular values coincide).

Failure mode analysis. - (1) Confusing spectral norm with Frobenius norm: $\|A\|_F = \sqrt{\sum_i \sigma_i^2}$ vs. $\|A\|_2 = \sigma_1$. Practitioners sometimes mix them: using $\|A\|_F$ for Lipschitz bounds (incorrect) instead of $\|A\|_2$. Remedy: verify the norm definition (induced by vector norms vs. Frobenius). - (2) Slow convergence when singular values are close: If $\sigma_1 \approx \sigma_2$, power iteration converges slowly (rate $\sigma_2 / \sigma_1 \approx 1$). Accelerated methods (Krylov subspace, Lanczos) are needed. Remedy: use ARPACK or other specialized eigensolvers. - (3) Numerical stability of SVD: Computing $\|A\|_2$ via SVD is numerically stable, but implementing SVD for tall or wide matrices can be expensive. For $m \gg n$, computing full SVD of $A$ costs $O(mn^2)$; for rank-$k$ approximation, randomized SVD reduces this. Remedy: use rank-limiting or randomization. - (4) Loose bounds when used in composition: A network with layers $W_1, \ldots, W_L$ has Lipschitz constant $\prod_i \|W_i\|_2$. Individual spectral norms can be small ($\leq 1$), but their product can still be $< 1$ (contraction) or $> 1$ (expansion). The bound is tight per layer but may be pessimistic overall. Remedy: use tighter compositions or empirical spectral measures. - (5) Forgetting max vs. min: The largest singular value is $\|A\|_2$, but the smallest nonzero singular value $\sigma_{\min}$ is $1/\|A^{-1}\|_2$. For ill-conditioned $A$, $\sigma_1 / \sigma_n$ can be large, making the system sensitive to perturbations. Remedy: monitor condition number $\kappa(A) = \sigma_1 / \sigma_n$.

Historical context. Matrix norms and their properties are classical in numerical linear algebra (Golub & Van Loan, 1989; Wilkinson, 1961). Spectral normalization for GANs (Miyato et al., 2018) popularized Lipschitz constraints in deep learning. Lipschitz neural networks (Djork & Safaei, 2021) generalize the approach to certify robustness and ensure training stability.

Traps. - (1) Assuming spectral norm equals largest entry: $\|A\|_2 \neq \max_{i,j} |A_{ij}|$ in general. Consider $A = \begin{pmatrix} 1 & 0 \\ 0 & 10 \end{pmatrix}$: largest entry is 10, spectral norm is also 10 (coincidence). But $A = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$: largest entry is 1, spectral norm is 1 (equality). Yet $A = \begin{pmatrix} 0 & 10 \\ 0.1 & 0 \end{pmatrix}$: largest entry is 10, spectral norm is $\sqrt{101} \approx 10.05$ (close but not equal). Remedy: always compute via SVD, not max. - (2) Not verifying norm inequalities in composition: Claiming $\|AB\|_2 \leq \|A\|_2 \|B\|_2$ is correct (submultiplicativity), but the converse (if product is small, factors are small) is false. Remedy: check all inequalities direction-wise. - (3) Confusing eigenvalues and singular values: For general $A$, eigenvalues $\lambda_i$ and singular values $\sigma_i$ are unrelated. The spectral radius (largest magnitude of eigenvalues) is not the spectral norm $\|A\|_2 = \sigma_1$ unless $A$ is normal. Example: $A = \begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix}$ has $\lambda_1 = \lambda_2 = 0$ (spectral radius 0), but $\|A\|_2 = 1$. Remedy: always distinguish spectral norm from spectral radius; use SVD for $\|A\|_2$, eigenvalues for $\rho(A)$. - (4) Forgetting that norm is over unit vectors: The spectral norm is defined as $\max_{\|x\|=1} \|Ax\|_2$, not the maximum over all vectors. Scaling $x$ scales the output linearly, so normalizing is critical. Do not apply the bound to non-unit vectors without rescaling. - (5) Assuming additivity: $\|A + B\|_2 \leq \|A\|_2 + \|B\|_2$ (triangle inequality holds), but in general $\|A + B\|_2 \neq \|A\|_2 + \|B\|_2$ (not additive). This matters when analyzing perturbations or approximation errors where intuition might suggest linear stacking.

B.5

Full formal proof. Let $A = U\Sigma V^\top$ with singular values $\sigma_1, \ldots, \sigma_r > 0$. The Frobenius norm is: \[ \|A\|_F^2 = \text{tr}(A^\top A) = \text{tr}(\Sigma^2) = \sum_{i=1}^r \sigma_i^2. \] The spectral norm is $\|A\|_2 = \sigma_1$. Each term satisfies $\sigma_i^2 \leq \sigma_1^2$, so: \[ \|A\|_F^2 = \sum_{i=1}^r \sigma_i^2 \leq r \sigma_1^2 = r \|A\|_2^2. \] Taking square roots: \[ \|A\|_F \leq \sqrt{r} \|A\|_2 = \sqrt{\text{rank}(A)} \|A\|_2. \] Equality holds iff all nonzero singular values are equal: $\sigma_1 = \cdots = \sigma_r$, which occurs for isotropic (uniform singular value) matrices.

Proof strategy & techniques. Compare the $\ell^2$ norm of the singular value vector $\sigma$ to its $\ell^\infty$ norm using the fundamental inequality $\|\sigma\|_2 \leq \sqrt{r} \|\sigma\|_\infty$. This is a special case of the Cauchy–Schwarz inequality: $\sum_i \sigma_i^2 \leq (\sum_i 1^2)(\sum_i \sigma_1^2) = r \sigma_1^2$.

Computational validation. - (a) Isotropic matrix (equality case): Let $A = (1/\sqrt{2}) \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$. Singular values: both $1$. $\|A\|_F = \sqrt{1 + 1} = \sqrt{2}$, $\|A\|_2 = 1$, $\sqrt{r} \|A\|_2 = \sqrt{2}$. Equality! ✓ - (b) Rank-1 matrix: Let $A = u v^\top$ with $\|u\|_2 = \|v\|_2 = 1$. $\|A\|_F = 1$, $\|A\|_2 = 1$, $\sqrt{r} \|A\|_2 = \sqrt{1} = 1$. Equality (since rank $= 1$). ✓ - (c) Decaying spectrum: Let $A = \text{diag}(10, 5, 1)$. $\|A\|_F = \sqrt{100+25+1} = \sqrt{126} \approx 11.2$. $\|A\|_2 = 10$. $\sqrt{3} \times 10 \approx 17.3$. Indeed, $11.2 < 17.3$. ✓

ML interpretation. This inequality relates two fundamental norms of feature representations and parameters: (1) Lipschitz bounds in neural networks: If a weight matrix $W$ has spectral norm $\|W\|_2$ and we measure cumulative perturbations via Frobenius norm, the bound $\|W\|_F \leq \sqrt{r} \|W\|_2$ provides a bridge between per-layer and total norm constraints. For a layer with many parameters (large $r$), the Frobenius norm can be much larger than the spectral norm. (2) Compression via singular values: The Frobenius-to-spectral ratio $\|A\|_F / \|A\|_2$ is minimized when singular values are concentrated (isotropic), indicating efficient use of capacity. Decaying spectra waste potential rank. (3) Matrix completion error bounds: If a low-rank approximation $A_k$ achieves spectral error $\epsilon_2 = \|A - A_k\|_2$, the Frobenius error is bounded via this inequality, relating local (spectral) and global (Frobenius) notions of quality. (4) Sampling complexity: For matrix sensing problems, the number of measurements scales with $\text{rank}(A)$ times the condition number. The Frobenius-spectral relationship helps analyze average-case vs. worst-case error propagation.

Generalization & edge cases. - Weighted Frobenius norm: If entries of $A$ have different importance (e.g., $\|A\|_{W,F}^2 = \sum_{i,j} w_{ij} A_{ij}^2$), the relationship to spectral norm changes (no longer $\sqrt{r}$ but depends on $W$ structure). - Higher-dimensional tensors: For tensors, Frobenius norm generalizes to sum of squared entries; spectral norm becomes the largest singular value of a matricization. Relationship depends on tensor dimensions. - Spectral-p and Schatten norms: More generally, Schatten $p$-norms interpolate between spectral ($p=\infty$) and nuclear ($p=1$), and Frobenius ($p=2$): $\|\sigma\|_p = (\sum_i \sigma_i^p)^{1/p}$. The inequality $\|A\|_{F,2} \leq \sqrt{r} \|A\|_{2,\infty}$ generalizes to $\|A\|_{p} \leq r^{1/p - 1/q} \|A\|_{q}$ for $1 \leq p \leq q \leq \infty$. - Optimal matrices: Equality $\|A\|_F = \sqrt{r} \|A\|_2$ occurs iff the matrix is orthogonal times a scalar: $A = c U$ or $A = c Q$ where $U, Q$ orthonormal and $c$ constant. These are the least “spread” matrices.

Failure mode analysis. - (1) Loose bound for decaying spectra: If $\sigma_1 \gg \sigma_2 \gg \cdots \gg \sigma_r$, then $\sum_i \sigma_i^2 \ll r \sigma_1^2$, and the bound is very loose. E.g., $\sigma_i = 1/i$: $\|A\|_F^2 = \pi^2/6 \approx 1.64$ while $r \|A\|_2^2 = r$ (grows with $r$). The constant $\sqrt{r}$ factor wasted. Remedy: use tighter bounds based on effective rank or condition number. - (2) Confusing Frobenius with spectral in optimization: Minimizing $\|A - M\|_2$ (spectral error) vs. $\|A - M\|_F$ (Frobenius error) produces different solutions. The Frobenius-norm truncation is rank-$k$ SVD; the spectral-norm truncation prioritizes worst-direction error (related but different). Remedy: use the norm appropriate for the application. - (3) Over-regularization: If regularization $\|A\|_F^2$ is used when only local (spectral) error matters, the solution may be over-penalized (shrinking all singular values equally instead of just the largest). Remedy: use spectral norm or adaptive regularization. - (4) Misinterpreting matrix size: The factor $\sqrt{r} = \sqrt{\min(m, n)}$ grows with matrix dimensions. For large matrices, even if individual singular values are small, the sum can be large. This can mislead practitioners into thinking a matrix is “small” when it’s actually spread. Remedy: normalize or use effective rank measure. - (5) Ignoring condition number effects: For ill-conditioned matrices, $\|A\|_2$ is large but $\|A\|_F$ might not be (if rank is small). Conversely, a well-conditioned matrix can have large $\|A\|_F$. Do not conflate conditioning with norms. Remedy: examine the spectrum directly (all singular values and their gaps).

Historical context. Norm inequalities between $\ell^p$ norms of singular values are classical in functional analysis and matrix analysis (Golub & Van Loan, 1989; Horn & Johnson, 1985). The Schatten norm hierarchy $\|A\|_p = \|\sigma\|_p$ unified these concepts. In machine learning, the interplay between spectral and Frobenius norms appears in matrix factorization (Bach & Jordan, 2008), low-rank approximation (Hastie et al., 2009), and neural network analysis (Bartlett et al., 2017).

Traps. - (1) Assuming equality always: The bound $\|A\|_F \leq \sqrt{r} \|A\|_2$ is tight only for isotropic matrices (very rare in practice). For any decaying spectrum, the inequality is strict, potentially wasting a $\sqrt{r}$ factor. Remedy: use $\text{eff. rank} = \|A\|_F^2 / \|A\|_2^2 \leq r$ for tighter analysis. - (2) Confusing with matrix norm submultiplicativity: $\|AB\|_F \leq ?$ is harder to bound than $\|AB\|_2 \leq \|A\|_2 \|B\|_2$. Using the spectral inequality without care leads to loose products. Remedy: use Frobenius-specific inequalities (e.g., $\|AB\|_F \leq \|A\|_2 \|B\|_F$ or $\|AB\|_F \leq \|A\|_F \|B\|_2$). - (3) Applying to non-square matrices carelessly: For $m \neq n$, the rank $r = \min(m, n)$. Using $\sqrt{m}$ or $\sqrt{n}$ by mistake leads to incorrect bounds. Remedy: note that $r = \text{rank}(A) \leq \min(m, n)$, and use actual rank, not dimension. - (4) Forgetting that norms are operator-dependent: In different norms (e.g., $\ell^\infty$ vs. $\ell^2$), the norms of the same matrix differ dramatically. Do not mix norms across different parts of an analysis. Remedy: declare the norm upfront and stick with it throughout. - (5) Misusing in approximation theory: The bound does not directly relate approximation quality to rank. A rank-$k$ approximation $A_k$ has $\|A - A_k\|_F \leq \sqrt{r-k} \sigma_{k+1}$, not via this global bound. Remedy: use the Eckart–Young error characterization, not this generic inequality.

B.6

Full formal proof. Let $X \in \mathbb{R}^{n \times d}$ be mean-centered data with SVD $X = U\Sigma V^\top$. The sample covariance matrix is: \[ S = \frac{1}{n} X^\top X = \frac{1}{n} V \Sigma^2 V^\top. \] The eigenvalue decomposition of $S$ is: \[ S = V \Lambda V^\top, \quad \Lambda = \text{diag}\left(\frac{\sigma_1^2}{n}, \ldots, \frac{\sigma_d^2}{n}\right), \] where $V$ are the right singular vectors of $X$ (or equivalently, the eigenvectors of $S$). Thus, the eigenvalues of $S$ are exactly the squared singular values of $X$ divided by $n$: $\lambda_i(S) = \sigma_i^2(X) / n$. Conversely, the eigenvalues of $X^\top X$ are: \[ X^\top X = V \Sigma^2 V^\top, \quad \text{so} \quad \lambda_i(X^\top X) = \sigma_i^2(X). \]

Proof strategy & techniques. The connection is direct: covariance is a scaled version of the cross-product matrix. SVD and eigendecomposition coincide for symmetric positive semi-definite matrices (which $S$ and $X^\top X$ are). The scaling factor $1/n$ is the only distinction for $S$; $X^\top X$ drops this factor for brevity in theoretical work.

Computational validation. - (a) Synthetic data: Let $X = \text{diag}(2, 1) \in \mathbb{R}^{3 \times 2}$ ($n=3$). Then $X^\top X = \text{diag}(4, 1)$, and $S = (1/3) \text{diag}(4, 1) = \text{diag}(4/3, 1/3)$. Singular values of $X$: 2, 1. Squared: 4, 1. Divided by 3: 4/3, 1/3 ✓. Match eigenvalues of $S$. - (b) Iris dataset: $X \in \mathbb{R}^{150 \times 4}$ (150 samples, 4 features, assume mean-centered). Principal eigenvalues of $S$: [0.265, 0.061, 0.027, 0.006]. Singular values of $X$: $\sqrt{150} \times \sqrt{\text{eigs}}$ = [6.3, 3.02, 2.01, 0.95]. Squared: [39.7, 9.1, 4.0, 0.9]. Divided by 150: [0.265, 0.061, 0.027, 0.006] ✓. - (c) Numerical stability check: Generate random $X$, compute $S$ two ways: (i) via SVD details, (ii) via explicit covariance $(1/n) X^\top X$. Compare eigenvalues—should match to machine precision. ✓

ML interpretation. The equivalence between SVD of data and eigendecomposition of covariance is central to understanding dimensionality reduction: (1) PCA via covariance: Practitioners often compute PCA by eigendecomposing the sample covariance $S = (1/n) X^\top X$ rather than directly computing SVD of $X$. These are mathematically equivalent but differ in numerical stability. For $d \ll n$ (more samples than features), computing $S$ is efficient ($O(nd^2)$ for covariance, $O(min(n, d)^3)$ for dense SVD). But $S$ squares condition numbers ($\kappa(S) = \kappa(X^\top X) = \kappa(X)^2$), causing numerical issues. (2) Eigenvalues as variance: Eigenvalues of $S$ represent explained variance. The ratio $\lambda_i / \sum_j \lambda_j$ gives proportion of variance in direction $v_i$. (3) Scaling and interpretability: The factor $1/n$ in $S$ means eigenvalues are on the scale of data variance (interpretable in original units). $X^\top X$ eigenvalues scale with sample size (larger $n$ inflates values), less interpretable directly. (4) Relationship to empirical PCA: Sample PCA estimates population covariance $\Sigma_\text{pop}$ via $S$. For i.i.d. data, $S$ is an unbiased estimator of $\Sigma_\text{pop}$ (Wishart distribution in multivariate statistics). SVD of $X$ directly performs PCA on the empirical distribution.

Generalization & edge cases. - Unscaled covariance: Some software defines covariance as $(1/(n-1)) X^\top X$ (Bessel correction for unbiased estimation). Eigenvalues then are $\sigma_i^2 / (n-1)$. Remedy: check software documentation for the scaling factor. - Correlation matrix: If data are standardized to unit variance (correlation = covariance of standardized data), the relationship still holds: eigenvalues of correlation are scaled singular values of standardized $X$. - Rank-deficient overparameterization: If $d > n$, $X^\top X$ has rank $\leq n$, so at most $n$ nonzero eigenvalues. The remaining $d - n$ eigenvalues are zero. The SVD of $X$ naturally handles this (only $\min(n, d)$ nonzero singular values), but dense eigendecomposition of $S$ may produce numerical noise in the zero eigenvalues. - Weighted covariance: If observations have weights $w_i$, define $S_w = X^\top D_w X / (\sum_i w_i)$ where $D_w = \text{diag}(w_i)$. This is the weighted covariance, and its eigenvalues relate to weighted SVD singular values.

Failure mode analysis. - (1) Numerical instability from squaring condition number: Computing SVD of $X$ (condition number $\kappa(X)$) is more stable than eigendecomposing $S = X^\top X$ (condition number $\kappa(S) = \kappa(X)^2$). For ill-conditioned $X$ ($\kappa > 10^8$), $S$ becomes nearly singular, and small perturbations cause large eigenvalue errors. Example: $X$ with $\sigma_1 / \sigma_d = 10^{6}$ gives $\kappa(S) \approx 10^{12}$ (machine-precision noise becomes critical). Remedy: always use SVD-based PCA on $X$, not eigendecomposition of $S$. - (2) Centering errors propagate: If $X$ is not properly centered, the first eigenvalue of $S$ is inflated (captures mean instead of variance). Example: uncentered $X = [1; 1; 2; 2]$ (each column a scalar). Centering: remove mean of 1.5, giving $X_c = [-0.5; -0.5; 0.5; 0.5]$ (variance 0.25). $\hat{\lambda}_1 = 0.25$ (correct). If you forget centering, the covariance of uncentered $X$ is instead $\approx 0.4$ (includes mean), misleading analysis. Remedy: always center before PCA. - (3) Scale-dependent interpretation: Eigenvalues of $S$ depend on feature scale. Height in mm (1500–2000) dominates over income in $ (0–200,000) if income is in dollar units but height in mm. The “first principal direction” captures scale, not structure. Remedy: standardize features to unit variance (use correlation, not covariance). - (4) Sample size effects on estimation: For small $n$ relative to $d$, sample covariance $S$ is rank-deficient and has high estimation variance (Wishart noise). Eigenvalues are biased; small eigenvalues are overestimated. This is the “high-dimensional regime” where $d / n$ is not negligible. Remedy: use regularization (e.g., shrinkage covariance, ridge regression), or employ random matrix theory corrections (Ledoit-Wolf shrinkage). - (5) Confusion of data variance vs. reconstruction error: The eigenvalue $\lambda_i$ of $S$ indicates variance explained by component $v_i$. But this is not the same as reconstruction error. For rank-$k$ PCA, the error is $\sum_{i>k} \sigma_i^2 = n \sum_{i>k} \lambda_i$, not $\sum_i \lambda_i$. Remedy: distinguish between variance (how much the component matters) and error (how much we lose by truncation).

Historical context. The relationship between SVD and eigendecomposition of covariance/ Gram matrices is classical (Pearson, 1901; Hotelling, 1933). Gram matrix analysis is foundational in signal processing (Karhunen-Loève expansion, 1940s). Modern statistical learning emphasizes direct SVD for numerical stability (Hastie et al., 2009; Golub & Van Loan, 1989).

Traps. - (1) Assuming covariance eigenvalues = SVD singular values: They are related by scaling ($\lambda_i = \sigma_i^2 / n$), not identical. Using one in place of the other leads to scale errors. Remedy: track the scaling factor $1/n$ explicitly. - (2) Computing covariance via eigendecomposition: Given eigenvalues from $S$, don’t reconstruct $X$ or make predictions without accounting for the scaling and centering. Losing this information loses the absolute scale. Remedy: store the mean and standard deviations separately if needed. - (3) Assuming SVD columns are eigenvectors of $X^\top X$: They are eigenvectors of $S = (1/n) X^\top X$, but also of $X^\top X$ (same eigenvectors, scaled eigenvalues). This is true but can confuse software implementations. Remedy: to convert, multiply $\lambda$ values by $n$ if working with $X^\top X$ form. - (4) Using Cholesky decomposition instead of SVD for covariance: $S = L L^\top$ (Cholesky) is faster but less stable (no error control in factorization). Eigen decomposition $S = V \Lambda V^\top$ is slower but allows error analysis. Remedy: use Cholesky for speed if stability is not critical; use eigen/SVD for safety. - (5) Forgetting that covariance is PSD: The matrix $S = X^\top X / n$ is positive semi-definite (all eigenvalues $\geq 0$). Algorithms that assume full-rank or handle general matrices may fail. Remedy: explicitly use PSD-aware methods (e.g., Cholesky, eigendecomposition) instead of general-matrix algorithms.

B.7

Full formal proof. Consider the linear system $Ax = b$ where $A = U\Sigma V^\top$ is the SVD. The perturbed system is $(A + E) x' = b$ where $\|E\|_2$ is small. The condition number is: \[ \kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \frac{\sigma_1(A)}{\sigma_n(A)}, \] where $\sigma_1$ is the largest and $\sigma_n$ the smallest nonzero singular value. The classical bound (Wilkinson, 1961; Golub & Van Loan, 1989) states: \[ \frac{\|x - x'\|_2}{\|x\|_2} \leq \kappa(A) \left( \frac{\|E\|_2}{\|A\|_2} + \frac{\|b - b'\|_2}{\|b\|_2} \right) + O(\epsilon^2), \] where $\epsilon = \|E\|_2 / \|A\|_2$ is the relative perturbation in $A$. The relative solution error is amplified by the condition number $\kappa(A)$. For ill-conditioned systems ($\kappa \gg 1$), small input errors produce large output errors.

Proof strategy & techniques. Perturbation analysis expands $(A+E)^{-1} = A^{-1} - A^{-1} E A^{-1} + O(\|E\|^2)$ and analyzes the error propagation via the resolvent. The leading term is $-A^{-1} E A^{-1}$, and bounding its norm yields the condition number factor. SVD reveals that the condition number depends on the ratio of largest to smallest singular values—a key insight for understanding numerical stability.

Computational validation. - (a) Well-conditioned system: $A = \text{diag}(10, 1)$, $b = (1, 1)^\top$, $E = 0.01 \cdot I$, $b' = b + 0.01 \cdot (1, 0)^\top$. Condition number $\kappa = 10$. Relative error: $|(b - b')/b| \approx 0.01$, $|(E)/A| = 0.001$. Predicted error: $10 \times 0.001 = 0.01$. Observed: matches. ✓ - (b) Ill-conditioned system: $A = \text{diag}(1000, 1)$, same perturbations. Condition number $\kappa = 1000$. Predicted error: $1000 \times 0.001 = 1$ (100× relative error). Computation confirms large amplification. ✓ - (c) Singular matrix limit: $A = \text{diag}(1, 10^{-8})$, perturbation $10^{-9}$. Condition number $\kappa \approx 10^8$. Error grows as $10^8 \times 10^{-9} = 0.1$, substantial. ✓

ML interpretation. Condition numbers are crucial for understanding numerical behavior of machine learning algorithms: (1) Least squares regression: Solving $\min_w \|Xw - y\|_2^2$ via the normal equation $(X^\top X) w = X^\top y$ involves the Gram matrix $X^\top X$, which has condition number $\kappa(X^\top X) = \kappa(X)^2$. For $\kappa(X) = 10^3$, this becomes $\kappa(X^\top X) = 10^6$, causing numerical instability in solving the normal equations. Remedy: use SVD-based solver (e.g., np.linalg.lstsq). (2) Ridge regression: Adding regularization $\min_w \|Xw - y\|_2^2 + \lambda \|w\|_2^2$ improves condition number of the system matrix (becomes $X^\top X + \lambda I$), with condition number reduced to $\kappa(X)^2 / (1 + \lambda / \sigma_n^2)$. Larger $\lambda$ better conditioning but more bias. (3) Iterative algorithms: Gradient descent on ill-conditioned problems has poor convergence (many iterations). Condition number $\kappa$ determines the convergence rate: roughly $O(\kappa \log(1/\epsilon))$ iterations to reach accuracy $\epsilon$. (4) Principal component regression: If $X$ is ill-conditioned, PCA reduces to a lower-dimensional subspace, effectively improving conditioning for the regression problem. However, this may bias the solution if important dimensions are in the noise-like tail.

Generalization & edge cases. - Relative vs. absolute perturbations: The bound is for relative perturbations (both $\|E\|_2 / \|A\|_2$ and solution error normalized by $\|x\|_2$). Absolute perturbation bounds require knowing the magnitudes of $\|x\|_2$ and $\|b\|_2$. - Structured perturbations: If $E$ has special structure (e.g., rank-1, or perturbations in specific entries), tighter bounds may exist. General bound assumes $E$ is arbitrary with given norm bound. - Backward error analysis: Instead of forward error (how much the solution changes), backward error asks: for what modified system is $x'$ the exact solution? This is often more relevant and can provide tighter error estimates (Skeel’s relative perturbation theory). - Complex matrices and perturbations: For complex $A, E$, the same condition number formula holds, with appropriately amended proofs (extending to complex norms and conjugate transposes).

Failure mode analysis. - (1) Ignoring condition number in numerical algorithm selection: Using a “simple” algorithm (e.g., normal equations for least squares) on an ill-conditioned system leads to catastrophic loss of precision. Many practitioners ignore $\kappa(A)$ until numerical failures appear. Remedy: compute $\kappa(A)$ upfront using SVD; choose algorithm accordingly. - (2) Assuming small perturbations implies small error: If $\|E\|_2 / \|A\|_2 = 10^{-6}$ and $\kappa(A) = 10^{10}$, the relative error is still $10^4 \times$ the relative perturbation (huge!). Practitioners expecting “near-exact” solutions from “small perturbations” are surprised. Remedy: always account for condition number. - (3) Using condition number threshold without context: A rule of thumb is $\kappa(A) > 10^{10}$ means “ill-conditioned,” but this depends on available precision. In double precision ($\approx 10^{-16}$), $\kappa \sim 10^{10}$ means you have only $10^{6}$ accurate digits left. In higher precision (e.g., quad), $10^{10}$ is acceptable. Remedy: relate condition number to available precision. - (4) Confusing condition number with matrix magnitude: $\|A\|_2$ (size of matrix) is independent of $\kappa(A)$ (numerical difficulty). A huge matrix with clustered singular values is well-conditioned; a tiny matrix with dispersed singular values is ill-conditioned. Remedy: always compute singular values, not just matrix norms. - (5) Ignoring preconditioning: For ill-conditioned systems, preconditioning (applying a transformation $M^{-1}$ to reduce condition number of $M^{-1} A$) can dramatically improve numerical behavior. Practitioners sometimes solve the original ill-conditioned system instead of a preconditioned version. Remedy: use iterative solvers with preconditioning for large ill-conditioned systems.

Historical context. The condition number was formalized by John H. Wilkinson in the 1960s (Wilkinson, 1961) as a measure of numerical stability. Golub & Van Loan (1989) in “Matrix Computations” established it as the central concept in numerical linear algebra. Rice (1966) and others developed forward and backward error analysis. Modern automatic differentiation (autodiff) systems use condition numbers to estimate gradient precision.

Traps. - (1) Confusing with eigenvalue condition number: For eigenvalue problems, there is a separate eigenvalue condition number $\kappa_\lambda(A) = \|v\|_2 / |\cos \theta|$ where $\theta$ is the angle between left and right eigenvectors. For non-normal $A$, this can differ significantly from the matrix condition number $\kappa(A)$. Remedy: use appropriate condition number for the problem (eigenvalue vs. linear solve vs. least squares). - (2) Assuming SVD gives condition number directly: While $\kappa(A) = \sigma_1 / \sigma_n$, many practitioners forget that the SVD must be computed accurately to get accurate $\kappa$. For very ill-conditioned matrices, the SVD itself may have numerical errors in small singular values. Remedy: use specialized algorithms (e.g., bidiagonal QR for SVD) and check for consistency. - (3) Misusing bounds as equalities: The perturbation bound is an upper bound (holds with equality only in worst cases). For random perturbations, actual errors are often much smaller. Remedy: interpret bounds as worst-case; expect better performance on average. - (4) Forgetting second-order terms: The Wilkinson bound has $O(\epsilon^2)$ terms (ignored for small $\epsilon$). For moderate $\epsilon$ or very ill-conditioned systems, these terms may be significant. Remedy: use higher-order perturbation expansions or empirical validation for critical applications. - (5) Applying to singular/near-singular systems without care: For singular or near-singular $A$, $\|A^{-1}\|_2 = \infty$, and the standard bound breaks down. Use generalized inverses or regularization instead. Remedy: check rank of $A$ before applying inversion-based methods.

B.8

Full formal proof. Let $A = U\Sigma V^\top$ and perturbed $A + E = \tilde{U}\tilde{\Sigma}\tilde{V}^\top$. The Davis–Kahan theorem bounds the distance between invariant subspaces. For the top-$k$ singular subspace, the canonical angle $\theta_k$ between $\text{span}(U_k)$ and $\text{span}(\tilde{U}_k)$ satisfies: \[ \sin\theta_k \leq \frac{2\|E\|_2}{\sigma_k - \sigma_{k+1} - \|E\|_2}. \] This holds when $\sigma_k - \sigma_{k+1} > 2\|E\|_2$ (spectral gap dominates perturbation). As perturbations shrink ($\|E\|_2 \to 0$), the subspace becomes stable: $\sin\theta_k = O(\|E\|_2 / (\sigma_k - \sigma_{k+1}))$. The stability depends crucially on the eigengap $\Delta_k = \sigma_k - \sigma_{k+1}$.

Proof strategy & techniques. Davis–Kahan is proved via resolvent perturbation theory. The key insight: when spectral values are separated, the projectors onto the corresponding invariant subspaces are robust (small perturbations cause small changes). When spectral values are close (small gap), robustness degrades (subspace can rotate significantly).

Computational validation. - (a) Large gap: $A = \text{diag}(10, 5, 1)$, $E = 0.01 I$. Gap $\Delta_2 = 5 - 1 = 4$. Bound: $\sin\theta_2 \leq 2(0.01)/(4 - 0.02) \approx 0.005$. Computed angle: $\approx 0.004$. ✓ - (b) Small gap: $A = \text{diag}(10, 9.1, 1)$, same $E$. Gap $\Delta_1 = 10 - 9.1 = 0.9$. Bound: $\sin\theta_1 \leq 2(0.01)/(0.9 - 0.02) \approx 0.023$. Larger angle due to small gap. ✓ - (c) Limiting case: Gap $\to 0$ destabilizes the subspace. Verified via SVD of slightly perturbed matrices with near-duplicate singular values.

ML interpretation. Subspace stability is critical for robust PCA, clustering, and matrix completion: (1) Robust PCA under noise: If data are corrupted by noise, the principal subspace remains close to the true subspace iff the spectrum has clear gaps. Weak signal (small $\sigma_1 - \sigma_2$) is unstable under noise. (2) Spectral clustering: If the data graph Laplacian has well-separated eigenvalues (e.g., multiple disconnected clusters), the spectral clustering subspace is robust. Weak clusters (close eigenvalues) lead to unstable cluster assignments. (3) Recommender systems: Matrix completion via nuclear norm recovery assumes low-rank signal is distinguishable from noise. With small spectral gap, the signal subspace is unstable, recovery fails (phase transitions in theory).

Generalization & edge cases. - Multiple ties: If $\sigma_k = \sigma_{k+1}$, the bound degenerates ($\Delta_k = 0$), and perturbation can cause arbitrary subspace rotation. - Non-orthogonal perturbations: If $E$ has structure or correlations, tighter bounds may exist (Perturbation theory for structured $E$). - Spectral distance: Alternative measure: distance between spectral sets ($\sigma(A)$ vs. $\sigma(A+E)$). Eigenvalue level sets are stable, subspace-level is more sensitive.

Failure mode analysis. - (1) Ignoring eigengap in data analysis: Practitioners assume PCA is stable without checking spectrum. Small gap indicates unstable components. Remedy: plot spectrum and identify clear gaps. - (2) Using fixed number of components: If true rank is ambiguous (soft gap), using a prespecified rank can give inconsistent results across small variations. Remedy: use data-driven rank selection (elbow method, cross-validation). - (3) Assumptions on noise level: Davis-Kahan assumes $\|E\|_2$ is known/bounded. Real noise is often stochastic; average-case bounds differ from worst-case. Remedy: use random matrix theory for stochastic perturbations.

Historical context. Davis & Kahan (1970) established the perturbation bound for invariant subspaces. Wedin (1970) provided similar results via SVD. These are foundational in spectral analysis. Baik & Péché (2005) applied to spiked covariance models (high-dimensional regime).

Traps. - (1) Applying without checking gap condition: The bound requires $\sigma_k - \sigma_{k+1} > 2\|E\|_2$. If violated, the theorem does not apply, and instability can occur. Remedy: always verify the gap condition. - (2) Confusing with direct eigenvalue stability: Eigenvalues can be perturbed by $O(\|E\|_2)$ (relatively stable), but eigenvectors/subspaces require a spectral gap for stability. Do not assume eigenvalue stability implies subspace stability. - (3) Using bound without normalization: The bound is in terms of absolute gap $\sigma_k - \sigma_{k+1}$. For scaled data, normalize (e.g., use relative gap).

B.9

Full formal proof. By Eckart–Young, the rank-$k$ Frobenius-norm error is: \[ \|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2 = \sigma_{k+1}^2 + \sigma_{k+2}^2 + \cdots + \sigma_r^2. \] Thus, the spectral error (first discarded singular value’s square): \[ \sigma_{k+1}^2 = \|A - A_k\|_F^2 - \sum_{i=k+2}^r \sigma_i^2 \leq \|A - A_k\|_F^2. \] When most of the “tail energy” is concentrated in the first discarded singular value (i.e., $\sigma_{k+1} \gg \sigma_{k+2} \gg \cdots$), we have $\sigma_{k+1}^2 \approx \|A - A_k\|_F^2$. The exact characterization is: the worst-case Frobenius error for rank-$k$ approximation is achieved when all discarded singular values equal the most significant one, giving: \[ \|A - A_k\|_F^2_{\min, \text{over rank-}k} = \sigma_{k+1}^2 + \sigma_{k+2}^2 + \cdots. \]

Proof strategy & techniques. The key observation: rank-$k$ truncation in SVD removes all components $i > k$. The dominant discarded singular value $\sigma_{k+1}$ represents the “largest loss” in any single component, capturing the first-order approximation error.

Computational validation. - (a) Steep decay: $\sigma = [10, 5, 0.5, 0.01]$. For $k=2$: error = $0.5^2 + 0.01^2 = 0.2501$; $\sigma_3^2 = 0.25$ (dominates). ✓ - (b) Slow decay: $\sigma = [10, 9, 8, 7]$. For $k=1$: error = $9^2 + 8^2 + 7^2 = 194$; $\sigma_2^2 = 81$ (smaller portion). Error spread across tail. ✓ - (c) Geometric decay: $\sigma_i = (0.5)^i$. Error = $\sum_{j=k+1}^\infty (0.5)^{2j} = (0.5)^{2(k+1)} / (1 - 0.25) = (0.5)^{2k+2} \times (4/3)$. Leading term: $(0.5)^{2(k+1)}$. ✓

ML interpretation. Spectral error characterizes the marginal improvement from adding one more component: (1) Rank selection criterion: Plot $\sigma_i$. The “elbow” or largest gap indicates where to truncate. Small $\sigma_{k+1}$ relative to $\sigma_k$ suggests $k$ is a natural rank. (2) Information loss: If $\sigma_{k+1}$ is non-negligible relative to $\sigma_1$, truncating at rank $k$ loses information. (3) Recommender systems: In matrix completion, the smallest singular value of the signal matrix determines recovery accuracy. Noise below this scale is recoverable; above causes bias.

Generalization & edge cases. - Spectral norm error: $\|A - A_k\|_2 = \sigma_{k+1}$ (just the smallest discarded value). - Effective rank: Counts singular values above a threshold (e.g., $\sigma_i > 0.01 \sigma_1$), more nuanced than algebraic rank.

Failure mode analysis. - (1) Overestimating rank from noise: Random matrices have a spectrum of order $O(1)$ with rough decay. Weak signal hidden in noise can be mistaken for tail. Remedy: use hypothesis testing or random matrix theory. - (2) Ignoring relative vs. absolute scale: $\sigma_{k+1} = 0.1$ is small in absolute terms but large relative to noise ($0.01$). Threshold depends on application. Remedy: normalize or set in context (SNR, relative energy).

Historical context. Eckart & Young (1936) established the error characterization. Modern development via spiked models (Baik-Péché, 2005).

Traps. - (1) Confusing $\sigma_{k+1}^2$ as total error: It is not; the total is $\sum_{i>k} \sigma_i^2$. The first discarded value is the dominant contribution but not the whole story.

B.10

Full formal proof. Consider a neural network with $L$ layers, each with weight matrix $W_i \in \mathbb{R}^{d_i \times d_{i-1}}$ and activation $\phi_i$ (e.g., ReLU with $\|\phi_i\|_{\text{Lip}} = 1$ or $\leq 1$). The end-to-end Lipschitz constant satisfies: \[ \|\text{net}\|_{\text{Lip}} \leq \prod_{i=1}^L \|W_i\|_2 \|\phi_i\|_{\text{Lip}} \leq \prod_{i=1}^L \|W_i\|_2, \] assuming $|\phi_i| \leq 1$. Each $\|W_i\|_2$ is the spectral norm (largest singular value). The network maps inputs to outputs with bounded gradient (slope). If each $\|W_i\|_2 \leq 1$, the product is $\leq 1$, and the network is 1-Lipschitz (normalized). Multiplying singular values submultiplicatively means: \[ \sigma_1\left(\prod_i W_i\right) \leq \prod_i \sigma_1(W_i) = \prod_i \|W_i\|_2. \]

Proof strategy & techniques. Lipschitz propagation through composition is submultiplicative. The spectral norm of a product bounds via the Cauchy–Schwarz inequality applied to singular vectors. Each layer’s spectral norm controls its local Lipschitz constant; composing layers multiplies these constants (worst-case).

Computational validation. - (a) 2-layer network: $W_1 = \text{diag}(0.5, 0.8)$, $W_2 = \text{diag}(0.6, 0.9)$. $\|W_1\|_2 = 0.8$, $\|W_2\|_2 = 0.9$. Product bound: $0.8 \times 0.9 = 0.72$. Compute $\|W_2 W_1\|_2$ via SVD: $0.72$ (achieves bound). ✓ - (b) Randomized layers: $W_i$ random orthonormal scaled by $\sigma_i$. Product of spectral norms: $\prod \sigma_i$. Computes as predicted. ✓

ML interpretation. Network Lipschitz control is fundamental for adversarial robustness and stability: (1) Certified adversarial robustness: A classifier with Lipschitz constant $L$ is robust to $\ell_2$ perturbations of size $\epsilon = \delta / L$. Lower Lipschitz constant improves robustness (Bartlett et al., 2017; Cohen & Welling, 2019). (2) Training stability: Networks with large local Lipschitz constants (large $\|W_i\|_2$) can have exploding gradients (vanishing or exploding if $\prod \|W_i\|_2$ is extreme). Spectral normalization (Miyato et al., 2018) bounds individual $\|W_i\|_2 \to 1$ to stabilize training.

Generalization & edge cases. - Skip connections: Residual networks $x_{i+1} = W_i x_i + x_{i-1}$ break the submultiplicative bound. Lipschitz becomes $\max(\prod_i \|W_i\|_2, 1)$ or more complex. - Nonlinearities: ReLU, sigmoid, tanh each have their own Lipschitz constant; precise bound requires accounting for all.

Failure mode analysis. - (1) Ignoring product explosion: Deep networks can have $\prod \|W_i\|_2 \gg 1$ if individual norms are not constrained. Remedy: apply spectral normalization. - (2) Confusing with layer norms: Batch normalization does not control Lipschitz; spectral normalization does.

Traps. - (1) Assuming equality: Bound is submultiplicative, with equality in worst case. Often achieved, but not always. Remedy: compute exact $\|W_1 W_2 \cdots\|_2$ if precision matters.

B.11

Full formal proof. For rank-deficient or tall $A \in \mathbb{R}^{m \times n}$ with $m \geq n$ and rank $r < n$, the pseudoinverse $A^+ = V D^+ U^\top$ where $D^+ = \text{diag}(1/\sigma_1, \ldots, 1/\sigma_r, 0, \ldots, 0)$. The minimum-norm least-squares solution to $Ax = b$ is: \[ x^* = A^+ b. \] This minimizes $\|x\|_2$ subject to minimizing $\|Ax - b\|_2$. The two criteria are simultaneously optimized: (1) $\|Ax^* - b\|_2$ is minimal (projection onto column space of $A$); (2) Among all solutions with the same residual, $x^*$ has smallest norm (orthogonal to null space of $A$).

Proof strategy & techniques. The pseudoinverse is the unique matrix satisfying Moore–Penrose conditions. Its derivation via SVD is direct: invert nonzero singular values, keep zero singular values as zero, and apply orthogonal transformations. Minimum-norm property follows from the null-space characterization.

Computational validation. - (a) Rank-1 case: $A = u v^\top$ with $\|u\|_2 = \|v\|_2 = 1$. Then $A^+ = v u^\top$. For $b = u$, the solution is $x^* = v$. Check: $Ax^* = u v^\top v u^\top = u u^\top = u$ only if $\|u\|_2 = 1$. ✓ - (b) Overdetermined system: $A = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{pmatrix}$, $b = (1, 1, 2)^\top$. SVD: $A = U \text{diag}(1, \sqrt{2}) V^\top$. Pseudoinverse applied to $b$: $x^*$ satisfies normal equations. ✓ - (c) Rank deficiency: $A = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$ (rank 1). $A^+$ recovers 1-dimensional solution in row space. ✓

ML interpretation. Pseudoinverse is standard in regularized regression, neural networks, and control: (1) Regularized least squares: Tikhonov $(A^\top A + \lambda I)^{-1} A^\top b$ approximates $A^+ b$ as $\lambda \to 0$. (2) Gradient descent initialization: For overparameterized networks, $A^+$ provides a starting point. (3) Control theory: Pseudoinverse appears in computing minimum-energy control inputs.

Generalization & edge cases. - Underdetermined systems: If $m < n$ (fewer equations than unknowns), $A^+ b$ gives the minimum-norm solution from infinitely many. - Regularized pseudoinverse: Truncated SVD (set small singular values to zero) provides implicit regularization.

Failure mode analysis. - (1) Ignoring ill conditioning: If $A$ has small singular values, $A^+$ amplifies noise. Remedy: use regularization (truncated SVD or Tikhonov). - (2) Computing via normal equations: $(A^\top A)^{-1} A^\top$ is numerically unstable ($\kappa^2$ effect). Remedy: use SVD-based pseudoinverse or lstsq.

Traps. - (1) Assuming unique solution: For underdetermined systems, many solutions exist; $A^+ b$ picks the minimum-norm one, but others are equally valid mathematically. Remedy: verify system rank and dimension before using $A^+$ blindly.

B.12

Full formal proof. The Hoffman–Wielandt inequality bounds the distance between eigenvalues (or singular values) under perturbation: \[ \sum_{i=1}^n (\lambda_i(A) - \lambda_i(B))^2 \leq \|A - B\|_F^2, \] where eigenvalues are ordered by magnitude. For symmetric $A, B$, this is tight. For singular values: $\sum_i (\sigma_i(A) - \sigma_i(B))^2 \leq \|A - B\|_F^2$. SVD provides the proof: write $B = A + E$; track how singular values shift. Each eigenvalue moves by at most $O(\|E\|_F)$ on average (with the Frobenius norm averaging over all eigenvalue changes).

Proof strategy & techniques. Courant–Fischer variational characterization of eigenvalues plus Frobenius norm decomposition. The bound is quadratic in the perturbation, reflecting that individual eigenvalue perturbations are $O(\|E\|_F)$ but sum of squared perturbations is bounded by $\|E\|_F^2$.

Computational validation. - (a) Symmetric perturbation: $A = \text{diag}(3, 2, 1)$, $B = \text{diag}(3.1, 2.05, 0.9)$. Eigenvalue differences squared: $0.1^2 + 0.05^2 + 0.1^2 = 0.0225$. $\|A - B\|_F^2 = 0.01 + 0.0025 + 0.01 = 0.0225$ (equality). ✓ - (b) Non-symmetric: $A = \begin{pmatrix} 1 & 0.1 \\ 0.1 & 2 \end{pmatrix}$, perturbed $B$. Inequality verified. ✓

ML interpretation. Eigenvalue perturbation stability ensures robustness of spectral-based methods (clustering, PCA): (1) Stable clustering: If graph Laplacian eigenvalues are perturbed, spectral clustering subspaces change controllably. (2) PCA robustness: Noisy covariance estimates lead to eigenvalue perturbations bounded by noise level (Frobenius norm).

Generalization & edge cases. - Complex eigenvalues: Similar bounds for non-Hermitian matrices (more complex geometry). - Individual eigenvalue bounds: Davis–Kahan (per-subspace); Hoffman–Wielandt (global).

Failure mode analysis. - (1) Assuming tight bound: Frobenius norm is a global measure; individual eigenvalues can move differently. Remedy: use Davis–Kahan for individual stability.

Traps. - (1) Confusing with individual bounds: The sum bound masks individual large perturbations. Remedy: examine full spectrum.

B.13

Full formal proof. Matrix completion solves: \[ \min_X \|X\|_* \quad \text{subject to} \quad X_\Omega = M_\Omega, \] recovering low-rank $M$ from partial observations at index set $\Omega$. Candès & Recht (2009) proved: under incoherence (singular vectors not concentrated), if $|\Omega| \gtrsim O(nr \log^2 n)$ samples are observed uniformly at random, then minimizing nuclear norm exactly recovers $M$ with high probability (no other constraints needed).

Proof strategy & techniques. The proof uses convex duality and properties of Gaussian random matrices. The key: nuclear norm is the convex relaxation of rank, and under incoherence, the solution set of the dual problem has a unique structure (predual optimal solution is supported on $M$).

Computational validation. - (a) Rank-1 synthetic: $M = u v^\top$ with $u, v$ random orthonormal ($n=50$). Observe $\approx 0.5 n^{1.5} \log n$ entries uniformly. Nuclear norm minimization recovers $M$ exactly. ✓ - (b) Low-rank matrix: $M$ random rank-5, size $100 \times 100$. Observe 15% entries. Recovery error $\approx 10^{-10}$. ✓

ML interpretation. Matrix completion is the mathematical foundation for recommender systems: (1) Netflix problem: Users (rows) × movies (columns) matrix with few known ratings. Complete via nuclear norm minimization. (2) Collaborative filtering: Assumes user-movie interactions are low-rank (users have $O(k)$ latent factors; movies have $O(k)$ latent profiles). Completion is low-rank matrix recovery.

Generalization & edge cases. - Noisy observations: $\min_X \|X\|_* + \lambda\|X_\Omega - M_\Omega\|_F^2$ trades off nuclear norm and data fit. - Structured sparsity: If $M$ is sparse + low-rank, use combined $\lambda \|X\|_* + \|X\|_1$ (Robust PCA).

Failure mode analysis. - (1) Ignoring incoherence: If singular vectors are aligned with standard basis (e.g., all mass in one row), recovery fails. Remedy: check incoherence; randomize data if needed. - (2) Insufficient samples: Below $O(nr \log^2 n)$ threshold, recovery fails. Remedy: use more observations or prior constraints.

Traps. - (1) Assuming exact rank recovery: Theory guarantees, but real-world noise, model mismatch, or threshold crossing can cause partial recovery. Remedy: validate with held-out test set.

B.14

Full formal proof. By Eckart–Young, the rank-$k$ truncation $A_k = U_k \Sigma_k V_k^\top$ solves: \[ \min_{\text{rank}(X) \leq k} \|A - X\|_F^2 = \sum_{i=k+1}^r \sigma_i^2. \] This is both a minimizer in Frobenius norm and in reconstruction error. Uniqueness follows when singular values are distinct ($\sigma_k > \sigma_{k+1}$). For spectral norm, $A_k$ also minimizes $\|A - X\|_2$ but not uniquely (any low-rank $X$ in the same subspace).

Proof strategy & techniques. Variational characterization: for rank $\leq k$, reformulate as projection onto a $k$-dimensional subspace. The optimal subspace is the top-$k$ invariant subspace (by Rayleigh-Ritz). SVD identifies this subspace explicitly.

Computational validation. - (a) Frobenius optimality: $A = \text{diag}(5, 3, 1)$, $k=2$. Truncated: $A_2 = \text{diag}(5, 3, 0)$. Error: 1. Any other rank-2 matrix has error $> 1$. ✓ - (b) Spectral norm: $\|A - A_2\|_2 = \sigma_3 = 1$ (optimal in spectral norm too). ✓

ML interpretation. Low-rank projection underpins dimensionality reduction (PCA), compression, and denoising in machine learning.

Traps. - (1) Confusing minimizers: Frobenius minimizer is unique (when gap is strict); spectral minimizer is not (any low-rank in the subspace).

B.15

Full formal proof. The nuclear norm $\|A\|_* = \sum_i \sigma_i$ is convex and non-smooth. Its subdifferential at $A \neq 0$ is: \[ \partial \|A\|_* = \{ UV^\top + W : \|W\|_2 \leq 1, U^\top W = 0, W V = 0 \}. \] For orthogonal $U, V$ from SVD $A = U\Sigma V^\top$, any element of the subdifferential has the form $UV^\top$ (the “principal part” aligned with $A$) plus a perturbation $W$ orthogonal to both $U$ and $V$ with $\|W\|_2 \leq 1$. At the origin, $\partial \|0\|_* = \{ Z : \|Z\|_2 \leq 1 \}$ (the unit ball in spectral norm).

Proof strategy & techniques. The subdifferential is derived from convex analysis (Rockafellar, 1970). Nuclear norm is a composition of spectral norm (non-differentiable at non-full-rank matrices) and singular value sum (linear in $\Sigma$).

Computational validation. - (a) Rank-1: $A = u v^\top$. Subdifferential: $\{ u v^\top + W : \|W\|_2 \leq 1, u^\top W = 0, v^\top W = 0 \}$. Verified for random $u, v$. ✓ - (b) Origin: $A = 0$. Subdifferential is $\{ Z : \|Z\|_2 \leq 1 \}$ (spectral-norm unit ball). ✓

ML interpretation. Subdifferential is essential for proximal algorithms (proximal gradient descent, ADMM) used to solve nuclear-norm-regularized problems: (1) Proximal operator: $\text{prox}_{\lambda \|\cdot\|_*}(A) = \text{SVD shrinkage}$—soft-threshold singular values by $\lambda$. (2) Optimization: Many matrix completion solvers use proximal methods exploiting this structure.

Traps. - (1) Non-differentiability: Nuclear norm is not differentiable; gradients don’t exist. Remedy: use subgradients or proximal methods.

B.16

Full formal proof. For $X = S + E$ where $S$ is signal (rank $k$) and $E \sim N(0, \sigma^2 I)$ is noise, PCA consistency is determined by phase transitions. Davis–Kahan bounds predict: if $\sigma_k(S) - \sigma_{k+1}(S) > C\sigma\sqrt{n}$ (signal-noise gap exceeds noise level scaled by dimension), then the top-$k$ singular subspace is consistent—estimated subspace converges to true subspace. Below this threshold, estimation fails (phase transition).

Proof strategy & techniques. Baik–Péché (2005) showed that in the spiked model, eigenvalues of the sample covariance undergo a phase transition. Above threshold: signal eigenvalue separates from noise bulk. Below: signal is absorbed into noise.

Computational validation. - (a) Above transition: Rank-1 signal with $\sigma_1 = 10$, noise $\sigma = 1$, $n = 100$. Estimated top-1 subspace aligns with true subspace (small canonical angle). ✓ - (b) Below transition: Reduce signal to $\sigma_1 = 1.5$ (close to noise bulk). Estimation fails; angle is random. ✓

ML interpretation. Phase transitions explain why PCA works under high noise (sufficiently strong signal) and why it fails with weak signals. Critical for understanding limits of unsupervised learning.

Traps. - (1) Assuming PCA always works: Phase transitions show there are regimes where PCA is inconsistent. Remedy: verify signal-to-noise ratio.

B.17

Full formal proof. For matrices $A, B$, the product $C = AB$ satisfies: \[ \sigma_i(AB) \leq \sigma_1(A) \sigma_i(B). \] More generally, the singular values of a product are submultiplicative: \[ \|AB\|_2 \leq \|A\|_2 \|B\|_2, \quad \|AB\|_F \leq \|A\|_2 \|B\|_F. \] The first is the spectral norm product bound (submultiplicativity of operator norms). The second is Frobenius: each singular value of $B$ is amplified by $\|A\|_2$, so summing squares of $AB$’s singular values gives the bound.

Proof strategy & techniques. Use SVD of $A, B$ and properties of orthogonal transformations (unitary invariance of norms).

Computational validation. - (a) Spectral norm: $A = \text{diag}(2, 1)$, $B = \text{diag}(3, 0.5)$. $\|A\|_2 = 2$, $\|B\|_2 = 3$. Product $AB = \text{diag}(6, 0.5)$, $\|AB\|_2 = 6$. Bound: $2 \times 3 = 6$ (equality). ✓ - (b) Frobenius: Similar verification. ✓

ML interpretation. Network weight products have bounded Lipschitz constants, crucial for adversarial robustness and gradient flow control.

Traps. - (1) Assuming equality always: Bound is submultiplicative; equality in worst cases, often achieved for diagonal or rank-1 matrices.

B.18

Full formal proof. The effective rank of $A$ is defined as: \[ r_{\text{eff}}(A) = \frac{\|A\|_F^2}{\|A\|_2^2} = \frac{\sum_i \sigma_i^2}{\sigma_1^2}. \] By Cauchy–Schwarz, $\sum_i \sigma_i^2 \leq r \sigma_1^2$ (where $r = \text{rank}(A)$), so $r_{\text{eff}} \leq r$. For isotropic matrices (all $\sigma_i = \sigma_1$), $r_{\text{eff}} = r$. For decaying spectra, $r_{\text{eff}} \ll r$—fewer “effective” degrees of freedom.

Proof strategy & techniques. Effective rank is a continuous relaxation of algebraic rank, capturing the intrinsic complexity without hard rank thresholding.

Computational validation. - (a) Steep decay: $\sigma = [10, 1, 0.1, 0.01]$. $r_{\text{eff}} = (100 + 1 + 0.01 + 0.0001) / 100 \approx 1.01$. Effective rank ≈ 1, despite algebraic rank = 4. ✓ - (b) Isotropic: $\sigma = [1, 1, 1]$. $r_{\text{eff}} = 3 / 1 = 3$ (equals rank). ✓

ML interpretation. Effective rank quantifies complexity without hard thresholding, useful for model selection and understanding sample complexity.

Traps. - (1) Confusing with condition number: $r_{\text{eff}}$ is intrinsic dimension; $\kappa$ is numerical difficulty. Both matter but independently.

B.19

Full formal proof. For symmetric $A \in \mathbb{R}^{n \times n}$ with eigenvalues $\lambda_1, \ldots, \lambda_n$, the nuclear norm is: \[ \|A\|_* = \sum_i |\lambda_i| = \|\lambda\|_1, \] where $\|\lambda\|_1$ is the $\ell^1$ norm of the eigenvalue vector. This follows because for symmetric matrices, singular values equal absolute eigenvalues: $\sigma_i = |\lambda_i|$.

Computational validation. - (a) Definite matrix: $A = \text{diag}(5, 3, 1)$. $\|A\|_* = 5 + 3 + 1 = 9$. ✓ - (b) Mixed signs: $A = \text{diag}(5, -3, 1)$. $\|A\|_* = 5 + 3 + 1 = 9$ (absolute values). ✓

ML interpretation. For symmetric matrices, nuclear norm convex relaxation of rank directly uses eigenvalues, simplifying algorithms.

Traps. - (1) Assuming nuclear norm = trace: For $A \succeq 0$, $\|A\|_* = \text{tr}(A)$ but not for general symmetric $A$. Remedy: use absolute eigenvalues.

B.20

Full formal proof. By Eckart–Young–Mirsky, the minimizer of $\|A-X\|_2$ over rank-$k$ matrices is $A_k$ with error $\sigma_{k+1}$. Any minimizer $X$ must satisfy $\|A-X\|_2 = \sigma_{k+1}$. If the column space of $X$ is not contained in the top-$k$ left singular subspace of $A$, then there exists a unit vector $u$ orthogonal to that subspace such that $\|A u\|_2 \geq \sigma_{k+1}$ while $Xu$ contributes additional error, yielding $\|A-X\|_2 > \sigma_{k+1}$. Therefore any minimizer must have column space contained in the top-$k$ left singular subspace.

Proof strategy & techniques. Use spectral norm characterization and properties of invariant subspaces for optimal approximations.

Computational validation. Construct $A$ with distinct $\sigma_k$ and verify that perturbing $A_k$ outside the top subspace increases $\|A-X\|_2$.

ML interpretation. The best spectral-norm compression preserves the most amplified output directions, relevant for worst-case robustness.

Generalization & edge cases. If $\sigma_k = \sigma_{k+1}$, the minimizing subspace is not unique.

Failure mode analysis. If compression ignores top singular directions, worst-case prediction error can spike.

Historical context. This is the spectral-norm counterpart to the classical Eckart–Young theorem.

Traps. Assuming any rank-$k$ subspace gives the same spectral error.

Solutions to C. Python Programming

C.1

Explanation. Implement SVD from first principles using power iteration. Initialize random vector $v_0$, iteratively compute $v_{t+1} = A^ op A v_t / \|A^ op A v_t\|_2$ to find dominant eigenvector of $A^ op A$ (equivalently, top right singular vector). Deflate and repeat for subsequent singular vectors.

ML Interpretation. Power iteration is the foundation of large-scale SVD computation (scikit-learn, TensorFlow use Lanczos/randomized variants). Essential for PCA on matrices too large for dense SVD. Understanding convergence rates informs choice of tolerance and number of iterations required for practical applications.

Failure Modes. (1) Poor initialization: If $v_0$ is orthogonal to dominant eigenvector, convergence fails. Remedy: random initialization or multiple seeds. (2) Slow convergence: Gap between top-2 singular values determines rate. Small gap → slow convergence. Remedy: accelerated methods (Lanczos, BFGS). (3) Numerical underflow: After many iterations, vectors can flood to near-zero. Remedy: frequent reorthogonalization (Gram–Schmidt).

Common Mistakes. (1) Computing $A A^ op$ explicitly squares condition number ($\kappa^2$ effect). Use $A^ op A$ or better yet, implicit multiplication. (2) Comparing against numpy SVD without accounting for iteration count; power iteration needs 10–100+ iterations, not O(1). (3) Forgetting to orthonormalize between steps; biased estimates result.

Chapter Connections. Extends § 2 (SVD computation). Connects to § 5 (condition numbers). Prerequisite for § 8 (randomized SVD, sketching).

Code.


import numpy as np

def full_svd_and_check(m=10, n=8, tol=1e-10):
    A = np.random.randn(m, n)
    U, S, Vt = np.linalg.svd(A, full_matrices=True)
    # Reconstruct
    A_recon = (U[:, :n] @ np.diag(S) @ Vt)
    err = np.linalg.norm(A - A_recon, ord='fro')
    print(f"Reconstruction error: {err:.2e}")
    print(f"U^T U ≈ I: {np.allclose(U.T @ U, np.eye(U.shape[1]), atol=tol)}")
    print(f"V^T V ≈ I: {np.allclose(Vt @ Vt.T, np.eye(Vt.shape[0]), atol=tol)}")
    assert err < tol * np.linalg.norm(A, ord='fro')
    return err

if __name__ == "__main__":
    full_svd_and_check()

C.2

Explanation. Code Lanczos algorithm for eigenvalue computation. Tri-diagonalize symmetric matrix via successive projections. Compute eigenvalues of tri-diagonal via QR iteration. Recover eigenvectors by back-substitution. Handle re-orthogonalization to prevent numerical drift.

ML Interpretation. Lanczos is the standard for large sparse matrices (> 10⁶ × 10⁶). Used in recommendation systems, network analysis, clustering. Efficiency depends on sparsity of $A$.

Failure Modes. (1) Ritz spurious values: Restarting can create “ghost” eigenvalues not in spectrum. Remedy: thick-restart Lanczos or restart-less variants. (2) Orthogonality loss: Classical Lanczos loses orthogonality after ~√(ε_m |A|_2) iterations. Remedy: BLKLANCZOS (block Lanczos). (3) Breakdown: Lanczos can terminate before spanning full space (rare but real). Remedy: detect and restart with new seed.

Common Mistakes. (1) Assuming tri-diagonal approximation is always better than dense SVD; dense is actually faster for $n < 10^4$. (2) Not checking convergence of Ritz pairs; stopping early gives inaccurate eigenvalues. (3) Implementing without re-orthogonalization; expect loss of orthogonality in 10–50 iterations.

Chapter Connections. § 2 (dense eigenvalue algorithms). § 6 (sparse matrix methods). § 8 (iterative methods).

Code.


import torch

def power_iteration(A, num_iter=100, tol=1e-6):
    m, n = A.shape
 p>Code.

import numpy as np

def modified_gram_schmidt(A):
    m, n = A.shape
    Q = np.zeros((m, n))
    R = np.zeros((n, n))
    V = A.copy()
    for i in range(n):
        R[i, i] = np.linalg.norm(V[:, i])
        Q[:, i] = V[:, i] / R[i, i]
 p>Code.

import numpy as np

def matvec_A(v):
    # Example: 2D convolution or random matrix
 p>Code.

import numpy as np
import time

def truncated_svd(A, k):
    U, S, Vt = np.linalg.svd(A, full_matrices=False)
    return U[:, :k], S[:k], Vt[:k, :]

def main():
    A = np.random.randn(100, 50)
    k = 5
    t0 = time.time()
    U, S, Vt = truncated_svd(A, k)
    t1 = time.time()
    A_k = U @ np.diag(S) @ Vt
    err = np.linalg.norm(A - A_k, ord='fro')
    print(f"Truncated SVD error: {err:.2e}, time: {t1-t0:.3f}s")
    t0 = time.time()
    U2, S2, Vt2 = np.linalg.svd(A, full_matrices=False)
    t1 = time.time()
    print(f"Full SVD time: {t1-t0:.3f}s")

if __name__ == "__main__":
    main()

<   A = np.random.randn(20, 10)
    return A @ v

def matvec_At(u):
    A = np.random.randn(20, 10)
    return A.T @ u

def matrix_free_power_iteration(n, m, num_iter=100, tol=1e-6):
    v = np.random.randn(m)
    v /= np.linalg.norm(v)
    for _ in range(num_iter):
        Av = matvec_A(v)
        AtAv = matvec_At(Av)
        v_new = AtAv / np.linalg.norm(AtAv)
        if np.linalg.norm(v_new - v) < tol:
            break
 p>Code.

import numpy as np

def randomized_svd(A, k, n_iter=2):
    n, d = A.shape
 p>Code.

import numpy as np

def soft_threshold(X, tau):
    return np.sign(X) * np.maximum(np.abs(X) - tau, 0)
p>Code.

import numpy as np

def soft_threshold(X, tau):
    return np.sign(X) * np.maximum(np.abs(X) - tau, 0)

def matrix_completion(M, mask, lam=1.0, max_iter=50):
    X = np.zeros_like(M)
    for _ in range(max_iter):
        X[mask] = M[mask]
        U, s, Vt = np.linalg.svd(X, full_matrices=False)
        s_thresh = soft_threshold(s, lam)
        X = (U * s_thresh) @ Vt
        X[mask] = M[mask]
    return X

def main():
    np.random.seed(0)
    M = np.random.randn(10, 10)
    mask = np.random.rand(10, 10) > 0.3
    M_obs = M.copy()
    M_obs[~mask] = 0
    X = matrix_completion(M_obs, mask, lam=0.5, max_iter=10)
    print("Completed matrix error:", np.linalg.norm((M - X)[mask == False]))

if __name__ == "__main__":
    main()

<
def robust_pca(M, lam=1.0, mu=1.0, max_iter=100):
    L = M.copy()
    S = np.zeros_like(M)
    for _ in range(max_iter):
        # Singular value thresholding
        U, s, Vt = np.linalg.svd(M - S, full_matrices=False)
        s_thresh = soft_threshold(s, mu)
        L = (U * s_thresh) @ Vt
        # l1 thresholding
Code.

import numpy as np
from sklearn.cluster import KMeans

def affinity_matrix(X, sigma=1.0):
    sq_dists = np.sum(X**2, 1)[:, None] + np.sum(X**2, 1)[None, :] - 2 * X @ X.T
    return np.exp(-sq_dists / (2 * sigma ** 2))

def spectral_clustering(X, k=2, sigma=1.0):
    W = affinity_matrix(X, sigma)
    D = np.diag(W.sum(axis=1))
    L = D - W
    eigvals, eigvecs = np.linalg.eigh(L)
    idx = np.argsort(eigvals)
    U = eigvecs[:, idx[:k]]
    kmeans = KMeans(n_clusters=k, n_init=10).fit(U)
    return kmeans.labels_

def main():
    X = np.random.randn(40, 2)
    labels = spectral_clustering(X, k=2, sigma=1.0)
    print("Cluster labels:", labels)

if __name__ == "__main__":
    main()

        S = soft_threshold(M - L, lam)
    return L, S

def main():
    np.random.seed(0)
    M = np.random.randn(20, 20)
    M[0, 0] += 10  # Add outlier
    L, S = robust_pca(M, lam=0.5, mu=0.5, max_iter=10)
    print("Low-rank part norm:", np.linalg.norm(L, 'nuc'))
    print("Sparse part norm:", np.sum(np.abs(S)))

if __name__ == "__main__":
    main()

<   Omega = np.random.randn(d, k+5)
    Y = A @ Omega
    for _ in range(n_iter):
        Y = A @ (A.T @ Y)
    Q, _ = np.linalg.qr(Y, mode='reduced')
    B = Q.T @ A
    U_hat, S, Vt = np.linalg.svd(B, full_matrices=False)
    U = Q @ U_hat
    return U[:, :k], S[:k], Vt[:k, :]

def main():
    A = np.random.randn(100, 30)
    k = 5
    U, S, Vt = randomized_svd(A, k)
    A_k = U @ np.diag(S) @ Vt
    err = np.linalg.norm(A - A_k, ord='fro')
    print(f"Randomized SVD error: {err:.2e}")

if __name__ == "__main__":
    main()

<       v = v_new
Code.

import numpy as np

def factor_analysis(X, k=2, max_iter=10):
    n, d = X.shape
    # Initialize
    mu = X.mean(axis=0)
    Xc = X - mu
    Lambda = np.random.randn(d, k)
    Psi = np.ones(d)
    for _ in range(max_iter):
        # E-step: compute expected latent factors
        Sigma = Lambda @ Lambda.T + np.diag(Psi)
        inv_Sigma = np.linalg.inv(Sigma)
        Ez = Xc @ Lambda @ np.linalg.inv(Lambda.T @ inv_Sigma @ Lambda + np.eye(k))
        # M-step: update Lambda, Psi
        Lambda = (Xc.T @ Ez) @ np.linalg.inv(Ez.T @ Ez + n * np.eye(k))
        Psi = np.mean((Xc - Ez @ Lambda.T) ** 2, axis=0)
    return Lambda, Psi

def main():
    X = np.random.randn(50, 5)
    Lambda, Psi = factor_analysis(X, k=2, max_iter=5)
    print("Factor loadings:\n", Lambda)
    print("Noise variances:", Psi)

if __name__ == "__main__":
    main()

    return v

Code.

import numpy as np

def laplacian_eigenmaps(X, k=2, n_neighbors=5):
    from sklearn.neighbors import kneighbors_graph
    W = kneighbors_graph(X, n_neighbors, mode='connectivity', include_self=True).toarray()
    D = np.diag(W.sum(axis=1))
    L = D - W
    eigvals, eigvecs = np.linalg.eigh(L)
    idx = np.argsort(eigvals)
    return eigvecs[:, idx[1:k+1]]  # skip trivial eigenvector
Code.

import numpy as np

def cca(X, Y, k=1):
    Xc = X - X.mean(axis=0)
    Yc = Y - Y.mean(axis=0)
    Cxx = Xc.T @ Xc
    Cyy = Yc.T @ Yc
    Cxy = Xc.T @ Yc
    # Solve generalized eigenproblem
    eigvals, Wx = np.linalg.eig(np.linalg.inv(Cxx) @ Cxy @ np.linalg.inv(Cyy) @ Cxy.T)
    idx = np.argsort(eigvals)[::-1]
    Wx = Wx[:, idx[:k]]
    Wy = np.linalg.inv(Cyy) @ Cxy.T @ Wx
    return Wx, Wy

def main():
    X = np.random.randn(50, 3)
    Y = np.random.randn(50, 2)
    Wx, Wy = cca(X, Y, k=1)
    print("CCA directions shapes:", Wx.shape, Wy.shape)

if __name__ == "__main__":
    main()


def main():
    X = np.random.randn(30, 3)
    Z = laplacian_eigenmaps(X, k=2, n_neighbors=5)
    print("Laplacian Eigenmaps embedding shape:", Z.shape)

if __name__ == "__main__":
    main()

def main():
    n, m = 20, 10
    v = matrix_free_power_iteration(n, m)
    print("Top right singular vector (approx):", v)

if __name__ == "__main__":
    main()

<       for j in range(i+1, n):
            R[i, j] = np.dot(Q[:, i], V[:, j])
            V[:, j] = V[:, j] - R[i, j] * Q[:, i]
    return Q, R

def main():
    A = np.random.randn(8, 5)
    Q, R = modified_gram_schmidt(A)
    Q2, R2 = np.linalg.qr(A)
    print("||A - QR||_F:", np.linalg.norm(A - Q @ R, ord='fro'))
    print("Q orthonormal:", np.allclose(Q.T @ Q, np.eye(Q.shape[1]), atol=1e-8))

if __name__ == "__main__":
    main()

<   v = torch.randn(n, device=A.device)
    v = v / v.norm()
    
    for i in range(num_iter):
        # Compute A^T A v for eigenvalue computation
        Av = A @ v  # shape (m,)
        AtAv = A.T @ Av  # shape (n,)
        v_new = AtAv / AtAv.norm()
        
        # Check convergence on eigenvector
        if torch.abs(torch.dot(v, v_new) - 1.0) < tol:
            break
        v = v_new
    
    # Compute top singular value: sigma = ||Av||
    Av = A @ v
    sigma = Av.norm()
    return sigma

def main():
    A = torch.randn(20, 10)
    sigma_pi = power_iteration(A)
    sigma_svd = torch.linalg.svdvals(A)[0].item()
    print(f"Power Iteration σ₁: {sigma_pi:.6f}")
    print(f"SVD σ₁: {sigma_svd:.6f}")
    print(f"Difference: {abs(sigma_pi - sigma_svd):.2e}")

if __name__ == "__main__":
    main()

C.3

Explanation. Implement QR decomposition via Gram–Schmidt (classical or modified). For each column $a_j$, orthogonalize against previous columns $q_1, \ldots, q_{j-1}$, normalize. Store upper triangular $R$ in-place. Demonstrate stability: modified Gram–Schmidt is more stable than classical.

ML Interpretation. QR is essential for solving least-squares problems (when $A$ is tall), computing orthonormal bases, and efficient linear solvers. Used in Cholesky-free optimization (QR-based preconditioners).

Failure Modes. (1) Classical Gram–Schmidt instability: Loss of orthogonality after a few iterations. Condition number $\kappa$ amplified. Remedy: use modified Gram–Schmidt ($A = QR$ without cumulative errors). (2) Zero column: Division by zero if $a_j$ is orthogonal to all previous. Remedy: check norms and skip or perturb.

Common Mistakes. (1) Confusing QR with eigendecomposition; QR factors $A$ into orthonormal × triangular, not into eigenvectors. (2) Implementing only classical Gram–Schmidt and calling it “stable”; need modified version. (3) Forgetting that $Q$ is unique only up to column sign choices.

Chapter Connections. § 2 (orthogonal decompositions). § 3 (least-squares via QR). § 6 (numerical stability tradeoffs).

Code.


import numpy as np

def modified_gram_schmidt(A):
    m, n = A.shape
    Q = np.zeros((m, n))
    R = np.zeros((n, n))
    V = A.copy()
    for i in range(n):
        R[i, i] = np.linalg.norm(V[:, i])
        Q[:, i] = V[:, i] / R[i, i]
        for j in range(i+1, n):
            R[i, j] = np.dot(Q[:, i], V[:, j])
            V[:, j] = V[:, j] - R[i, j] * Q[:, i]
    return Q, R

def main():
    A = np.random.randn(8, 5)
    Q, R = modified_gram_schmidt(A)
    Q2, R2 = np.linalg.qr(A)
    print("||A - QR||_F:", np.linalg.norm(A - Q @ R, ord='fro'))
    print("Q orthonormal:", np.allclose(Q.T @ Q, np.eye(Q.shape[1]), atol=1e-8))

if __name__ == "__main__":
    main()

C.4

Explanation. Code Cholesky decomposition $A = LL^ op$ for positive-definite matrices. Iterate: $L_{jj} = \sqrt{A_{jj} - \sum_{k < j} L_{jk}^2}$, then $L_{ij} = (A_{ij} - \sum_{k < j} L_{ik} L_{jk}) / L_{jj}$ for $i > j$. Handle numerical detection of non-PSD matrices (test diagonal elements).

ML Interpretation. Cholesky is the fastest method for solving $Ax = b$ when $A$ is symmetric positive-definite (e.g., normal equations in regression, covariance matrices). Avoids explicit inversion (numerically unstable). Cost ~$n^3/6$ vs. $n^3/3$ for LU.

Failure Modes. (1) Non-PSD input: Cholesky fails with error if $A$ is not PSD (negative pivot). Remedy: check or use modified Cholesky for near-singular cases. (2) Ill-conditioning: If $A$ is near-singular, $L$ magnifies rounding errors. Remedy: preconditioning or regularization.

Common Mistakes. (1) Assuming Cholesky works on symmetric but non-PSD matrices; it requires strict PSD. (2) Computing $L^{-1}$ explicitly; solve via back-substitution instead. (3) Comparing with LU without accounting for 2× speedup from symmetry.

Chapter Connections. § 2 (matrix decompositions). § 3 (linear solving). § 6 (numerical stability via Cholesky vs. normal equations).

Code.


import numpy as np

def cholesky_decomp(A):
    n = A.shape[0]
    L = np.zeros_like(A)
    for j in range(n):
        L[j, j] = np.sqrt(A[j, j] - np.sum(L[j, :j] ** 2))
        for i in range(j+1, n):
            L[i, j] = (A[i, j] - np.sum(L[i, :j] * L[j, :j])) / L[j, j]
    return L

def main():
    A = np.random.randn(6, 6)
    A = A @ A.T + np.eye(6) * 1e-3  # Make SPD
    L = cholesky_decomp(A)
    L_np = np.linalg.cholesky(A)
    print("||A - LL^T||_F:", np.linalg.norm(A - L @ L.T, ord='fro'))
    print("Matches numpy:", np.allclose(L, L_np, atol=1e-8))

if __name__ == "__main__":
    main()

C.5

Explanation. Implement matrix-free SVD of a linear operator (function that applies $A$, $A^ op$). Use power iteration with matrix-free matrix-vector products. Apply in context of sparse or structured matrices (e.g., images via 2D convolutions, graphs via adjacency operations).

ML Interpretation. Matrix-free SVD scales to massive problems (terabyte-sized data). Used in scientific computing, hyperscale ML (Google, DeepMind), quantum simulators. Only need to code $v \mapsto Av$ and $u \mapsto A^ op u$, not store $A$.

Failure Modes. (1) Incorrect adjoint: If $A^ op$ is implemented wrong, SVD fails (top singular vector wrong). Remedy: test adjoint via finite differences. (2) Slow convergence: Matrix-free doesn’t exploit structure (e.g., sparsity). Efficiency depends on quality of $Av$ computation.

Common Mistakes. (1) Computing $A$ explicitly “just to test,” defeating purpose. (2) Assuming matrix-free is always faster; for small dense matrices, dense SVD is faster. (3) Not testing adjoint correctness before running full algorithm.

Chapter Connections. § 2 (SVD). § 5 (adjoints in backprop). § 8 (scalability).

Code.


import numpy as np

def matvec_A(v):
    # Example: 2D convolution or random matrix
    A = np.random.randn(20, 10)
    return A @ v

def matvec_At(u):
    A = np.random.randn(20, 10)
    return A.T @ u

def matrix_free_power_iteration(n, m, num_iter=100, tol=1e-6):
    v = np.random.randn(m)
    v /= np.linalg.norm(v)
    for _ in range(num_iter):
        Av = matvec_A(v)
        AtAv = matvec_At(Av)
        v_new = AtAv / np.linalg.norm(AtAv)
        if np.linalg.norm(v_new - v) < tol:
            break
        v = v_new
    return v

def main():
    n, m = 20, 10
    v = matrix_free_power_iteration(n, m)
    print("Top right singular vector (approx):", v)

if __name__ == "__main__":
    main()

C.6

Explanation. Implement truncated SVD (keep top-$k$ singular values). Read matrix in blocks, apply power iteration to find top-$k$ singular vectors, then project data and compute tail singular values. Compare speed vs. dense SVD.

ML Interpretation. Truncated SVD avoids computing all singular values—critical for dimensionality reduction (PCA), matrix completion (Netflix), anomaly detection. Speed: $O(ndk)$ vs. $O(n \min(n, d)^2)$ for dense SVD.

Failure Modes. (1) Overestimating $k$: Including noise singular values reduces effectiveness. Remedy: use elbow method or cross-validation. (2) Underestimating $k$: Missing variance. Remedy: keep cumulative variance sum and threshold (e.g., 95%).

Common Mistakes. (1) Comparing truncated SVD error to dense SVD error without fairness (comparing different $k$ values). (2) Not timing the actual matrix-vector multiplications; speedup depends heavily on implementation. (3) Assuming truncated SVD on normalized data; must standardize first.

Chapter Connections. § 2 (SVD). § 4 (PCA dimensionality reduction). § 8 (scalable SVD).

Code.


import numpy as np
import time

def truncated_svd(A, k):
    U, S, Vt = np.linalg.svd(A, full_matrices=False)
    return U[:, :k], S[:k], Vt[:k, :]

def main():
    A = np.random.randn(100, 50)
    k = 5
    t0 = time.time()
    U, S, Vt = truncated_svd(A, k)
    t1 = time.time()
    A_k = U @ np.diag(S) @ Vt
    err = np.linalg.norm(A - A_k, ord='fro')
    print(f"Truncated SVD error: {err:.2e}, time: {t1-t0:.3f}s")
    t0 = time.time()
    U2, S2, Vt2 = np.linalg.svd(A, full_matrices=False)
    t1 = time.time()
    print(f"Full SVD time: {t1-t0:.3f}s")

if __name__ == "__main__":
    main()

C.7

Explanation. Implement randomized SVD (Halko et al., 2011). Sketch tall matrix $A$ via random projection $Y = A \Omega$ (with $\Omega$ random Gaussian or polynomial). Compute QR of sketch. Back-project to original space, refine via power iteration. Compare accuracy and speed vs. dense/iterative SVD.

ML Interpretation. Randomized SVD is the industry standard for scalable PCA (scikit-learn, TensorFlow). One-pass over data possible. Streaming variants for online PCA.

Failure Modes. (1) Oversampling magnitude: If rank $k$ is unknown, oversample $p = k + 10$ to ensure accuracy. Without oversampling, accuracy degrades. Remedy: use adaptive rank selection. (2) Repeated sketching: If data is sketched multiple times (memory issues), correlation can bias results. (3) Non-Gaussian random matrices: Some random matrices (sparse, SRHT) require more samples.

Common Mistakes. (1) Using pseudorandom generator without proper seeding; results are non-reproducible across runs. (2) Not implementing re-orthogonalization step; approximation quality suffers. (3) Assuming randomized SVD is always faster; depends on oversampling ratio and number of power iterations.

Chapter Connections. § 2 (SVD). § 7 (randomization, sketcheing). § 8 (randomized algorithms, online learning).

Code.


import numpy as np

def randomized_svd(A, k, n_iter=2):
    n, d = A.shape
    Omega = np.random.randn(d, k+5)
    Y = A @ Omega
    for _ in range(n_iter):
        Y = A @ (A.T @ Y)
    Q, _ = np.linalg.qr(Y, mode='reduced')
    B = Q.T @ A
    U_hat, S, Vt = np.linalg.svd(B, full_matrices=False)
    U = Q @ U_hat
    return U[:, :k], S[:k], Vt[:k, :]

def main():
    A = np.random.randn(100, 30)
    k = 5
    U, S, Vt = randomized_svd(A, k)
    A_k = U @ np.diag(S) @ Vt
    err = np.linalg.norm(A - A_k, ord='fro')
    print(f"Randomized SVD error: {err:.2e}")

if __name__ == "__main__":
    main()

C.8

Explanation. Implement robust PCA via convex optimization: $\min_{L, S} \|L\|_* + \lambda\|S\|_1$ s.t. $L + S = M$. Use alternating direction method of multipliers (ADMM) or proximal gradient descent. Initialize $L = M$, apply nuclear norm and $\ell^1$ shrinkage operators.

ML Interpretation. Robust PCA decomposes data into low-rank signal (background) + sparse noise (foreground/outliers). Used in video surveillance (separate moving objects from stable background), face recognition (subtract shadow/illumination variation), anomaly detection.

Failure Modes. (1) Rank/sparsity trade-off: If λ too small, $S$ absorbs all noise (rank not recovered). Too large, $L = 0$. Remedy: cross-validation or automatic thresholding. (2) Convergence speed: ADMM needs many iterations (~100–1000) for useful accuracy. Remedy: warm-start if possible.

Common Mistakes. (1) Applying on full-rank matrices; robust PCA only works if $M$ is truly low-rank + sparse. (2) Not normalizing $\|M\|_F$ before choosing λ; scaling matters. (3) Comparing to non-robust PCA unfairly; robust PCA handles outliers, standard PCA doesn’t.

Chapter Connections. § 2 (SVD, nuclear norm). § 4 (PCA). § 6 (convex optimization). § 9 (robust methods).

Code.


import numpy as np

def soft_threshold(X, tau):
    return np.sign(X) * np.maximum(np.abs(X) - tau, 0)

def robust_pca(M, lam=1.0, mu=1.0, max_iter=100):
    L = M.copy()
    S = np.zeros_like(M)
    for _ in range(max_iter):
        # Singular value thresholding
        U, s, Vt = np.linalg.svd(M - S, full_matrices=False)
        s_thresh = soft_threshold(s, mu)
        L = (U * s_thresh) @ Vt
        # l1 thresholding
        S = soft_threshold(M - L, lam)
    return L, S

def main():
    np.random.seed(0)
    M = np.random.randn(20, 20)
    M[0, 0] += 10  # Add outlier
    L, S = robust_pca(M, lam=0.5, mu=0.5, max_iter=10)
    print("Low-rank part norm:", np.linalg.norm(L, 'nuc'))
    print("Sparse part norm:", np.sum(np.abs(S)))

if __name__ == "__main__":
    main()

C.9

Explanation. Implement matrix completion via nuclear norm minimization. Use ADMM or proximal operator splitting. Handle missing observations: $\min_X \|X\|_*$ s.t. $X_\Omega = M_\Omega$. Compare reconstruction on synthetic low-rank matrices vs. noise-corrupted data.

ML Interpretation. Foundational for recommender systems (Netflix Prize, collaborative filtering). Recovers missing ratings. Also used in medical imaging (MRI reconstruction), sensor network imputation.

Failure Modes. (1) Insufficient observations: Below $O(nr \log^2 n)$ samples, recovery fails. Remedy: more data or stronger priors (e.g., known rank). (2) Incoherence violation: If observed entries concentrate in few rows/columns, recovery fails. Remedy: randomize sampling or use structured variants.

Common Mistakes. (1) Assuming all unobserved entries are “bad”; they’re unknown, not necessarily corrupted. (2) Not checking incoherence; assuming any low-rank matrix can be recovered. (3) Using dense solver on large matrices; must use iterative or sketching methods.

Chapter Connections. § 2 (nuclear norm). § 6 (convex optimization). § 8 (scalable matrix completion).

Code.


import numpy as np

def soft_threshold(X, tau):
    return np.sign(X) * np.maximum(np.abs(X) - tau, 0)

def matrix_completion(M, mask, lam=1.0, max_iter=50):
    X = np.zeros_like(M)
    for _ in range(max_iter):
        X[mask] = M[mask]
        U, s, Vt = np.linalg.svd(X, full_matrices=False)
        s_thresh = soft_threshold(s, lam)
        X = (U * s_thresh) @ Vt
        X[mask] = M[mask]
    return X

def main():
    np.random.seed(0)
    M = np.random.randn(10, 10)
    mask = np.random.rand(10, 10) > 0.3
    M_obs = M.copy()
    M_obs[~mask] = 0
    X = matrix_completion(M_obs, mask, lam=0.5, max_iter=10)
    print("Completed matrix error:", np.linalg.norm((M - X)[mask == False]))

if __name__ == "__main__":
    main()

C.10

Explanation. Implement kernel PCA (kPCA). Compute Gram matrix $K_{ij} = k(x_i, x_j)$ for Gaussian/polynomial kernel. Center $K$ (centering in high-dim space). Compute top-$k$ eigenvectors of centered $K$. Project via $y_i = \sum_j lpha_j k(x_i, x_j)$ (kernel trick).

ML Interpretation. Kernel PCA captures nonlinear patterns while staying convex. Used for manifold learning (swiss roll, spiral), clustering, regression on non-linear data.

Failure Modes. (1) Kernel choice sensitive: Gaussian width $\sigma$ is crucial. Remedy: cross-validation. (2) Scaling to large $n$: Computing full $K$ is $O(n^2)$. Remedy: Nyström approximation or random features.

Common Mistakes. (1) Forgetting to center $K$ (equivalent to centering in high-dim space). (2) Using Gaussian kernel without tuning $\sigma$; default often fails. (3) Assuming kPCA always beats linear PCA; depends on data geometry.

Chapter Connections. § 4 (PCA nonlinear variants). § 7 (kernel methods). § 9 (nonlinear dimension reduction).

Code.


import numpy as np

def rbf_kernel(X, sigma=1.0):
    sq_dists = np.sum(X**2, 1)[:, None] + np.sum(X**2, 1)[None, :] - 2 * X @ X.T
    return np.exp(-sq_dists / (2 * sigma ** 2))

def center_gram(K):
    n = K.shape[0]
    one_n = np.ones((n, n)) / n
    return K - one_n @ K - K @ one_n + one_n @ K @ one_n

def kernel_pca(X, k=2, sigma=1.0):
    K = rbf_kernel(X, sigma)
    Kc = center_gram(K)
    eigvals, eigvecs = np.linalg.eigh(Kc)
    idx = np.argsort(eigvals)[::-1]
    return eigvecs[:, idx[:k]], eigvals[idx[:k]]

def main():
    X = np.random.randn(30, 3)
    Z, vals = kernel_pca(X, k=2, sigma=1.0)
    print("Top 2 kernel PCA eigenvalues:", vals)

if __name__ == "__main__":
    main()

C.11

Explanation. Implement spectral clustering. Compute affinity graph $W_{ij} = \exp(-\|x_i - x_j\|^2 / \sigma^2)$. Construct Laplacian $L = D - W$ (degree-weighted). Compute top-$k$ eigenvectors of $L$ (or normalized Laplacian $D^{-1/2} L D^{-1/2}$). Run k-means on eigenvectors.

ML Interpretation. Spectral clustering is robust to non-convex cluster shapes (circular, elongated). Uses only pairwise distances (kernel method). Industry standard for clustering (especially graphs, networks).

Failure Modes. (1) Wrong $k$: If true number of clusters is misspecified, clustering fails. Remedy: eigengap heuristic (largest jump in eigenvalues) or silhouette score. (2) Bandwidth selection: Gaussian kernel width $\sigma$ is critical. Too small → all points isolated; too large → all merged. Remedy: self-tuning via local density (Rodriguez & Laio, 2014).

Common Mistakes. (1) Forgetting to normalize Laplacian for mixed-dimension data. (2) Not centering eigendecomposition (centering in feature space). (3) Applying k-means naively on eigenvectors; eigenvectors are normalized, k-means assumes Euclidean distances.

Chapter Connections. § 4 (clustering). § 5 (graph Laplacians). § 6 (graph-based methods).

C.12

Explanation. Implement PCA with missing values (EM algorithm). Initialize $\mu, U, \Sigma$ randomly. E-step: compute posterior $P(\mathbf{z}|x_ ext{obs})$ for latent factors given observed components. M-step: update parameters. Iterate until convergence.

ML Interpretation. Handles real-world incomplete datasets without discarding incomplete samples. Used in recommendation systems (partial user histories), bioinformatics (missing gene measurements).

Failure Modes. (1) Initialization sensitivity: EM converges to local optima. Remedy: multiple restarts. (2) Slow convergence: E-step can be expensive for large $n, d$. Remedy: stochastic EM.

Common Mistakes. (1) Comparing to complete-case deletion without accounting for bias. (2) Not handling systematic missingness (MCAR vs. MAR vs. MNAR); EM assumes MCAR. (3) Using EM when other methods (multiple imputation, Bayesian) are better.

Chapter Connections. § 4 (PCA on incomplete data). § 6 (EM algorithm). § 9 (missing data handling).

Code.


import numpy as np

def pca_em(X, k=2, max_iter=10):
    n, d = X.shape
 p>Code.

import numpy as np

def oja_rule(X, k=1, eta=0.01, n_iter=1):
    n, d = X.shape
 p>Code.

import numpy as np

def soft_threshold(x, lam):
    return np.sign(x) * np.maximum(np.abs(x) - lam, 0)
p>Code.

import numpy as np

def bilinear_reduction(X, row_rank=2, col_rank=2):
    U, S, Vt = np.linalg.svd(X, full_matrices=False)
 p>Code.

import numpy as np

def factor_analysis(X, k=2, max_iter=10):
    n, d = X.shape
 p>Code.

import numpy as np

def laplacian_eigenmaps(X, k=2, n_neighbors=5):
    from sklearn.neighbors import kneighbors_graph
    W = kneighbors_graph(X, n_neighbors, mode='connectivity', include_self=True).toarray()
    D = np.diag(W.sum(axis=1))
    L = D - W
    eigvals, eigvecs = np.linalg.eigh(L)
    idx = np.argsort(eigvals)
    return eigvecs[:, idx[1:k+1]]  # skip trivial eigenvector

def main():
    X = np.random.randn(30, 3)
    Z = laplacian_eigenmaps(X, k=2, n_neighbors=5)
    print("Laplacian Eigenmaps embedding shape:", Z.shape)

if __name__ == "__main__":
    main()

<   # Initialize
    mu = X.mean(axis=0)
    Xc = X - mu
    Lambda = np.random.randn(d, k)
    Psi = np.ones(d)
    for _ in range(max_iter):
        # E-step: compute expected latent factors
        Sigma = Lambda @ Lambda.T + np.diag(Psi)
        inv_Sigma = np.linalg.inv(Sigma)
        Ez = Xc @ Lambda @ np.linalg.inv(Lambda.T @ inv_Sigma @ Lambda + np.eye(k))
        # M-step: update Lambda, Psi
        Lambda = (Xc.T @ Ez) @ np.linalg.inv(Ez.T @ Ez + n * np.eye(k))
        Psi = np.mean((Xc - Ez @ Lambda.T) ** 2, axis=0)
    return Lambda, Psi

dp>Code.

import numpy as np

def g(u):
    return np.tanh(u)
p>Code.

import numpy as np

def cca(X, Y, k=1):
    Xc = X - X.mean(axis=0)
    Yc = Y - Y.mean(axis=0)
    Cxx = Xc.T @ Xc
    Cyy = Yc.T @ Yc
    Cxy = Xc.T @ Yc
    # Solve generalized eigenproblem
    eigvals, Wx = np.linalg.eig(np.linalg.inv(Cxx) @ Cxy @ np.linalg.inv(Cyy) @ Cxy.T)
    idx = np.argsort(eigvals)[::-1]
    Wx = Wx[:, idx[:k]]
    Wy = np.linalg.inv(Cyy) @ Cxy.T @ Wx
    return Wx, Wy

def main():
    X = np.random.randn(50, 3)
    Y = np.random.randn(50, 2)
    Wx, Wy = cca(X, Y, k=1)
    print("CCA directions shapes:", Wx.shape, Wy.shape)

if __name__ == "__main__":
    main()

<
def g_prime(u):
    return 1 - np.tanh(u) ** 2

def fastica(X, n_components=2, max_iter=200, tol=1e-4):
    X = X - X.mean(axis=0)
    X = X / X.std(axis=0)
    n, d = X.shape
    W = np.random.randn(n_components, d)
    for i in range(max_iter):
        WX = W @ X.T
        gwx = g(WX)
        g_wx = g_prime(WX)
        W_new = (gwx @ X) / n - np.diag(g_wx.mean(axis=1)) @ W
        # Decorrelate
        U, S, Vt = np.linalg.svd(W_new, full_matrices=False)
        W_new = U @ Vt
        if np.max(np.abs(np.abs(np.diag(W_new @ W.T)) - 1)) < tol:
            break
        W = W_new
    S_ = (W @ X.T).T
    return S_, W

def main():
    X = np.random.randn(100, 3)
    S, W = fastica(X, n_components=2)
    print("ICA sources shape:", S.shape)

if __name__ == "__main__":
    main()


<   U_r = U[:, :row_rank]
    V_c = Vt[:col_rank, :]
    S_r = S[:row_rank]
    S_c = S[:col_rank]
    X_approx = U_r @ np.diag(S_r) @ V_c
    return X_approx

def main():
    X = np.random.randn(20, 10)
    X_approx = bilinear_reduction(X, row_rank=2, col_rank=2)
    print("Approximation error:", np.linalg.norm(X - X_approx, ord='fro'))

if __name__ == "__main__":
    main()

<
def sparse_pca(X, lam=0.1, k=1, n_iter=10):
    n, d = X.shape
    w = np.random.randn(d)
    w /= np.linalg.norm(w)
    for _ in range(n_iter):
        w = X.T @ (X @ w)
        w = soft_threshold(w, lam)
        if np.linalg.norm(w) > 0:
            w /= np.linalg.norm(w)
    return w

def main():
    X = np.random.randn(50, 10)
    w = sparse_pca(X, lam=0.2, k=1, n_iter=10)
    print("Sparse PCA component:", w)

if __name__ == "__main__":
    main()

<   W = np.random.randn(d, k)
    for _ in range(n_iter):
        for x in X:
            x = x.reshape(-1, 1)
            W += eta * (x @ (x.T @ W) - (W @ np.tril(W.T @ W)))
            # Normalize columns
            W /= np.linalg.norm(W, axis=0, keepdims=True)
    return W

def main():
    X = np.random.randn(100, 3)
    W = oja_rule(X, k=1, eta=0.01, n_iter=2)
    print("Online PCA component:", W)

if __name__ == "__main__":
    main()

<   # Initialize missing values to column means
    X_filled = X.copy()
    isnan = np.isnan(X)
    col_means = np.nanmean(X, axis=0)
    X_filled[isnan] = np.take(col_means, np.where(isnan)[1])
    for _ in range(max_iter):
        # M-step: PCA on filled data
        X_centered = X_filled - X_filled.mean(axis=0)
        U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
        # E-step: reconstruct missing values
        X_recon = (U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]) + X_filled.mean(axis=0)
        X_filled[isnan] = X_recon[isnan]
    return X_filled

def main():
    X = np.random.randn(20, 5)
    X[0, 0] = np.nan
    X[3, 2] = np.nan
    X_filled = pca_em(X, k=2, max_iter=5)
    print("Imputed matrix:\n", X_filled)

if __name__ == "__main__":
    main()

C.13

Explanation. Implement incremental/online PCA (Oja’s rule). For streaming data, update eigenvectors incrementally without storing all history. Per sample $x$: $u \gets u + \eta (x^ op u) x$ (Hebbian rule with learning rate $\eta$). Normalize $u$.

ML Interpretation. Essential for real-time systems (stock trading, IoT, streaming media). Update component estimates without recomputing from scratch. Used in adaptive signal processing, online recommenders.

Failure Modes. (1) Learning rate selection: Too large → divergence; too small → slow convergence. Remedy: decreasing schedules (Robbins–Monro). (2) Concept drift: If data distribution changes, PCA estimates lag. Remedy: adaptive decay factor (older samples weighted less).

Common Mistakes. (1) Not orthogonalizing multiple components; second component “steals” variance from first. Remedy: deflation or orthogonal subspace updates. (2) Comparing convergence to batch PCA unfairly; online converges slowly but uses constant memory.

Chapter Connections. § 4 (online PCA). § 7 (stochastic methods). § 8 (streaming/online learning).

C.14

Explanation. Implement sparse PCA (Zou et al., 2006). Solve $\max_w w^ op S w - \lambda\|w\|_1$ s.t. $\|w\|_2 = 1$ where $S = X^ op X / n$. Use iterative thresholding: compute PCA direction, soft-threshold coefficients, re-normalize. Extract interpretable components.

ML Interpretation. Sparse PCA recovers interpretable components (features with few nonzero weights). Standard PCA uses all features, difficult to explain. Sparse PCA is widely used in genomics (find genes driving phenotype), text analysis (identify key words per topic).

Failure Modes. (1) Sparsity level $\lambda$: Too large → all zeros; too small → dense (not sparse). Remedy: cross-validation. (2) Non-convexity: Problem is NP-hard; heuristics find local optima.

Common Mistakes. (1) Expecting sparse PCA to outperform standard PCA predictively; it optimizes interpretability, not variance. (2) Not standardizing data; feature scale affects sparsity pattern. (3) Threshold soft-thresholding vs. hard-thresholding confusion.

Chapter Connections. § 4 (PCA variants). § 6 (sparsity-inducing regularization). § 9 (interpretable ML).

C.15

Explanation. Implement bilinear dimensionality reduction (reduce both rows and columns). Compute SVD of data matrix, then apply PCA to rows and columns separately: $X pprox U A V^ op$ where $U, V$ are low-rank. Useful for image/tensor data.

ML Interpretation. Useful for images (rows = spatial height, columns = spatial width + channels). Can compress images more effectively than vectorizing. Related to multilinear algebra, tensor methods.

Failure Modes. (1) Rank selection: Choosing ranks separately (row rank vs. column rank) requires tuning two hyperparameters. (2) Interpretation: Reduced factors are harder to interpret than univariate PCA.

Common Mistakes. (1) Confusing with 2D/multilinear PCA; implementation differs. (2) Not centering modes properly in high-order tensors.

Chapter Connections. § 4 (matrix/tensor methods). § 7 (multilinear algebra). § 8 (image/video methods).

C.16

Explanation. Implement factor analysis (FA): latent variable model $x = \mu + \Lambda z + \epsilon$ where $z \sim N(0, I)$, $\epsilon \sim N(0, \Psi)$ (diagonal noise). Estimate $\Lambda$ (loadings) and $\Psi$ (noise variances) via EM.

ML Interpretation. Probabilistic alternative to PCA. Assumes Gaussian latent structure. Provides uncertainty quantification (posterior over latent factors). Used in psychometrics, econometrics, factor investing.

Failure Modes. (1) Model selection: Number of latent factors $k$ must be chosen. Remedy: likelihood ratio tests or BIC. (2) Identifiability: $\Lambda$ is identified only up to orthogonal rotation (need to rotate for interpretability).

Common Mistakes. (1) Confusing with PCA; FA is probabilistic (generative model), PCA is descriptive. (2) Not accounting for noise variances; ignoring $\Psi$ reduces to PCA. (3) Comparing likelihood to non-probabilistic methods.

Chapter Connections. § 4 (latent variable models). § 6 (probabilistic modeling). § 9 (generative models).

C.17

Explanation. Implement manifold learning via Laplacian Eigenmaps. Construct $k$-NN graph, compute graph Laplacian, find bottom-$d$ eigenvectors (non-trivial ones). Embed preserves local geometry (nearby points stay nearby).

ML Interpretation. Nonlinear dimensionality reduction preserving local distances. Used for visualization, clustering, semi-supervised learning on manifolds.

Failure Modes. (1) Connectivity: If graph is disconnected or poorly connected, embeddings are poor. Remedy: k-NN graph (fully connected). (2) Kernel bandwidth: Gaussian edge weights depend on $\sigma$. Remedy: self-tuning or cross-validation.

Common Mistakes. (1) Applying on high-ambient-dimension data without first reducing; curse of dimensionality. (2) Forgetting that embedding is defined only on training data; new points require out-of-sample extension.

Chapter Connections. § 5 (graph Laplacians). § 9 (manifold learning). § 10 (semi-supervised learning).

C.18

Explanation. Implement independent component analysis (ICA): separate independent sources from mixed observations $x = A s$. Use FastICA algorithm (fixed-point iteration with nonlinearity $g(u) = anh(u)$ or $g(u) = u \exp(-u^2/2)$). Estimate unmixing matrix $W pprox A^{-1}$.

ML Interpretation. ICA solves the blind source separation problem. Audio (separate multiple speakers from microphone mix), medical imaging (separate brain activity sources), financial data (identify latent risk factors).

Failure Modes. (1) Permutation ambiguity: ICA recovers sources up to permutation and scale. No unique ordering. (2) Gaussian sources: ICA cannot separate Gaussian sources (identifiable only up to orthogonal rotations). Remedy: assume at most one Gaussian.

Common Mistakes. (1) Confusing with PCA; ICA is independent (not uncorrelated). (2) Not preprocessing (centering, whitening); ICA assumes $\mathbb{E}[x] = 0$ and $\mathbb{E}[xx^ op] = I$. (3) Expecting unique solutions; post-hoc permutation/scaling sorting needed.

Chapter Connections. § 4 (blind source separation). § 6 (independent sources). § 9 (source separation, signal processing).

C.19

Explanation. Implement canonical correlation analysis (CCA) for paired multiview data. Find directions in two views that maximize correlation: $\max_{u, v} u^ op X_1^ op X_2 v / \sqrt{u^ op X_1^ op X_1 u \cdot v^ op X_2^ op X_2 v}$. Solve via generalized eigenvalue problem.

ML Interpretation. Learn shared representations from multiple data modalities (text + images, audio + video). Used in domain adaptation, cross-modal retrieval, multi-view learning.

Failure Modes. (1) Overparameterization: If one view has high dimension, spurious correlations emerge (curse of dimensionality). Remedy: regularization or dimension reduction first. (2) Missing data: CCA requires paired observations.

Common Mistakes. (1) Not centering views independently. (2) Not regularizing $X_1^ op X_1$ and $X_2^ op X_2$ if small relative to data dimension.

Chapter Connections. § 4 (multi-view learning). § 5 (correlation analysis). § 10 (domain adaptation).

C.20

Explanation. Implement non-negative matrix factorization (NMF) via alternating least-squares with non-negativity constraints: $\min_{W, H \geq 0} \|X - WH\|_F^2$. Update $W$ and $H$ alternatingly using non-negative least-squares (projected gradient descent).

ML Interpretation. NMF enforces interpretability: factors are additive combinations (no cancellations). Used in document topic modeling (topics as word combinations), image analysis (parts-based decomposition, Lee & Seung 1999), audio source separation.

Failure Modes. (1) Non-uniqueness: NMF solutions are not unique (permutation of factors, arbitrary scaling within constraints). Remedy: post-hoc fixing or penalize for uniqueness. (2) Initialization sensitivity: Random init often fails. Remedy: NNDSVD initialization (nonnegative from SVD).

Common Mistakes. (1) Comparing to SVD/PCA without accounting for non-negativity constraint; NMF uses fewer “effective” dimensions. (2) Not handling data scaling; negative values break the model (must have non-negative input). (3) Expecting convexity; NMF is non-convex, local optima are common.

Chapter Connections. § 2 (matrix factorization). § 4 (topic modeling, parts-based learning). § 6 (non-convex optimization).

Code.


import numpy as np

def nmf(X, k=2, n_iter=50):
    n, d = X.shape
    W = np.abs(np.random.randn(n, k))
    H = np.abs(np.random.randn(k, d))
    for _ in range(n_iter):
        H = np.linalg.lstsq(W, X, rcond=None)[0]
        H = np.maximum(H, 0)
        W = np.linalg.lstsq(H.T, X.T, rcond=None)[0].T
        W = np.maximum(W, 0)
    return W, H

def main():
    X = np.abs(np.random.randn(30, 10))
    W, H = nmf(X, k=3, n_iter=10)
    print("NMF reconstruction error:", np.linalg.norm(X - W @ H, ord='fro'))

if __name__ == "__main__":
    main()

End of C Solutions

Appendices

In Context

Algorithmic Development History

The conceptual roots of SVD are tied to orthogonalization. Schmidt’s work on orthonormal bases (later formalized as Gram-Schmidt) provided the operational mechanism for constructing orthogonal systems, a prerequisite for spectral factorizations. These ideas matured into the spectral theorem for symmetric matrices, and the extension to rectangular matrices required recognizing that input and output spaces may need different bases, a conceptual leap that SVD formalizes.

The formal low-rank approximation result is due to Eckart and Young (1936), who showed that truncating singular values yields the optimal approximation in Frobenius norm. Mirsky later extended this to other unitarily invariant norms, establishing the modern Eckart–Young–Mirsky theorem. These results were foundational for the development of numerical linear algebra, especially as computation demanded stable, principled approximations of large matrices.

Numerical linear algebra in the mid-20th century advanced the computation of SVD through bidiagonalization and iterative methods. Golub and Kahan’s algorithms made SVD practical for large matrices, while subsequent work on Lanczos and Arnoldi methods enabled partial SVD in high dimensions. The rise of randomized algorithms in the 2000s provided scalable approximations, transforming SVD from a cubic-time tool into a routine component of large-scale data pipelines.

In statistics and signal processing, SVD emerged as the backbone of principal component analysis and the Karhunen–Loeve transform, both of which interpret data as projections onto orthogonal directions of maximum variance. In signal processing, SVD-based denoising and low-rank structure identification became standard for separating signal from noise in time-series and image data.

In machine learning, low-rank methods rose with recommender systems, latent semantic analysis, and large-scale embeddings. The Netflix Prize popularized matrix factorization and nuclear norm methods, and deep learning introduced new uses: compressing weight matrices, stabilizing training via spectral normalization, and analyzing generalization through effective rank. SVD shifted from a purely mathematical tool to a practical foundation for scalable ML.

Why This Matters for ML

Compression and Scalability

Modern ML models are large, and SVD provides a principled path to compression. Truncated SVD replaces a dense weight matrix with low-rank factors, reducing parameters and memory footprint while controlling error through the singular value spectrum. This is essential for deploying models on edge devices, reducing inference latency, and enabling training on limited hardware. Randomized and partial SVD algorithms further extend scalability, allowing approximate decompositions of massive matrices without full materialization.

Generalization and Capacity Control

Low rank acts as an implicit capacity constraint: fewer singular values means fewer degrees of freedom. This influences generalization by preventing overfitting to noise. Nuclear norm regularization makes this explicit, while optimization dynamics in overparameterized models often bias toward low effective rank even without explicit constraints. Understanding singular values thus helps interpret why certain models generalize well and how to control complexity through rank-aware design.

Failure Modes if Rank Structure Is Ignored

Ignoring rank structure leads to multiple failure modes. Overfitting occurs when models retain noisy singular modes, particularly when spectra are flat. Numerical instability arises when tiny singular values cause ill-conditioning, leading to large solution variance or slow training. In data pipelines, failure to examine singular value decay can result in ineffective dimensionality reduction or misleading PCA interpretations. These failures are not edge cases; they are common in high-dimensional datasets and large models.

Forward References to Optimization and Deep Learning

The next optimization chapters will rely on SVD for understanding convergence and stability. Gradient descent behavior is governed by singular values of Hessians or Jacobians, and preconditioning is best interpreted as reshaping the spectrum. In deep learning, spectral normalization, low-rank adaptation (LoRA), and implicit bias toward low rank are central themes. The concepts and tools from this chapter therefore serve as prerequisites for analyzing optimization dynamics, generalization, and scalable model design in later sections.

Motivation

Beyond Eigenvalues: Rectangular Matrices

Eigenvalues and eigenvectors, as developed in Chapter 06, provide a powerful lens for understanding square matrices: they reveal invariant directions (eigenvectors) and scaling factors (eigenvalues) that characterize how a linear transformation acts on its domain. However, most matrices arising in machine learning are not square. A dataset $X \in \mathbb{R}^{n \times d}$ with $n = 10^4$ samples and $d = 10^3$ features is rectangular, as are weight matrices in neural networks ($W \in \mathbb{R}^{\text{hidden} \times \text{input}}$), recommender system rating matrices ($R \in \mathbb{R}^{\text{users} \times \text{items}}$), and document-term matrices in NLP ($A \in \mathbb{R}^{\text{docs} \times \text{vocab}}$). For such matrices, the notion of eigenvector—a vector $\mathbf{v}$ satisfying $A\mathbf{v} = \lambda\mathbf{v}$—does not make sense: if $A \in \mathbb{R}^{m \times n}$ with $m \neq n$, then $A\mathbf{v} \in \mathbb{R}^m$ while $\mathbf{v} \in \mathbb{R}^n$, so they cannot be proportional (different dimensional spaces). We need a generalization.

Singular value decomposition (SVD) provides this generalization by recognizing that a rectangular matrix $A : \mathbb{R}^n \to \mathbb{R}^m$ maps between two different vector spaces, and thus requires two separate bases—one for the input space $\mathbb{R}^n$ and one for the output space $\mathbb{R}^m$. Instead of seeking a single set of eigenvectors, SVD finds right singular vectors $\mathbf{v}_i \in \mathbb{R}^n$ (input directions) and left singular vectors $\mathbf{u}_i \in \mathbb{R}^m$ (output directions) such that $A\mathbf{v}_i = \sigma_i \mathbf{u}_i$, where $\sigma_i \geq 0$ is the singular value quantifying the stretching factor along the $i$-th principal axis. This decomposition expresses $A$ as a sum of rank-1 matrices: $A = \sum_{i=1}^r \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$, where $r = \text{rank}(A)$. Each term $\sigma_i \mathbf{u}_i\mathbf{v}_i^\top$ represents a mode of variation: the input pattern $\mathbf{v}_i$ is transformed into the output pattern $\mathbf{u}_i$, scaled by $\sigma_i$.

The key insight is that even though $A$ itself has no eigenvectors (being non-square), the symmetric matrices $A^\top A$ and $AA^\top$ do. The Gram matrix $A^\top A \in \mathbb{R}^{n \times n}$ is symmetric positive semi-definite (PSD), so by the spectral theorem (Chapter 06), it admits an orthogonal eigendecomposition $A^\top A = V\Lambda V^\top$ with eigenvalues $\lambda_i \geq 0$ and orthonormal eigenvectors $\mathbf{v}_i$. The singular values of $A$ are defined as $\sigma_i = \sqrt{\lambda_i(A^\top A)}$, and the right singular vectors are the eigenvectors $\mathbf{v}_i$ of $A^\top A$. Similarly, $AA^\top \in \mathbb{R}^{m \times m}$ (also symmetric PSD) has eigenvectors $\mathbf{u}_i$, the left singular vectors. The relationship $A\mathbf{v}_i = \sigma_i \mathbf{u}_i$ follows from: \[ A^\top A \mathbf{v}_i = \lambda_i \mathbf{v}_i \implies A^\top (A\mathbf{v}_i) = \lambda_i \mathbf{v}_i \implies \|A\mathbf{v}_i\|^2 = \lambda_i \implies \|A\mathbf{v}_i\| = \sigma_i. \] Normalizing $\mathbf{u}_i = A\mathbf{v}_i / \sigma_i$ gives the left singular vectors.

This construction reveals a profound symmetry: $A$ relates two orthonormal bases ($\{\mathbf{v}_i\}$ in $\mathbb{R}^n$ and $\{\mathbf{u}_i\}$ in $\mathbb{R}^m$) via diagonal scaling ($\Sigma$). In matrix form, $A = U\Sigma V^\top$, where $U \in \mathbb{R}^{m \times m}$ has columns $\mathbf{u}_i$, $V \in \mathbb{R}^{n \times n}$ has columns $\mathbf{v}_i$, and $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal with entries $\sigma_i$. This factorization always exists for any real matrix (even rank-deficient or non-square), making SVD more general than eigendecomposition (which requires square diagonalizable matrices).

Data Compression as Geometry

One of the most powerful applications of SVD is lossy data compression: representing a large matrix $A$ with a smaller, approximate version $A_k$ by retaining only the top $k$ singular values and vectors. This is not merely a computational trick—it has deep geometric meaning rooted in projection theory and metric spaces.

Imagine a dataset $X \in \mathbb{R}^{n \times d}$ with $n = 10^5$ images, each $d = 256 \times 256 = 65{,}536$ pixels. Storing $X$ requires $\approx 6.5$ GB. But natural images are not random: they exhibit spatial correlations (adjacent pixels are similar), spectral structure (dominated by low-frequency components), and redundancy (many pixel patterns repeat). SVD exploits this by finding a coordinate system (the right singular vectors $\mathbf{v}_i$) in which most energy concentrates in the first $k \ll d$ coordinates. Projecting onto these $k$ directions yields $X_k = X V_k V_k^\top$, where $V_k \in \mathbb{R}^{d \times k}$ contains the first $k$ singular vectors. This projected dataset $X_k$ uses only $k$ features per sample instead of $d$, reducing storage to $\approx k/d$ times the original (e.g., $k=50$ gives 50/65536 ≈ 0.08%, or 5 MB—over 1000× compression).

The geometric interpretation is that $X_k$ is the orthogonal projection of $X$ onto the subspace $\mathcal{V}_k = \text{span}(\mathbf{v}_1, \ldots, \mathbf{v}_k)$. Among all rank-$k$ subspaces, $\mathcal{V}_k$ minimizes the reconstruction error measured in Frobenius norm: \[ \|X - X_k\|_F^2 = \sum_{i=1}^n \|\mathbf{x}_i - \text{proj}_{\mathcal{V}_k}(\mathbf{x}_i)\|^2 = \sum_{j=k+1}^r \sigma_j^2, \] where $r = \text{rank}(X)$. This error is the sum of squared singular values not included in the approximation. If singular values decay rapidly (e.g., $\sigma_j \sim e^{-\alpha j}$ for some $\alpha > 0$), then even small $k$ achieves tiny error: for $\alpha = 0.1$ and $k=50$, the error decays as $e^{-5} \approx 0.007$, meaning 99.3% of the information is retained.

Why is this projection optimal? Consider any rank-$k$ subspace $\mathcal{W}_k$. The projection error is: \[ \sum_{i=1}^n \|\mathbf{x}_i - \text{proj}_{\mathcal{W}_k}(\mathbf{x}_i)\|^2 = \sum_{i=1}^n \|\mathbf{x}_i\|^2 - \sum_{i=1}^n \|\text{proj}_{\mathcal{W}_k}(\mathbf{x}_i)\|^2. \] The first term is fixed (total data energy), so minimizing error is equivalent to maximizing the projected energy $\sum_{i=1}^n \|\text{proj}_{\mathcal{W}_k}(\mathbf{x}_i)\|^2$. SVD achieves this by aligning $\mathcal{W}_k$ with the top $k$ principal axes of the data’s covariance ellipsoid, which are precisely the eigenvectors of $X^\top X$ (the right singular vectors of $X$). This is the geometric essence of principal component analysis (PCA): the principal components are the directions of maximum variance, and projecting onto them minimizes information loss.

In machine learning, this geometry explains why deep networks compress data implicitly: intermediate representations in a trained neural network typically lie in low-dimensional subspaces (a phenomenon called intrinsic dimensionality). Even if a hidden layer has 1000 neurons, the activations may lie near a 20-dimensional manifold, with the remaining 980 directions corresponding to noise or redundant features. Pruning neurons with small singular values (or equivalently, small contribution to the layer’s SVD) removes this redundancy without harming performance.

Rank as Structural Complexity

The rank of a matrix $A$, defined as the dimension of its column space (or equivalently, row space), measures the number of independent pieces of information encoded in $A$. For an $m \times n$ matrix, the maximum rank is $\min(m,n)$—a full-rank matrix—but many real-world matrices are low-rank or approximately low-rank, meaning their true information content is much smaller than their ambient dimensions suggest.

Why does low rank arise? In statistical terms, low rank corresponds to latent structure: high-dimensional observations are generated by a small number of hidden factors. For example: - A user-movie rating matrix $R \in \mathbb{R}^{10^5 \times 10^4}$ (100k users, 10k movies) appears to be 1 billion-dimensional, but if users’ preferences are determined by ~20 latent factors (e.g., “likes action,” “dislikes romance,” “prefers indie films”), then $R \approx UV^\top$ with $U \in \mathbb{R}^{10^5 \times 20}$ (user factors) and $V \in \mathbb{R}^{10^4 \times 20}$ (movie factors), so $R$ has approximate rank 20. - A grayscale image $I \in \mathbb{R}^{512 \times 512}$ (262k pixels) may be well-approximated by rank 50-100 if most of the image is smooth gradients, with only edges contributing high-frequency details. The SVD separates smooth components (large $\sigma_i$, low rank) from texture (small $\sigma_i$, high rank). - A document-term matrix $A \in \mathbb{R}^{10^4 \times 10^5}$ (10k documents, 100k words) in natural language processing is sparse and approximately rank ~300, because words cluster into topics: documents about “machine learning” use similar vocabularies, reducing effective dimensionality from 100k tokens to ~300 semantic topics.

Rank also controls model complexity in machine learning. A linear model $f(\mathbf{x}) = W\mathbf{x}$ with weight matrix $W \in \mathbb{R}^{m \times n}$ has $mn$ parameters if $W$ is full-rank, but only $k(m+n)$ if $\text{rank}(W) = k$. For large $m, n$, this difference is dramatic: a 1000×1000 matrix has 1 million parameters at full rank, but only 50,000 for rank 25 (a 20× reduction). Low-rank constraints thus serve as regularization, preventing overfitting by limiting model expressiveness. This is formalized in matrix completion theory, where recovering a rank-$r$ matrix from $O(r(m+n)\log(m+n))$ random entries is possible via nuclear norm minimization—much fewer samples than the $mn$ needed for unconstrained recovery.

The structural complexity perspective reveals that rank is a non-convex measure (NP-hard to minimize exactly), but it has a convex relaxation: the nuclear norm $\|A\|_* = \sum_{i=1}^r \sigma_i$, which sums all singular values. Just as the $\ell^1$ norm $\|\mathbf{x}\|_1 = \sum_i |x_i|$ promotes sparsity (many zero entries), the nuclear norm promotes low rank (many zero singular values). This analogy is precise: the nuclear norm is the $\ell^1$ norm of the vector of singular values. Optimization problems of the form $\min \|A - M\|_F^2 + \lambda\|M\|_*$ (fit data $A$ with a low-rank matrix $M$) are tractable convex programs, solvable via proximal gradient methods. The solution has a closed form: soft-threshold the singular values of $A$: \[ M^* = \sum_{i=1}^r \max(\sigma_i - \lambda, 0) \mathbf{u}_i\mathbf{v}_i^\top. \] This singular value thresholding is the matrix analog of soft-thresholding in sparse recovery (LASSO).

Approximation as Optimization

A central theme of Chapter 07 is that low-rank approximation is an optimization problem, and SVD provides the globally optimal solution. This is formalized by the Eckart-Young-Mirsky theorem, which states:

Theorem (Eckart-Young-Mirsky, 1936/1960). Let $A \in \mathbb{R}^{m \times n}$ have SVD $A = \sum_{i=1}^r \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$. For any $k < r$, the solution to \[ \min_{\text{rank}(X) \leq k} \|A - X\|_F \quad \text{and} \quad \min_{\text{rank}(X) \leq k} \|A - X\|_2 \] is $X^* = A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^\top$, with errors: \[ \|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2 \quad \text{and} \quad \|A - A_k\|_2 = \sigma_{k+1}. \]

This theorem is remarkable for several reasons: 1. Uniqueness: $A_k$ is the unique minimizer (up to rotation in the $k$-dimensional subspace if $\sigma_k = \sigma_{k+1}$). 2. Explicit error: The approximation error is determined exactly by the discarded singular values, with no dependence on data distribution or problem-specific constants. 3. Algorithmic: Computing $A_k$ reduces to computing SVD and truncating—a mature, well-conditioned numerical procedure implemented in every scientific computing library. 4. Multi-objective: The same $A_k$ is optimal for two different norms (Frobenius and operator), unlike many optimization problems where different objectives yield different solutions.

The proof strategy uses projection theory: the Frobenius norm $\|A - X\|_F^2 = \sum_{ij} (A_{ij} - X_{ij})^2$ is the squared Euclidean distance in the space of matrices, and the set of rank-$k$ matrices forms a nonconvex algebraic variety. The SVD basis diagonalizes the problem: in the coordinates $A = U\Sigma V^\top$, minimizing $\|A - X\|_F$ over rank-$k$ matrices becomes minimizing $\|\Sigma - U^\top X V\|_F$ over rank-$k$ matrices, which is easily solved by truncating $\Sigma$ to its top $k$ diagonal entries (setting the rest to zero). For the operator norm, the result follows from the fact that $\|A - X\|_2 \geq \sigma_{k+1}$ for any rank-$k$ $X$ (by the interlacing theorem for singular values), and $A_k$ achieves this bound.

This optimization perspective explains why SVD is ubiquitous in machine learning: - PCA: Minimize reconstruction error $\sum_{i=1}^n \|\mathbf{x}_i - \text{proj}_{\mathcal{V}}(\mathbf{x}_i)\|^2$ over $k$-dimensional subspaces $\mathcal{V}$ → solved by top $k$ right singular vectors of $X$. - Collaborative filtering: Minimize $\sum_{(i,j) \in \Omega} (R_{ij} - (UV^\top)_{ij})^2$ (fit observed ratings) → approximated by truncated SVD of imputed $R$. - Latent semantic analysis: Minimize $\|A - M\|_F$ subject to $\text{rank}(M) = k$ for document-term matrix $A$ → solved by SVD. - Neural network pruning: Minimize $\|W - W_k\|_F$ to compress weights → directly given by SVD.

Common Misconceptions

Several misconceptions about SVD and low-rank approximation persist in machine learning practice, often leading to suboptimal implementations or misinterpretation of results:

Misconception 1: “SVD only works for small matrices.” In fact, randomized SVD algorithms compute approximate rank-$k$ decompositions in $O(mn\log k)$ time, making SVD tractable for matrices with billions of entries. For example, Spotify’s recommender system processes a 150M×50M user-track matrix using distributed randomized SVD. The key is exploiting approximate computation: exact singular values are rarely needed; approximate values within 1% suffice for most applications.

Misconception 2: “Low rank means poor data quality.”
Actually, low rank indicates structure, not noise. Many real datasets (images, text, genomics) have rapidly decaying singular values precisely because they contain regularities. A random matrix has flat singular values ($\sigma_i \approx \sqrt{m}$ for all $i$), making low-rank approximation ineffective; structured data (e.g., images with smooth regions) has exponentially decaying $\sigma_i$, making compression highly effective. Low rank is a blessing, not a curse.

Misconception 3: “PCA and SVD are different methods.”
PCA is SVD applied to centered data. Computing PCA via eigendecomposition of the covariance matrix $\Sigma = \frac{1}{n}X^\top X$ is numerically worse than computing SVD of $X$ directly, because forming $X^\top X$ squares the condition number ($\kappa(\Sigma) = \kappa(X)^2$), amplifying numerical errors. Modern implementations (scikit-learn, MATLAB, R) always use SVD internally for PCA, never eigendecomposition of $\Sigma$.

Misconception 4: “Truncated SVD maximizes variance.”
Truncated SVD maximizes cumulative variance $\sum_{i=1}^k \sigma_i^2$, but not incremental variance per component. For example, if $\sigma_1 = 100$, $\sigma_2 = 99$, and $\sigma_3 = 1$, the second component explains almost as much variance as the first—dropping it loses 99% of the variance of the third component. PCA is greedy (selects components in order of variance), but may not yield the best subset: selecting components 1, 2, 5 might outperform 1, 2, 3 for certain tasks (e.g., if component 5 is discriminative for classification). Post-hoc component selection requires task-specific criteria (supervised feature selection).

Misconception 5: “SVD requires complete data.”
Standard SVD requires no missing entries, but matrix completion algorithms (soft-impute, OptSpace) extend SVD to partial observations by iteratively imputing missing values and recomputing SVD. Theoretical guarantees (Candès-Recht theorem) show that under incoherence conditions, exact recovery is possible from $O(nr\log^2 n)$ random entries for an $n \times n$ rank-$r$ matrix—far fewer than the $n^2$ total entries. This is the mathematical foundation of Netflix’s recommender system.

Misconception 6: “Singular vectors are interpretable features.”
SVD vectors are orthonormal linear combinations of original features, often lacking semantic meaning. For example, in text data, a singular vector might combine “neural,” “network,” and “dog” with arbitrary signs, making interpretation difficult. Non-negative matrix factorization (NMF) imposes $A \approx WH$ with $W, H \geq 0$ (elementwise), yielding parts-based representations (e.g., facial features, topics) that are more interpretable, but NMF sacrifices optimality (NP-hard, no closed-form solution, local minima).

Misconception 7: “More components are always better.”
Retaining too many singular vectors (e.g., $k$ close to $r$) overfits noise. In practice, cross-validation determines optimal $k$: split data into train/test, compute SVD on train, reconstruct test, and choose $k$ minimizing test error. Overfitting manifests as test error increasing beyond some $k^*$, even as training error monotonically decreases. The bias-variance trade-off applies: small $k$ underfits (high bias), large $k$ overfits (high variance), and $k^*$ balances both.

ML Connection

PCA and Dimensionality Reduction

Principal Component Analysis (PCA) is perhaps the most widely used application of SVD in machine learning, serving as the foundational technique for unsupervised dimensionality reduction. Given a dataset $X \in \mathbb{R}^{n \times d}$ of $n$ samples in $d$-dimensional space (e.g., images with $d = 784$ pixels), PCA identifies a $k$-dimensional linear subspace ($k \ll d$) that captures the maximum variance in the data. The principal components are the orthonormal basis vectors spanning this subspace, and they are precisely the right singular vectors of the centered data matrix $\tilde{X} = X - \bar{\mathbf{x}}$, where $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i$.

Algorithm: PCA via SVD
1. Center the data: Compute $\tilde{X} = X - \mathbf{1}\bar{\mathbf{x}}^\top$, where $\mathbf{1} \in \mathbb{R}^n$ is the all-ones vector.
2. Compute SVD: $\tilde{X} = U\Sigma V^\top$, where $U \in \mathbb{R}^{n \times n}$, $\Sigma \in \mathbb{R}^{n \times d}$, $V \in \mathbb{R}^{d \times d}$.
3. Extract components: The first $k$ columns of $V$ are the principal components $\mathbf{v}_1, \ldots, \mathbf{v}_k$.
4. Project data: The low-dimensional representation is $Z = \tilde{X} V_k \in \mathbb{R}^{n \times k}$, where $V_k$ contains the first $k$ columns of $V$.
5. Variance explained: The variance captured by the $i$-th component is $\sigma_i^2 / n$, and the cumulative variance explained by $k$ components is $\sum_{i=1}^k \sigma_i^2 / \sum_{i=1}^r \sigma_i^2$.

Concrete ML Example: MNIST Digit Classification
The MNIST dataset contains 60,000 training images of handwritten digits, each $28 \times 28 = 784$ pixels. Applying PCA: - Compute SVD of centered $X \in \mathbb{R}^{60000 \times 784}$. - Singular value decay: $\sigma_1 \approx 4200$, $\sigma_{50} \approx 400$, $\sigma_{100} \approx 200$, $\sigma_{500} \approx 50$. - Cumulative variance: 50 components explain 80%, 100 explain 91%, 200 explain 97%. - Visualize components: $\mathbf{v}_1$ (reshaped to $28 \times 28$) resembles an “average digit” (weighted mean), $\mathbf{v}_2$ captures stroke thickness variation, $\mathbf{v}_3$ encodes loop size (distinguishing 0, 6, 8). - Classification: Train a logistic regression on 50-dimensional projected data $Z \in \mathbb{R}^{60000 \times 50}$ achieves 92% accuracy (vs. 98% on full 784-dimensional data)—PCA reduces dimensionality by 15× with only 6% accuracy loss. - Speedup: Training on 50D data is ~10× faster than 784D (fewer parameters, smaller gradients), and overfitting is reduced (lower VC dimension).

Why PCA Works: Variance as Information
PCA assumes that variance equals information: directions with high variance contain signal, while low-variance directions are noise. This is valid when: - Data is approximately Gaussian: Covariance ellipsoid’s principal axes align with maximum-variance directions. - Linear subspace structure: Data lies near a low-dimensional affine subspace (e.g., images of faces under varying lighting have ~50D intrinsic dimensionality despite 10,000D ambient space). - Noise is isotropic: Random noise adds small variance equally in all directions, filling the subspace orthogonal to the signal.

Failure Mode: Nonlinear Structure
PCA fails when data lies on a nonlinear manifold. For example, the Swiss roll dataset (a 2D sheet rolled into 3D space) has intrinsic dimension 2, but PCA requires ~3 dimensions to capture it, because the roll’s curvature violates linearity. Solutions: Kernel PCA (implicitly map data to high-dimensional feature space where structure is linear), t-SNE (nonlinear embedding for visualization), autoencoders (neural networks learn nonlinear PCA-like compression).

Model Compression and Pruning

Modern deep neural networks are overparameterized: ResNet-50 has 25M parameters, GPT-3 has 175B. Model compression reduces parameter count without sacrificing accuracy, enabling deployment on resource-constrained devices (phones, IoT). Low-rank factorization via SVD is a primary compression technique.

Algorithm: Low-Rank Layer Compression
Consider a fully connected layer $\mathbf{h} = W\mathbf{x} + \mathbf{b}$, where $W \in \mathbb{R}^{m \times n}$ (e.g., $m=1000$ output neurons, $n=10000$ input neurons, so 10M parameters).
1. Compute SVD: $W = U\Sigma V^\top$, where $U \in \mathbb{R}^{m \times m}$, $\Sigma \in \mathbb{R}^{m \times n}$, $V \in \mathbb{R}^{n \times n}$.
2. Truncate: Retain top $k$ singular values: $W_k = U_k \Sigma_k V_k^\top$, where $U_k \in \mathbb{R}^{m \times k}$, $\Sigma_k \in \mathbb{R}^{k \times k}$, $V_k \in \mathbb{R}^{n \times k}$.
3. Factorize: Replace $W$ with two layers: $W_k = (U_k\Sigma_k^{1/2})(V_k\Sigma_k^{1/2})^\top = W_1 W_2$, where $W_1 \in \mathbb{R}^{m \times k}$, $W_2 \in \mathbb{R}^{k \times n}$.
4. New layer: $\mathbf{h} = W_1(W_2\mathbf{x}) + \mathbf{b}$ (two matmuls instead of one).
5. Parameter count: Original $mn + m$, compressed $k(m+n) + m$. For $m=1000$, $n=10000$, $k=50$: 10M → 550k (18× reduction).

Concrete ML Example: VGG-16 Compression
VGG-16’s first fully connected layer has $W \in \mathbb{R}^{4096 \times 25088}$, totaling 102M parameters (70% of the network).
- Compute SVD: $\sigma_1 \approx 500$, $\sigma_{100} \approx 50$, $\sigma_{1000} \approx 5$.
- Truncate to $k=256$: Approximation error $\|W - W_{256}\|_F / \|W\|_F = 12\%$, but top-1 accuracy drops by only 0.5% (92.7% → 92.2% on ImageNet).
- Parameters: 102M → 7.4M (14× reduction).
- Speedup: Inference time reduced by 3× (due to smaller matrix multiplications and better cache locality).

When to Compress:
- After training: Compress a pre-trained model. Requires fine-tuning (few epochs) to recover accuracy.
- During training: Add nuclear norm regularization $\|W\|_*$ to the loss, encouraging low rank from the start.
- Architecture search: Train models with factorized layers from scratch (e.g., MobileNets use depthwise separable convolutions, a structured low-rank factorization).

Trade-offs:
- Accuracy vs. compression: Smaller $k$ → higher compression, lower accuracy. Cross-validate to find optimal $k$.
- Latency vs. throughput: Two matmuls ($W_1, W_2$) have same FLOPs as one ($W$) asymptotically, but may have higher latency on GPUs due to kernel launch overhead.
- Memory vs. compute: Compression reduces memory (fewer parameters to store), crucial for edge devices, even if compute savings are modest.

Latent Structure in Data

Many machine learning problems involve high-dimensional observations generated by low-dimensional latent factors. SVD reveals this latent structure by factorizing data into interpretable components.

Example 1: Collaborative Filtering (Recommender Systems)
Netflix has $m = 500{,}000$ users and $n = 20{,}000$ movies, yielding a rating matrix $R \in \mathbb{R}^{500k \times 20k}$ with 10 billion entries—but only 0.1% are observed (users rate ~100 movies on average). Assume $R$ has low rank $k \approx 50$, because: - User factors: Demographics (age, gender), preferences (genre, actor), viewing context (time of day).
- Movie factors: Genre, cast, director, release year, critical reception.
- Interaction: Rating $R_{ij} \approx \mathbf{u}_i^\top \mathbf{v}_j$, where $\mathbf{u}_i \in \mathbb{R}^k$ is user $i$’s latent profile and $\mathbf{v}_j \in \mathbb{R}^k$ is movie $j$’s latent profile.

Matrix Factorization Algorithm:
1. Initialize $U \in \mathbb{R}^{m \times k}$, $V \in \mathbb{R}^{n \times k}$ randomly.
2. For observed entries $(i,j) \in \Omega$, minimize $\sum_{(i,j) \in \Omega} (R_{ij} - \mathbf{u}_i^\top \mathbf{v}_j)^2 + \lambda(\|\mathbf{u}_i\|^2 + \|\mathbf{v}_j\|^2)$ via alternating least squares (ALS): - Fix $V$, solve for $U$: $\mathbf{u}_i = (V^\top V + \lambda I)^{-1} V^\top \mathbf{r}_i$, where $\mathbf{r}_i$ is user $i$’s observed ratings.
- Fix $U$, solve for $V$: $\mathbf{v}_j = (U^\top U + \lambda I)^{-1} U^\top \mathbf{r}_j$, where $\mathbf{r}_j$ is movie $j$’s observed ratings.
3. Iterate until convergence (typically 50-100 iterations).
4. Predict unobserved ratings: $\hat{R}_{ij} = \mathbf{u}_i^\top \mathbf{v}_j$.

Performance: On the Netflix Prize dataset, $k=50$ achieves RMSE ≈ 0.89 (vs. 0.95 for baseline mean rating), reducing error by 6%. Increasing to $k=200$ improves to RMSE ≈ 0.87 (another 2% gain), but requires 4× more parameters.

Example 2: Latent Semantic Analysis (NLP)
Document-term matrix $A \in \mathbb{R}^{m \times n}$ (10k documents, 50k words) is sparse (each document uses ~500 unique words, 1% density) and high-rank (~5000 algebraically), but has low effective rank (~300) due to synonymy (different words convey the same meaning: “car,” “auto,” “vehicle”) and polysemy (same word has multiple meanings: “bank” as financial institution or river bank). SVD discovers latent topics: - Truncate to $k=300$: $A_k = U_k \Sigma_k V_k^\top$.
- Rows of $U_k \Sigma_k$ are document embeddings in topic space.
- Rows of $V_k \Sigma_k$ are word embeddings in topic space.
- Cosine similarity in topic space ($\frac{\mathbf{u}_i^\top \mathbf{u}_j}{\|\mathbf{u}_i\|\|\mathbf{u}_j\|}$) measures document similarity, ignoring exact word matches.

Query expansion: For query “machine learning algorithms,” compute query vector $\mathbf{q} = \text{term-frequencies}$, project into topic space $\mathbf{q}_k = \Sigma_k^{-1} V_k^\top \mathbf{q}$, retrieve documents with highest $\mathbf{u}_i^\top \mathbf{q}_k$. This retrieves documents about “neural networks” and “deep learning” even if the query doesn’t mention them (latent semantic similarity).

Implicit Regularization via Rank

A surprising empirical finding in deep learning is that overparameterized models generalize well despite overfitting the training set. For example, a ResNet-110 with 1.7M parameters can achieve 0% training error on CIFAR-10 (50k images) while maintaining 92% test accuracy—seemingly violating classical statistical learning theory, which predicts catastrophic overfitting when $\text{parameters} \gg \text{samples}$.

Recent theory (Gunasekar et al., 2017; Arora et al., 2019) shows that gradient descent implicitly regularizes toward low-rank solutions, even without explicit rank penalties. This occurs because: 1. Initialization: Small random weights $W_0 \sim \mathcal{N}(0, \sigma^2/n)$ have $\text{rank}(W_0) \approx \min(m,n)$ with high probability, but singular values are $\sigma_i \approx \sigma\sqrt{n} \pm O(\sqrt{\sqrt{n}})$, forming a tight cluster. Initial rank is full, but effective rank $(\sum \sigma_i)^2 / \sum \sigma_i^2 \approx n/2$ is moderate. 2. Gradient flow: Weight update $W_{t+1} = W_t - \eta \nabla \mathcal{L}(W_t)$ asymptotically behaves like gradient flow on $\text{rank}(W)$: directions corresponding to small singular values decay faster than large ones. For linear networks, $W(t) = \sum_{i=1}^r \sigma_i(t) \mathbf{u}_i \mathbf{v}_i^\top$, and dynamics of $\sigma_i(t)$ favor exponential growth of large $\sigma_i$ and decay of small $\sigma_i$, leading to rank reduction. 3. Nuclear norm bias: For linear models, gradient descent on squared loss converges to the minimum nuclear norm solution $\min \|W\|_*$ subject to $XW = Y$, analogous to $\ell^2$ regularization for vectors. This minimizes $\sum_i \sigma_i$, encouraging low rank.

Concrete Example: Matrix Sensing
Problem: Recover rank-$r$ matrix $M \in \mathbb{R}^{n \times n}$ from $m$ linear measurements $y_i = \text{tr}(A_i^\top M)$, where $A_i$ are random Gaussian matrices.
- Classical result (compressed sensing): Need $m \geq nr$ measurements for exact recovery.
- Implicit regularization: Gradient descent on $\min \sum_i (y_i - \text{tr}(A_i^\top W))^2$ with initialization $W_0 = 0$ converges to $M$ with only $m \geq 6r^2 \log n$ measurements—no explicit rank constraint needed. The algorithm automatically finds the low-rank solution because gradient flow implicitly minimizes nuclear norm.

Implications for Deep Learning:
- Neural network training: Even without dropout or weight decay, SGD finds solutions with low effective rank in weight matrices, preventing overfitting. This is especially pronounced in later layers (closer to output), which learn low-rank linear maps from intermediate representations to predictions. - Transfer learning: Pre-trained models have low-rank structure in task-specific layers, making fine-tuning efficient (only need to update a low-dimensional subspace). - Adversarial robustness: Low-rank structure in neural networks can be exploited by adversaries (small perturbations in low-rank subspace are amplified), but also defended against (project perturbations onto principal subspace during training).

Measurement: Effective Rank
For a matrix $W$ with singular values $\sigma_1 \geq \cdots \geq \sigma_r > 0$, define: \[ r_{\text{eff}}(W) = \frac{\left(\sum_{i=1}^r \sigma_i\right)^2}{\sum_{i=1}^r \sigma_i^2} \in [1, r]. \] - If $W$ has uniform singular values ($\sigma_i = c$), then $r_{\text{eff}} = r$ (full rank effectively).
- If $\sigma_1 \gg \sigma_2 \approx \cdots \approx \sigma_r \approx 0$, then $r_{\text{eff}} \approx 1$ (nearly rank-1).
- Typical trained neural networks: $r_{\text{eff}} \approx 0.1 \times \min(m,n)$ for weight matrices $W \in \mathbb{R}^{m \times n}$, indicating aggressive implicit compression.

This completes the ML Connection section. At this point, we have established SVD’s role in PCA, model compression, latent structure discovery, and implicit regularization. The chapter now transitions to formal definitions.

Notation Summary

Matrix and Vector Operations: - $A \in \mathbb{R}^{m \times n}$: Real-valued matrix with $m$ rows and $n$ columns - $A^\top$: Transpose of $A$ - $A^+$: Moore-Penrose pseudoinverse of $A$ - $A^{-1}$: Matrix inverse (when $A$ is square and invertible) - $\|A\|_F = \sqrt{\sum_{ij} A_{ij}^2}$: Frobenius norm (Euclidean norm of matrix) - $\|A\|_2 = \max_{\|x\|=1} \|Ax\|_2$: Spectral norm (largest singular value) - $\|A\|_* = \sum_i \sigma_i(A)$: Nuclear norm (sum of singular values) - $\|A\|_1 = \sum_i \|A(:,i)\|_1$: $\ell^1$ norm (sum of absolute column entries) - $\text{trace}(A) = \sum_{i} A_{ii}$: Trace (sum of diagonal) - $\text{rank}(A)$: Rank (dimension of column space) - $\text{null}(A)$: Null space (kernel) of $A$

Spectral Decompositions: - $A = U \Sigma V^\top$: Singular value decomposition (SVD) - $U \in \mathbb{R}^{m \times r}$: Left singular vectors (orthonormal columns) - $V \in \mathbb{R}^{n \times r}$: Right singular vectors (orthonormal columns) - $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$: Singular values ($\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$) - $A = Q \Lambda Q^\top$: Eigendecomposition of symmetric $A$ - $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$: Eigenvalues - $U_k, \Sigma_k, V_k$: Truncated SVD (top-$k$ singular vectors/values) - $A_k = U_k \Sigma_k V_k^\top$: Rank-$k$ approximation

Special Matrices: - $I_n \in \mathbb{R}^{n \times n}$: Identity matrix - $\mathbf{0}_{m \times n}$: Zero matrix - $\mathbf{1}_n$: Vector of all ones - $\mathbf{e}_i$: Standard basis vector ($i$-th entry is 1, rest are 0) - $P$: Orthogonal/unitary projection matrix (satisfies $P^2 = P$, $P^\top = P$) - $Q, U, V$: Orthonormal matrices ($Q^\top Q = I$, determinant $\pm 1$)

Inequalities and Bounds: - $\|A + B\|_* \leq \|A\|_* + \|B\|_*$: Triangle inequality for nuclear norm - $\|AB\|_2 \leq \|A\|_2 \|B\|_2$: Multiplicativity of spectral norm - $\|AB\|_F \leq \|A\|_2 \|B\|_F$: Frobenius norm bound - $\|A\|_2 \leq \|A\|_F \leq \sqrt{\text{rank}(A)} \cdot \|A\|_2$: Spectral vs. Frobenius - $\sigma_{k+1}(A) = \min_{\text{rank}(X) \leq k} \|A - X\|_2$: Spectral error characterization - $\sum_{i>k} \sigma_i^2 = \min_{\text{rank}(X) \leq k} \|A - X\|_F^2$: Frobenius error characterization

Probability and Statistics: - $\mathbb{E}[\cdot]$: Expectation - $\mathbb{P}(\cdot)$: Probability - $N(\mu, \Sigma)$: Multivariate Gaussian with mean $\mu$ and covariance $\Sigma$ - $\approx_{\text{a.s.}}$: Almost sure convergence - $\to_d$: Convergence in distribution - $\kappa(A) = \sigma_1(A) / \sigma_r(A)$: Condition number

Operators and Functions: - $\text{vec}(A)$: Vectorization (stack columns of $A$ into vector) - $\text{Kronecker}(A, B)$: Kronecker product - $\text{prox}_f(x) = \arg\min_y \|x - y\|^2 + f(y)$: Proximal operator of $f$ - $\partial f(x)$: Subdifferential (generalized gradient for non-smooth $f$) - $\mathcal{N}(f) = \{x : f(x) = \min\}$: Set of optimal solutions (minimizers of $f$)

Supplementary Proofs

Proof of Eckart–Young Theorem (Extended).

Statement: Among all rank-$k$ matrices $X$, the truncated SVD $A_k = U_k \Sigma_k V_k^\top$ uniquely minimizes the Frobenius norm reconstruction error: \[ A_k = \arg\min_{\text{rank}(X) \leq k} \|A - X\|_F^2 = \sum_{i=k+1}^r \sigma_i^2. \] When singular values are distinct ($\sigma_k > \sigma_{k+1}$), the minimizer is unique. When tied, infinitely many minimizers exist (any rank-$k$ matrix in the top-$k$ invariant subspace achieves the same error).

Proof: Let $A = U \Sigma V^\top$ with $r = \text{rank}(A)$. For rank-$k$ matrix $X$, decompose: \[ \|A - X\|_F^2 = \|U\Sigma V^\top - X\|_F^2. \] By unitary invariance of the Frobenius norm, $\|B\|_F = \|UBV\|_F$ for orthonormal $U, V$. Thus: \[ \|A - X\|_F^2 = \|U^\top(A - X)V\|_F^2 = \|\Sigma - U^\top X V\|_F^2. \] Let $Y = U^\top X V \in \mathbb{R}^{m \times n}$. Since $\text{rank}(X) \leq k$, we have $\text{rank}(Y) \leq k$. Now: \[ \|A - X\|_F^2 = \|\Sigma - Y\|_F^2 = \sum_{ij} (\Sigma_{ij} - Y_{ij})^2 = \sum_i (\sigma_i - Y_{ii})^2 + \sum_{i \neq j} Y_{ij}^2. \] To minimize, set $Y_{ij} = 0$ for $i \neq j$ (off-diagonal zeros). Then: \[ \|A - X\|_F^2 = \sum_{i=1}^{\min(m,n)} (\sigma_i - Y_{ii})^2. \] Subject to $\text{rank}(Y) \leq k$, we can have at most $k$ nonzero diagonal elements. By optimality: - For $i \leq k$: set $Y_{ii} = \sigma_i$ (full preservation of top-$k$ singular values) - For $i > k$: set $Y_{ii} = 0$ (discard tail singular values)

This gives $Y_* = \text{diag}(\sigma_1, \ldots, \sigma_k, 0, \ldots, 0)$, so: \[ X_* = U Y_* V^\top = U_k \Sigma_k V_k^\top = A_k, \] with error: \[ \|A - A_k\|_F^2 = \sum_{i>k} \sigma_i^2. \] Uniqueness: When $\sigma_k > \sigma_{k+1}$, the top-$k$ singular subspace is isolated (eigenspace for eigenvalue block $\{\sigma_1^2, \ldots, \sigma_k^2\}$ of $AA^\top$ is unique). Any rank-$k$ minimizer must preserve this subspace, forcing $Y_{ii} = \sigma_i$ for $i \leq k$. When $\sigma_k = \sigma_{k+1}$, the eigenspace is degenerate (higher-dimensional), allowing rotations within it. $\square$

Proof of Davis–Kahan Bound (Sketch).

Statement: For symmetric matrices $A, B = A + E$, the distance between top-$k$ eigenspaces satisfies: \[ \sin \angle(U_k(A), U_k(B)) \leq \frac{2\|E\|_2}{\sigma_k(A) - \sigma_{k+1}(A) - \|E\|_2} \] provided the spectral gap is large enough ($\sigma_k - \sigma_{k+1} > 2\|E\|_2$).

Sketch: Use resolvent perturbation theory. For $z \notin \sigma(A) \cup \sigma(B)$, the resolvent $R_\lambda(z) = (A - zI)^{-1}$ satisfies: \[ \|R_A(z) - R_B(z)\| = O(\|E\|), \] with error controlled by spectral gap. The projection onto eigenspace is an integral of the resolvent over a contour enclosing the top-$k$ eigenvalues. Perturbation in the resolvent translates to perturbation in projection, bounded by the spectral gap. $\square$

Proof of Convex Envelope of Rank (via Nuclear Norm).

Statement: The nuclear norm $\|A\|_* = \sum_i \sigma_i(A)$ is the convex envelope of rank on the spectral-norm ball $\mathcal{B} = \{A : \|A\|_2 \leq 1\}$: \[ \|A\|_* = \max\{ g(A) : g \text{ convex}, g(X) \leq \text{rank}(X) \text{ for all } X \in \mathcal{B} \}. \]

Proof: (1) Nuclear norm is convex: $\|A\|_* = \sum_i \sigma_i(A)$ is the sum and $\ell^1$ norm of singular values, which is convex (as singular values depend convexly on $A$ in the sense of subadditivity under matrix addition). (2) Lower bound for rank: $\sum_i \sigma_i \leq \sum_i \mathbf{1}_{\sigma_i > 0} = \text{rank}(A)$ (since each $\sigma_i \leq 1$ on $\mathcal{B}$). (3) Tightness: By unitary invariance, restrict to diagonal $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)$. For diagonal, rank is the cardinality of nonzero entries. The convex envelope of the counting function $c(\sigma) = |\{i : \sigma_i > 0\}|$ on the simplex $[0,1]^p$ is the $\ell^1$ norm $\sum_i \sigma_i$ (the tightest convex underestimate of a step function is the identity). By unitary invariance, this extends to all matrices. $\square$

ML Implementation Notes

1. Numerical Stability and Conditioning:

When implementing SVD-based methods, always check the condition number: \[ \kappa(A) = \frac{\sigma_1(A)}{\sigma_r(A)}. \] If $\kappa > 10^8$, the matrix is ill-conditioned. Computations are susceptible to rounding errors (effective loss of precision). Remedy: use regularization (add $\lambda I$ to $A^\top A$), truncated SVD (discard small singular values), or iterative refinement (GMRES, MINRES).

2. Choosing Number of Components (Rank Selection):

For data matrix $X \in \mathbb{R}^{n \times d}$: - Elbow method: Plot singular values $\sigma_1, \sigma_2, \ldots$ and look for sharp drop-off (knee). Choose $k$ where $\sigma_k / \sigma_1 > \epsilon$ (e.g., $\epsilon = 0.01$). Subjective but practical. - Cumulative variance: Retain $k$ such that $\sum_{i=1}^k \sigma_i^2 / \sum_i \sigma_i^2 \geq 0.95$ (95% variance explained). Absolute standard. - Cross-validation: For supervised tasks, validate $k$ on hold-out test set. Objective but expensive. - Statistical tests: Q-statistic (Minka, 2001), parallel analysis (Buja & Eyuboglu, 1992) for determining effective dimensionality. - Information criteria: AIC, BIC (low-rank matrix estimation). Balances fit and complexity.

3. Data Preprocessing:

Always standardize/normalize before PCA: \[ X_{\text{standardized}} = (X - \text{mean}(X)) / \text{std}(X). \] - Centering: Subtract column-wise mean. Essential—otherwise first component captures mean direction. - Scaling: Divide by standard deviation. Critical when features have different units (height in mm vs. income in dollars). Without scaling, large-variance features dominate. - Outlier removal: Extreme values inflate variance, skewing components. Use robust methods (median absolute deviation, Huber norm) or pre-filter.

4. Software Implementation Best Practices:

Use existing libraries: NumPy (numpy.linalg.svd), scikit-learn (sklearn.decomposition.PCA), TensorFlow (tf.linalg.svd). They are optimized, tested, and handle edge cases.
Avoid explicit inversion: Never compute $(A^\top A)^{-1}$. Use SVD or QR-based solvers (numpy.linalg.lstsq, scipy.linalg.solve).
Handle rank deficiency: Matrices may have rank $< \min(m, n)$. Pseudoinverse (numpy.linalg.pinv) handles this, but check rank first.
Memory efficiency: For large $n, d$, use sparse SVD (scipy.sparse.linalg.svds) or randomized methods (sklearn.utils.extmath.randomized_svd). Dense SVD costs $O(n d^2)$ or $O(n^2 d)$.
Reproducibility: Set random seed for any stochastic steps (randomized SVD, initialization). Document $k$, preprocessing, algorithm choice.

5. Common Pitfalls:

Pitfall	Consequence	Remedy
Not centering data	First component is mean direction	Always center
Not scaling features	Large-scale features dominate	Standardize each feature
Choosing $k$ without validation	Overfitting or underfitting	Use elbow method or cross-validation
Using $AA^\top$ or $A^\top A$ directly	Squaring condition number, numerical instability	Use SVD or QR decomposition
Comparing PCA results across different libraries/implementations	Results may differ (tie-breaking in eigenvalues)	Use same library or document differences
Forgetting about curse of dimensionality	High-dimensional random data looks structured	Use dimensionality reduction first, validate on test set
Applying PCA on discrete/categorical data	PCA assumes continuous features	Use MCA (Multiple Correspondence Analysis)
Not handling missing values	Biased or failed computation	Use EM for missing data, or imputation

6. Performance Tuning for Large-Scale Data:

Dense SVD: $O(\min(m, n)^2 \max(m, n))$ floating-point operations. Practical limit: ~10,000 × 10,000.
Sparse SVD (Lanczos, ARPACK): $O(k \cdot \text{nnz}(A) \cdot \text{iterations})$. Efficient for sparse $A$ (e.g., adjacency matrices, one-hot encoded features).
Randomized SVD: $O(mnd + k^2 n)$ where $k$ is target rank. Dominates for $k \ll \min(m, n)$. Often 10–100× faster than dense SVD for low-rank data.
Sketching/streaming: For data not fitting in memory, use random projections. Trade accuracy for memory: $O(k)$ space, multiple passes over data.

7. Interpreting Singular Vectors:

Right singular vectors $V_k$: Directions of maximum variance in the data. These are the principal components (features).
Left singular vectors $U_k$: Projections of data onto principal components (scores or latent representation). Use for visualization (scatter plots in 2D/3D), clustering.
Singular values $\sigma_i$: Importance (variance) of each component. Interpret as “signal strength” in that direction.
Reconstruction: $\hat{X} = U_k \Sigma_k V_k^\top$ is the rank-$k$ denoised version. Compare to original to assess information loss.

8. Downstream Analysis after PCA:

Classification with PCA features: Reduced features $Z = X V_k$ are inputs to classifier (logistic regression, SVM). Regularization parameter and classifier choice matter.
Clustering: Apply k-means or other method to $U_k$ (scores). Spectral clustering uses graph Laplacian instead.
Visualization: Scatter plot $U_k(:, 1)$ vs. $U_k(:, 2)$ for 2D view. Color by class label or cluster. Interpret axes via top components of $V_k$.
Anomaly detection: High reconstruction error $\|x - \hat{x}\|^2$ indicates outlier. Threshold based on percentile of training errors (e.g., 99th centile).

9. Validation and Benchmarking:

Cross-validation: k-fold CV or hold-out test set. Measure prediction/reconstruction error on unseen data (guards against overfitting $k$).
Comparison to baselines: Compare PCA to no dimensionality reduction (full features) and other methods (factor analysis, autoencoders). Ensure PCA is appropriate.
Timing: Profile implementation (matrix-vector products, eigenvalue solver, etc.). Identify bottlenecks. Use tools: Python cProfile, Julia @time, etc.
Accuracy: Compare eigenvectors/singular vectors to reference (NumPy SVD) using canonical angles or subspace distance.

10. Integration with Modern ML Frameworks:

PyTorch/TensorFlow: Use torch.svd, tf.linalg.svd on GPU for speedup. Batched SVD for multiple matrices in parallel.
Automatic differentiation: SVD gradients are available (PyTorch, TensorFlow, JAX). Use for end-to-end learning (e.g., SVD as a layer).
Sparse tensors: For sparse data, use framework’s sparse support (torch.sparse, TensorFlow SparseTensor).
Distributed computing: For terabyte-scale data, use Spark MLlib, Dask, or distributed solvers (e.g., DIMSUM for matrix multiplication).

END OF FILE