Example number
42
Example slug
example_42_svd_and_conditioning_why_normal_equations_are_risky
Background

The SVD $X = U\Sigma V^\top$ gives $X^\top X = V\Sigma^2 V^\top$, so the eigenvalues of $X^\top X$ are the squared singular values of $X$. The condition number $\kappa(A) = \|A\| \|A^{-1}\|$ equals $\sigma_{\max}/\sigma_{\min}$ for invertible $A$. Hence $\kappa(X^\top X) = \sigma_{\max}^2/\sigma_{\min}^2 = \kappa(X)^2$. Numerically, forming $X^\top X$ loses precision: if $\kappa(X) \sim 10^8$, then $\kappa(X^\top X) \sim 10^{16}$, exceeding double-precision range ($\sim 10^{16}$ digits). SVD-based lstsq operates on $X$ directly, avoiding this squaring. Understanding this justifies why explicit $(X^\top X)^{-1}$ is discouraged in favor of QR or SVD.

Purpose

Show why forming the normal equations $X^\top X w = X^\top y$ squares the condition number of $X$, making least-squares solvers that use $X^\top X$ numerically unstable for ill-conditioned data. Demonstrate via SVD that $\kappa(X^\top X) = \kappa(X)^2$, and build intuition that SVD or QR-based solvers avoid this squaring, offering stable solutions even when $X$ has near-collinear columns. Emphasize when normal equations fail and why modern ML libraries default to SVD/QR.

Problem

Compute cond(X) from singular values and compare to cond(X^T X). Verify cond(X^T X) ≈ cond(X)^2.

Solution (Math)

If $X=U\Sigma V^T$, then $X^TX = V\Sigma^2 V^T$. So $\kappa(X)=\sigma_{\max}/\sigma_{\min}$ and $\kappa(X^TX)=\sigma_{\max}^2/\sigma_{\min}^2=\kappa(X)^2$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=50, d=4, seed=2)
X = np.c_[X, X[:, [0]] + 1e-3 * X[:, [1]]]

S = np.linalg.svd(X, compute_uv=False)
cond_X = S.max() / S.min()
cond_XTX = np.linalg.cond(X.T@X)

print("cond(X):", cond_X)
print("cond(X^T X):", cond_XTX)
print("cond(X)^2:", cond_X**2)
Code Introduction

This snippet builds an ill-conditioned design matrix by appending a nearly collinear column (X[:,0] + 1e-3*X[:,1]) to a 4-feature dataset. It then compares three quantities: the spectral condition number of $X$ from its singular values (cond_X), the condition number of the normal-equation matrix $X^\top X$ (cond_XTX), and the squared condition number cond(X)^2. The printout illustrates that forming $X^\top X$ squares the condition number, making the normal equations far more ill-conditioned than the original system—why SVD/QR solvers are preferred over explicit normal-equation solves.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Generate base data: X, y, _ = toy_linreg(n=50, d=4, seed=2) produces a $50\times 4$ matrix and labels.
  2. Add near-collinear column: X = np.c_[X, X[:,[0]] + 1e-3*X[:,[1]]] appends a 5th column that is almost a linear combination of the first two, forcing a small singular value.
  3. Compute singular values: S = np.linalg.svd(X, compute_uv=False) returns the vector of singular values in descending order.
  4. Condition number of $X$: cond_X = S.max() / S.min() computes $\sigma_{\max}/\sigma_{\min}$.
  5. Condition number of $X^\top X$: cond_XTX = np.linalg.cond(X.T@X) uses a library routine (internally SVD or eigendecomp).
  6. Verify squaring: print cond(X), cond(X^T X), and cond(X)^2; the last two should match closely.
  7. Diagnostics: if $\kappa(X) > 10^8$, expect $\kappa(X^\top X) > 10^{16}$, indicating severe instability in normal-equation solves.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Condition number squaring: $\kappa(X^\top X) = \kappa(X)^2$ follows from $X^\top X = V\Sigma^2 V^\top$.
  • Normal equations risk: forming $X^\top X$ amplifies ill-conditioning, leading to unstable solutions.
  • SVD as diagnostic: compute singular values $S$ to get $\kappa(X) = S[0]/S[-1]$ without forming $X^\top X$.
  • Near-collinearity: adding a nearly dependent column (X[:,0] + 1e-3*X[:,1]) makes $\kappa(X)$ large.
  • Solver choice: lstsq uses SVD/QR to avoid squaring; explicit $(X^\top X)^{-1}$ fails for large $\kappa$.
  • Shape discipline: $X \in \mathbb{R}^{n\times d}$, $X^\top X \in \mathbb{R}^{d\times d}$; singular values live in $\mathbb{R}^{\min(n,d)}$.
Notes

Part 1: Problem Setup and SVD Decomposition
Start with $X \in \mathbb{R}^{50\times 4}$, append a near-collinear column to get $X \in \mathbb{R}^{50\times 5}$. Compute SVD: $X = U\Sigma V^\top$; singular values $S$ are the diagonal of $\Sigma$. Condition number $\kappa(X) = S[0]/S[-1]$; near-zero $S[-1]$ means near-rank-deficiency.

Part 2: Condition Number Squaring
$X^\top X = V\Sigma^2 V^\top$ (orthogonal $U$ cancels). Eigenvalues of $X^\top X$ are $\sigma_i^2$, so $\kappa(X^\top X) = (\sigma_{\max}/\sigma_{\min})^2 = \kappa(X)^2$. Numerically, if $\kappa(X) = 10^8$, then $\kappa(X^\top X) = 10^{16}$, exhausting double precision.

Part 3: Numerical Verification as Diagnostic
Print cond(X), cond(X^T X), and cond(X)^2. Match confirms the squaring relation. Large $\kappa(X^\top X)$ predicts unstable solutions if using $(X^\top X)^{-1}$; residuals and parameter estimates may be garbage. Use lstsq or QR instead.

Why This Matters for ML
Linear regression, ridge regression, and many optimization steps form normal equations. If features are collinear (common in high-dimensional data), $\kappa(X)$ is large, and forming $X^\top X$ doubles the exponent of instability. Modern libraries default to SVD/QR; understanding why prevents manual use of $(X^\top X)^{-1}$, which fails silently with large errors.

Numerical and Implementation Notes
Use np.linalg.svd(X, compute_uv=False) for singular values only; cheaper than full decomposition. For large $n$, avoid forming $X^\top X$ explicitly; use lstsq or iterative methods. Check rcond in lstsq to control singular-value truncation. Monitor condition numbers as regularization diagnostics.

Numerical and Shape Notes
Shapes: $X (n\times d)$, $X^\top X (d\times d)$, $S (\min(n,d))$. For wide $X$ ($n<d$), $X^\top X$ is rank-deficient; pseudoinverse needed. Singular values span $[0, \sigma_{\max}]$; ratios $\sigma_i/\sigma_{\max}$ indicate effective rank. Appending columns (e.g., np.c_[X, new_col]) can drastically change $\kappa$.

History and Applications

The SVD was developed by Beltrami (1873) and Jordan (1874), with modern numerical algorithms emerging from Golub and Kahan (1965). Recognition that normal equations square the condition number drove adoption of QR (Householder, 1958) and SVD-based least-squares solvers in numerical libraries (LAPACK, 1990s). In ML, this principle underpins why sklearn.linear_model.LinearRegression and np.linalg.lstsq use SVD rather than explicit $(X^\top X)^{-1}$. Applications include ridge regression (regularization stabilizes ill-conditioned $X^\top X$), PCA (covariance eigendecomposition), and modern neural-network training (gradient conditioning affects convergence). Understanding $\kappa(X^\top X) = \kappa(X)^2$ guides regularization, feature engineering, and solver choice.

Connection to Broader Examples
  • Linear maps (Ch. 4): $X$ is a linear map; SVD reveals its structure and invertibility.
  • Projections (Ch. 6): Least squares is projection; conditioning affects numerical accuracy.
  • Rank-nullity (Ch. 7): Near-rank-deficiency (small $\sigma_{\min}$) indicates near-null directions.
  • Eigen (Ch. 8): $X^\top X$ is symmetric; its eigenvalues are $\sigma_i^2$.
  • PSD (Ch. 9): $X^\top X$ is PSD; condition number from eigenvalues.
  • SVD (Ch. 10): This chapter’s core topic: SVD as the tool for stable least squares.
  • PCA (Ch. 11): Covariance $\Sigma \propto X^\top X$; ill-conditioning affects PCA stability.
  • Least squares (Ch. 12): Normal equations $X^\top Xw = X^\top y$ square condition number; SVD avoids this.
  • Conditioning (Ch. 14): This example operationalizes conditioning theory: $\kappa(X^\top X) = \kappa(X)^2$.

Comments