Example number
74
Example slug
example_74_svd_and_conditioning_why_normal_equations_are_risky
Background

Conditioning governs numerical sensitivity: small data/model perturbations can cause large solution changes when $\kappa$ is large. Forming $X^\top X$ squares $\kappa$ and magnifies roundoff errors, which is why modern solvers prefer QR/SVD factorizations of $X$ directly. Multicollinearity (nearly dependent features) is a primary source of ill-conditioning in regression; regularization (ridge) floors the smallest singular values but still cannot undo the squaring if one works with $X^\top X$.

Purpose

Build intuition for how conditioning behaves under different linear-algebra formulations, especially why forming normal equations $X^\top X$ is numerically risky. You will quantify the relationship $\kappa(X^\top X) = [\kappa(X)]^2$, see how near-collinearity (multicollinearity) creates ill-conditioning, and learn why QR/SVD-based least-squares solvers avoid the catastrophic squaring of the condition number.

Problem

Compute cond(X) from singular values and compare to cond(X^T X). Verify cond(X^T X) ≈ cond(X)^2.

Solution (Math)

If $X = U\Sigma V^\top$, then

\[ X^\top X = V\,\Sigma^2\,V^\top. \]

Thus the eigenvalues of $X^\top X$ are $\lambda_i = \sigma_i^2$, and

\[ \kappa(X) = \frac{\sigma_{\max}}{\sigma_{\min}}, \qquad \kappa(X^\top X) = \frac{\lambda_{\max}}{\lambda_{\min}} = \frac{\sigma_{\max}^2}{\sigma_{\min}^2} = [\kappa(X)]^2. \]

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=50, d=4, seed=2)
X = np.c_[X, X[:, [0]] + 1e-3 * X[:, [1]]]

S = np.linalg.svd(X, compute_uv=False)
cond_X = S.max() / S.min()
cond_XTX = np.linalg.cond(X.T@X)

print("cond(X):", cond_X)
print("cond(X^T X):", cond_XTX)
print("cond(X)^2:", cond_X**2)
Code Introduction

This snippet demonstrates how forming the normal equations explicitly squares the condition number, making least-squares problems much more sensitive to numerical error. With $X \in \mathbb{R}^{50 \times 5}$ (after adding a nearly collinear fifth column), the design matrix is ill-conditioned, and computing $X^\top X$ amplifies this conditioning problem quadratically.

The code generates synthetic regression data via toy_linreg(n=50, d=4, seed=2), yielding $X \in \mathbb{R}^{50 \times 4}$ with moderate condition number. A fifth column is then concatenated to induce multicollinearity: $X_{:,5} = X_{:,1} + 10^{-3} X_{:,2}$. The singular values $\sigma_1 \ge \cdots \ge \sigma_5$ from np.linalg.svd(X, compute_uv=False) give $\kappa(X) = \sigma_1/\sigma_5$. Because the new column is nearly dependent, $\sigma_5$ is tiny and $\kappa(X)$ is large.

The normal equations $X^\top X \hat{w} = X^\top y$ have $\kappa(X^\top X) = [\kappa(X)]^2$ since the eigenvalues of $X^\top X$ are $\sigma_i^2$. The code verifies this via np.linalg.cond(X.T @ X) and compares it to cond_X**2. In practice, this means solving via normal equations can lose roughly twice as many digits of accuracy as solving via QR/SVD on $X$ directly.

Implications and best practices: - Avoid np.linalg.solve(X.T @ X, X.T @ y); use np.linalg.lstsq(X, y) (QR/SVD-based) instead. - Standardize features and consider ridge ($\lambda I$) to improve conditioning. - Diagnose with $\min \sigma_i$, $\kappa(X)$, and comparisons between solutions from SVD/QR vs normal equations.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Generate data: X, y, _ = toy_linreg(n=50, d=4, seed=2) yields $X \in \mathbb{R}^{50\times 4}$.
  • Create near-collinearity: append $x_1 + 10^{-3} x_2$ as a 5th column to make $X \in \mathbb{R}^{50\times 5}$ nearly rank-deficient.
  • Singular values: S = np.linalg.svd(X, compute_uv=False); compute $(X) = S.max() / S.min()`.
  • Normal equations conditioning: cond_XTX = np.linalg.cond(X.T @ X).
  • Verify numerically: cond_XTX \approx (cond_X)**2 within floating-point tolerance.
  • Optional: compare solutions from np.linalg.lstsq (SVD/QR-based) vs solving normal equations; measure relative error.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Condition number squaring: $\kappa(X^\top X) = [\kappa(X)]^2$.
  • Why normal equations are numerically fragile for ill-conditioned $X$.
  • How a nearly collinear feature makes $\sigma_{\min}$ tiny and $\kappa$ huge.
  • Practical verification using SVD singular values and np.linalg.cond.
  • Safer alternatives: QR and SVD solve least squares without squaring $\kappa$.
Notes
  • Part 1: Creating an Ill-Conditioned Design Matrix. Adding a nearly collinear column $x_1 + 10^{-3} x_2$ makes $\sigma_{\min}$ small and $\kappa(X)$ large; multicollinearity is common in real data.
  • Part 2: Condition Number Squaring in the Normal Equations. Because $\lambda_i(X^\top X) = \sigma_i^2(X)$, $\kappa(X^\top X) = [\kappa(X)]^2$; solving via normal equations amplifies sensitivity.
  • Part 3: Practical Checks and Diagnostics. Report $\min \sigma_i$, $\kappa(X)$, $\kappa(X^\top X)$, and compare SVD/QR vs normal-equations solutions; large discrepancies indicate instability.
  • Why This Matters for ML. Instability harms generalization and interpretability; SVD/QR solvers preserve accuracy, while ridge/standardization mitigate ill-conditioning from feature scales and redundancy.
  • ML Examples and Patterns. Ridge regression improves conditioning by flooring small singular values; PCA removes low-variance directions; polynomial features benefit from orthogonalization (Legendre/Chebyshev) to reduce $\kappa$.
  • Connection to Linear Algebra Theory. SVD links $X$ to $X^\top X$ eigen-spectra; QR keeps $\kappa$ unchanged; pseudoinverse $X^+ = V\Sigma^+ U^\top$ operates on $\sigma_i$, not $\sigma_i^2$.
  • Numerical and Implementation Notes. Prefer np.linalg.lstsq (SVD/QR-backed) or scipy.linalg.lstsq; avoid solve(X.T@X, X.T@y). Standardize features; add ridge ($\lambda I$) if needed; detect rank deficiency with a tolerance on $\sigma_{\min}$.
  • Numerical and Shape Notes. $X \in \mathbb{R}^{n\times d}$, $y \in \mathbb{R}^n$; $S$ holds $\sigma_i$; $X^\top X \in \mathbb{R}^{d\times d}$. If $\sigma_{\min}=0$, $\kappa=\infty$ and normal equations are singular.
  • Pedagogical Significance. A vivid, empirical demonstration that equivalent algebraic forms have different numerical behavior; teaches to respect conditioning and choose algorithms accordingly.
History and Applications

The risks of normal equations have been emphasized since the early days of numerical linear algebra. Householder and Givens developed stable orthogonal factorizations (QR) in the mid-20th century to avoid squaring the condition number. Golub and Kahan (1965) popularized SVD-based algorithms that are now the gold standard for ill-conditioned least squares. In modern ML and statistics, multicollinearity routinely creates small singular values; best practice across libraries is to solve with QR or SVD (e.g., LAPACK’s DGELS, DGELSD) rather than forming $X^\top X$. Applications include linear/regression modeling under collinearity, Gauss–Newton and Levenberg–Marquardt (which form approximate normal equations but solve via QR), and large-scale problems where iterative methods with preconditioning are necessary to control effective conditioning.

Connection to Broader Examples
  • Links to Chapter 10 (SVD): singular values drive conditioning and stability.
  • Connects to Chapter 12 (least squares): QR/SVD vs normal equations; projection geometry preserved without squaring $\kappa$.
  • Relates to Chapter 14 (conditioning): ill-conditioning, error amplification, and regularization.
  • Bridges to Chapter 13 (solving systems): Cholesky on $X^\top X$ is fragile unless $X$ is well-conditioned; QR/SVD preferred.
  • Complements Chapter 11 (PCA): small singular values indicate low-variance directions and potential overfitting.

Comments