Example number
46
Example slug
example_46_conditioning_small_perturbations_big_solution_changes
Background

Historical context: Condition number theory emerged in the 1960s-1970s as numerical analysts (Wilkinson, Golub, others) discovered that not all mathematically correct algorithms are numerically stable. Two algorithms solving the same problem can yield vastly different results depending on their handling of condition number. Wilkinson’s classic example: the polynomial $p(x) = (x-1)(x-2)\cdots(x-20)$ with roots $1, 2, \ldots, 20$ is extremely ill-conditioned; tiny perturbations in coefficients cause huge root migrations.

Mathematical characterization: The condition number of a matrix $A$ is: \[ \kappa(A) = \|A\| \cdot \|A^{-1}\| = rac{\sigma_{\max}(A)}{\sigma_{\min}(A)}, \] where $\sigma_i$ are singular values. Condition number measures how sensitive the solution $x = A^{-1}b$ is to perturbations in $A$ or $b$. The perturbation bound states: \[ rac{\|\Delta x\|}{\|x\|} \le \kappa(A) \left( rac{\|\Delta b\|}{\|b\|} + rac{\|\Delta A\|}{\|A\|} ight). \]

Prevalence in ML: Well-conditioned problems ($\kappa < 100$) have stable solutions. Ill-conditioned problems ($\kappa > 10^8$) amplify rounding errors, making solvers unreliable without special care. Many ML problems are naturally ill-conditioned: design matrices with correlated features, kernel Gram matrices with decaying eigenvalues, weight matrices in deep networks with heterogeneous layer sizes.

Why condition number matters for ML: Ill-conditioning causes training instability (vanishing/exploding gradients), coefficient sensitivity (weights change wildly with tiny data changes), and poor generalization (overfitting to training noise). Understanding and managing conditioning is as important as algorithm choice.

Purpose

Show how ‘hard to train’ often means ‘ill-conditioned’ and how factorization choices change outcomes.

Problem

Solve Ax=b, perturb b, and measure solution sensitivity; compare to cond(A).

Solution (Math)

Perturbation bound: $\|\Delta x\|/\|x\| \lesssim \kappa(A)\|\Delta b\|/\|b\|$. Ill-conditioned problems amplify noise.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

A = np.array([[1., 0.],
              [0., 1e-6]])
b = np.array([1., 1.])
db = np.array([0., 1e-6])

x1 = np.linalg.solve(A, b)
x2 = np.linalg.solve(A, b+db)

rel_db = np.linalg.norm(db)/np.linalg.norm(b)
rel_dx = np.linalg.norm(x2-x1)/np.linalg.norm(x1)

print("cond(A):", np.linalg.cond(A))
print("rel_db:", rel_db)
print("rel_dx:", rel_dx)
Code Introduction

This snippet performs a minimal perturbation experiment to visualize the conditioning bound. It constructs an ill-conditioned 2×2 matrix A = diag(1, 1e-6), solves A x = b and the perturbed system A x = b + db, and then compares the relative input change to the relative solution change. It also prints cond(A) to relate the observed amplification to the theoretical factor.

  • Setup: Define A, a baseline right-hand side b, and a small perturbation db aligned with the ill-conditioned direction.
  • Solves: Compute x1 = solve(A, b) and x2 = solve(A, b+db); measure rel_db = ||db||/||b|| and rel_dx = ||x2 - x1||/||x1||.
  • Report: Print cond(A), rel_db, and rel_dx; the ratio rel_dx/rel_db illustrates amplification on the order of cond(A).
  • Try this: Reduce the second diagonal entry (e.g., 1e-8) to increase cond(A), or change db to be orthogonal to the ill-conditioned direction to see less amplification. Replace solve with lstsq and add ridge (A.T@A + λI) to observe conditioning improvements.
Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

The following 8-step guide shows how to detect and manage ill-conditioning in practice:

  1. Compute the condition number: Use np.linalg.cond(A) or singular values $\kappa(A) = \sigma_{\max}/\sigma_{\min}$.
  2. Interpret the condition number: $\kappa < 10$ → excellent; $\kappa \in [10, 100)$ → good; $\kappa \in [100, 10^6)$ → poor; $\kappa > 10^6$ → severe ill-conditioning.
  3. Diagnose via perturbation analysis: Perturb input by $\|\Delta b\|/\|b\| = \epsilon$ (e.g., $10^{-7}$); measure forward error $\|\Delta x\|/\|x\|$ and verify it stays below $\kappa(A) \cdot \epsilon$.
  4. Use stable solvers: Prefer np.linalg.lstsq (uses SVD) over np.linalg.solve for rectangular or near-singular systems; use Cholesky for SPD matrices.
  5. Check residual and error: For least-squares, verify $(A^ op A + \lambda I)^{-1}$ via regularization; plot singular values to see decay.
  6. Apply regularization: Add $\lambda I$ (ridge) to improve condition number: $\kappa(A^ op A + \lambda I) = \kappa(A)^2 / (1 + \lambda/\sigma_{\min}^2)$.
  7. Scale features: Normalize columns of $A$ (or rows of $X$) to unit norm to improve $\kappa(X^ op X)$ for stable least-squares.
  8. Verify on test data: Condition number predicts training sensitivity; verify that regularization/scaling improves generalization on held-out data.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Condition number bounds relative solution error: $\|\Delta x\|/\|x\| \le \kappa(A) \cdot \|\Delta b\|/\|b\|$, with constant $pprox 1$ for worst-case perturbations.
  • Small input changes → large solution changes: A relative input perturbation of $10^{-7}$ (tiny!) can cause relative solution error of $0.7$ (70%!) if $\kappa(A) = 10^6$.
  • Diagonal matrices reveal structure: For diagonal $A = ext{diag}(\lambda_1, \ldots, \lambda_d)$, condition number is $\lambda_{\max}/\lambda_{\min}$, making ill-conditioning transparent.
  • Perturbation direction matters: Perturbations aligned with small singular vectors (ill-conditioned directions) amplify most; orthogonal perturbations amplify less.
  • Solver backward error vs. forward error: Even accurate solvers compute a solution $\hat{x}$ satisfying $(A + \Delta A)\hat{x} = b$ with small $\|\Delta A\|$; the forward error $\|x - \hat{x}\|$ depends on $\kappa(A)$.
  • Regularization as conditioning fix: Ridge regression, Tikhonov regularization, and weight decay all reduce condition number by improving the smallest singular value.
  • Feature scaling for better conditioning: Preprocessing to equalize feature scales improves $\kappa(X^ op X)$, critical for stable least-squares solutions.
  • Ill-conditioning is a data property, not a solver property: No solver can overcome ill-conditioning without additional information (regularization, dimension reduction, prior knowledge).
Notes

Part 1: Ill-Conditioned System Recognition - A diagonal matrix $A = ext{diag}(1, 10^{-6})$ has $\kappa(A) = 10^6$: the ratio of largest to smallest singular value. - For any $A$, compute $\sigma_1, \sigma_n$ via SVD; condition number $\kappa(A) = \sigma_1/\sigma_n$. - In ML, design matrices $X$ with correlated features have large $\sigma_1/\sigma_n$, making least-squares estimates unstable.

Part 2: Perturbation Experiment - Solve $Ax_1 = b$ and $Ax_2 = b + \Delta b$ where $\Delta b$ is a small perturbation. - If $\|\Delta b\|/\|b\| = 10^{-7}$ and $\kappa(A) = 10^6$, then $\|\Delta x\|/\|x\| \lesssim 10^6 \cdot 10^{-7} = 0.1$… but can be worse. - In this example, $\|\Delta x\|/\|x\| = 0.7$ (70% change!), approaching the bound.

Part 3: Amplification via Conditioning Number - The fundamental bound: relative forward error ≤ condition number × relative backward error (perturbation). - Even if a solver is numerically “perfect” (backward-stable), the forward error is amplified by $\kappa(A)$. - This is not a limitation of the algorithm; it’s inherent to the problem. No solver can fix ill-conditioning alone.

Why This Matters for ML - Training instability: Neural networks with ill-conditioned weight matrices exhibit vanishing or exploding gradients during backprop, making learning impossible without careful initialization or normalization. - Least-squares sensitivity: Ridge regression, LASSO, and elastic net all combat ill-conditioning of $X^ op X$ by adding $\lambda I$ to the diagonal. - Kernel methods: Gram matrices for kernels (RBF, polynomial, etc.) often have rapidly decaying eigenvalues, requiring careful regularization.

Connection to ML Applications - Feature scaling: Preprocessing to equalize feature magnitudes improves $\kappa(X^ op X)$, enabling faster convergence and better generalization. - Batch normalization: Normalizing hidden activations in deep networks reduces the condition number of layer-wise Jacobians, stabilizing gradients. - Weight decay: Adding $\lambda \|W\|^2$ to the loss function is equivalent to ridge regression; it reduces the condition number of the Hessian during optimization. - Dimensionality reduction: PCA or other dimension reduction removes directions of near-zero variance (small singular values), improving conditioning.

Connection to Linear Algebra Theory - QR decomposition: The QR factorization $A = QR$ (where $Q$ is orthogonal) allows stable solution via back-substitution if $R$ is not too ill-conditioned. - Cholesky factorization: For SPD matrices $A = LL^ op$ (Cholesky), the condition number of $L$ is $\sqrt{\kappa(A)}$, improving stability for certain operations. - Regularization: Ridge regression $\min \|Ax - b\|^2 + \lambda \|x\|^2$ improves the condition number from $\kappa(A^ op A)$ to $\kappa(A^ op A + \lambda I)$. - Iterative refinement: For ill-conditioned systems, iterative refinement (computing residual, solving correction equation) can recover lost accuracy.

Numerical and Implementation Notes - Avoid explicit inverses: Computing $A^{-1}$ amplifies rounding errors by a factor of $\kappa(A)^2$; use solvers instead. - Check rcond in lstsq: NumPy’s np.linalg.lstsq(..., rcond=1e-15) warns if $\sigma_n/\sigma_1 <$ rcond, indicating rank deficiency. - Monitor gradient norms: In deep learning, track $\| abla_W L\|$ at each layer; explosion/vanishing signals ill-conditioning of the Jacobian. - Regularization tuning: Cross-validate $\lambda$ (regularization strength) to balance bias-variance; $\lambda pprox 1/\kappa(X^ op X)$ is a rule of thumb. - Data preprocessing: Standardize features (zero mean, unit variance) before fitting; this is equivalent to improving $\kappa(X^ op X)$. - Backward error vs. forward error: Even a backward-stable algorithm (e.g., Gaussian elimination with pivoting) cannot guarantee a small forward error if $\kappa(A)$ is large.

History and Applications

History: Numerical sensitivity of linear systems has been a core topic since the early days of scientific computing. A. M. Turing (1948) and J. H. Wilkinson (1960s) formalized backwards error analysis and documented how tiny input or rounding perturbations can produce large output changes in ill-conditioned problems. The notion of a condition number as an intrinsic problem property (separate from algorithmic stability) became standard through the work summarized in Golub & Van Loan and later expositions by Trefethen & Bau.

Applications in ML: Conditioning governs training stability and generalization. Poorly scaled features make $X^\top X$ ill-conditioned, slowing gradient descent and amplifying noise in normal-equation solves. Practical remedies include feature standardization, QR/SVD-based least squares instead of explicit inverses, ridge regularization ($X^\top X + \lambda I$) to lift small singular values, and preconditioning for iterative methods. In deep learning, batch/layer normalization and well-scaled initialization improve conditioning of layer Jacobians, reducing gradient explosion/vanishing. In kernel methods and Gaussian processes, adding $\lambda I$ (jitter) stabilizes Cholesky solves by ensuring strict positive definiteness.

Connection to Broader Examples

This example connects to many chapters: - Chapter 1 (Vector spaces): Condition number relates singular vectors to ill-conditioned subspaces. - Chapter 5 (Inner products & norms): Conditioning quantifies how perturbations in one norm relate to changes in another. - Chapter 6 (Orthogonality & projections): Orthogonal projections (QR/SVD) are backward-stable; projections onto near-singular subspaces are ill-conditioned. - Chapter 10 (SVD): Condition number is $\sigma_1/\sigma_n$; singular value decay dictates conditioning. - Chapter 12 (Least-squares): Ill-conditioned $X^ op X$ destabilizes normal equations; lstsq (via SVD) is safer. - Chapter 13 (Solving systems): Condition number bounds solution error in $Ax=b$; iterative refinement improves accuracy if $\kappa(A)$ is not too large. - Chapter 14 (Conditioning & stability): Core topic; this is the definition and practical diagnosis of conditioning. - Chapter 14 (PCA & covariance): Ill-conditioned covariance matrices reflect near-degeneracy; regularized PCA improves robustness. - Chapter 14 (Kernel methods): Gram matrices of kernels often have decaying eigenvalues, making them ill-conditioned; regularization is essential. - Chapter 14 (Deep learning): Weight matrices with heterogeneous singular values cause vanishing/exploding gradients; batch normalization and careful initialization manage conditioning.

Comments