Example number
14
Example slug
example_14_condition_number_predicts_sensitivity
Background

The condition number $\kappa(A) = \|A\| \|A^{-1}\|$ measures how sensitive the solution $x$ of $Ax = b$ is to perturbations in $b$ or $A$. For the 2-norm, $\kappa(A) = \sigma_{\max}/\sigma_{\min}$ (ratio of largest to smallest singular value). A perturbation of relative size $\varepsilon$ in $b$ can cause output error up to $\kappa(A) \cdot \varepsilon$ in $x$.

In ML, conditioning appears everywhere: $\kappa(X^\top X)$ in least squares, $\kappa(\Sigma)$ in PCA, $\kappa(H)$ (Hessian) in optimization. Large condition numbers cause training instability: gradient updates become unpredictable, parameters explode, and small data changes yield wildly different models. Regularization (ridge, weight decay) adds $\lambda I$ to improve conditioning by bounding the smallest eigenvalue away from zero. Preconditioning (batch norm, adaptive optimizers) rescales to balance the eigenvalue spectrum. Stable algorithms (QR, SVD, Cholesky) mitigate but cannot remove ill-conditioning—the underlying sensitivity remains.

Purpose

Build intuition that “hard to train” and “unstable gradients” often trace to ill-conditioning—large condition numbers amplify small input noise into large output errors:

  • Understand $\kappa(A) = \sigma_{\max}/\sigma_{\min}$ as a sensitivity multiplier.
  • See how a tiny perturbation in $b$ can cause huge changes in $x$ when $\kappa(A)$ is large.
  • Recognize that regularization, preconditioning, and stable factorizations improve conditioning.
  • Connect this to ML: covariance matrices, Hessians, gradient norms—all governed by conditioning.
Problem

Solve a well-conditioned and an ill-conditioned system and compare sensitivity to perturbations.

Solution (Math)

Large $\kappa(A)$ implies small input perturbations can cause large output changes. Compare systems with different singular value spreads.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

def trial(A):
    b = np.array([1.,1.])
    db = np.array([0.,1e-6])
    x = np.linalg.solve(A,b)
    x2 = np.linalg.solve(A,b+db)
    return np.linalg.cond(A), np.linalg.norm(x2-x)/np.linalg.norm(x)

A_good = np.array([[1.,0.],[0.,0.9]])
A_bad  = np.array([[1.,0.],[0.,1e-6]])

for name,A in [("good",A_good),("bad",A_bad)]:
    condA, rel_dx = trial(A)
    print(name, "cond:", condA, "rel_dx:", rel_dx)
Code Introduction

This snippet demonstrates how conditioning controls numerical stability by comparing two diagonal matrices with drastically different condition numbers. The trial function solves $Ax = b$ for a baseline right-hand side $b = [1, 1]^\top$, then resolves with a tiny perturbation $b + \delta b$ where $\delta b = [0, 10^{-6}]^\top$. It returns the condition number $\kappa(A) = \|A\| \|A^{-1}\|$ and the relative change in solution $\frac{\|x_2 - x\|}{\|x\|}$. The condition number bounds how much output error can be amplified from input perturbations: a perturbation of size $\varepsilon$ in $b$ can cause output error up to $\kappa(A) \cdot \varepsilon$ in relative terms.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Compute condition number: kappa = np.linalg.cond(A) returns $\kappa_2(A) = \sigma_{\max}/\sigma_{\min}$.
  • Perturbation experiment: solve $Ax = b$ and $Ax_2 = b + \delta b$; measure relative change $\|x_2 - x\|/\|x\|$.
  • Expected amplification: $\frac{\|x_2 - x\|}{\|x\|} \lesssim \kappa(A) \frac{\|\delta b\|}{\|b\|}$.
  • Diagonal test matrices: diag(1, 0.9) has $\kappa \approx 1.11$ (good); diag(1, 1e-6) has $\kappa \approx 10^6$ (bad).
  • Real-world guidance: if $\kappa(X^\top X) > 10^{10}$, expect numerical issues; add ridge regularization or switch to SVD.
  • Shape discipline: $A \in \mathbb{R}^{2\times 2}$, $b \in \mathbb{R}^2$, $x \in \mathbb{R}^2$ in this toy example; principles scale to high dimensions.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Condition number $\kappa(A)$ predicts sensitivity: small perturbations in $b$ cause output errors amplified by $\kappa(A)$.
  • Well-conditioned ($\kappa \approx 1$): input noise stays proportional to output noise.
  • Ill-conditioned ($\kappa \gg 1$): tiny input changes cause catastrophic output errors.
  • Diagonal matrices make conditioning transparent: $\kappa = \lambda_{\max}/\lambda_{\min}$.
  • ML takeaway: inspect singular values before inverting; regularize when $\kappa$ is large.
Notes

Part 1: Core setup - Condition number predicts sensitivity

State the objects, shapes, and target question for Condition number predicts sensitivity. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Condition number predicts sensitivity

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Condition number predicts sensitivity work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Condition number predicts sensitivity

Give the computational recipe for Condition number predicts sensitivity, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: check dimensions before manipulating formulas.
  • Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
  • Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).

For the well-conditioned matrix A_good = diag(1.0, 0.9), the condition number is $\kappa \approx 1.0 / 0.9 \approx 1.11$ (ratio of largest to smallest singular value). A small perturbation $\delta b$ of relative size $10^{-6}$ in $b$ produces a relative change in $x$ of roughly the same order—input noise stays proportional to output noise. This is the ideal regime: numerical errors remain controlled, and the solution is trustworthy.

For the ill-conditioned matrix A_bad = diag(1.0, 10^{-6}), the condition number is $\kappa \approx 1.0 / 10^{-6} = 10^6$. The same $10^{-6}$ perturbation in $b$ now causes a relative change in $x$ of order $10^0$ (100% error or larger). This catastrophic amplification occurs because the small eigenvalue $10^{-6}$ makes $A$ nearly singular—inverting it multiplies errors by the reciprocal of the smallest singular value. In ML contexts, this manifests as training instability: gradient updates become unpredictable, parameter estimates explode, and small changes in data cause wildly different models.

ML Connections: The condition number $\kappa(X^\top X)$ in least squares, $\kappa(\Sigma)$ in PCA, and $\kappa(H)$ (Hessian conditioning) in optimization all determine whether algorithms converge reliably. Regularization (ridge regression, weight decay) explicitly improves conditioning by adding $\lambda I$ to matrices, bounding the smallest eigenvalue away from zero. Preconditioning (batch normalization, adaptive optimizers like Adam) rescales gradients to balance the eigenvalue spectrum. Shape discipline helps diagnose conditioning: if $X \in \mathbb{R}^{n \times d}$ with $n \ll d$ (underdetermined), $X^\top X$ is rank-deficient ($\kappa = \infty$), signaling the need for regularization or dimensionality reduction. Always inspect singular values before inverting matrices—small singular values flag numerical danger zones where floating-point arithmetic breaks down.

History and Applications

Origins of conditioning theory: The condition number emerged from early numerical analysis (von Neumann, Turing, Wilkinson) as a way to quantify how rounding errors propagate through matrix computations. Wilkinson’s backward error analysis (1960s) showed that stable algorithms produce nearly exact solutions to slightly perturbed problems, with perturbations bounded by $\kappa(A)$ times machine precision.

Modern stability theory: Higham and others formalized perturbation bounds: for linear systems, $\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \frac{\|\delta b\|}{\|b\|} + O(\kappa(A)^2 \varepsilon_{\text{mach}})$, where $\varepsilon_{\text{mach}} \approx 10^{-16}$ for double precision. This explains why condition numbers above $10^{10}$ cause visible errors even with stable solvers.

ML applications: Conditioning governs training dynamics across methods. In least squares, $\kappa(X^\top X) = \kappa(X)^2$ amplifies sensitivity—QR/SVD avoid squaring the condition number. In deep learning, poorly conditioned Hessians slow convergence; adaptive optimizers (Adam, RMSprop) implicitly precondition by normalizing per-parameter gradient history. Batch normalization reduces internal covariate shift by whitening activations, improving layer-wise conditioning. In PCA, small eigenvalues of the covariance matrix correspond to high-variance directions in noisy data—truncating them regularizes. Diagnosing conditioning via singular value inspection and mitigating via regularization ($\lambda I$) or preconditioning are foundational skills for reliable ML systems.

Connection to Broader Examples
  • Least squares (Ch. 12): $\kappa(X^\top X) = \kappa(X)^2$; ill-conditioning squared—prefer QR/SVD over normal equations.
  • PCA (Ch. 11): Covariance $\Sigma$ conditioning affects eigenvalue estimation; small eigenvalues are noisy.
  • Cholesky (Ch. 13): SPD matrices with large $\kappa$ can fail Cholesky or require jitter; check conditioning first.
  • SVD (Ch. 10): Small singular values signal conditioning issues; truncated SVD regularizes by discarding them.
  • Optimization: Hessian conditioning governs convergence rates; poorly conditioned Hessians require more iterations or preconditioning.

Comments