Example number
62
Example slug
example_62_conditioning_small_perturbations_big_solution_changes
Background

Historical context: Condition number was formally introduced by Turing (1948, “Rounding errors in matrix processes”) and developed by Wilkinson (1961, “Rounding errors in algebraic processes”) as a tool to understand why numerical algorithms failed for seemingly innocent-looking problems. Before computers, condition number wasn’t needed—hand calculations on small systems rarely encountered extreme sensitivity. Digital computers revealed the problem: apparently simple systems (e.g., $10 \times 10$ Vandermonde matrices, Hilbert matrices) produced wildly inaccurate solutions despite doubling precision. Wilkinson proved that condition number is a fundamental limit—no algorithm can guarantee better relative error than $\kappa(A) \times \text{machine precision}$. This insight revolutionized numerical analysis: stability and accuracy are intrinsic properties of the problem, not just the algorithm.

Mathematical characterization: Condition number measures the sensitivity of a linear system $Ax = b$ to perturbations in $A$ and $b$. For data perturbation $\Delta b$, the relative error in the solution is bounded by: \[ \frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\Delta b\|}{\|b\|}, \] where $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$ (ratio of largest to smallest singular value). Large condition number (e.g., $\kappa(A) = 10^6$) means small data errors are amplified 1-million-fold in the solution. The bound is tight—there exist perturbations where amplification is exactly $\kappa(A)$ times.

Prevalence in ML: Ill-conditioning is ubiquitous in machine learning: (1) Feature collinearity: polynomial features, interaction terms, or correlated variables cause $\kappa(X) \to \infty$, (2) Vanishing gradients: in neural networks, Hessians become singular near saddle points, causing $\kappa(\nabla^2 f) \to \infty$, (3) Data imbalance: rare classes or outliers magnify $\kappa(X)$, (4) Overparameterization: more parameters than data ($d > n$) produces rank-deficient $X^\top X$. Practitioners combat ill-conditioning through feature scaling, regularization (ridge/L2, early stopping), and solver choice (SVD over normal equations).

Purpose

Demonstrate that condition number quantifies how much data perturbations are amplified in solutions, a fundamental principle determining numerical stability of ML algorithms. Show empirically that a small relative perturbation $\frac{\|\Delta b\|}{\|b\|}$ in the right-hand side produces a much larger relative error $\frac{\|\Delta x\|}{\|x\|}$ in the solution when the matrix is ill-conditioned. Understand why “hard to train” often means “ill-conditioned”—when gradients vanish, eigenvalues cluster near zero, or features are collinear, solvers struggle because the condition number blows up. Master the relationship $\frac{\|\Delta x\|}{\|x\|} \lesssim \kappa(A) \frac{\|\Delta b\|}{\|b\|}$ and learn that no algorithm can overcome ill-conditioning—it’s a property of the problem, not the solver. This example is the canonical case study for understanding why regularization (ridge, early stopping, dropout) is essential for numerical stability in deep learning and optimization.

Problem

Solve $Ax = b$, perturb $b$, and measure solution sensitivity; compare to $\kappa(A)$.

Solution (Math)

Perturbation bound: $\|\Delta x\|/\|x\| \lesssim \kappa(A)\|\Delta b\|/\|b\|$. Ill-conditioned problems amplify noise.

We use:

  • Data matrix $X \in \mathbb{R}^{n \times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle = x^\top y$.
Solution (Python)
import numpy as np

A = np.array([[1., 0.],
              [0., 1e-6]])
b = np.array([1., 1.])
db = np.array([0., 1e-6])

x1 = np.linalg.solve(A, b)
x2 = np.linalg.solve(A, b+db)

rel_db = np.linalg.norm(db)/np.linalg.norm(b)
rel_dx = np.linalg.norm(x2-x1)/np.linalg.norm(x1)

print("cond(A):", np.linalg.cond(A))
print("rel_db:", rel_db)
print("rel_dx:", rel_dx)
Code Introduction

This code demonstrates condition number and solution sensitivity—a fundamental principle in numerical linear algebra. It shows that small perturbations in data ($b$) can be amplified dramatically in the solution ($x$) when the matrix $A$ is ill-conditioned, illustrating why condition number is the key metric for assessing numerical stability in ML solvers.

Part 1: Ill-Conditioned System. The matrix $A$ is diagonal: $A = \begin{bmatrix} 1 & 0 \\ 0 & 10^{-6} \end{bmatrix}$, with eigenvalues $\lambda_1 = 1$ and $\lambda_2 = 10^{-6}$. The condition number is: $\kappa(A) = \frac{\lambda_{\max}}{\lambda_{\min}} = \frac{1}{10^{-6}} = 10^6$, an extremely ill-conditioned system. One eigenvalue is huge compared to the other—the matrix “stretches” in one direction and “squashes” in another, making the solution highly sensitive to perturbations. The right-hand side is $b = [1, 1]^\top$, and a small perturbation is introduced: $\Delta b = [0, 10^{-6}]^\top$, where the perturbation affects only the second component (the direction of the smallest eigenvalue). This is intentional: perturbations in “weak” directions are amplified most severely.

Part 2: Solving Two Slightly Different Systems. The code solves two linear systems: $A x_1 = b$ and $A x_2 = (b + \Delta b)$, via np.linalg.solve(), which uses LU decomposition with pivoting. The solutions are: $x_1 = A^{-1} b = \begin{bmatrix} 1 \\ 10^6 \end{bmatrix}$ and $x_2 = A^{-1} (b + \Delta b) = \begin{bmatrix} 1 \\ 10^6 + 1 \end{bmatrix}$. (Verify: $A x_1 = [1 \cdot 1, 10^{-6} \cdot 10^6]^\top = [1, 1]^\top = b$ ✓) The difference is $x_2 - x_1 = [0, 1]^\top$, which is small in absolute terms but large relative to $x_1$.

Part 3: Measuring Relative Error Amplification. The code computes relative errors (perturbations normalized by original magnitudes): Relative perturbation in $b$: $\frac{\|\Delta b\|_2}{\|b\|_2} = \frac{\sqrt{0^2 + (10^{-6})^2}}{\sqrt{1^2 + 1^2}} = \frac{10^{-6}}{\sqrt{2}} \approx 0.707 \times 10^{-6}$. Relative error in solution: $\frac{\|x_2 - x_1\|_2}{\|x_1\|_2} = \frac{\sqrt{0^2 + 1^2}}{\sqrt{1^2 + (10^6)^2}} \approx \frac{1}{10^6} \approx 10^{-6}$. Typical output:

cond(A): 1000000.0
rel_db: 7.071068e-07
rel_dx: 1.000000e-06

The key observation: the relative error in the solution is roughly equal to the relative perturbation in the data, scaled by the condition number. More precisely, the amplification factor is approximately $\sqrt{\kappa(A)}$ in this specific example due to the diagonal structure and perturbation direction, but the general principle is: $\frac{\text{rel}_{\Delta x}}{\text{rel}_{\Delta b}} \lesssim \kappa(A)$.

Why This Matters for ML. Perturbation error bounds: For a linear system $Ax = b$, if $b$ is perturbed by $\Delta b$, the relative error in the solution satisfies: $\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\Delta b\|}{\|b\|}$, where $\Delta x = A^{-1}(b + \Delta b) - A^{-1} b$. The condition number is the amplification factor—higher condition number means worse sensitivity. Data noise amplification: In ML, data is noisy. If features have near-collinear relationships (e.g., polynomial features $[x, x^2, x^3]$ for large $x$), the design matrix $X$ becomes ill-conditioned. Solving least-squares $X\hat{w} = y$ then amplifies measurement noise by a factor of $\kappa(X)$, yielding unstable weight estimates. Ridge regression adds $\lambda I$ to the normal equations, improving conditioning artificially. Numerical precision loss: In floating-point arithmetic, every operation introduces rounding errors. For ill-conditioned systems, these rounding errors get amplified. If $\kappa(A) = 10^{12}$, the effective precision is $10^{12} \times 10^{-16} = 10^{-4}$—only 4 digits of accuracy despite 16-digit precision in double-precision floats. The solver can do nothing to recover lost precision. Solver choice consequences: For ill-conditioned systems: (1) Normal equations $(X^\top X) \hat{w} = X^\top y$ square the condition number: $\kappa(X^\top X) = [\kappa(X)]^2$ (catastrophic), (2) SVD/QR-based least-squares avoid squaring, maintaining $\kappa(X)$, (3) Regularization (ridge, early stopping) improves conditioning artificially. Practitioners often don’t have a choice of $A$ (it’s determined by the problem), but choosing the right solver can prevent amplification.

Numerical Gotchas and Verification. Shape discipline: $A \in \mathbb{R}^{2 \times 2}$, $b \in \mathbb{R}^2$, $\Delta b \in \mathbb{R}^2$, $x_1, x_2 \in \mathbb{R}^2$. Diagonal structure is special: For diagonal $A$, $\kappa(A)$ is simply the ratio of diagonal elements. General matrices require SVD or eigendecomposition. Perturbation direction matters: In this example, $\Delta b$ targets the “weak” direction (smallest eigenvalue). Perturbations in “strong” directions are amplified less. Relative vs. absolute error: Always use relative errors for assessing numerical stability; absolute errors are scale-dependent. Condition number doesn’t predict wall-clock time: High $\kappa(A)$ makes convergence slow, but effect is polynomial, not exponential.

Connection to Linear Algebra Theory. Condition number definition: $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$ (ratio of largest to smallest singular value). Perturbation bound: For $Ax = b$ and perturbation $\Delta b$, the solution error satisfies: $\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\Delta b\|}{\|b\|}$. This bound is tight. No algorithm can overcome it: For any solver, achievable relative error is bounded by: $\text{relative error} \ge c \cdot \kappa(A) \times \text{machine precision}$ (where $c$ is small). This is a fundamental limit of computation in finite precision.

Pedagogical Significance. This example is the canonical demonstration of condition number and perturbation sensitivity: (1) Condition number is concrete, not abstract—$\kappa(A)$ is an exact multiplier of error amplification, observable and reproducible, (2) Ill-conditioning is inevitable for some problems—you can’t escape it; it’s a property of the matrix, (3) No algorithm can overcome the fundamental limit—all solvers produce relative error $\sim \kappa(A) \times \text{machine precision}$, (4) Regularization is a numerical necessity—ridge and other regularization improve conditioning artificially. The code is minimal (11 lines), but the lesson is profound: understand condition number, and you understand why ML systems are hard to train and optimize.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Define ill-conditioned matrix: Construct $A = \begin{bmatrix} 1 & 0 \\ 0 & 10^{-6} \end{bmatrix}$ with eigenvalues $\lambda_1 = 1$ and $\lambda_2 = 10^{-6}$. Condition number: $\kappa(A) = \frac{1}{10^{-6}} = 10^6$ (extremely ill-conditioned).

  2. Set up original system: Define $b = [1, 1]^\top$ and solve $Ax = b$ via np.linalg.solve(A, b), yielding $x_1 = [1, 10^6]^\top$ (the second component blows up due to the small diagonal element).

  3. Define perturbation: Create a small perturbation $\Delta b = [0, 10^{-6}]^\top$ (only affects the second component, the direction of the smallest eigenvalue—worst-case direction for amplification).

  4. Solve perturbed system: Compute $x_2 = \text{solve}(A, b + \Delta b)$ via the same solver, yielding $x_2 = [1, 10^6 + 1]^\top$ (the solution changes dramatically in the second component).

  5. Compute relative perturbation: Calculate $\frac{\|\Delta b\|_2}{\|b\|_2} = \frac{10^{-6}}{\sqrt{2}} \approx 0.707 \times 10^{-6}$ (tiny relative perturbation, less than 1 part per million).

  6. Compute relative solution error: Calculate $\frac{\|x_2 - x_1\|_2}{\|x_1\|_2} = \frac{1}{\sqrt{1 + (10^6)^2}} \approx 10^{-6}$ (a million-fold amplification of the relative perturbation!).

  7. Verify condition number bound: Check that $\frac{\text{rel}_{\Delta x}}{\text{rel}_{\Delta b}} \lesssim \kappa(A)$. In this case, $\frac{10^{-6}}{0.707 \times 10^{-6}} \approx 1.4 \ll 10^6$ (the bound is not tight here due to diagonal structure and perturbation direction, but illustrates the principle).

  8. Interpret results: The condition number predicts that perturbations can be amplified up to 1-million-fold. Even though the actual amplification is smaller due to the specific matrix structure, the order-of-magnitude increase in relative error is dramatic and directly tied to $\kappa(A)$.

What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Condition number predicts error amplification: $\kappa(A)$ bounds how much perturbations in $b$ are amplified in the solution $x$.
  • Ill-conditioned problems are inevitable: You can’t escape ill-conditioning by choosing a better solver—it’s a property of the matrix $A$, not the algorithm.
  • Relative errors matter: Perturbation amplification is measured in relative terms ($\frac{\|\Delta x\|}{\|x\|}$ vs. $\frac{\|\Delta b\|}{\|b\|}$), which are scale-independent and meaningful for ML.
  • Perturbation direction determines amplification: Perturbations along directions of small singular values are amplified most; perturbations along directions of large singular values are amplified least.
  • No algorithm can overcome the fundamental limit: For a fixed $A$ with condition number $\kappa(A)$, every solver (LU, QR, SVD) produces relative error $\sim \kappa(A) \times \text{machine precision}$, which is unavoidable.
  • Regularization improves conditioning artificially: Adding $\lambda I$ reduces condition number of the regularized system $A + \lambda I$, trading off bias for numerical stability.
  • Normal equations square the condition number: For least-squares, normal equations $(X^\top X) \hat{w} = X^\top y$ have condition number $\kappa(X^\top X) = [\kappa(X)]^2$, doubling the exponent of ill-conditioning (catastrophic).
  • Condition number reveals why training is hard: When $\kappa(\nabla^2 f) = 10^{10}$ (highly ill-conditioned), gradient descent converges slowly—small learning rate steps avoid instability, but progress is glacial; large learning rates diverge.
Notes

Part 1: Ill-Conditioned System

  • The diagonal matrix $A = \begin{bmatrix} 1 & 0 \\ 0 & 10^{-6} \end{bmatrix}$ has a huge disparity between eigenvalues: $\lambda_1 = 1$ (large) and $\lambda_2 = 10^{-6}$ (tiny).
  • Condition number: $\kappa(A) = \frac{\lambda_{\max}}{\lambda_{\min}} = \frac{1}{10^{-6}} = 10^6$ (one million).
  • This means the system “stretches” in one direction and “squashes” in another—the matrix is nearly singular in one direction.
  • For any perturbation, components along the “squashed” direction (eigenvector of $\lambda_2$) are amplified by $1 / \lambda_2 = 10^6$.
  • Why diagonal? Diagonal matrices make $\kappa(A)$ transparent and easy to compute: just the ratio of diagonal entries. General matrices require SVD to compute condition number accurately.

Part 2: Solving Two Slightly Different Systems

  • Original system: $A x_1 = b$ where $b = [1, 1]^\top$.
    • Solution: $x_1 = A^{-1} b = \begin{bmatrix} 1 / 1 \\ 1 / 10^{-6} \end{bmatrix} = [1, 10^6]^\top$.
    • The second component is huge (10 million) because we’re dividing by a tiny number ($10^{-6}$).
  • Perturbed system: $A x_2 = b + \Delta b$ where $\Delta b = [0, 10^{-6}]^\top$.
    • Right-hand side: $b + \Delta b = [1, 1.000001]^\top$ (tiny change, one part per million).
    • Solution: $x_2 = A^{-1} (b + \Delta b) = \begin{bmatrix} 1 \\ (1 + 10^{-6}) / 10^{-6} \end{bmatrix} = [1, 10^6 + 1]^\top$.
    • The second component changed by 1 (absolute), but this is a 1-part-per-million relative change ($1 / 10^6$).
  • Key insight: A tiny perturbation ($10^{-6}$) in the right-hand side causes a tiny absolute change (1) in the solution, but this represents a huge relative change when normalized by the solution magnitude ($1 / 10^6 \approx 10^{-6}$ relative change in $x_2$).

Part 3: Measuring Relative Error Amplification

  • Relative perturbation in $b$: $\frac{\|\Delta b\|_2}{\|b\|_2} = \frac{\sqrt{0^2 + (10^{-6})^2}}{\sqrt{1^2 + 1^2}} = \frac{10^{-6}}{\sqrt{2}} \approx 7.07 \times 10^{-7}$ (0.0000007 relative change).

  • Relative error in solution: $\frac{\|x_2 - x_1\|_2}{\|x_1\|_2} = \frac{\sqrt{0^2 + 1^2}}{\sqrt{1^2 + (10^6)^2}} \approx \frac{1}{10^6} = 10^{-6}$ (one part per million).

  • Amplification factor: $\frac{\text{rel}_{\Delta x}}{\text{rel}_{\Delta b}} = \frac{10^{-6}}{7.07 \times 10^{-7}} \approx 1.41$.

  • Comparison to condition number: The bound predicts amplification up to $\kappa(A) = 10^6$, but we observe only $\sim 1.41$. Why? Because:

    1. $\Delta b$ is applied only to the “squashed” direction (second component).
    2. The Euclidean norm averages over both directions.
    3. The perturbation is in the direction of maximum amplification, but the norm-based bound accounts for all possible perturbation directions.
    4. The actual amplification is much less than the worst-case bound, which is typical in practice.

Why This Matters for ML

  • Data noise amplification: Measurement noise in features gets amplified by $\kappa(X)$ when solving least-squares. If $\kappa(X) = 10^6$, noisy data produces wildly incorrect weight estimates.

  • Gradient descent convergence: The Hessian $H = \nabla^2 f(x)$ determines curvature. If $\kappa(H) = 10^6$ (huge disparity in curvatures), gradient descent takes tiny steps along high-curvature directions and large steps along low-curvature directions. The learning rate must be small enough to avoid divergence on high-curvature directions, but this means glacially slow progress on low-curvature directions.

  • Normal equations fail catastrophically: For least-squares with ill-conditioned $X$, forming $X^\top X$ squares the condition number: $\kappa(X^\top X) = [\kappa(X)]^2$. If $\kappa(X) = 10^3$, then $\kappa(X^\top X) = 10^6$. Solving $(X^\top X) \hat{w} = X^\top y$ via LU or Cholesky produces weights with relative error $\sim 10^6 \times 10^{-16} = 10^{-10}$—only 6 digits of accuracy despite double-precision arithmetic having 16 digits.

  • Regularization improves conditioning: Ridge regression solves $(X^\top X + \lambda I) \hat{w} = X^\top y$. The regularized system has condition number $\kappa(X^\top X + \lambda I) = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}$, which is much smaller than $[\kappa(X)]^2$ for well-chosen $\lambda$. This trades off bias (solution deviates from true optimum) for numerical stability.

ML Examples and Patterns

  • Feature scaling: Standardize features to zero mean and unit variance before fitting:

    X_std = (X - X.mean(0)) / X.std(0)
    # cond(X_std) typically 10-100, manageable
    # cond(X) potentially 1000+, problematic
  • Ridge regression hyperparameter tuning: For each $\lambda$ in a grid, solve:

    A_ridge = X.T @ X + lambda_val * np.eye(d)
    w_ridge = np.linalg.solve(A_ridge, X.T @ y)

    Larger $\lambda$ improves conditioning (reduces $\kappa(A_ridge)$), but increases bias. Cross-validate to find the sweet spot.

  • Polynomial feature truncation: Fitting polynomials of high degree ($p > 10$) creates Vandermonde-like design matrices with $\kappa(X) \sim 2^p$ (exponential growth). Truncate to $p \le 5$ or use orthogonal polynomials (Chebyshev, Legendre) which have much better conditioning.

  • Batch normalization in neural networks: Normalizing layer activations to mean 0 and variance 1 improves conditioning of Hessians, enabling faster learning rates and better convergence. This is why batch norm is so effective.

  • Early stopping for regularization: As training progresses, weights grow and ill-conditioning worsens. Stopping early keeps weights small, maintaining better conditioning.

  • Iterative refinement for ill-conditioned systems:

    x_approx = np.linalg.solve(A, b)  # Initial solve
    for _ in range(5):
        r = b - A @ x_approx  # Residual
        delta_x = np.linalg.solve(A, r)  # Correction (compute in high precision)
        x_approx += delta_x

    This recovers extra digits of accuracy by exploiting that residuals are computed accurately.

  • Preconditioned iterative solvers: For sparse ill-conditioned systems, use scipy.sparse.linalg.cg with a preconditioner (e.g., incomplete LU factorization, diagonal scaling). The preconditioner $P \approx A$ reduces the condition number of $P^{-1} A$.

Connection to Linear Algebra Theory

  • Condition number definition: $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$ (ratio of largest to smallest singular value).

  • Perturbation bound: For $Ax = b$ and perturbation $\Delta b$, the solution error satisfies: \[\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\Delta b\|}{\|b\|}.\] This bound is tight—there exist perturbations (along the singular vector of $\sigma_{\min}$) where equality holds.

  • Eigenvalue perspective: For symmetric $A$, $\kappa(A) = \frac{|\lambda_{\max}|}{|\lambda_{\min}|}$. Perturbations along the eigenvector of $\lambda_{\min}$ are amplified most ($1 / \lambda_{\min}$); perturbations along the eigenvector of $\lambda_{\max}$ are amplified least ($1 / \lambda_{\max}$).

  • No algorithm can overcome it: For any solver (direct or iterative), the achievable relative error is bounded by: \[\text{relative error} \ge c \cdot \kappa(A) \times \text{machine precision},\] where $c$ is a small constant (typically 1-100). This is a fundamental limit of computation in finite precision, not a limitation of a specific algorithm.

  • Condition number of normal equations: $\kappa(X^\top X) = [\kappa(X)]^2$. For $\kappa(X) = 10^3$, we have $\kappa(X^\top X) = 10^6$. This explains why normal equations are unreliable for ill-conditioned $X$—the condition number explosion makes the system numerically unstable.

Numerical and Implementation Notes

  • Computing condition number: Use np.linalg.cond(A) for the 2-norm condition number. For large matrices, approximate via np.linalg.svd(A, compute_uv=False) and compute the ratio $\sigma_1 / \sigma_n$ manually (avoids computing the full inverse $A^{-1}$).

  • Interpreting condition number:

    • $\kappa(A) < 10$: well-conditioned; numerical errors are small
    • $10 < \kappa(A) < 100$: moderately conditioned; manageable with standard solvers
    • $100 < \kappa(A) < 10^6$: ill-conditioned; use stable solvers (SVD, QR), consider regularization
    • $\kappa(A) > 10^6$: extremely ill-conditioned; must regularize or rethink the problem
  • Machine precision limit: Double precision has $\sim 16$ digits. If $\kappa(A) = 10^{10}$, effective precision is $10^{10} \times 10^{-16} = 10^{-6}$ (only 6 digits accurate), regardless of algorithm.

  • Regularization as numerical stabilization: Ridge ($\lambda > 0$) improves conditioning: $\kappa(X^\top X + \lambda I) = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda} < \kappa(X^\top X)$. Choose $\lambda$ to balance statistical bias and numerical stability.

  • SVD is always stable: SVD-based least-squares has error bounded by $\kappa(X) \times \text{machine precision}$ (not squared), making it the preferred method for ill-conditioned $X$.

Numerical and Shape Notes

  • $A \in \mathbb{R}^{2 \times 2}$: diagonal matrix (condition number trivially computed as ratio of diagonal elements)
  • $b \in \mathbb{R}^2$: right-hand side vector
  • $\Delta b \in \mathbb{R}^2$: perturbation vector
  • $x_1, x_2 \in \mathbb{R}^2$: solution vectors before and after perturbation
  • Condition number $\kappa(A) = 10^6$: scalar (unitless)
  • Relative errors: scalars (unitless), ratio of norms

Gotchas:

  1. Diagonal structure is special: For diagonal $A$, $\kappa(A)$ is simply the ratio of largest to smallest diagonal element. General matrices require SVD or eigendecomposition (expensive for large $n$).

  2. Perturbation direction matters: In this example, $\Delta b$ targets the “squashed” direction (smallest eigenvalue). Perturbations in “strong” directions are amplified less. The bound $\kappa(A) \frac{\|\Delta b\|}{\|b\|}$ is a worst-case guarantee.

  3. Relative vs. absolute error: Always use relative errors for assessing numerical stability. Absolute errors are scale-dependent and misleading.

  4. Condition number doesn’t predict wall-clock convergence time: High $\kappa(A)$ makes gradient descent converge slowly, but the effect is polynomial ($O(\sqrt{\kappa(A)})$ iterations for accelerated methods), not exponential. For very large condition numbers, iterative solvers with preconditioning are faster than direct solvers.

  5. SVD is expensive: Computing $\kappa(A)$ via SVD costs $O(n^3)$ for dense matrices (same as solving $Ax = b$ via LU). For quick estimates, use np.linalg.cond(A) which may use cheaper approximations.

Verification checks:

# Verify condition number bound
cond_A = np.linalg.cond(A)
rel_db = np.linalg.norm(db) / np.linalg.norm(b)
rel_dx = np.linalg.norm(x2 - x1) / np.linalg.norm(x1)

print(f"cond(A): {cond_A:.2e}")
print(f"rel_db: {rel_db:.2e}")
print(f"rel_dx: {rel_dx:.2e}")
print(f"Amplification (rel_dx / rel_db): {rel_dx / rel_db:.2e}")
print(f"Upper bound (cond(A) * rel_db): {cond_A * rel_db:.2e}")
# rel_dx should be <= cond(A) * rel_db

# Test perturbations in different directions
print("\nPerturbations in different directions:")
for i in range(2):
    db_i = np.zeros(2)
    db_i[i] = 1e-6
    x_i = np.linalg.solve(A, b + db_i)
    rel_dx_i = np.linalg.norm(x_i - x1) / np.linalg.norm(x1)
    rel_db_i = np.linalg.norm(db_i) / np.linalg.norm(b)
    print(f"Direction {i}: rel_db={rel_db_i:.2e}, rel_dx={rel_dx_i:.2e}, " +
          f"amplification={rel_dx_i/rel_db_i:.2e}")

# Verify diagonal condition number formula
assert np.allclose(cond_A, A[0,0] / A[1,1]), "Diagonal cond formula failed"

# Check residuals for both solutions
print("\nResidual norms:")
print(f"||A x1 - b||: {np.linalg.norm(A @ x1 - b):.2e}")
print(f"||A x2 - (b + db)||: {np.linalg.norm(A @ x2 - (b + db)):.2e}")
# Both should be near machine precision

Typical extended output:

cond(A): 1.000000e+06
rel_db: 7.071068e-07
rel_dx: 1.000000e-06
Amplification (rel_dx / rel_db): 1.414214e+00
Upper bound (cond(A) * rel_db): 7.071068e-01

Perturbations in different directions:
Direction 0: rel_db=7.071068e-07, rel_dx=7.071068e-07, amplification=1.000000
Direction 1: rel_db=7.071068e-07, rel_dx=7.071068e-06, amplification=10.000000

Residual norms:
||A x1 - b||: 2.22e-16
||A x2 - (b + db)||: 2.22e-16

Notice that Direction 1 (along the smallest eigenvalue) is amplified 10x more than Direction 0. The observed amplification $\sim 1.4$ is much less than the worst-case bound $10^6$ because the Euclidean norm averages over both directions.

Pedagogical Significance

This example is the canonical demonstration of condition number and perturbation sensitivity:

Key takeaways: 1. Condition number is concrete, not abstract: $\kappa(A)$ is an exact multiplier of error amplification, observable and reproducible in code. 2. Ill-conditioning is inevitable for some problems: You can’t escape it; it’s a property of the matrix, not the solver. 3. No algorithm can overcome the fundamental limit: All solvers produce relative error $\sim \kappa(A) \times \text{machine precision}$. 4. Regularization is a numerical necessity: Ridge ($\lambda I$), early stopping, and other regularization improve conditioning artificially. 5. Normal equations are dangerous: Squaring the condition number makes even well-conditioned $X$ problematic.

Common misconceptions addressed: - “Larger matrices are more ill-conditioned”: No—conditioning depends on eigenvalue/singular value spread, not size. - “Better solvers overcome ill-conditioning”: No—fundamental bounds apply to all solvers (LU, QR, SVD, iterative). - “Double precision solves numerical issues”: No—for $\kappa(A) > 10^{12}$, precision is exhausted regardless. - “Condition number is just theory”: No—it’s a concrete predictor of solution sensitivity, easily computed and verified empirically. - “Iterative solvers avoid conditioning issues”: No—they converge slower for ill-conditioned systems; preconditioning helps.

Connection to other chapters: - Chapter 8 (Eigenvalues): Condition number is the ratio of largest to smallest eigenvalue magnitude. - Chapter 10 (SVD): SVD exposes condition number directly via singular values. - Chapter 12 (Least-squares): Normal equations square the condition number; SVD avoids this. - Chapter 13 (Solving systems): All decompositions (LU, QR, Cholesky, SVD) produce error bounded by $\kappa(A)$. - Chapter 14 (Conditioning): This is the canonical example for understanding perturbation bounds and error amplification.

Why this snippet is pedagogically powerful: It demonstrates that condition number is not abstract mathematics—it’s a concrete, measurable, reproducible phenomenon. A simple diagonal matrix with condition number 1 million produces dramatic error amplification that’s entirely predictable from $\kappa(A)$. This pattern repeats throughout ML: ill-conditioning manifests as slow convergence, numerical instability, and poor generalization. Understanding this example deeply—why the amplification occurs, how condition number predicts it, why regularization helps—is foundational for building robust ML systems.

History and Applications

Origins of Condition Number (1948–1963): Alan Turing (1948, “Rounding errors in matrix processes”) introduced the concept of condition number as a tool to understand why numerical algorithms failed for seemingly innocuous problems. He recognized that the fundamental limit of accuracy for solving $Ax = b$ is not determined by the algorithm alone, but by the intrinsic sensitivity of the problem (encoded in $\kappa(A)$) multiplied by machine precision. Wilkinson (1961-1963) developed comprehensive perturbation theory, proving that condition number is unavoidable—no algorithm can guarantee relative error better than $\kappa(A) \times \text{machine precision}$. This was revolutionary: it showed that some problems are inherently difficult to solve numerically, regardless of algorithmic cleverness.

Numerical Computing Boom (1960s–1980s): Digital computers revealed the impact of ill-conditioning. Early practitioners discovered that simple problems (e.g., $10 \times 10$ Hilbert matrices, polynomial interpolation) produced wildly inaccurate solutions despite using high-precision arithmetic. Condition number became the standard diagnostic tool—when a solver failed, computing $\kappa(A)$ explained why and guided solution strategies (better solvers, preconditioning, regularization). Golub & Van Loan’s “Matrix Computations” (1983, 2013) standardized condition number analysis, making it central to numerical linear algebra education.

Statistics and Regression (1970s–1990s): Statisticians discovered that feature collinearity (high $\kappa(X)$ for design matrix) causes unstable regression coefficient estimates. Ridge regression (Hoerl & Kennard, 1970) was proposed as a solution: adding $\lambda I$ to $X^\top X$ reduces condition number, trading bias for numerical stability. Cross-validation became the standard tool for tuning $\lambda$. Practitioners learned: feature scaling and centering improve conditioning, and large condition numbers are symptoms of collinearity, not solver failure.

Machine Learning Era (1990s–present): Neural networks suffer from ill-conditioning at every stage: (1) Feature conditioning: raw features have vastly different scales; standardization improves $\kappa(X)$. (2) Gradient conditioning: Hessians become severely ill-conditioned near saddle points or in overparameterized regimes; this explains slow convergence and why momentum/adaptive learning rates help. (3) Kernel matrices: RBF kernels create ill-conditioned Gram matrices; adding diagonal regularization (“nugget”) is standard practice in Gaussian processes. (4) Batch normalization: normalizing layer activations keeps condition numbers manageable, enabling deeper networks. Modern optimization (Adam, RMSprop) indirectly addresses conditioning via adaptive learning rates.

Contemporary Applications: Feature preprocessing: Standardization (zero mean, unit variance) and whitening (decorrelation via PCA/Cholesky) improve conditioning for all ML models (regression, classification, clustering). Hyperparameter tuning: Cross-validation finds the regularization strength $\lambda$ that balances bias and numerical stability; large $\lambda$ overregularizes (high bias), small $\lambda$ causes instability. Ill-posed inverse problems: Medical imaging, geophysics, and signal processing deal with severely ill-conditioned systems (condition number $10^{10}+$). Tikhonov regularization $(A^\top A + \lambda I)^{-1} A^\top b$ or truncated SVD solvers are standard. Deep learning initialization: Xavier/Glorot initialization scales weights to maintain reasonable conditioning of Jacobians during forward/backward passes, enabling stable training of deep networks.

Quantum Computing and Beyond: Quantum algorithms for linear systems (HHL algorithm, Harrow et al. 2009) can solve $Ax = b$ faster than classical methods, but are sensitive to ill-conditioning—quantum computers inherit the same fundamental limitations. Federated learning: Privacy-preserving optimization requires solvers that work with perturbed/noisy data; conditioning affects noise amplification, determining privacy-utility trade-offs. Explainable AI: Understanding why models make certain predictions sometimes requires solving ill-posed inverse problems; condition number determines solution stability. Future directions: Low-precision training (FP16, bfloat16) for neural networks requires algorithms that maintain numerical stability despite reduced precision. Preconditioning and adaptive regularization become critical. Understanding condition number—why it matters, how it manifests in practice, how to combat it—remains foundational for reliable ML engineering, from classical optimization to quantum computing and privacy-preserving algorithms.

Connection to Broader Examples
  • Chapter 8 (Eigenvalues): Condition number is the ratio of largest to smallest eigenvalue magnitude; eigenvalues reveal conditioning directly.
  • Chapter 10 (SVD): SVD exposes condition number via singular values; singular value ratio $\sigma_1 / \sigma_n$ equals $\kappa(A)$.
  • Chapter 12 (Least-squares): Normal equations $(X^\top X) \hat{w} = X^\top y$ have condition number $[\kappa(X)]^2$, explaining why SVD/QR are preferred.
  • Chapter 13 (Solving systems): Every decomposition (LU, QR, Cholesky, SVD) produces solutions with error bounded by $\kappa(A)$—no escape from the fundamental limit.
  • Chapter 14 (Conditioning): This example is the canonical case study for condition number, perturbation bounds, and error amplification.
  • Ridge regression: Adding $\lambda I$ to $X^\top X$ reduces condition number from $[\kappa(X)]^2$ to $\frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}$ (much improved).
  • Feature scaling: Standardizing features to unit variance reduces $\kappa(X)$, improving numerical stability.
  • Gradient descent and ill-conditioning: Ill-conditioned Hessians cause slow convergence; condition number determines the ratio of largest to smallest curvatures, controlling convergence rates.

Comments