Condition number theory emerged from Turingâs pioneering 1948 paper Rounding-Off Errors in Matrix Processes, which introduced the concept of âconditionâ to quantify how errors propagate through matrix computations. Wilkinsonâs 1963 work Rounding Errors in Algebraic Processes formalized the perturbation bounds and established condition number as the central measure of numerical stability. By the 1970s, condition number analysis became the foundation of numerical linear algebra, guiding algorithm design (prefer QR/SVD over LU for least squares, use pivoting to reduce $\kappa$, regularize near-singular systems). In modern ML, ill-conditioning manifests as âhard to trainâ phenomena: gradient descent oscillates wildly on loss surfaces with large Hessian condition number ($\kappa(H) \sim 10^6$ common in deep networks), least-squares solutions explode when features have mismatched scales (age in years vs. income in dollars â $\kappa(X) \sim 10^4$), and adversarial perturbations exploit ill-conditioned weight matrices to flip predictions. The standard remediesâfeature normalization (standardize to zero mean, unit variance), ridge regularization (add $\lambda I$ to floor eigenvalues), batch normalization (implicitly improve Hessian conditioning), adaptive optimizers (Adam/RMSprop approximate $H^{-1/2}$), and spectral normalization (bound $\|W\|_2 = 1$)âare all conditioning repair mechanisms that reduce $\kappa$ to stabilize training. Understanding condition number as error amplification is essential for diagnosing training failures, choosing solvers (SVD vs. normal equations), and designing robust ML systems.
- Log in to post comments
This example demonstrates how condition number directly predicts error amplification in linear systems $Ax = b$. By constructing a deliberately ill-conditioned diagonal matrix ($\kappa(A) = 10^6$) and measuring how a tiny perturbation $\delta b = [0, 10^{-6}]^\top$ changes the solution, we empirically verify the perturbation bound $\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\delta b\|}{\|b\|}$. The key insight: condition number is not an abstract theoretical curiosityâitâs the amplification factor mapping input noise to output errors. For $\kappa(A) = 10^6$, a $10^{-6}$ relative input perturbation can produce a relative output change up to $\sim 1$ (100% error). This motivates all subsequent conditioning-aware practices in ML: feature scaling (reduce $\kappa(X)$), ridge regularization (add $\lambda I$ to floor eigenvalues), SVD/QR over normal equations (avoid squaring $\kappa$), and adaptive optimizers (precondition ill-conditioned Hessians). The diagonal matrix makes the mechanism transparentâno SVD required to understand how small eigenvalues cause amplification.
Solve Ax=b, perturb b, and measure solution sensitivity; compare to cond(A).
For a linear system $Ax = b$ with perturbation $\delta b$ in the right-hand side, the condition number perturbation bound is:
\[ \frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\delta b\|}{\|b\|}, \]
where $\Delta x = x_{\text{perturbed}} - x_{\text{original}}$ and $\kappa(A) = \|A\| \|A^{-1}\|$ is the condition number. For the 2-norm, $\kappa(A) = \sigma_{\max}(A) / \sigma_{\min}(A)$ (ratio of largest to smallest singular values).
The bound states: relative output error is at most the condition number times the relative input error. For ill-conditioned systems ($\kappa(A) \gg 1$), small input perturbations are amplified by up to $\kappa(A)$ in the output.
For the diagonal matrix $A = \text{diag}(1, 10^{-6})$:
\[ \kappa(A) = \frac{\lambda_{\max}}{\lambda_{\min}} = \frac{1}{10^{-6}} = 10^6. \]
This predicts that a $10^{-6}$ relative perturbation in $b$ could cause up to a factor-of-$10^6$ amplification (though the actual amplification depends on perturbation alignment with eigenvectors).
Notation: - $A \in \mathbb{R}^{d \times d}$: square matrix - $x, b, \delta b \in \mathbb{R}^d$: solution, right-hand side, perturbation vectors - $\kappa(A) \in \mathbb{R}$: condition number (scalar) - $\sigma_{\max}, \sigma_{\min}$: largest and smallest singular values - $\|\cdot\|$: vector/matrix norm (typically 2-norm)
import numpy as np
A = np.array([[1., 0.],
[0., 1e-6]])
b = np.array([1., 1.])
db = np.array([0., 1e-6])
x1 = np.linalg.solve(A, b)
x2 = np.linalg.solve(A, b+db)
rel_db = np.linalg.norm(db)/np.linalg.norm(b)
rel_dx = np.linalg.norm(x2-x1)/np.linalg.norm(x1)
print("cond(A):", np.linalg.cond(A))
print("rel_db:", rel_db)
print("rel_dx:", rel_dx)This code empirically demonstrates how the condition number $\kappa(A)$ of a matrix governs the amplification of perturbations through linear systems. It constructs a simple 2Ã2 ill-conditioned diagonal matrix, measures how a tiny perturbation in the right-hand side propagates to the solution, and verifies the theoretical bound $\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\delta b\|}{\|b\|}$.
Part 1: Anatomy of an Ill-Conditioned System
The matrix $A = \begin{bmatrix} 1 & 0 \\ 0 & 10^{-6} \end{bmatrix}$ is diagonal with extreme eigenvalue disparity. The first eigenvalue is $\lambda_1 = 1$ (well-scaled), while the second is $\lambda_2 = 10^{-6}$ (extremely smallâsix orders of magnitude smaller). This yields a condition number:
\[ \kappa(A) = \frac{\lambda_{\max}}{\lambda_{\min}} = \frac{1}{10^{-6}} = 10^6 \]
Geometric interpretation: The level sets of $\|Ax - b\|_2$ are ellipsoids with semi-axes proportional to $1/\sigma_i$. For $\sigma_2 = 10^{-6}$, one axis is $10^6$ times longer than the otherâa highly elongated âneedle-shapedâ feasible region. Any noise in the right-hand side $b$ along the narrow direction (second coordinate) causes large shifts in the optimal $x$ along that direction.
Why diagonal? Two reasons: (1) Pedagogical clarity: For diagonal matrices, the condition number is transparentâno SVD needed, just the ratio of diagonal entries. This makes the connection between eigenvalues and conditioning immediately obvious. (2) Numerical simplicity: Diagonal systems solve trivially ($x_i = b_i / A_{ii}$), eliminating algorithmic complexity and isolating the pure effect of conditioning.
Real-world ML analogs: (1) Feature matrices $X \in \mathbb{R}^{n \times d}$ where features have vastly different scales (e.g., age vs. income). The Gram matrix $X^\top X$ inherits squared conditioning $\kappa(X^\top X) = [\kappa(X)]^2$. (2) Neural network Hessians $H = \nabla^2_W \mathcal{L}$ in deep nets, where gradient flow causes some eigenvalues to shrink to $\sim 10^{-8}$ (vanishing gradients) while others remain $\sim 1$ (well-scaled layers). (3) Covariance matrices in PCA where a few principal components dominate ($\sigma_1 \gg \sigma_d$), making the matrix nearly rank-deficient.
Part 2: Perturbation Experiment and Verification
The code performs a controlled amplification experiment:
Baseline system: Solve $Ax_1 = b$ where $b = [1, 1]^\top$. For the diagonal matrix, the solution is trivial: $x_1 = [b_1/A_{11}, b_2/A_{22}] = [1/1, 1/10^{-6}] = [1, 10^6]^\top$. The second component is already huge because weâre âdividing by a tiny number.â
Perturbed system: Introduce a tiny perturbation $\delta b = [0, 10^{-6}]^\top$, aligned with the weak eigenvalue direction (second coordinate). Solve $Ax_2 = b + \delta b$ to get $x_2 = [1, (1 + 10^{-6})/10^{-6}] = [1, 10^6 + 1]^\top$. The solution change is $\Delta x = x_2 - x_1 = [0, 1]^\top$.
Relative perturbation measurements:
- Input: $\text{rel\_db} = \|\delta b\|_2 / \|b\|_2 = 10^{-6} / \sqrt{2} \approx 7.07 \times 10^{-7}$ (sub-parts-per-million noise).
- Output: $\text{rel\_dx} = \|\Delta x\|_2 / \|x_1\|_2 = 1 / 10^6 = 10^{-6}$ (because $\|\Delta x\| = 1$ and $\|x_1\| \approx 10^6$).
Amplification factor: $\text{rel\_dx} / \text{rel\_db} = 10^{-6} / (7 \times 10^{-7}) \approx 1.4$. This is much smaller than the condition number $\kappa(A) = 10^6$.
Why is amplification only 1.4, not $10^6$? The theoretical bound $\|\Delta x\| / \|x\| \le \kappa(A) \|\delta b\| / \|b\|$ is a worst-case upper bound. The actual amplification depends on: (1) Direction alignment: The perturbation $\delta b = [0, 10^{-6}]^\top$ aligns perfectly with the smallest eigenvector (second coordinate), causing maximal amplification in absolute terms ($\delta x_2 = 10^{-6} / 10^{-6} = 1$). However⦠(2) Large solution norm: The baseline solution $x_1 = [1, 10^6]^\top$ has $\|x_1\|_2 \approx 10^6$, dominated by the already-amplified second component. The relative change $\|\Delta x\| / \|x_1\|$ is diluted by this large norm. In other words, weâre measuring a unit absolute change ($|\Delta x_2| = 1$) relative to a $10^6$-scale baseline ($|x_{1,2}| = 10^6$), yielding $1/10^6 = 10^{-6}$.
Key ML lesson: Relative errors can be deceptively small when the solution norm is large, even though absolute errors show full amplification. For gradient descent, absolute step sizes $\|\Delta w\|$ matterâill-conditioned Hessians cause large absolute updates in some directions even if relative changes look small. This manifests as oscillations or divergence.
Verification: The code prints cond(A) = 1e6, rel_db â 7e-07, rel_dx â 1e-06. The ratio confirms that perturbations amplify by $\sim 1.4 \times$ in relative terms (subdued by large $\|x_1\|$), but the absolute change $|\Delta x_2| = 1$ shows full $10^6$ amplification in the weak direction ($10^{-6} \to 1$).
Why This Matters for ML
Four critical ML manifestations of conditioning:
Gradient descent instability: For quadratic loss $f(w) = \frac{1}{2} w^\top H w - g^\top w$, gradient descent converges at rate $O(\kappa(H) \log(1/\epsilon))$. If $\kappa(H) = 10^6$, it takes $\sim 10^6$ iterations to reach $10^{-6}$ accuracy. Worse, stochastic gradients (mini-batch noise) introduce $\sim 10^{-3}$ perturbations, which get amplified $\sim 10^6 \times$ in ill-conditioned directions, causing oscillations. Adaptive optimizers (Adam, RMSprop) approximate $H^{-1/2}$, reducing effective $\kappa \approx 1$.
Feature scaling as conditioning repair: Unnormalized features with scales ranging from $[0, 1]$ (binary) to $[0, 100000]$ (currency) yield $\kappa(X) \sim 10^5$. Normal equations $X^\top X$ square this to $\kappa \sim 10^{10}$âexceeding the stability threshold for double precision. Standardization $(X_{:,j} - \mu_j) / \sigma_j$ brings all features to unit variance, reducing $\kappa(X)$ to $\sim 10$ (only the intrinsic correlation structure remains). scikit-learnâs
StandardScalerperforms this by default for precisely this reason.Ridge regularization floors eigenvalues: Ridge $(X^\top X + \lambda I) w = X^\top y$ adds $\lambda$ to all eigenvalues of $X^\top X$. If $\lambda_{\min}(X^\top X) \sim 10^{-6}$ and $\lambda_{\max} \sim 1$, then $\kappa(X^\top X) \sim 10^6$ (this example). Adding $\lambda = 10^{-3}$ floors the spectrum: $\kappa(X^\top X + \lambda I) \le 1 / (10^{-6} + 10^{-3}) \approx 10^3$, improving conditioning by $10^3 \times$ and recovering 3 digits of precision.
Adversarial robustness and ill-conditioned classifiers: In neural networks, weight matrices $W$ with $\kappa(W) \sim 10^4$ amplify input perturbations $\delta x$: $\|\Delta y\| = \|W \delta x\| \le \|W\|_2 \|\delta x\|$. For $\|W\|_2 = \sigma_{\max}(W) \sim 10^4$ and $\|\delta x\| = 10^{-3}$, adversarial examples cause $\|\Delta y\| \sim 10$ (flipping predictions). Spectral normalization ($W / \sigma_{\max}(W)$) bounds $\|W\|_2 = 1$, limiting amplification to $\|\Delta y\| \le \|\delta x\|$. Used in GANs (Miyato et al., 2018) and robust certification.
Numerical and Shape Notes
Shape discipline: $A \in \mathbb{R}^{2 \times 2}$ (diagonal), $b, \delta b, x_1, x_2 \in \mathbb{R}^2$. Condition number $\kappa(A) \in \mathbb{R}$ (scalar). Norms $\|\cdot\|_2 \in \mathbb{R}$ (scalars). Check dimensions before computing:
A.shape = (2, 2),b.shape = (2,),np.linalg.cond(A).shape = ().Gotcha 1: Condition number norm:
np.linalg.cond(A)computes $\kappa_2(A) = \sigma_{\max} / \sigma_{\min}$ (2-norm, spectral) by default. Other norms:np.linalg.cond(A, p=1)(1-norm),p=np.inf(infinity-norm),p='fro'(Frobeniusârarely used for conditioning). For diagonal matrices, all norms agree: $\kappa_p(D) = |\lambda_{\max}| / |\lambda_{\min}|$.Gotcha 2: Diagonal matrices donât need SVD: For general $A$, computing $\kappa(A)$ requires full SVD ($O(d^3)$). For diagonal $A$, eigenvalues are the diagonal entriesâ$O(d)$ read. This is why we use diagonal matrices for pedagogy: transparent conditioning.
Gotcha 3: Perturbation direction matters: Perturbations aligned with the smallest singular vector $v_d$ (corresponding to $\sigma_{\min}$) achieve the worst-case amplification $\kappa(A)$. Random perturbations see average-case $\sim \sqrt{\kappa(A)}$ (much milder). In this example, $\delta b = [0, 10^{-6}]^\top$ aligns with $v_2 = [0, 1]^\top$, the weak direction.
Gotcha 4: Relative vs. absolute errors: Relative errors $\|\Delta x\| / \|x\|$ can be misleadingly small when $\|x\|$ is large, or misleadingly large when $\|x\|$ is small. Always check both absolute ($\|\Delta x\|$) and relative metrics. For gradient descent, absolute step sizes matter most.
Gotcha 5: Machine precision limits: Double precision (float64) has $\epsilon_{\text{mach}} \approx 2.22 \times 10^{-16}$. Roundoff error in solving $Ax = b$ is $\sim \kappa(A) \epsilon_{\text{mach}}$. For $\kappa = 10^{12}$, expect error $\sim 10^{-4}$ (losing 12 digits from 16). For $\kappa > 10^{15}$, solutions are essentially random noise. Mitigation: (1) Regularize: add $\lambda I$ to floor eigenvalues. (2) Use higher precision:
np.float128(quad, ~32 digits) ormpmath(arbitrary). (3) Iterative refinement.
Connection to Linear Algebra Theory
Perturbation bounds: For $Ax = b$ and perturbed $(A + \Delta A)(x + \Delta x) = b + \Delta b$, first-order analysis yields: \[ \frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \left( \frac{\|\Delta A\|}{\|A\|} + \frac{\|\Delta b\|}{\|b\|} \right) \] This bound is tight (achieved for specific perturbations). The condition number $\kappa(A) = \|A\| \|A^{-1}\|$ is the amplification factor.
Eigenvalue interpretation: For symmetric $A$ with eigenvalues $\lambda_1 \ge \cdots \ge \lambda_d > 0$, the inverse $A^{-1}$ has eigenvalues $1/\lambda_i$. Small $\lambda_d$ (near-zero) inverts to large $1/\lambda_d$ (near-infinity), causing amplification. Condition number $\kappa = \lambda_1 / \lambda_d$ quantifies this.
SVD perspective: Any matrix $A = U \Sigma V^\top$ has $\kappa(A) = \sigma_{\max} / \sigma_{\min}$. Small $\sigma_{\min}$ indicates near-rank-deficiency (almost singular). The pseudoinverse $A^+ = V \Sigma^+ U^\top$ inverts non-zero singular values, providing minimum-norm solutions for ill-conditioned systems.
Geometric view: Level sets of $\|Ax - b\|_2$ are ellipsoids with semi-axes $1/\sigma_i$. For ill-conditioned $A$, these ellipsoids are extremely elongated. Small changes in $b$ cause large shifts in $x$ along the narrow directions (corresponding to small $\sigma_i$). The condition number measures the ellipsoidâs aspect ratio.
ML Examples and Patterns
Feature normalization (StandardScaler): Unnormalized features â $\kappa(X) \sim 10^5$. Standardize:
X_norm = (X - X.mean(0)) / X.std(0)â $\kappa(X_{\text{norm}}) \sim 10$. Cross-validate: computenp.linalg.cond(X)before/after.Ridge regression stabilization: Ill-conditioned $X^\top X$ ($\kappa \sim 10^{12}$) â add $\lambda I$:
np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y). Condition number: $\kappa(X^\top X + \lambda I) \le \sigma_{\max}^2 / \lambda$. Cross-validate $\lambda$ via k-fold CV.Gradient descent learning rate: For quadratic $f(w)$, optimal $\eta = 2 / (\lambda_{\min} + \lambda_{\max})$. For $\kappa = 10^6$, $\eta \approx 2 / \lambda_{\max}$ (small). Momentum accelerates by factor $\sim \sqrt{\kappa}$. Adam adaptively estimates $H^{-1/2}$, reducing $\kappa_{\text{effective}} \approx 1$.
Batch normalization: Normalizes activations $\hat{x} = (x - \mathbb{E}[x]) / \sqrt{\text{Var}[x] + \epsilon}$ per layer. Implicitly improves Hessian conditioning, enabling learning rates $10-100 \times$ higher.
Spectral normalization for GANs: $W_{\text{SN}} = W / \sigma_{\max}(W)$ ensures $\|W\|_2 = 1$, bounding Lipschitz constant. Prevents conditioning from exploding during adversarial training.
Condition number monitoring: Track $\kappa(H)$ via power iteration or Lanczos during training. If $\kappa > 10^6$, switch to adaptive optimizers or add regularization. PyTorch:
torch.linalg.cond(H)(expensive for large $H$âuse Hutchinson trace estimator for efficiency).
Pedagogical Significance
This example is pedagogically central for five reasons:
Concrete empirical verification: Students see that $\kappa(A) = 10^6$ directly predicts that $10^{-6}$ input noise causes unit-scale output changes. This visceral demonstration makes condition number tangibleânot just an abstract bound, but a measured amplification factor.
Diagonal transparency: Diagonal matrices make conditioning obviousâno SVD, no matrix factorizations, just eigenvalue ratios. This isolates the concept from algorithmic complexity.
Relative vs. absolute errors: Highlights that relative errors can be deceptively small when $\|x\|$ is large. The absolute change $|\Delta x_2| = 1$ shows full $10^6$ amplification, even though relative change is $10^{-6}$. Essential for understanding gradient descent behavior.
Motivates all conditioning-aware ML practices: Feature scaling, ridge regularization, SVD/QR solver choice, adaptive optimizers, spectral normalizationâall target reducing $\kappa$ to prevent the amplification demonstrated here.
Common misconceptions addressed: (a) âCondition number is just theoryââNo, itâs a measured amplification factor. (b) âSmall perturbations always cause small errorsââOnly for $\kappa \sim 1$; ill-conditioned systems amplify exponentially. (c) âDouble precision is always enoughââFor $\kappa > 10^{15}$, solutions are noise. (d) âIterative solvers avoid conditioningââNo, CG converges $\propto \sqrt{\kappa}$; preconditioning mandatory.
Connection to other examples: Example 74 (condition number squaring $\kappa(X^\top X) = [\kappa(X)]^2$), Example 76 (SVD vs. normal equations), Example 77 (Cholesky stability for well-conditioned SPD), Example 75 (PCA reveals ill-conditioning via small singular values). This example provides the empirical foundation for understanding why these practices matter.
Step-by-step execution of the conditioning amplification experiment:
Construct ill-conditioned diagonal matrix:
A = np.array([[1., 0.], [0., 1e-6]])creates $A = \text{diag}(1, 10^{-6}) \in \mathbb{R}^{2 \times 2}$ with eigenvalues $\lambda_1 = 1$ and $\lambda_2 = 10^{-6}$. Condition number: $\kappa(A) = 1 / 10^{-6} = 10^6$ (computed vianp.linalg.cond(A)). This represents severe ill-conditioningâthe second coordinate direction is nearly degenerate.Define baseline right-hand side:
b = np.array([1., 1.])sets $b = [1, 1]^\top \in \mathbb{R}^2$. Norm: $\|b\|_2 = \sqrt{2} \approx 1.414$.Define perturbation:
db = np.array([0., 1e-6])creates $\delta b = [0, 10^{-6}]^\top$, a tiny perturbation aligned with the weak eigenvalue direction (second coordinate). Norm: $\|\delta b\|_2 = 10^{-6}$.Solve baseline system:
x1 = np.linalg.solve(A, b)computes $x_1 = A^{-1} b = [1, 10^6]^\top$ (diagonal inversion is element-wise: $x_1[0] = 1/1 = 1$, $x_1[1] = 1/10^{-6} = 10^6$). Norm: $\|x_1\|_2 \approx 10^6$.Solve perturbed system:
x2 = np.linalg.solve(A, b+db)computes $x_2 = A^{-1}(b + \delta b) = [1, 10^6 + 1]^\top$ (second component: $(1 + 10^{-6})/10^{-6} = 10^6 + 1$).Measure relative perturbations: (a) Input:
rel_db = np.linalg.norm(db) / np.linalg.norm(b) â 10^{-6} / \sqrt{2} â 7.07 \times 10^{-7}. (b) Output:rel_dx = np.linalg.norm(x2-x1) / np.linalg.norm(x1) = 1 / 10^6 = 10^{-6}(since $\Delta x = [0, 1]^\top$, $\|\Delta x\| = 1$).Interpretation: The amplification factor is $10^{-6} / (7 \times 10^{-7}) \approx 1.4$, much smaller than $\kappa(A) = 10^6$ because the large norm of $x_1$ (dominated by the $10^6$ component) dilutes the relative error. The absolute change ($|\Delta x_2| = 1$) shows full $10^6$ amplification in the weak direction ($10^{-6} \to 1$), but in relative terms, itâs masked by $\|x_1\| \approx 10^6$.
- Condition number as error amplification: For $\kappa(A) = 10^6$, the perturbation bound $\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\delta b\|}{\|b\|}$ predicts that a $10^{-6}$ relative input perturbation can cause up to a unit-scale (100%) relative output changeâthis snippet empirically verifies the bound by measuring
rel_dx â 1e-06forrel_db â 7e-07, confirming $\sim 1.4$ amplification (smaller than worst-case $10^6$ due to large $\|x\|$ diluting relative error). - Diagonal matrices reveal conditioning directly: For $A = \text{diag}(\lambda_1, \ldots, \lambda_d)$, the condition number is simply $\kappa(A) = |\lambda_{\max}| / |\lambda_{\min}|$ (no SVD required)âhere $1 / 10^{-6} = 10^6$ shows severe ill-conditioning from the near-zero second eigenvalue, making the mechanism transparent.
- Perturbation alignment matters: The bound is a worst-case estimate achieved when $\delta b$ aligns with the smallest singular vectorâhere $\delta b = [0, 10^{-6}]^\top$ perfectly aligns with the weak direction (second coordinate), producing maximal amplification in absolute terms ($\delta x_2 = 1$), though relative error is masked by the large norm $\|x_1\| \approx 10^6$.
- Relative vs. absolute errors: The code measures both: absolute change $\|\Delta x\| = 1$ (full amplification $10^{-6} \to 1$) vs. relative change $\|\Delta x\| / \|x\| = 10^{-6}$ (small because $\|x\| \approx 10^6$ dominates)âalways check both metrics, as relative errors can be misleading when $\|x\|$ is large.
- Machine precision limits: For $\kappa(A) = 10^{12}$, solving $Ax = b$ in double precision (16 digits) loses 12 digits, leaving only ~4 correctâfor $\kappa > 10^{15}$, solutions are essentially random noise, necessitating regularization ($A + \epsilon I$) or higher precision.
- ML implications: This amplification pattern explains gradient descent instability (ill-conditioned Hessian $\kappa(H) \gg 1$ causes oscillations), feature scaling necessity (unmormalized $X$ yields $\kappa(X) \sim 10^4$), ridge regularization benefits (add $\lambda I$ to improve $\kappa$), and adversarial vulnerability (ill-conditioned $W$ amplifies input perturbations $\delta x$ to large output changes $W \delta x$).
Part 1: Anatomy of an Ill-Conditioned System
The matrix $A = \text{diag}(1, 10^{-6})$ is diagonal with extreme eigenvalue disparity: $\lambda_1 = 1$ and $\lambda_2 = 10^{-6}$. Condition number: $\kappa(A) = 1 / 10^{-6} = 10^6$. Geometric interpretation: The level sets of $\|Ax - b\|_2$ are ellipses with semi-axes proportional to $1/\sigma_i$. For $\sigma_2 = 10^{-6}$, one axis is $10^6$ times longer than the otherâa âneedle-shapedâ feasible region where tiny changes in $b$ cause large shifts in the optimal $x$ along the narrow direction (second coordinate). The system is severely ill-conditionedâthe second coordinate direction is nearly degenerate (scaling by $10^{-6}$), making solutions hypersensitive to noise in that direction. For diagonal matrices, conditioning is transparent: no SVD needed, just eigenvalue ratio $\lambda_{\max} / \lambda_{\min}$. This pedagogical choice makes the amplification mechanism obvious.
Part 2: Perturbation Experiment and Amplification
The code solves $Ax_1 = b$ where $b = [1, 1]^\top$ (baseline) and $Ax_2 = b + \delta b$ where $\delta b = [0, 10^{-6}]^\top$ (perturbed). For diagonal $A$, solutions are trivial: $x_1 = [1, 10^6]^\top$ and $x_2 = [1, 10^6 + 1]^\top$. Solution change: $\Delta x = [0, 1]^\top$. Relative perturbations: Input: $\|\delta b\| / \|b\| = 10^{-6} / \sqrt{2} \approx 7 \times 10^{-7}$. Output: $\|\Delta x\| / \|x_1\| = 1 / 10^6 = 10^{-6}$. Amplification factor: $10^{-6} / (7 \times 10^{-7}) \approx 1.4$, much smaller than $\kappa(A) = 10^6$. Why? The perturbation $\delta b = [0, 10^{-6}]^\top$ aligns perfectly with the weak eigenvalue direction (second coordinate), causing maximal amplification in absolute terms ($\delta x_2 = 10^6 \cdot 10^{-6} = 1$), but the large norm of $x_1$ (dominated by the $10^6$ component) dilutes the relative error. This illustrates that the conditioning bound $\kappa(A) \|\delta b\| / \|b\|$ is a worst-case upper boundâactual amplification depends on how perturbations align with singular vectors.
Part 3: Absolute vs. Relative Errors and Precision Limits
The code measures both absolute ($\|\Delta x\| = 1$) and relative ($\|\Delta x\| / \|x_1\| = 10^{-6}$) errors. Key observation: Absolute change shows full $10^6$ amplification in the weak direction ($10^{-6} \to 1$), but relative change is small because $\|x_1\| \approx 10^6$ dominates. Always check both metricsârelative errors can be misleadingly small when $\|x\|$ is large, or misleadingly large when $\|x\|$ is small. Machine precision limits: Double precision has ~15-16 decimal digits. For $\kappa(A) = 10^{12}$, solving $Ax = b$ loses 12 digits of accuracy (only ~3-4 correct digits remain). For $\kappa(A) > 10^{15}$, solutions are essentially random noise. Mitigation: (1) Regularize: add $\lambda I$ to floor eigenvalues, reducing $\kappa(A + \lambda I) \le \sigma_{\max} / \lambda$. (2) Use higher precision: np.float128 (quad precision, ~32 digits) or arbitrary-precision libraries (mpmath). (3) Iterative refinement: solve $Ax_0 = b$ (inaccurate), compute residual $r_0 = b - Ax_0$ in double precision, solve $A \delta x = r_0$, update $x_1 = x_0 + \delta x$ (can recover ~8-10 lost digits).
Why This Matters for ML
Gradient descent on ill-conditioned loss: For quadratic $f(w) = \frac{1}{2} w^\top H w - g^\top w$, convergence rate is $O(\kappa(H) \log(1/\epsilon))$. If $\kappa(H) = 10^6$, small noise in gradients (from stochastic sampling) causes oscillations. Feature scaling as conditioning repair: Unnormalized features (age vs. income) yield $\kappa(X) \sim 10^4$. Standardizing: $X_{:,j} \leftarrow (X_{:,j} - \mu_j) / \sigma_j$ reduces $\kappa(X)$ to $\sim 10$, preventing amplification. Ridge regularization floors eigenvalues: $(X^\top X + \lambda I) w = X^\top y$ improves conditioning: $\kappa(X^\top X + \lambda I) \le \sigma_{\max}^2 / \lambda$. For $\lambda \sim \sigma_{\min}^2$, this reduces $\kappa$ from $10^{12}$ to $10^6$. Adversarial robustness: Ill-conditioned $W$ amplifies $\delta x$: $\Delta y = W \delta x$. If $\kappa(W) = 10^4$, a $10^{-3}$ perturbation can flip predictions. Spectral normalization ($W / \sigma_{\max}(W)$) bounds $\|W\|_2 = 1$, limiting amplification. Numerical precision: For $\kappa(X^\top X) = 10^{12}$ (normal equations), solutions lose ~12 digitsâthis is why SVD/QR avoid forming $X^\top X$ explicitly.
ML Examples and Patterns
Feature normalization: Unnormalized $X$ with features spanning $[0, 100000]$ vs. $[0, 1]$ yields $\kappa(X) \sim 10^5$. Standardization reduces to $\kappa \sim 10$. Code: X_norm = (X - X.mean(0)) / X.std(0). Ridge regression stabilization: Ill-conditioned $X^\top X$ ($\kappa \sim 10^{12}$) add $\lambda I$: $\kappa(X^\top X + \lambda I) \le \sigma_{\max}^2 / \lambda$. Cross-validate $\lambda$ via k-fold CV. Gradient descent learning rate: For $\kappa(H) = 10^6$, optimal $\eta \sim 2 / (\lambda_{\min} + \lambda_{\max}) \approx 2 / \lambda_{\max}$ (small). Adam/RMSprop adaptively estimate $H^{-1/2}$, reducing effective $\kappa \approx 1$. Batch normalization: Normalizes activations per layer: $\hat{x} = (x - \mathbb{E}[x]) / \sqrt{\text{Var}[x] + \epsilon}$. Implicitly improves Hessian conditioning, enabling higher learning rates. Spectral normalization for GANs: $W_{\text{SN}} = W / \sigma_{\max}(W)$ ensures $\|W\|_2 = 1$, bounding Lipschitz constant and stabilizing training. Preconditioning in optimization: L-BFGS approximates $H^{-1}$ via quasi-Newton updates, reducing $\kappa_{\text{effective}}$. Condition number monitoring: Track $\kappa(H)$ during training via power iteration or Lanczos. If $\kappa > 10^6$, switch to adaptive optimizers or add regularization.
Connection to Linear Algebra Theory
Perturbation bound derivation: For $Ax = b$ and $(A + \Delta A)(x + \Delta x) = b + \Delta b$, first-order analysis gives $A \Delta x \approx \Delta b - \Delta A \cdot x$, so $\Delta x \approx A^{-1}(\Delta b - \Delta A \cdot x)$. Taking norms: $\|\Delta x\| \le \|A^{-1}\| (\|\Delta b\| + \|\Delta A\| \|x\|)$. Dividing by $\|x\| = \|A^{-1} b\| \le \|A^{-1}\| \|b\|$ and using $\kappa(A) = \|A\| \|A^{-1}\|$ yields the bound. Eigenvalue interpretation for diagonal matrices: For $A = \text{diag}(\lambda_1, \ldots, \lambda_d)$, solution is $x_i = b_i / \lambda_i$. Perturbation $\delta b_i$ produces $\delta x_i = \delta b_i / \lambda_i$. Small $\lambda_i$ amplifies by $1/\lambda_i$. Condition number $\kappa = \lambda_{\max} / \lambda_{\min}$ quantifies worst-case amplification. Geometric interpretation: Level sets of $\|Ax - b\|_2$ are ellipsoids with semi-axes $1/\sigma_i$. For ill-conditioned $A$, ellipsoids are elongatedâsmall $b$ changes cause large $x$ shifts along narrow directions. SVD perspective: $\kappa(A) = \sigma_{\max} / \sigma_{\min}$ reveals how $A$ stretches/compresses space. Small $\sigma_{\min}$ indicates near-rank-deficiency. Pseudoinverse $A^+ = V \Sigma^+ U^\top$ inverts non-zero singular values, providing minimum-norm solutions for ill-conditioned systems.
Numerical and Implementation Notes
Shape discipline: $A \in \mathbb{R}^{2 \times 2}$ (diagonal), $b, \delta b, x_1, x_2 \in \mathbb{R}^2$, $\kappa(A) \in \mathbb{R}$ (scalar). Gotchas: (1) Condition number norm: np.linalg.cond(A) uses 2-norm by default. For other norms: np.linalg.cond(A, p=1) (1-norm), p=np.inf (infinity-norm), p='fro' (Frobenius). (2) Diagonal matrices: $\kappa = |\lambda_{\max}| / |\lambda_{\min}|$ (no SVD needed). For general matrices, compute via np.linalg.cond or singular values. (3) Perturbation alignment: Worst-case amplification when $\delta b$ aligns with smallest singular vector. Random perturbations see average-case $\sim \sqrt{\kappa}$. (4) Relative vs. absolute: Always check both. Relative errors misleading when $\|x\|$ is large/small. (5) Machine precision: For $\kappa > 10^{15}$ in double precision, solutions are noise. Regularize or use higher precision. (6) Condition number estimation: For large sparse $A$, full SVD is expensive ($O(d^3)$). Use iterative estimators: scipy.sparse.linalg.norm(A, ord=2) for $\sigma_{\max}$, inverse iteration for $\sigma_{\min}$.
Numerical and Shape Notes
Verification checks: (1) Eigenvalues: np.linalg.eigvalsh(A) confirms $\{10^{-6}, 1\}$. (2) Condition number: np.linalg.cond(A) = 10^6$. (3) **Solution correctness:** Residualsnp.linalg.norm(A @ x1 - b)andnp.linalg.norm(A @ x2 - (b + db))` should be $\sim 0$ (machine precision). (4) Perturbation bound: $\|\Delta x\| / \|x_1\| \le \kappa(A) \|\delta b\| / \|b\|$ should hold. (5) Absolute changes: $\|\Delta x\| = 1$ shows full amplification in weak direction. Cost: For diagonal $A$, solving is $O(d)$. Computing $\kappa$ via SVD is $O(d^3)$, but for diagonal matrices, eigenvalues are diagonal entries ($O(d)$). Tolerance: Machine epsilon $\epsilon_{\text{mach}} \approx 2.22 \times 10^{-16}$ (float64). Roundoff error in solving $Ax = b$: $\sim \kappa(A) \epsilon_{\text{mach}}$. For $\kappa = 10^6$, expect error $\sim 10^{-10}$ (losing 6 digits from 16).
Pedagogical Significance
This example is the canonical empirical demonstration of condition number as error amplification: (1) Condition number predicts instability: $\kappa(A) = 10^6$ means $10^{-6}$ input noise can cause unit-scale output changes. (2) Diagonal matrices make it obvious: No SVD neededâeigenvalue ratio directly reveals conditioning. (3) Worst-case vs. average-case: Perturbations aligned with smallest eigenvector see full $\kappa(A)$ amplification; random perturbations see $\sim \sqrt{\kappa(A)}$ on average. (4) Relative errors can be deceptive: Absolute changes may be large even when relative changes are small (if $\|x\|$ is large). (5) Regularization is essential: For $\kappa(A) > 10^{10}$, solutions are unreliable without adding $\lambda I$. Common misconceptions addressed: âCondition number is just theoryâ (Noâit directly predicts numerical instability). âSmall perturbations always cause small errorsâ (Only for $\kappa \sim 1$; ill-conditioned systems amplify exponentially). âDouble precision is always enoughâ (For $\kappa > 10^{15}$, solutions are noise). âIterative solvers avoid conditioningâ (NoâCG converges $\propto \sqrt{\kappa}$; preconditioning mandatory). Connection to other examples: Example 74 (condition number squaring), Example 76 (SVD vs. normal equations), Example 77 (Cholesky stability). Why pedagogically powerful: Provides concrete numerical evidence that $\kappa$ is the amplification factor mapping input errors to output errors. Diagonal matrix makes mechanism transparent. Students see that $\kappa = 10^6$ means $10^{-6}$ perturbation unit-scale changeâvisceral and immediate. Motivates all conditioning-aware ML practices: feature scaling, ridge regularization, SVD/QR solver choice, adaptive optimizers.
The concept of matrix conditioning emerged from the numerical analysis community in the mid-20th century, driven by the urgent need to solve large linear systems on early digital computers. Alan Turing introduced the notion of âconditionâ in his landmark 1948 paper Rounding-Off Errors in Matrix Processes, analyzing how perturbations in matrix coefficients propagate to solutions. Turing defined a âcondition numberâ (though not using the modern term) as a measure of a systemâs sensitivity to input errorsâthe mathematical foundation for all subsequent stability analysis.
Von Neumann and Goldstine (1947) had earlier studied error propagation in matrix inversion, deriving probabilistic bounds on accumulated roundoff errors. However, Turingâs work was more accessible and directly applicable to practical computation. J. H. Wilkinson, Turingâs successor at the National Physical Laboratory, formalized the modern theory in his influential monograph Rounding Errors in Algebraic Processes (1963). Wilkinson introduced backward error analysisâthe idea that numerical algorithms produce exact solutions to slightly perturbed problemsâand showed that the condition number $\kappa(A) = \|A\| \|A^{-1}\|$ controls the magnification of backward errors to forward errors. J. R. Rice (1966) popularized the term âcondition numberâ in the modern sense, unifying disparate notions of stability under a single conceptual framework.
By the 1970s-80s, conditioning had become central to numerical linear algebra. Golub and Van Loanâs Matrix Computations (1983) systematized the theory, emphasizing the connection between conditioning and singular values: $\kappa(A) = \sigma_{\max} / \sigma_{\min}$. This led to SVD-based solvers (LAPACKâs gelsd, SciPyâs lstsq) that avoid forming ill-conditioned normal equations $X^\top X$, preserving numerical accuracy in least squares problems. Highamâs Accuracy and Stability of Numerical Algorithms (2002) provided the definitive reference, cataloging condition numbers for matrix factorizations, eigenvalue problems, and nonlinear systems.
Modern ML applications reveal conditioning challenges at unprecedented scale. Deep learning optimization faces $\kappa(H) \sim 10^6$ Hessians in deep residual networks (He et al., 2016), where gradient flow causes eigenvalue decay. Feature preprocessing (standardization, min-max scaling) became standard practice after empirical demonstrations that unnormalized features cause $\kappa(X) \sim 10^5$ data matrices, making least squares unstable (scikit-learn documentation, 2007 onwards). Ridge regularization $(X^\top X + \lambda I)^{-1}$ explicitly repairs conditioning by flooring eigenvaluesârecognized since Hoerl-Kennard (1970) but only widely adopted with MLâs scale.
Adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015) exploit ill-conditioned neural network classifiers: tiny input perturbations $\delta x$ amplify through layers with large $\|W\|_2$, causing misclassifications. Spectral normalization (Miyato et al., 2018) divides weight matrices by $\sigma_{\max}(W)$ to bound $\|W\|_2 = 1$, limiting amplificationânow standard in GAN training. Batch normalization (Ioffe-Szegedy, 2015) implicitly improves Hessian conditioning by normalizing activations, enabling learning rates $10-100 \times$ higher (empirically, $\kappa(H)$ drops from $\sim 10^8$ to $\sim 10^4$). Adaptive optimizers (Kingma-Baâs Adam, 2015; Hintonâs RMSprop, 2012) approximate $H^{-1/2}$, transforming ill-conditioned problems to effectively well-conditioned ones ($\kappa_{\text{effective}} \approx 1$), explaining their dominance over vanilla SGD in deep learning.
Conditioning remains a central concern in large-scale optimization (interior-point methods face $\kappa(H) > 10^{12}$ near optimality), sparse systems (graph Laplacians with disconnected components have $\kappa \to \infty$), and high-performance computing (mixed-precision training balances speed vs. numerical stability, requiring condition number monitoring). The simple perturbation experiment in this exampleâ$\kappa(A) = 10^6$ amplifies $10^{-6}$ noise to unit-scale errorsâcaptures the universal mechanism underlying all these phenomena.
Example 74 (SVD and condition number squaring): Normal equations $(X^\top X) w = X^\top y$ square the condition number: $\kappa(X^\top X) = [\kappa(X)]^2$. If $\kappa(X) = 10^3$, then $\kappa(X^\top X) = 10^6$ (as in this example), amplifying errors from $10^{-3}$ to $10^{-6}$ relative perturbations potentially causing unit-scale solution changesâthis motivates using SVD/QR instead of normal equations.
Example 76 (Least squares: lstsq vs normal equations): SVD-based
lstsqpreserves $\kappa(X)$, while normal equations viasolve(X.T @ X, X.T @ y)inherit $\kappa(X)^2$. For $\kappa(X) = 10^3$, normal equations see $\sim 10^6$ amplification (this exampleâs regime), losing ~6 digits of precision.Example 77 (Cholesky reuse): Cholesky is backward stable for well-conditioned SPD matrices ($\kappa < 10^6$). For $\kappa(A) = 10^{12}$ (worse than this example), Cholesky factorization may fail or produce unreliable resultsâregularization $(A + \lambda I)$ is essential.
Example 73 (PSD matrices): Small eigenvalues (near-zero but positive) indicate near-singularity and ill-conditioning. This exampleâs $\lambda_2 = 10^{-6}$ is the culpritâany eigenvalue $< 10^{-10}$ (relative to $\lambda_{\max}$) causes $\kappa > 10^{10}$, exceeding the stability threshold for double precision.
Example 75 (PCA and explained variance): When PCA reveals that a few singular values dominate ($\sigma_1 \gg \sigma_d$), the design matrix $X$ is ill-conditioned. Truncating to top $k$ components (discarding $\sigma_{k+1}, \ldots, \sigma_d$) is equivalent to regularizationâimproving effective condition number from $\sigma_1 / \sigma_d$ to $\sigma_1 / \sigma_k$.
Example 79 (Sparse systems): Sparse ill-conditioned systems (e.g., graph Laplacians with nearly disconnected components) require preconditioners to reduce $\kappa$ before iterative solvers (CG, GMRES) can converge. Incomplete Cholesky or Jacobi preconditioning targets the small eigenvalues causing ill-conditioning.
Chapter 12 (Least squares): Feature scaling (standardizing columns of $X$) reduces $\kappa(X)$ from $\sim 10^4$ (unnormalized) to $\sim 10$ (standardized), preventing the amplification demonstrated here.
Comments