Part 1: Ill-Conditioned System Recognition - A diagonal matrix $A = ext{diag}(1, 10^{-6})$ has $\kappa(A) = 10^6$: the ratio of largest to smallest singular value. - For any $A$, compute $\sigma_1, \sigma_n$ via SVD; condition number $\kappa(A) = \sigma_1/\sigma_n$. - In ML, design matrices $X$ with correlated features have large $\sigma_1/\sigma_n$, making least-squares estimates unstable.
Part 2: Perturbation Experiment - Solve $Ax_1 = b$ and $Ax_2 = b + \Delta b$ where $\Delta b$ is a small perturbation. - If $\|\Delta b\|/\|b\| = 10^{-7}$ and $\kappa(A) = 10^6$, then $\|\Delta x\|/\|x\| \lesssim 10^6 \cdot 10^{-7} = 0.1$⦠but can be worse. - In this example, $\|\Delta x\|/\|x\| = 0.7$ (70% change!), approaching the bound.
Part 3: Amplification via Conditioning Number - The fundamental bound: relative forward error ⤠condition number à relative backward error (perturbation). - Even if a solver is numerically âperfectâ (backward-stable), the forward error is amplified by $\kappa(A)$. - This is not a limitation of the algorithm; itâs inherent to the problem. No solver can fix ill-conditioning alone.
Why This Matters for ML - Training instability: Neural networks with ill-conditioned weight matrices exhibit vanishing or exploding gradients during backprop, making learning impossible without careful initialization or normalization. - Least-squares sensitivity: Ridge regression, LASSO, and elastic net all combat ill-conditioning of $X^ op X$ by adding $\lambda I$ to the diagonal. - Kernel methods: Gram matrices for kernels (RBF, polynomial, etc.) often have rapidly decaying eigenvalues, requiring careful regularization.
Connection to ML Applications - Feature scaling: Preprocessing to equalize feature magnitudes improves $\kappa(X^ op X)$, enabling faster convergence and better generalization. - Batch normalization: Normalizing hidden activations in deep networks reduces the condition number of layer-wise Jacobians, stabilizing gradients. - Weight decay: Adding $\lambda \|W\|^2$ to the loss function is equivalent to ridge regression; it reduces the condition number of the Hessian during optimization. - Dimensionality reduction: PCA or other dimension reduction removes directions of near-zero variance (small singular values), improving conditioning.
Connection to Linear Algebra Theory - QR decomposition: The QR factorization $A = QR$ (where $Q$ is orthogonal) allows stable solution via back-substitution if $R$ is not too ill-conditioned. - Cholesky factorization: For SPD matrices $A = LL^ op$ (Cholesky), the condition number of $L$ is $\sqrt{\kappa(A)}$, improving stability for certain operations. - Regularization: Ridge regression $\min \|Ax - b\|^2 + \lambda \|x\|^2$ improves the condition number from $\kappa(A^ op A)$ to $\kappa(A^ op A + \lambda I)$. - Iterative refinement: For ill-conditioned systems, iterative refinement (computing residual, solving correction equation) can recover lost accuracy.
Numerical and Implementation Notes - Avoid explicit inverses: Computing $A^{-1}$ amplifies rounding errors by a factor of $\kappa(A)^2$; use solvers instead. - Check rcond in lstsq: NumPyâs np.linalg.lstsq(..., rcond=1e-15) warns if $\sigma_n/\sigma_1 <$ rcond, indicating rank deficiency. - Monitor gradient norms: In deep learning, track $\| abla_W L\|$ at each layer; explosion/vanishing signals ill-conditioning of the Jacobian. - Regularization tuning: Cross-validate $\lambda$ (regularization strength) to balance bias-variance; $\lambda pprox 1/\kappa(X^ op X)$ is a rule of thumb. - Data preprocessing: Standardize features (zero mean, unit variance) before fitting; this is equivalent to improving $\kappa(X^ op X)$. - Backward error vs. forward error: Even a backward-stable algorithm (e.g., Gaussian elimination with pivoting) cannot guarantee a small forward error if $\kappa(A)$ is large.
Comments