This code demonstrates the relationship between the condition number of a design matrix $X$ and its Gram matrix $X^\top X$âa critical insight for understanding numerical stability in least-squares regression and why forming normal equations can amplify ill-conditioning.
Part 1: Creating an Ill-Conditioned Design Matrix. The code starts with a well-conditioned regression dataset generated by toy_linreg(n=50, d=4, seed=2), yielding $X \in \mathbb{R}^{50 \times 4}$ (50 examples, 4 features). To introduce ill-conditioning, a fifth feature is artificially created as an almost-duplicate of the first feature: $X_{:,5} = X_{:,1} + 10^{-3} X_{:,2}$, done via np.c_[X, X[:, [0]] + 1e-3 * X[:, [1]]]. This produces $X \in \mathbb{R}^{50 \times 5}$. Near-collinearityâtwo columns pointing in almost the same directionâis a hallmark of ill-conditioning in practice. When features are nearly dependent, small changes in data (noise, measurement error) cause large swings in fitted weights. The condition number quantifies this sensitivity.
Part 2: Condition Number via SVD. The condition number of $X$ is defined as the ratio of the largest to smallest singular value: $\kappa(X) = \frac{\sigma_{\max}}{\sigma_{\min}}$. The code computes it directly from singular values via S = np.linalg.svd(X, compute_uv=False) (returning only singular values, skipping $U, V^\top$), then cond_X = S.max() / S.min(). The parameter compute_uv=False saves computationâonly eigenvalues are needed for conditioning diagnosis. A condition number of $\kappa(X) = 1$ means well-conditioning (all directions equally important); as $\kappa(X)$ grows (e.g., $10^6$), the matrix becomes ill-conditionedâsome directions are much more âdominant,â and solving least-squares becomes numerically fragile.
Part 3: Condition Number of the Normal EquationsâThe Squaring Effect. The Gram matrix is $X^\top X \in \mathbb{R}^{5 \times 5}$. Its condition number is computed via np.linalg.cond(X.T@X), using eigenvalue decomposition or SVD internally. The key relationship: $\kappa(X^\top X) = [\kappa(X)]^2$. This is exact and follows from the SVD: if $X = U \Sigma V^\top$, then $X^\top X = V \Sigma^2 V^\top$, so eigenvalues of $X^\top X$ are squares of $X$âs singular values. The condition number becomes $\kappa(X^\top X) = \frac{\sigma_{\max}^2}{\sigma_{\min}^2} = [\kappa(X)]^2$. The code prints cond_X**2 to verify this relationship numerically. Typical output shows near-perfect agreement, confirming the squaring relationship.
Why This Matters for ML. When solving least-squares via normal equations $(X^\top X) \hat{w} = X^\top y$, the condition number of the system is $\kappa(X^\top X) = [\kappa(X)]^2$. If $X$ has condition number $10^3$ (moderately ill-conditioned), then $X^\top X$ has condition number $10^6$ (severely ill-conditioned). Numerical errors are amplified quadratically, leading to inaccurate solutions. Why solvers like lstsq use SVD: Modern implementations avoid forming $X^\top X$ explicitly and instead use SVD or QR decomposition directly. The SVD solves least-squares as $\hat{w} = V \Sigma^{-1} U^\top y$. No squaring occursâthe condition number is $\kappa(X)$, not $[\kappa(X)]^2$. This is why np.linalg.lstsq is universally preferred over manually solving $X^\top X \hat{w} = X^\top y$.
Regularization and Preprocessing. Ridge regression solves $(X^\top X + \lambda I) \hat{w}_\lambda = X^\top y$, adding $\lambda I$ to boost the smallest eigenvalues of $X^\top X$, reducing the condition number of the regularized system. For large $\lambda$, conditioning improves dramatically, at the cost of bias. Before fitting, practitioners often standardize features to similar scales, improving conditioning by removing scale disparityânear-collinear columns with vastly different magnitudes are more problematic than those with similar magnitudes.
Numerical Interpretation and Gotchas. - $X \in \mathbb{R}^{50 \times 5}$ (design matrix), $S \in \mathbb{R}^5$ (singular values), $X^\top X \in \mathbb{R}^{5 \times 5}$ (Gram matrix, symmetric). - Singular values from np.linalg.svd are sorted descending; use max() and min() for condition number. - If eig_S.min() is extremely close to zero ($\sim 10^{-16}$), condition number becomes unreliable; use np.linalg.cond with tolerance if needed. - $X^\top X$ is always symmetric, so symmetric eigenvalue solvers are used internally (more stable than general eigendecomposition). - The artificially added column X[:, [0]] + 1e-3 * X[:, [1]] is intentionally nearly redundant; the $10^{-3}$ perturbation makes it numerically close to dependent but not exactly soâmodeling realistic collinearity. - Condition number is scale-invariant for the matrix itself (scaling columns doesnât change the singular value ratio), but practical conditioning depends on relative feature scales.
Linear Algebra Theory Connection. The condition number bounds relative error: for perturbation $\Delta b$ in $Ax = b$, $\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\Delta b\|}{\|b\|}$. For normal equations, this becomes $\frac{\|\Delta \hat{w}\|}{\|\hat{w}\|} \le [\kappa(X)]^2 \frac{\|\Delta y\|}{\|y\|}$. The SVD is the most numerically stable decomposition: $\kappa(X)$ (not squared) bounds error when using SVD-based solvers. Forming $X^\top X$ loses singular value information: if $X$ has singular values $[\sigma_1, \ldots, \sigma_r, 0, \ldots, 0]$, then $X^\top X$ has eigenvalues $[\sigma_1^2, \ldots, \sigma_r^2, 0, \ldots, 0]$âsmall nonzero singular values are squared, making ill-conditioning more severe.
Pedagogical Power. This example shows that a seemingly innocent choice (forming $X^\top X$ vs. using SVD) has exponential numerical consequences. The squaring relationship is elegant, exact, and universally applicableâit explains why practitioners prefer lstsq over manual normal equations. Understanding this one relationship deepens intuition for numerical linear algebra as a whole.
Comments