Part 1: Two Solution Methods
The code compares two least-squares solvers: the robust lstsq (using SVD internally) versus the numerically risky normal equations solved via np.linalg.solve. It demonstrates why direct SVD-based methods are preferred over explicit normal-equation solves in practice, even though both minimize the same objective mathematically.
Method 1 (robust): w1 = np.linalg.lstsq(X, y, rcond=None)[0] uses the least-squares solver, which internally computes the SVD $X = U\Sigma V^\top$ and solves via: \[
\hat{w} = V \Sigma^{-1} U^\top y,
\] where $\Sigma^{-1}$ treats singular values below a numerical threshold as zero. This approach avoids forming $X^\top X$ explicitly, preserving numerical stability even when $X$ is ill-conditioned. The rcond=None parameter uses machine precision as the cutoff for rank determination.
Method 2 (risky): w2 = np.linalg.solve(X.T@X, X.T@y) forms the normal equations $X^\top X w = X^\top y$ and solves them via Gaussian elimination. This method is mathematically equivalent to least squares but squares the condition number: $\kappa(X^\top X) = \kappa(X)^2$. For well-conditioned $X$, both methods agree; for ill-conditioned $X$, solve accumulates rounding errors in $X^\top X$, producing $w_2 \ne w_1$.
Part 2: Residual Verification
The residuals are computed as $r_i = y - Xw_i$ for both solutions. Both residuals should have nearly identical norms (||r1|| and ||r2||), since both methods minimize $\|y - Xw\|_2^2$ and converge to the same optimal residual vector (up to numerical error). The printed norms confirm theyâre equal; any significant difference (e.g., ||r1|| = 10.5, ||r2|| = 10.6) signals numerical instability in the normal-equations solver. This is a subtle but critical observation: two solutions can have identical residuals but very different weightsâa signature of ill-conditioning and non-identifiability.
Part 3: Weight Discrepancy as Ill-Conditioning Indicator
The quantity ||w1 - w2|| measures how much the two solutions disagree. For well-conditioned problems (||w1 - w2|| near machine precision, $\sim 10^{-14}$), both solvers work equally well, and the difference is negligible rounding error. For ill-conditioned problems, ||w1 - w2|| can be substantial (e.g., $10^{-6}$ or larger), indicating that $w_2$ from the normal equations has accumulated significant error, even though its residual ||r2|| remains close to the optimal value.
Why? The normal equations solve a different, more sensitive problem: \[
(X^\top X) w = X^\top y.
\] Small perturbations in $y$ (e.g., from rounding) are amplified by $\kappa(X^\top X) = \kappa(X)^2$ when computing $w$. However, the residual $y - Xw$ is less sensitive because it involves $X$ directly (condition number $\kappa(X)$), not the squared system. So $w_2$ can be far from $w_1$, yet both produce similar residualsâa classic symptom of non-identifiability due to ill-conditioning. The weights $w$ are unstable estimates of the true coefficients, but predictions $\hat{y} = Xw$ remain stable.
Why This Matters for ML
Interpretability vs. prediction stability: In regression, the fitted weights $w$ are often interpreted as feature importance or causal effects. Ill-conditioning makes these interpretations unreliable: tiny changes in the data produce large changes in $w$. However, if the task is pure prediction, $\hat{y} = Xw$ remains stable because predictions depend on $X$ (well-conditioned) rather than $w$ alone.
Regularization necessity: When ||w1 - w2|| is large, regularization (ridge regression, elastic net, LASSO) becomes essential. Ridge regression solves: \[
(X^\top X + \lambda I) w = X^\top y,
\] which improves conditioning by adding $\lambda I$ to the Hessian. The trade-off is bias (estimator $w$ is biased toward zero), but reduced variance and stability.
Solver selection: For $n \gg d$ (many examples, few features), both solvers are usually fast and stableâuse whichever is convenient. For $n \approx d$ or $n < d$ (fewer examples than features), always use lstsq or SVD-based methods. For $n < d$ (underdetermined), normal equations break down entirely (matrix is singular); only lstsq with rcond truncation recovers the minimum-norm solution.
Numerical diagnostics: Print ||w1 - w2|| and kappa(X) as routine diagnostics to catch ill-conditioning early:
print(f"Weight disagreement: {np.linalg.norm(w1 - w2):.2e}")
print(f"Condition number: {np.linalg.cond(X):.2e}")
if np.linalg.cond(X) > 1e8:
print("WARNING: Ill-conditioned design matrix. Consider regularization.")
Connection to ML Applications
Multicollinearity in feature engineering: When features are highly correlated (e.g., age and birth year), $X$ becomes ill-conditioned. The least-squares solution $\hat{w}$ assigns wildly different weights to correlated features, even though predictions $\hat{y}$ remain stable. This is why multicollinearity is a red flag: feature interpretability collapses, not prediction accuracy.
Ridge regression as conditioning fix: Ridge adds $\lambda I$ to the Hessian, effectively regularizing away directions with small singular values. This trades some bias for much better conditioning and weight stabilityâessential when features are correlated or when interpretability matters.
Cross-validation and hyperparameter tuning: Ill-conditioned problems are prone to overfitting because small changes in training data produce large weight changes. Cross-validation detects this via high variance in validation scores. Using lstsq (stable) instead of normal equations helps, but regularization is the fundamental fix.
Numerical and Implementation Notes
Use full_matrices=False for economy SVD in lstsq; critical when $n \gg d$ to avoid allocating huge matrices. Always center and/or standardize features before least squares to improve conditioning. Verify $\|X^\top r\|_2 \approx 0$ (orthogonality condition) to confirm optimality. For high-dimensional problems ($d > 1000$), normal equations become prohibitively expensive anyway (forming $X^\top X$ requires $O(nd^2)$ operations); lstsq scales better. When $X$ is sparse, use sparse solvers (scipy.sparse.linalg.lsqr) instead of dense lstsq. Always inspect both ||w1 - w2|| and ||r1||, ||r2|| togetherâweights can diverge even if residuals agree.
Numerical and Shape Notes
Shape discipline: - $X \in \mathbb{R}^{n \times d}$: design matrix ($n$ examples, $d$ features) - $y \in \mathbb{R}^n$: response vector - $X^\top X \in \mathbb{R}^{d \times d}$: Gram matrix (symmetric, PSD, but potentially ill-conditioned) - $X^\top y \in \mathbb{R}^d$: cross-product vector - $w_1, w_2 \in \mathbb{R}^d$: weight vectors - $r_1, r_2 \in \mathbb{R}^n$: residual vectors - $U \in \mathbb{R}^{n \times d}$, $\Sigma \in \mathbb{R}^d$, $V^\top \in \mathbb{R}^{d \times d}$ (economy SVD)
Gotchas: 1. Normal equations are fragile: For $\kappa(X) > 10^6$, the normal equations lose several digits of precision. Always prefer lstsq for production code. 2. Residuals can mask weight errors: Both residuals are similar even if $w$ disagreement is large. Always inspect both ||w1 - w2|| and ||r1||, ||r2|| together. 3. lstsq return value: np.linalg.lstsq(X, y, rcond=None) returns a tuple (w, residuals, rank, s). Extract the weight via [0], not [0, 0]. 4. Singular/near-singular X: If $\text{rank}(X) < d$, solve(X.T@X, ...) fails with a LinAlgError (matrix is singular). lstsq handles this gracefully by returning the minimum-norm solution. 5. Broadcasting in matrix-vector products: X.T @ X is shape $(d, d)$ and X.T @ y is shape $(d,)$âensure correct dimensions before calling solve.
Comments