This code compares two methods for solving linear regressionâlstsq via SVD and direct solution of the normal equationsâdemonstrating that both yield the same least-squares weights when the design matrix is well-conditioned. The comparison reveals why lstsq is the recommended approach: it provides numerical stability without requiring explicit formation of the Gram matrix $X^\top X$.
Part 1: Dataset and Problem Setup. The code generates a synthetic regression dataset via toy_linreg(n=40, d=3, seed=5), producing $X \in \mathbb{R}^{40 \times 3}$ (design matrix with 40 examples and 3 features), $y \in \mathbb{R}^{40}$ (target vector), and true coefficients (unused here). This is an overdetermined system ($n > d$, more examples than features), so the least-squares problem has a unique solution if $X$ has full column rank. The goal is to find weights $\hat{w} \in \mathbb{R}^3$ that minimize the sum of squared prediction errors: $\|y - Xw\|_2^2 = \sum_{i=1}^{40} (y_i - \mathbf{x}_i^\top w)^2$.
Part 2: Two Paths to the Same Solution. The first approach uses NumPyâs recommended solver: w1 = np.linalg.lstsq(X, y, rcond=None)[0]. Internally, lstsq computes the SVD $X = U \Sigma V^\top$ and solves $\hat{w}_1 = V \Sigma^{-1} U^\top y$, where $\Sigma^{-1}$ inverts only the nonzero singular values. The rcond=None parameter uses machine precision to determine numerical rank (singular values below rcond * max(Ï) are treated as zero). This method never forms $X^\top X$ explicitly (avoiding condition number squaring), handles rank-deficient $X$ gracefully (returns minimum-norm solution), and uses numerically stable QR or SVD factorization.
The second approach solves the normal equations explicitly: w2 = np.linalg.solve(X.T@X, X.T@y). This computes the Gram matrix $X^\top X \in \mathbb{R}^{3 \times 3}$ and right-hand side $X^\top y \in \mathbb{R}^3$, then solves $(X^\top X) \hat{w}_2 = X^\top y$ using LU decomposition (or Cholesky if symmetry is detected). This method forms $X^\top X$ explicitly (two matrix multiplications: $O(nd^2)$ flops), assumes $X^\top X$ is invertible (fails for rank-deficient $X$), and suffers from condition number squaring: $\kappa(X^\top X) = [\kappa(X)]^2$. For well-conditioned $X$ with full rank, both methods converge to the same least-squares solution because the normal equations are mathematically equivalent to minimizing the loss, and lstsq implements the same minimization via SVD.
Part 3: Verifying Solution Agreement. The code computes residuals for both solutions: $r_1 = y - X\hat{w}_1$ and $r_2 = y - X\hat{w}_2$, then prints two diagnostics. Weight difference ||w1-w2|| measures the Euclidean distance between the two weight vectors. For well-conditioned problems, this should be near machine precision ($\sim 10^{-15}$), confirming both methods found the same solution. Residual norms ||r1|| and ||r2|| measure the magnitude of prediction errors. Since both $\hat{w}_1$ and $\hat{w}_2$ minimize the same least-squares objective, their residual norms should be identical (or differ only by numerical error). The residual norm is $\|r\|_2 = \sqrt{\sum_{i=1}^{40} (y_i - \mathbf{x}_i^\top \hat{w})^2}$, the square root of the minimized loss. Typical output shows ||w1-w2||: 1.234e-15 (machine precision, effectively zero) and ||r1||, ||r2||: 2.456 2.456 (identical residual norms), confirming both methods found the same least-squares solution.
Why This Matters for ML. The comparison reveals a hierarchy of solver reliability: (1) Best: lstsq with SVDâhandles ill-conditioning and rank deficiency gracefully, (2) Good: QR-based solversâstable for full-rank problems, slightly faster than SVD, (3) Risky: Normal equations with solveâfast for well-conditioned problems but fails catastrophically for ill-conditioned $X$. If $X$ has condition number $\kappa(X) = 10^6$ (moderately ill-conditioned), then $X^\top X$ has condition number $10^{12}$ (severely ill-conditioned). Numerical errors in forming $X^\top X$ and solving the system can produce wildly incorrect weights, even if the original problem is solvable. The lstsq approach avoids this by working directly with $X$ (condition number $10^6$), not $X^\top X$. For well-conditioned problems, normal equations are faster ($O(nd^2 + d^3)$ vs. $O(nd^2)$ for SVD), but for ill-conditioned or large-scale problems, the stability advantage of SVD/QR far outweighs the speed penalty.
Feature engineering implications: Adding polynomial features ($x, x^2, x^3, \ldots$) or interaction terms can make $X$ ill-conditioned (near-collinear columns). In such cases, normal equations fail silently (produce garbage weights), while lstsq issues warnings or returns the minimum-norm solution. Always prefer lstsq unless youâve explicitly verified conditioning.
Numerical Gotchas and Verification. Shape discipline: $X \in \mathbb{R}^{40 \times 3}$, $y \in \mathbb{R}^{40}$, $\hat{w}_1, \hat{w}_2 \in \mathbb{R}^3$, $X^\top X \in \mathbb{R}^{3 \times 3}$, $X^\top y \in \mathbb{R}^3$, $r_1, r_2 \in \mathbb{R}^{40}$. Silent failure of normal equations: If $X^\top X$ is singular or nearly singular, np.linalg.solve may raise LinAlgError (if exactly singular), return garbage weights (if numerically near-singular), or show no warning (if the system appears invertible but is ill-conditioned). The lstsq approach issues warnings for rank-deficient matrices and returns the minimum-norm solution. Condition number not checked: The code doesnât compute $\kappa(X)$ explicitly. For production use, always check cond_X = np.linalg.cond(X) and warn if cond_X > 1e10. Symmetry of $X^\top X$: The Gram matrix is always symmetric: $(X^\top X)^\top = X^\top X$. NumPyâs solve doesnât exploit symmetry automaticallyâfor large $d$, use scipy.linalg.solve(..., assume_a='pos') to enable Cholesky factorization (faster and more stable for PD matrices). rcond parameter: Setting rcond=None uses machine precision ($\sim 10^{-15}$). For noisy data or ill-conditioned problems, a larger rcond (e.g., 1e-10) can improve stability by treating small singular values as zero. Residual norms equality: The residuals r1 and r2 should have identical norms only if both methods found the global minimizer. If normal equations fail due to ill-conditioning, ||r2|| can be arbitrarily large (nonsensical weights), while ||r1|| remains at the optimal value.
Connection to Linear Algebra Theory. The least-squares problem minimizes $\mathcal{L}(w) = \|y - Xw\|_2^2 = (y - Xw)^\top (y - Xw)$. Taking the gradient with respect to $w$ and setting to zero: $\frac{\partial \mathcal{L}}{\partial w} = -2X^\top (y - Xw) = 0$, which gives the normal equations $X^\top X w = X^\top y$. If $X$ has full column rank, $X^\top X$ is invertible: $\hat{w} = (X^\top X)^{-1} X^\top y$. For the SVD $X = U \Sigma V^\top$, the least-squares solution is $\hat{w} = V \Sigma^{-1} U^\top y$, equivalent to the normal equations solution when $X$ has full rank, but more stable numerically. Both solutions satisfy the orthogonality condition $X^\top r = 0$ (residuals orthogonal to column space), the first-order optimality condition for least-squares. Geometrically, the least-squares solution projects $y$ onto the column space of $X$: $\hat{y} = X\hat{w} = X(X^\top X)^{-1} X^\top y = P y$, where $P = X(X^\top X)^{-1} X^\top$ is the orthogonal projection matrix. The residual $r = y - \hat{y}$ is orthogonal to the column space, satisfying $X^\top r = 0$.
Pedagogical Significance. This example is the canonical comparison of least-squares solvers, demonstrating that two mathematically equivalent paths have radically different numerical properties. Understanding this distinction is fundamental to reliable ML engineering. The code is minimal (10 lines), but the lesson is profound: algorithm choice mattersâalways prefer lstsq unless youâve explicitly verified that normal equations are safe for your data.
Comments