Least squares emerged from Gaussâs 1809 work on celestial mechanics (predicting Ceresâs orbit from noisy observations) and Legendreâs 1805 formalization of the âmethod of least squares.â The normal equations $X^\top X w = X^\top y$ became the standard analytical solution, dominating statistical practice for 150 years. The numerical instability of forming $X^\top X$ explicitly was recognized by statisticians in the 1960sâ1970s, leading to QR decomposition (Householder 1958, Golub 1965) and SVD-based methods as the recommended approach for ill-conditioned systems. In modern ML, least squares is the prototype convex problem: closed-form solution, unique global minimum, and direct connection to projection geometry. It serves as the foundation for ridge regression, polynomial regression, Gaussian process regression, and the local quadratic approximations underlying Newtonâs method, Gauss-Newton, and Levenberg-Marquardt. Despite the availability of gradient-based methods for large-scale problems, direct least-squares solvers remain essential for small-to-medium $d$ (e.g., $d < 10^4$) and as subroutines in iterative algorithms (trust-region methods, RANSAC, Kalman filtering).
- Log in to post comments
This example demonstrates the numerical trade-offs between SVD-based lstsq and direct normal equations for solving least squares $\min_w \|y - Xw\|_2^2$. While both methods are mathematically equivalent (yielding identical optimal weights $\hat{w}$ and residual norms), they differ dramatically in numerical stability: lstsq preserves the condition number $\kappa(X)$ via SVD/QR factorization, whereas normal equations square it to $\kappa(X^\top X) = [\kappa(X)]^2$, amplifying roundoff errors. By comparing weight agreement and residual norms on well-conditioned data, we establish the pedagogical baseline: âboth methods converge when $X$ is well-conditioned, but only lstsq generalizes to ill-conditioned or rank-deficient systems.â This pattern recurs throughout MLâridge regression, weighted least squares, and gradient descent all reduce to variants of this comparison.
Solve least squares via lstsq and via normal equations; compare solutions and residual norms.
The least-squares problem is:
\[ \min_{w \in \mathbb{R}^d} \|y - Xw\|_2^2. \]
Taking the gradient and setting to zero yields the normal equations:
\[ X^\top X w = X^\top y. \]
When $X \in \mathbb{R}^{n \times d}$ has full column rank ($\text{rank}(X) = d$), $X^\top X$ is invertible, and the unique solution is:
\[ \hat{w} = (X^\top X)^{-1} X^\top y. \]
Alternatively, via the SVD $X = U \Sigma V^\top$, the pseudoinverse solution is:
\[ \hat{w} = X^+ y = V \Sigma^{-1} U^\top y. \]
Both methods yield the same $\hat{w}$ when $X$ is full-rank and well-conditioned. The residual $r = y - X\hat{w}$ satisfies the orthogonality condition $X^\top r = 0$ (residual orthogonal to column space), and its norm is uniquely determined by the projection:
\[ \|r\|_2^2 = \|y\|_2^2 - \|X\hat{w}\|_2^2. \]
Notation: - $X \in \mathbb{R}^{n \times d}$: design matrix (rows = examples, columns = features) - $y \in \mathbb{R}^n$: target vector - $w, \hat{w} \in \mathbb{R}^d$: weight vectors - $\|\cdot\|_2$: Euclidean norm - $X^\top$: transpose (not $X^T$) - $X^+$: Moore-Penrose pseudoinverse
import numpy as np
from scripts.toy_data import toy_linreg
X, y, _ = toy_linreg(n=40, d=3, seed=5)
w1 = np.linalg.lstsq(X, y, rcond=None)[0]
w2 = np.linalg.solve(X.T@X, X.T@y)
r1 = y - X@w1
r2 = y - X@w2
print("||w1-w2||:", np.linalg.norm(w1-w2))
print("||r1||, ||r2||:", np.linalg.norm(r1), np.linalg.norm(r2))This snippet compares two methods for solving the least-squares problem $\min_w \|y - Xw\|_2^2$: the numerically stable SVD-based lstsq and the direct normal equations via solve. Both compute optimal weights $\hat{w} \in \mathbb{R}^3$ for the regression problem $X \in \mathbb{R}^{40 \times 3}$, $y \in \mathbb{R}^{40}$, but differ in numerical robustness and computational cost.
The code generates synthetic regression data via toy_linreg(n=40, d=3, seed=5), yielding a design matrix $X \in \mathbb{R}^{40 \times 3}$ and targets $y \in \mathbb{R}^{40}$. Two methods compute the least-squares solution: Method 1 (SVD-based) via w1 = np.linalg.lstsq(X, y, rcond=None)[0] factorizes $X = U \Sigma V^\top$ and computes $\hat{w} = V \Sigma^{-1} U^\top y$ with condition number $\kappa(X) = \sigma_1 / \sigma_d$ (no squaring), costing $O(nd^2)$ but handling rank-deficient and ill-conditioned systems robustly. Method 2 (Normal equations) via w2 = np.linalg.solve(X.T @ X, X.T @ y) forms the Gram matrix $X^\top X \in \mathbb{R}^{3 \times 3}$ explicitly and solves $(X^\top X) w = X^\top y$ via Cholesky with condition number $\kappa(X^\top X) = [\kappa(X)]^2$ (squared conditioning), costing comparable $O(nd^2)$ but losing accuracy when $\kappa(X) > 10^6$. Both are mathematically equivalent when $X$ has full column rank, deriving from the gradient optimality condition $\nabla_w \|y - Xw\|_2^2 = -2 X^\top (y - Xw) = 0$, which yields $X^\top X w = X^\top y$ with unique solution $\hat{w} = (X^\top X)^{-1} X^\top y$ (assuming invertibility).
The code computes residuals $r_1 = y - X w_1$ and $r_2 = y - X w_2$ and checks: (1) Weight agreement via np.linalg.norm(w1 - w2) should be near machine precision ($\sim 10^{-15}$) for well-conditioned $X$, confirming both methods converge to the same solution; (2) Residual norms via np.linalg.norm(r1) and np.linalg.norm(r2) should be identical, verifying optimality. The near-zero weight difference confirms algorithmic equivalence. Identical residual norms follow from the normal equations $X^\top X \hat{w} = X^\top y$ (first-order optimality), which implies $X^\top r = X^\top (y - X\hat{w}) = 0$ (residual orthogonal to column space). Since the residual norm is uniquely determined by projection geometry $\|r\|_2^2 = \|y\|_2^2 - \|X\hat{w}\|_2^2$, and $X\hat{w}$ is the orthogonal projection of $y$ onto $\text{Col}(X)$, both methods yield the same $\|r\|_2$ by construction, independent of solver choice.
For well-conditioned $X$ (typical $\kappa(X) \sim 10â100$ in toy_linreg), both methods agree within roundoff error. Use lstsq (SVD-based) when: (1) $X$ is ill-conditioned ($\kappa(X) > 10^6$)ânormal equations lose accuracy due to condition number squaring; (2) $X$ is rank-deficient ($\text{rank}(X) < d$)âlstsq returns minimum-norm solution, normal equations fail; (3) Interpretability mattersâSVD provides singular values for diagnosing collinearity; (4) Robustness is criticalâSVD handles edge cases gracefully. Use normal equations when: (1) $X$ is well-conditioned ($\kappa(X) \ll 10^6$) and full-rank; (2) Speed is essentialâCholesky on $X^\top X$ is ~2x faster for small $d < 100$; (3) Iterative refinement availableâhigher precision mitigates conditioning issues. Default to lstsq unless profiling reveals bottlenecks and $\kappa(X) < 10^4$ is verified. 5. Ã
. Björck. Numerical Methods for Least Squares Problems. SIAM, 1996. 6. N. J. Higham. Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM, 2002. 7. C. L. Lawson and R. J. Hanson. Solving Least Squares Problems. SIAM, 1995 (reprint of 1974 edition). 8. J. W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
Step-by-step execution of the least-squares comparison:
Generate synthetic regression data:
X, y, _ = toy_linreg(n=40, d=3, seed=5)creates a well-conditioned design matrix $X \in \mathbb{R}^{40 \times 3}$ and targets $y \in \mathbb{R}^{40}$ with known ground-truth weights (returned as the third element, ignored here).Solve via SVD-based lstsq:
w1 = np.linalg.lstsq(X, y, rcond=None)[0]computes the pseudoinverse solution $\hat{w}_1 = X^+ y = V \Sigma^{-1} U^\top y$ via economy SVD ($X = U \Sigma V^\top$ with $U \in \mathbb{R}^{40 \times 3}$, $\Sigma \in \mathbb{R}^{3 \times 3}$, $V \in \mathbb{R}^{3 \times 3}$). The[0]index extracts the weight vector from the tuple(solution, residuals, rank, singular_values).rcond=Noneuses machine precision ($\sim 10^{-15}$) as the singular value cutoff.Solve via normal equations:
w2 = np.linalg.solve(X.T @ X, X.T @ y)first forms the Gram matrix $X^\top X \in \mathbb{R}^{3 \times 3}$ (cost $O(40 \cdot 3^2) = O(360)$ flops) and moment vector $X^\top y \in \mathbb{R}^3$ (cost $O(40 \cdot 3) = O(120)$ flops), then solves the symmetric positive-definite system via Cholesky decomposition (cost $O(3^3 / 3) = O(9)$ flops). Total cost: $O(nd^2) \approx O(360)$ flops, comparable to SVD for small $d$.Compute residuals:
r1 = y - X @ w1andr2 = y - X @ w2evaluate the prediction errors $r_i = y - X\hat{w}_i \in \mathbb{R}^{40}$ for both methods. Each matrix-vector product costs $O(40 \cdot 3) = O(120)$ flops.Validate agreement: Print
||w1 - w2||_2(should be $\sim 10^{-15}$ for well-conditioned $X$) and||r1||_2, ||r2||_2(should be identical to $\sim 10$ significant digits). Large weight differences indicate ill-conditioning; any residual norm discrepancy beyond roundoff error ($\sim 10^{-14}$) signals a bug.Interpretation: For well-conditioned $X$ (typical $\kappa(X) \sim 10â100$ for
toy_linreg), both methods agree. To observe divergence, generate ill-conditioned data:X_ill = np.random.randn(40, 3); X_ill[:, 2] = X_ill[:, 0] + 1e-8 * X_ill[:, 1](near-collinear columns) and re-runâ||w1 - w2||will grow to $10^{-7}$ or larger.
- Algorithmic equivalence vs. numerical stability: Both
lstsq(SVD-based) andsolve(X.T @ X, X.T @ y)(normal equations) compute the same optimal weights $\hat{w}$ when $X$ is well-conditioned, but differ in how they handle ill-conditioningâlstsqpreserves $\kappa(X)$ via orthogonal factorizations, while normal equations square it to $\kappa(X^\top X) = [\kappa(X)]^2$, amplifying roundoff errors. - Residual norm uniqueness: The residual norm $\|r\|_2 = \|y - X\hat{w}\|_2$ is uniquely determined by the projection geometry (orthogonal projection of $y$ onto $\text{Col}(X)$), independent of the solver usedâboth methods yield identical $\|r_1\|_2 = \|r_2\|_2$ within machine precision.
- Weight agreement as a conditioning diagnostic: The difference $\|w_1 - w_2\|_2$ quantifies numerical stabilityânear-zero ($\sim 10^{-15}$) indicates well-conditioned $X$; larger values signal ill-conditioning or rank deficiency, where normal equations lose accuracy.
- When to use each method:
lstsqis the default choice for robustness (handles rank deficiency, ill-conditioning, and edge cases); normal equations are acceptable only for well-conditioned $X$ ($\kappa(X) \ll 10^6$) and small $d$, where Cholesky on $X^\top X$ offers a ~2x speedup. - Connection to ML workflows: This comparison generalizes to all least-squares variantsâridge regression $(X^\top X + \lambda I) w = X^\top y$, weighted least squares $X^\top W X w = X^\top W y$, and iterative refinement all inherit the same numerical trade-offs between direct (Cholesky/LU) and orthogonal (QR/SVD) factorizations.
- Pedagogical baseline for solver choice: The pattern âverify conditioning â choose solver â validate residualsâ establishes a reusable mental model for all overdetermined linear systems in ML (e.g., feature engineering, polynomial regression, transfer learning with frozen representations).
Part 1: Two Algorithmic Paths to the Same Solution
The code generates synthetic regression data via toy_linreg(n=40, d=3, seed=5), yielding a design matrix $X \in \mathbb{R}^{40 \times 3}$ and targets $y \in \mathbb{R}^{40}$. Two methods compute the least-squares solution: Method 1 (SVD-based) via lstsq factorizes $X = U \Sigma V^\top$ and computes $\hat{w} = V \Sigma^{-1} U^\top y$ with condition number $\kappa(X) = \sigma_1 / \sigma_d$ (no squaring), costing $O(nd^2)$ but handling rank-deficient and ill-conditioned systems robustly. Method 2 (Normal equations) via solve(X.T @ X, X.T @ y) forms the Gram matrix explicitly and solves $(X^\top X) w = X^\top y$ via Cholesky with condition number $\kappa(X^\top X) = [\kappa(X)]^2$ (squared conditioning), costing comparable $O(nd^2)$ but losing accuracy when $\kappa(X) > 10^6$. Both are mathematically equivalent when $X$ has full column rank, deriving from the gradient optimality condition $\nabla_w \|y - Xw\|_2^2 = -2 X^\top (y - Xw) = 0$, which yields $X^\top X w = X^\top y$ with unique solution $\hat{w} = (X^\top X)^{-1} X^\top y$ (assuming invertibility).
Part 2: Verifying Solution Agreement
The code computes residuals $r_1 = y - X w_1$ and $r_2 = y - X w_2$ and checks: (1) Weight agreement via np.linalg.norm(w1 - w2) should be near machine precision ($\sim 10^{-15}$) for well-conditioned $X$, confirming both methods converge to the same solution; (2) Residual norms via np.linalg.norm(r1) and np.linalg.norm(r2) should be identical, verifying optimality. The near-zero weight difference confirms algorithmic equivalence. Identical residual norms follow from the normal equations $X^\top X \hat{w} = X^\top y$ (first-order optimality), which implies $X^\top r = X^\top (y - X\hat{w}) = 0$ (residual orthogonal to column space). Since the residual norm is uniquely determined by projection geometry $\|r\|_2^2 = \|y\|_2^2 - \|X\hat{w}\|_2^2$, and $X\hat{w}$ is the orthogonal projection of $y$ onto $\text{Col}(X)$, both methods yield the same $\|r\|_2$ by construction, independent of solver choice.
Part 3: Conditioning Diagnostics and Failure Modes
For well-conditioned $X$ (typical $\kappa(X) \sim 10â100$ in toy_linreg), both methods agree within roundoff error. To observe divergence, generate ill-conditioned data with near-collinear columns: X_ill = np.random.randn(40, 3); X_ill[:, 2] = X_ill[:, 0] + 1e-8 * X_ill[:, 1] yields $\kappa(X_{\text{ill}}) \sim 10^8$, causing ||w1 - w2|| to grow to $10^{-7}$ or larger (normal equations lose accuracy). For rank-deficient $X$ (e.g., X[:, 2] = X[:, 0] + X[:, 1] exactly), lstsq returns the minimum-norm solution via pseudoinverse, while solve(X.T @ X, X.T @ y) fails with LinAlgError: Singular matrix. Conditioning verification before choosing solver: cond_X = np.linalg.cond(X); assert cond_X < 1e6, "Use lstsq" checks if normal equations are safe. The condition number squaring $\kappa(X^\top X) = [\kappa(X)]^2$ amplifies roundoff errors quadratically: if $\kappa(X) = 10^6$, then $\kappa(X^\top X) = 10^{12}$, losing ~6 digits of precision (roundoff error $\sim \kappa \cdot \epsilon_{\text{mach}} \approx 10^{12} \cdot 10^{-16} = 10^{-4}$).
Why This Matters for ML
Use lstsq (SVD-based) when: (1) $X$ is ill-conditioned ($\kappa(X) > 10^6$)ânormal equations lose accuracy due to condition number squaring; (2) $X$ is rank-deficient ($\text{rank}(X) < d$)âlstsq returns minimum-norm solution, normal equations fail; (3) Interpretability mattersâSVD provides singular values for diagnosing collinearity; (4) Robustness is criticalâSVD handles edge cases gracefully. Use normal equations when: (1) $X$ is well-conditioned ($\kappa(X) \ll 10^6$) and full-rank; (2) Speed is essentialâCholesky on $X^\top X$ is ~2x faster for small $d < 100$; (3) Iterative refinement availableâhigher precision mitigates conditioning issues. Default to lstsq unless profiling reveals bottlenecks and $\kappa(X) < 10^4$ is verified. Ridge regression bridges both: $(X^\top X + \lambda I) \hat{w}_\lambda = X^\top y$ with $\lambda > 0$ ensures invertibility (even if $X^\top X$ singular) and improves conditioning $\kappa(X^\top X + \lambda I) \le \sigma_1^2 / \lambda$, stabilizing normal equations for moderately ill-conditioned $X$. QR decomposition alternative: Factoring $X = QR$ avoids forming $X^\top X$ while maintaining $\kappa(R) = \kappa(X)$ (no squaring), matching SVD stability at ~2-3x speedup (scikit-learnâs LinearRegression uses QR internally).
ML Examples and Patterns
Gradient descent for least squares: Iteratively minimizing $\|y - Xw\|_2^2$ via $w_{t+1} = w_t - \eta (X^\top X w_t - X^\top y)$ recovers normal equations at convergence ($\nabla_w f = 0$). Preferred for very large $n, d$ where direct methods cost $O(nd^2 + d^3)$ vs. $O(Tnd)$ for $T$ iterations. SGD: Minibatch $\mathcal{B}$ of size $b \ll n$ costs $O(bd^2)$ per step, enabling billions of examples. Ridge regression (L2 regularization): Modifies normal equations to $(X^\top X + \lambda I) \hat{w}_\lambda = X^\top y$, ensuring invertibility and improving conditioning. SVD-based solution $\hat{w}_\lambda = \sum_i \frac{\sigma_i}{\sigma_i^2 + \lambda} (u_i^\top y) v_i$ shrinks small singular values more (damping). Weighted least squares: Heteroscedastic errors introduce weights $W = \text{diag}(w_1, \ldots, w_n)$ modifying to $X^\top W X \hat{w} = X^\top W y$, solved via lstsq on transformed system $\tilde{X} = W^{1/2} X$. Polynomial regression: Features $[x, x^2, x^3]$ create ill-conditioned $X$ ($\kappa \sim 10^6$)âsolutions: orthogonal polynomials (Legendre/Chebyshev), feature scaling, ridge regularization. Iterative solvers (conjugate gradient): For large $d \sim 10^6$, CG solves $X^\top X w = X^\top y$ iteratively without forming $X^\top X$, but convergence depends on $\kappa(X^\top X) = [\kappa(X)]^2$ (preconditioning essential).
Connection to Linear Algebra Theory
Normal equations derivation: Minimizing $f(w) = \|y - Xw\|_2^2 = y^\top y - 2 y^\top X w + w^\top X^\top X w$ yields gradient $\nabla_w f = -2 X^\top y + 2 X^\top X w$, setting to zero gives $X^\top X w = X^\top y$. Uniqueness: If $\text{rank}(X) = d$, then $X^\top X$ is positive definite (for $v \neq 0$, $v^\top X^\top X v = \|Xv\|_2^2 > 0$), ensuring invertibility and unique solution $\hat{w} = (X^\top X)^{-1} X^\top y$. If $\text{rank}(X) < d$, $X^\top X$ is singular with infinitely many solutions, and pseudoinverse $X^+ = V \Sigma^+ U^\top$ selects minimum-norm $\hat{w}_{\text{min}} = X^+ y$. Projection interpretation: The least-squares solution $\hat{y} = X\hat{w} = X(X^\top X)^{-1} X^\top y = P_X y$ is the orthogonal projection of $y$ onto $\text{Col}(X)$ with projection matrix $P_X$. Residual $r = (I - P_X) y$ projects onto $\text{Col}(X)^\perp$, explaining $X^\top r = 0$ (orthogonality) and uniqueness of $\|r\|_2$. SVD perspective: With $X = U \Sigma V^\top$, pseudoinverse $X^+ = V \Sigma^{-1} U^\top$ gives $\hat{w} = \sum_{i=1}^{r} \frac{u_i^\top y}{\sigma_i} v_i$ (sum over $r = \text{rank}(X)$), automatically regularizing rank-deficient systems. Moore-Penrose properties: $X^+$ satisfies four axioms including $(X X^+)^\top = X X^+$ (symmetry of column space projection) and $(X^+ X)^\top = X^+ X$ (row space projection); for full-rank $X$, $X^+ = (X^\top X)^{-1} X^\top$ (right inverse).
Numerical and Implementation Notes
Shape discipline: $X \in \mathbb{R}^{n \times d}$ ($n=40$, $d=3$), $y \in \mathbb{R}^n$, $w_1, w_2 \in \mathbb{R}^d$, $r_1, r_2 \in \mathbb{R}^n$, $X^\top X \in \mathbb{R}^{d \times d}$ ($3 \times 3$), $X^\top y \in \mathbb{R}^d$. Gotchas: (1) lstsq returns tuple (solution, residuals, rank, singular_values)âuse [0] to extract $\hat{w}$; residuals (index [1]) contains $\|r\|_2^2$ only if $n > d$ and full rank, else empty. (2) rcond=None uses machine precision ($\sim 10^{-15}$) cutoff; for rank-deficient/noisy systems set rcond=1e-10 to truncate small singular values. (3) Normal equations fail if $\text{rank}(X) < d$ (LinAlgError: Singular matrix)âlstsq handles gracefully via pseudoinverse. (4) Conditioning check before using normal equations: cond_X = np.linalg.cond(X); assert cond_X < 1e10. (5) Memory: Forming $X^\top X$ requires $O(d^2)$ memory; for $d \sim 10^4$, thatâs 400 MB vs. 80 MB for $X$ with $n=10^6$. (6) Iterative refinement: If normal equations moderately ill-conditioned, refine via $w_0 = \text{solve}(X^\top X, X^\top y)$; $r_0 = y - X w_0$; $dw = \text{solve}(X^\top X, X^\top r_0)$; $w_1 = w_0 + dw$ (recovers ~8-10 lost digits).
Numerical and Shape Notes
Verification checks: (1) Weight agreement ||w1 - w2||_2 should be $\sim 10^{-15}$ for well-conditioned $X$. (2) Residual norms ||r1||_2 and ||r2||_2 should match to $\sim 10$ significant digits. (3) Normal equations satisfied: ||X.T @ X @ w1 - X.T @ y||_2 should be $\sim 10^{-14}$. (4) Orthogonality: ||X.T @ r1||_2 should be $\sim 10^{-14}$ (residual orthogonal to column space). (5) Condition number squaring: cond(X.T @ X) / cond(X)**2 should be $\approx 1.0$. (6) QR agreement: Q, R = np.linalg.qr(X); w3 = np.linalg.solve(R, Q.T @ y) yields ||w1 - w3||_2 \sim 10^{-15}. Cost analysis: SVD-based lstsq costs $O(nd^2)$ for economy SVD ($n \gg d$); normal equations cost $O(nd^2)$ for $X^\top X$ formation + $O(d^3/3)$ for Cholesky. For small $d < 100$, Cholesky is ~2x faster; for large $d > 10^4$, both are $O(nd^2)$-dominated. Tolerance: Machine epsilon $\epsilon_{\text{mach}} \approx 2.22 \times 10^{-16}$ (float64); roundoff error in least squares $\sim \kappa(X) \cdot \epsilon_{\text{mach}}$; for $\kappa(X) = 10^6$, expect ~10 lost digits ($10^{-16} \times 10^6 = 10^{-10}$).
Pedagogical Significance
This example is the canonical comparison of least-squares solution methods, establishing: (1) Both methods are mathematically equivalent (normal equations âï¸ SVD) but differ numerically. (2) lstsq is the default choice (handles ill-conditioning, rank deficiency, edge cases robustly). (3) Normal equations are faster but fragile (only for well-conditioned $X$, $\kappa \ll 10^6$, small $d$). (4) Condition number squaring is real (forming $X^\top X$ amplifies numerical errors quadratically). (5) Residual norms are unique (determined by projection geometry, independent of solver). Common misconceptions addressed: âNormal equations are always fineâ (Noâfail for rank-deficient $X$, lose accuracy for ill-conditioned $X$). âlstsq is slowerâ (Only ~2x for small $d$, stability gain worth it). âWeights can differâ (Only due to roundoffâ$\sim 10^{-15}$ for well-conditioned $X$). âResiduals differ if weights differâ (Noânorms uniquely determined by projection). Connection to other examples: Example 70 (both satisfy $X^\top r = 0$), Example 71 (minimum-norm for rank-deficient $X$), Example 74 (condition number squaring), Example 75 (SVD connects to PCA). Why pedagogically powerful: Demonstrates that algorithmic choices matterâtransition from $X w \approx y$ to $(X^\top X) w = X^\top y$ is algebraically trivial but numerically catastrophic for ill-conditioned systems. Teaches that lstsq is the numerically principled approach preserving conditioning via SVD/QR. Pattern generalizes: always prefer methods avoiding explicit Gram matrix formation unless conditioning verified.
Historical Foundations
The method of least squares emerged independently from Adrien-Marie Legendreâs 1805 publication Nouvelles m pour la d des orbites des com and Carl Friedrich Gaussâs 1809 work Theoria motus corporum coelestium, where Gauss famously applied least squares to predict the orbit of the asteroid Ceres from sparse observations. Gauss claimed to have used the method as early as 1795, establishing it as the foundation of parametric statistical inference. The normal equations $X^\\top X w = X^\\top y$ derive their name from the geometric property that the residual $r = y - X\\hat{w}$ is orthogonal (normal) to the column space of $X$, a fact recognized by early geometers. For 150 years, the normal equations were solved via explicit matrix inversion or Gaussian elimination, with little attention to numerical stability tables were computed by hand, and condition numbers were rarely calculated. The numerical crisis emerged in the 1960s when digital computers revealed catastrophic errors in ill-conditioned problems: Wilkinsonâs 1963 work Rounding Errors in Algebraic Processes exposed the dangers of forming $X^\\top X$ explicitly, showing that condition number squaring $\\kappa(X^\\top X) = [\\kappa(X)]^2$ amplifies roundoff errors quadratically. Alston Householder (1958) and Gene Golub (1965) pioneered QR decomposition as a numerically stable alternative, factorizing $X = QR$ to solve $R \\hat{w} = Q^\\top y$ without forming $X^\\top X$. Golub and Kahanâs 1965 algorithm for computing the SVD established pseudoinverse-based least squares ($\\hat{w} = X^+ y$) as the gold standard for rank-deficient and ill-conditioned systems, laying the groundwork for modern numerical linear algebra. Today, numpy.linalg.lstsq implements SVD-based least squares by default, while scipy.linalg.lstsq offers QR via LAPACKâs DGELSD routine.
Modern Applications in ML
Least squares remains the prototype optimization problem in machine learning, serving as the foundation for countless algorithms. In linear regression, lstsq fits models $y \\approx X w$ for feature engineering (polynomial basis expansions, interaction terms, radial basis functions), where ill-conditioning from collinear features or high-degree polynomials necessitates robust solvers. Ridge regression $(X^\\top X + \\lambda I) \\hat{w}_\\lambda = X^\\top y$ stabilizes the normal equations by adding $\\lambda I$, improving conditioning from $\\kappa(X^\\top X) = [\\kappa(X)]^2$ to $\\kappa(X^\\top X + \\lambda I) \\le \\sigma_1^2 / \\lambda$ is the L2 regularization ubiquitous in neural networks (weight decay), SVMs (soft-margin), and Gaussian processes (posterior mean). In Gaussian process regression, the posterior mean $\\mu(x_*) = k_* (K + \\sigma^2 I)^{-1} y$ solves a regularized least-squares problem where the kernel matrix $K$ is often ill-conditioned (near-duplicate training points) decomposition of $K + \\sigma^2 I$ (positive definite by construction) is the standard approach, but SVD-based methods are preferred for diagnosing numerical issues. Weighted least squares $\\min_w \\|W^{1/2} (y - Xw)\\|_2^2$ with weights $W = \\text{diag}(w_1, \\ldots, w_n)$ models heteroscedastic noise (variance depends on $i$), appearing in robust regression (iteratively reweighted least squares, IRLS), generalized linear models (GLMs), and sensor fusion (Kalman filtering with measurement noise covariances). In randomized linear algebra, sketching methods (e.g., count-sketch, Johnson-Lindenstrauss) approximate $X$ with a smaller matrix $\\tilde{X} \\in \\mathbb{R}^{s \\times d}$ ($s \\ll n$) and solve $\\min_w \\|\\tilde{X} w - \\tilde{y}\\|_2^2$ for $O(sd^2)$ cost instead of $O(nd^2)$, enabling least squares on massive datasets ($n \\sim 10^9$). Iterative refinement (conjugate gradient, LSQR, LSMR) avoids forming $X^\\top X$ entirely, computing matrix-vector products $X^\\top (X w_t - y)$ at each step for sparse $X$ (e.g., document-term matrices in NLP, adjacency matrices in graph learning) where $X^\\top X$ loses sparsity. Transfer learning with frozen representations solves $\\min_w \\|y - \\Phi(X) w\\|_2^2$ where $\\Phi$ is a pretrained neural network (e.g., BERT embeddings, ResNet features), and $\\Phi(X) \\in \\mathbb{R}^{n \\times d}$ with $d \\sim 10^3$ is typically well-conditioned, making normal equations viable. The Gauss-Newton algorithm for nonlinear least squares (e.g., neural network training before SGD) linearizes $f(w)$ around $w_t$ and solves $\\min_{\\Delta w} \\|J_t \\Delta w - r_t\\|_2^2$ via QR or SVD at each iteration, where $J_t$ is the Jacobian used in robotics (bundle adjustment, simultaneous localization and mapping) and scientific computing (inverse problems, parameter estimation).
Example 70 (Orthogonality and projections): Both $w_1$ and $w_2$ satisfy the orthogonality condition $X^\top r = 0$ (residual orthogonal to column space), confirming that $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{Col}(X)$. The uniqueness of $\|r\|_2$ follows from the projection theorem.
Example 71 (Rank and nullspace): When $X$ is rank-deficient ($\text{rank}(X) < d$),
lstsqreturns the minimum-norm solution $\hat{w}_{\text{min}} = X^+ y$ (lying in the row space, orthogonal to the nullspace), whilesolve(X.T @ X, X.T @ y)fails withLinAlgError: Singular matrix. This highlights the robustness advantage of SVD-based methods.Example 74 (SVD and conditioning): The condition number squaring $\kappa(X^\top X) = [\kappa(X)]^2$ explains why normal equations lose accuracy for ill-conditioned $X$. If $\kappa(X) = 10^6$, then $\kappa(X^\top X) = 10^{12}$, causing normal equations to lose ~6 digits of precision (since machine epsilon $\epsilon_{\text{mach}} \approx 10^{-16}$, and roundoff error scales as $\kappa \cdot \epsilon_{\text{mach}} \approx 10^{12} \cdot 10^{-16} = 10^{-4}$).
Example 75 (PCA and explained variance): The SVD factorization $X = U \Sigma V^\top$ used by
lstsqsimultaneously solves least squares and provides the singular values for PCA diagnostics (variance explained, effective rank). The connection is: PCAâs first principal component maximizes variance âï¸ least squares minimizes residual norm (dual optimization problems).Example 77 (Ridge regression): Adding $\lambda I$ to the normal equations $(X^\top X + \lambda I) w = X^\top y$ improves conditioning ($\kappa(X^\top X + \lambda I) \le \sigma_1^2 / \lambda$), making normal equations viable even for moderately ill-conditioned $X$. The SVD-based ridge solution $\hat{w}_\lambda = \sum_i \frac{\sigma_i}{\sigma_i^2 + \lambda} (u_i^\top y) v_i$ provides better numerical stability and interpretability (singular value damping).
Example 78 (Gradient descent for least squares): Iterative methods like gradient descent avoid forming $X^\top X$ entirely, instead computing $X^\top (X w_t - y)$ at each step (cost $O(nd)$ per iteration). Convergence depends on $\kappa(X^\top X) = [\kappa(X)]^2$, requiring preconditioning for ill-conditioned systems.
Chapter 13 (Solving systems): Least squares is a special case of overdetermined linear systems ($n > d$). The choice between direct methods (Cholesky, QR, SVD) and iterative methods (conjugate gradient, LSQR) follows the same conditioning trade-offs demonstrated here.
Comments