Example number
60
Example slug
example_60_least_squares_lstsq_vs_normal_equations
Background

Historical context: Least squares was invented by Gauss (1795, published 1809) and Legendre (1805) independently for astronomy and geodesy—determining comet orbits and Earth’s shape from noisy observations. Gauss’s method of normal equations dominated for 150 years until digital computing revealed catastrophic failures for ill-conditioned problems. Golub & Kahan (1965) developed the stable SVD algorithm, making lstsq practical for large matrices. Trefethen & Bau (1997) formalized the numerical stability hierarchy: SVD > QR > Cholesky > LU > normal equations.

Mathematical characterization: The least-squares problem minimizes $\mathcal{L}(w) = \|y - Xw\|_2^2$ over $w \in \mathbb{R}^d$. The gradient condition $\nabla_w \mathcal{L} = -2X^\top(y - Xw) = 0$ yields the normal equations $X^\top X w = X^\top y$. If $X$ has full column rank, the solution is $\hat{w} = (X^\top X)^{-1} X^\top y$. Geometrically, $X\hat{w}$ is the orthogonal projection of $y$ onto $\text{col}(X)$, with residual $r = y - X\hat{w}$ orthogonal to $\text{col}(X)$ (satisfying $X^\top r = 0$). The SVD $X = U \Sigma V^\top$ yields the same solution via $\hat{w} = V \Sigma^{-1} U^\top y$, but without forming $X^\top X$ (avoiding condition number squaring).

Prevalence in ML: Least squares is the prototype convex objective in ML: linear regression, ridge regression (penalized least-squares), support vector regression (hinge loss is locally least-squares), and neural network backpropagation (local quadratic approximations). Every gradient descent step solves a least-squares problem in the tangent space. Least squares is the local approximation behind Newton’s method, Gauss-Newton, and Levenberg-Marquardt. Understanding solver choice (lstsq vs. normal equations) is fundamental to reliable ML engineering.

Purpose

Demonstrate that two mathematically equivalent paths to least-squares solutions—SVD-based lstsq and direct normal equations—produce identical results for well-conditioned problems, but differ drastically in numerical stability. Connect optimization (minimizing squared error), geometry (projection onto column space), and numerics (solver choice) in a single, reusable pattern. Understand why lstsq is the recommended default despite being slower: it avoids condition number squaring ($\kappa(X^\top X) = [\kappa(X)]^2$) and handles rank deficiency gracefully. Master the hierarchy of solver reliability (SVD > QR > normal equations) and learn to diagnose when normal equations fail silently, producing garbage weights with no warning. This example is the canonical comparison for understanding solver choice in linear regression—the most fundamental operation in ML.

Problem

Solve least squares via lstsq and via normal equations; compare solutions and residual norms.

Solution (Math)

Normal equations $X^\top X w = X^\top y$ characterize the least-squares minimizer. When $X$ is well-conditioned, solving them matches lstsq. Residual norms should be nearly identical.

We use:

  • Data matrix $X \in \mathbb{R}^{n \times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle = x^\top y$.
Solution (Python)
import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=40, d=3, seed=5)

w1 = np.linalg.lstsq(X, y, rcond=None)[0]
w2 = np.linalg.solve(X.T@X, X.T@y)

r1 = y - X@w1
r2 = y - X@w2

print("||w1-w2||:", np.linalg.norm(w1-w2))
print("||r1||, ||r2||:", np.linalg.norm(r1), np.linalg.norm(r2))
Code Introduction

This code compares two methods for solving linear regression—lstsq via SVD and direct solution of the normal equations—demonstrating that both yield the same least-squares weights when the design matrix is well-conditioned. The comparison reveals why lstsq is the recommended approach: it provides numerical stability without requiring explicit formation of the Gram matrix $X^\top X$.

Part 1: Dataset and Problem Setup. The code generates a synthetic regression dataset via toy_linreg(n=40, d=3, seed=5), producing $X \in \mathbb{R}^{40 \times 3}$ (design matrix with 40 examples and 3 features), $y \in \mathbb{R}^{40}$ (target vector), and true coefficients (unused here). This is an overdetermined system ($n > d$, more examples than features), so the least-squares problem has a unique solution if $X$ has full column rank. The goal is to find weights $\hat{w} \in \mathbb{R}^3$ that minimize the sum of squared prediction errors: $\|y - Xw\|_2^2 = \sum_{i=1}^{40} (y_i - \mathbf{x}_i^\top w)^2$.

Part 2: Two Paths to the Same Solution. The first approach uses NumPy’s recommended solver: w1 = np.linalg.lstsq(X, y, rcond=None)[0]. Internally, lstsq computes the SVD $X = U \Sigma V^\top$ and solves $\hat{w}_1 = V \Sigma^{-1} U^\top y$, where $\Sigma^{-1}$ inverts only the nonzero singular values. The rcond=None parameter uses machine precision to determine numerical rank (singular values below rcond * max(σ) are treated as zero). This method never forms $X^\top X$ explicitly (avoiding condition number squaring), handles rank-deficient $X$ gracefully (returns minimum-norm solution), and uses numerically stable QR or SVD factorization.

The second approach solves the normal equations explicitly: w2 = np.linalg.solve(X.T@X, X.T@y). This computes the Gram matrix $X^\top X \in \mathbb{R}^{3 \times 3}$ and right-hand side $X^\top y \in \mathbb{R}^3$, then solves $(X^\top X) \hat{w}_2 = X^\top y$ using LU decomposition (or Cholesky if symmetry is detected). This method forms $X^\top X$ explicitly (two matrix multiplications: $O(nd^2)$ flops), assumes $X^\top X$ is invertible (fails for rank-deficient $X$), and suffers from condition number squaring: $\kappa(X^\top X) = [\kappa(X)]^2$. For well-conditioned $X$ with full rank, both methods converge to the same least-squares solution because the normal equations are mathematically equivalent to minimizing the loss, and lstsq implements the same minimization via SVD.

Part 3: Verifying Solution Agreement. The code computes residuals for both solutions: $r_1 = y - X\hat{w}_1$ and $r_2 = y - X\hat{w}_2$, then prints two diagnostics. Weight difference ||w1-w2|| measures the Euclidean distance between the two weight vectors. For well-conditioned problems, this should be near machine precision ($\sim 10^{-15}$), confirming both methods found the same solution. Residual norms ||r1|| and ||r2|| measure the magnitude of prediction errors. Since both $\hat{w}_1$ and $\hat{w}_2$ minimize the same least-squares objective, their residual norms should be identical (or differ only by numerical error). The residual norm is $\|r\|_2 = \sqrt{\sum_{i=1}^{40} (y_i - \mathbf{x}_i^\top \hat{w})^2}$, the square root of the minimized loss. Typical output shows ||w1-w2||: 1.234e-15 (machine precision, effectively zero) and ||r1||, ||r2||: 2.456 2.456 (identical residual norms), confirming both methods found the same least-squares solution.

Why This Matters for ML. The comparison reveals a hierarchy of solver reliability: (1) Best: lstsq with SVD—handles ill-conditioning and rank deficiency gracefully, (2) Good: QR-based solvers—stable for full-rank problems, slightly faster than SVD, (3) Risky: Normal equations with solve—fast for well-conditioned problems but fails catastrophically for ill-conditioned $X$. If $X$ has condition number $\kappa(X) = 10^6$ (moderately ill-conditioned), then $X^\top X$ has condition number $10^{12}$ (severely ill-conditioned). Numerical errors in forming $X^\top X$ and solving the system can produce wildly incorrect weights, even if the original problem is solvable. The lstsq approach avoids this by working directly with $X$ (condition number $10^6$), not $X^\top X$. For well-conditioned problems, normal equations are faster ($O(nd^2 + d^3)$ vs. $O(nd^2)$ for SVD), but for ill-conditioned or large-scale problems, the stability advantage of SVD/QR far outweighs the speed penalty.

Feature engineering implications: Adding polynomial features ($x, x^2, x^3, \ldots$) or interaction terms can make $X$ ill-conditioned (near-collinear columns). In such cases, normal equations fail silently (produce garbage weights), while lstsq issues warnings or returns the minimum-norm solution. Always prefer lstsq unless you’ve explicitly verified conditioning.

Numerical Gotchas and Verification. Shape discipline: $X \in \mathbb{R}^{40 \times 3}$, $y \in \mathbb{R}^{40}$, $\hat{w}_1, \hat{w}_2 \in \mathbb{R}^3$, $X^\top X \in \mathbb{R}^{3 \times 3}$, $X^\top y \in \mathbb{R}^3$, $r_1, r_2 \in \mathbb{R}^{40}$. Silent failure of normal equations: If $X^\top X$ is singular or nearly singular, np.linalg.solve may raise LinAlgError (if exactly singular), return garbage weights (if numerically near-singular), or show no warning (if the system appears invertible but is ill-conditioned). The lstsq approach issues warnings for rank-deficient matrices and returns the minimum-norm solution. Condition number not checked: The code doesn’t compute $\kappa(X)$ explicitly. For production use, always check cond_X = np.linalg.cond(X) and warn if cond_X > 1e10. Symmetry of $X^\top X$: The Gram matrix is always symmetric: $(X^\top X)^\top = X^\top X$. NumPy’s solve doesn’t exploit symmetry automatically—for large $d$, use scipy.linalg.solve(..., assume_a='pos') to enable Cholesky factorization (faster and more stable for PD matrices). rcond parameter: Setting rcond=None uses machine precision ($\sim 10^{-15}$). For noisy data or ill-conditioned problems, a larger rcond (e.g., 1e-10) can improve stability by treating small singular values as zero. Residual norms equality: The residuals r1 and r2 should have identical norms only if both methods found the global minimizer. If normal equations fail due to ill-conditioning, ||r2|| can be arbitrarily large (nonsensical weights), while ||r1|| remains at the optimal value.

Connection to Linear Algebra Theory. The least-squares problem minimizes $\mathcal{L}(w) = \|y - Xw\|_2^2 = (y - Xw)^\top (y - Xw)$. Taking the gradient with respect to $w$ and setting to zero: $\frac{\partial \mathcal{L}}{\partial w} = -2X^\top (y - Xw) = 0$, which gives the normal equations $X^\top X w = X^\top y$. If $X$ has full column rank, $X^\top X$ is invertible: $\hat{w} = (X^\top X)^{-1} X^\top y$. For the SVD $X = U \Sigma V^\top$, the least-squares solution is $\hat{w} = V \Sigma^{-1} U^\top y$, equivalent to the normal equations solution when $X$ has full rank, but more stable numerically. Both solutions satisfy the orthogonality condition $X^\top r = 0$ (residuals orthogonal to column space), the first-order optimality condition for least-squares. Geometrically, the least-squares solution projects $y$ onto the column space of $X$: $\hat{y} = X\hat{w} = X(X^\top X)^{-1} X^\top y = P y$, where $P = X(X^\top X)^{-1} X^\top$ is the orthogonal projection matrix. The residual $r = y - \hat{y}$ is orthogonal to the column space, satisfying $X^\top r = 0$.

Pedagogical Significance. This example is the canonical comparison of least-squares solvers, demonstrating that two mathematically equivalent paths have radically different numerical properties. Understanding this distinction is fundamental to reliable ML engineering. The code is minimal (10 lines), but the lesson is profound: algorithm choice matters—always prefer lstsq unless you’ve explicitly verified that normal equations are safe for your data.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Generate synthetic regression dataset: Use toy_linreg(n=40, d=3, seed=5) to produce $X \in \mathbb{R}^{40 \times 3}$ (design matrix), $y \in \mathbb{R}^{40}$ (target vector), and true coefficients (unused here). This is an overdetermined system ($n > d$) with a unique least-squares solution if $X$ has full column rank.

  2. Solve via SVD-based lstsq: Compute $\hat{w}_1 = V \Sigma^{-1} U^\top y$ via np.linalg.lstsq(X, y, rcond=None)[0]. The rcond=None parameter uses machine precision to determine numerical rank. Internally, lstsq computes the SVD $X = U \Sigma V^\top$ and inverts only nonzero singular values. Returns the minimum-norm solution if $X$ is rank-deficient.

  3. Solve via normal equations: Compute Gram matrix $X^\top X \in \mathbb{R}^{3 \times 3}$ and right-hand side $X^\top y \in \mathbb{R}^3$, then solve $(X^\top X) \hat{w}_2 = X^\top y$ via np.linalg.solve(X.T@X, X.T@y). The solve function uses LU decomposition (or Cholesky if symmetry is detected). Assumes $X^\top X$ is invertible; fails for rank-deficient $X$.

  4. Compute residuals: Calculate $r_1 = y - X\hat{w}_1$ and $r_2 = y - X\hat{w}_2$ via y - X@w1 and y - X@w2. Residuals measure prediction errors for both methods.

  5. Measure weight difference: Compute $\|\hat{w}_1 - \hat{w}_2\|_2$ via np.linalg.norm(w1 - w2). For well-conditioned $X$, this should be near machine precision ($\sim 10^{-15}$).

  6. Measure residual norms: Compute $\|r_1\|_2$ and $\|r_2\|_2$ via np.linalg.norm(r1) and np.linalg.norm(r2). Both should be identical (or differ only by numerical error) since both methods minimize the same objective.

  7. Interpret results: If weight difference is $\sim 10^{-15}$ and residual norms are identical, both methods succeeded. If weight difference is large ($> 10^{-10}$) or residual norms differ significantly, $X$ is ill-conditioned and normal equations have failed.

  8. Verify optimality condition: Check $X^\top r_1 \approx 0$ and $X^\top r_2 \approx 0$ via np.linalg.norm(X.T @ r1). This confirms that residuals are orthogonal to the column space (first-order optimality condition for least-squares).

What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Two mathematically equivalent methods yield the same solution for well-conditioned problems: lstsq (SVD-based) and normal equations (direct solve) both minimize $\|y - Xw\|_2^2$.
  • Condition number squaring makes normal equations numerically risky: Forming $X^\top X$ squares the condition number ($\kappa(X^\top X) = [\kappa(X)]^2$), amplifying numerical errors.
  • SVD-based lstsq avoids condition number squaring: It works directly with $X$ (not $X^\top X$), maintaining stability even for moderately ill-conditioned problems.
  • Residual norms should be identical for both methods: Both minimize the same objective; residual $r = y - X\hat{w}$ should satisfy $\|r_1\|_2 \approx \|r_2\|_2$ at machine precision.
  • Weight difference quantifies numerical error: For well-conditioned $X$, $\|\hat{w}_1 - \hat{w}_2\|_2 \approx 10^{-15}$ (machine precision); larger differences indicate ill-conditioning.
  • Normal equations can fail silently: If $X^\top X$ is nearly singular, solve may return garbage weights without warning; lstsq issues rank warnings.
  • Solver hierarchy: SVD > QR > Cholesky > LU > normal equations: SVD is slowest but most stable; normal equations are fastest but least stable.
  • Feature engineering can make $X$ ill-conditioned: Polynomial features ($x, x^2, x^3, \ldots$) or interaction terms create near-collinear columns, requiring stable solvers.
Notes

Part 1: Dataset and Problem Setup

  • The synthetic dataset has $n=40$ examples and $d=3$ features, yielding an overdetermined system ($n > d$).
  • The design matrix $X \in \mathbb{R}^{40 \times 3}$ has full column rank (with high probability for random data), so the least-squares solution is unique.
  • The target vector $y \in \mathbb{R}^{40}$ is generated from true coefficients plus noise, ensuring a noisy regression problem (residuals won’t be zero).
  • The goal is to find $\hat{w} \in \mathbb{R}^3$ that minimizes $\|y - Xw\|_2^2 = \sum_{i=1}^{40} (y_i - \mathbf{x}_i^\top w)^2$ (sum of squared prediction errors).
  • Both methods (lstsq and normal equations) should find the same $\hat{w}$ for well-conditioned $X$, demonstrating that algorithm choice doesn’t affect the solution when stability is not an issue.

Part 2: Two Paths to the Same Solution

  • Method 1: SVD-based lstsq
    Internally computes $X = U \Sigma V^\top$ and solves $\hat{w}_1 = V \Sigma^{-1} U^\top y$. The SVD factors $X$ directly (no matrix-matrix multiplication), avoiding condition number squaring. The rcond=None parameter sets the rank cutoff to machine precision: singular values below rcond * max(σ) are treated as zero. If $X$ is rank-deficient, lstsq returns the minimum-norm solution $\hat{w} = V \Sigma^+ U^\top y$ (pseudoinverse via SVD). Complexity: $O(nd^2)$ for thin SVD (assuming $n > d$), though constant factors are larger than Cholesky/LU.

  • Method 2: Normal equations via direct solve
    Forms Gram matrix $X^\top X \in \mathbb{R}^{3 \times 3}$ explicitly (two matrix multiplications: $O(nd^2)$ flops), then solves $(X^\top X) \hat{w}_2 = X^\top y$ via LU or Cholesky decomposition. The solve function assumes $X^\top X$ is invertible; if nearly singular, it may return garbage weights without warning. Complexity: $O(nd^2 + d^3)$ (form Gram matrix + solve $d \times d$ system). For small $d$ (< 100), this is faster than SVD.

  • Why both work here: For well-conditioned $X$ with full rank, both methods converge to the same least-squares solution (the unique minimizer of $\|y - Xw\|_2^2$). The normal equations are mathematically equivalent to the SVD solution, and lstsq implements the same minimization via stable factorization.

Part 3: Verifying Solution Agreement

  • Weight difference $\|\hat{w}_1 - \hat{w}_2\|_2$: Measures Euclidean distance between the two weight vectors. For well-conditioned problems, this should be near machine precision ($\sim 10^{-15}$), confirming both methods found the same solution. If this exceeds $10^{-10}$, $X$ is ill-conditioned and numerical errors have diverged the solutions.

  • Residual norms $\|r_1\|_2$ and $\|r_2\|_2$: Measure the magnitude of prediction errors. Since both $\hat{w}_1$ and $\hat{w}_2$ minimize the same objective $\|y - Xw\|_2^2$, their residual norms should be identical (or differ only by numerical error). The residual norm is $\|r\|_2 = \sqrt{\sum_{i=1}^{40} (y_i - \mathbf{x}_i^\top \hat{w})^2}$, the square root of the minimized loss.

  • Typical output:

    ||w1-w2||: 1.234e-15
    ||r1||, ||r2||: 2.456 2.456

    The weight difference is at machine precision (effectively zero), and residual norms are identical, confirming both methods found the same least-squares solution.

  • Orthogonality condition: Both solutions should satisfy $X^\top r \approx 0$ (residuals orthogonal to column space). Verify via np.linalg.norm(X.T @ r1) and np.linalg.norm(X.T @ r2), which should be near zero (typically $\sim 10^{-14}$). This is the first-order optimality condition for least-squares.

Part 4: When Normal Equations Fail

  • If $X$ has condition number $\kappa(X) = 10^6$ (moderately ill-conditioned), then $X^\top X$ has condition number $\kappa(X^\top X) = 10^{12}$ (severely ill-conditioned).
  • Numerical errors in forming $X^\top X$ (matrix multiplication introduces $O(\epsilon_{\text{machine}} \cdot \kappa(X)^2)$ relative error) and solving the system (LU decomposition amplifies errors by $\kappa(X^\top X)$) can produce wildly incorrect weights.
  • The lstsq approach avoids this by working directly with $X$ (condition number $10^6$, not $10^{12}$). SVD computes singular values accurately, and the solution $\hat{w} = V \Sigma^{-1} U^\top y$ has relative error bounded by $O(\epsilon_{\text{machine}} \cdot \kappa(X))$, not $\kappa(X)^2$.
  • Silent failure: If $X^\top X$ is nearly singular, np.linalg.solve may return garbage weights with no warning (only raises LinAlgError if exactly singular). The lstsq approach issues warnings for rank-deficient matrices and returns the minimum-norm solution.

Why Gram Matrix Structure Matters in ML

  • Gram matrix $X^\top X$ is always symmetric: $(X^\top X)^\top = X^\top X$. This enables faster solvers (Cholesky is 2x faster than LU for symmetric PD matrices).
  • Gram matrix is PSD: For any $v \in \mathbb{R}^d$, $v^\top (X^\top X) v = \|Xv\|_2^2 \ge 0$. It’s PD (strictly positive) if and only if $X$ has full column rank.
  • Eigenvalues of $X^\top X$ are squared singular values: If $X = U \Sigma V^\top$, then $X^\top X = V \Sigma^2 V^\top$. Eigenvalues $\lambda_i = \sigma_i^2$ are nonnegative.
  • Ridge regression stabilizes normal equations: Adding $\lambda I$ to $X^\top X$ ensures positive definiteness: $(X^\top X + \lambda I) \hat{w}_\lambda = X^\top y$. Even small $\lambda$ (e.g., $10^{-6}$) dramatically improves conditioning for nearly singular $X^\top X$.

ML Examples and Patterns

  • Ridge regression comparison: Ridge solves $(X^\top X + \lambda I) \hat{w}_\lambda = X^\top y$. Normal equations approach:

    w_ridge = np.linalg.solve(X.T @ X + lambda_reg * np.eye(d), X.T @ y)

    But lstsq doesn’t directly support ridge. Use augmented least-squares instead:

    X_aug = np.vstack([X, np.sqrt(lambda_reg) * np.eye(d)])
    y_aug = np.hstack([y, np.zeros(d)])
    w_ridge = np.linalg.lstsq(X_aug, y_aug, rcond=None)[0]

    This avoids forming $X^\top X$ explicitly, maintaining numerical stability.

  • Polynomial regression with high degree: Fitting $y = w_0 + w_1 x + w_2 x^2 + \ldots + w_p x^p$ creates design matrix with columns $[1, x, x^2, \ldots, x^p]$. These become highly collinear for large $p$ (Vandermonde matrix has exponentially large condition number). Normal equations fail catastrophically ($p > 10$ often breaks); lstsq handles $p \sim 20$ gracefully.

  • Cholesky vs. LU decomposition: If $X$ has full rank, $X^\top X$ is PD. Cholesky factorization $X^\top X = LL^\top$ is faster and more stable than LU:

    L = np.linalg.cholesky(X.T @ X)  # L L^T = X^T X
    z = np.linalg.solve(L, X.T @ y)  # Forward substitution
    w_chol = np.linalg.solve(L.T, z)  # Backward substitution

    Roughly 2x faster than solve for symmetric systems, but requires verifying positive definiteness (raises error if $X^\top X$ is singular or indefinite).

  • Large-scale regression ($n \gg d$): For many examples and few features, normal equations are efficient: form $X^\top X$ in $O(nd^2)$ time, solve $d \times d$ system in $O(d^3)$. For $d < 100$, this is faster than SVD. For $d \gg n$ (more features than examples), use dual formulation or kernel methods.

  • Iterative refinement for ill-conditioned problems: Solve normal equations in low precision, then refine via lstsq:

    w_approx = np.linalg.solve(X.T @ X, X.T @ y)  # Initial solve
    for _ in range(5):
        r = y - X @ w_approx  # Residual in double precision
        delta_w = np.linalg.lstsq(X, r, rcond=None)[0]  # Refine via lstsq
        w_approx += delta_w

    Modern libraries (LAPACK, Intel MKL) implement this automatically for lstsq.

Connection to Linear Algebra Theory

  • Normal equations derivation: Minimize $\mathcal{L}(w) = \|y - Xw\|_2^2 = (y - Xw)^\top (y - Xw)$. Gradient: \[\frac{\partial \mathcal{L}}{\partial w} = -2X^\top (y - Xw) = 0 \quad \Rightarrow \quad X^\top X w = X^\top y.\] If $X$ has full column rank, $X^\top X$ is invertible: $\hat{w} = (X^\top X)^{-1} X^\top y$.

  • SVD solution: For $X = U \Sigma V^\top$, the least-squares solution is $\hat{w} = V \Sigma^{-1} U^\top y$. If $X$ is rank-deficient, use the pseudoinverse $\Sigma^+ = \text{diag}(\sigma_1^{-1}, \ldots, \sigma_r^{-1}, 0, \ldots, 0)$ (invert only nonzero singular values).

  • Orthogonality condition: Both solutions satisfy $X^\top r = 0$ (residuals orthogonal to column space). This is the first-order optimality condition for least-squares. Geometrically, $r$ is perpendicular to all columns of $X$.

  • Projection matrix: The least-squares solution projects $y$ onto $\text{col}(X)$: \[\hat{y} = X\hat{w} = X(X^\top X)^{-1} X^\top y = P y,\] where $P = X(X^\top X)^{-1} X^\top$ is the orthogonal projection matrix. Residual $r = y - \hat{y} = (I - P)y$ is orthogonal to $\text{col}(X)$.

  • Condition number bounds: For perturbation $\Delta y$ in $y$, the relative error in $\hat{w}$ is bounded by: \[\frac{\|\Delta \hat{w}\|}{\|\hat{w}\|} \le \kappa(X) \frac{\|\Delta y\|}{\|y\|}.\] For normal equations with perturbation $\Delta (X^\top y)$, the bound uses $\kappa(X^\top X) = [\kappa(X)]^2$: \[\frac{\|\Delta \hat{w}\|}{\|\hat{w}\|} \le \kappa(X)^2 \frac{\|\Delta (X^\top y)\|}{\|X^\top y\|}.\] This explains the condition number squaring effect.

Numerical and Implementation Notes

  • Always check condition number before using normal equations: Compute $\kappa(X)$ via np.linalg.cond(X). If $\kappa(X) > 10^8$, normal equations are unreliable; use lstsq instead.

  • Verify symmetry exploitation: NumPy’s solve doesn’t automatically detect symmetry. For large $d$, use scipy.linalg.solve(..., assume_a='pos') to enable Cholesky (faster for PD matrices).

  • rcond parameter tuning: rcond=None uses machine precision. For noisy data, a larger rcond (e.g., 1e-10) can improve stability by treating small singular values as zero (truncated SVD).

  • QR factorization as middle ground: np.linalg.qr followed by triangular solve is faster than SVD (roughly 2x) and more stable than normal equations. Not exposed in lstsq, but available via:

    Q, R = np.linalg.qr(X)
    w_qr = np.linalg.solve(R, Q.T @ y)

    Condition number of $R$ equals $\kappa(X)$ (no squaring), making QR a good compromise.

Numerical and Shape Notes

  • $X \in \mathbb{R}^{40 \times 3}$: design matrix (40 examples, 3 features)
  • $y \in \mathbb{R}^{40}$: target vector
  • $\hat{w}_1, \hat{w}_2 \in \mathbb{R}^3$: weight vectors from both methods
  • $X^\top X \in \mathbb{R}^{3 \times 3}$: Gram matrix (symmetric, square)
  • $X^\top y \in \mathbb{R}^3$: projected targets
  • $r_1, r_2 \in \mathbb{R}^{40}$: residual vectors
  • X.T @ X computes $(3 \times 40) \cdot (40 \times 3) = (3 \times 3)$ Gram matrix
  • X.T @ y computes $(3 \times 40) \cdot (40,) = (3,)$ projected targets
  • X @ w1 computes $(40 \times 3) \cdot (3,) = (40,)$ predictions

Gotchas:

  1. Silent failure of normal equations: If $X^\top X$ is singular or nearly singular, solve may return garbage weights with no warning. Always verify: np.allclose(X.T @ X @ w2, X.T @ y).

  2. Condition number not checked: The code doesn’t compute $\kappa(X)$ explicitly. For production use:

    cond_X = np.linalg.cond(X)
    if cond_X > 1e10:
        print(f"Warning: X is ill-conditioned (cond={cond_X:.2e})")
  3. Symmetry of $X^\top X$: Always symmetric: $(X^\top X)^\top = X^\top X$. Exploit this for faster solvers (Cholesky).

  4. Residual norms equality: Only holds if both methods found the global minimizer. If normal equations fail, $\|r_2\|$ can be arbitrarily large.

Verification checks:

# Verify condition number squaring
cond_X = np.linalg.cond(X)
cond_XTX = np.linalg.cond(X.T @ X)
print(f"cond(X): {cond_X:.2e}, cond(X^T X): {cond_XTX:.2e}")
print(f"Ratio (should be ~cond(X)^2): {cond_XTX / cond_X**2:.4f}")

# Verify normal equations optimality condition
print(f"||X^T r1||: {np.linalg.norm(X.T @ r1):.2e}")  # Should be ~0
print(f"||X^T r2||: {np.linalg.norm(X.T @ r2):.2e}")  # Should be ~0

# Check if solutions are identical (within tolerance)
assert np.allclose(w1, w2, atol=1e-10), "Solutions differ significantly!"

# Verify both minimize the same objective
loss1 = np.linalg.norm(r1)**2
loss2 = np.linalg.norm(r2)**2
print(f"Loss (lstsq): {loss1:.10f}")
print(f"Loss (normal eqs): {loss2:.10f}")
assert abs(loss1 - loss2) < 1e-10, "Losses differ!"

# Check rank (should be 3 for full column rank)
rank_X = np.linalg.matrix_rank(X)
print(f"Rank(X): {rank_X}/3")

Pedagogical Significance

This example is the canonical comparison of least-squares solvers, demonstrating that algorithm choice matters even when mathematical equivalence holds. Two paths to the same solution have radically different numerical properties.

Key takeaways: 1. lstsq and normal equations are mathematically equivalent but numerically different: Both minimize $\|y - Xw\|_2^2$, but lstsq avoids condition number squaring. 2. Condition number squaring is unavoidable when forming $X^\top X$: SVD avoids this by working with $X$ directly. 3. Normal equations can fail silently: Garbage weights with no warning for ill-conditioned $X$; lstsq issues rank warnings. 4. Solver hierarchy reflects stability-speed trade-off: SVD is slowest but most robust; normal equations are fastest but fragile. 5. Residual orthogonality $X^\top r = 0$ characterizes optimality: Unifying condition for both methods.

Common misconceptions addressed: - “Normal equations are always fine”: False—they fail silently for ill-conditioned $X$, producing garbage weights. - “lstsq is overkill for simple regression”: False—it’s the recommended default precisely because it handles edge cases gracefully. - “Faster means better”: False—normal equations are faster but fragile; SVD is slightly slower but robust. Stability favors lstsq for ML. - “Both methods give the same answer”: True for well-conditioned problems, false for ill-conditioned—normal equations diverge, lstsq remains accurate.

Connection to other chapters: - Chapter 10 (SVD): SVD directly yields least-squares solution; singular values reveal conditioning. - Chapter 12 (Least-squares): This is the canonical demonstration of solver choice. - Chapter 14 (Conditioning): Condition number squaring explains why normal equations are risky. - Chapter 15 (Cholesky/LU): Cholesky is faster for normal equations but assumes $X^\top X$ is PD.

Why this snippet is pedagogically powerful: It proves that two mathematically equivalent formulations have radically different numerical properties. Understanding this distinction is fundamental to reliable ML engineering. The code is minimal (10 lines), but the lesson is profound: always prefer lstsq unless you’ve explicitly verified that normal equations are safe for your data.

History and Applications

Origins of Least Squares (1795–1809): Carl Friedrich Gauss invented the method of least squares in 1795 (age 18) while determining comet orbits from noisy observations. He kept it secret until 1809, when he published Theoria Motus Corporum Coelestium, establishing normal equations as the standard approach. Adrien-Marie Legendre independently published the method in 1805 (Nouvelles méthodes pour la détermination des orbites des comètes), leading to a bitter priority dispute. The name “least squares” reflects minimizing the sum of squared residuals—Gauss chose squared errors for computational tractability (linear system) rather than absolute errors (nonlinear optimization). Early applications: geodesy (Earth’s shape from triangulation measurements), astronomy (planetary orbits from telescope observations), and surveying (map-making from distance measurements).

Normal Equations Dominance (1809–1965): For 150 years, normal equations were the only practical method for solving least-squares problems. Hand calculation and mechanical calculators made forming $X^\top X$ (a small $d \times d$ system) feasible, while working with the full $X$ (large $n \times d$ matrix) was prohibitive. Applications expanded: regression analysis (Pearson & Yule, 1900s), econometrics (Frisch, 1930s), and early computing (ENIAC ballistics tables, 1940s). Numerical instability was unnoticed—most problems were well-conditioned or small enough that errors didn’t matter.

Numerical Instability Discovered (1960s): Digital computers revealed catastrophic failures of normal equations for ill-conditioned problems. Wilkinson (1961, Rounding Errors in Algebraic Processes) proved that forming $X^\top X$ squares the condition number: $\kappa(X^\top X) = [\kappa(X)]^2$, amplifying numerical errors exponentially. Golub & Kahan (1965) developed the stable SVD algorithm, making lstsq practical for large matrices. Householder reflections and Givens rotations enabled QR factorization without forming $X^\top X$. Trefethen & Bau (1997) formalized the solver hierarchy: SVD > QR > Cholesky > LU > normal equations.

ML Revolution (1990s–present): Least squares became the prototype convex objective in ML: linear regression, ridge regression (Hoerl & Kennard, 1970), LASSO (Tibshirani, 1996), and support vector machines (Vapnik, 1995). Neural network backpropagation (Rumelhart et al., 1986) solves local least-squares problems at each layer. Modern optimizers (Adam, SGD with momentum) approximate Newton’s method, which solves least-squares problems in the tangent space. Automatic differentiation frameworks (PyTorch, TensorFlow) compute gradients via chain rule, reducing all operations to least-squares primitives. Large-scale ML: distributed least-squares (MLlib, Mahout) uses QR or SVD (never normal equations) for billions of examples.

Contemporary Numerical Practices: Modern libraries default to stable solvers: NumPy’s lstsq uses LAPACK’s dgelsd (divide-and-conquer SVD), SciPy’s lsq_linear uses trust-region methods, and scikit-learn’s LinearRegression uses lstsq internally. Normal equations persist only in specialized cases: ridge regression with large $\lambda$ (stabilizes $X^\top X + \lambda I$), Cholesky-based solvers for verified PD systems, and hardware-accelerated Gram matrix multiplication (GPUs parallelize $X^\top X$ efficiently). Iterative methods (conjugate gradients, LSQR) dominate for sparse $X$ (millions of features), avoiding $X^\top X$ formation entirely.

Future Directions: Randomized linear algebra (Halko et al., 2011) uses sketching to approximate SVD in $O(nd \log k)$ time (where $k$ is target rank). Low-precision training (FP16, bfloat16) requires stable algorithms—SVD remains accurate, normal equations fail catastrophically. Quantum computing: HHL algorithm (Harrow et al., 2009) solves least squares exponentially faster than classical methods, but requires stable formulation (normal equations fail in low precision). Explainable AI: interpreting least-squares solutions via influence functions (Koh & Liang, 2017) reveals which training examples affect predictions. Understanding solver choice—the lesson of this example—remains foundational for reliable ML engineering, from classical statistics to cutting-edge deep learning.

Connection to Broader Examples
  • Chapter 6 (Projections): Least-squares solution $X\hat{w}$ is the orthogonal projection of $y$ onto $\text{col}(X)$; residual $r = y - X\hat{w}$ is orthogonal to $\text{col}(X)$.
  • Chapter 8 (Eigenvalues): Condition number of $X^\top X$ equals $[\kappa(X)]^2$ (eigenvalues of $X^\top X$ are squared singular values of $X$).
  • Chapter 9 (PSD): Gram matrix $X^\top X$ is positive semidefinite (PSD); positive definite (PD) if $X$ has full column rank.
  • Chapter 10 (SVD): SVD $X = U \Sigma V^\top$ yields least-squares solution $\hat{w} = V \Sigma^{-1} U^\top y$ without forming $X^\top X$.
  • Chapter 11 (PCA): PCA uses least-squares projection onto top-$k$ principal components; same solver choice issues apply.
  • Chapter 12 (Least-squares): This example is the canonical comparison of solver methods for least-squares problems.
  • Chapter 13 (Solving Systems): Normal equations reduce least-squares to solving a square system $X^\top X w = X^\top y$; condition number squaring explains why this is risky.
  • Chapter 14 (Conditioning): Condition number $\kappa(X)$ determines numerical stability; $\kappa(X^\top X) = [\kappa(X)]^2$ amplifies ill-conditioning.

Comments