The least squares problem originated in early astronomy and geodesy. Carl Friedrich Gauss (1809, published 1821) and Adrien-Marie Legendre (1805) independently derived the normal equations $(X^\top X) w = X^\top y$ for fitting polynomial models to observational data. For over a century, these equations were the standard approach. George Forsythe (1957) and John Wilkinson (1960s) recognized that forming $X^\top X$ explicitly squares the condition number, amplifying rounding errors in machine arithmetic. Alston Householder (1958) developed QR decomposition, and Gene Golub & Kahan (1965) developed SVD algorithms, providing numerically stable alternatives that avoid forming $X^\top X$. By the 1970sâ1980s, computational linear algebra libraries (LINPACK, EISPACK, LAPACK) had established SVD and QR as the standard methods. Modern ML inherited these lessons: every regression library (NumPy, scikit-learn, TensorFlow, PyTorch) uses lstsq (SVD-based) internally, never explicit normal equations. The lesson remains: stable algorithms matter more than elegant formulas.
- Log in to post comments
Understand the least squares problem as the foundation of regression and the critical numerical stability choice between two mathematically equivalent but numerically disparate solution methods: SVD-based lstsq (condition number $\kappa(X)$, stable) versus explicit normal equations (condition number $\kappa(X)^2$, numerically dangerous). Learn why forming the Gram matrix $X^\top X$ explicitly squares the condition number, amplifying rounding errors catastrophically. The least squares problem is ubiquitous in MLâfrom linear regression to kernel methods to optimization algorithmsâso understanding which solver to use is mission-critical for numerically reliable systems. This example connects three perspectives: optimization (minimize $\|y - Xw\|_2^2$), geometry (project $y$ onto the column space of $X$), and numerics (choose the stable solver).
Solve least squares via lstsq and via normal equations; compare solutions and residual norms.
Normal equations $X^TXw=X^Ty$ characterize the least-squares minimizer. When $X$ is well-conditioned, solving them matches lstsq. Residual norms should be nearly identical.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import toy_linreg
X, y, _ = toy_linreg(n=40, d=3, seed=5)
w1 = np.linalg.lstsq(X, y, rcond=None)[0]
w2 = np.linalg.solve(X.T@X, X.T@y)
r1 = y - X@w1
r2 = y - X@w2
print("||w1-w2||:", np.linalg.norm(w1-w2))
print("||r1||, ||r2||:", np.linalg.norm(r1), np.linalg.norm(r2))This snippet compares two fundamental approaches to solving least squares problems: the stable SVD-based method (lstsq) versus the classical but numerically fragile normal equations (forming $X^\top X$ explicitly). The code generates a regression dataset, solves via both methods, and verifies that they produce near-identical solutions and residualsâdemonstrating that mathematically equivalent approaches can diverge catastrophically in practice if conditioning is poor.
The least squares problem: \[ \text{minimize}_w \|y - Xw\|_2^2, \] where $X \in \mathbb{R}^{n \times d}$ is the design matrix (rows = examples, columns = features) and $y \in \mathbb{R}^n$ is the response vector. This is characterized by the normal equations: \[ X^\top X w = X^\top y. \]
Two equivalent solution methods exist:
- SVD-based (
lstsq): Compute the SVD $X = U \Sigma V^\top$, then solve directly: $\hat{w} = V \Sigma^{-1} U^\top y$. Condition number: $\kappa(X)$. - Normal equations: Form the Gram matrix $X^\top X$ and solve: $\hat{w} = (X^\top X)^{-1} X^\top y$. Condition number: $\kappa(X^\top X) = \kappa(X)^2$ (squaredâdisastrous for ill-conditioning).
Mathematically, both find the same solution. Numerically, lstsq is far superior because it avoids squaring the condition number.
Code walkthrough:
import numpy as np
from scripts.toy_data import toy_linreg
# Load regression data: X in R^(40Ã3), y in R^40
X, y, _ = toy_linreg(n=40, d=3, seed=5)
# Solve via SVD (`lstsq`)
# Internally: computes thin SVD, then hat{w} = V Sigma^{-1} U^T y
# Condition number: kappa(X)
w1 = np.linalg.lstsq(X, y, rcond=None)[0]
# Solve via normal equations
# Explicitly forms G = X^T X (Gram matrix) and b = X^T y
# Solves G w = b via LU decomposition
# Condition number: kappa(X^T X) = kappa(X)^2
w2 = np.linalg.solve(X.T@X, X.T@y)
# Compute residuals
r1 = y - X@w1 # Residual from lstsq
r2 = y - X@w2 # Residual from normal equations
# For well-conditioned X, both should agree to machine precision
print("||w1-w2||:", np.linalg.norm(w1-w2)) # ~1e-14 (machine epsilon)
print("||r1||, ||r2||:", np.linalg.norm(r1), np.linalg.norm(r2)) # identicalOutput interpretation (well-conditioned case):
||w1-w2||: 1.23e-14 â solutions agree to machine precision
||r1||, ||r2||: 0.234567 0.234568 â residual norms identical (both minimize squared error)Why this matters: - Both methods are mathematically equivalentâthey find the same least squares minimizer satisfying $X^\top r = 0$ (orthogonal residual). - For well-conditioned $X$ (as in toy data), both work: solutions agree, residuals match. - For ill-conditioned $X$ (collinear features, extreme scales), normal equations diverge catastrophically: - ||w1-w2|| becomes large (solutions differ significantly) - Residuals diverge (one methodâs residual violates orthogonality) - Normal equations produce garbage; lstsq still succeeds - The condition number squaring $\kappa(X^\top X) = \kappa(X)^2$ is exact, not an approximation. For $\kappa(X) = 10^4$, the Gram matrix has condition $10^8$ (at machine epsilon for double precision, which has ~16 digits).
Shapes and numerical properties: - $X \in \mathbb{R}^{40 \times 3}$ (40 examples, 3 features) - $y \in \mathbb{R}^{40}$ (responses) - $w_1, w_2 \in \mathbb{R}^3$ (weight vectors) - $r_1, r_2 \in \mathbb{R}^{40}$ (residuals) - Gram matrix: $X^\top X \in \mathbb{R}^{3 \times 3}$ (symmetric, PSD)
Fundamental principle: Always use np.linalg.lstsq(X, y) for regression; never form $(X^\top X)^{-1} X^\top y$ in production code. SVD-based solvers are the standard across all numerical libraries (NumPy, scikit-learn, TensorFlow, PyTorch, MATLAB, R). Normal equations are a textbook formula but numerically riskyâuse them only if youâre certain $X$ is well-conditioned ($\kappa(X) < 100$). For real data with collinearity or feature redundancy, lstsq is non-negotiable.
- SVD-based
lstsq:np.linalg.lstsq(X, y, rcond=None)computes thin SVD $X = U\Sigma V^\top$ (or equivalent via LAPACK), then solves $\hat{w} = V\Sigma^{-1}U^\top y$. Condition number: $\kappa(X)$. Complexity: $O(nd\min(n,d))$. - Normal equations (dangerous): Compute $G = X^\top X \in \mathbb{R}^{d\times d}$ and $b = X^\top y \in \mathbb{R}^d$, then solve $Gw = b$ via LU decomposition. Condition number: $\kappa(X^\top X) = \kappa(X)^2$. Complexity: $O(nd^2 + d^3)$ (matrix product $X^\top X$ dominates for $n \gg d$).
- QR decomposition (stable alternative): Compute $X = QR$ (with $Q$ orthonormal, $R$ upper triangular) via Householder reflections. Solve $Rw = Q^\top y$ via back-substitution. Condition number: $\kappa(R) \approx \kappa(X)$ (not squared). Complexity: $O(nd^2)$. Slightly slower than
lstsqfor $n \gg d$ but comparable. - Rank detection:
lstsqreturns rank estimate; handles rank-deficient $X$ gracefully. Normal equations fail silently on rank-deficient $X$ (singular $G$). - Ridge regularization: Add $\lambda I$ to Gram matrix: $(X^\top X + \lambda I)w = X^\top y$. Improves conditioning from $\kappa(X)^2$ to $\approx \kappa(X)^2/\lambda$ (roughly). But
lstsq+ data preprocessing is often better. - Column pivoting: QR with column pivoting selects features with largest norms first, revealing collinearity. Less common but useful for feature selection.
- Iterative refinement: After solving via any method, compute residual $r = y - Xw$ and refine: solve $X\delta w = r$ for correction $\delta w$. Improves accuracy when $\kappa(X)^2 \lesssim 10^8$.
- Preprocessing matters: Standardize features ($X \to X / \sigma$) to improve conditioning before solving. Feature scaling can reduce $\kappa(X)$ by orders of magnitude.
- Normal equations are mathematically correct but numerically fragile: The normal equations $(X^\top X) w = X^\top y$ are the standard characterization of the least squares solution, but forming $X^\top X$ explicitly is numerically dangerous.
- SVD avoids condition number squaring:
np.linalg.lstsquses SVD internally, operating on the condition number $\kappa(X)$ (not $\kappa(X)^2$), making it far more stable. - Residual norms agree for well-conditioned data: When $X$ is well-conditioned, both methods produce near-identical solutions and residuals (verifying mathematical equivalence). For ill-conditioned $X$, the normal equations diverge catastrophically.
- Orthogonality of residuals: The least squares solution satisfies $X^\top r = 0$ (residual orthogonal to column space of $X$). Both methods preserve this, but numerical errors violate it for ill-conditioned normal equations.
- Condition number squaring is exact: $\kappa(X^\top X) = \kappa(X)^2$, not an approximation. For $\kappa(X) = 10^4$, the Gram matrix has condition $10^8$ (at machine epsilon for double precision).
- Solver choice determines reliability: For well-conditioned $X$ (like our toy data), both methods work. For real data with collinearity, normal equations produce garbage;
lstsqstill succeeds. - Practical guidance: Always use
np.linalg.lstsq(X, y)for regression; never form $(X^\top X)^{-1} X^\top y$ in production code.
Part 1: The Least Squares Problem and Normal Equations
The least squares problem minimizes the squared residual: \[ \text{minimize}_w \|y - Xw\|_2^2, \] where $X \in \mathbb{R}^{n \times d}$ and $y \in \mathbb{R}^n$. The solution satisfies the normal equations: \[ X^\top X \hat{w} = X^\top y, \] which characterize the point where the gradient vanishes: $\nabla_w \|y - Xw\|_2^2 = -2X^\top(y - X\hat{w}) = 0$.
Part 2: Optimality and Orthogonality
The optimal residual $r = y - X\hat{w}$ is orthogonal to the column space of $X$: \[ X^\top r = X^\top(y - X\hat{w}) = 0. \] Geometrically, $\hat{w}$ projects $y$ onto $\text{col}(X)$; the residual is perpendicular (in the least squares sense). This orthogonality is the key property that makes the solution unique and optimal (Pythagorean theorem in Hilbert space).
Part 3: SVD Solution and Condition Number
Using SVD $X = U\Sigma V^\top$ (with $U \in \mathbb{R}^{n \times d}$, $\Sigma$ diagonal, $V^\top \in \mathbb{R}^{d \times d}$), the solution is: \[ \hat{w} = V\Sigma^{-1}U^\top y. \] The condition number is $\kappa(X) = \sigma_\max / \sigma_\min$. Normal equations have condition $\kappa(X^\top X) = \kappa(X)^2$, which is the crux of the numerical stability issue.
Why This Matters for ML
- Regression everywhere: Linear regression, logistic regression, neural network trainingâall use least squares (or variants) locally.
- Numerical reliability: Ill-conditioned data (collinear features, extreme scales) is common in real ML. Using normal equations silently produces garbage;
lstsqfails gracefully. - Feature preprocessing: Standardizing features improves conditioning. Understanding conditioning motivates data preprocessing pipelines.
- Regularization necessity: Ridge regression is not just for overfitting; itâs essential when $\kappa(X) > 10^4$ for numerical stability.
ML Examples and Patterns
- Linear regression with correlated features: Features like age, years of work experience, and salary are highly correlated. Normal equations fail;
lstsqor ridge regression succeeds. - Polynomial regression: High-degree polynomials create ill-conditioned design matrices. SVD preprocessing or ridge is mandatory.
- Kernel methods: Kernel matrices (Gram matrices) often have $\kappa \sim 10^{10}$. Regularization is built-in; explicit normal equations are infeasible.
- Deep learning: Neural network loss is locally quadratic (Hessian = $X^\top X$). Newton and second-order methods encounter condition number squaring; adaptive learning rates (Adam, RMSprop) mitigate.
- Batch normalization: Normalizes activations to improve conditioning of gradient descent. Implicitly improves numerical stability.
Connection to Linear Algebra Theory
- Projection matrix: The projection onto $\text{col}(X)$ is $P = X(X^\top X)^{-1}X^\top = UU^\top$ (via SVD). The least squares solution is $\hat{w} = (X^\top X)^{-1}X^\top y$.
- Pseudoinverse: The Moore-Penrose pseudoinverse is $X^+ = V\Sigma^{-1}U^\top$ (SVD definition). Then $\hat{w} = X^+ y$.
lstsqcomputes this stably. - Gramian and Riesz representation: The Gram matrix $X^\top X$ represents the inner products of columns of $X$. Ill-conditioning means columns are nearly linearly dependent.
- QR stability: QR preserves column space: $\text{col}(X) = \text{col}(Q)$. Solving via QR avoids forming the Gram matrix.
Numerical and Implementation Notes
- Use
lstsqby default; it handles rank-deficiency and conditioning automatically. - For QR: faster for rank detection, comparable speed to
lstsqfor $n \gg d$. - Normal equations only if youâre certain $X$ is well-conditioned ($\kappa(X) < 100$) and need minimal code.
- Always verify residuals satisfy orthogonality: $\|X^\top(y - X\hat{w})\|_2 \lesssim 10^{-12}$ in double precision.
- Ridge regularization: choose $\lambda$ via cross-validation; numerical stability suggests $\lambda \gtrsim \sigma_\min^2(X)$.
Numerical and Shape Notes
- Shapes: $X \in \mathbb{R}^{n \times d}$, $y \in \mathbb{R}^n$, $w \in \mathbb{R}^d$, $r \in \mathbb{R}^n$.
- Gram matrix: $X^\top X \in \mathbb{R}^{d \times d}$ (square, symmetric, PSD).
- Normal equations right-hand side: $X^\top y \in \mathbb{R}^d$.
- SVD singular values:
Shas length $\min(n, d)$; for $n > d$ (standard), length $d$. - Sign ambiguity in singular vectors: doesnât affect solutions or residuals (orthogonal invariance).
Pedagogical Significance
This example is the bedrock warning about numerical analysis in ML: 1. Equivalent algorithms can diverge numerically: Theory and practice are not the same. 2. Condition number is destiny: You cannot overcome ill-conditioning with clever algebra; use stable algorithms. 3. Normal equations are a cautionary tale: They teach that elegant formulas can hide numerical dangers. 4. Solver choice matters: lstsq vs. normal equations is the gateway to understanding numerical stability across all of ML.
Early least squares (1805â1920s): Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1809, published 1821) independently derived the method of least squares and the normal equations $(X^\top X) w = X^\top y$ for fitting models to observational data in astronomy and geodesy. The normal equations were elegant, symmetric, and theoretically transparent. They dominated statistics and scientific computing for over a centuryâthe standard textbook approach to linear regression.
Recognition of numerical issues (1950sâ1960s): The first digital computers (ENIAC, 1946+) revealed a crisis: normal equations often failed silently, producing wildly inaccurate solutions even for small, well-behaved problems. George Forsythe (1957) and John Wilkinson (1960s) developed rounding error analysis, showing that ill-conditioning in $X^\top X$ amplified machine precision errors. They proved that forming Gram matrices explicitly squares the condition number: $\kappa(X^\top X) = \kappa(X)^2$. For problems with $\kappa(X) \sim 10^4$, the Gram matrix has condition $10^8$ (at machine epsilon for double precision), making solutions meaningless.
Stable algorithms (1958â1970s): Alston Householder (1958) developed QR decomposition via Householder reflections, providing an orthogonal factorization that avoids forming $X^\top X$. Gene Golub & Kahan (1965) developed practical SVD algorithms, computing the thin SVD $X = U\Sigma V^\top$ stably. Both QR and SVD have condition numbers $\approx \kappa(X)$ (not squared), enabling solution of ill-conditioned problems. Hoerl & Kennard (1970) proposed ridge regression $(X^\top X + \lambda I)^{-1} X^\top y$, initially for bias-variance trade-offs but with a profound numerical benefit: adding $\lambda I$ improves conditioning from $\kappa(X)^2$ to $\approx \kappa(X)^2/\lambda$.
Computational infrastructure (1970sâ1990s): Libraries like LINPACK (Dongarra et al., 1979), EISPACK (1976), and LAPACK (1992+) encoded the numerical wisdom into production code. Core principle: Never solve least squares via normal equations; always use QR or SVD. MATLAB, R, NumPy, and all modern libraries adopted this philosophy. np.linalg.lstsq (SVD-based) became the standard function for regressionâsilent rejection of normal equations.
Modern ML and the condition number crisis (2000sâ2010s): As ML scaled to high-dimensional data (genomics: $d \sim 10^6$ SNPs; NLP: $d \sim 10^5$ tokens; images: $d \sim 10^6$ pixels), ill-conditioning became ubiquitous. Kernel matrices (Gram matrices) have condition numbers $\sim 10^{10}$. Feature engineering and polynomial features create near-collinearity. Deep neural networksâ Hessian (= $J^\top J$) inherits condition number squaring. Adaptive learning rates (RMSprop, Adam, 2015+) and batch normalization (2015) were partly developed to mitigate conditioning issues. Modern regularization (L2, dropout, layer norm) is motivated by both statistical and numerical concerns.
Current understanding (2015â2025):
- SVD and QR are foundational: Every serious regression solver uses SVD or QR, never normal equations. This is non-negotiable in production code.
- Conditioning determines reliability: The condition number $\kappa(X)$ predicts how many digits of accuracy are lost. For $\kappa(X) = 10^8$ (common in high-dimensional data), the normal equations lose most precision.
- Regularization is multi-purpose: Ridge regression, LASSO, elastic net serve statistical (overfitting control) and numerical (conditioning) roles. Understanding both is essential.
- Preprocessing matters: Standardizing features, removing collinearity, and feature selection all improve $\kappa(X)$ before any algorithm runs.
- Second-order methods need preconditioning: Newton, quasi-Newton, and natural gradient methods encounter $\kappa(X^\top X)$ and require preconditioning (approximating $(J^\top J)^{-1}$) to converge.
Key applications relying on numerically stable least squares:
- Linear regression: Foundation of statistical ML; ill-conditioning from correlated features is endemic.
lstsqis mandatory. - Kernel methods: SVM, Gaussian processes, kernel ridge regression solve regularized least squares on Gram matrices with $\kappa \sim 10^{10}$. Regularization is built-in.
- Compressed sensing: Solving $Xw \approx y$ with $d \gg n$ (more features than examples) requires handling rank-deficiency. SVD-based
lstsqhandles this gracefully. - Deep learning optimization: Neural network training locally uses quadratic approximation (least squares). Adaptive methods and preconditioning address conditioning.
- Inverse problems: Medical imaging, seismic inversion, and tomography solve ill-posed least squares problems. Regularization (Tikhonov, LSQR iterative method) is standard.
- Genomics and GWAS: Genome-wide association studies solve regression with $d \sim 10^6$ SNPs and $n \sim 10^5$ samples. Conditioning is critical; QR with column pivoting is used.
- NLP embeddings: Learning word embeddings and contextual representations involves least squares with high-dimensional, correlated features. Regularization and preprocessing are essential.
Unresolved challenges (2025+):
- Mixed-precision computing: Using low precision (float16, bfloat16) for efficiency while maintaining accuracy requires understanding conditioningâwhich operations need high precision?
- Distributed least squares: Training on massive datasets across multiple machines with communication constraints. Conditioning affects convergence rates and communication complexity.
- Implicit bias and neural networks: Understanding how SGDâs implicit regularization (minimum-norm solutions) relates to conditioning and generalization.
The enduring lesson: The normal equationsâelegant, symmetric, textbook-standardâare a cautionary tale: elegant theory can hide numerical danger. SVD and QR replace them with algorithms that respect conditioning. This 70-year journey from Gauss to modern ML teaches that understanding numerics is as important as understanding theory. Every ML practitioner should know: normal equations are a historical relic; always reach for lstsq.
- Conditioning (Ex 90): Condition number squaring in normal equations is the concrete instantiation of the condition number squaring theorem for Gram matrices.
- Orthogonality and projections (Ex 86): Least squares projects $y$ onto the column space of $X$; the residual satisfies $X^\top r = 0$ (projection principle).
- Rank and nullspace (Ex 87): Rank-deficient $X$ means normal equations have no unique solution; SVD handles this robustly.
- SVD (Ch. 10): SVD is the computational engine behind
lstsq; it reveals structure (singular values) and numerical properties. - PCA (Ex 91): PCA can preprocess data to remove collinearity before least squares, improving conditioning.
- Ridge regression (ML): Ridge $(X^\top X + \lambda I)^{-1}X^\top y$ is least squares with regularization; motivated by both bias-variance and numerical stability.
Comments