Least squares has roots in early astronomy/geodesy; in ML it is the prototype convex objective and the local approximation behind many optimization steps. Carl Friedrich Gauss developed the method around 1795 (published 1809) for computing planetary orbits from noisy telescope observations, predicting the position of the dwarf planet Ceres. Adrien-Marie Legendre independently published it in 1805, coining the term âméthode des moindres carrésâ (method of least squares). The geometric interpretationâthat the solution projects the response vector onto the column space of the design matrixâwas formalized in the 20th century with the advent of linear algebra. The normal equations $X^\top X w = X^\top y$ provide a closed-form solution when $X$ has full column rank, but direct inversion of $X^\top X$ is numerically unstable (squaring the condition number). Modern practice uses QR decomposition or SVD to solve least squares, ensuring backward stability even for ill-conditioned problems. In machine learning, least squares underpins linear regression, ridge regression (adding $\lambda I$ to $X^\top X$), and serves as the linearization step in Gauss-Newton optimization for nonlinear models. The orthogonality condition $X^\top r = 0$ is the first-order optimality condition for the convex objective $\|Xw - y\|_2^2$, making least squares the simplest instance of convex optimization and the foundation for gradient-based training methods.
- Log in to post comments
Connect optimization, geometry (projection), and numerics (solver choice) in a single, reusable pattern. Least squares regression is the foundational supervised learning technique that converts the problem âfind the best linear fitâ into a concrete computational task: solve $X^\top X w = X^\top y$ (normal equations) or, more robustly, compute $w$ via QR/SVD factorization. This example demonstrates the dual nature of least squaresâit is simultaneously an optimization problem (minimize squared error) and a geometric problem (project the response onto the feature subspace). The optimality condition $X^\top r = 0$ (residual orthogonal to column space) is not just a theoretical curiosity; it is a numerical diagnostic that validates solver correctness and reveals ill-conditioning. Understanding this orthogonality property equips you to debug training failures, design regularization strategies (ridge, lasso), and interpret gradient descent as an iterative projection. Least squares is the prototype for all convex optimization in ML: neural network training via backpropagation reduces to solving sequences of least squares subproblems (Gauss-Newton), and many advanced methods (ADMM, proximal methods) rely on efficient least squares solvers as inner loops. Mastering this example provides the conceptual foundation for understanding how optimization, geometry, and numerical stability interact in machine learning.
Fit linear regression on toy data via lstsq; report ||r|| and verify orthogonality.
Least squares gives $\hat w=\arg\min_w\|Xw-y\|^2$. Residual $r=y-X\hat w$ satisfies $X^Tr=0$.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import toy_linreg
X, y, _ = toy_linreg(n=50, d=4, seed=1)
w = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w
print("||r||_2:", np.linalg.norm(r))
print("||X^T r||_2:", np.linalg.norm(X.T@r))This code demonstrates the fundamental optimality condition of least squares regression: the residual vector is orthogonal to the column space of the design matrix. The workflow consists of four steps: (1) generate synthetic regression data with $X \in \mathbb{R}^{50 \times 4}$ and $y \in \mathbb{R}^{50}$, (2) solve for the weight vector $w$ that minimizes $\|Xw - y\|_2^2$ using np.linalg.lstsq (which internally uses QR or SVD factorization for numerical stability), (3) compute the residual $r = y - Xw$ (the prediction error), and (4) verify the optimality condition by checking that $\|X^\top r\|_2 \approx 0$. The orthogonality check is not merely a theoretical exerciseâit is a practical diagnostic for solver correctness and conditioning. If $\|X^\top r\|_2$ is near machine precision ($\sim 10^{-12}$), the solver successfully found the least squares solution; if it is larger ($> 10^{-8}$), suspect ill-conditioning, rank deficiency, or numerical instability. The residual norm $\|r\|_2$ quantifies irreducible error (noise or model misspecification), while the orthogonality condition $X^\top r = 0$ confirms that the model has extracted all explainable variance along the feature directions. This patternâsolve for parameters, compute residual, verify optimalityâis the foundation for all supervised learning: linear regression, ridge regression, logistic regression (via Newton-Raphson), and neural network training (where backpropagation computes $X^\top r$ for nonlinear feature maps $X$). Understanding least squares as a projection problem (not just an optimization problem) reveals why regularization improves stability (shrinks the projection operator), why gradient descent converges (iteratively enforces orthogonality), and why SVD-based solvers are preferred (avoid condition number squaring). This code is the computational core of regression, providing both the solution and a numerical health check for solution correctness.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Data generation:
toy_linreg(n=50, d=4, seed=1)creates a synthetic regression problem with $X \in \mathbb{R}^{50 \times 4}$ (design matrix), $y \in \mathbb{R}^{50}$ (response vector), and true weights. The data exhibits linear structure with added Gaussian noise. - Least squares solver:
np.linalg.lstsq(X, y, rcond=None)returns a tuple(w, residuals, rank, s). The[0]indexing extracts $w \in \mathbb{R}^4$, the weight vector minimizing $\|Xw - y\|_2^2$. Internally, lstsq uses QR decomposition for well-conditioned problems or SVD for rank-deficient/ill-conditioned cases. - rcond parameter: Setting
rcond=Noneuses machine precision ($\sim 10^{-15}$) as the cutoff for determining effective rank. Singular values smaller thanrcond * max(s)are treated as zero, enabling robust handling of near-singular matrices. - Residual computation: $r = y - X\hat{w}$ computes the prediction error for each training example. This is the component of $y$ that cannot be explained by the linear modelâgeometrically, itâs orthogonal to $\text{span}(X)$.
- Residual norm:
||r||_2quantifies the root sum of squared errors. For well-specified models with Gaussian noise $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$, $\|r\|_2$ concentrates around $\sqrt{n} \sigma$. Large residual norms indicate model misspecification or high noise. - Orthogonality check:
||X.T@r||_2computes the norm of the gradient $\nabla_w \|y - Xw\|_2^2 = -2 X^\top r$. At the optimum, this should be near zero ($\sim 10^{-12}$ for double precision). Values above $10^{-8}$ suggest ill-conditioning, rank deficiency, or solver failure. - Interpretation: If $\|X^\top r\|_2 / \|r\|_2 < 10^{-10}$, the solver successfully found the least squares solution. The ratio $\|r\|_2 / \|y\|_2$ indicates the fraction of variance left unexplained by the model.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Orthogonality as optimality: The least squares solution $\hat{w}$ satisfies $X^\top r = 0$, meaning the residual is orthogonal to the column space of $X$âthis is the first-order optimality condition for the convex objective $\|Xw - y\|_2^2$.
- Projection interpretation: The prediction $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{span}(X)$, and the residual $r = y - \hat{y}$ is the component orthogonal to this subspace.
- Numerical diagnostic: Computing $\|X^\top r\|_2$ provides a solver health checkâit should be near machine precision ($\sim 10^{-12}$) for well-conditioned problems; large values indicate ill-conditioning or solver failure.
- Stable solver choice: Using
np.linalg.lstsq(which internally uses QR or SVD) avoids the $\kappa(X)^2$ condition number amplification of forming $X^\top X$, ensuring backward stability. - Residual decomposition: The residual norm $\|r\|_2$ quantifies irreducible error (noise, model misspecification), while the orthogonality $X^\top r = 0$ confirms the model has extracted all explainable variance along the feature directions.
- Computation pattern: This is a solve operationâfinding parameters that satisfy an optimality conditionâthat appears throughout ML: ridge regression (regularized solve), gradient descent (iterative solve), and Gauss-Newton (linearized solve for nonlinear models).
Part 1: Problem Setup and Solution
The code loads a synthetic regression dataset with toy_linreg(n=50, d=4, seed=1), producing a design matrix $X \in \mathbb{R}^{50 \times 4}$ (50 examples, 4 features) and response vector $y \in \mathbb{R}^{50}$. The least squares solver np.linalg.lstsq(X, y, rcond=None) computes the weight vector $w \in \mathbb{R}^4$ that minimizes the squared residual norm: $\hat{w} = \arg\min_{w} \|y - Xw\|_2^2$. The solver uses QR decomposition or SVD internally (depending on matrix conditioning), avoiding the numerically unstable normal equations $X^\top X w = X^\top y$. The rcond=None parameter uses machine precision as the cutoff for rank determination, ensuring robust handling of near-singular matrices. The solution $w$ defines the prediction $\hat{y} = Xw$, which is the orthogonal projection of $y$ onto the column space $\text{span}(X)$.
Part 2: Residual Norm and Reconstruction Error
The residual vector $r = y - Xw \in \mathbb{R}^{50}$ measures the prediction error for each training example. Computing ||r||_2 gives the root-mean-square error (up to a $\sqrt{n}$ scaling factor), quantifying how much of $y$ cannot be explained by the linear model. Geometrically, $r$ is the component of $y$ orthogonal to $\text{span}(X)$âthe part that lies outside the subspace defined by the feature columns. A large residual norm indicates either high noise, model misspecification (missing important features), or insufficient model capacity (wrong functional form). For well-specified models with Gaussian noise, $\|r\|_2$ concentrates around $\sqrt{n} \sigma$, where $\sigma$ is the noise standard deviation. The residual norm is the irreducible errorâno linear model can achieve lower loss without overfitting.
Part 3: Orthogonality Check as Optimality Diagnostic
The key diagnostic is ||X^T r||_2, which should be near zero (within machine precision, typically $\sim 10^{-12}$) for a correctly solved least squares problem. Why? At the optimum, the gradient of the loss $\|y - Xw\|_2^2$ with respect to $w$ must vanish: $\nabla_w \|y - Xw\|_2^2 = -2 X^\top (y - Xw) = -2 X^\top r = 0$. This is the normal equations expressed as an orthogonality condition: $X^\top r = 0$ means the residual is orthogonal to every column of $X$. Numerically, $\|X^\top r\|_2$ acts as a convergence checkâif itâs not close to zero, suspect: (1) Ill-conditioning: Large condition number $\kappa(X) = \sigma_{\max}(X) / \sigma_{\min}(X)$ amplifies rounding errors. (2) Rank deficiency: Redundant or nearly linearly dependent columns in $X$. (3) Solver failure: Rare but possible numerical instability in the underlying QR/SVD decomposition. (4) Implementation bug: Mismatched shapes or incorrect matrix transpose. For well-conditioned problems ($\kappa(X) < 10^6$), $\|X^\top r\|_2$ should be orders of magnitude smaller than $\|r\|_2$. If the ratio $\|X^\top r\|_2 / \|r\|_2$ exceeds $10^{-10}$, investigate conditioning via np.linalg.cond(X).
Connection to ML Practice
This orthogonality property is the geometric foundation of supervised learning. Linear regression projects the response onto the feature subspace, minimizing Euclidean distance. Ridge regression modifies the projection by adding $\lambda I$ to $X^\top X$, shrinking the solution toward the origin and improving stability when $X$ is ill-conditioned. Gradient descent iteratively approaches the orthogonality condition: starting from random $w_0$, each update $w_{k+1} = w_k + \eta X^\top r_k$ reduces $\|X^\top r\|_2$ until convergence. Neural network training extends this to nonlinear feature maps: backpropagation computes gradients that are analogous to $X^\top r$, ensuring parameter updates improve fit. Understanding least squares as a projection problemânot just an optimization problemâreveals why regularization, preconditioning, and iterative solvers work: they all manipulate the geometry of the projection operator $X(X^\top X)^{-1}X^\top$ to balance fit and stability.
Numerical and Shape Notes
Shape discipline is essential here. For $X \in \mathbb{R}^{50 \times 4}$, $w \in \mathbb{R}^4$, the product $Xw \in \mathbb{R}^{50}$ yields predictions matching $y$âs shape. The residual $r = y - Xw \in \mathbb{R}^{50}$ is a column vector. Computing $X^\top r$ requires transposing $X$ to shape $(4, 50)$, then matrix-vector multiplication $(4, 50) \times (50,) \to (4,)$ yields a vector in $\mathbb{R}^4$âthe gradient of the loss with respect to weights. The norm ||X.T@r||_2 collapses this to a scalar diagnostic. Printing both norms provides context: if $\|r\|_2 \approx 5.0$ but $\|X^\top r\|_2 \approx 10^{-13}$, the solver found the optimal least squares solution (within numerical precision), and the residual norm reflects irreducible error (noise or model capacity limits). If $\|X^\top r\|_2$ is unexpectedly large, investigate via SVD: U, S, Vt = np.linalg.svd(X, full_matrices=False); print("cond:", S.max()/S.min()). Condition numbers above $10^8$ often require regularization or more stable solvers.
Least squares has a distinguished history as one of the earliest computational methods in science. Carl Friedrich Gauss developed the technique around 1795 (age 18) but kept it secret until 1809, when he published it in Theoria Motus Corporum Coelestium to compute the orbit of the asteroid Ceres from sparse observations. His prediction allowed astronomers to relocate Ceres after it was lost behind the Sun, establishing least squares as a practical tool for data fitting. Adrien-Marie Legendre independently published the method in 1805, coining the term âmethod of least squares.â The normal equations $X^\top X w = X^\top y$ were the standard solution until the 20th century, when numerical analysts recognized that forming $X^\top X$ squares the condition number, amplifying rounding errors catastrophically for ill-conditioned problems. Alston Householder introduced QR decomposition in the 1950s as a stable alternative, and Gene Golubâs SVD algorithm (1965) provided an even more robust approach, handling rank-deficient matrices gracefully. The geometric interpretationâthat least squares projects the response onto the column space of the design matrixâemerged with the development of linear algebra in the mid-20th century, connecting optimization to geometry. In modern machine learning, least squares is ubiquitous: linear regression (minimize $\|Xw - y\|_2^2$ directly), ridge regression (add $\lambda \|w\|_2^2$ penalty to stabilize ill-conditioned problems), weighted least squares (handle heteroscedastic noise by minimizing $\|W^{1/2}(Xw - y)\|_2^2$), iteratively reweighted least squares (solve nonlinear problems like logistic regression by linearizing and solving sequences of least squares subproblems), and Gauss-Newton optimization (second-order method for neural network training that approximates the Hessian with $X^\top X$). Applications span every quantitative field: econometrics (fitting supply/demand curves), genomics (estimating genetic effects from GWAS data), computer vision (structure-from-motion and camera calibration), signal processing (Wiener filtering for denoising), and robotics (Kalman filtering for state estimation). Despite being over 200 years old, least squares remains the workhorse of data analysis, prized for its simplicity, interpretability, and the existence of exact, stable algorithms.
Least squares synthesizes concepts from multiple chapters:
- Chapter 6 (Orthogonality & Projections): The prediction $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{span}(X)$; the residual $r$ is the orthogonal complement.
- Chapter 7 (Rank & Nullspace): If $X$ is rank-deficient, the least squares solution is not uniqueâthe nullspace of $X$ parametrizes infinitely many solutions with the same residual norm.
lstsqreturns the minimum-norm solution. - Chapter 10 (SVD): The SVD-based solver computes $\hat{w} = V \Sigma^{-1} U^\top y$ (where $X = U \Sigma V^\top$), avoiding the condition number squaring of $X^\top X$.
- Chapter 12 (Least Squares): This example demonstrates the computational core of linear regressionâsolving $X^\top X w = X^\top y$ via stable factorizations (QR/SVD) rather than explicit inversion.
- Chapter 13 (Solving Systems): Least squares extends to overdetermined systems ($n > d$), where no exact solution exists; the solution minimizes the residual in the $\ell_2$ norm.
- Chapter 14 (Conditioning & Stability): The orthogonality check $\|X^\top r\|_2$ serves as a backward error diagnosticâsmall values confirm that the computed $\hat{w}$ satisfies the optimality condition to near machine precision.
Comments