Example 6: Least squares residual is orthogonal to column space
Chapter: 6 orthogonality_projections
Purpose
Build geometric intuition linking optimization to projections:
- Understand that least squares is an orthogonal projection onto the column space.
- See that the residual $r = y - X\hat{w}$ is orthogonal to all columns of $X$.
- Verify the optimality condition $X^\top r \approx 0$ numerically.
- Recognize the pattern: fit = project; residual orthogonality = optimality.
Problem
Fit least squares and verify X^T(y-Xŵ)≈0.
Solution (math)
The least squares problem minimizes $\|y - Xw\|_2^2$. Taking the gradient and setting it to zero gives the normal equations:
\[ X^\top X \hat{w} = X^\top y. \]
Multiplying through, we see $X^\top(y - X\hat{w}) = 0$, so the residual $r = y - X\hat{w}$ is orthogonal to each column of $X$. Geometrically, $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{span}(X)$, and $r$ lies in the orthogonal complement.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^\top y$.
Background
Least squares regression, dating back to Gauss and Legendre’s work in astronomy (1800s), is the foundational convex optimization problem. In machine learning, it serves as the closed-form solution for linear regression, the normal equations step in iterative reweighted least squares, the inner loop of many optimization algorithms (Newton’s method, trust region methods), and the local quadratic approximation in second-order training.
The geometric perspective reveals that the optimal prediction $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{span}(X)$—the subspace of all possible linear predictions. The residual $r = y - \hat{y}$ lies in the orthogonal complement and satisfies $X^\top r = 0$. This orthogonality condition is both the first-order optimality condition (gradient = 0) and a geometric necessity (projections minimize distance). Numerically verifying $\|X^\top r\|_2 \approx 0$ confirms that the solver found the true optimum and that numerical errors are small.
What this example demonstrates
- Least squares residuals are orthogonal to the column space: $X^\top(y - X\hat{w}) = 0$.
- This orthogonality is the geometric signature of projection and the algebraic signature of optimality.
np.linalg.lstsqsolves the normal equations stably via QR or SVD.- Verifying $\|X^\top r\|_2 \approx 0$ is a numerical sanity check for regression diagnostics.
Numerical Implementation Details
- Solve $\min_w \|y - Xw\|_2^2$ via
np.linalg.lstsq(X, y, rcond=None), which uses QR or SVD internally—never form $X^\top X$ explicitly. - Compute residual: $r = y - X\hat{w}$ with shapes $y:(n), X:(n\times d), w:(d), r:(n)$.
- Compute $X^\top r$ with shape $(d)$ and take its norm:
np.linalg.norm(X.T @ r). - Expect $\|X^\top r\|_2 < 10^{-12}$ for well-conditioned problems; larger values signal ill-conditioning or bugs.
- Avoid explicit inverses: $(X^\top X)^{-1}$ is numerically unstable when $X$ has nearly dependent columns.
Connection to the broader example
- Projection operators: The matrix $P = X(X^\top X)^{-1}X^\top$ projects onto $\text{span}(X)$; residual orthogonality implies $P^2 = P$ (idempotence).
- Ridge regression: Adding $\lambda I$ to $X^\top X$ shrinks $\hat{w}$ and trades off fit quality (larger residuals) against regularization.
- Gradient descent: The residual $r$ is proportional to the gradient $\nabla_w L = -X^\top r$; orthogonality means gradient = 0 at the optimum.
- Generalization: Orthogonality generalizes to weighted least squares (with metric $W$), kernel ridge regression (in RKHS), and iteratively reweighted least squares (GLMs).
- Conditioning: If $X$ has small singular values, $X^\top X$ is nearly singular, and small $\|X^\top r\|$ may hide large errors in $\hat{w}$.
Notes
- Shape discipline: $X\in\mathbb{R}^{n\times d}$, $y\in\mathbb{R}^n$, $w\in\mathbb{R}^d$, $r\in\mathbb{R}^n$, $X^\top r\in\mathbb{R}^d$.
- Numerical note 1:
lstsquses stable QR/SVD;rcond=Nonesets the threshold for filtering small singular values to machine precision. - Numerical note 2 (tolerance): Treat $\|X^\top r\|_2$ on the order of $10^{-12}$–$10^{-14}$ as numerical zero; materially larger values signal ill-conditioning, rank deficiency, or implementation bugs.
- Interpretation: $\|X^\top r\|_2 \approx 0$ confirms optimality; non-zero values indicate numerical issues (ill-conditioning, bugs) or non-convergence.
- Geometric insight: The residual $r$ is the shortest vector from $y$ to $\text{span}(X)$; orthogonality ensures this minimum distance.
- Numerical note 3: The
lstsqsolver uses stable QR or SVD decomposition rather than explicitly forming $X^\top X$, which can be ill-conditioned when $X$ has nearly dependent columns. Thercond=Noneparameter uses machine precision to filter out effectively zero singular values. Shape discipline: $X^\top r$ has shape $(d)$ (e.g., $(3)$ when $X\in\mathbb{R}^{30\times 3}$), and the norm reduces this to a scalar. If the printed value is large (e.g., $> 10^{-10}$), it signals numerical instability or a bug in the solver.
References
- See Chapter 6.
- G. Strang. Introduction to Linear Algebra (5th ed.). Wellesley–Cambridge Press, 2016.
- G. H. Golub and C. F. Van Loan. Matrix Computations (4th ed.). Johns Hopkins University Press, 2013.
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning (2nd ed.). Springer, 2009.
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
- C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
Code Introduction
We generate synthetic regression data with $n=30$ examples and $d=3$ features using toy_linreg. The design matrix $X\in\mathbb{R}^{30\times 3}$ and target $y\in\mathbb{R}^{30}$ are passed to np.linalg.lstsq, which returns the optimal weights $\hat{w}\in\mathbb{R}^3$. We compute the residual $r = y - X\hat{w}$ and verify orthogonality by printing $\|X^\top r\|_2$, which should be near zero (typically $< 10^{-13}$). Shapes: X.T @ r is $(3, 30) \times (30,) = (3,)$, and the norm reduces this to a scalar.
Solution (Python)
import numpy as np
from scripts.toy_data import toy_linreg
X, y, _ = toy_linreg(n=30, d=3, seed=0)
w = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w
print("||X^T r||_2:", np.linalg.norm(X.T@r))History
Orthogonal projections and least squares are intertwined since Gauss’s work on astronomical data (1809). Gauss showed that the best-fitting line minimizes the sum of squared deviations, and the solution satisfies the “normal equations”—a term that survives to this day.
Geometric interpretation: The orthogonality condition $X^\top r = 0$ was first interpreted geometrically in the mid-20th century as a statement about projections onto subspaces. This perspective—that regression is projection—became canonical in statistics and linear algebra texts (Strang, Gilbert).
QR and SVD stability: While the normal equations $X^\top X \hat{w} = X^\top y$ are elegant, they can be numerically unstable. The solution is to use QR decomposition or SVD (Golub & Van Loan, 1980s), which avoid explicitly forming $X^\top X$. Modern solvers (NumPy, LAPACK) default to these stable methods, making least squares a reliable computational workhorse.