Example number
38
Example slug
example_38_least_squares_residual_is_orthogonal_to_column_space
Background

Least squares minimizes $\|y - Xw\|_2^2$ over $w$. The optimality condition from calculus is $\nabla_w \|y - Xw\|_2^2 = -2 X^\top (y - Xw) = 0$, which is the normal equations $X^\top X w = X^\top y$. Geometrically, this says the residual $r = y - Xw$ is orthogonal to every column of $X$—the prediction $\hat{y} = Xw$ is the orthogonal projection of $y$ onto $\text{span}(X)$. Numerically, solvers use QR or SVD to avoid forming $X^\top X$ (which squares the condition number). The orthogonality check $\|X^\top r\|_2 \approx 0$ is a convergence diagnostic: small values confirm the solver reached the optimum to machine precision, while large values indicate ill-conditioning or rank deficiency.

Purpose

Build intuition that least squares is a projection problem, not just an optimization problem. The optimal solution places predictions $\hat{y} = X\hat{w}$ in the column space of $X$, and the residual $r = y - \hat{y}$ must be orthogonal to that subspace. This orthogonality condition $X^\top r = 0$ is the geometric heart of regression, ridge regression, generalized least squares, and neural network weight updates. Understanding it clarifies convergence diagnostics, conditioning effects, and the bias-variance tradeoff.

Problem

Fit least squares and verify X^T(y-Xŵ)≈0.

Solution (Math)

Optimality gives $X^T$Xw-y$=0$, so residual $r=y-X\hat w$ is orthogonal to each column of $X$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=30, d=3, seed=0)
w = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w
print("||X^T r||_2:", np.linalg.norm(X.T@r))
Code Introduction

This code verifies the fundamental orthogonality condition of least squares: at the optimal solution, the residual vector must be orthogonal to every column of the design matrix. It demonstrates that numerical linear algebra solvers achieve the geometric property underlying all projection-based regression methods.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Load data: X, y, _ = toy_linreg(n=30, d=3, seed=0) generates $X \in \mathbb{R}^{30\times 3}$ and $y \in \mathbb{R}^{30}$.
  2. Solve least squares: w = np.linalg.lstsq(X, y, rcond=None)[0] computes $\hat{w}$ minimizing $\|y - Xw\|_2^2$ using SVD.
  3. Compute residual: r = y - X@w gives the vector of prediction errors.
  4. Check orthogonality: X.T@r should be near-zero vector; np.linalg.norm(X.T@r) collapses to scalar diagnostic.
  5. Expected magnitude: for well-conditioned $X$ with $\kappa(X) \lesssim 10^3$, expect $\|X^\top r\|_2 / \|r\|_2 < 10^{-12}$.
  6. If large: compute condition number np.linalg.cond(X) and check for rank deficiency via np.linalg.matrix_rank(X).
  7. Sanity checks: print residual norm, orthogonality error, and condition number as convergence triplet.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Least squares as projection: $\hat{y} = X\hat{w}$ is the closest point in $\text{span}(X)$ to $y$ under Euclidean distance.
  • Orthogonality condition: the residual $r = y - \hat{y}$ satisfies $X^\top r = 0$, meaning $r$ is orthogonal to every column of $X$.
  • Numerical verification: $\|X^\top r\|_2$ should be near machine precision ($\sim 10^{-13}$ in double precision) for well-conditioned problems.
  • Conditioning effects: large $\kappa(X)$ amplifies rounding errors, increasing $\|X^\top r\|_2$ even when the solver converges.
  • Solver choice matters: lstsq uses SVD/QR for stability; explicit $X^\top X$ inversion squares condition number and often fails.
  • Shape discipline: $X \in \mathbb{R}^{n\times d}$, $w \in \mathbb{R}^d$, $r \in \mathbb{R}^n$, $X^\top r \in \mathbb{R}^d$.
Notes

Part 1: Problem Setup and Least Squares Solution
The toy dataset has $X \in \mathbb{R}^{30\times 3}$ (30 examples, 3 features) and $y \in \mathbb{R}^{30}$. The solver np.linalg.lstsq uses SVD internally: decompose $X = U\Sigma V^\top$, then $\hat{w} = V\Sigma^{-1}U^\top y$ (with singular value cutoff for rank deficiency). This avoids forming $X^\top X$, which squares the condition number and loses precision. The resulting $\hat{w} \in \mathbb{R}^3$ defines fitted values $\hat{y} = X\hat{w}$, the orthogonal projection of $y$ onto $\text{span}(X)$.

Part 2: Residual Vector as Orthogonal Complement
The residual $r = y - X\hat{w}$ is the component of $y$ orthogonal to $\text{span}(X)$. The normal equations $X^\top X w = X^\top y$ rearrange to $X^\top (y - Xw) = 0$, which is the orthogonality condition $X^\top r = 0$. Geometrically, this says $r$ is perpendicular to every column of $X$. The vector space $\mathbb{R}^{30}$ decomposes as $y = \hat{y} + r$ where $\hat{y} \in \text{span}(X)$ and $r \in \text{span}(X)^\perp$. This is the orthogonal direct sum decomposition.

Part 3: Numerical Verification as Diagnostic
The quantity $\|X^\top r\|_2$ measures how close the computed solution is to satisfying the normal equations. For well-conditioned $X$ with $\kappa(X) \lesssim 10^3$, expect $\|X^\top r\|_2 / \|r\|_2 < 10^{-12}$ in double precision. If the ratio exceeds $10^{-8}$, suspect: (1) ill-conditioning ($\kappa(X) > 10^8$), (2) rank deficiency (nearly collinear features), (3) solver failure (rare), or (4) bug (wrong transpose, shape mismatch). This diagnostic is more informative than just $\|r\|_2$: small residual norm means good fit, but large $\|X^\top r\|_2$ means the solver didn’t converge to the true optimum.

Connection to Projection Geometry
The projection operator $P = X(X^\top X)^{-1}X^\top$ (assuming full column rank) satisfies $P^2 = P$ (idempotent) and $P^\top = P$ (symmetric), making it an orthogonal projector. The fitted values are $\hat{y} = Py$ and the residual is $r = (I - P)y$. The complementary projector $(I - P)$ projects onto $\text{span}(X)^\perp$. The orthogonality condition $X^\top r = 0$ is equivalent to $X^\top (I - P)y = 0$, which holds by construction: $X^\top (I - P) = X^\top - X^\top X(X^\top X)^{-1}X^\top = X^\top - X^\top = 0$. This projector viewpoint clarifies ridge regression: adding $\lambda I$ changes $P$ to $P_\lambda = X(X^\top X + \lambda I)^{-1}X^\top$, which is no longer orthogonal but still a projector.

Why This Matters for ML
Gradient descent on least squares iteratively approaches $X^\top r = 0$: each step $w_{k+1} = w_k + \eta X^\top r_k$ reduces $\|X^\top r\|_2$. Monitoring this quantity diagnoses convergence and conditioning. In neural networks, backprop through linear layers computes $\nabla_W = X^\top \delta$ (where $\delta$ is upstream error), which has the same structure as $X^\top r$—understanding projection geometry clarifies why gradients vanish or explode when activations ($X$) have extreme norms. Feature engineering: if $\|X^\top r\|_2$ is small but $\|r\|_2$ is large, the model converged but lacks capacity (add features or go nonlinear). Regularization: ridge adds $\lambda w$ to the gradient, modifying the orthogonality condition to $X^\top r + \lambda w = 0$; the residual is no longer strictly orthogonal to $\text{span}(X)$, trading bias for stability.

Numerical and Implementation Notes
Use np.linalg.lstsq with rcond=None for automatic singular value cutoff. Avoid explicit $(X^\top X)^{-1}$ unless $X$ is extremely well-conditioned ($\kappa < 10$). For large $n$ or $d$, consider iterative solvers (conjugate gradients) or stochastic methods. Check orthogonality error relative to residual norm: ortho_err = np.linalg.norm(X.T@r) / np.linalg.norm(r); values $> 10^{-8}$ indicate numerical issues. For ill-conditioned $X$, use ridge regression w = np.linalg.solve(X.T@X + lam*np.eye(d), X.T@y) to stabilize.

Numerical and Shape Notes
Shapes: $X(n\times d)$, $w(d)$, $y(n)$, $\hat{y}(n)$, $r(n)$, $X^\top r(d)$. Transpose $X^\top$ has shape $(d, n)$, so $(d,n) \times (n,) \to (d,)$ for $X^\top r$. For batched problems with multiple $y$ vectors ($Y \in \mathbb{R}^{n\times k}$), compute $W = \text{lstsq}(X, Y)$ with shape $(d, k)$, residuals $R = Y - XW$ with shape $(n, k)$, and check $\|X^\top R\|_F$ (Frobenius norm). Gotcha: if $r$ is accidentally a row vector, X.T@r may succeed but produce wrong results; always verify r.shape == (n,). For intercept models, ensure $X$ has a column of ones; then r.sum() ≈ 0 (residuals sum to zero) is an additional diagnostic.

History and Applications

Least squares dates to Gauss (1809) and Legendre (1805), who used it for astronomical orbit determination and geodesy. The orthogonality condition $X^\top r = 0$ was recognized early as the key to uniqueness and optimality. Numerical methods evolved from normal equations ($X^\top X w = X^\top y$, ill-conditioned for large $\kappa(X)$) to QR decomposition (Householder, 1958) and SVD (Golub & Kahan, 1965), which avoid forming $X^\top X$ and handle rank deficiency gracefully. In statistics, least squares is the maximum likelihood estimator for linear regression with Gaussian noise; the orthogonality condition is the first-order optimality of the log-likelihood. In machine learning, least squares is the building block for ridge regression, LASSO (orthogonality plus $L^1$ penalty), generalized least squares (weighted norms), and neural network training (each layer update uses the same gradient structure as $X^\top r$). Modern optimization (gradient descent, Newton’s method, conjugate gradients) iteratively enforces $X^\top r \to 0$, with convergence rates determined by $\kappa(X)$. Understanding least squares as projection—not just curve-fitting—clarifies regularization (changing the projection subspace), conditioning (amplifying rounding errors), and generalization (bias-variance decomposition is prediction vs. residual split).

Connection to Broader Examples
  • Vector spaces (Ch. 1): Column space $\text{span}(X)$ is a subspace; projections map $\mathbb{R}^n$ to subspaces.
  • Span (Ch. 2): $\hat{y} = Xw$ lies in $\text{span}(X)$; residual $r$ lies in the orthogonal complement.
  • Inner products (Ch. 5): Orthogonality is $\langle X_{:,j}, r \rangle = 0$ for all columns $j$, which is $X^\top r = 0$.
  • Projections (Ch. 6): Least squares is projection onto $\text{span}(X)$; projector is $P = X(X^\top X)^{-1}X^\top$.
  • SVD (Ch. 10): lstsq uses SVD to handle rank deficiency; condition number $\kappa(X) = \sigma_{\max}/\sigma_{\min}$.
  • PCA (Ch. 11): Projecting data onto principal components is least squares onto a low-rank subspace.
  • Least squares (Ch. 12): This example is the canonical orthogonality verification for all LS problems.
  • Conditioning (Ch. 14): Large $\kappa(X)$ amplifies $\|X^\top r\|_2$; regularization (ridge) improves conditioning.

Comments