Example number
22
Example slug
example_22_least_squares_residual_is_orthogonal_to_column_space
Background

Least squares dates to Legendre and Gauss in the early 19th century (astronomy, geodesy). The modern geometric view explains its optimality: among all vectors in the column space $\operatorname{col}(X)$, the orthogonal projection $X\hat w$ is closest to $y$ in the $\ell_2$ sense, so the residual $r = y - X\hat w$ is orthogonal to every column of $X$. Algebraically, the normal equations $X^\top X\hat w = X^\top y$ imply $X^\top r = 0$.

Numerically, forming $X^\top X$ can amplify conditioning issues; robust solvers use QR or SVD to compute $\hat w$ without explicit inverses. When $X$ has full column rank, $X^\top X$ is symmetric positive definite and Cholesky can be applied—but QR/SVD remain safer when columns are nearly dependent. In ML, least squares underpins linear regression, Gauss–Newton steps, and local quadratic approximations of more complex objectives.

Purpose

Connect optimization, geometry (orthogonal projection), and numerics (solver choice, conditioning) in a single, reusable pattern. Least squares fits $w$ so that $Xw$ is the orthogonal projection of $y$ onto the column space of $X$, and the residual $r = y - X\hat w$ lies in the orthogonal complement. This geometric fact doubles as a powerful numerical diagnostic: $\|X^\top r\|_2$ should be near machine precision when the solve is correct and the problem is well-conditioned.

Problem

Fit least squares and verify X^T(y-Xŵ)≈0.

Solution (Math)

Optimality gives $X^T$Xw-y$=0$, so residual $r=y-X\hat w$ is orthogonal to each column of $X$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=30, d=3, seed=0)
w = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w
print("||X^T r||_2:", np.linalg.norm(X.T@r))
Code Introduction

This snippet verifies the orthogonality condition of least squares residuals. With toy_linreg(n=30, d=3), you have $X \in \mathbb{R}^{30\times 3}$ and $y \in \mathbb{R}^{30}$. The solver np.linalg.lstsq(X, y, rcond=None) returns $w \in \mathbb{R}^3$ that minimizes $\|y - Xw\|_2^2$. The residual is $r = y - Xw \in \mathbb{R}^{30}$. At the optimum, the normal equations imply $X^\top r = 0$, meaning the residual is orthogonal to every column of $X$. Computing $\|X^\top r\|_2$ should be near zero for a well-conditioned problem; geometrically, $Xw$ is the orthogonal projection of $y$ onto $\operatorname{col}(X)$ and $r$ lies in its orthogonal complement

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Solve for $\hat w$ via np.linalg.lstsq(X, y, rcond=None), which uses QR/SVD under the hood and avoids forming $X^X`.
  • Form the residual r = y - X @ w and verify orthogonality by computing X.T @ r and its 2-norm.
  • Expect np.linalg.norm(X.T @ r) to be near machine precision (e.g., $10^{-12}$–$10^{-8}$ in float64) for well-conditioned $X`.
  • Track shapes: (n,d) @ (d,) -> (n,) for X @ w; (d,n) @ (n,) -> (d,) for X.T @ r.
  • Stability guardrails: if the norm is large, inspect conditioning (e.g., singular values), rescale features, or switch to SVD-based solvers.
  • Alternatives: use scipy.linalg.lstsq for more control; for SPD normal equations, scipy.linalg.cho_factor/cho_solve can be efficient.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Residual orthogonality: at the optimum, $X^\top r = 0$ so $r$ is orthogonal to each column of $X$.
  • Projection viewpoint: $X\hat w$ is the orthogonal projection of $y$ onto $\operatorname{col}(X)$; $r$ lies in the orthogonal complement.
  • A practical diagnostic: computing $\|X^\top r\|_2$ verifies solver correctness and highlights conditioning issues.
  • Shape discipline: confirm $(n,d)$ for $X$, $(d,)$ for $w$, $(n,)$ for $y$ and $r$, and $(d,)$ for $X^\top r$.
  • Solver choice guidance: prefer lstsq/QR/SVD over explicit normal-equation inverses; consider Cholesky when SPD is guaranteed.
Notes

Part 1: Core setup - Least squares residual is orthogonal to column space

State the objects, shapes, and target question for Least squares residual is orthogonal to column space. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Least squares residual is orthogonal to column space

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Least squares residual is orthogonal to column space work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Least squares residual is orthogonal to column space

Give the computational recipe for Least squares residual is orthogonal to column space, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: check dimensions before manipulating formulas.
  • Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
  • Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).

Numerical notes: np.linalg.lstsq is typically stable for moderate conditioning because it leverages QR/SVD rather than inverting $X^\top X$. In float64, a well-posed problem should yield $\|X^\top r\|_2$ close to machine precision; choose a tolerance appropriate to data scale (e.g., compare $\|X^\top r\|_2$ to $\|X\|\,\|r\|$). If $\|X^\top r\|$ is unexpectedly large, inspect singular values (SVD), rescale columns, or regularize. Finally, assert shapes: $(n,d)$ for $X$, $(d,)$ for $w$, $(n,)$ for $y$ and $r$, and $(d,)$ for $X^\top r$.

History and Applications

Legendre (1805) and Gauss (1809) formalized least squares in astronomy and geodesy, with Gauss emphasizing projection and orthogonality in deriving optimality. The condition $X^\top r = 0$ encapsulates the geometric essence: $X\hat w$ is the closest point in $\operatorname{col}(X)$ to $y$, and the residual points perpendicularly. In statistics, ordinary least squares (OLS) leverages the same principle; diagnostics such as residual plots and leverage scores build on orthogonality structure.

In modern ML, orthogonality conditions inform algorithmic choices: Gauss–Newton approximations, trust-region methods, and gradient-based solvers routinely exploit projection viewpoints. Projections and residual checks appear in dimensionality reduction (PCA), regularization (ridge/Tikhonov), and iterative refinement of models. Practically, monitoring $\|X^\top r\|$ serves as a robust sanity check for implementations and a lens on conditioning and feature engineering.

Connection to Broader Examples
  • Chapter 6 (orthogonality and projections): residual orthogonality is the core optimality condition for least squares.
  • Chapter 12 (least squares): this example is a concrete diagnostic for the projection interpretation of linear regression.
  • Chapter 14 (conditioning and stability): large $\|X^\top r\|$ often signals ill-conditioning or model misspecification.
  • Chapter 13 (solving systems): relates to SPD solves via normal equations, and why QR/SVD can be preferable.
  • Chapter 11 (PCA): projections onto low-dimensional subspaces recur in dimensionality reduction; orthogonality is central.

Comments