ex1.ai

Example number

Example slug

example_6_least_squares_residual_is_orthogonal_to_column_space

Background

Least squares regression, dating back to Gauss and Legendreâs work in astronomy (1800s), is the foundational convex optimization problem. In machine learning, it serves as the closed-form solution for linear regression, the normal equations step in iterative reweighted least squares, the inner loop of many optimization algorithms (Newtonâs method, trust region methods), and the local quadratic approximation in second-order training.

The geometric perspective reveals that the optimal prediction $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{span}(X)$âthe subspace of all possible linear predictions. The residual $r = y - \hat{y}$ lies in the orthogonal complement and satisfies $X^\top r = 0$. This orthogonality condition is both the first-order optimality condition (gradient = 0) and a geometric necessity (projections minimize distance). Numerically verifying $\|X^\top r\|_2 \approx 0$ confirms that the solver found the true optimum and that numerical errors are small.

Purpose

Build geometric intuition linking optimization to projections:

Understand that least squares is an orthogonal projection onto the column space.
See that the residual $r = y - X\hat{w}$ is orthogonal to all columns of $X$.
Verify the optimality condition $X^\top r \approx 0$ numerically.
Recognize the pattern: fit = project; residual orthogonality = optimality.

Problem

Fit least squares and verify X^T(y-XwÌ)â0.

Solution (Math)

The least squares problem minimizes $\|y - Xw\|_2^2$. Taking the gradient and setting it to zero gives the normal equations:

\[ X^\top X \hat{w} = X^\top y. \]

Multiplying through, we see $X^\top(y - X\hat{w}) = 0$, so the residual $r = y - X\hat{w}$ is orthogonal to each column of $X$. Geometrically, $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{span}(X)$, and $r$ lies in the orthogonal complement.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^\top y$.

Solution (Python)


import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=30, d=3, seed=0)
w = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w
print("||X^T r||_2:", np.linalg.norm(X.T@r))

Code Introduction

We generate synthetic regression data with $n=30$ examples and $d=3$ features using toy_linreg. The design matrix $X\in\mathbb{R}^{30\times 3}$ and target $y\in\mathbb{R}^{30}$ are passed to np.linalg.lstsq, which returns the optimal weights $\hat{w}\in\mathbb{R}^3$. We compute the residual $r = y - X\hat{w}$ and verify orthogonality by printing $\|X^\top r\|_2$, which should be near zero (typically $< 10^{-13}$). Shapes: X.T @ r is $(3, 30) \times (30,) = (3,)$, and the norm reduces this to a scalar.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Solve $\min_w \|y - Xw\|_2^2$ via np.linalg.lstsq(X, y, rcond=None), which uses QR or SVD internallyânever form $X^\top X$ explicitly.
Compute residual: $r = y - X\hat{w}$ with shapes $y:(n), X:(n\times d), w:(d), r:(n)$.
Compute $X^\top r$ with shape $(d)$ and take its norm: np.linalg.norm(X.T @ r).
Expect $\|X^\top r\|_2 < 10^{-12}$ for well-conditioned problems; larger values signal ill-conditioning or bugs.
Avoid explicit inverses: $(X^\top X)^{-1}$ is numerically unstable when $X$ has nearly dependent columns.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Least squares residuals are orthogonal to the column space: $X^\top(y - X\hat{w}) = 0$.
This orthogonality is the geometric signature of projection and the algebraic signature of optimality.
np.linalg.lstsq solves the normal equations stably via QR or SVD.
Verifying $\|X^\top r\|_2 \approx 0$ is a numerical sanity check for regression diagnostics.

Notes

Part 1: Core setup - Least squares residual is orthogonal to column space

State the objects, shapes, and target question for Least squares residual is orthogonal to column space. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Least squares residual is orthogonal to column space

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Least squares residual is orthogonal to column space work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Least squares residual is orthogonal to column space

Give the computational recipe for Least squares residual is orthogonal to column space, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

Shape discipline: $X\in\mathbb{R}^{n\times d}$, $y\in\mathbb{R}^n$, $w\in\mathbb{R}^d$, $r\in\mathbb{R}^n$, $X^\top r\in\mathbb{R}^d$.
Numerical note 1: lstsq uses stable QR/SVD; rcond=None sets the threshold for filtering small singular values to machine precision.
Numerical note 2 (tolerance): Treat $\|X^\top r\|_2$ on the order of $10^{-12}$â$10^{-14}$ as numerical zero; materially larger values signal ill-conditioning, rank deficiency, or implementation bugs.
Interpretation: $\|X^\top r\|_2 \approx 0$ confirms optimality; non-zero values indicate numerical issues (ill-conditioning, bugs) or non-convergence.
Geometric insight: The residual $r$ is the shortest vector from $y$ to $\text{span}(X)$; orthogonality ensures this minimum distance.
Numerical note 3: The lstsq solver uses stable QR or SVD decomposition rather than explicitly forming $X^\top X$, which can be ill-conditioned when $X$ has nearly dependent columns. The rcond=None parameter uses machine precision to filter out effectively zero singular values. Shape discipline: $X^\top r$ has shape $(d)$ (e.g., $(3)$ when $X\in\mathbb{R}^{30\times 3}$), and the norm reduces this to a scalar. If the printed value is large (e.g., $> 10^{-10}$), it signals numerical instability or a bug in the solver.

History and Applications

Orthogonal projections and least squares are intertwined since Gaussâs work on astronomical data (1809). Gauss showed that the best-fitting line minimizes the sum of squared deviations, and the solution satisfies the ânormal equationsââa term that survives to this day.

Geometric interpretation: The orthogonality condition $X^\top r = 0$ was first interpreted geometrically in the mid-20th century as a statement about projections onto subspaces. This perspectiveâthat regression is projectionâbecame canonical in statistics and linear algebra texts (Strang, Gilbert).

QR and SVD stability: While the normal equations $X^\top X \hat{w} = X^\top y$ are elegant, they can be numerically unstable. The solution is to use QR decomposition or SVD (Golub & Van Loan, 1980s), which avoid explicitly forming $X^\top X$. Modern solvers (NumPy, LAPACK) default to these stable methods, making least squares a reliable computational workhorse.

Connection to Broader Examples

Projection operators: The matrix $P = X(X^\top X)^{-1}X^\top$ projects onto $\text{span}(X)$; residual orthogonality implies $P^2 = P$ (idempotence).
Ridge regression: Adding $\lambda I$ to $X^\top X$ shrinks $\hat{w}$ and trades off fit quality (larger residuals) against regularization.
Gradient descent: The residual $r$ is proportional to the gradient $\nabla_w L = -X^\top r$; orthogonality means gradient = 0 at the optimum.
Generalization: Orthogonality generalizes to weighted least squares (with metric $W$), kernel ridge regression (in RKHS), and iteratively reweighted least squares (GLMs).
Conditioning: If $X$ has small singular values, $X^\top X$ is nearly singular, and small $\|X^\top r\|$ may hide large errors in $\hat{w}$.