Example number
54
Example slug
example_54_orthogonality_in_regression_residual_column_space
Background

Historical context: Least-squares regression emerged from Gauss’s method of least squares (1809), designed to estimate planetary orbits from noisy astronomical observations. Legendre independently developed it for geodetic surveys. The orthogonality condition (residuals perpendicular to the model space) was implicitly understood but formalized in the 20th century via projection theory and functional analysis. Gram-Schmidt orthogonalization (1880s-1900s) provided a computational method to construct orthonormal bases. Modern numerical algorithms (QR decomposition, SVD) exploited this geometry to solve least-squares problems stably. In ML, least-squares is the prototype convex objective: understanding its geometry (projection, orthogonality) clarifies how all optimization works locally.

Mathematical characterization: The residual orthogonality $X^\top r = 0$ is equivalent to the normal equations $X^\top X \hat{w} = X^\top y$, which arise from setting the gradient of the loss $\mathcal{L}(w) = \|y - Xw\|_2^2$ to zero: $\frac{\partial \mathcal{L}}{\partial w} = -2 X^\top (y - Xw) = 0$. This is first-order optimality. The projection theorem guarantees a unique orthogonal projection of $y$ onto the column space of $X$, characterized by orthogonality to the subspace. For ill-conditioned or low-rank $X$, numerical solvers (SVD, QR) exploit orthogonal decompositions to stably satisfy this condition despite rounding errors.

Prevalence in ML: Every statistical model fit via least-squares (linear regression, logistic regression pre-loss, neural networks with MSE loss) inherently solves a projection problem. Ridge regression, Lasso, and other regularizers modify the residual geometry by adding bias ($X^\top r \neq 0$) to improve generalization. Iterative solvers (gradient descent, conjugate gradient) progress toward satisfying orthogonality. Understanding this pattern is foundational for debugging fitting issues, choosing regularization, and reasoning about generalization.

Purpose

Show how residual orthogonality $X^\top r = 0$ characterizes least-squares optimality, connecting optimization, geometry (projection), and numerics (solver choice). Demonstrate that the least-squares solution $\hat{w}$ is the unique weight vector such that prediction residuals $r = y - X\hat{w}$ are orthogonal to every column of the design matrix $X$—equivalently, residuals lie in the orthogonal complement of the column space. Build intuition for why this condition is both a consequence (residuals must be orthogonal for optimality) and a characterization (any $w$ satisfying this condition is optimal). Emphasize that verifying $\|X^\top r\|_2 \approx 0$ is a sanity check that your solver succeeded, and that breaking this condition (via regularization) reveals the trade-off between fit and stability.

Problem

Compute least squares solution and verify the residual is orthogonal to X’s columns (X^T r ≈ 0).

Solution (Math)

Least squares optimality implies normal equations $X^T$Xw-y$=0$. Equivalently, the residual $r=y-X\hat w$ is orthogonal to $\mathrm{range}(X)$, i.e., $X^T r=0$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=25, d=3, seed=0)
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w_hat
print("||X^T r||_2:", np.linalg.norm(X.T@r))
Code Introduction

This snippet demonstrates the orthogonality condition of least-squares regression: the residual vector $r = y - X\hat{w}$ is orthogonal to the column space of the design matrix $X$. It reveals a fundamental property—least-squares solutions satisfy $X^\top r = 0$ (or near-zero numerically)—which characterizes the optimal regression coefficients and underpins projection-based intuition for fitting.

Part 1: Least-Squares Setup. The code generates a synthetic linear regression dataset and solves least-squares via np.linalg.lstsq(X, y, rcond=None)[0], which internally uses SVD or QR decomposition for numerical stability (avoiding explicit normal equations $X^\top X \hat{w} = X^\top y$). The result $\hat{w} \in \mathbb{R}^3$ is the optimal weight vector minimizing squared prediction error $\|y - Xw\|_2^2$.

Part 2: Residual Orthogonality Condition. The residual is $r = y - X\hat{w}$. The key check is computing $\|X^\top r\|_2$, the Euclidean norm of $X^\top r \in \mathbb{R}^3$. Mathematically, least-squares solutions satisfy the normal equations $X^\top X \hat{w} = X^\top y$, which rearranges to $X^\top (y - X\hat{w}) = X^\top r = 0$. This means the residual is orthogonal to every column of $X$ (lies in the orthogonal complement of the column space). The printed value should be near machine precision zero ($\sim 10^{-13}$ or smaller), confirming this orthogonality. Geometrically, least-squares projects $y$ onto the column space $\text{col}(X)$; the projection $\hat{y} = X\hat{w}$ is the closest point in $\text{col}(X)$ to $y$, and the residual $r = y - \hat{y}$ is the perpendicular component.

Why This Condition Characterizes Optimality. The residual orthogonality $X^\top r = 0$ is not just a consequence—it’s an equivalent characterization of least-squares optimality. This is the projection theorem: the orthogonal projection of a point onto a subspace is unique and characterized by orthogonality. Any weight vector $w$ satisfies this condition if and only if $w = \hat{w}$ (the least-squares solution), making this condition both a check for solver correctness and a proof of optimality.

Numerical Precision and Diagnostics. Checking $\|X^\top r\|_2 \approx 0$ is a sanity check that verifies your solver found the true least-squares solution (not a local minimum or numerical artifact). If significantly larger (e.g., $> 10^{-8}$), something went wrong—ill-conditioning, singular $X$, or solver failure. Ridge regression and other regularizers break this orthogonality: $X^\top r_\lambda \neq 0$, trading fit for stability. Iterative refinement exploits this condition: refine $w$ iteratively to progressively satisfy $\|X^\top r\|_2 < \epsilon$.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Generate synthetic dataset: X, y, _ = toy_linreg(n=25, d=3, seed=0) creates $X \in \mathbb{R}^{25 \times 3}$ and $y \in \mathbb{R}^{25}$.
  2. Solve least-squares: w_hat = np.linalg.lstsq(X, y, rcond=None)[0] computes the optimal weight vector via SVD/QR (stable method).
  3. Compute residuals: r = y - X @ w_hat computes $r = y - X\hat{w} \in \mathbb{R}^{25}$.
  4. Compute residual-feature product: X.T @ r yields $X^\top r \in \mathbb{R}^3$ (should be near zero).
  5. Measure orthogonality: np.linalg.norm(X.T @ r) computes $\|X^\top r\|_2$ (Euclidean norm).
  6. Print result: ||X^T r||_2 should be $\approx 10^{-14}$ to $10^{-13}$ (machine precision for double).
  7. Verify condition: If much larger (e.g., $> 10^{-8}$), investigate ill-conditioning or solver issues.
  8. Interpret: Perfect orthogonality (within numerical error) confirms least-squares optimality.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Orthogonality characterizes optimality: Least-squares solutions satisfy $X^\top r = 0$ (residuals orthogonal to column space).
  • Projection geometry: The solution $\hat{w}$ yields fitted values $\hat{y} = X\hat{w}$, the orthogonal projection of $y$ onto $\text{col}(X)$.
  • Residual as complement: The residual $r = y - \hat{y}$ lies entirely in the orthogonal complement of $\text{col}(X)$.
  • Normal equations equivalence: $X^\top r = 0$ is equivalent to (and derived from) the normal equations $X^\top X \hat{w} = X^\top y$.
  • First-order optimality: Orthogonality is the gradient condition $\nabla_w \mathcal{L} = 0$ for the squared-error loss.
  • Numerical verification: Computing $\|X^\top r\|_2$ checks that a solver found the true least-squares solution (should be $\approx 0$).
  • Regularization breaks orthogonality: Ridge, Lasso, weight decay modify $X^\top r$ away from zero, trading fit for stability.
  • Scale matters: The magnitude of $\|X^\top r\|_2$ reveals solver quality and conditioning of $X$.
Notes

Part 1: Least-Squares Setup - Least-squares solves: $\hat{w} = \arg\min_w \|y - Xw\|_2^2$ (minimize squared prediction error). - Internally, lstsq uses SVD or QR decomposition for numerical stability (not normal equations $X^\top X \hat{w} = X^\top y$). - The result $\hat{w} \in \mathbb{R}^3$ is the optimal weight vector (if $X$ has full column rank). - Fitted values: $\hat{y} = X\hat{w} \in \mathbb{R}^{25}$ (predictions on training set).

Part 2: Residual Orthogonality Condition - Residual: $r = y - X\hat{w}$ measures prediction error (should be small if model fits well). - Orthogonality condition: $X^\top r = 0$ means $r$ is orthogonal to every column of $X$. - This is equivalent to: $r \in \text{col}(X)^\perp$ (residuals lie in the orthogonal complement of column space). - Characterization: Any $w$ satisfying $X^\top (y - Xw) = 0$ is the unique least-squares solution (if $X$ full rank). - Geometric interpretation: $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{col}(X)$; $r$ is the perpendicular component.

Why This Matters for ML - Model diagnostics: Checking $\|X^\top r\|_2 \approx 0$ verifies that the solver found the true least-squares solution. - Residual properties: If $X$ includes a constant column, residuals automatically have zero mean (centered). - Ridge regression trade-off: Adding regularization breaks orthogonality: $X^\top r_\lambda = -\lambda \hat{w}_\lambda \neq 0$. - Overfitting indicator: In overparameterized models, residuals shrink but orthogonality still holds for training data. - Iterative solvers: Convergence is measured by how well $\|X^\top r\|_2$ decreases toward zero.

ML Examples and Patterns - Linear regression diagnostics: Residual plots assume $X^\top r \approx 0$ holds; if not, solver failed. - Feature selection: The marginal value of adding a feature is $|x_\text{new}^\top r|$ (correlation with unexplained variance). - Gram-Schmidt: Orthogonalization produces residuals at each step, analogous to least-squares. - Iterative refinement: For ill-conditioned $X$, refine $w$ via $r^{(k)} = y - Xw^{(k)}$, $\delta w^{(k)} = \text{lstsq}(X, r^{(k)})$, $w^{(k+1)} = w^{(k)} + \delta w^{(k)}$. - Gradient descent: Each GD step reduces $\|r\|_2$ but doesn’t immediately satisfy orthogonality; convergence approaches $X^\top r = 0$.

Connection to Linear Algebra Theory - Normal equations: Orthogonality $X^\top r = 0$ rearranges to $X^\top X \hat{w} = X^\top y$ (the normal equations). - QR factorization: Writing $X = QR$, the solution is $\hat{w} = R^{-1} Q^\top y$; residuals $r = (I - QQ^\top) y$ lie in the orthogonal complement of $\text{col}(Q) = \text{col}(X)$ by construction. - SVD perspective: $X = U\Sigma V^\top$ gives $\hat{w} = V \Sigma^{-1} U^\top y$; residuals satisfy $X^\top r = 0$ due to orthogonality of $U, V$. - Cauchy-Schwarz: Orthogonality is ensured by the projection theorem—a consequence of geometry, not a numerical accident. - Ill-conditioning: Small singular values in $X$ amplify numerical errors; iterative refinement mitigates by refining $r$ in high precision.

Numerical and Implementation Notes - Transpose order: X.T @ r (shape (3, 25) × (25,) → (3,)), not X @ r.T (incompatible). - Norm type: np.linalg.norm(X.T @ r) computes Frobenius/Euclidean norm; specify ord=2 for clarity. - rcond parameter: lstsq(..., rcond=None) uses machine precision for rank determination; explicitly setting is good practice. - Full vs. reduced rank: If $X$ is rank-deficient, lstsq returns minimum-norm solution; orthogonality $X^\top r = 0$ still holds. - Tolerance for verification: Values $< 10^{-10}$ indicate good orthogonality; values $> 10^{-8}$ warrant investigation.

Numerical and Shape Notes - $X \in \mathbb{R}^{25 \times 3}$ (25 examples, 3 features). - $y \in \mathbb{R}^{25}$ (target vector). - $\hat{w} \in \mathbb{R}^3$ (weight vector). - $r \in \mathbb{R}^{25}$ (residual vector). - $X^\top r \in \mathbb{R}^3$ (should be near zero). - np.linalg.norm(X.T @ r) yields a scalar (Euclidean norm of $X^\top r$).

Pedagogical Significance - Core principle: Least-squares finds weights such that residuals are orthogonal to the input feature space. - Optimization meets geometry: First-order optimality ($\nabla_w \mathcal{L} = 0$) is equivalent to orthogonal projection. - Solver verification: This single check reveals whether your solver succeeded. - Regularization trade-off: Visualizing how regularization breaks orthogonality illuminates the bias-variance trade-off. - Foundation for ML: Understanding projection and orthogonality is essential for regression, neural networks, and principled optimization.

History and Applications

Least-squares regression originated with Gauss (1809) and Legendre (independently), developed to estimate planetary orbits and survey measurements with noisy observations. The orthogonality condition (residuals perpendicular to the model space) was implicit in their derivations but formalized later via projection theory and functional analysis. Gram-Schmidt orthogonalization (1883/1911) provided computational methods to construct orthonormal bases, while the QR decomposition (Householder 1958, Golub 1965) enabled stable least-squares solving. The condition number concept (von Neumann 1940s, refined by Wilkinson 1960s-1970s) revealed that ill-conditioned $X$ compromises numerical accuracy despite mathematically satisfied orthogonality. Modern solvers exploit SVD (Golub-Kahan algorithm) and iterative refinement to maintain orthogonality within floating-point precision. In ML, least-squares is the prototype convex objective: understanding its geometry (projection, orthogonality, residuals) clarifies all optimization locally. Ridge regression (Tikhonov regularization, 1960s) trades fit for stability by relaxing orthogonality ($X^\top r_\lambda \neq 0$, with bias proportional to $\lambda$). Lasso (Tibshirani 1996) and elastic net (Zou-Hastie 2005) extend this trade-off via sparsity constraints. Residual diagnostics in classical statistics (plots, autocorrelation tests) assume orthogonality holds. Modern applications: linear regression in econometrics, factor models in finance, neural networks with MSE loss all solve projection problems. Understanding orthogonality is essential for debugging fitting issues, choosing regularization strength, and reasoning about generalization.

Connection to Broader Examples
  • Chapter 2 (Span/Linear combinations): Residuals decompose $y$ into projection ($X\hat{w}$) and orthogonal complement ($r$).
  • Chapter 4 (Linear maps): $X\hat{w}$ applies the linear map $x \mapsto Xw$ to project onto column space.
  • Chapter 5 (Inner products/norms): Orthogonality is defined via inner product: $\langle x_i, r \rangle = 0$ for each column $x_i$.
  • Chapter 6 (Projections): This is orthogonal projection; $\hat{y}$ is the closest point in $\text{col}(X)$ to $y$.
  • Chapter 8 (Eigendecomposition): For square $X$, eigenvalues relate to projection properties.
  • Chapter 10 (SVD): SVD reveals rank, conditioning, and numerical accuracy of projection.
  • Chapter 12 (Least-squares): This example is least-squares; normal equations encode orthogonality.
  • Chapter 13 (Solving systems): Least-squares projects $y$ onto column space; related to solving $Ax \approx b$.
  • Chapter 14 (Conditioning): Ill-conditioned $X$ causes numerical errors; orthogonality check diagnoses this.
  • Chapter 16 (Matrix products): All operations ($X\hat{w}$, $X^\top r$) are matrix-vector products.

Comments