Example number
86
Example slug
example_86_orthogonality_in_regression_residual_column_space
Background

Least squares was independently developed by Legendre (1805) and Gauss (1809) for astronomy and geodesy (fitting planetary orbits, surveying Earth’s shape). Gauss recognized the orthogonality condition as the optimality criterion, though the geometric interpretation (projection onto subspaces) came later with the development of linear algebra (Grassmann, 1840s). In modern ML, least squares is the prototype convex problem: closed-form solution, unique global minimum (when $X$ has full column rank), and numerically stable algorithms (QR, SVD, Cholesky). Beyond linear regression, least squares appears as the local quadratic approximation in Newton’s method, Gauss-Newton optimization, trust-region methods, and Levenberg-Marquardt. The orthogonality condition $X^\top r = 0$ generalizes to gradient vanishes $\nabla f(x^*) = 0$ for unconstrained optimization, and $\nabla f(x^*) \perp \text{feasible directions}$ for constrained optimization (KKT conditions). Every iterative optimizer in deep learning (SGD, Adam, LBFGS) seeks points where gradients vanish, making this the most fundamental pattern in computational optimization.

Purpose

Connect three fundamental perspectives on least squares in a single, verifiable relationship: optimization (minimizing squared error), geometry (projection onto column space), and numerics (stable solver implementation). Show that orthogonality $X^\top r \approx 0$ is not an approximation but an exact mathematical property of the optimal solution, providing a universal diagnostic for solver correctness, feature redundancy, and model adequacy. Teach you to use this condition as a practical sanity check in regression, neural network initialization, iterative optimization convergence, and anywhere gradient-based methods seek stationary points. Emphasize that this pattern—optimality ⟺ gradient vanishes ⟺ residual orthogonal to feasible directions—is the foundation of all constrained and unconstrained optimization in ML.

Problem

Compute least squares solution and verify the residual is orthogonal to X’s columns (X^T r ≈ 0).

Solution (Math)

For least squares regression $\min_w \|y - Xw\|_2^2$ with $X \in \mathbb{R}^{n \times d}$, $y \in \mathbb{R}^n$, the optimal solution $\hat{w} \in \mathbb{R}^d$ satisfies the normal equations: \[ X^\top X \hat{w} = X^\top y. \]

Rearranging gives the orthogonality condition: \[ X^\top (y - X\hat{w}) = X^\top r = 0, \] where $r = y - X\hat{w} \in \mathbb{R}^n$ is the residual vector.

Geometric interpretation: The residual $r$ lies in the orthogonal complement of the column space $\text{col}(X)$: \[ r \perp \text{col}(X) \iff X^\top r = 0. \]

Verification: Compute $\|X^\top r\|_2$. For a correct solution, this should be $\approx 0$ (within machine precision $\sim 10^{-14}$ for double precision).

We use: - $X \in \mathbb{R}^{n \times d}$ (rows = examples, $n$ = sample size) - Column vectors by default - Inner product $\langle x, y \rangle = x^\top y$; norm $\|x\|_2 = \sqrt{x^\top x}$ - Transpose $A^\top$ (not $A^T$)

Solution (Python)

import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=25, d=3, seed=0)
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w_hat
print("||X^T r||_2:", np.linalg.norm(X.T@r))
Code Introduction

The code demonstrates the fundamental orthogonality relationship in least squares regression: $X^\top r = 0$, where $r = y - X\hat{w}$ is the residual vector. This condition is both a theoretical optimality criterion (gradient vanishes at minimum) and a practical diagnostic tool (verifies solver correctness).

Setup and least squares solve: - Generate synthetic linear regression data: $X \in \mathbb{R}^{25 \times 3}$, $y \in \mathbb{R}^{25}$ via toy_linreg(n=25, d=3). - Solve $\min_w \|y - Xw\|_2^2$ using np.linalg.lstsq(X, y, rcond=None). This uses SVD decomposition internally: $X = U \Sigma V^\top \implies \hat{w} = V \Sigma^{-1} U^\top y$. - Extract solution: w_hat = lstsq(...)[0] (index 0 gets coefficients; other outputs are residual sum of squares, rank, singular values).

Residual computation and orthogonality check: - Compute residual: r = y - X @ w_hat ($\in \mathbb{R}^{25}$). This is the component of $y$ orthogonal to $\text{col}(X)$. - Check orthogonality: X.T @ r ($\in \mathbb{R}^3$) should be zero vector (one entry per feature column). Transpose X.T is $3 \times 25$; multiplying by $r \in \mathbb{R}^{25}$ gives $3$-vector. - Compute norm: np.linalg.norm(X.T @ r) measures $\|X^\top r\|_2$. For correct solution, expect $\sim 10^{-14}$ (machine precision for double-precision floats).

Interpretation: - Geometric: $\hat{y} = X\hat{w}$ lies in $\text{col}(X)$ (span of feature columns); $r = y - \hat{y}$ is perpendicular to this subspace. Projection formula: $\hat{y} = X(X^\top X)^{-1} X^\top y$. - Optimization: Normal equations $X^\top X \hat{w} = X^\top y \iff X^\top (y - X\hat{w}) = 0$ say gradient vanishes. Gradient $\nabla_w L = -X^\top r$; orthogonality $\iff \nabla L = 0 \iff$ optimum. - Feature redundancy: If new feature $x_{\text{new}}$ satisfies $x_{\text{new}}^\top r \approx 0$, it’s already captured by existing features (adds no new information). - Regularization contrast: Ridge regression $(X^\top X + \lambda I) \hat{w} = X^\top y$ violates orthogonality: $X^\top r = -\lambda \hat{w} \neq 0$. Checking $\|X^\top r\|_2 > 0$ confirms regularization is active.

Why this matters for ML: - Solver verification: After any regression (gradient descent, coordinate descent, closed-form), compute $\|X^\top r\|_2$. If $> 10^{-6}$, solver hasn’t converged or encountered numerical issues (ill-conditioning, bugs). - Convergence criterion: Iterative optimizers (conjugate gradient, LBFGS, SGD) terminate when $\|\nabla L\|_2 < \epsilon$. For least squares, $\nabla L = -X^\top r$, so orthogonality is the stopping condition. - Bias term necessity: If data not centered, $\mathbf{1}^\top r \neq 0$ (residual has nonzero mean). Including intercept (first column of $X$ is all-ones) ensures $\mathbf{1}^\top r = 0$ (residuals sum to zero). - PCA centering: PCA subtracts mean $\mu$ from data. Residual from mean satisfies $\mathbf{1}^\top (X - \mu \mathbf{1}^\top) = 0$, making first PC orthogonal to constant vector. - Backprop analogy: Gradient $\frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Y}$ has same structure. Training ends when $\|X^\top \text{grad}\| \approx 0$.

Numerical notes: - lstsq uses SVD (not QR or Cholesky) for robustness to rank deficiency. If $\text{rank}(X) < d$, returns minimum-norm solution via pseudoinverse. Orthogonality still holds. - Condition number matters: If $\kappa(X) = \sigma_{\max}(X) / \sigma_{\min}(X)$ is large ($> 10^{10}$), orthogonality may degrade to $\sim 10^{-6}$. Use regularization (ridge) or feature normalization. - Shape discipline: X.T @ r is $d \times n$ times $n \times 1 \to d \times 1$. Common error: forgetting transpose, writing X @ r (fails because $n \times d$ times $n$ is invalid).

Expected output: ||X^T r||_2: 1.2e-14 (or similar $\sim 10^{-14}$). This confirms numerical solver achieved theoretical optimum within machine precision.

Numerical Implementation Details
  • Least squares solve: np.linalg.lstsq(X, y, rcond=None) uses SVD decomposition $X = U \Sigma V^\top$, computing $\hat{w} = V \Sigma^{-1} U^\top y$. Complexity: $O(nd^2)$ for $n > d$; $O(nd^2)$ for $d > n$.
  • Residual: r = y - X @ w_hat is $O(nd)$ matrix-vector multiply plus $O(n)$ subtraction.
  • Orthogonality check: X.T @ r is $O(nd)$ (transpose-matrix-vector), then np.linalg.norm is $O(d)$. Total: $O(nd)$.
  • Expected result: $\|X^\top r\|_2 \sim 10^{-14}$ for double precision (machine epsilon $\epsilon_{\text{machine}} \approx 2.22 \times 10^{-16}$).
  • Ill-conditioning amplifies error: If $\kappa(X) = \sigma_{\max}(X) / \sigma_{\min}(X)$ is large ($> 10^{10}$), orthogonality may degrade to $\sim 10^{-6}$. Use regularization (ridge) or preconditioning.
  • Rank-deficient $X$: If $\text{rank}(X) < d$, lstsq returns minimum-norm solution via pseudoinverse. Orthogonality $X^\top r = 0$ still holds.
  • Alternative: QR decomposition: For $X = QR$ (thin QR), $\hat{w} = R^{-1} Q^\top y$. Cheaper than SVD ($O(nd^2)$ vs. $O(nd^2)$ with better constants), but less robust to rank deficiency.
What This Example Demonstrates
  • Orthogonality as optimality: The condition $X^\top r = 0$ is exact (not approximate) for the optimal least squares solution.
  • Projection onto column space: $\hat{y} = X\hat{w}$ lies in $\text{col}(X)$; $r = y - \hat{y}$ is orthogonal to $\text{col}(X)$.
  • Solver verification: Checking $\|X^\top r\|_2 < 10^{-10}$ diagnoses whether lstsq, gradient descent, or any solver found the correct solution.
  • Feature redundancy test: If new feature $x_{\text{new}}$ satisfies $x_{\text{new}}^\top r \approx 0$, it’s already captured by existing features (redundant).
  • Normal equations without forming $X^\top X$: lstsq uses SVD internally (avoids squaring condition number); explicit $(X^\top X)^{-1}$ is numerically dangerous.
  • Gradient vanishing pattern: $X^\top r = -\nabla_w \|y - Xw\|_2^2 / 2$; orthogonality ⟺ gradient zero ⟺ optimality.
Notes

Part 1: The Orthogonality Principle

Claim: For least squares $\min_w \|y - Xw\|_2^2$, the optimal solution $\hat{w}$ satisfies $X^\top r = 0$ where $r = y - X\hat{w}$.

Derivation: Expand the objective: \[ f(w) = \|y - Xw\|_2^2 = (y - Xw)^\top (y - Xw) = y^\top y - 2 w^\top X^\top y + w^\top X^\top X w. \] Differentiate with respect to $w$: \[ \nabla_w f(w) = -2 X^\top y + 2 X^\top X w. \] Set gradient to zero: \[ X^\top X \hat{w} = X^\top y \iff X^\top (y - X\hat{w}) = 0 \iff X^\top r = 0. \]

Matrix interpretation: $X^\top r = 0$ means $r$ is orthogonal to every column of $X$. Since $\text{col}(X) = \{\sum_j w_j X_j : w \in \mathbb{R}^d\}$, we have $r \perp \text{col}(X)$.

Part 2: Geometric Interpretation—Projection Matrix

The least squares solution projects $y$ onto $\text{col}(X)$: \[ \hat{y} = X\hat{w} = X(X^\top X)^{-1} X^\top y = P y, \] where $P = X(X^\top X)^{-1} X^\top$ is the projection matrix. Properties: - Symmetric: $P^\top = P$ - Idempotent: $P^2 = P$ (projecting twice = projecting once) - Range: $\text{range}(P) = \text{col}(X)$ - Null space: $\text{null}(P) = \text{col}(X)^\perp$

Residual is the complement: \[ r = y - \hat{y} = (I - P) y = Q y, \] where $Q = I - P$ projects onto $\text{col}(X)^\perp$. Orthogonality: $P^\top Q = 0 \implies \hat{y}^\top r = 0$.

Pythagorean theorem: \[ \|y\|_2^2 = \|\hat{y}\|_2^2 + \|r\|_2^2. \]

Part 3: Normal Equations and Solution Methods

From $X^\top r = 0$: \[ X^\top X \hat{w} = X^\top y. \]

Solution methods: 1. Direct inverse (unstable): $\hat{w} = (X^\top X)^{-1} X^\top y$. Avoid: condition number squares $\kappa(X^\top X) = \kappa(X)^2$. 2. QR decomposition (stable): $X = QR \implies \hat{w} = R^{-1} Q^\top y$. Complexity $O(nd^2)$; $\kappa(X) = \kappa(R)$ (preserves conditioning). 3. SVD (most robust): $X = U \Sigma V^\top \implies \hat{w} = V \Sigma^{-1} U^\top y$. Handles rank-deficiency; same complexity as QR; np.linalg.lstsq uses this. 4. Cholesky (for $X^\top X$): If $X$ has full rank, $X^\top X$ is positive definite: $L L^\top = X^\top X$, solve $L z = X^\top y$ (forward sub), then $L^\top \hat{w} = z$ (back sub). Complexity $O(d^3 + nd^2)$; fast but inherits squared conditioning.

Why This Matters for ML

1. Diagnostic for solver correctness: After any regression solver (gradient descent, coordinate descent, LBFGS), compute $\|X^\top r\|_2$. If $> 10^{-6}$, solver hasn’t converged or encountered numerical issues.

2. Feature selection and redundancy: If new feature $x_{\text{new}}$ has $|x_{\text{new}}^\top r| < 0.01 \|r\|_2$, it explains < 1% of unexplained variance. Greedy forward selection picks features with largest $|x^\top r|$.

3. Bias term necessity: If data not centered, $\mathbf{1}^\top r \neq 0$ (residual has non-zero mean). Including intercept term ensures $\mathbf{1} \in \text{col}(X)$, making $\mathbf{1}^\top r = 0$ (residuals sum to zero).

4. Gradient descent convergence: Gradient $\nabla_w L = -X^\top r$ (for squared loss). Orthogonality $\iff$ gradient vanishes $\iff$ optimization converged.

5. Iterative refinement: In preconditioned conjugate gradient, residual $r_k = y - Xw_k$ guides search direction. Stop when $\|X^\top r_k\|_2 < \epsilon$.

6. Regularization breaks orthogonality: Ridge regression solves $(X^\top X + \lambda I) \hat{w} = X^\top y \implies X^\top r = -\lambda \hat{w}$. Checking $X^\top r \neq 0$ confirms regularization is active.

ML Examples and Patterns

Example 1: Linear regression with intercept \[ X = [\mathbf{1}, X_{\text{features}}] \in \mathbb{R}^{n \times (d+1)}. \] After fitting, $\sum_{i=1}^n r_i = \mathbf{1}^\top r = 0$ (residuals sum to zero). If this fails, intercept wasn’t fitted or data preprocessing error occurred.

Example 2: Polynomial regression residuals Fit $y = w_0 + w_1 x + w_2 x^2 + \cdots + w_p x^p$. Check $X^\top r = 0$ where $X = [1, x, x^2, \ldots, x^p]$. If $|x^{p+1}{}^\top r|$ is large, higher-degree term needed (model inadequacy).

Example 3: Ridge vs. OLS - OLS: $\|X^\top r_{\text{OLS}}\|_2 \approx 0$. - Ridge: $\|X^\top r_{\text{ridge}}\|_2 = \lambda \|\hat{w}_{\text{ridge}}\|_2$. Larger $\lambda \implies$ larger violation of orthogonality (more bias, less variance).

Example 4: Gradient descent sanity check Train linear model with GD. At iteration $t$: \[ \text{gradient} = -\frac{1}{n} X^\top (y - X w_t). \] Stop when $\|X^\top (y - X w_t)\|_2 < 10^{-6}$. If GD stalls at higher values, learning rate too small or ill-conditioning ($\kappa(X)$ large).

Example 5: PCA centering PCA on centered data $X_c = X - \mu \mathbf{1}^\top$ ensures first PC captures maximum variance orthogonal to mean direction. Residual from mean has $\mathbf{1}^\top r = 0$ automatically.

Connection to Linear Algebra Theory

1. Fundamental theorem of linear algebra (four subspaces): - $\text{col}(X)$ and $\text{null}(X^\top)$ are orthogonal complements in $\mathbb{R}^n$. - $\hat{y} \in \text{col}(X)$; $r \in \text{null}(X^\top)$ since $X^\top r = 0$.

2. Projection formula: \[ P = X(X^\top X)^{-1} X^\top \text{ (orthogonal projection onto } \text{col}(X)\text{)}. \] Minimizes $\|y - z\|_2$ over all $z \in \text{col}(X)$.

3. Pythagorean theorem: \[ \|y\|_2^2 = \|\hat{y}\|_2^2 + \|r\|_2^2 \text{ when } \hat{y} \perp r. \] Generalizes to $R^2 = 1 - \|r\|_2^2 / \|y - \bar{y} \mathbf{1}\|_2^2$ (coefficient of determination).

4. QR and SVD interpretations: - QR: $X = QR \implies P = QQ^\top$. Orthonormal basis $Q$ for $\text{col}(X)$ makes projection transparent. - SVD: $X = U \Sigma V^\top \implies P = U U^\top$ (using thin SVD). $U$ columns are orthonormal basis for $\text{col}(X)$.

5. Condition number: \[ \kappa(X) = \frac{\sigma_{\max}(X)}{\sigma_{\min}(X)} = \sqrt{\kappa(X^\top X)}. \] Large $\kappa \implies$ small $\sigma_{\min} \implies$ nearly rank-deficient $\implies$ orthogonality harder to achieve numerically.

Numerical and Implementation Notes

Finite-difference gradient check: Numerically verify $X^\top r = -\nabla_w L$: \[ \frac{\partial L}{\partial w_j} \approx \frac{L(w + \epsilon e_j) - L(w - \epsilon e_j)}{2\epsilon}. \] Should match $-[X^\top r]_j$ within $10^{-8}$ for $\epsilon = 10^{-5}$.

Avoid forming $X^\top X$ explicitly: If $n \gg d$ (tall-skinny matrix), computing $X^\top X$ is $O(nd^2)$. But solving normal equations via Cholesky is $O(nd^2 + d^3)$. For $d > 1000$, the $d^3$ term dominates. Use QR or iterative methods instead.

Mixed precision: In deep learning, forward pass in float16 (fast), backward pass (gradients) in float32 (accurate). Orthogonality check should use float32 to detect convergence reliably.

Checkpointing for large $n$: If $X$ doesn’t fit in memory, compute $X^\top X$ and $X^\top y$ in mini-batches: \[ X^\top X = \sum_{k} X_k^\top X_k, \quad X^\top y = \sum_{k} X_k^\top y_k. \] Then solve normal equations. Orthogonality check requires full $X$, so subsample or approximate.

Sparse $X$: If $X$ is sparse, X.T @ r is $O(\text{nnz}(X))$ (number of nonzeros). Use scipy.sparse.linalg.lsqr (iterative); orthogonality emerges as convergence criterion.

Numerical and Shape Notes

Shape discipline: - $X \in \mathbb{R}^{n \times d}$, $y \in \mathbb{R}^n$, $\hat{w} \in \mathbb{R}^d$. - $X \hat{w} \in \mathbb{R}^n$ (each row of $X$ dot-producted with $\hat{w}$). - $r = y - X\hat{w} \in \mathbb{R}^n$ (residual per example). - $X^\top r \in \mathbb{R}^d$ (residual projected onto each feature column).

Transpose reasoning: - Forward: $X w$ is $n \times d$ times $d \times 1 \to n \times 1$. - Backward (gradient): $\frac{\partial L}{\partial w} = X^\top \frac{\partial L}{\partial \hat{y}}$ is $d \times n$ times $n \times 1 \to d \times 1$.

Broadcasting pitfalls: If $y$ is shape (n,) and $X @ w_hatis shape(n,), subtraction works. But if $y$ is accidentally(n, 1)(column vector),X.T @ rmay broadcast incorrectly. Always usey.flatten()` or verify shapes.

Common error: Forgetting transpose. Writing X @ r (wrong, shape $n \times d$ times $n \to$ error) instead of X.T @ r (correct, $d \times n$ times $n \to d$).

Pedagogical Significance

This example distills the central insight of optimization: optimality $\iff$ gradient vanishes $\iff$ residual orthogonal to feasible directions. It appears everywhere: - Unconstrained: $\nabla f(x^*) = 0$ (stationary point). - Constrained (equality): $\nabla f(x^*) \perp \text{tangent space of constraint manifold}$ (Lagrange multipliers). - Projection: $\min_z \|y - z\|$ subject to $z \in S \implies y - z \perp S$ (perpendicular from point to subspace).

Least squares = simplest nontrivial optimization problem: Quadratic objective, linear constraints (unconstrained if $X$ full rank), closed-form solution, exact condition for optimality. Mastering this prepares students for: - Neural network training: Replace $X$ with $\frac{\partial \text{loss}}{\partial \text{params}}$ (Jacobian); same gradient-vanishing logic. - Kernel methods: Replace $X$ with $K$ (kernel matrix); orthogonality in RKHS. - Iterative optimization: CG, LBFGS, gradient descent all terminate when $\|\nabla f\|_2 < \epsilon$.

Verification as habit: Teaching students to check $\|X^\top r\|_2$ after every regression trains numerical skepticism—a critical skill when debugging ML pipelines, where silent failures (non-converged optimizers, rank-deficient features, ill-conditioning) produce plausible but wrong results.

History and Applications

History: Legendre (1805) and Gauss (1809) introduced least squares to fit astronomical observations. The geometric view—projection onto a subspace and residual orthogonality—crystallized with the maturation of linear algebra (Grassmann, late 19th century), and numerical algorithms (Householder QR, SVD) made large-scale least squares routine in the 20th century.

Applications: Orthogonality $X^\top r = 0$ is the universal optimality certificate for least squares: it verifies solver correctness, detects missing intercepts (nonzero mean residual), and guides feature engineering (redundant features have small $|x^\top r|$). In ML systems, this principle appears in backprop (transpose multiplies aggregate sensitivities), iterative solvers (stopping when gradients vanish), and regularization (ridge breaks orthogonality in a controlled way).

Connection to Broader Examples
  • PCA and SVD (Ex 11, 10): PCA centers data ($\mu = \frac{1}{n} X^\top \mathbf{1}$), ensuring residual from mean has $\mathbf{1}^\top r = 0$ (orthogonal to constant vector).
  • Backprop (Ex 84): Gradient $\frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Y}$ has the same structure as orthogonality check. Training ends when $\|X^\top \text{grad}\| \approx 0$.
  • Attention (Ex 80): Attention outputs lie in the span of value vectors; the “query not attended” component would be orthogonal (though attention uses softmax normalization, not projection).
  • Least squares as projection (Ex 6): Projection matrix $P = X(X^\top X)^{-1} X^\top$ satisfies $P = P^\top$ (symmetric) and $P^2 = P$ (idempotent). $\hat{y} = Py$; $r = (I - P)y$.
  • Regularization (Ridge, Lasso): Ridge modifies normal equations: $(X^\top X + \lambda I) \hat{w}_{\text{ridge}} = X^\top y$. Residual $r_{\text{ridge}}$ is not orthogonal: $X^\top r_{\text{ridge}} = -\lambda \hat{w}_{\text{ridge}} \neq 0$ (trades bias for variance).
  • Iterative solvers: Conjugate gradient, LBFGS converge when $\|X^\top r\|_2 < \epsilon$. Orthogonality tracks how close iterative solution is to analytical optimum.
  • Feature engineering: Checking $x_{\text{new}}^\top r$ quantifies how much unexplained variance the new feature captures. Greedy forward selection iteratively adds features with largest $|x^\top r|$.

Comments