Least squares has roots in early astronomy (Gauss, Legendre ~1800s) and geodesy, where it was used to fit orbits and survey data. In ML, it is the prototype convex objective: every solution satisfies the normal equations $X^\top X w = X^\top y$, and the residual orthogonality $X^\top r = 0$ is the first-order optimality condition. Least squares also provides the local quadratic approximation in Newtonâs method and trust-region algorithms. Understanding orthogonality deeplyâwhy it arises, how to verify it numerically, and what it reveals about model fitâis foundational for regression, PCA, and iterative solvers.
- Log in to post comments
Build deep intuition for the fundamental orthogonality condition of least squares: at the optimal solution $\hat{w}$, the residual $r = y - X\hat{w}$ is orthogonal to the column space of $X$. This connects optimization (normal equations), geometry (projection onto subspaces), and numerics (solver stability). You should come away understanding why orthogonality is both a mathematical necessity and a practical diagnostic, and how to verify it numerically.
Compute least squares solution and verify the residual is orthogonal to Xâs columns (X^T r â 0).
Least-squares optimality implies the normal equations:
\[ X^\top X \hat{w} = X^\top y. \]
Rearranging gives the orthogonality condition:
\[ X^\top (y - X\hat{w}) = 0 \quad \Leftrightarrow \quad X^\top r = 0, \]
where $r = y - X\hat{w}$ is the residual. This states that $r$ is orthogonal to every column of $X$, equivalently $r \perp \text{Col}(X)$.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import toy_linreg
X, y, _ = toy_linreg(n=25, d=3, seed=0)
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w_hat
print("||X^T r||_2:", np.linalg.norm(X.T@r))This snippet demonstrates the fundamental orthogonality condition of least squares: at the optimal solution $\hat{w}$, the residual vector $r = y - X\hat{w}$ is orthogonal to the column space of $X$. This orthogonality is both a geometric interpretation (projection onto a subspace) and an algebraic necessity (normal equations).
The code generates synthetic regression data via toy_linreg(n=25, d=3, seed=0), producing a design matrix $X \in \mathbb{R}^{25 \times 3}$ and target vector $y \in \mathbb{R}^{25}$. The least-squares solution minimizes $\|y - Xw\|_2^2$ and is computed via np.linalg.lstsq(X, y, rcond=None), which uses SVD-based decomposition for numerical stability. The function returns a tuple; [0] extracts $\hat{w} \in \mathbb{R}^3$. The residual $r = y - X\hat{w}$ measures the unexplained variance.
The key verification is computing $\|X^\top r\|_2$, which should be near machine precision (typically $\sim 10^{-14}$ for double precision). This confirms the orthogonality condition $X^\top r = 0$, derived from the normal equations $X^\top X \hat{w} = X^\top y$. Geometrically, $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{Col}(X)$, and $r$ is the projection onto $\text{Col}(X)^\perp$. The decomposition $y = X\hat{w} + r$ with $\langle X\hat{w}, r \rangle = 0$ yields the Pythagorean theorem: $\|y\|_2^2 = \|X\hat{w}\|_2^2 + \|r\|_2^2$.
Shape discipline: $X \in \mathbb{R}^{n\times d}$, $y \in \mathbb{R}^n$, $\hat{w} \in \mathbb{R}^d$, $r \in \mathbb{R}^n$, $X^\top r \in \mathbb{R}^d$. Gotchas: lstsq returns a tupleâuse [0] to extract $\hat{w}$. For ill-conditioned $X$, SVD-based lstsq is more stable than explicit normal equations $(X^\top X)^{-1} X^\top y$, which square the condition number.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Generate synthetic regression data via
toy_linreg(n=25, d=3, seed=0): $X \in \mathbb{R}^{25\times 3}$, $y \in \mathbb{R}^{25}$. - Solve least squares using
np.linalg.lstsq(X, y, rcond=None); extract $\hat{w} \in \mathbb{R}^3$ from returned tuple (index[0]). - Compute residual $r = y - X @ \hat{w}$; shape $(25,)$.
- Verify orthogonality by computing $X^\top r$ (shape $(3,)$) and checking $\|X^\top r\|_2 \approx 0$.
- Expected output: $\|X^\top r\|_2 \sim 10^{-14}$ or smaller (machine precision for double-precision floats).
- Optional checks: verify normal equations directly ($\|X^\top X \hat{w} - X^\top y\|_2 \approx 0$), Pythagorean theorem ($\|y\|_2^2 = \|X\hat{w}\|_2^2 + \|r\|_2^2$), and $R^2$ score.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Least-squares solution $\hat{w}$ minimizes $\|y - Xw\|_2^2$ and satisfies the normal equations $X^\top X \hat{w} = X^\top y$.
- Residual orthogonality: $X^\top r = 0$ is the first-order optimality condition; $r$ is orthogonal to $\text{Col}(X)$.
- Numerical verification: $\|X^\top r\|_2 \approx 0$ (machine precision, typically $\sim 10^{-14}$) confirms correctness.
- Geometric interpretation: $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{Col}(X)$; $r$ projects onto $\text{Col}(X)^\perp$.
- Solver stability: SVD-based
lstsqavoids squaring the condition number, unlike explicit normal equations. - Residual diagnostics: patterns in $r$ reveal model quality; orthogonality ensures features donât correlate with unexplained variance.
- Part 1: Solving the Least-Squares Problem.
np.linalg.lstsq(X, y, rcond=None)uses SVD-based decomposition to minimize $\|y - Xw\|_2^2$, returning $\hat{w} \in \mathbb{R}^d$ that satisfies the normal equations. The residual $r = y - X\hat{w}$ measures unexplained variance. - Part 2: Verifying Orthogonality via Normal Equations. At optimality, $X^\top (y - X\hat{w}) = 0 \Leftrightarrow X^\top r = 0$. Computing $\|X^\top r\|_2$ should yield $\sim 10^{-14}$ or smaller (machine precision), confirming that $r \perp \text{Col}(X)$.
- Part 3: Geometric interpretation and diagnostics. $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{Col}(X)$; $r$ projects onto $\text{Col}(X)^\perp$. Decomposition $y = X\hat{w} + r$ with $\langle X\hat{w}, r \rangle = 0$ yields Pythagorean theorem: $\|y\|_2^2 = \|X\hat{w}\|_2^2 + \|r\|_2^2$. Patterns in $r$ diagnose model quality.
- Why This Matters for ML. Orthogonality is the stopping criterion for gradient-based solvers, the geometric meaning of projection, and a diagnostic for model fit. Residual analysis reveals misspecification, heteroscedasticity, and overfitting. Ridge regression relaxes orthogonality intentionally to reduce variance.
- ML Examples and Patterns. Fit: regression solves least squares; Project: $\hat{y}$ is projection onto feature span; Decompose: QR/SVD factorizations preserve orthogonality; Solve: iterative methods (LSQR, CGLS) converge when $X^\top r \approx 0$.
- Connection to Linear Algebra Theory. Normal equations arise from setting gradient to zero: $\nabla_w \|y - Xw\|_2^2 = -2 X^\top (y - Xw) = 0$. Projection matrix $P_X = X(X^\top X)^{-1} X^\top$ satisfies $P_X^2 = P_X$ and $P_X = P_X^\top$ (symmetric idempotent). SVD gives $\hat{w} = V \Sigma^{-1} U^\top y$; orthogonality follows from $U^\top (I - UU^\top) = 0$.
- Numerical and Implementation Notes. Use
lstsqover explicit normal equations $(X^\top X)^{-1} X^\top y$ for ill-conditioned $X$; SVD avoids squaring the condition number. Check $\|X^\top r\|_2$ with relative tolerance:tol = 1e-10 * np.linalg.norm(X.T @ y). If $X$ is rank-deficient,lstsqreturns minimum-norm solution via pseudoinverse. - Numerical and Shape Notes. Annotate $X \in \mathbb{R}^{n\times d}$, $y \in \mathbb{R}^n$, $\hat{w} \in \mathbb{R}^d$, $r \in \mathbb{R}^n$, $X^\top r \in \mathbb{R}^d$. Verify
lstsqreturns tuple; extract $\hat{w}$ with[0]. For centered data, residuals have zero mean. - Pedagogical Significance. This is the canonical demonstration that least squares is orthogonal projection. The connection between optimization (minimizing squared error) and geometry (orthogonal projection) is profound and repeats throughout ML: PCA, ridge regression, kernel methods. Understanding orthogonality deeplyâwhy it arises, how to verify it, what it revealsâis foundational for mastering linear models.
Least squares traces to Gauss (1809) and Legendre (1805), who independently developed the method for fitting planetary orbits and survey triangulations. The orthogonality condition $X^\top r = 0$ and the normal equations were recognized early as the first-order optimality condition. In the 20th century, numerical linear algebra (Golub, Van Loan, Björck) developed stable algorithms (QR, SVD, Cholesky) that preserve orthogonality to machine precision even for ill-conditioned problems. Modern ML inherited least squares as the prototype convex objective, the local approximation in Newtonâs method, and the foundation for ridge regression, kernel methods, and iterative solvers (LSQR, CGLS). Understanding residual orthogonality remains central to regression diagnostics, projection geometry, and optimization convergence criteria.
- Links to Chapter 6 (orthogonality/projections): least squares is orthogonal projection onto $\text{Col}(X)$; $r$ is projection onto the orthogonal complement.
- Connects to Chapter 12 (least squares): this example is foundationalâorthogonality defines the normal equations and projection geometry.
- Relates to Chapter 11 (PCA): PCA also projects onto subspaces; residuals are projections onto complement subspaces.
- Bridges to Chapter 13 (solving systems): SVD, QR, and Cholesky all solve least squares while preserving orthogonality numerically.
- Complements Chapter 14 (conditioning): ill-conditioned $X$ amplifies errors in $\hat{w}$, but orthogonality $X^\top r = 0$ still holds to machine precision.
- Reinforces shape discipline: $X \in \mathbb{R}^{n\times d}$, $y \in \mathbb{R}^n$, $\hat{w} \in \mathbb{R}^d$, $r \in \mathbb{R}^n$, $X^\top r \in \mathbb{R}^d$.
Comments