Least squares minimizes $\|y - Xw\|_2^2$ over $w$. The optimality condition from calculus is $\nabla_w \|y - Xw\|_2^2 = -2 X^\top (y - Xw) = 0$, which is the normal equations $X^\top X w = X^\top y$. Geometrically, this says the residual $r = y - Xw$ is orthogonal to every column of $X$âthe prediction $\hat{y} = Xw$ is the orthogonal projection of $y$ onto $\text{span}(X)$. Numerically, solvers use QR or SVD to avoid forming $X^\top X$ (which squares the condition number). The orthogonality check $\|X^\top r\|_2 \approx 0$ is a convergence diagnostic: small values confirm the solver reached the optimum to machine precision, while large values indicate ill-conditioning or rank deficiency.
- Log in to post comments
Build intuition that least squares is a projection problem, not just an optimization problem. The optimal solution places predictions $\hat{y} = X\hat{w}$ in the column space of $X$, and the residual $r = y - \hat{y}$ must be orthogonal to that subspace. This orthogonality condition $X^\top r = 0$ is the geometric heart of regression, ridge regression, generalized least squares, and neural network weight updates. Understanding it clarifies convergence diagnostics, conditioning effects, and the bias-variance tradeoff.
Fit least squares and verify X^T(y-XwÌ)â0.
Optimality gives $X^T$Xw-y$=0$, so residual $r=y-X\hat w$ is orthogonal to each column of $X$.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import toy_linreg
X, y, _ = toy_linreg(n=30, d=3, seed=0)
w = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w
print("||X^T r||_2:", np.linalg.norm(X.T@r))This code verifies the fundamental orthogonality condition of least squares: at the optimal solution, the residual vector must be orthogonal to every column of the design matrix. It demonstrates that numerical linear algebra solvers achieve the geometric property underlying all projection-based regression methods.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Load data:
X, y, _ = toy_linreg(n=30, d=3, seed=0)generates $X \in \mathbb{R}^{30\times 3}$ and $y \in \mathbb{R}^{30}$. - Solve least squares:
w = np.linalg.lstsq(X, y, rcond=None)[0]computes $\hat{w}$ minimizing $\|y - Xw\|_2^2$ using SVD. - Compute residual:
r = y - X@wgives the vector of prediction errors. - Check orthogonality:
X.T@rshould be near-zero vector;np.linalg.norm(X.T@r)collapses to scalar diagnostic. - Expected magnitude: for well-conditioned $X$ with $\kappa(X) \lesssim 10^3$, expect $\|X^\top r\|_2 / \|r\|_2 < 10^{-12}$.
- If large: compute condition number
np.linalg.cond(X)and check for rank deficiency vianp.linalg.matrix_rank(X). - Sanity checks: print residual norm, orthogonality error, and condition number as convergence triplet.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Least squares as projection: $\hat{y} = X\hat{w}$ is the closest point in $\text{span}(X)$ to $y$ under Euclidean distance.
- Orthogonality condition: the residual $r = y - \hat{y}$ satisfies $X^\top r = 0$, meaning $r$ is orthogonal to every column of $X$.
- Numerical verification: $\|X^\top r\|_2$ should be near machine precision ($\sim 10^{-13}$ in double precision) for well-conditioned problems.
- Conditioning effects: large $\kappa(X)$ amplifies rounding errors, increasing $\|X^\top r\|_2$ even when the solver converges.
- Solver choice matters:
lstsquses SVD/QR for stability; explicit $X^\top X$ inversion squares condition number and often fails. - Shape discipline: $X \in \mathbb{R}^{n\times d}$, $w \in \mathbb{R}^d$, $r \in \mathbb{R}^n$, $X^\top r \in \mathbb{R}^d$.
Part 1: Problem Setup and Least Squares Solution
The toy dataset has $X \in \mathbb{R}^{30\times 3}$ (30 examples, 3 features) and $y \in \mathbb{R}^{30}$. The solver np.linalg.lstsq uses SVD internally: decompose $X = U\Sigma V^\top$, then $\hat{w} = V\Sigma^{-1}U^\top y$ (with singular value cutoff for rank deficiency). This avoids forming $X^\top X$, which squares the condition number and loses precision. The resulting $\hat{w} \in \mathbb{R}^3$ defines fitted values $\hat{y} = X\hat{w}$, the orthogonal projection of $y$ onto $\text{span}(X)$.
Part 2: Residual Vector as Orthogonal Complement
The residual $r = y - X\hat{w}$ is the component of $y$ orthogonal to $\text{span}(X)$. The normal equations $X^\top X w = X^\top y$ rearrange to $X^\top (y - Xw) = 0$, which is the orthogonality condition $X^\top r = 0$. Geometrically, this says $r$ is perpendicular to every column of $X$. The vector space $\mathbb{R}^{30}$ decomposes as $y = \hat{y} + r$ where $\hat{y} \in \text{span}(X)$ and $r \in \text{span}(X)^\perp$. This is the orthogonal direct sum decomposition.
Part 3: Numerical Verification as Diagnostic
The quantity $\|X^\top r\|_2$ measures how close the computed solution is to satisfying the normal equations. For well-conditioned $X$ with $\kappa(X) \lesssim 10^3$, expect $\|X^\top r\|_2 / \|r\|_2 < 10^{-12}$ in double precision. If the ratio exceeds $10^{-8}$, suspect: (1) ill-conditioning ($\kappa(X) > 10^8$), (2) rank deficiency (nearly collinear features), (3) solver failure (rare), or (4) bug (wrong transpose, shape mismatch). This diagnostic is more informative than just $\|r\|_2$: small residual norm means good fit, but large $\|X^\top r\|_2$ means the solver didnât converge to the true optimum.
Connection to Projection Geometry
The projection operator $P = X(X^\top X)^{-1}X^\top$ (assuming full column rank) satisfies $P^2 = P$ (idempotent) and $P^\top = P$ (symmetric), making it an orthogonal projector. The fitted values are $\hat{y} = Py$ and the residual is $r = (I - P)y$. The complementary projector $(I - P)$ projects onto $\text{span}(X)^\perp$. The orthogonality condition $X^\top r = 0$ is equivalent to $X^\top (I - P)y = 0$, which holds by construction: $X^\top (I - P) = X^\top - X^\top X(X^\top X)^{-1}X^\top = X^\top - X^\top = 0$. This projector viewpoint clarifies ridge regression: adding $\lambda I$ changes $P$ to $P_\lambda = X(X^\top X + \lambda I)^{-1}X^\top$, which is no longer orthogonal but still a projector.
Why This Matters for ML
Gradient descent on least squares iteratively approaches $X^\top r = 0$: each step $w_{k+1} = w_k + \eta X^\top r_k$ reduces $\|X^\top r\|_2$. Monitoring this quantity diagnoses convergence and conditioning. In neural networks, backprop through linear layers computes $\nabla_W = X^\top \delta$ (where $\delta$ is upstream error), which has the same structure as $X^\top r$âunderstanding projection geometry clarifies why gradients vanish or explode when activations ($X$) have extreme norms. Feature engineering: if $\|X^\top r\|_2$ is small but $\|r\|_2$ is large, the model converged but lacks capacity (add features or go nonlinear). Regularization: ridge adds $\lambda w$ to the gradient, modifying the orthogonality condition to $X^\top r + \lambda w = 0$; the residual is no longer strictly orthogonal to $\text{span}(X)$, trading bias for stability.
Numerical and Implementation Notes
Use np.linalg.lstsq with rcond=None for automatic singular value cutoff. Avoid explicit $(X^\top X)^{-1}$ unless $X$ is extremely well-conditioned ($\kappa < 10$). For large $n$ or $d$, consider iterative solvers (conjugate gradients) or stochastic methods. Check orthogonality error relative to residual norm: ortho_err = np.linalg.norm(X.T@r) / np.linalg.norm(r); values $> 10^{-8}$ indicate numerical issues. For ill-conditioned $X$, use ridge regression w = np.linalg.solve(X.T@X + lam*np.eye(d), X.T@y) to stabilize.
Numerical and Shape Notes
Shapes: $X(n\times d)$, $w(d)$, $y(n)$, $\hat{y}(n)$, $r(n)$, $X^\top r(d)$. Transpose $X^\top$ has shape $(d, n)$, so $(d,n) \times (n,) \to (d,)$ for $X^\top r$. For batched problems with multiple $y$ vectors ($Y \in \mathbb{R}^{n\times k}$), compute $W = \text{lstsq}(X, Y)$ with shape $(d, k)$, residuals $R = Y - XW$ with shape $(n, k)$, and check $\|X^\top R\|_F$ (Frobenius norm). Gotcha: if $r$ is accidentally a row vector, X.T@r may succeed but produce wrong results; always verify r.shape == (n,). For intercept models, ensure $X$ has a column of ones; then r.sum() â 0 (residuals sum to zero) is an additional diagnostic.
Least squares dates to Gauss (1809) and Legendre (1805), who used it for astronomical orbit determination and geodesy. The orthogonality condition $X^\top r = 0$ was recognized early as the key to uniqueness and optimality. Numerical methods evolved from normal equations ($X^\top X w = X^\top y$, ill-conditioned for large $\kappa(X)$) to QR decomposition (Householder, 1958) and SVD (Golub & Kahan, 1965), which avoid forming $X^\top X$ and handle rank deficiency gracefully. In statistics, least squares is the maximum likelihood estimator for linear regression with Gaussian noise; the orthogonality condition is the first-order optimality of the log-likelihood. In machine learning, least squares is the building block for ridge regression, LASSO (orthogonality plus $L^1$ penalty), generalized least squares (weighted norms), and neural network training (each layer update uses the same gradient structure as $X^\top r$). Modern optimization (gradient descent, Newtonâs method, conjugate gradients) iteratively enforces $X^\top r \to 0$, with convergence rates determined by $\kappa(X)$. Understanding least squares as projectionânot just curve-fittingâclarifies regularization (changing the projection subspace), conditioning (amplifying rounding errors), and generalization (bias-variance decomposition is prediction vs. residual split).
- Vector spaces (Ch. 1): Column space $\text{span}(X)$ is a subspace; projections map $\mathbb{R}^n$ to subspaces.
- Span (Ch. 2): $\hat{y} = Xw$ lies in $\text{span}(X)$; residual $r$ lies in the orthogonal complement.
- Inner products (Ch. 5): Orthogonality is $\langle X_{:,j}, r \rangle = 0$ for all columns $j$, which is $X^\top r = 0$.
- Projections (Ch. 6): Least squares is projection onto $\text{span}(X)$; projector is $P = X(X^\top X)^{-1}X^\top$.
- SVD (Ch. 10):
lstsquses SVD to handle rank deficiency; condition number $\kappa(X) = \sigma_{\max}/\sigma_{\min}$. - PCA (Ch. 11): Projecting data onto principal components is least squares onto a low-rank subspace.
- Least squares (Ch. 12): This example is the canonical orthogonality verification for all LS problems.
- Conditioning (Ch. 14): Large $\kappa(X)$ amplifies $\|X^\top r\|_2$; regularization (ridge) improves conditioning.
Comments