Example number
12
Example slug
example_12_least_squares_fit_and_residual_norm
Background

Least squares arises from minimizing the squared error $\|y - Xw\|_2^2$, yielding the normal equations $X^\top X\hat{w} = X^\top y$ when the columns of $X$ are independent. Geometrically, the solution $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto the column space of $X$, and the residual $r = y - \hat{y}$ lies in the orthogonal complement, giving $X^\top r = 0$.

Numerically, forming $X^\top X$ explicitly can be unstable when $X$ is ill-conditioned. Modern solvers implement QR or SVD factorizations (as in np.linalg.lstsq) to compute $\hat{w}$ reliably, filtering tiny singular values according to rcond. This stability is crucial in ML pipelines, where features may be collinear or scaled unevenly. The projection viewpoint also connects least squares to PCA (projection onto principal subspaces), ridge regression (regularized projection), and Gauss–Newton methods (local quadratic fits).

Purpose

Build a robust ML-first intuition that least squares is a projection problem solved stably by modern linear algebra routines:

  • Fit is project: $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{span}(X)$.
  • Residual optimality: $r = y - X\hat{w}$ satisfies $X^\top r \approx 0$ (up to numerical tolerance).
  • Stable computation: use lstsq (QR/SVD) rather than explicit inverses; verify shapes and diagnostics.
  • Shape discipline: $X\in\mathbb{R}^{n\times d}$, $y\in\mathbb{R}^n$, $w\in\mathbb{R}^d$, $r\in\mathbb{R}^n$.
Problem

Fit linear regression on toy data via lstsq; report ||r|| and verify orthogonality.

Solution (Math)

Least squares gives $\hat w=\arg\min_w\|Xw-y\|^2$. Residual $r=y-X\hat w$ satisfies $X^Tr=0$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=50, d=4, seed=1)
w = np.linalg.lstsq(X, y, rcond=None)[0]
r = y - X@w
print("||r||_2:", np.linalg.norm(r))
print("||X^T r||_2:", np.linalg.norm(X.T@r))
Code Introduction

This snippet solves a least-squares regression and verifies the key orthogonality condition that characterizes the solution. With data $X\in\mathbb{R}^{n\times d}$ and $y\in\mathbb{R}^n$, np.linalg.lstsq(X, y, rcond=None) computes the minimum-norm solution $w\in\mathbb{R}^d$ to $\min_w \|y - Xw\|_2^2$. The residual $r = y - Xw$ quantifies fit error, while the diagnostic $\|X^\top r\|_2$ checks the normal-equation condition $X^\top r = 0$ at the optimum. Geometrically, $Xw$ is the orthogonal projection of $y$ onto $\text{span}(X)$ and $r$ lies in the orthogonal complement. Numerically, lstsq uses QR/SVD internally and rcond=None filters tiny singular values; expect $\|X^\top r\|_2$ near machine precision (e.g., $< 10^{-12}$). If it is noticeably larger, suspect ill-conditioning or a bug.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Compute $\hat{w}$ via w = np.linalg.lstsq(X, y, rcond=None)[0]; this uses QR/SVD for stability.
  • Residual: r = y - X@w; norms: np.linalg.norm(r) and np.linalg.norm(X.T@r).
  • Tolerance guidance: for well-conditioned problems, expect $\|X^\top r\|_2 \lesssim 10^{-12}$–$10^{-14}$; materially larger values suggest ill-conditioning or numerical issues.
  • Conditioning: inspect singular values of $X$ via np.linalg.svd(X, full_matrices=False); a large condition number indicates sensitivity.
  • Avoid explicit inverses of $X^\top X$; prefer QR/SVD or Cholesky when SPD structure is available.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Orthogonality at the optimum: $X^\top(y - X\hat{w}) = 0$.
  • Projection identity: $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{span}(X)$.
  • Stable solving: np.linalg.lstsq uses QR/SVD internally and avoids explicit inverses.
  • Diagnostics: small $\|X^\top r\|_2$ (near machine precision) confirms optimality; large values flag conditioning issues or bugs.
  • Shape awareness: $X:(n\times d)$, $y:(n)$, $w:(d)$, $r:(n)$.
Notes

Part 1: Core setup - Least squares fit and residual norm

State the objects, shapes, and target question for Least squares fit and residual norm. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Least squares fit and residual norm

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Least squares fit and residual norm work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Least squares fit and residual norm

Give the computational recipe for Least squares fit and residual norm, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: verify $X:(n\times d)$, $y:(n)$, $w:(d)$, $r:(n)$, and $X^\top r:(d)$.
  • Numerical practice: use lstsq (QR/SVD); avoid forming $X^\top X$ explicitly; inspect singular values for conditioning.
  • Tolerance: treat $\|X^\top r\|_2$ near machine precision as zero; larger values indicate ill-conditioning, rank deficiency, or bugs.
  • Interpretation: $\hat{y}$ is the closest point in $\text{span}(X)$ to $y$; $r$ points orthogonally away—this is the geometric meaning of optimality.
History and Applications

Origins: Least squares emerged in the early 19th century from astronomical and geodetic measurements. Legendre (1805) and Gauss (1809) formalized the method, showing that minimizing squared deviations yields the best-fit parameters and the normal equations. The orthogonality $X^\top r = 0$ later acquired a geometric interpretation as a projection condition.

Generalized inverses: In the mid-20th century, Penrose (1955) defined the Moore–Penrose pseudoinverse, providing a unique minimum-norm solution even when $X$ is rank-deficient or $n<d$. Numerically stable algorithms based on QR and SVD (Golub & Kahan, 1965) made least squares a practical and ubiquitous tool.

ML applications: Least squares underpins linear regression, Gauss–Newton and Levenberg–Marquardt updates, and serves as the local quadratic approximation in many optimization methods. In high-dimensional ML, overparameterization makes the minimum-norm property relevant to implicit bias: gradient descent initialized at zero often converges to the minimum-norm interpolant. Regularization (ridge/Tikhonov) trades fit for stability, improving conditioning and generalization. In practice, least squares appears across computer vision (pose estimation), signal processing (denoising), and data fitting pipelines where projection geometry and numerical stability matter.

Connection to Broader Examples
  • Orthogonality and projections (Ch. 6): This example operationalizes the projection viewpoint—least squares is an orthogonal projection; residuals lie in the orthogonal complement.
  • SVD and PCA (Ch. 10–11): SVD underpins lstsq and explains conditioning; PCA is projection onto principal components.
  • Conditioning and stability (Ch. 14): Small singular values make $X^\top X$ ill-conditioned; QR/SVD mitigate instability.
  • Solving systems (Ch. 13): lstsq generalizes solving both overdetermined and underdetermined systems via pseudoinverse.
  • Regularization: Ridge (Tikhonov) modifies the projection to trade fit for stability; connects to bias–variance in ML.

Comments