ex1.ai

Example number

Example slug

example_44_least_squares_lstsq_vs_normal_equations

Background

Historical roots: Least squares originates in Gaussâs work on planetary orbits (1809) and Legendreâs treatise on curve fitting (1805). Both independently arrived at minimizing $\sum (y_i - \hat{y}_i)^2$ to fit lines to noisy data. The modern matrix formulationâsolving $X^\top X w = X^\top y$âcrystallized in the mid-20th century (Golub, Van Loan) once numerical linear algebra matured.

Mathematical characterization: The least-squares minimizer $\hat{w}$ is the unique solution to: \[ \text{arg}\min_w \|y - Xw\|_2^2, \] characterized by the normal equations $X^\top X w = X^\top y$ (assuming full column rank). Geometrically, $\hat{w}$ is the orthogonal projection of $y$ onto the column space of $X$; the residual $r = y - X\hat{w}$ is orthogonal to all columns of $X$ (i.e., $X^\top r = 0$).

Two solvers, one problem: Method 1 (lstsq) directly computes the SVD $X = U\Sigma V^\top$ and solves via $\hat{w} = V\Sigma^{-1}U^\top y$. Method 2 (normal equations) forms $X^\top X$ and $X^\top y$, then solves $(X^\top X)w = X^\top y$ via LU or Cholesky. Both minimize $\|y - Xw\|_2^2$ algebraically, but the normal equations square the condition number, amplifying rounding error. In ML, least squares is the prototype convex objective and the local approximation in gradient descent; understanding its numerical behavior is foundational for all regression-based learning.

Purpose

Connect optimization, geometry (projection), and numerics (solver choice) in a single, reusable pattern. Demonstrate that normal equations and lstsq solve the same optimization problem but via different numerical paths, with drastically different numerical stability properties. Show how squaring the condition number (a consequence of forming $X^\\top X$) can render the normal-equations solver unreliable, even though residuals appear similar. Build intuition for when to trust solver outputs and when ill-conditioning signals deeper data issues (multicollinearity, near non-identifiability). Emphasize the principle: stable algorithms matterâtwo mathematically equivalent formulations can produce vastly different numerical results in the presence of rounding error.

Problem

Solve least squares via lstsq and via normal equations; compare solutions and residual norms.

Solution (Math)

Normal equations $X^TXw=X^Ty$ characterize the least-squares minimizer. When $X$ is well-conditioned, solving them matches lstsq. Residual norms should be nearly identical.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import toy_linreg

X, y, _ = toy_linreg(n=40, d=3, seed=5)

w1 = np.linalg.lstsq(X, y, rcond=None)[0]
w2 = np.linalg.solve(X.T@X, X.T@y)

r1 = y - X@w1
r2 = y - X@w2

print("||w1-w2||:", np.linalg.norm(w1-w2))
print("||r1||, ||r2||:", np.linalg.norm(r1), np.linalg.norm(r2))

Code Introduction

This code compares two least-squares solvers: the robust lstsq (using SVD internally) versus the numerically risky normal equations solved via np.linalg.solve. It demonstrates why direct SVD-based methods are preferred over explicit normal-equation solves in practice, even though both minimize the same objective mathematically.

The code loads a linear regression dataset with toy_linreg(n=40, d=3, seed=5), producing $X \in \mathbb{R}^{40 \times 3}$ (40 examples, 3 features) and response vector $y \in \mathbb{R}^{40}$. Two weight vectors are computed:

Method 1 (robust): w1 = np.linalg.lstsq(X, y, rcond=None)[0] uses the least-squares solver, which internally computes the SVD $X = U\Sigma V^\top$ and solves via: \[ \hat{w} = V \Sigma^{-1} U^\top y, \] where $\Sigma^{-1}$ treats singular values below a numerical threshold as zero. This approach avoids forming $X^\top X$ explicitly, preserving numerical stability even when $X$ is ill-conditioned. The rcond=None parameter uses machine precision as the cutoff for rank determination.

Method 2 (risky): w2 = np.linalg.solve(X.T@X, X.T@y) forms the normal equations $X^\top X w = X^\top y$ and solves them via Gaussian elimination. This method is mathematically equivalent to least squares but squares the condition number: $\kappa(X^\top X) = \kappa(X)^2$. For well-conditioned $X$, both methods agree; for ill-conditioned $X$, solve accumulates rounding errors in $X^\top X$, producing $w_2 \ne w_1$.

The residuals are computed as $r_i = y - Xw_i$ for both solutions. Both residuals should have nearly identical norms (||r1|| and ||r2||), since both methods minimize $\|y - Xw\|_2^2$ and converge to the same optimal residual vector (up to numerical error). The printed norms confirm theyâre equal; any significant difference signals numerical instability in the normal-equations solver. This is a subtle but critical observation: two solutions can have identical residuals but very different weightsâa signature of ill-conditioning and non-identifiability.

The quantity ||w1 - w2|| measures how much the two solutions disagree. For well-conditioned problems (||w1 - w2|| near machine precision, $\sim 10^{-14}$), both solvers work equally well, and the difference is negligible rounding error. For ill-conditioned problems, ||w1 - w2|| can be substantial (e.g., $10^{-6}$ or larger), indicating that $w_2$ from the normal equations has accumulated significant error, even though its residual ||r2|| remains close to the optimal value. Small perturbations in $y$ (e.g., from rounding) are amplified by $\kappa(X^\top X) = \kappa(X)^2$ when computing $w$. However, the residual $y - Xw$ is less sensitive because it involves $X$ directly (condition number $\kappa(X)$), not the squared system. So $w_2$ can be far from $w_1$, yet both produce similar residualsâa classic symptom of non-identifiability due to ill-conditioning. The weights $w$ are unstable estimates of the true coefficients, but predictions $\hat{y} = Xw$ remain stable.

Why this matters: In ML, the fitted weights $w$ are often interpreted as feature importance or causal effects. Ill-conditioning makes these interpretations unreliable: tiny changes in the data produce large changes in $w$. However, if the task is pure prediction, $\hat{y} = Xw$ remains stable because predictions depend on $X$ (well-conditioned) rather than $w$ alone. When ||w1 - w2|| is large, regularization (ridge regression, elastic net) becomes essential to stabilize weight estimates. For $n \gg d$ (many examples, few features), both solvers are usually fast and stable. For $n \approx d$ or $n < d$ (fewer examples than features), always use lstsq or SVD-based methods. For $n < d$ (underdetermined), normal equations break down entirely (matrix is singular); only lstsq with rcond truncation recovers the minimum-norm solution. This example operationalizes the principle: normal equations square the condition number. For stable least squares, always use lstsq, QR decomposition, or SVD directly. Normal equations are mathematically elegant but numerically treacherous.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Load data: X, y, _ = toy_linreg(n=40, d=3, seed=5) generates $X \in \mathbb{R}^{40\times 3}$ (design matrix with 40 examples, 3 features) and $y \in \mathbb{R}^{40}$ (response vector).
Method 1 (robust): w1 = np.linalg.lstsq(X, y, rcond=None)[0] uses SVD internally: $X = U\Sigma V^\top$, then $\hat{w} = V\Sigma^{-1}U^\top y$.
Method 2 (risky): w2 = np.linalg.solve(X.T@X, X.T@y) forms the Gram matrix $G = X^\top X \in \mathbb{R}^{3\times 3}$ and solves $Gw = X^\top y$ via Gaussian elimination (LU decomposition).
Residuals: r1 = y - X@w1 and r2 = y - X@w2 compute $r_i = y - Xw_i$ (should be nearly identical for well-conditioned $X$).
Weight norms: np.linalg.norm(w1-w2) measures the Euclidean distance $\|w_1 - w_2\|_2$ between solutionsâsmall for well-conditioned $X$, large (e.g., $> 10^{-6}$) for ill-conditioned $X$.
Residual norms: np.linalg.norm(r1) and np.linalg.norm(r2) compute $\|r_i\|_2 = \sqrt{\sum_j r_{ij}^2}$; should be numerically identical (both equal to the optimal LS residual norm).
Condition number check: Compute $\kappa(X) = \sigma_1/\sigma_d$ (ratio of largest to smallest singular values) to diagnose ill-conditioning; $\kappa(X) > 10^8$ strongly suggests using lstsq and potentially regularization.
Orthogonality check: Verify $\|X^\top r\|_2 \approx 0$ for both solutions (should be $< 10^{-12}$), confirming that residuals are truly orthogonal to the column space.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Normal equations square the condition number: $\kappa(X^\top X) \approx \kappa(X)^2$, making the normal-equations solver fragile for ill-conditioned design matrices.
Two solvers, one optimal residual: Both lstsq and solve(X.T@X, X.T@y) minimize $\|y - Xw\|_2^2$ (residuals nearly identical), but lstsq produces stable weights $w_1$, while solve may produce drifted weights $w_2$ when $X$ is ill-conditioned.
Weight discrepancy as diagnostic: The norm $\|w_1 - w_2\|$ is a practical warning sign of ill-conditioning. When $\|w_1 - w_2\|$ is large (e.g., $> 10^{-6}$), the weights $w$ are unreliable estimates, even if residuals look similarâa subtle but critical failure mode.
Identifiability vs.Â prediction: Ill-conditioning makes coefficients $w$ unstable and hard to interpret (non-identifiable); but predictions $\hat{y} = Xw$ remain stable because they depend on the well-conditioned quantity $X$, not $w$ alone.
Solver selection rule: Always use lstsq for least squares (or QR/SVD if you need full decomposition). Use normal equations only for dense, well-conditioned problems where speed is criticalâand even then, prefer Cholesky if $X^\top X$ is explicitly positive definite.
Regularization necessity: When weights are non-identifiable (large $\|w_1 - w_2\|$), ordinary least squares is insufficient; ridge regression or other regularization is needed to stabilize weight estimates.

Notes

Part 1: Two Solution Methods

The code compares two least-squares solvers: the robust lstsq (using SVD internally) versus the numerically risky normal equations solved via np.linalg.solve. It demonstrates why direct SVD-based methods are preferred over explicit normal-equation solves in practice, even though both minimize the same objective mathematically.

Part 2: Residual Verification

The residuals are computed as $r_i = y - Xw_i$ for both solutions. Both residuals should have nearly identical norms (||r1|| and ||r2||), since both methods minimize $\|y - Xw\|_2^2$ and converge to the same optimal residual vector (up to numerical error). The printed norms confirm theyâre equal; any significant difference (e.g., ||r1|| = 10.5, ||r2|| = 10.6) signals numerical instability in the normal-equations solver. This is a subtle but critical observation: two solutions can have identical residuals but very different weightsâa signature of ill-conditioning and non-identifiability.

Part 3: Weight Discrepancy as Ill-Conditioning Indicator

Why? The normal equations solve a different, more sensitive problem: \[ (X^\top X) w = X^\top y. \] Small perturbations in $y$ (e.g., from rounding) are amplified by $\kappa(X^\top X) = \kappa(X)^2$ when computing $w$. However, the residual $y - Xw$ is less sensitive because it involves $X$ directly (condition number $\kappa(X)$), not the squared system. So $w_2$ can be far from $w_1$, yet both produce similar residualsâa classic symptom of non-identifiability due to ill-conditioning. The weights $w$ are unstable estimates of the true coefficients, but predictions $\hat{y} = Xw$ remain stable.

Why This Matters for ML

Interpretability vs.Â prediction stability: In regression, the fitted weights $w$ are often interpreted as feature importance or causal effects. Ill-conditioning makes these interpretations unreliable: tiny changes in the data produce large changes in $w$. However, if the task is pure prediction, $\hat{y} = Xw$ remains stable because predictions depend on $X$ (well-conditioned) rather than $w$ alone.

Regularization necessity: When ||w1 - w2|| is large, regularization (ridge regression, elastic net, LASSO) becomes essential. Ridge regression solves: \[ (X^\top X + \lambda I) w = X^\top y, \] which improves conditioning by adding $\lambda I$ to the Hessian. The trade-off is bias (estimator $w$ is biased toward zero), but reduced variance and stability.

Solver selection: For $n \gg d$ (many examples, few features), both solvers are usually fast and stableâuse whichever is convenient. For $n \approx d$ or $n < d$ (fewer examples than features), always use lstsq or SVD-based methods. For $n < d$ (underdetermined), normal equations break down entirely (matrix is singular); only lstsq with rcond truncation recovers the minimum-norm solution.

Numerical diagnostics: Print ||w1 - w2|| and kappa(X) as routine diagnostics to catch ill-conditioning early:

print(f"Weight disagreement: {np.linalg.norm(w1 - w2):.2e}")
print(f"Condition number: {np.linalg.cond(X):.2e}")
if np.linalg.cond(X) > 1e8:
    print("WARNING: Ill-conditioned design matrix. Consider regularization.")

Connection to ML Applications

Multicollinearity in feature engineering: When features are highly correlated (e.g., age and birth year), $X$ becomes ill-conditioned. The least-squares solution $\hat{w}$ assigns wildly different weights to correlated features, even though predictions $\hat{y}$ remain stable. This is why multicollinearity is a red flag: feature interpretability collapses, not prediction accuracy.

Ridge regression as conditioning fix: Ridge adds $\lambda I$ to the Hessian, effectively regularizing away directions with small singular values. This trades some bias for much better conditioning and weight stabilityâessential when features are correlated or when interpretability matters.

Cross-validation and hyperparameter tuning: Ill-conditioned problems are prone to overfitting because small changes in training data produce large weight changes. Cross-validation detects this via high variance in validation scores. Using lstsq (stable) instead of normal equations helps, but regularization is the fundamental fix.

Numerical and Implementation Notes

Use full_matrices=False for economy SVD in lstsq; critical when $n \gg d$ to avoid allocating huge matrices. Always center and/or standardize features before least squares to improve conditioning. Verify $\|X^\top r\|_2 \approx 0$ (orthogonality condition) to confirm optimality. For high-dimensional problems ($d > 1000$), normal equations become prohibitively expensive anyway (forming $X^\top X$ requires $O(nd^2)$ operations); lstsq scales better. When $X$ is sparse, use sparse solvers (scipy.sparse.linalg.lsqr) instead of dense lstsq. Always inspect both ||w1 - w2|| and ||r1||, ||r2|| togetherâweights can diverge even if residuals agree.

Numerical and Shape Notes

Shape discipline: - $X \in \mathbb{R}^{n \times d}$: design matrix ($n$ examples, $d$ features) - $y \in \mathbb{R}^n$: response vector - $X^\top X \in \mathbb{R}^{d \times d}$: Gram matrix (symmetric, PSD, but potentially ill-conditioned) - $X^\top y \in \mathbb{R}^d$: cross-product vector - $w_1, w_2 \in \mathbb{R}^d$: weight vectors - $r_1, r_2 \in \mathbb{R}^n$: residual vectors - $U \in \mathbb{R}^{n \times d}$, $\Sigma \in \mathbb{R}^d$, $V^\top \in \mathbb{R}^{d \times d}$ (economy SVD)

Gotchas: 1. Normal equations are fragile: For $\kappa(X) > 10^6$, the normal equations lose several digits of precision. Always prefer lstsq for production code. 2. Residuals can mask weight errors: Both residuals are similar even if $w$ disagreement is large. Always inspect both ||w1 - w2|| and ||r1||, ||r2|| together. 3. lstsq return value: np.linalg.lstsq(X, y, rcond=None) returns a tuple (w, residuals, rank, s). Extract the weight via [0], not [0, 0]. 4. Singular/near-singular X: If $\text{rank}(X) < d$, solve(X.T@X, ...) fails with a LinAlgError (matrix is singular). lstsq handles this gracefully by returning the minimum-norm solution. 5. Broadcasting in matrix-vector products: X.T @ X is shape $(d, d)$ and X.T @ y is shape $(d,)$âensure correct dimensions before calling solve.

History and Applications

Early foundations: Least squares traces to Gaussâs work on planetary motion (1809) and Legendreâs curve-fitting treatise (1805). Both independently discovered that minimizing $\sum (y_i - \hat{y}_i)^2$ yields the âbestâ fit in a natural sense. Gaussâs insightâthat least squares is optimal under Gaussian noiseâbecame foundational to modern statistics.

Matrix formulation: The characterization via normal equations $X^\top X w = X^\top y$ emerged in the mid-20th century as numerical linear algebra matured. Golub and Van Loanâs Matrix Computations (1983) crystallized the connection between least squares and SVD, showing that condition number squaring is inevitable in normal equations and that direct SVD is superior.

Numerical awakening: In the 1960s-1970s, researchers (Golub, Kahan, others) discovered that normal equations fail catastrophically for ill-conditioned $X$. Wilkinsonâs Rounding Errors in Algebraic Processes (1963) made this concrete: a simple least-squares problem solved via normal equations could lose all significant digits. This triggered a paradigm shift: always use QR or SVD for least squares, never normal equations for general use.

Ridge regression response: When least squares failed due to multicollinearity (ill-conditioned $X$), Hoerl and Kennard (1970) introduced ridge regression: $(X^\top X + \lambda I)w = X^\top y$. This improved conditioning (reducing $\kappa$) at the cost of bias, offering a principled trade-off between estimation variance and bias.

Modern applications in ML: - Linear regression baseline: Least squares remains the foundation for supervised learning; understanding its numerics is prerequisite to deep learning, kernel methods, and modern optimization. - Multicollinearity detection: Ill-conditioned design matrices (high $\kappa(X)$) signal redundant or correlated features. Practitioners use condition numbers as diagnostic tools to flag feature engineering problems. - Regularization justification: Ridge, elastic net, and LASSO all address ill-conditioning or overfitting. The ||w1 - w2|| discrepancy motivates why regularization is necessary: raw least squares produces unstable estimates when features are correlated. - Numerical libraries: Modern implementations (NumPy, TensorFlow, PyTorch) build on Golub-Van Loan principles, using SVD or QR internally for least squares. The principle âalways use lstsq, never normal equationsâ is nearly universal. - Ill-posed inverse problems: In scientific computing and imaging, least squares is used to solve inverse problems (e.g., deconvolution, tomography). Ill-conditioning from noise and discretization is endemic; understanding $\kappa(X)$ and regularization is critical for meaningful reconstructions. - Causal inference: In econometrics and observational studies, non-identifiability (large $||w_1 - w_2||$) is a red flag for confounding: no amount of least squares can extract causal effects from ill-conditioned observational data without additional assumptions. - Kernel methods: Kernel ridge regression solves $(K + \lambda I)\alpha = y$ where $K$ is the Gram matrix. The normal-equations analogy holds: forming $K^\top K$ is avoided; instead, iterative solvers (conjugate gradient) work with $K$ directly for stability.

The key historical lesson: stable algorithms matter as much as correct mathematics. Two equivalent formulations can yield vastly different results in the presence of rounding error. This principle extends far beyond least squares to all numerical computation.

Connection to Broader Examples

Vector spaces (Ch. 1): The solution $\hat{w}$ lives in $\mathbb{R}^d$ (the feature space); the residual $r$ is orthogonal to the column space of $X$.
Span and linear combinations (Ch. 2): The fitted model $\hat{y} = Xw$ is a linear combination of the columns of $X$ (each column is a feature).
Basis and dimension (Ch. 3): If $X$ has full column rank, its columns form a basis for a $d$-dimensional subspace of $\mathbb{R}^n$.
Linear maps and matrices (Ch. 4): The least-squares fit $X\cdot : \mathbb{R}^d \to \mathbb{R}^n$ is a linear map.
Inner products and norms (Ch. 5): Least squares minimizes $\|y - Xw\|_2^2 = (y - Xw)^\top(y - Xw)$, an inner product.
Orthogonality and projections (Ch. 6): The fit $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{col}(X)$; the residual $r = y - \hat{y}$ is orthogonal to $\text{col}(X)$.
Rank and nullspace (Ch. 7): If $X$ has rank $d < m$, the least-squares solution lies in the row space of $X$ (rank = column rank).
SVD (Ch. 10): lstsq uses SVD; the SVD reveals the conditioning of $X$ and the numerical rank.
Conditioning and stability (Ch. 14): $\kappa(X) = \sigma_1/\sigma_d$ measures how sensitive $\hat{w}$ is to perturbations in $y$; $\kappa(X^\top X) \approx \kappa(X)^2$ explains why normal equations are fragile.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments