Example number
7
Example slug
example_7_null_space_explains_nonidentifiability_overparameterized_linear_model
Background

Overparameterized linear models ($n<d$) are underdetermined: there are infinitely many $w$ that interpolate the data because the null space $\mathcal{N}(X)$ is nontrivial. The Moore–Penrose pseudoinverse (what lstsq computes) returns the unique minimum-norm solution among them.

The SVD $X = U\Sigma V^\top$ exposes this structure: zero (or tiny) singular values indicate directions $v$ in parameter space where $Xv=0$. Adding any linear combination of these null-space vectors leaves the predictions $Xw$ unchanged. This phenomenon underlies modern discussions of implicit bias: gradient descent starting at zero converges to the minimum-norm interpolant, while regularization (e.g., ridge) explicitly selects a solution by penalizing weight magnitude.

Purpose

Build a geometric and numerical intuition for overparameterized linear models ($n<d$):

  • See why null-space directions create infinitely many parameter vectors with identical training predictions.
  • Understand that lstsq returns the minimum-norm solution among these possibilities.
  • Diagnose rank deficiency using singular values and identify null-space directions with SVD.
  • Recognize how regularization or implicit bias selects one solution among many.
Problem

Construct n<d, fit w0, add a null-space vector z, and verify predictions are unchanged.

Solution (Math)

If $z\in\mathcal{N}(X)$, then $Xz=0$. Hence $X(w_0+z)=Xw_0$: infinitely many parameters yield identical predictions on the training inputs.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
rng = np.random.default_rng(1007)

n, d = 6, 12
X = rng.normal(size=(n,d))
w_true = rng.normal(size=d)
y = X@w_true

w0 = np.linalg.lstsq(X, y, rcond=None)[0]
U,S,Vt = np.linalg.svd(X, full_matrices=False)
z = Vt[-1]; z /= np.linalg.norm(z)

pred0 = X@w0
pred1 = X@(w0 + 5.0*z)
print("min singular value:", S.min())
print("||pred1-pred0||:", np.linalg.norm(pred1-pred0))
Code Introduction

This code illustrates the non-uniqueness of least-squares solutions in an underdetermined system ($n=6$ rows, $d=12$ columns). It generates data $X\in\mathbb{R}^{6\times 12}$, a true parameter $w_{\text{true}}\in\mathbb{R}^{12}$, and responses $y = X w_{\text{true}}$. Solving $\min_w \|Xw - y\|_2^2$ via lstsq returns the minimum-norm solution $w_0$. The SVD U, S, Vt = svd(X) reveals the right singular vectors; the last row Vt[-1] spans (approximately) the null space of $X$ because its singular value is the minimum. Normalizing this vector as $z$ and forming $w_0 + 5 z$ produces a different parameter vector that adds a null-space component. The predictions pred0 = X @ w0 and pred1 = X @ (w0 + 5 z) are compared: their difference norm is near zero, showing that moving along the null space does not change $Xw$. The printed smallest singular value (S.min()) quantifies how close $X$ is to rank-deficient; a very small value indicates a large null-space direction, which explains the non-uniqueness of $w$ and why multiple solutions yield identical predictions.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Construct an underdetermined system with $n=6$, $d=12$ using Gaussian samples for $X$ and $w_{\text{true}}$.
  • Solve $\min_w \|Xw - y\|_2^2$ with np.linalg.lstsq(X, y, rcond=None) to obtain the minimum-norm $w_0$.
  • Compute SVD via U, S, Vt = np.linalg.svd(X, full_matrices=False); the last right-singular vector Vt[-1] approximates a null-space direction.
  • Normalize z and perturb $w_0$ by a scaled null vector (e.g., $w_0 + 5z$); verify $\|X(w_0+5z) - Xw_0\|_2 \approx 0$.
  • Monitor conditioning with the smallest singular value S.min(); very small values indicate near rank-deficiency.
  • Avoid explicit inverses of $X^\top X$; use SVD/QR-based solvers for stability.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Null-space directions leave predictions unchanged: if $z\in\mathcal{N}(X)$, then $X(w_0+z)=Xw_0$.
  • lstsq returns the minimum-norm solution among infinitely many interpolants when $n<d$.
  • The smallest singular values from SVD reveal rank deficiency and span the null space.
  • Adding a null-space vector can change parameters dramatically while leaving training loss zero—highlighting non-identifiability.
Notes

Part 1: Core setup - Null space explains non-identifiability (overparameterized linear model)

State the objects, shapes, and target question for Null space explains non-identifiability (overparameterized linear model). Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Null space explains non-identifiability (overparameterized linear model)

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Null space explains non-identifiability (overparameterized linear model) work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Null space explains non-identifiability (overparameterized linear model)

Give the computational recipe for Null space explains non-identifiability (overparameterized linear model), note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: $X\in\mathbb{R}^{n\times d}$ with $n<d$, $y\in\mathbb{R}^n$, $w\in\mathbb{R}^d$, $S\in\mathbb{R}^{\min(n,d)}$, $V^\top\in\mathbb{R}^{d\times d}$.
  • Numerical note: lstsq (QR/SVD) is stable; avoid forming $X^\top X$ explicitly. rcond=None filters tiny singular values at machine precision.
  • Null-space scaling: Large steps along $z\in\mathcal{N}(X)$ leave $Xw$ unchanged but can inflate $\|w\|$; regularization controls this growth.
  • Tolerance: Expect $\|X(w_0+5z) - Xw_0\|_2$ near machine precision (e.g., $<10^{-12}$) if $z$ is truly in the null space.
History and Applications

The rank-nullity theorem $\dim(\text{range}(A)) + \dim(\text{null}(A)) = \dim(\text{domain})$ was formalized by Sylvester (1851) and is fundamental to linear algebra. This decomposition shows that every vector space splits into two complementary parts: the column space and null space.

Implicit bias and modern deep learning: For decades, linear regression’s non-uniqueness in overparameterized regimes ($n < d$) was treated as a minor edge case. However, modern neural networks are massively overparameterized (billions of parameters, millions of examples). The question “which solution does gradient descent pick?” became central. Woodworth & Gunasekar (2020) and others showed that gradient descent with appropriate initialization converges to the minimum-norm interpolant—exactly what the pseudoinverse returns. This connection bridged classical numerical analysis with modern deep learning.

Regularization and inductive bias: Ridge regression adds a penalty $\lambda \|w\|_2^2$, breaking the non-uniqueness by explicitly trading off fit quality for parameter norm. This regularization bias toward smaller weights (via weight decay) is now standard in deep learning, where it acts as an implicit structure-seeking mechanism.

Connection to Broader Examples
  • Implicit bias: Gradient descent from zero on an underdetermined system converges to the minimum-norm interpolant, mirroring the lstsq solution.
  • Regularization: Ridge regression adds $\lambda I$, shrinking weights and selecting a unique solution even when $n<d$.
  • Generalization risk: Moving along null-space directions leaves training loss unchanged but can alter model complexity and downstream behavior.
  • Diagnostics: Small singular values flag rank issues; inspecting null vectors explains parameter drift without prediction changes.

Comments