Example number
23
Example slug
example_23_null_space_explains_nonidentifiability_overparameterized_linear_model
Background

The null space $\mathcal{N}(X) = \{z \in \mathbb{R}^d : Xz = 0\}$ consists of all vectors that map to zero under $X$. By the rank–nullity theorem, $\text{rank}(X) + \dim(\mathcal{N}(X)) = d$, so when $n < d$ and $X$ has full row rank $n$, the null space has dimension $d - n$. Any solution $w_0$ to $Xw = y$ can be shifted by any $z \in \mathcal{N}(X)$ to produce another solution: $X(w_0 + \alpha z) = Xw_0 = y$ for all scalars $\alpha$.

In ML, this non-identifiability appears whenever parameters outnumber training examples. Wide neural networks, large embeddings, and overparameterized linear models all exhibit null-space structure. SVD exposes this: $X = U\Sigma V^\top$ with singular values $\sigma_1 \geq \cdots \geq \sigma_r > 0$ and $r = \text{rank}(X)$. The last $d - r$ right singular vectors (rows of $V^\top$) span $\mathcal{N}(X)$. Small but nonzero singular values create near-null directions where parameters change a lot but predictions change little, leading to instability and poor generalization without regularization.

Purpose

Build a crisp, ML-first understanding of how the null space causes non-identifiability in overparameterized models. When $n < d$, infinitely many parameter vectors $w$ can produce the same predictions, because the null space $\mathcal{N}(X)$ contains directions along which changing $w$ leaves $Xw$ unchanged. This is central to modern deep learning: wide networks have vast null spaces, so SGD implicitly chooses among equivalent solutions via initialization and optimizer dynamics.

You will learn to identify null-space directions via SVD, verify that moving along them doesn’t change predictions, and interpret small singular values as near-null directions that lead to ill-conditioning. The goal is to make you fluent in diagnosing rank issues and understanding why regularization (ridge, weight decay) suppresses null-space wandering.

Problem

Construct n<d, fit w0, add a null-space vector z, and verify predictions are unchanged.

Solution (Math)

If $z\in\mathcal{N}(X)$, then $Xz=0$. Hence $X(w_0+z)=Xw_0$: infinitely many parameters yield identical predictions on the training inputs.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
rng = np.random.default_rng(1023)

n, d = 6, 12
X = rng.normal(size=(n,d))
w_true = rng.normal(size=d)
y = X@w_true

w0 = np.linalg.lstsq(X, y, rcond=None)[0]
U,S,Vt = np.linalg.svd(X, full_matrices=False)
z = Vt[-1]; z /= np.linalg.norm(z)

pred0 = X@w0
pred1 = X@(w0 + 5.0*z)
print("min singular value:", S.min())
print("||pred1-pred0||:", np.linalg.norm(pred1-pred0))
Code Introduction

This snippet illustrates non-identifiability in an underdetermined linear model and how small singular values create directions where parameters can change with minimal effect on predictions. With $n=6$ and $d=12$, $X \in \mathbb{R}^{6\times 12}$, $w_{\text{true}} \in \mathbb{R}^{12}$, and $y = X w_{\text{true}}$. The least-squares call lstsq returns the minimum-norm interpolant $w_0$ satisfying $X w_0 \approx y$.

An SVD of $X$ gives $X = U\,\Sigma\,V^\top$ with singular values $S$ and right singular vectors (rows of $V^\top$). The code picks $z$ as the right singular vector for the smallest singular value $\sigma_{\min}$. Moving the weights along this direction, $w_0 \mapsto w_0 + 5z$, changes the prediction by \[ X(w_0 + 5z) - Xw_0 \;=\; 5\,Xz \;=\; 5\,\sigma_{\min}\,u_{\min}, \] so the difference norm is $\|X(w_0+5z) - Xw_0\|_2 \approx 5\,\sigma_{\min}$. When $\sigma_{\min}$ is tiny, predictions barely change even though parameters move a lot—this is the hallmark of near-null directions and rank issues. In the exact null-space case ($\sigma_{\min}=0$), $Xz=0$ and predictions are unchanged. As a diagnostic, small singular values signal ill-conditioning and non-uniqueness; ridge regularization $(X^\top X + \lambda I)$ suppresses such directions. Gotcha: with full_matrices=False, $V^\top$ contains only the first $r=\min(n,d)$ right singular vectors (nonzero singular values). To extract true null-space vectors, use np.linalg.svd(X, full_matrices=True) and take the rows of $V^\top$ corresponding to zero singular values.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Solve for $\hat w$ via np.linalg.lstsq(X, y, rcond=None), which returns the minimum-norm solution (smallest $\|w\|_2$) among all interpolants.
  • Compute the SVD: U, S, Vt = np.linalg.svd(X, full_matrices=False) returns singular values in descending order.
  • Extract a near-null direction: z = Vt[-1] is the right singular vector for the smallest singular value; normalize via z /= np.linalg.norm(z).
  • Perturb the solution: form w1 = w0 + alpha * z and compute predictions X @ w1 versus X @ w0.
  • Measure prediction change: np.linalg.norm(X @ w1 - X @ w0) should be approximately alpha * S.min(), near zero if $\sigma_{\min}$ is tiny.
  • Gotcha: full_matrices=False only returns the first $r = \min(n, d)$ right singular vectors. For true null-space vectors (when $\sigma = 0$), use full_matrices=True and take rows of $V^\top$ corresponding to zero singular values.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Null-space non-identifiability: moving $w$ along $z \in \mathcal{N}(X)$ leaves predictions $Xw$ unchanged.
  • SVD reveals null directions: the right singular vectors for zero (or tiny) singular values span $\mathcal{N}(X)$.
  • Small singular values signal near-null directions: predictions change slowly, parameters change rapidly.
  • Practical diagnostic: if $\sigma_{\min}$ is small, $w$ is non-unique and conditioning is poor.
  • Regularization suppresses null-space wandering: ridge/weight decay penalize large $\|w\|$, stabilizing the solution.
  • Shape discipline: $(n,d)$ with $n < d$ guarantees $\dim(\mathcal{N}(X)) \geq d - n$.
Notes

Part 1: Non-identifiability and Minimum-Norm Solutions. When $n < d$, the system $Xw = y$ is underdetermined and has infinitely many solutions. lstsq returns the one with smallest $\|w\|_2$, but adding any $z \in \mathcal{N}(X)$ produces another solution with potentially much larger norm. In deep learning, this means SGD’s path through parameter space—initialization, learning rate, batch order—implicitly selects among equivalent predictors.

Part 2: SVD and Null-Space Extraction. The SVD $X = U\Sigma V^\top$ reveals rank structure: $r = \text{rank}(X)$ equals the number of nonzero singular values. The last $d - r$ rows of $V^\top$ span $\mathcal{N}(X)$. With full_matrices=False, you only get the first $\min(n,d)$ vectors; use full_matrices=True to access the full null space when $\sigma = 0$ exactly. Numerically, treat $\sigma < \epsilon$ as “zero” for some tolerance $\epsilon$.

Part 3: Small Singular Values and Conditioning. Even when $X$ has full row rank, small $\sigma_{\min}$ creates near-null directions where $\|Xz\|$ is tiny but $\|z\|$ is not. The condition number $\kappa = \sigma_{\max}/\sigma_{\min}$ governs error amplification: large $\kappa$ means small perturbations in $y$ cause large swings in $w$. Ridge regularization $(X^\top X + \lambda I)^{-1}$ shifts singular values away from zero, improving stability at the cost of bias.

Part 4: ML Implications and Debugging. Track $\sigma_{\min}$ as a diagnostic: if it’s near machine precision, expect numerical issues and non-uniqueness. Compare predictions before and after perturbing $w$ along null/near-null directions to verify your solver and data pipeline. In practice, wide architectures (transformers, large embeddings) have vast null spaces; techniques like weight decay, dropout, and early stopping implicitly regularize to select generalizable solutions from the equivalence class.

History and Applications

The rank–nullity theorem and null-space structure have roots in 19th-century linear algebra (Sylvester, Frobenius). By the mid-20th century, numerical analysts (Golub, Wilkinson) emphasized SVD as the primary tool for diagnosing rank deficiency and extracting null-space bases, especially in least-squares contexts. The observation that small singular values cause instability led to regularization techniques (Tikhonov, ridge regression) that shift eigenvalues away from zero.

In modern ML, overparameterization (more parameters than data points) is ubiquitous: wide neural networks, large embeddings, and attention mechanisms all have vast null spaces. Recent theory (Bartlett et al., 2020; Neyshabur et al.) explores why SGD finds solutions that generalize despite non-identifiability—implicit bias toward low-norm or low-complexity solutions. Practically, monitoring singular values and null-space structure helps debug feature engineering, detect redundant features, and guide regularization strategies (weight decay, dropout, early stopping). Understanding null space is essential for diagnosing model capacity, interpretability, and robustness in high-dimensional settings.

Connection to Broader Examples
  • Chapter 7 (rank and nullspace): this example directly applies the rank–nullity theorem to overparameterized models.
  • Chapter 10 (SVD): SVD exposes rank structure and null-space bases; small singular values create instability.
  • Chapter 12 (least squares): when $n < d$, lstsq returns the minimum-norm solution; infinitely many others exist.
  • Chapter 14 (conditioning): small $\sigma_{\min}$ implies large condition number $\kappa = \sigma_{\max}/\sigma_{\min}$, amplifying errors.
  • Chapter 9 (positive definite): when $n < d$, $X^\top X$ is singular (not full rank), so Cholesky fails; regularization makes it PD.
  • Modern ML: wide networks, embeddings, and attention all exhibit null-space structure; implicit regularization via SGD dynamics selects among equivalent solutions.

Comments