Example number
55
Example slug
example_55_rank_null_space_many_parameters_same_predictor
Background

Historical context: The concepts of rank and null space matured in linear algebra over the 19th-20th centuries, culminating in the rank-nullity theorem: $\text{rank}(X) + \text{nullity}(X) = d$. The theorem clarifies that as rank increases, the null space shrinks, and vice versa. The SVD (1870s-1960s) provided a computational method to reveal rank and null-space structure. In statistics and machine learning, underdetermined systems ($d > n$) became prevalent with high-dimensional data (text, images, genomics). Ridge regression (Tikhonov 1960s) and Lasso (Tibshirani 1996) address the infinite-solution problem by adding penalty terms. Modern deep learning operates in highly overparameterized regimes ($d \gg n$), where understanding the null space is crucial for explaining implicit regularization via SGD and the implicit bias toward certain solutions.

Mathematical characterization: The null space $\ker(X) = \{z : Xz = 0\}$ consists of all vectors that the matrix maps to zero. The rank-nullity theorem states: $\text{rank}(X) + \dim(\ker(X)) = d$. When $d > n$, the rank is at most $n$ (limited by the number of rows), so $\dim(\ker(X)) \ge d - n > 0$ (non-trivial null space). Any weight vector $w$ can be decomposed as $w = w_\parallel + w_\perp$ where $w_\parallel \in \text{row}(X)$ (row space) and $w_\perp \in \ker(X)$ (null space). Predictions depend only on $w_\parallel$: $Xw = X(w_\parallel + w_\perp) = Xw_\parallel$.

Prevalence in ML: Deep learning, compressed sensing, inverse problems, and NLP all operate in underdetermined regimes. Understanding the null space explains why multiple weight configurations produce identical training predictions, why regularization is necessary for unique solutions, and why implicit biases of optimizers matter. The null space is also the source of non-identifiability in statistical modeling—parameters that can’t be uniquely estimated from data.

Purpose

Show that weight vectors in the null space of $X$ don’t affect predictions—a fundamental property of underdetermined linear systems ($d > n$). Demonstrate that infinite solutions exist for underdetermined regression, yet they all produce identical predictions if they differ only by null-space components. Build intuition for why regularization is essential: without it, multiple weight vectors fit training data perfectly but may differ dramatically in magnitude or sparsity. Emphasize that the null space characterizes non-identifiable directions—parameter directions that can’t be distinguished from training data alone. Clarify why implicit regularization (e.g., SGD) and explicit regularization (L2, L1) both work by selecting from the infinite solution set.

Problem

Create X with n<d, fit least squares to y, then add a null-space direction z and verify predictions do not change.

Solution (Math)

When $d> \mathrm{rank}(X)$, null space $\mathcal{N}(X)$ is nontrivial. For any $z\in\mathcal{N}(X)$, $X(w+z)=Xw$, so parameter vectors differing by $z$ define the same predictor on the training inputs.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)
import numpy as np
rng = np.random.default_rng(1055)
n, d = 6, 12
X = rng.normal(size=(n,d))
w_true = rng.normal(size=d)
y = X@w_true
w0 = np.linalg.lstsq(X, y, rcond=None)[0]
U,S,Vt = np.linalg.svd(X, full_matrices=False)
z = Vt[-1]
z = z / np.linalg.norm(z)
pred0 = X@w0
pred1 = X@(w0 + 10.0*z)
print("min singular value:", S.min())
print("||pred1 - pred0||_2:", np.linalg.norm(pred1 - pred0))
Code Introduction

This code demonstrates that weight vectors in the null space of $X$ don’t affect predictions—a fundamental property of underdetermined linear systems. It reveals why regularization and rank deficiency matter: infinite solutions exist for underdetermined regression ($d > n$), but they all produce identical predictions if they differ only by null-space components.

Part 1: Underdetermined Regression Setup. The code generates a synthetic underdetermined regression problem with $X \in \mathbb{R}^{6 \times 12}$ ($d > n$). A null space exists with dimension at least $12 - 6 = 6$. The least-squares solution w0 = lstsq(X, y) finds one weight vector minimizing $\|y - Xw\|_2^2$. Since $X$ has more columns than rows, the column space is non-trivial: there exist vectors $z \neq 0$ such that $Xz = 0$ (null-space vectors). Any such $z$ represents a direction in weight space that changes $w$ without changing predictions.

Part 2: Null Space via SVD. The SVD factorization $X = U\Sigma V^\top$ decomposes the column space and null space structure. The code extracts the last row of $V^\top$ via z = Vt[-1] (a null-space vector). Since $Xz = U\Sigma V^\top z = 0$ (zero singular values annihilate the component of $V^\top z$ in the null-space directions), this $z$ satisfies $Xz = 0$. Normalizing z = z / ||z|| ensures unit norm for interpretable perturbation magnitude.

Part 3: Null-Space Independence of Predictions. The code constructs two weight vectors: $w_0$ (lstsq solution) and $w_1 = w_0 + 10.0 z$ (perturbed by a large null-space component). The predictions are $\text{pred}_0 = X w_0$ and $\text{pred}_1 = X(w_0 + 10.0 z) = X w_0 + 10.0 X z = X w_0 + 0 = \text{pred}_0$. Despite weights differing by magnitude $10 \|z\|$, predictions remain identical. The code verifies this by printing ||pred1 - pred0||_2, which should be near machine precision zero ($\sim 10^{-15}$), not $10 \|z\|$ as one might naively expect. This illustrates prediction-null-space invariance: infinitely many weight vectors produce identical predictions—they differ only in the null space, which doesn’t affect output. Understanding this property is crucial for regularization: without additional constraints, the solution is non-unique; regularization selects a unique solution by penalizing weights, preferring smaller norms, sparsity, or other criteria. Implicit regularization via SGD converges to the minimum-norm solution—the null-space component is never activated.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Create underdetermined system: Generate $X \in \mathbb{R}^{6 \times 12}$ ($n < d$); create $y$ from a random true weight vector.
  2. Solve least-squares: w0 = lstsq(X, y) finds one solution (the minimum-norm solution).
  3. Compute SVD: U, S, Vt = svd(X, full_matrices=False) reveals singular values and right singular vectors.
  4. Extract null-space vector: The last row z = Vt[-1] is a null-space vector (corresponds to zero singular value).
  5. Normalize null-space vector: z = z / ||z|| ensures unit norm for interpretable perturbation magnitude.
  6. Perturb weights: Create w1 = w0 + 10.0 * z (add null-space component with magnitude 10).
  7. Compute predictions: pred0 = X @ w0 and pred1 = X @ w1 (should be identical).
  8. Verify invariance: Check ||pred1 - pred0||_2 ≈ 0 (should be machine precision zero).
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Null space is prediction-neutral: Adding a null-space vector to weights doesn’t change predictions ($Xz = 0 \Rightarrow X(w + z) = Xw$).
  • Underdetermined systems have infinite solutions: With $d > n$, infinite weight vectors fit training data equally well.
  • SVD reveals null-space structure: Right singular vectors corresponding to zero singular values span the null space.
  • Prediction invariance: Even large null-space perturbations produce zero change in predictions (verified numerically).
  • Rank-nullity decomposition: Weight space decomposes into row space (affects predictions) and null space (doesn’t affect predictions).
  • Non-identifiability: Parameters are non-identifiable in null-space directions—can’t be uniquely estimated from data.
  • Regularization necessity: Without regularization, the solution is non-unique; regularization selects from the infinite set based on additional criteria (norm, sparsity).
  • Implicit regularization: SGD converges to the minimum-norm solution even without explicit regularization—the null space explains why certain solutions are preferred.
Notes

Part 1: Underdetermined Regression Setup - $X \in \mathbb{R}^{6 \times 12}$: more features ($d=12$) than examples ($n=6$); rank at most 6. - $y = X w_\text{true}$: target generated from true weights (arbitrary choice for illustration). - lstsq(X, y) returns the minimum-norm solution: $\hat{w} = \arg\min_w \{\|y - Xw\|_2^2, \|w\|_2\}$. - Infinite solutions fit the training data equally well; they form an affine subspace: $\{w : Xw = y\} = w_0 + \ker(X)$.

Part 2: Null Space via SVD - SVD: $X = U \Sigma V^\top$ with $U \in \mathbb{R}^{6 \times 6}$, $\Sigma$ diagonal, $V^\top \in \mathbb{R}^{12 \times 12}$. - Rank = number of nonzero singular values = 6 (in this case). - Null space = span of right singular vectors with zero singular values = last 6 rows of $V^\top$. - Last row z = Vt[-1] is a null-space vector: $Xz = U\Sigma V^\top z = 0$.

Part 3: Null-Space Independence of Predictions - $Xz = 0 \Rightarrow X(w_0 + \alpha z) = Xw_0 + \alpha Xz = Xw_0$ for any scalar $\alpha$. - pred1 = X(w0 + 10*z) = X*w0 = pred0 (predictions identical despite $\|w_1 - w_0\|_2 = 10 \|z\|_2 \ne 0$). - Numerical verification: ||pred1 - pred0||_2 ≈ 0 (machine precision, not $10 \|z\|_2$).

Why This Matters for ML - Regularization motivation: Infinite solutions exist for underdetermined systems; regularization (L2, L1) selects one by minimizing a penalty. - Non-identifiability: Parameters are non-identifiable in null-space directions—can’t distinguish them from training data. - Implicit regularization: SGD converges to the minimum-norm solution; the null space explains why this solution is preferred (bias of the algorithm). - Generalization: While all solutions fit training data identically, null-space perturbations can cause poor generalization on test data (large weight norm).

ML Examples and Patterns - Deep learning: Neural networks are severely overparameterized; the null space is enormous, explaining why regularization/initialization matter. - Sparse vs. dense solutions: L1 (Lasso) selects sparse solutions; L2 (Ridge) selects dense but small-norm solutions—both from the same infinite set. - Feature redundancy: Linearly dependent columns create null-space directions; PCA/whitening can remove them. - Implicit regularization via SGD: In overparameterized models, SGD converges to the minimum-norm solution—null space is ignored (has zero norm).

Connection to Linear Algebra Theory - Rank-nullity theorem: $\text{rank}(X) + \dim(\ker(X)) = d$; with $d=12$, $\text{rank}(X)=6$, $\dim(\ker(X))=6$. - Orthogonal decomposition: $\mathbb{R}^d = \text{row}(X) \oplus \ker(X)$ (direct sum); any $w$ uniquely decomposes as $w = w_\parallel + w_\perp$. - Affine subspace of solutions: All solutions to $Xw = y$ form $w_0 + \ker(X)$ (an affine subspace parallel to null space). - SVD characterization: $\ker(X) = \text{span}(v_{r+1}, \ldots, v_d)$ where $v_i$ are columns of $V$ and $r = \text{rank}(X)$.

Numerical and Implementation Notes - Full vs. reduced SVD: With full_matrices=False, reduced SVD is efficient; null space comes from rows of $V^\top$ beyond rank. - Singular value magnitude: S.min() indicates conditioning; small values signal near-rank-deficiency (approximate null space). - Matrix rank computation: np.linalg.matrix_rank(X, tol=...) uses singular values; rank = number of singular values $>$ tolerance. - Null-space basis: Last $d - \text{rank}(X)$ rows of $V^\top$ form orthonormal basis of null space; any linear combination is also in null space.

Numerical and Shape Notes - $X \in \mathbb{R}^{6 \times 12}$ (6 rows, 12 columns). - $U \in \mathbb{R}^{6 \times 6}$ (with full_matrices=False). - $V^\top \in \mathbb{R}^{12 \times 12}$ (full size, even with reduced SVD). - z = Vt[-1] has shape (12,) (null-space vector in $\mathbb{R}^{12}$). - pred0, pred1 have shape (6,) (predictions for 6 examples).

Pedagogical Significance - Core principle: Null space characterizes weight adjustments that preserve predictions; infinitely many solutions produce identical training predictions. - Regularization necessity: Without additional constraints, the solution is non-unique; regularization (explicit or implicit via SGD) selects based on norm/sparsity/other criteria. - Generalization trade-off: Large weights in null-space directions don’t affect training loss but can hurt generalization; regularization prevents this. - Foundation for overparameterized models: Modern deep learning operates with massive null spaces; understanding them explains implicit regularization, double descent, and benign overfitting.

History and Applications

The rank-nullity theorem emerged in 19th-century linear algebra, formalizing the relationship between rank and null-space dimension. The SVD (1870s-1960s) provided computational methods to compute rank and null-space structure. In classical statistics, non-identifiability (parameters non-identifiable in null-space directions) has been a known issue for decades; Bayesian methods address it via priors. Ridge regression (Tikhonov 1960s) and Lasso (Tibshirani 1996) explicitly handle underdetermined systems by adding penalties that select from the infinite solution set. Modern deep learning operates in severely overparameterized regimes ($d \gg n$) where the null space is enormous. Implicit regularization via SGD (discovered in 2015-2018, formalized by Gunasekar et al., others) shows that gradient descent converges to the minimum-norm solution even without explicit regularization—the null space is avoided due to algorithmic bias. The double descent phenomenon (Belkin et al. 2019) reveals that test error exhibits a U-shape as overparameterization increases; understanding happens around the interpolation threshold ($d \approx n$) where the null space transitions from empty to large. Understanding the null space is now central to modern ML theory: it explains generalization in overparameterized models, implicit biases of optimizers, and why regularization works despite (or because of) the infinite-solution regime.

Connection to Broader Examples
  • Chapter 2 (Span/Linear combinations): Null space is the orthogonal complement of the row space; together they span all of $\mathbb{R}^d$.
  • Chapter 3 (Basis/Dimension): Null-space dimension = $d - \text{rank}(X)$; captured by rank-nullity theorem.
  • Chapter 4 (Linear maps): Null space characterizes vectors that map to zero; row space characterizes the image direction.
  • Chapter 6 (Projections): Row space and null space are orthogonal complements; projecting $w$ onto row space removes null-space component.
  • Chapter 10 (SVD): SVD directly reveals rank, null space (via right singular vectors with zero singular values), and row space.
  • Chapter 12 (Least-squares): Minimum-norm solution is unique; null space explains why other solutions exist.
  • Chapter 13 (Solving systems): Underdetermined systems have infinite solutions; null space characterizes the solution set.
  • Chapter 14 (Conditioning): Small singular values indicate proximity to rank deficiency (approximate null space).
  • Chapter 16 (Matrix products): Predictions $Xw$ depend only on row-space component; null-space component is annihilated.
  • Deep learning: Overparameterized networks operate with large null spaces; understanding them explains generalization and implicit regularization.

Comments