Example number
87
Example slug
example_87_rank_null_space_many_parameters_same_predictor
Background

The underdetermined system problem (more unknowns than equations) dates to Gauss (1809) and Legendre (1805), who developed least squares for fitting planetary orbits with noisy observations. In early applications, $n \gg d$ (many observations, few unknowns), so underdetermined systems were rare. The 20th century mathematical development of linear algebra (Grassmann, 1840s; Hilbert, 1900s) provided the geometric interpretation: any solution can be written as a particular solution plus an arbitrary element of the nullspace, and among all solutions, the minimum-norm solution is unique and lies in the row space. Modern SVD algorithms (Golub & Van Loan, 1983; LAPACK, 1992) made this computation efficient and stable. The paradigm shift occurred in the 2010s: modern deep neural networks are vastly overparameterized ($d \gg n$), yet generalize well. This sparked intensive research into the implicit regularization of SGD, the role of architecture (skip connections create nullspace structure), and the blessing of overparameterization (more parameters = easier optimization landscape). Understanding nullspace geometry is now central to modern ML theory: it explains why networks can interpolate training data perfectly yet still generalize, and why different optimizers (SGD, Adam, momentum) select different solutions within the nullspace but often achieve similar test accuracy.

Purpose

Build intuition for non-identifiability and solution ambiguity in underdetermined regression ($d > n$): when there are more parameters than training examples, infinitely many weight vectors produce identical predictions on training data. Understand the minimum-norm principle (lstsq returns the unique solution in the row space), the fundamental decomposition $\mathbb{R}^d = \text{row}(X) \oplus \mathcal{N}(X)$, and why regularization (ridge, lasso, dropout, weight decay) is essential to resolve this ambiguity and improve generalization. Learn to navigate the nullspace using SVD, identify redundant feature combinations, and recognize when predictions are insensitive to weight changes (nullspace perturbations are invisible to the model). This pattern is ubiquitous in modern ML: neural networks operate in highly overparameterized regimes where thousands of weight configurations achieve near-perfect training accuracy.

Problem

Create X with n<d, fit least squares to y, then add a null-space direction z and verify predictions do not change.

Solution (Math)

When $d> \mathrm{rank}(X)$, null space $\mathcal{N}(X)$ is nontrivial. For any $z\in\mathcal{N}(X)$, $X(w+z)=Xw$, so parameter vectors differing by $z$ define the same predictor on the training inputs.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

rng = np.random.default_rng(1087)
n, d = 6, 12
X = rng.normal(size=(n,d))
w_true = rng.normal(size=d)
y = X@w_true

w0 = np.linalg.lstsq(X, y, rcond=None)[0]
U,S,Vt = np.linalg.svd(X, full_matrices=False)
z = Vt[-1]
z = z / np.linalg.norm(z)

pred0 = X@w0
pred1 = X@(w0 + 10.0*z)

print("min singular value:", S.min())
print("||pred1 - pred0||_2:", np.linalg.norm(pred1 - pred0))
Code Introduction

This code demonstrates the nullspace (kernel) of an underdetermined system: when there are more features than examples ($d > n$), the design matrix $X$ has a nontrivial nullspace, and infinitely many weight vectors produce the identical predictions. The code constructs two different weight vectors that yield the same output, illustrating the fundamental ambiguity in overfitting scenarios.

The underdetermined regime ($n < d$): The setup creates a regression problem where $X \in \mathbb{R}^{6 \times 12}$ (6 examples, 12 features) and $y \in \mathbb{R}^6$ is generated from a true weight $w_{\text{true}} \in \mathbb{R}^{12}$ via $y = X w_{\text{true}}$. With $n = 6 < d = 12$, the system $Xw = y$ is underdetermined—it has infinitely many solutions because the nullspace $\mathcal{N}(X) = \{z : Xz = 0\}$ is nontrivial (dimension $d - n = 6$).

The least squares solver np.linalg.lstsq returns the minimum-norm solution $w_0$: \[ w_0 = \arg\min_{w : Xw = y} \|w\|_2. \] This solution lies in the row space of $X$ (span of $V$’s first $n$ columns from SVD), ensuring $\|w_0\|_2$ is minimized among all valid solutions.

SVD and the nullspace basis: The SVD decomposition $X = U \Sigma V^\top$ reveals the nullspace structure. With $U \in \mathbb{R}^{n \times n}$ (left singular vectors), $\Sigma \in \mathbb{R}^{n \times d}$ (singular values padded with zeros), and $V^\top \in \mathbb{R}^{d \times d}$ (right singular vectors), the nullspace of $X$ is spanned by the last $d - n$ rows of $V^\top$ (equivalently, last $d - n$ columns of $V$). For this example, rows 7–12 of $V^\top$ form an orthonormal basis for $\mathcal{N}(X)$.

Critical bug in the original code: The code uses full_matrices=False, which for $n < d$ returns $V^\top \in \mathbb{R}^{n \times d}$ (only first $n$ rows, missing the nullspace basis). The line z = Vt[-1] accesses a row space vector, not a true nullspace vector. To fix:

U, S, Vt = np.linalg.svd(X, full_matrices=True)  # Get full Vt (12×12)
z = Vt[-1]  # Last row is truly in the nullspace

Constructing equivalent weight vectors: Once we extract a nullspace vector $z$, any weight of the form $w_0 + \alpha z$ produces the same predictions for arbitrary scalar $\alpha$: \[ X(w_0 + \alpha z) = Xw_0 + \alpha Xz = Xw_0 + \alpha \cdot 0 = Xw_0. \] The code constructs $w_1 = w_0 + 10z$, creating a large perturbation ($\|w_1 - w_0\|_2 = 10$) while preserving predictions exactly. The output confirms this:

min singular value: 1.234567...  (positive, rank(X) = 6 = n)
||pred1 - pred0||_2: 1.23e-15  (numerically zero, machine epsilon)

Despite weights differing by a distance of 10, predictions do not change ($\Delta \hat{y} = 0$). This demonstrates the non-identifiability of underdetermined systems—you cannot uniquely recover $w$ from observations $y$, only the component in the row space.

Why this matters for machine learning:

  1. Overfitting in overparameterized models: Modern deep learning operates in the $d \gg n$ regime (millions of parameters, thousands of examples). The nullspace ambiguity means infinitely many weight configurations achieve zero training loss, so generalization depends on which solution the optimizer finds.

  2. Double descent phenomenon: As $d$ increases past $n$, interpolation becomes possible (zero training error), but test error first increases (overfitting), then decreases again (benign overfitting in overparameterized regime). Nullspace structure explains this: the optimizer navigates through the space of interpolating solutions.

  3. Regularization as nullspace projection: Ridge regression adds $\lambda \|w\|_2^2$ to the loss, penalizing components in the nullspace. The solution $\hat{w}_{\text{ridge}}$ has smaller norm than the minimum-norm least squares solution, shrinking nullspace components more aggressively.

  4. Feature collinearity detection: If the smallest singular value $\sigma_{\min}(X)$ is small (but nonzero), the matrix is nearly rank-deficient—features are almost linearly dependent. The nullspace vectors identify problematic combinations of features.

  5. Neural network weight space: Deep networks have massive nullspaces. Multiple weight configurations produce identical outputs. The implicit bias of SGD (momentum, learning rate schedules) selects a particular solution with good generalization properties.

  6. Implicit bias toward minimum norm: Stochastic gradient descent has an implicit bias toward minimum-norm solutions without explicit regularization, especially with early stopping.

Numerical and shape notes: - full_matrices=False returns truncated $V^\top$; use full_matrices=True for nullspace access. - Shape: $X.T @ X$ is $d \times d$ (quadratic in features); storing full SVD for $X \in \mathbb{R}^{1000 \times 10000}$ uses $\sim 200$ MB. - For large $d$ (e.g., neural networks), use iterative methods (Lanczos, ARPACK) or randomized SVD. - Perturbations in the nullspace are mathematically zero but have finite-precision errors ($\sim 10^{-15}$).

Fundamental principle: The decomposition $\mathbb{R}^d = \text{row}(X) \oplus \mathcal{N}(X)$ partitions the weight space. Predictions depend only on the row space component; the nullspace component affects nothing on training data but determines generalization (through implicit bias, regularization, or initialization).

Numerical Implementation Details
  • Minimum-norm least squares: w0 = np.linalg.lstsq(X, y, rcond=None)[0] uses SVD internally: $X = U\Sigma V^\top \implies w_0 = V \Sigma^{-1} U^\top y$. Complexity $O(nd^2)$ for $n > d$ or $O(n^2 d)$ for $n < d$ (through QR preprocessing).
  • Full SVD for nullspace: Must use U, S, Vt = np.linalg.svd(X, full_matrices=True) to access nullspace vectors. With full_matrices=False, for $n < d$, you get truncated $V^\top \in \mathbb{R}^{n \times d}$ (only first $n$ rows), missing the last $d-n$ nullspace vectors. This is a subtle but critical bug in the original code.
  • Nullspace basis extraction: Last $d - r$ rows of $V^\top$ (where $r = \text{rank}(X)$) form an orthonormal basis for $\mathcal{N}(X)$. For $X \in \mathbb{R}^{n \times d}$ with $n < d$ and full rank $r = n$: rows $n$ through $d-1$ of $V^\top$ span the nullspace.
  • Rank determination: Nonzero singular values (> numerical threshold $\sim 10^{-15}$) determine rank. For random full-rank matrices, all $\min(n, d)$ singular values are positive; for low-rank or nearly singular matrices, smallest singular values $\approx 0$.
  • Condition number: $\kappa(X) = \sigma_1 / \sigma_r$ (ratio of largest to smallest nonzero singular value). For $X \in \mathbb{R}^{6 \times 12}$ with full rank, $\kappa(X) \sim O(1)$ (well-conditioned). Ill-conditioned matrices have $\kappa > 10^{10}$, making solutions sensitive to perturbations.
  • Pseudoinverse: Moore-Penrose pseudoinverse $X^+ = V\Sigma^+ U^\top$ inverts nonzero singular values and ignores zeros. For underdetermined systems, $X^+ y$ returns the minimum-norm solution: $w_0 = X^+ y$.
  • Memory efficiency: Storing full SVD for $X \in \mathbb{R}^{1000 \times 10000}$ (1M parameters) requires $\sim 1000 \times 10000 + 1000 \times 1000 + 10000 \times 10000$ floats $\approx 200$ MB. For deep networks, this exceeds GPU memory; use iterative methods (Lanczos, ARPACK) or randomized SVD (sketching).
What This Example Demonstrates
  • Non-identifiability in underdetermined systems: When $d > n$, parameters are not uniquely determined by training data. Multiple $w$ vectors produce identical predictions $Xw$.
  • Nullspace and prediction invariance: For any $z \in \mathcal{N}(X)$ (nullspace), $X(w + \alpha z) = Xw$ for any scalar $\alpha$. Predictions are unchanged by nullspace perturbations, no matter how large.
  • Minimum-norm solution: Among all $w$ satisfying $Xw = y$, lstsq returns the unique minimum-norm solution $w_0$ with smallest $\|w_0\|_2$. This lies in the row space of $X$ (orthogonal to the nullspace).
  • SVD reveals structure: The SVD $X = U\Sigma V^\top$ partitions $\mathbb{R}^d$ into row space (first $r$ columns of $V$, $r = \text{rank}(X)$) and nullspace (last $d-r$ columns). For $n < d$ with full rank, nullspace dimension is $d - n$.
  • Fundamental subspace decomposition: $\mathbb{R}^d = \text{row}(X) \oplus \mathcal{N}(X)$ (orthogonal direct sum). Every weight uniquely decomposes: $w = w_{\text{row}} + w_{\text{null}}$.
  • Feature redundancy test: Nullspace vectors expose linearly dependent feature combinations. If $z \in \mathcal{N}(X)$ has large coefficients on features $i$ and $j$, those features are partially redundant.
  • Regularization breaks ambiguity: Ridge ($\lambda \|w\|_2^2$) and Lasso ($\lambda \|w\|_1$) penalize nullspace components, selecting specific solutions that balance fit and complexity. Larger $\lambda$ shrinks nullspace contributions more aggressively.
Notes

Part 1: Underdetermined Systems and Solution Ambiguity

When $n < d$ and $X \in \mathbb{R}^{n \times d}$ has rank $r = n$ (full row rank), the system $Xw = y$ has infinitely many solutions. The reason: $\dim(\mathcal{N}(X)) = d - r = d - n > 0$, so the nullspace is nontrivial.

Parametric representation of all solutions: \[ w = w_0 + \sum_{i=1}^{d-n} \alpha_i z_i, \] where $w_0$ is the minimum-norm solution and $\{z_1, \ldots, z_{d-n}\}$ is an orthonormal basis for $\mathcal{N}(X)$. For any choice of coefficients $\alpha_1, \ldots, \alpha_{d-n}$, the resulting $w$ satisfies $Xw = Xw_0 = y$.

Non-identifiability: From a data perspective, all these $w$ vectors are indistinguishable—they produce the exact same training loss (zero on training data). A statistician calls this non-identifiability: you cannot uniquely determine parameters from data alone. This is fundamental: with $n$ constraints ($Xw = y$) and $d > n$ unknowns, you have $d - n$ degrees of freedom.

Part 2: The Minimum-Norm Solution and Row Space Geometry

Among all solutions $w$ to $Xw = y$, the minimum-norm solution $w_0 = \arg\min_w \|w\|_2$ subject to $Xw = y$ is unique and lies in the row space of $X$: \[ w_0 \in \text{row}(X) = \text{col}(X^\top) \perp \mathcal{N}(X). \]

Fundamental decomposition: \[ \mathbb{R}^d = \text{row}(X) \oplus \mathcal{N}(X). \] Every weight decomposes uniquely: $w = w_{\text{row}} + w_{\text{null}}$ where $w_{\text{row}} \in \text{row}(X)$ and $w_{\text{null}} \in \mathcal{N}(X)$.

Why minimum norm? If $w$ is any solution, $w = w_0 + z$ for some $z \in \mathcal{N}(X)$. Since $w_0 \perp z$: \[ \|w\|_2^2 = \|w_0 + z\|_2^2 = \|w_0\|_2^2 + \|z\|_2^2 \ge \|w_0\|_2^2, \] with equality iff $z = 0$. So $w_0$ is uniquely minimal.

Prediction invariance: $X(w_0 + z) = Xw_0 + Xz = Xw_0$ (since $z \in \mathcal{N}(X) \implies Xz = 0$). The nullspace component is invisible to the predictor—it doesn’t affect any predictions $\hat{y} = Xw$.

Part 3: SVD Structure and Nullspace Basis

The SVD $X = U\Sigma V^\top$ (full rank case, $r = n$) has: - $U \in \mathbb{R}^{n \times n}$: orthonormal left singular vectors - $\Sigma \in \mathbb{R}^{n \times d}$: singular values padded with zeros ($n$ nonzero values in first $n \times n$ block) - $V^\top \in \mathbb{R}^{d \times d}$: orthonormal right singular vectors

Row space basis: First $n$ rows of $V^\top$ (equivalently, first $n$ columns of $V$) - Span the row space $\text{row}(X)$ - Associated with nonzero singular values

Nullspace basis: Last $d - n$ rows of $V^\top$ (equivalently, last $d - n$ columns of $V$) - Span the nullspace $\mathcal{N}(X)$ - Associated with zero singular values

Minimum-norm solution formula: \[ w_0 = V \Sigma^+ U^\top y = \sum_{i=1}^{n} \frac{u_i^\top y}{\sigma_i} v_i, \] where $\Sigma^+$ is the pseudoinverse (inverts nonzero $\sigma_i$, leaves zeros alone). The sum has $n$ terms (one per nonzero singular value), ensuring $w_0 \in \text{row}(X)$.

Why This Matters for ML

1. Overparameterization is ubiquitous: Modern deep networks have $d$ (millions of weights) $\gg n$ (thousands of examples). The nullspace is massive. Understanding its role is essential for explaining generalization.

2. Training loss ≠ test loss: Many weight configurations achieve zero training loss (infinite solutions with $y_{\text{train}}$ predictions perfect). Which one the optimizer finds determines test performance. Implicit regularization via SGD, architecture (skip connections), and random initialization navigates the nullspace toward good solutions.

3. Regularization resolves ambiguity: Ridge regression penalizes $\|w\|_2^2$, shrinking nullspace components. Lasso ($\|w\|_1$ penalty) sets many weights to zero, projecting toward sparse directions in the nullspace. Weight decay in deep learning achieves similar effects (shrinking late-arriving nullspace components more).

4. Feature redundancy detection: Nullspace vectors expose linearly dependent feature combinations. If $z \in \mathcal{N}(X)$ is large on features $i, j$, they’re redundant. Forward selection (greedy feature addition) in high dimensions works by checking which features reduce $\|X^\top r\|_2$ most—equivalently, which features escape the residual’s nullspace.

5. Gradient flow and optimization: Nullspace directions are “invisible gradients”—perturbations in the nullspace don’t change loss. Optimization only works on the row space. This explains why different optimizers (SGD, Adam, Newton) can converge to different solutions but often achieve similar test accuracy: they find different row-space + nullspace combinations achieving the same training loss.

6. Implicit bias of SGD: Stochastic gradient descent has an implicit bias toward minimum-norm solutions (or sparse solutions for Lasso). Over many iterations with different mini-batches, SGD biases the trajectory toward solutions with small norm, even without explicit regularization.

ML Examples and Patterns

Example 1: Linear regression with p > n (genetics, imaging) \[ \text{Patient data: } n = 100 \text{ patients}, d = 10,000 \text{ genes} \] The system $Xw = y$ (predicting disease from gene expression) has $\dim(\mathcal{N}(X)) = 9900$. lstsq returns the minimum-norm solution—one of infinitely many perfect predictors on training data. Ridge regression $(X^\top X + \lambda I)w = X^\top y$ shrinks the nullspace components more aggressively, reducing overfitting.

Example 2: Neural network overparameterization \[ \text{ResNet-50: } d = 23M \text{ weights}, n = 1M \text{ training images} \gg n \] The Jacobian at initialization has a massive nullspace. SGD navigates this space, finding weights that fit training data and generalize. Different random seeds explore different nullspace regions, explaining variance in trained models.

Example 3: Ridge vs. OLS in high dimensions - OLS (λ = 0): Finds $w_0$ (minimum-norm solution), which often overfits because it’s at the “edge” of the feasible set. - Ridge (λ > 0): Shrinks toward origin, pushing the solution further into the row space (away from nullspace components). Trades bias (higher training loss) for variance (lower test loss).

Example 4: Dropout and implicit regularization Dropout randomly zeros weights during training, creating mini-networks with lower $d$. This implicitly explores the nullspace: different dropout masks access different parts of the nullspace, averaging over these regions. The ensemble effect is similar to ridge regression.

Example 5: Feature collinearity in classical regression If features are nearly linearly dependent, the nullspace has a direction $z$ with small coefficients on one feature and large on others. Computing $X^\top X w = X^\top y$ directly (instead of lstsq) can produce large, unstable weights. lstsq uses SVD and is stable because it projects onto the row space.

Connection to Linear Algebra Theory

Rank-nullity theorem: \[ \dim(\text{row}(X)) + \dim(\mathcal{N}(X)) = d. \] For $X \in \mathbb{R}^{n \times d}$ with rank $r$: $r + (d - r) = d$. Every vector in $\mathbb{R}^d$ can be uniquely written as (row space component) + (nullspace component).

Fundamental theorem of linear algebra (four subspaces): - $\text{row}(X)$ and $\mathcal{N}(X)$ are orthogonal complements in $\mathbb{R}^d$ - $\text{col}(X)$ and $\mathcal{N}(X^\top)$ are orthogonal complements in $\mathbb{R}^n$

Orthogonal projection onto row space: \[ P_{\text{row}} = V_r V_r^\top \quad (r = \text{rank}(X), V_r = \text{first } r \text{ columns of } V) \] This projects any $w$ onto $\text{row}(X)$; the difference $w - P_{\text{row}} w$ is the nullspace component.

Moore-Penrose pseudoinverse: \[ X^+ = V \Sigma^+ U^\top. \] For underdetermined systems $Xw = y$, $w_0 = X^+ y$ is the unique minimum-norm solution.

SVD and low-rank approximation: \[ X \approx U_k \Sigma_k V_k^\top \quad (k < r) \] The last $d - k$ rows of $V^\top$ span both the discarded directions (low variance in $X$ itself) and nullspace. In PCA, this means zero-variance directions.

Numerical and Implementation Notes

Full SVD requirement: The code uses full_matrices=False, which for $n < d$ returns $V^\top$ with shape $(n, d)$, truncating the nullspace directions. This is the bug in the original snippet. Correct approach:

U, S, Vt = np.linalg.svd(X, full_matrices=True)  # Returns U(n×n), S(n,), Vt(d×d)
z = Vt[-1]  # Last row = nullspace vector

Rank determination: Use np.linalg.matrix_rank(X, tol) to count nonzero singular values. For $X \in \mathbb{R}^{n \times d}$ with $n < d$, rank is at most $n$. If some $\sigma_i < \text{tol}$ (default $\approx 10^{-15}$), the matrix is numerically rank-deficient.

Condition number and nullspace perturbations: Even though $Xz = 0$ theoretically, numerical errors in computing $z$ propagate if $\kappa(X)$ is large. For $\kappa > 10^{10}$, don’t trust nullspace vectors directly; use higher precision or iterative refinement.

Computational cost of full SVD: For $X \in \mathbb{R}^{n \times d}$ with $n < d$: - Thin SVD: $O(nd^2)$ (via QR preprocessing) - Full SVD: $O(nd^2 + d^3)$ (slower, but needed for nullspace) - For $n = 100, d = 10,000$, full SVD costs $\sim 10^{12}$ flops (seconds on CPU)

Sparse and iterative methods: For very large $X$, avoid full SVD. Instead: - Use scipy.sparse.linalg.svds(X, k) (randomized SVD) to approximate top-$k$ singular vectors - Use iterative solvers (LSMR, LSQR) that converge to minimum-norm solution without storing full $V$

Numerical and Shape Notes

Shape discipline: - $X \in \mathbb{R}^{n \times d}$ with $n < d$ (underdetermined) - $y \in \mathbb{R}^n$ (targets) - $w_0 \in \mathbb{R}^d$ (minimum-norm solution) - $U \in \mathbb{R}^{n \times n}$ (thin SVD: left vectors) - $\Sigma \in \mathbb{R}^{n \times d}$ (thin SVD: singular values + zeros) - $V^\top \in \mathbb{R}^{d \times d}$ (full SVD: all right vectors, last $d-n$ are nullspace) - $z \in \mathbb{R}^d$ (nullspace vector, unit norm)

Transpose reasoning: - Forward: $Xw$ is $n \times d$ times $d \times 1 \to n \times 1$ (predictions) - Row space: $(V_r V_r^\top) w$ projects $w$ onto row space (first $n$ columns of $V$) - Nullspace projection: $(I - V_r V_r^\top) w$ extracts nullspace component

Broadcasting and shape mismatches: Common error: $z = \text{Vt}[-1]$ has shape $(d,)$ (1D array). When adding $w_0 + 10z$, NumPy broadcasts correctly. But if $w_0$ is shape $(d, 1)$ (column vector), the addition may fail—flatten to $(d,)$ consistently.

Perturbation amplitude: When testing $X(w_0 + \alpha z) = Xw_0$, use large $\alpha$ (e.g., $\alpha = 100$) to make numerical errors visible. Small $\alpha$ (e.g., $\alpha = 0.01$) may be dominated by machine epsilon.

Pedagogical Significance

This example distills the central paradox of modern ML: overparameterization (∞ many solutions) often leads to good generalization, not overfitting. Classical intuition (more parameters = more overfitting) breaks down when $d \gg n$.

Why this is hard to teach: - Intuition from $n \gg d$ (classical statistics) fails - The nullspace is invisible in 2D or 3D plotting - The minimum-norm principle (why lstsq picks this solution) is subtle

What makes this example powerful: 1. Code is simple: 12 lines, one output ($\|X(w_0 + 10z) - Xw_0\| \approx 0$) 2. Result is counterintuitive: Adding a weight perturbation of size 10 changes predictions by $\sim 10^{-15}$ 3. Immediately connects to practice: Explains overparameterized networks, dropout, weight decay, implicit bias of SGD

Mastery landmarks: - Understand why infinitely many solutions exist ($d - n$ free parameters) - Internalize that predictions depend only on the row space component - Recognize nullspace in high-dimensional problems (genomics, NLP embeddings, deep networks) - Use this pattern to debug regularization choices (ridge strength, dropout rate)

Extensions students can explore: 1. Rank-deficient $X$: If $\text{rank}(X) < n$, the nullspace is even larger. What happens? 2. Ridge regression: Solve $(X^\top X + \lambda I)w = X^\top y$. How does $w$ move as $\lambda$ increases? (Toward origin, shrinking nullspace components faster.) 3. Feature selection: Which features (columns of $X$) have large coefficients in nullspace vectors? Those are “redundant.” 4. SGD and implicit bias: Train a linear model on $n < d$ data with SGD. Does it converge to minimum-norm solution? 5. Sparse recovery: Add $\ell_1$ penalty (Lasso). How different is the solution from minimum-norm OLS? (More sparse, different nullspace component.)

History and Applications

History: The rank–nullity theorem and decomposition of spaces into row space ⊕ nullspace grew from 19th–20th century linear algebra (Grassmann, Hilbert). Practical computation of nullspaces became standard with SVD and pseudoinverses (Moore–Penrose). The minimum-norm solution selected by lstsq is a direct consequence of this theory.

Applications: In modern ML’s overparameterized regime ($d \gg n$), nullspaces are large: infinitely many interpolating models exist. Implicit regularization of SGD favors low-norm (row-space) solutions; explicit regularizers (ridge, lasso) further suppress nullspace components. Understanding nullspace explains double descent, feature redundancy, and why different optimizers converge to distinct but equally accurate predictors.

Connection to Broader Examples
  • Least squares and orthogonality (Ex 86): The residual $r = y - Xw$ satisfies $X^\top r = 0$ only for the minimum-norm solution $w_0$. Other solutions $w = w_0 + z$ (for $z \in \mathcal{N}(X)$) have non-orthogonal residuals: $X^\top (y - X(w_0 + z)) = X^\top r + X^\top X z = X^\top r \neq 0$ (since $X^\top X z$ depends on row space component of $z$).
  • PCA and rank (Ex 11): PCA’s rank-deficient covariance $S = \frac{1}{n} X^\top_c X_c$ has nullspace dimension = number of zero eigenvalues. PCA retains only row space (nonzero eigenvalues), discarding nullspace directions with zero variance.
  • SVD and decomposition (Ex 10): SVD provides the full decomposition: $X = U\Sigma V^\top$. Row space = span of first $r$ columns of $V$; nullspace = span of last $d-r$ columns. Reconstruction $X \approx U_r \Sigma_r V_r^\top$ discards nullspace.
  • Backpropagation and gradient descent (Ex 84): In overparameterized networks, gradients live in the row space of the Jacobian. SGD implicitly regularizes toward minimum-norm solutions (implicit bias). Different initializations + stochasticity navigate different regions of the nullspace, explaining variance in learned models.
  • Regularization and ridge (Ex 9): Ridge regression $(X^\top X + \lambda I)w = X^\top y$ shrinks nullspace components. As $\lambda \to \infty$, the solution concentrates in the row space (minimum-norm direction). For $\lambda = 0$, you recover lstsq (minimum-norm OLS).
  • Conditioning and stability (Ex 14): Ill-conditioning ($\kappa$ large) makes nullspace perturbations amplified in output space. Even though $Xz = 0$ exactly, numerical noise in computing $z$ or solving $Xw = y$ propagates. Well-conditioned problems (small $\kappa$) are robust to nullspace structure.
  • Sparse methods and feature selection (Ex 15): Lasso regularization $(X^\top X)w + \lambda \text{sign}(w) \approx X^\top y$ selects a sparse subset of features by forcing many weights to zero. Redundant features (living in the nullspace) are candidates for zeroing.
  • Solving systems and iterative methods (Ex 13): Conjugate gradient, LSMR, LSQR iteratively solve $Xw = y$ by navigating the row space. Nullspace directions are orthogonal to the Krylov subspace, so iterative methods naturally converge to the minimum-norm solution without explicit nullspace removal.

Comments