When $X \in \mathbb{R}^{n\times d}$ is wide ($n<d$), $X$ cannot have full column rank. The null space $\mathcal{N}(X)$ has dimension at least $d-n$, so there are infinitely many minimizers of $\|y - Xw\|_2^2$. The minimum-norm solution is $w_0 = X^\dagger y$ (MooreâPenrose pseudoinverse), which np.linalg.lstsq returns for underdetermined systems. Any $w_0 + z$ with $z \in \mathcal{N}(X)$ gives the same predictions because $Xz=0$. The right-singular vectors of $X$ corresponding to zero (or tiny) singular values span $\mathcal{N}(X)$, making SVD a natural tool to exhibit these directions. Understanding null space explains parameter drift, implicit bias, and the need for regularization to select among infinitely many fits.
- Log in to post comments
Explain why overparameterized linear models ($n<d$) have infinitely many solutions and how the null space governs non-identifiability. Show that least-squares returns one representative $w_0$, yet any $w_0 + z$ with $z \in \mathcal{N}(X)$ yields identical predictions. Build intuition for implicit bias (minimum-norm solutions), regularization, and why parameters can move without changing training loss.
Construct n<d, fit w0, add a null-space vector z, and verify predictions are unchanged.
If $z\in\mathcal{N}(X)$, then $Xz=0$. Hence $X(w_0+z)=Xw_0$: infinitely many parameters yield identical predictions on the training inputs.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
rng = np.random.default_rng(1039)
n, d = 6, 12
X = rng.normal(size=(n,d))
w_true = rng.normal(size=d)
y = X@w_true
w0 = np.linalg.lstsq(X, y, rcond=None)[0]
U,S,Vt = np.linalg.svd(X, full_matrices=False)
z = Vt[-1]; z /= np.linalg.norm(z)
pred0 = X@w0
pred1 = X@(w0 + 5.0*z)
print("min singular value:", S.min())
print("||pred1-pred0||:", np.linalg.norm(pred1-pred0))This code illustrates non-uniqueness of least-squares solutions in an overparameterized linear system: adding any null-space vector to a least-squares solution leaves predictions unchanged because $Xz=0$).
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Set shapes: choose $n=6$, $d=12$ to ensure $n<d$ (wide design matrix).
- Sample data:
X = rng.normal(size=(n,d)); true weightsw_true = rng.normal(size=d); labelsy = X@w_true. - Fit least squares:
w0 = np.linalg.lstsq(X, y, rcond=None)[0]returns the minimum-norm interpolating solution. - Compute SVD:
U, S, Vt = np.linalg.svd(X, full_matrices=False); the last row ofVtis a smallest-singular-value direction. - Extract null vector:
z = Vt[-1]; z /= np.linalg.norm(z)to normalize a null-space direction (or near-null if $\sigma_{\min} > 0$). - Perturb weights: form $w_1 = w_0 + 5 z$ to move along the null space.
- Verify invariance: compare
pred0 = X@w0andpred1 = X@w1;np.linalg.norm(pred1 - pred0)should be near zero. InspectS.min()to gauge rank deficiency.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Non-identifiability in overparameterized linear systems: infinitely many $w$ achieve zero training error.
- Null space characterization: if $z \in \mathcal{N}(X)$, then $Xz = 0$ and $X(w_0 + z) = Xw_0$.
- SVD-based construction: the right-singular vector associated with the smallest singular value spans a null-space direction.
- Prediction invariance: adding scaled null-space components leaves $\hat{y}$ unchanged.
- Minimum-norm bias:
lstsqreturns the minimum-norm solution among all interpolating solutions. - Shape discipline: $X \in \mathbb{R}^{n\times d}$ with $n<d$ guarantees a nontrivial null space.
Part 1: Problem Setup and Solution Set
With $n<d$, the minimizers of $\|y - Xw\|_2^2$ form an affine space $w_0 + \mathcal{N}(X)$. The pseudoinverse solution $w_0 = X^\dagger y$ is the minimum-norm element of this set.
Part 2: Null Space Direction
Any $z \in \mathcal{N}(X)$ satisfies $Xz = 0$, so $X(w_0 + z) = Xw_0$. The SVD $X = U\Sigma V^\top$ reveals $z$ as a right-singular vector associated with $\sigma_{\min}$ (zero or tiny), giving a constructive way to move within the solution set.
Part 3: Numerical Verification as Diagnostic
Check $\|X(w_0 + \alpha z) - Xw_0\|_2$; if $\sigma_{\min} \approx 0$, this should be near machine precision. When $\sigma_{\min}$ is small but nonzero, invariance degrades proportionally to $|\alpha|\,\sigma_{\min}$.
Why This Matters for ML
Overparameterized models (wide linear layers, deep nets) often interpolate training data, leaving many parameter vectors with identical training loss. Implicit bias (gradient descent with small initialization) tends toward minimum-norm or low-complexity solutions. Null-space intuition explains parameter drift without loss change and motivates regularization to fix directions unseen by the data.
Numerical and Implementation Notes
Use SVD-based lstsq instead of forming $X^\top X$. Inspect S.min() as a quick condition diagnostic. Normalize null-space vectors before scaling; large scalings leave predictions unchanged but can explode parameter norms. Near-null directions ($\sigma_{\min} > 0$ but tiny) mean invariance is approximate.
Numerical and Shape Notes
Shapes: $X (n\times d)$, $w (d)$, $y (n)$, $\mathcal{N}(X) \subset \mathbb{R}^d$. With $n<d$, expect at least $d-n$ null directions. For multiple outputs $Y \in \mathbb{R}^{n\times k}$, the null space acts column-wise on $W \in \mathbb{R}^{d\times k}$. Always verify $X@z \approx 0$ before assuming invariance; floating-point noise can make $\|Xz\|_2$ small but nonzero.
Non-uniqueness in linear least squares traces back to Gauss and Legendre, but the modern resolution comes from the MooreâPenrose pseudoinverse (Penrose, 1955) and SVD-based algorithms (Golub & Kahan, 1965), which reveal null-space directions robustly. In statistics, identifiability and regularization (ridge/Tikhonov) govern which solution within $w_0 + \mathcal{N}(X)$ is chosen. In machine learning, overparameterized linear layers and deep networks share the same geometry: many weight configurations yield zero training loss, and optimization dynamics plus implicit/explicit regularizers (weight decay, early stopping) select minimum-norm or low-complexity solutions. Null-space awareness informs pruning, low-rank factorization, double-descent behavior, and the stability of predictions under parameter drift.
- Vector spaces (Ch. 1): Null space is a subspace of $\mathbb{R}^d$; column space is a subspace of $\mathbb{R}^n$.
- Span (Ch. 2): $\mathcal{N}(X)$ is the span of right-singular vectors with zero (or tiny) singular values.
- Basis and dimension (Ch. 3): $\dim \mathcal{N}(X) = d - \operatorname{rank}(X)$ counts the degrees of non-identifiability.
- Linear maps (Ch. 4): $X$ is a linear map; its kernel is $\mathcal{N}(X)$.
- Projections (Ch. 6): Least squares projects $y$ onto $\text{col}(X)$; null-space moves are orthogonal to the row space.
- Rank-nullity (Ch. 7): Rank plus nullity equals $d$; this example exhibits the nullity side explicitly.
- Conditioning (Ch. 14): Tiny singular values reveal near-null directions, causing instability in $w$ though predictions remain stable.
Comments