ex1.ai

Example number

Example slug

example_71_rank_null_space_many_parameters_same_predictor

Background

Rank and nullspace are foundational in linear algebra (Strang, Golub-Van Loan) and pervasive in modern ML. When $d > n$ (more features than examples), linear regression has infinitely many solutionsâthe solution space is an affine subspace $w_0 + \mathcal{N}(X)$. SVD reveals this structure explicitly: rows of $V^\top$ (with full_matrices=True) beyond the rank span the nullspace. In ML, underdetermined systems arise in overparameterized models (deep learning with $d \gg n$), feature selection (correlated features), and compressed sensing. Regularization (ridge, lasso) resolves non-uniqueness by selecting unique solutionsâridge picks the minimum-norm solution, lasso enforces sparsity. Understanding nullspaces is essential for interpreting weight ambiguity, diagnosing multicollinearity, and reasoning about implicit biases in gradient descent.

Purpose

Build deep intuition for rank deficiency and nullspace non-uniqueness in underdetermined linear systems. When features outnumber examples ($d > n$), least-squares solutions are non-unique: infinitely many weight vectors yield identical predictions. This example demonstrates how to extract nullspace vectors via SVD, verify prediction invariance, and understand why regularization is essential in high-dimensional ML. You should come away understanding the geometric structure of solution spaces and the role of minimum-norm solutions.

Problem

Create X with n<d, fit least squares to y, then add a null-space direction z and verify predictions do not change.

Solution (Math)

When $d > \text{rank}(X)$, the nullspace $\mathcal{N}(X) = \{z \in \mathbb{R}^d : Xz = 0\}$ is nontrivial. For any $z \in \mathcal{N}(X)$:

\[ X(w + z) = Xw + Xz = Xw, \]

so parameter vectors differing by $z$ define the same predictor on the training inputs. The least-squares problem:

\[ \min_w \|y - Xw\|_2^2 \]

has infinitely many solutions when $d > n$ and $X$ has full row rank $r = n$. The minimum-norm solution:

\[ w_0 = \arg\min_{w: Xw = y} \|w\|_2 \]

is unique and lies in the row space of $X$, orthogonal to $\mathcal{N}(X)$.

Notation: - $X \in \mathbb{R}^{n\times d}$: design matrix (rows = examples, $n=6$; cols = features, $d=12$) - $w \in \mathbb{R}^d$: weight vector - $y \in \mathbb{R}^n$: targets - $\|x\|_2$: Euclidean norm - $\mathcal{N}(X)$: nullspace of $X$, dimension $d - \text{rank}(X)$

Solution (Python)


import numpy as np

rng = np.random.default_rng(1071)
n, d = 6, 12
X = rng.normal(size=(n,d))
w_true = rng.normal(size=d)
y = X@w_true

w0 = np.linalg.lstsq(X, y, rcond=None)[0]
U,S,Vt = np.linalg.svd(X, full_matrices=False)
z = Vt[-1]
z = z / np.linalg.norm(z)

pred0 = X@w0
pred1 = X@(w0 + 10.0*z)

print("min singular value:", S.min())
print("||pred1 - pred0||_2:", np.linalg.norm(pred1 - pred0))

Code Introduction

This snippet demonstrates rank deficiency and nullspace non-uniqueness in underdetermined linear systems. With $n=6$ examples and $d=12$ features, the design matrix $X \in \mathbb{R}^{6\times 12}$ has at most rank 6, leaving a $(12-6)=6$-dimensional nullspace. Any vector in this nullspace can be added to a least-squares solution without changing predictions, illustrating the fundamental ambiguity when features outnumber examples.

The code generates a random design matrix $X$ and true weights $w_{\text{true}} \in \mathbb{R}^{12}$, then forms targets $y = X w_{\text{true}}$. Because $n < d$, the system $Xw = y$ has infinitely many solutions. np.linalg.lstsq(X, y, rcond=None) returns the minimum-norm solution $w_0 = \arg\min_{w: Xw = y} \|w\|_2$, which is unique and orthogonal to $\mathcal{N}(X)$.

Critical gotcha: The code uses full_matrices=False in the SVD call, which returns only $V^\top \in \mathbb{R}^{6\times 12}$ (the first 6 right singular vectors corresponding to non-zero singular values). For $d > n$, this cannot span the nullspace. The line z = Vt[-1] extracts the last row of the truncated $V^\top$, which is $v_6^\top$ (corresponding to the smallest singular value $\sigma_6 > 0$)âa row-space vector, not a nullspace vector. Consequently, $Xz \neq 0$, and adding $10z$ to $w_0$ will change predictions, yielding ||pred1 - pred0||_2 $\approx 10\sigma_{\text{min}}$ (non-zero).

To correctly extract a nullspace vector, use full_matrices=True to obtain $V^\top \in \mathbb{R}^{12\times 12}$. Rows 6-11 span $\mathcal{N}(X)$ (corresponding to the $d - r = 6$ zero singular values in the full decomposition). Extract via z = Vt_full[r] where $r = \text{rank}(X) = 6$. Then $Xz = 0$, and predictions are invariant: ||X@(w0 + 10*z) - X@w0||_2 $\approx 10^{-14}$ (machine precision).

Shape discipline: $X \in \mathbb{R}^{6\times 12}$, $w_0 \in \mathbb{R}^{12}$, $z \in \mathbb{R}^{12}$, $y \in \mathbb{R}^6$. With full_matrices=False: $U \in \mathbb{R}^{6\times 6}$, $S \in \mathbb{R}^6$, $V^\top \in \mathbb{R}^{6\times 12}$. With full_matrices=True: $V^\top \in \mathbb{R}^{12\times 12}$, rows 0-5 are row space, rows 6-11 are nullspace.

This example is pedagogically powerful because it reveals a common pitfall (full_matrices flag) and demonstrates the geometric structure of underdetermined systemsâa topic central to understanding regularization, overparameterization, and implicit biases in ML optimization.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Generate underdetermined system: $X \in \mathbb{R}^{6\times 12}$ (rank 6), $y = X w_{\text{true}}$ for random $w_{\text{true}} \in \mathbb{R}^{12}$.
Solve via np.linalg.lstsq(X, y, rcond=None): returns minimum-norm solution $w_0 \in \mathbb{R}^{12}$.
Critical: Use np.linalg.svd(X, full_matrices=True) to obtain $V^\top \in \mathbb{R}^{12\times 12}$; rows $r$ through $d-1$ (indices 6-11) span $\mathcal{N}(X)$.
Extract nullspace vector: z = Vt_full[r] (first nullspace vector), normalize: z /= np.linalg.norm(z).
Verify prediction invariance: compute $\|X(w_0 + 10z) - Xw_0\|_2 \approx 0$ (machine precision $\sim 10^{-14}$).
Code bug to avoid: The provided code uses full_matrices=False, so z = Vt[-1] extracts the last row-space vector (smallest singular value), not a nullspace vector. This causes ||pred1 - pred0||_2 to be non-zero ($\approx 10\sigma_{\text{min}}$). Corrected code requires full_matrices=True.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Underdetermined systems ($n < d$): When features outnumber examples, least-squares solutions are non-unique; infinitely many weight vectors yield identical predictions.
Nullspace extraction via SVD: Rows of $V^\top$ (with full_matrices=True) beyond rank $r$ span $\mathcal{N}(X)$; for $X \in \mathbb{R}^{6\times 12}$ with rank 6, the nullspace has dimension $12 - 6 = 6$.
Prediction invariance: Adding any $z \in \mathcal{N}(X)$ to the solution $w_0$ doesnât change predictions: $X(w_0 + \alpha z) = Xw_0$ for all $\alpha$.
Minimum-norm solution uniqueness: Among all solutions, lstsq returns the unique $w_0$ with smallest $\|w_0\|_2$, which lies in the row space and satisfies $\langle w_0, z \rangle = 0$ for all $z \in \mathcal{N}(X)$.
Critical SVD gotcha: full_matrices=False (default) only returns the first $r = \min(n,d)$ right singular vectors, which cannot span the nullspace when $d > n$. Use full_matrices=True to access nullspace vectors.
ML implications: Regularization (ridge/lasso) is essential when $d > n$ to select unique solutions and prevent overfitting; gradient descent implicitly navigates nullspaces in overparameterized networks.

Notes

Part 1: Constructing the Underdetermined System. With $n=6$ examples and $d=12$ features, $X \in \mathbb{R}^{6\times 12}$ has at most rank 6, leaving a $(12-6)=6$-dimensional nullspace. The system $Xw = y$ has infinitely many solutions: $w = w_0 + z$ for any $z \in \mathcal{N}(X)$. np.linalg.lstsq returns the minimum-norm solution $w_0 = \arg\min_{w: Xw=y} \|w\|_2$, which is unique and orthogonal to $\mathcal{N}(X)$.
Part 2: Extracting a Nullspace Vector via SVD. SVD $X = U\Sigma V^\top$ decomposes $X$ into orthogonal bases for all four fundamental subspaces. With full_matrices=True, $V^\top \in \mathbb{R}^{12\times 12}$ has rows 0-5 spanning the row space (non-zero singular values) and rows 6-11 spanning $\mathcal{N}(X)$ (zero singular values). Use z = Vt_full[r] where $r = \text{rank}(X)$. Critical gotcha: full_matrices=False returns only $V^\top \in \mathbb{R}^{6\times 12}$ (first 6 rows), which cannot span the nullspaceâVt[-1] is the smallest row-space vector, not a nullspace vector.
Part 3: Testing Prediction Invariance (What Should Happen). For a true nullspace vector $z$, $Xz = 0$, so $X(w_0 + \alpha z) = Xw_0$ for all $\alpha$. Computing ||X@(w0 + 10*z) - X@w0||_2 should yield $\sim 10^{-14}$ (machine precision). If the code uses full_matrices=False incorrectly, z is a row-space vector with $Xz = \sigma_{\text{min}} u_{\text{min}} \neq 0$, and the norm is $\approx 10\sigma_{\text{min}}$ (non-zero).
Why This Matters for ML. Underdetermined systems ($d > n$) are ubiquitous in modern ML: neural networks have millions of parameters trained on thousands of examples. Non-uniqueness means many weight configurations achieve the same training loss. Regularization (ridge/lasso) selects unique solutions: ridge picks minimum-norm $w_0$, lasso enforces sparsity. Gradient descentâs implicit regularization navigates nullspaces to find solutions that generalizeâunderstanding nullspace structure is key to interpreting this phenomenon.
ML Examples and Patterns. Deep learning: Overparameterized networks ($d \gg n$) have vast nullspaces; SGD finds low-norm solutions via implicit regularization. Feature selection: Correlated features make $X$ nearly rank-deficient; lasso selects one feature per correlated group. Matrix completion: Recommender systems with missing ratings have large nullspaces; low-rank factorization + regularization yields unique completions. Compressed sensing: $\ell_1$ minimization recovers sparse solutions uniquely despite infinite dense solutions. Equivariant networks: Nullspace encodes symmetries (translation, rotation) that donât affect predictions.
Connection to Linear Algebra Theory. Rank-nullity theorem: $\dim(\text{row}(X)) + \dim(\mathcal{N}(X)) = d$, so $\dim(\mathcal{N}(X)) = d - r$. Minimum-norm via pseudoinverse: $w_0 = X^+ y = V\Sigma^+ U^\top y$, where $\Sigma^+$ inverts non-zero singular values. Normal equations: $X^\top X w = X^\top y$ is singular when $d > n$; general solution is $w = w_0 + z$ for $z \in \mathcal{N}(X^\top X) = \mathcal{N}(X)$. Ridge regularization: $(X^\top X + \lambda I)^{-1} X^\top y$ is always invertible for $\lambda > 0$, eliminating nullspace.
Numerical and Implementation Notes. Use full_matrices=True when $d > n$ to access nullspace vectors. Determine rank via np.linalg.matrix_rank(X, tol=1e-10) with explicit tolerance for nearly rank-deficient matrices. Verify nullspace property: np.linalg.norm(X @ z) < 1e-14. Check $w_0$ orthogonality: np.dot(w0, z) < 1e-14. For large sparse $X$, use scipy.sparse.linalg.svds (partial SVD).
Numerical and Shape Notes. Annotate: $X \in \mathbb{R}^{n\times d}$, $w_0 \in \mathbb{R}^d$, $z \in \mathbb{R}^d$, $y \in \mathbb{R}^n$. With full_matrices=False: $U \in \mathbb{R}^{n\times r}$, $S \in \mathbb{R}^r$, $V^\top \in \mathbb{R}^{r\times d}$ where $r = \min(n,d)$. With full_matrices=True: $U \in \mathbb{R}^{n\times n}$, $V^\top \in \mathbb{R}^{d\times d}$. Nullspace basis: Vt_full[r:, :] has shape $(d-r, d)$.
Pedagogical Significance. This is the canonical demonstration that underdetermined systems have non-unique solutions and that SVD reveals the nullspace explicitly. Key insights: (1) Minimum-norm solution is unique but infinitely many solutions exist. (2) Nullspace vectors can be added without changing predictions. (3) Regularization resolves ambiguity. (4) full_matrices flag is critical for accessing nullspaces when $d > n$. (5) In high-dimensional ML, understanding nullspaces is essential for reasoning about generalization, implicit regularization, and weight interpretability.

History and Applications

The rank-nullity theorem (also called the fundamental theorem of linear algebra) dates to the 19th century, with rigorous formulations by Sylvester (1850s) and Frobenius (1890s). The concept of nullspace (kernel) became central in abstract algebra and functional analysis (Hilbert spaces, 1900s). SVD, independently discovered by Beltrami (1873) and Jordan (1874), was computationally impractical until Golub and Kahan (1965) developed stable algorithms. The connection between SVD and the four fundamental subspaces (column space, row space, nullspace, left nullspace) was popularized by Strangâs textbooks (1970s-present).

In modern ML, underdetermined systems ($d > n$) became prominent with the rise of overparameterized models. Vapnikâs support vector machines (1990s) used regularization to handle $d > n$ cases. The âdouble descentâ phenomenon (Belkin et al., 2019) showed that overparameterized models can generalize despite interpolating training dataânullspace structure is key to understanding this. Bartlett et al.Â (2020) formalized âbenign overfittingâ in linear regression: minimum-norm solutions generalize when features are well-distributed. In deep learning, understanding how gradient descent navigates the vast nullspace of overparameterized networks (neural tangent kernel theory, Jacot et al., 2018) is an active frontier. Nullspace concepts also appear in compressed sensing (CandÃ¨s, Donoho, 2000s), matrix completion (Netflix Prize, 2006), and equivariant neural networks (group theory symmetries).

Connection to Broader Examples

Links to Chapter 7 (rank/nullspace): This is the canonical exampleârank deficiency implies non-trivial nullspace and non-unique solutions.
Connects to Chapter 10 (SVD): SVD explicitly constructs orthonormal bases for all four fundamental subspaces (column space, row space, nullspace, left nullspace).
Relates to Chapter 12 (least squares): Underdetermined systems and minimum-norm solutions via pseudoinverse.
Bridges to Chapter 8 (eigenvalues): $X^\top X$ has zero eigenvalues corresponding to $\mathcal{N}(X)$; eigenvectors are nullspace basis.
Complements Chapter 14 (conditioning): Nearly rank-deficient matrices (small singular values) amplify solution sensitivity to noise.
Reinforces Chapter 3 (basis/dimension): Nullspace dimension is $d - r$ (rank-nullity theorem); any basis of $\mathcal{N}(X)$ has this dimension.
Connects to regularization: Ridge regression $(X^\top X + \lambda I)^{-1} X^\top y$ is always invertible for $\lambda > 0$, eliminating nullspace ambiguity.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments