Part 1: Underdetermined Regression Setup - $X \in \mathbb{R}^{6 \times 12}$: more features ($d=12$) than examples ($n=6$); rank at most 6. - $y = X w_\text{true}$: target generated from true weights (arbitrary choice for illustration). - lstsq(X, y) returns the minimum-norm solution: $\hat{w} = \arg\min_w \{\|y - Xw\|_2^2, \|w\|_2\}$. - Infinite solutions fit the training data equally well; they form an affine subspace: $\{w : Xw = y\} = w_0 + \ker(X)$.
Part 2: Null Space via SVD - SVD: $X = U \Sigma V^\top$ with $U \in \mathbb{R}^{6 \times 6}$, $\Sigma$ diagonal, $V^\top \in \mathbb{R}^{12 \times 12}$. - Rank = number of nonzero singular values = 6 (in this case). - Null space = span of right singular vectors with zero singular values = last 6 rows of $V^\top$. - Last row z = Vt[-1] is a null-space vector: $Xz = U\Sigma V^\top z = 0$.
Part 3: Null-Space Independence of Predictions - $Xz = 0 \Rightarrow X(w_0 + \alpha z) = Xw_0 + \alpha Xz = Xw_0$ for any scalar $\alpha$. - pred1 = X(w0 + 10*z) = X*w0 = pred0 (predictions identical despite $\|w_1 - w_0\|_2 = 10 \|z\|_2 \ne 0$). - Numerical verification: ||pred1 - pred0||_2 â 0 (machine precision, not $10 \|z\|_2$).
Why This Matters for ML - Regularization motivation: Infinite solutions exist for underdetermined systems; regularization (L2, L1) selects one by minimizing a penalty. - Non-identifiability: Parameters are non-identifiable in null-space directionsâcanât distinguish them from training data. - Implicit regularization: SGD converges to the minimum-norm solution; the null space explains why this solution is preferred (bias of the algorithm). - Generalization: While all solutions fit training data identically, null-space perturbations can cause poor generalization on test data (large weight norm).
ML Examples and Patterns - Deep learning: Neural networks are severely overparameterized; the null space is enormous, explaining why regularization/initialization matter. - Sparse vs. dense solutions: L1 (Lasso) selects sparse solutions; L2 (Ridge) selects dense but small-norm solutionsâboth from the same infinite set. - Feature redundancy: Linearly dependent columns create null-space directions; PCA/whitening can remove them. - Implicit regularization via SGD: In overparameterized models, SGD converges to the minimum-norm solutionânull space is ignored (has zero norm).
Connection to Linear Algebra Theory - Rank-nullity theorem: $\text{rank}(X) + \dim(\ker(X)) = d$; with $d=12$, $\text{rank}(X)=6$, $\dim(\ker(X))=6$. - Orthogonal decomposition: $\mathbb{R}^d = \text{row}(X) \oplus \ker(X)$ (direct sum); any $w$ uniquely decomposes as $w = w_\parallel + w_\perp$. - Affine subspace of solutions: All solutions to $Xw = y$ form $w_0 + \ker(X)$ (an affine subspace parallel to null space). - SVD characterization: $\ker(X) = \text{span}(v_{r+1}, \ldots, v_d)$ where $v_i$ are columns of $V$ and $r = \text{rank}(X)$.
Numerical and Implementation Notes - Full vs. reduced SVD: With full_matrices=False, reduced SVD is efficient; null space comes from rows of $V^\top$ beyond rank. - Singular value magnitude: S.min() indicates conditioning; small values signal near-rank-deficiency (approximate null space). - Matrix rank computation: np.linalg.matrix_rank(X, tol=...) uses singular values; rank = number of singular values $>$ tolerance. - Null-space basis: Last $d - \text{rank}(X)$ rows of $V^\top$ form orthonormal basis of null space; any linear combination is also in null space.
Numerical and Shape Notes - $X \in \mathbb{R}^{6 \times 12}$ (6 rows, 12 columns). - $U \in \mathbb{R}^{6 \times 6}$ (with full_matrices=False). - $V^\top \in \mathbb{R}^{12 \times 12}$ (full size, even with reduced SVD). - z = Vt[-1] has shape (12,) (null-space vector in $\mathbb{R}^{12}$). - pred0, pred1 have shape (6,) (predictions for 6 examples).
Pedagogical Significance - Core principle: Null space characterizes weight adjustments that preserve predictions; infinitely many solutions produce identical training predictions. - Regularization necessity: Without additional constraints, the solution is non-unique; regularization (explicit or implicit via SGD) selects based on norm/sparsity/other criteria. - Generalization trade-off: Large weights in null-space directions donât affect training loss but can hurt generalization; regularization prevents this. - Foundation for overparameterized models: Modern deep learning operates with massive null spaces; understanding them explains implicit regularization, double descent, and benign overfitting.
Comments