The SVD $X = U\Sigma V^\top$ gives $X^\top X = V\Sigma^2 V^\top$, so the eigenvalues of $X^\top X$ are the squared singular values of $X$. The condition number $\kappa(A) = \|A\| \|A^{-1}\|$ equals $\sigma_{\max}/\sigma_{\min}$ for invertible $A$. Hence $\kappa(X^\top X) = \sigma_{\max}^2/\sigma_{\min}^2 = \kappa(X)^2$. Numerically, forming $X^\top X$ loses precision: if $\kappa(X) \sim 10^8$, then $\kappa(X^\top X) \sim 10^{16}$, exceeding double-precision range ($\sim 10^{16}$ digits). SVD-based lstsq operates on $X$ directly, avoiding this squaring. Understanding this justifies why explicit $(X^\top X)^{-1}$ is discouraged in favor of QR or SVD.
- Log in to post comments
Show why forming the normal equations $X^\top X w = X^\top y$ squares the condition number of $X$, making least-squares solvers that use $X^\top X$ numerically unstable for ill-conditioned data. Demonstrate via SVD that $\kappa(X^\top X) = \kappa(X)^2$, and build intuition that SVD or QR-based solvers avoid this squaring, offering stable solutions even when $X$ has near-collinear columns. Emphasize when normal equations fail and why modern ML libraries default to SVD/QR.
Compute cond(X) from singular values and compare to cond(X^T X). Verify cond(X^T X) â cond(X)^2.
If $X=U\Sigma V^T$, then $X^TX = V\Sigma^2 V^T$. So $\kappa(X)=\sigma_{\max}/\sigma_{\min}$ and $\kappa(X^TX)=\sigma_{\max}^2/\sigma_{\min}^2=\kappa(X)^2$.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.toy_data import toy_linreg
X, y, _ = toy_linreg(n=50, d=4, seed=2)
X = np.c_[X, X[:, [0]] + 1e-3 * X[:, [1]]]
S = np.linalg.svd(X, compute_uv=False)
cond_X = S.max() / S.min()
cond_XTX = np.linalg.cond(X.T@X)
print("cond(X):", cond_X)
print("cond(X^T X):", cond_XTX)
print("cond(X)^2:", cond_X**2)This snippet builds an ill-conditioned design matrix by appending a nearly collinear column (X[:,0] + 1e-3*X[:,1]) to a 4-feature dataset. It then compares three quantities: the spectral condition number of $X$ from its singular values (cond_X), the condition number of the normal-equation matrix $X^\top X$ (cond_XTX), and the squared condition number cond(X)^2. The printout illustrates that forming $X^\top X$ squares the condition number, making the normal equations far more ill-conditioned than the original systemâwhy SVD/QR solvers are preferred over explicit normal-equation solves.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Generate base data:
X, y, _ = toy_linreg(n=50, d=4, seed=2)produces a $50\times 4$ matrix and labels. - Add near-collinear column:
X = np.c_[X, X[:,[0]] + 1e-3*X[:,[1]]]appends a 5th column that is almost a linear combination of the first two, forcing a small singular value. - Compute singular values:
S = np.linalg.svd(X, compute_uv=False)returns the vector of singular values in descending order. - Condition number of $X$:
cond_X = S.max() / S.min()computes $\sigma_{\max}/\sigma_{\min}$. - Condition number of $X^\top X$:
cond_XTX = np.linalg.cond(X.T@X)uses a library routine (internally SVD or eigendecomp). - Verify squaring: print
cond(X),cond(X^T X), andcond(X)^2; the last two should match closely. - Diagnostics: if $\kappa(X) > 10^8$, expect $\kappa(X^\top X) > 10^{16}$, indicating severe instability in normal-equation solves.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Condition number squaring: $\kappa(X^\top X) = \kappa(X)^2$ follows from $X^\top X = V\Sigma^2 V^\top$.
- Normal equations risk: forming $X^\top X$ amplifies ill-conditioning, leading to unstable solutions.
- SVD as diagnostic: compute singular values $S$ to get $\kappa(X) = S[0]/S[-1]$ without forming $X^\top X$.
- Near-collinearity: adding a nearly dependent column (
X[:,0] + 1e-3*X[:,1]) makes $\kappa(X)$ large. - Solver choice:
lstsquses SVD/QR to avoid squaring; explicit $(X^\top X)^{-1}$ fails for large $\kappa$. - Shape discipline: $X \in \mathbb{R}^{n\times d}$, $X^\top X \in \mathbb{R}^{d\times d}$; singular values live in $\mathbb{R}^{\min(n,d)}$.
Part 1: Problem Setup and SVD Decomposition
Start with $X \in \mathbb{R}^{50\times 4}$, append a near-collinear column to get $X \in \mathbb{R}^{50\times 5}$. Compute SVD: $X = U\Sigma V^\top$; singular values $S$ are the diagonal of $\Sigma$. Condition number $\kappa(X) = S[0]/S[-1]$; near-zero $S[-1]$ means near-rank-deficiency.
Part 2: Condition Number Squaring
$X^\top X = V\Sigma^2 V^\top$ (orthogonal $U$ cancels). Eigenvalues of $X^\top X$ are $\sigma_i^2$, so $\kappa(X^\top X) = (\sigma_{\max}/\sigma_{\min})^2 = \kappa(X)^2$. Numerically, if $\kappa(X) = 10^8$, then $\kappa(X^\top X) = 10^{16}$, exhausting double precision.
Part 3: Numerical Verification as Diagnostic
Print cond(X), cond(X^T X), and cond(X)^2. Match confirms the squaring relation. Large $\kappa(X^\top X)$ predicts unstable solutions if using $(X^\top X)^{-1}$; residuals and parameter estimates may be garbage. Use lstsq or QR instead.
Why This Matters for ML
Linear regression, ridge regression, and many optimization steps form normal equations. If features are collinear (common in high-dimensional data), $\kappa(X)$ is large, and forming $X^\top X$ doubles the exponent of instability. Modern libraries default to SVD/QR; understanding why prevents manual use of $(X^\top X)^{-1}$, which fails silently with large errors.
Numerical and Implementation Notes
Use np.linalg.svd(X, compute_uv=False) for singular values only; cheaper than full decomposition. For large $n$, avoid forming $X^\top X$ explicitly; use lstsq or iterative methods. Check rcond in lstsq to control singular-value truncation. Monitor condition numbers as regularization diagnostics.
Numerical and Shape Notes
Shapes: $X (n\times d)$, $X^\top X (d\times d)$, $S (\min(n,d))$. For wide $X$ ($n<d$), $X^\top X$ is rank-deficient; pseudoinverse needed. Singular values span $[0, \sigma_{\max}]$; ratios $\sigma_i/\sigma_{\max}$ indicate effective rank. Appending columns (e.g., np.c_[X, new_col]) can drastically change $\kappa$.
The SVD was developed by Beltrami (1873) and Jordan (1874), with modern numerical algorithms emerging from Golub and Kahan (1965). Recognition that normal equations square the condition number drove adoption of QR (Householder, 1958) and SVD-based least-squares solvers in numerical libraries (LAPACK, 1990s). In ML, this principle underpins why sklearn.linear_model.LinearRegression and np.linalg.lstsq use SVD rather than explicit $(X^\top X)^{-1}$. Applications include ridge regression (regularization stabilizes ill-conditioned $X^\top X$), PCA (covariance eigendecomposition), and modern neural-network training (gradient conditioning affects convergence). Understanding $\kappa(X^\top X) = \kappa(X)^2$ guides regularization, feature engineering, and solver choice.
- Linear maps (Ch. 4): $X$ is a linear map; SVD reveals its structure and invertibility.
- Projections (Ch. 6): Least squares is projection; conditioning affects numerical accuracy.
- Rank-nullity (Ch. 7): Near-rank-deficiency (small $\sigma_{\min}$) indicates near-null directions.
- Eigen (Ch. 8): $X^\top X$ is symmetric; its eigenvalues are $\sigma_i^2$.
- PSD (Ch. 9): $X^\top X$ is PSD; condition number from eigenvalues.
- SVD (Ch. 10): This chapterâs core topic: SVD as the tool for stable least squares.
- PCA (Ch. 11): Covariance $\Sigma \propto X^\top X$; ill-conditioning affects PCA stability.
- Least squares (Ch. 12): Normal equations $X^\top Xw = X^\top y$ square condition number; SVD avoids this.
- Conditioning (Ch. 14): This example operationalizes conditioning theory: $\kappa(X^\top X) = \kappa(X)^2$.
Comments