Part 1: Least-Squares Setup - Least-squares solves: $\hat{w} = \arg\min_w \|y - Xw\|_2^2$ (minimize squared prediction error). - Internally, lstsq uses SVD or QR decomposition for numerical stability (not normal equations $X^\top X \hat{w} = X^\top y$). - The result $\hat{w} \in \mathbb{R}^3$ is the optimal weight vector (if $X$ has full column rank). - Fitted values: $\hat{y} = X\hat{w} \in \mathbb{R}^{25}$ (predictions on training set).
Part 2: Residual Orthogonality Condition - Residual: $r = y - X\hat{w}$ measures prediction error (should be small if model fits well). - Orthogonality condition: $X^\top r = 0$ means $r$ is orthogonal to every column of $X$. - This is equivalent to: $r \in \text{col}(X)^\perp$ (residuals lie in the orthogonal complement of column space). - Characterization: Any $w$ satisfying $X^\top (y - Xw) = 0$ is the unique least-squares solution (if $X$ full rank). - Geometric interpretation: $\hat{y} = X\hat{w}$ is the orthogonal projection of $y$ onto $\text{col}(X)$; $r$ is the perpendicular component.
Why This Matters for ML - Model diagnostics: Checking $\|X^\top r\|_2 \approx 0$ verifies that the solver found the true least-squares solution. - Residual properties: If $X$ includes a constant column, residuals automatically have zero mean (centered). - Ridge regression trade-off: Adding regularization breaks orthogonality: $X^\top r_\lambda = -\lambda \hat{w}_\lambda \neq 0$. - Overfitting indicator: In overparameterized models, residuals shrink but orthogonality still holds for training data. - Iterative solvers: Convergence is measured by how well $\|X^\top r\|_2$ decreases toward zero.
ML Examples and Patterns - Linear regression diagnostics: Residual plots assume $X^\top r \approx 0$ holds; if not, solver failed. - Feature selection: The marginal value of adding a feature is $|x_\text{new}^\top r|$ (correlation with unexplained variance). - Gram-Schmidt: Orthogonalization produces residuals at each step, analogous to least-squares. - Iterative refinement: For ill-conditioned $X$, refine $w$ via $r^{(k)} = y - Xw^{(k)}$, $\delta w^{(k)} = \text{lstsq}(X, r^{(k)})$, $w^{(k+1)} = w^{(k)} + \delta w^{(k)}$. - Gradient descent: Each GD step reduces $\|r\|_2$ but doesnât immediately satisfy orthogonality; convergence approaches $X^\top r = 0$.
Connection to Linear Algebra Theory - Normal equations: Orthogonality $X^\top r = 0$ rearranges to $X^\top X \hat{w} = X^\top y$ (the normal equations). - QR factorization: Writing $X = QR$, the solution is $\hat{w} = R^{-1} Q^\top y$; residuals $r = (I - QQ^\top) y$ lie in the orthogonal complement of $\text{col}(Q) = \text{col}(X)$ by construction. - SVD perspective: $X = U\Sigma V^\top$ gives $\hat{w} = V \Sigma^{-1} U^\top y$; residuals satisfy $X^\top r = 0$ due to orthogonality of $U, V$. - Cauchy-Schwarz: Orthogonality is ensured by the projection theoremâa consequence of geometry, not a numerical accident. - Ill-conditioning: Small singular values in $X$ amplify numerical errors; iterative refinement mitigates by refining $r$ in high precision.
Numerical and Implementation Notes - Transpose order: X.T @ r (shape (3, 25) à (25,) â (3,)), not X @ r.T (incompatible). - Norm type: np.linalg.norm(X.T @ r) computes Frobenius/Euclidean norm; specify ord=2 for clarity. - rcond parameter: lstsq(..., rcond=None) uses machine precision for rank determination; explicitly setting is good practice. - Full vs. reduced rank: If $X$ is rank-deficient, lstsq returns minimum-norm solution; orthogonality $X^\top r = 0$ still holds. - Tolerance for verification: Values $< 10^{-10}$ indicate good orthogonality; values $> 10^{-8}$ warrant investigation.
Numerical and Shape Notes - $X \in \mathbb{R}^{25 \times 3}$ (25 examples, 3 features). - $y \in \mathbb{R}^{25}$ (target vector). - $\hat{w} \in \mathbb{R}^3$ (weight vector). - $r \in \mathbb{R}^{25}$ (residual vector). - $X^\top r \in \mathbb{R}^3$ (should be near zero). - np.linalg.norm(X.T @ r) yields a scalar (Euclidean norm of $X^\top r$).
Pedagogical Significance - Core principle: Least-squares finds weights such that residuals are orthogonal to the input feature space. - Optimization meets geometry: First-order optimality ($\nabla_w \mathcal{L} = 0$) is equivalent to orthogonal projection. - Solver verification: This single check reveals whether your solver succeeded. - Regularization trade-off: Visualizing how regularization breaks orthogonality illuminates the bias-variance trade-off. - Foundation for ML: Understanding projection and orthogonality is essential for regression, neural networks, and principled optimization.
Comments