Part 1: Sample Covariance is PSD - $\Sigma = \frac{1}{n-1} X_c^\top X_c$ is a Gram matrix: $A = B^\top B$ form always yields PSD. - Eigenvalues measure variance in each principal direction; $\lambda_i = 0$ means data have zero variance along that axis (features are constant or linearly dependent). - Centering is essential: $\Sigma_{\text{uncentered}} = \frac{1}{n-1} X^\top X$ conflates mean and variance, inflating eigenvalues. - Rank at most $\min(n-1, d)$: with $n=40, d=2$, rank is at most 2; excess eigenvalues are zero (though here both are nonzero for generic data).
Part 2: Kernel Gram Matrix is PSD - Kernel function $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ is valid (Mercer) iff the Gram matrix $K_{ij} = k(x_i, x_j)$ is PSD for any set of points. - RBF kernel $k(x, y) = \exp(-\gamma \|x - y\|_2^2)$ is a Mercer kernel; its Gram matrix is always PSD. - Gram matrix diagonal is $K_{ii} = k(x_i, x_i) = 1$ (for normalized RBF); off-diagonal entries $< 1$ decrease with pairwise distance. - Eigenvalues characterize the implicit feature space dimension: fast decay of eigenvalues suggests data lie on a low-dimensional manifold.
Part 3: Why PSD Matters in ML - Convexity: Quadratic form $\frac{1}{2} w^\top A w + b^\top w$ is convex iff $A$ is PSD; ensures unique global minimum. - Stability: Cholesky factorization $A = LL^\top$ is numerically stable for PSD matrices; used in Gaussian likelihood, matrix inversion. - Mercer feature maps: PSD kernels correspond to feature maps $\phi : \mathbb{R}^d \to \mathcal{H}$; enables implicit infinite-dimensional representations. - Condition number: $\kappa = \lambda_{\max}/\lambda_{\min}$ controls solver convergence; PSD-ness ensures all eigenvalues are nonzero or (legitimately) zero.
Why PSD Matters in ML - Ensures valid probability models (covariance in Gaussian distributions). - Enables convex optimization (global optima, efficient algorithms). - Allows stable matrix factorizations (Cholesky, eigendecomposition). - Links kernels to implicit feature spaces (Mercerâs theorem).
ML Examples and Patterns - Gaussian models: Covariance $\Sigma$ in $\mathcal{N}(\mu, \Sigma)$ must be PSD; used in LDA, GMM, Gaussian processes. - Ridge regression: $(X^\top X + \lambda I)^{-1}$ always has Cholesky factorization for $\lambda > 0$ (PSD-ness ensured). - Kernel SVM: Objective $\frac{1}{2} \alpha^\top K \alpha - \mathbf{1}^\top \alpha$ is convex iff $K$ is PSD; ensures unique solution. - Kernel PCA: Eigenvectors of $K$ give principal components in feature space; uses PSD eigendecomposition. - Gaussian processes: Kernel $k(x, x')$ evaluated on data yields PSD $K$; posterior is computed via Cholesky. - Whitening/normalization: $\Sigma = LL^\top$ enables $Z = L^{-1}(X_c^\top)^\top$ (whitened data, zero mean, unit covariance).
Connection to Linear Algebra Theory - Spectral theorem: Symmetric matrices diagonalize as $A = U \Lambda U^\top$ with orthogonal $U$ and diagonal $\Lambda$. For PSD, $\lambda_i \ge 0$. - Quadratic form: $v^\top A v = \sum_i \lambda_i (u_i^\top v)^2 \ge 0$ for PSD; equals zero only if $v$ is orthogonal to all nonzero eigenspace vectors. - Rayleigh quotient: $\lambda_{\min} = \min_v \frac{v^\top A v}{v^\top v}$, $\lambda_{\max} = \max_v \frac{v^\top A v}{v^\top v}$ characterizes eigenvalue bounds. - Gram matrix form: $A = B^\top B$ always yields PSD; rank of $A$ equals rank of $B$.
Numerical and Implementation Notes - Bias parameter: bias=False (default) uses $1/(n-1)$ (unbiased estimator); bias=True uses $1/n$ (MLE). For ML, unbiased is standard. - Symmetric solver efficiency: eigvalsh exploits symmetry (only computes upper triangle); eig is slower and less numerically stable for symmetric input. - Negative eigenvalues: If eig_S.min() < -1e-10, covariance is corrupted (centering error, numerical bug, or data issue). - Kernel evaluation cost: Computing full $K$ is $O(n^2 d)$; for large $n$, approximations (Nyström, random features) reduce memory. - Trace and sum of eigenvalues: $\text{trace}(A) = \sum_i \lambda_i$; trace is numerically cheaper to compute; useful sanity check.
Numerical and Shape Notes - $X \in \mathbb{R}^{40 \times 2}$: 40 examples, 2 features (2D visualization-friendly). - $X_c \in \mathbb{R}^{40 \times 2}$: centered (mean-subtracted along columns). - $\Sigma \in \mathbb{R}^{2 \times 2}$: symmetric, square, PSD. Typically small (full eigendecomposition is cheap). - eig_S $\in \mathbb{R}^2$: two eigenvalues (one per feature dimension). Sorted ascending: $\lambda_1 \le \lambda_2$. - $K \in \mathbb{R}^{40 \times 40}$: symmetric Gram matrix (expensive to form/store for large $n$). - eig_K $\in \mathbb{R}^{40}$: 40 eigenvalues (one per example). Typically fast decay (low-rank feature manifold). - Kernel bandwidth $\gamma = 0.5$ controls locality; larger $\gamma$ yields shorter-range interactions, often more eigenvalues near zero.
Pedagogical Significance - Core principle: PSD-ness is a hidden assumption in many ML algorithms; violations cause silent failures or numerical crashes. - Diagnostic skill: Inspecting minimum eigenvalues is a quick sanity check for correct covariance computation and valid kernel implementation. - Theory-practice bridge: Mercerâs theorem (theory) and kernel Gram matrices (practice) demonstrate the elegance of implicit feature spaces. - Regularization intuition: Ridge regression adds $\lambda I$ to ensure PSD-ness and numerical stability; regularization and PSD-ness are intimately linked. - Conditioning and stability: Small eigenvalues underlie ill-conditioning; understanding eigenvalues reveals when algorithms are fragile.
Comments