Example number
57
Example slug
example_57_psd_in_ml_covariance_and_kernel_gram_matrices
Background

Historical context: Positive definiteness emerged in quadratic form theory (19th century), with applications to optimization and statistics. Mercer’s theorem (1909) formalized the connection between PSD kernels and feature spaces. Kernel methods in machine learning (SVM, kernel PCA, Gaussian processes) rely critically on Mercer’s characterization. The stability and convexity guarantees from PSD matrices are central to modern convex optimization (Boyd & Vandenberghe, 2004).

Mathematical characterization: A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is positive semidefinite (PSD) if $v^\top A v \ge 0$ for all $v \in \mathbb{R}^n$. Equivalently: all eigenvalues $\lambda_i \ge 0$, $A = B^\top B$ for some $B$, or $A = U \Lambda U^\top$ with $\Lambda$ having nonnegative diagonal. For covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c$, PSD-ness follows from the Gram matrix form. For kernels, PSD-ness is Mercer’s condition: valid kernels yield PSD Gram matrices.

Prevalence in ML: Covariance matrices underpin Gaussian models, PCA, dimensionality reduction, and regularization. Kernel Gram matrices are fundamental to SVM, kernel ridge regression, kernel PCA, and Gaussian processes. PSD-ness is assumed implicitly in many algorithms—violations (e.g., from numerical errors or invalid kernels) cause silent failures or crashes.

Purpose

Show, in an ML-first way, that covariance matrices and valid kernel Gram matrices are positive semidefinite (PSD)—and why this property is crucial for stable, convex algorithms. Build intuition for why PSD-ness guarantees nonnegative eigenvalues, enables Cholesky factorization, and ensures valid probability models. Learn to diagnose conditioning issues via the smallest eigenvalue and understand the implicit feature spaces behind kernel methods.

Problem

Compute eigenvalues of (1) a covariance matrix and (2) an RBF kernel Gram matrix; report minimum eigenvalues.

Solution (Math)

Covariance matrices are PSD because they are of the form $X_c^TX_c$ (up to scaling). Valid kernel Gram matrices are PSD because they equal inner products in a feature space. Eigenvalues should be nonnegative up to numerical tolerance.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=40, seed=0)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)
eig_S = np.linalg.eigvalsh(Sigma)
print("min eig(cov):", eig_S.min())

K = toy_kernel_rbf(X, gamma=0.5)
eig_K = np.linalg.eigvalsh(K)
print("min eig(kernel):", eig_K.min())
Code Introduction

This code verifies that two fundamental ML matrices are positive semidefinite (PSD): the sample covariance matrix and the kernel Gram matrix. Both matrices have nonnegative eigenvalues, a critical property that ensures numerical stability and enables convex optimization.

Part 1: Sample Covariance is PSD. The code generates 40 data points via toy_pca_points(n=40, seed=0), yielding $X \in \mathbb{R}^{40 \times 2}$ (40 examples, 2 features). The data are centered as $X_c = X - \mu$, where $\mu = \frac{1}{n}\sum_{i=1}^n X_i$. The sample covariance matrix is $\Sigma = \frac{1}{n-1} X_c^\top X_c \in \mathbb{R}^{2 \times 2}$, computed via np.cov(Xc, rowvar=False, bias=False). The parameters rowvar=False (treat columns as variables) and bias=False (use $1/(n-1)$ normalization, unbiased estimator) ensure standard covariance semantics. The eigenvalues are computed via np.linalg.eigvalsh(Sigma), which uses a symmetric eigenvalue solver (the h suffix indicates Hermitian/symmetric input). The output eig_S is an array of eigenvalues sorted in ascending order. The printed minimum eigenvalue eig_S.min() is nonnegative (or very close to zero within machine precision), confirming that $\Sigma$ is PSD. Why covariance is PSD: By definition, $\Sigma = \frac{1}{n-1} X_c^\top X_c$ is a product of the form $A^\top A$. For any vector $v \in \mathbb{R}^2$: $v^\top \Sigma v = \frac{1}{n-1} v^\top X_c^\top X_c v = \frac{1}{n-1} \|X_c v\|_2^2 \ge 0$. The sum of squared projections is always nonnegative, so $\Sigma$ is PSD by construction. The minimum eigenvalue is zero if and only if $X_c$ is rank-deficient (columns are linearly dependent)—unlikely with random data, but possible if features are constant or perfectly correlated.

Part 2: Kernel Gram Matrix is PSD. A kernel Gram matrix $K \in \mathbb{R}^{n \times n}$ is constructed from a kernel function $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ as $K_{ij} = k(x_i, x_j)$, where $x_i$ are data points. The code uses the RBF (Radial Basis Function) kernel: $k(x_i, x_j) = \exp\left(-\gamma \|x_i - x_j\|_2^2\right)$, with $\gamma = 0.5$ (bandwidth parameter). The toy_kernel_rbf(X, gamma=0.5) function computes all pairwise kernel evaluations, yielding a $40 \times 40$ matrix $K$. The eigenvalues eig_K = np.linalg.eigvalsh(K) are computed and printed. For any valid kernel, all eigenvalues are nonnegative, confirming $K$ is PSD. Why kernel Gram matrices are PSD: A kernel function $k$ is positive definite if for any finite set of points $\{x_1, \ldots, x_n\}$ and coefficients $\{c_1, \ldots, c_n\}$: $\sum_{i,j=1}^n c_i c_j k(x_i, x_j) = \mathbf{c}^\top K \mathbf{c} \ge 0$. The RBF kernel (and many others: polynomial, sigmoid) satisfies this definition. This property is equivalent to $K$ being PSD—all eigenvalues are nonnegative. Kernels in machine learning are chosen specifically to ensure this property, because PSD matrices enable: (1) Convex optimization: Quadratic forms with PSD matrix are convex objectives; (2) Stable solvers: Cholesky factorization only works for PSD matrices; (3) Geometric interpretation: Eigenvectors and eigenvalues reveal the implicit feature space.

Why PSD Matters in ML. In Gaussian modeling, inference, and regularization (ridge regression, Gaussian processes), covariance matrices must be PSD to ensure: (1) valid probability distributions (covariance encodes variance directions); (2) stable likelihood computation (Cholesky factorization for efficient Gaussian density evaluation); (3) proper regularization (ridge penalty $\lambda I$ adds $\lambda > 0$ to eigenvalues, preserving PSD-ness). In kernel methods (SVM, kernel ridge regression, kernel PCA), the kernel Gram matrix appears in the optimization problem and decision boundary. PSD-ness ensures: (1) Convexity: the objective $\frac{1}{2} \mathbf{w}^\top K \mathbf{w} + \text{loss}$ is convex; (2) Mercer’s theorem: PSD kernels correspond to implicit feature maps $\phi : \mathbb{R}^d \to \mathcal{H}$ into a Hilbert space, where $k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$; (3) Efficient computation: Cholesky decomposition accelerates computations without explicitly forming high-dimensional features. The smallest eigenvalue eig_S.min() and eig_K.min() have diagnostic value: if very close to zero ($\sim 10^{-10}$), the matrix is nearly singular (rank-deficient)—small perturbations can cause large changes in solutions. In covariance, this indicates nearly-constant or perfectly-correlated features. In kernels, it indicates that the data lie approximately on a lower-dimensional manifold.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Center the data: Compute $X_c = X - \bar{X}$ (column-wise mean subtraction) to remove mean bias from covariance.
  2. Form sample covariance: $\Sigma = \frac{1}{n-1} X_c^\top X_c$ via np.cov(Xc, rowvar=False, bias=False) (unbiased estimator).
  3. Compute eigenvalues: Use np.linalg.eigvalsh(Sigma) (symmetric eigenvalue solver) to extract all eigenvalues $[\lambda_1, \ldots, \lambda_d]$ (sorted ascending).
  4. Check PSD-ness: Verify eig_S.min() >= -1e-10 (nonnegative, within numerical tolerance); negative eigenvalues indicate errors.
  5. Form kernel Gram matrix: Compute pairwise kernel evaluations $K_{ij} = k(x_i, x_j)$ for all $n$ examples via toy_kernel_rbf(X, gamma).
  6. Compute kernel eigenvalues: Apply np.linalg.eigvalsh(K) to extract $n$ eigenvalues $[\mu_1, \ldots, \mu_n]$.
  7. Diagnose conditioning: Compute condition number $\kappa = \lambda_{\max} / \lambda_{\min}$ (for covariance); large $\kappa$ (e.g., $> 10^{10}$) signals ill-conditioning.
  8. Interpret minimum eigenvalue: Approaching zero means near-singularity (features nearly dependent, or data lie on lower-dimensional manifold); small but nonzero is normal (reflects data structure).
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Covariance matrices are PSD by construction: $\Sigma = \frac{1}{n-1} X_c^\top X_c$ (Gram matrix form) implies all eigenvalues $\ge 0$.
  • Kernel Gram matrices are PSD by Mercer’s theorem: Valid kernels (RBF, polynomial, sigmoid) satisfy PSD-ness; invalid kernels violate it.
  • Eigenvalues diagnose conditioning: Small eigenvalues signal near-singularity, ill-conditioning, and potential numerical instability.
  • Centering is required for covariance: Uncentered data includes the mean’s contribution, inflating eigenvalues and breaking PCA semantics.
  • Symmetric eigenvalue solvers are efficient: np.linalg.eigvalsh exploits symmetry; only eigenvalues (not eigenvectors) are needed for PSD checking.
  • Condition number matters: Ratio $\lambda_{\max} / \lambda_{\min}$ determines solver convergence rate and numerical accuracy.
  • Cholesky factorization requires PSD: $A = LL^\top$ exists only if $A$ is PSD; used for efficient Gaussian likelihood and solving linear systems.
  • Numerical checks catch errors: Inspecting eig_S.min() and eig_K.min() reveals bugs in covariance computation or kernel implementation.
Notes

Part 1: Sample Covariance is PSD - $\Sigma = \frac{1}{n-1} X_c^\top X_c$ is a Gram matrix: $A = B^\top B$ form always yields PSD. - Eigenvalues measure variance in each principal direction; $\lambda_i = 0$ means data have zero variance along that axis (features are constant or linearly dependent). - Centering is essential: $\Sigma_{\text{uncentered}} = \frac{1}{n-1} X^\top X$ conflates mean and variance, inflating eigenvalues. - Rank at most $\min(n-1, d)$: with $n=40, d=2$, rank is at most 2; excess eigenvalues are zero (though here both are nonzero for generic data).

Part 2: Kernel Gram Matrix is PSD - Kernel function $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ is valid (Mercer) iff the Gram matrix $K_{ij} = k(x_i, x_j)$ is PSD for any set of points. - RBF kernel $k(x, y) = \exp(-\gamma \|x - y\|_2^2)$ is a Mercer kernel; its Gram matrix is always PSD. - Gram matrix diagonal is $K_{ii} = k(x_i, x_i) = 1$ (for normalized RBF); off-diagonal entries $< 1$ decrease with pairwise distance. - Eigenvalues characterize the implicit feature space dimension: fast decay of eigenvalues suggests data lie on a low-dimensional manifold.

Part 3: Why PSD Matters in ML - Convexity: Quadratic form $\frac{1}{2} w^\top A w + b^\top w$ is convex iff $A$ is PSD; ensures unique global minimum. - Stability: Cholesky factorization $A = LL^\top$ is numerically stable for PSD matrices; used in Gaussian likelihood, matrix inversion. - Mercer feature maps: PSD kernels correspond to feature maps $\phi : \mathbb{R}^d \to \mathcal{H}$; enables implicit infinite-dimensional representations. - Condition number: $\kappa = \lambda_{\max}/\lambda_{\min}$ controls solver convergence; PSD-ness ensures all eigenvalues are nonzero or (legitimately) zero.

Why PSD Matters in ML - Ensures valid probability models (covariance in Gaussian distributions). - Enables convex optimization (global optima, efficient algorithms). - Allows stable matrix factorizations (Cholesky, eigendecomposition). - Links kernels to implicit feature spaces (Mercer’s theorem).

ML Examples and Patterns - Gaussian models: Covariance $\Sigma$ in $\mathcal{N}(\mu, \Sigma)$ must be PSD; used in LDA, GMM, Gaussian processes. - Ridge regression: $(X^\top X + \lambda I)^{-1}$ always has Cholesky factorization for $\lambda > 0$ (PSD-ness ensured). - Kernel SVM: Objective $\frac{1}{2} \alpha^\top K \alpha - \mathbf{1}^\top \alpha$ is convex iff $K$ is PSD; ensures unique solution. - Kernel PCA: Eigenvectors of $K$ give principal components in feature space; uses PSD eigendecomposition. - Gaussian processes: Kernel $k(x, x')$ evaluated on data yields PSD $K$; posterior is computed via Cholesky. - Whitening/normalization: $\Sigma = LL^\top$ enables $Z = L^{-1}(X_c^\top)^\top$ (whitened data, zero mean, unit covariance).

Connection to Linear Algebra Theory - Spectral theorem: Symmetric matrices diagonalize as $A = U \Lambda U^\top$ with orthogonal $U$ and diagonal $\Lambda$. For PSD, $\lambda_i \ge 0$. - Quadratic form: $v^\top A v = \sum_i \lambda_i (u_i^\top v)^2 \ge 0$ for PSD; equals zero only if $v$ is orthogonal to all nonzero eigenspace vectors. - Rayleigh quotient: $\lambda_{\min} = \min_v \frac{v^\top A v}{v^\top v}$, $\lambda_{\max} = \max_v \frac{v^\top A v}{v^\top v}$ characterizes eigenvalue bounds. - Gram matrix form: $A = B^\top B$ always yields PSD; rank of $A$ equals rank of $B$.

Numerical and Implementation Notes - Bias parameter: bias=False (default) uses $1/(n-1)$ (unbiased estimator); bias=True uses $1/n$ (MLE). For ML, unbiased is standard. - Symmetric solver efficiency: eigvalsh exploits symmetry (only computes upper triangle); eig is slower and less numerically stable for symmetric input. - Negative eigenvalues: If eig_S.min() < -1e-10, covariance is corrupted (centering error, numerical bug, or data issue). - Kernel evaluation cost: Computing full $K$ is $O(n^2 d)$; for large $n$, approximations (Nyström, random features) reduce memory. - Trace and sum of eigenvalues: $\text{trace}(A) = \sum_i \lambda_i$; trace is numerically cheaper to compute; useful sanity check.

Numerical and Shape Notes - $X \in \mathbb{R}^{40 \times 2}$: 40 examples, 2 features (2D visualization-friendly). - $X_c \in \mathbb{R}^{40 \times 2}$: centered (mean-subtracted along columns). - $\Sigma \in \mathbb{R}^{2 \times 2}$: symmetric, square, PSD. Typically small (full eigendecomposition is cheap). - eig_S $\in \mathbb{R}^2$: two eigenvalues (one per feature dimension). Sorted ascending: $\lambda_1 \le \lambda_2$. - $K \in \mathbb{R}^{40 \times 40}$: symmetric Gram matrix (expensive to form/store for large $n$). - eig_K $\in \mathbb{R}^{40}$: 40 eigenvalues (one per example). Typically fast decay (low-rank feature manifold). - Kernel bandwidth $\gamma = 0.5$ controls locality; larger $\gamma$ yields shorter-range interactions, often more eigenvalues near zero.

Pedagogical Significance - Core principle: PSD-ness is a hidden assumption in many ML algorithms; violations cause silent failures or numerical crashes. - Diagnostic skill: Inspecting minimum eigenvalues is a quick sanity check for correct covariance computation and valid kernel implementation. - Theory-practice bridge: Mercer’s theorem (theory) and kernel Gram matrices (practice) demonstrate the elegance of implicit feature spaces. - Regularization intuition: Ridge regression adds $\lambda I$ to ensure PSD-ness and numerical stability; regularization and PSD-ness are intimately linked. - Conditioning and stability: Small eigenvalues underlie ill-conditioning; understanding eigenvalues reveals when algorithms are fragile.

History and Applications

PSD matrices in optimization: The theory of positive definite matrices crystallized in the 19th century through quadratic form analysis. Cholesky’s decomposition (1910) provided an efficient factorization $A = LL^\top$ for PSD matrices, enabling fast linear system solving and log-determinant computation. In statistics, covariance matrices in multivariate Gaussian models naturally assumed PSD structure. Kernel methods emerged from the theory of reproducing kernel Hilbert spaces (Aronszajn, 1950), with Mercer’s theorem (1909, formalized later) establishing that PSD kernels correspond to inner products in implicit feature spaces.

Modern era: Boyd & Vandenberghe (2004) placed PSD matrices at the heart of convex optimization, showing that many ML problems (SVM, logistic regression, Gaussian processes) reduce to convex objectives with PSD matrices. Schölkopf & Smola (2002) unified kernel methods theory, emphasizing PSD-ness as the defining property of valid kernels. Today, PSD matrices are assumed implicitly in nearly every ML algorithm—violations (from numerical errors, invalid kernels, or corrupted covariance) cause subtle failures. Modern applications: kernel methods in large-scale ML (SVM, kernel ridge regression, Gaussian processes), covariance estimation in high dimensions (regularization, shrinkage), and implicit regularization in deep learning (understanding the implicit bias of SGD toward low-rank solutions and PSD-preserving parameter updates). The rise of differentiable programming has re-emphasized PSD-ness—automatic differentiation of algorithms that assume PSD Hessians or covariances can fail silently if the property is violated during training.

Connection to Broader Examples
  • Chapter 8 (Eigenvalues): PSD matrices have real, orthogonal eigenvectors; eigenvalues reveal principal directions and scales.
  • Chapter 10 (SVD): Covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c$ relates to SVD via $\Sigma = V \Lambda V^\top$ (right singular vectors/values of $X_c$).
  • Chapter 11 (PCA): Principal components are eigenvectors of $\Sigma$; PSD-ness guarantees real, orthogonal PC directions.
  • Chapter 12 (Least-squares): Ridge regression adds $\lambda I$ to $X^\top X$; $X^\top X + \lambda I$ is always PSD for $\lambda > 0$.
  • Chapter 14 (Conditioning): Condition number $\lambda_{\max}/\lambda_{\min}$ determines solver accuracy; PSD-ness ensures nonnegativity for interpretation.
  • Kernels and Mercer’s theorem: Valid kernels produce PSD Gram matrices; enables feature-space interpretation and convex optimization.
  • Convex optimization: PSD matrices enable convex objectives, guaranteeing global optima and efficient algorithms (gradient descent, Newton).
  • Gaussian models: Covariance in $\mathcal{N}(\mu, \Sigma)$ must be PSD; Cholesky factorization $\Sigma = LL^\top$ enables efficient likelihood.

Comments