ex1.ai

Example number

Example slug

example_41_psd_in_ml_covariance_and_kernel_gram_matrices

Background

For centered data $X_c \in \mathbb{R}^{n\times d}$, the sample covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c$ is PSD because it is a Gram matrix of feature vectors $x_i - \bar{x}$. By Mercerâs theorem, a valid positive-definite kernel $k(x,x')$ yields a PSD Gram matrix $K_{ij} = k(x_i, x_j)$; $K$ equals inner products of implicit feature maps $\phi(x_i)$, so $K = \Phi \Phi^\top$ is PSD. Numerically, eigenvalues of a PSD matrix are nonnegative; small negative values reflect floating-point error or kernel misspecification. PSD structure guarantees nonnegative quadratic forms $z^\top M z \ge 0$, enabling Cholesky factorization and stable Gaussian conditioning.

Purpose

Show that two ubiquitous matrices in MLâsample covariance and kernel Gramâare positive semidefinite (PSD), and how to verify this numerically by inspecting their eigenvalues. Build intuition that covariance is PSD because it is $X_c^\top X_c$ up to scaling, and that valid kernels yield PSD Gram matrices because they are inner products in a feature space. Emphasize that PSD structure enables stable algorithms (Cholesky, eigendecompositions, Gaussian processes, SVMs) and that small negative eigenvalues usually signal numerical error or an invalid kernel.

Problem

Compute eigenvalues of (1) a covariance matrix and (2) an RBF kernel Gram matrix; report minimum eigenvalues.

Solution (Math)

Covariance matrices are PSD because they are of the form $X_c^TX_c$ (up to scaling). Valid kernel Gram matrices are PSD because they equal inner products in a feature space. Eigenvalues should be nonnegative up to numerical tolerance.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=40, seed=0)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)
eig_S = np.linalg.eigvalsh(Sigma)
print("min eig(cov):", eig_S.min())

K = toy_kernel_rbf(X, gamma=0.5)
eig_K = np.linalg.eigvalsh(K)
print("min eig(kernel):", eig_K.min())

Code Introduction

This snippet checks positive semidefiniteness of two common matrices: a sample covariance and an RBF kernel Gram matrix. It loads 40 synthetic 2D points, centers them, and computes the covariance $\Sigma = \tfrac{1}{n-1} X_c^\top X_c$. Eigenvalues via eigvalsh should satisfy $\lambda_{\min} \ge 0$ for PSD; min eig(cov) reports the smallest eigenvalue as a sanity check on symmetry and numerical stability. It then builds the RBF kernel $K_{ij} = \exp(-\gamma \|x_i - x_j\|_2^2)$ with $\gamma = 0.5$. By Mercerâs theorem, $K$ must be PSD; min eig(kernel) validates that property numerically (small negative values, if any, reflect floating-point error). Both tests confirm that covariance and kernel matrices used in PCA and kernel methods are PSD, enabling Cholesky/SVD-based algorithms.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Generate toy points: X = toy_pca_points(n=40, seed=0) produces $X \in \mathbb{R}^{40\times 2}$.
Center data: Xc = X - X.mean(0) to form zero-mean features.
Covariance: Sigma = np.cov(Xc, rowvar=False, bias=False) computes $\Sigma \in \mathbb{R}^{2\times 2}$.
Eigenvalues (cov): eig_S = np.linalg.eigvalsh(Sigma); PSD implies $\min eig_S \ge 0$ up to numerical tolerance.
Kernel Gram: K = toy_kernel_rbf(X, gamma=0.5) builds the RBF kernel matrix $K \in \mathbb{R}^{40\times 40}$.
Eigenvalues (kernel): eig_K = np.linalg.eigvalsh(K); PSD implies $\min eig_K \ge 0$; small negatives suggest numerical error or invalid kernel parameters.
Diagnostics: print minimum eigenvalues; if significantly negative, re-check centering, kernel parameters, and numerical precision.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Covariance matrices are PSD because they are Gram matrices of centered features.
Kernel Gram matrices (RBF here) are PSD by construction as feature-space inner products.
Eigenvalues of PSD matrices should be nonnegative; tiny negatives are numerical noise.
Minimum eigenvalue is a quick diagnostic for PSD and for kernel validity.
PSD enables efficient solvers (Cholesky) and guarantees convexity of quadratic objectives.
Shape discipline: $\Sigma \in \mathbb{R}^{d\times d}$, $K \in \mathbb{R}^{n\times n}$; both symmetric.

Notes

Part 1: Covariance as Gram Matrix
$\Sigma = \frac{1}{n-1} X_c^\top X_c$ is PSD; for any $z$, $z^\top \Sigma z = \frac{1}{n-1}\|X_c z\|_2^2 \ge 0$. Eigenvalues equal variances along principal directions.

Part 2: Kernel Gram Matrix
For RBF kernel $k(x,x') = \exp(-\gamma \|x-x'\|_2^2)$, $K_{ij} = k(x_i, x_j)$ is PSD because $k$ is positive definite; $K = \Phi \Phi^\top$ for some feature map $\phi$. Any finite Gram built from a PD kernel is PSD.

Part 3: Numerical Verification as Diagnostic
eigvalsh on symmetric matrices should return nonnegative values; tiny negative ($\sim -10^{-12}$) often stems from rounding. Large negative values indicate issues: non-centered data for covariance, invalid kernel parameters, or numerical instability. Checking the minimum eigenvalue is a quick PSD sanity test before Cholesky or GP inference.

Why This Matters for ML
PSD guarantees nonnegative variances and ensures kernels define valid similarity measures. Gaussian processes, kernel ridge regression, and SVMs require PSD Gram matrices for convexity and numerical stability. PCA and spectral methods rely on PSD covariance to yield real, nonnegative spectra. In optimization, PSD Hessians imply convexity.

Numerical and Implementation Notes
Use eigvalsh for symmetric matrices; it is faster and more stable than eigvals. Center data before covariance; otherwise $\Sigma$ includes mean effects. Small negative eigenvalues can be clipped to zero when enforcing PSD (e.g., jitter in GP kernels). Avoid forming $K$ with ill-chosen $\gamma$ that can cause near-rank-deficiency or overflow/underflow in the RBF.

Numerical and Shape Notes
Shapes: $X (n\times d)$, $\Sigma (d\times d)$, $K (n\times n)$. Covariance size depends on feature dimension; kernel Gram depends on sample count. Both are symmetric. For larger $n$, $K$ costs $O(n^2)$ storage and $O(n^3)$ eigendecomp; use approximations (NystrÃ¶m, random features) when needed.

History and Applications

Covariance matrices have been central since Pearson (1901) formalized PCA, relying on PSD structure to interpret variances. Positive-definite kernels gained prominence with Mercer (1909) and were popularized in ML via SVMs and Gaussian processes; their Gram matrices must be PSD for convexity and well-posed inference. Numerically, PSD enables Cholesky factorizations used in GP regression and kernel ridge regression. In modern practice, PSD diagnostics guide kernel choice, hyperparameter tuning (e.g., $\gamma$ in RBF), and stabilization (adding jitter) for large-scale kernel methods.

Connection to Broader Examples

Vector spaces (Ch. 1): PSD means all quadratic forms $z^\top M z \ge 0$; $M$ acts on vectors in $\mathbb{R}^d$ or $\mathbb{R}^n$.
Inner products (Ch. 5): Covariance and kernels are Gram matrices of inner products (explicit or implicit feature maps).
Projections (Ch. 6): Covariance defines projection energy; PSD ensures nonnegative variance along any direction.
Eigen (Ch. 8): PSD matrices have nonnegative eigenvalues and orthogonal eigenvectors; spectral methods rely on this.
PCA (Ch. 11): PCA diagonalizes covariance; PSD guarantees real, nonnegative variances (eigenvalues).
Least squares (Ch. 12): Normal equations involve $X^\top X$ (PSD); conditioning depends on eigenvalues.
Conditioning (Ch. 14): Small eigenvalues indicate near-singularity; PSD with tiny negatives may be numerical noise.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments