Example number
25
Example slug
example_25_psd_checks_covariance_and_kernel_gram_matrices
Background

For any centered data matrix $X_c \in \mathbb{R}^{n\times d}$, the sample covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c$ is symmetric PSD because $x^\top \Sigma x = \frac{1}{n-1}\|X_c x\|_2^2 \ge 0$ for all $x$. PSD implies real, nonnegative eigenvalues and enables factorizations such as Cholesky when strictly PD. Kernel Gram matrices $K_{ij} = k(x_i, x_j)$ are PSD when $k$ is a valid Mercer kernel; the RBF kernel is strictly PD for distinct inputs and $\gamma > 0$.

Numerically, finite precision can introduce tiny negative eigenvalues (e.g., $-10^{-12}$ in float64) even for theoretically PSD matrices. Large negative values indicate construction bugs: missing centering for covariance, incorrect kernel formula, or corrupted inputs. Checking eigenvalues or attempting Cholesky provides a quick sanity check before downstream optimization (e.g., solving linear systems, computing log-determinants) that assumes PSD/PD.

Purpose

Develop a reliable, ML-first checklist for recognizing and verifying positive semidefinite (PSD) structure in covariance matrices and kernel Gram matrices. Covariance matrices are PSD by construction; valid Mercer kernels yield PSD Gram matrices. Knowing how to check PSD quickly (eigenvalues, Cholesky) helps you catch data/implementation issues early, choose numerically stable solvers, and reason about optimization landscapes that rely on PSD assumptions.

You will learn to compute eigenvalue-based diagnostics, interpret tiny negative eigenvalues as numerical noise, and distinguish true PSD violations from floating-point artifacts. The goal is to make PSD verification a routine guardrail for regression, classification with kernels, Gaussian processes, and any method that assumes positive (semi)definiteness.

Problem

Compute min eigenvalues of covariance and RBF Gram matrix; they should be ≥0 (up to tolerance).

Solution (Math)

PSD matrices have nonnegative eigenvalues. Covariance $\propto X_c^TX_c$ is PSD. Kernel Gram matrices are PSD by definition of valid kernels.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=30, seed=0)
Xc = X - X.mean(0)

Sigma = np.cov(Xc, rowvar=False, bias=False)
print("min eig(cov):", np.linalg.eigvalsh(Sigma).min())

K = toy_kernel_rbf(X, gamma=0.5)
print("min eig(K):", np.linalg.eigvalsh(K).min())
Code Introduction

This snippet checks positive semidefinite (PSD) structure by inspecting minimum eigenvalues. Centered data Xc yields a sample covariance $\Sigma = \frac{1}{n-1}X_c^\top X_c$, which is symmetric PSD by construction; np.linalg.eigvalsh(Sigma).min() should be non-negative (up to numerical noise). The RBF kernel matrix K = toy_kernel_rbf(X, gamma=0.5) is also PSD/PD for $\gamma > 0$; its smallest eigenvalue should be non-negative, and strictly positive if points are distinct and $\gamma$ is finite. Tiny negative values near $-10^{-12}$ are typically rounding error. A clearly negative minimum eigenvalue signals a bug in construction (e.g., missing centering for covariance, or an invalid kernel).

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Center data: Xc = X - X.mean(0) ensures covariance is computed around zero mean.
  • Covariance: Sigma = np.cov(Xc, rowvar=False, bias=False) with unbiased $n-1$ denominator and features as columns.
  • Eigen check: np.linalg.eigvalsh(Sigma).min() yields the smallest eigenvalue; expect $\ge -\epsilon$ with small $\epsilon$.
  • Kernel Gram: K = toy_kernel_rbf(X, gamma=0.5) builds an RBF kernel matrix; RBF should be PSD/PD for distinct points.
  • Kernel eigen check: np.linalg.eigvalsh(K).min() should be nonnegative; small negatives suggest numerical noise or implementation issues.
  • Tolerances: choose $\epsilon$ based on dtype (float64: $10^{-12}$–$10^{-10}$); clamp or shift by $\epsilon I$ if downstream algorithms require strict PD.
  • Alternative checks: attempt np.linalg.cholesky (will raise if not PD); for PSD with rank deficiency, inspect singular values via SVD.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Covariance PSD: $\Sigma = \frac{1}{n-1} X_c^\top X_c$ has nonnegative eigenvalues by construction.
  • Kernel PSD: RBF Gram matrices are PSD/PD when inputs are distinct and $\gamma>0$.
  • Eigenvalue diagnostics: np.linalg.eigvalsh reveals PSD violations; tiny negatives are often numerical noise.
  • Practical tolerances: decide thresholds (e.g., $-10^{-10}$ in float64) to distinguish noise from errors.
  • Solver readiness: PSD enables Cholesky/CG; detect issues early to avoid crashes or wrong results.
Notes

Part 1: Core setup - PSD checks: covariance and kernel Gram matrices

State the objects, shapes, and target question for PSD checks: covariance and kernel Gram matrices. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - PSD checks: covariance and kernel Gram matrices

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make PSD checks: covariance and kernel Gram matrices work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - PSD checks: covariance and kernel Gram matrices

Give the computational recipe for PSD checks: covariance and kernel Gram matrices, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

Interpretation and PSD geometry. PSD means $x^\top A x \ge 0$ for all $x$; geometrically, $A$ does not flip directions into negative quadratic form. For covariance, this encodes nonnegative variance along any direction. For kernels, PSD ensures the Gram matrix represents inner products in an implicit feature space.

Numerical diagnostics. Use eigvalsh for symmetric matrices; inspect the minimum eigenvalue. Tiny negative values near machine epsilon are typically rounding. Large negatives indicate model or data bugs. For strict PD needs, add jitter $\epsilon I$ before Cholesky/CG.

Shape discipline. Covariance: (n,d) data, centered to (n,d), produces (d,d) Sigma. Kernel Gram: (n,n) symmetric. Ensure consistent ordering (rows = examples). Track dtype (float64 preferred for spectral checks).

Practical guardrails. Before downstream solves or log-dets, check PSD. If min eigenvalue is below tolerance, recenter data, verify kernel implementation, or regularize. PSD checks are quick, cheap, and catch issues early in pipelines for GP regression, kernel SVMs, and PCA.

History and Applications

PSD and PD matrices emerged in quadratic form theory in the 19th century; covariance PSD was formalized in statistics as variance along any direction. Mercer’s theorem (1909) characterized kernels that yield PSD Gram matrices via feature maps. In numerical linear algebra, PSD/PD structure enables efficient Cholesky factorization, CG, and log-determinant computation.

Modern ML relies heavily on PSD kernels: Gaussian processes, kernel ridge, SVMs, and spectral clustering all assume PSD Gram matrices. Covariance PSD underpins PCA and probabilistic modeling. Practical pipelines add jitter to ensure PD before Cholesky/CG, and monitor eigenvalues to diagnose conditioning. PSD checks remain a fast, high-value diagnostic step before optimization or inference.

Connection to Broader Examples
  • Chapter 9 (PSD/PD): core definitions, eigenvalue characterizations, and factorizations.
  • Chapter 10 (SVD): singular values of $X_c$ squared (scaled) give covariance eigenvalues; PSD structure is explicit.
  • Chapter 11 (PCA): covariance PSD underlies principal components; eigenvectors/eigenvalues define variance directions.
  • Chapter 14 (conditioning): small eigenvalues signal ill-conditioning; adding jitter/regularization improves stability.
  • Chapter 13 (solving systems): PD enables fast/stable Cholesky; PSD with low rank may need SVD/CG with regularization.
  • Kernels in ML: Gaussian processes, kernel ridge, SVMs, and spectral clustering rely on PSD Gram matrices.

Comments