Example number
9
Example slug
example_9_psd_checks_covariance_and_kernel_gram_matrices
Background

A symmetric matrix is PSD if all eigenvalues are nonnegative. Covariance matrices have the form $\Sigma = \frac{1}{n-1} X_c^\top X_c$, so for any vector $u$, $u^\top \Sigma u = \|X_c u\|_2^2 \ge 0$, making $\Sigma$ PSD by construction. Kernel Gram matrices are PSD when the kernel is valid (Mercer): for any coefficients $\alpha$, $\alpha^\top K \alpha = \sum_{i,j} \alpha_i \alpha_j k(x_i, x_j) \ge 0$.

In floating-point arithmetic, tiny negative eigenvalues can appear due to roundoff; these are tolerated if they are close to zero. Substantial negative eigenvalues indicate an invalid kernel implementation, lack of centering for covariance, or numerical instability.

Purpose

Build a practical intuition for positive semidefinite (PSD) matrices in ML:

  • Recognize that covariance matrices and valid kernel Gram matrices are PSD (eigenvalues nonnegative).
  • Use the minimum eigenvalue as a quick diagnostic for PSD vs numerical noise.
  • Connect PSD structure to stability of downstream solvers (e.g., Cholesky, Gaussian processes).
  • Keep shapes straight: $X\in\mathbb{R}^{n\times d}$, $\Sigma\in\mathbb{R}^{d\times d}$, $K\in\mathbb{R}^{n\times n}$.
Problem

Compute min eigenvalues of covariance and RBF Gram matrix; they should be ≥0 (up to tolerance).

Solution (Math)

PSD matrices have nonnegative eigenvalues. Covariance $\propto X_c^TX_c$ is PSD. Kernel Gram matrices are PSD by definition of valid kernels.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=30, seed=0)
Xc = X - X.mean(0)

Sigma = np.cov(Xc, rowvar=False, bias=False)
print("min eig(cov):", np.linalg.eigvalsh(Sigma).min())

K = toy_kernel_rbf(X, gamma=0.5)
print("min eig(K):", np.linalg.eigvalsh(K).min())
Code Introduction

This snippet checks positive semidefiniteness of two canonical matrices: the sample covariance and an RBF kernel Gram matrix. The data $X\in\mathbb{R}^{30\times 2}$ come from toy_pca_points; centering gives $X_c = X - \bar{X}$. The sample covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c$ is symmetric PSD by construction, so its smallest eigenvalue should be nonnegative up to numerical noise. Next, it builds an RBF kernel $K$ with bandwidth parameter $\gamma = 0.5$; kernel Gram matrices are also symmetric PSD. The printed minimum eigenvalues should be near zero or positive. Tiny negative values indicate floating-point error, not a violation of PSD; large negative values signal a bug or ill-conditioning.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Center data: $X_c = X - \text{mean}(X)$; compute covariance with np.cov(Xc, rowvar=False, bias=False) to get $\Sigma\in\mathbb{R}^{d\times d}$.
  • Use np.linalg.eigvalsh(Sigma) (specialized for symmetric matrices) and inspect min(); expect $\ge -\varepsilon$ with $\varepsilon$ small (e.g., $10^{-12}$) due to roundoff.
  • Build RBF Gram matrix with toy_kernel_rbf(X, gamma=0.5); eigenvalues should also be nonnegative up to tolerance.
  • If minimum eigenvalues are materially negative, recheck centering, kernel parameters, or add a small jitter (e.g., $10^{-6}$ on the diagonal) before Cholesky.
  • Shapes: $X\in\mathbb{R}^{n\times d}$ (here $n=30$, $d=2$), $\Sigma\in\mathbb{R}^{2\times 2}$, $K\in\mathbb{R}^{30\times 30}$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Covariance matrices are PSD; kernel Gram matrices for valid kernels (e.g., RBF) are PSD.
  • Minimum eigenvalues should be nonnegative up to numerical tolerance; small negatives reflect roundoff.
  • np.linalg.eigvalsh is preferred for symmetric/PSD checks.
  • PSD diagnostics are a first step before applying Cholesky or other SPD-requiring methods.
Notes

Part 1: Core setup - PSD checks: covariance and kernel Gram matrices

State the objects, shapes, and target question for PSD checks: covariance and kernel Gram matrices. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - PSD checks: covariance and kernel Gram matrices

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make PSD checks: covariance and kernel Gram matrices work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - PSD checks: covariance and kernel Gram matrices

Give the computational recipe for PSD checks: covariance and kernel Gram matrices, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: $X\in\mathbb{R}^{n\times d}$, $X_c$ centered, $\Sigma\in\mathbb{R}^{d\times d}$ symmetric, $K\in\mathbb{R}^{n\times n}$ symmetric.
  • Numerical note: use eigvalsh for symmetric matrices; tiny negative eigenvalues ($\approx -10^{-12}$) are roundoff, large negatives indicate an error.
  • Interpretation: PSD ensures convex quadratic forms and nonnegative variances; SPD (strictly positive) is needed for unique Cholesky and inverses.
  • Stability: If needed for downstream Cholesky, add a small jitter to the diagonal to enforce SPD when eigenvalues are near zero.
History and Applications

Positive semidefinite (PSD) matrices emerged from quadratic form theory in the 18th century. A matrix is PSD if its associated quadratic form $x^\top A x$ is always nonnegative—a condition that ties to geometry (axes of an ellipsoid) and optimization (convex quadratics).

Mercer’s theorem and kernels: James Mercer (1909) proved that a continuous, symmetric, positive semidefinite kernel function (in the integral operator sense) admits an eigenfunction expansion. This theoretical result enabled a powerful trick: replace unknown nonlinear decision boundaries with kernel methods on fixed-dimensional feature spaces. Support Vector Machines (SVMs, Vapnik & Colleagues, 1990s) popularized kernels in ML, making PSD Gram matrices a practical computational object.

Cholesky and numerical stability: The Cholesky factorization $A = L L^\top$ exists uniquely for PSD matrices and is numerically stable. In Gaussian processes and other probabilistic methods, checking PSD (via eigenvalues or Cholesky success) is a critical diagnostic before solving linear systems or sampling.

Connection to Broader Examples
  • PSD is required for Cholesky factorization and Gaussian process kernels; this mirrors conditioning and stability themes.
  • Covariance PSD underpins PCA/SVD (Chapter 10) and least-squares normal equations (Chapter 12).
  • Kernel Gram PSD supports SVMs, kernel ridge regression, and kernel PCA.
  • Inspecting eigenvalues is a lightweight diagnostic before more expensive factorizations.

Comments