ex1.ai

Example number

Example slug

example_73_psd_in_ml_covariance_and_kernel_gram_matrices

Background

PSD matrices arise throughout ML: covariance matrices encode data variance, kernel Gram matrices enable nonlinear methods (SVM, GP, kernel PCA), and Hessians of convex objectives are PSD at minima. Verifying PSD via eigenvalues ensures numerical stability (Cholesky decomposition requires PD), convexity (quadratic forms $v^\top A v \ge 0$), and interpretability (eigenvalues measure variance or effective dimensionality). The RBF kernel is a canonical example of a positive definite kernel (Mercerâs theorem), guaranteeing PSD Gram matrices for arbitrary finite point sets.

Purpose

Build deep intuition for why covariance matrices and valid kernel Gram matrices are positive semidefinite (PSD), how to verify this numerically via eigenvalues, and why PSD structure is foundational for convex optimization, numerical stability, and interpretability in ML. You will see that PSD is not an abstract property but a concrete guarantee arising from Gram matrix factorizations and kernel inner products.

Problem

Compute eigenvalues of (1) a covariance matrix and (2) an RBF kernel Gram matrix; report minimum eigenvalues.

Solution (Math)

Covariance matrices have the form:

\[ \Sigma = \frac{1}{n-1} X_c^\top X_c \in \mathbb{R}^{d\times d}, \]

where $X_c = X - \mu$ is centered data. For any $v \in \mathbb{R}^d$:

\[ v^\top \Sigma v = \frac{1}{n-1} v^\top X_c^\top X_c v = \frac{1}{n-1} \|X_c v\|_2^2 \ge 0, \]

so $\Sigma$ is PSD (all eigenvalues $\lambda_i \ge 0$). Valid kernel Gram matrices $K \in \mathbb{R}^{n\times n}$ satisfy:

\[ K_{ij} = k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle_{\mathcal{H}}, \]

where $\phi$ maps to a reproducing kernel Hilbert space (RKHS). Thus $K = \Phi \Phi^\top$ is a Gram matrix, ensuring PSD.

Notation: - $X \in \mathbb{R}^{n\times d}$: data matrix (rows = examples) - $\Sigma \in \mathbb{R}^{d\times d}$: covariance matrix - $K \in \mathbb{R}^{n\times n}$: kernel Gram matrix - $\lambda_i$: eigenvalues (nonnegative for PSD)

Solution (Python)


import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=40, seed=0)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)
eig_S = np.linalg.eigvalsh(Sigma)
print("min eig(cov):", eig_S.min())

K = toy_kernel_rbf(X, gamma=0.5)
eig_K = np.linalg.eigvalsh(K)
print("min eig(kernel):", eig_K.min())

Code Introduction

This snippet verifies that two common ML matricesâthe sample covariance matrix and an RBF kernel Gram matrixâare positive semidefinite (PSD) by computing their eigenvalues and confirming all are nonnegative.

The code generates synthetic data $X \in \mathbb{R}^{40 \times 2}$ via toy_pca_points, centers it ($X_c = X - \mu$), and computes the unbiased sample covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c \in \mathbb{R}^{2 \times 2}$. Using np.linalg.eigvalsh(Sigma) returns eigenvalues in ascending order. For any symmetric $\Sigma = \frac{1}{n-1} X_c^\top X_c$, the matrix is a Gram matrix (inner products of data vectors), ensuring PSD: $v^\top \Sigma v = \frac{1}{n-1} \|X_c v\|_2^2 \ge 0$ for all $v$. This implies $\lambda_i \ge 0$. The code checks eig_S.min() to verify the smallest eigenvalue is nonnegative (typically $\sim 10^{-10}$ or larger). A positive minimum eigenvalue confirms $\Sigma$ is positive definite (PD) (strictly positive eigenvalues), which occurs when $X_c$ has full column rank.

The Gram matrix $K \in \mathbb{R}^{40 \times 40}$ is constructed via the RBF kernel: $K_{ij} = \exp(-\gamma \|x_i - x_j\|_2^2)$ with $\gamma = 0.5$. The RBF kernel is a positive definite kernel (Mercerâs theorem), guaranteeing $K$ is PSD for any finite point set: $\alpha^\top K \alpha \ge 0$ for all $\alpha$. This follows from the infinite-dimensional feature map interpretation: $k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}$, so $K = \Phi \Phi^\top$ is a Gram matrix. Computing eig_K = np.linalg.eigvalsh(K) returns eigenvalues; eig_K.min() should be nonnegative (modulo floating-point roundoff $\sim 10^{-15}$).

Shape discipline: $X \in \mathbb{R}^{40\times 2}$, $X_c \in \mathbb{R}^{40\times 2}$, $\Sigma \in \mathbb{R}^{2\times 2}$, $K \in \mathbb{R}^{40\times 40}$. Gotchas: Use eigvalsh for symmetric matrices (faster/more accurate than eig); check minimum eigenvalue with tolerance (e.g., > -1e-10) to account for roundoff; zero eigenvalues indicate rank deficiency (PSD but not PD).

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Generate data: $X \in \mathbb{R}^{40\times 2}$ via toy_pca_points(n=40, seed=0).
Center: $X_c = X - \mu$ where $\mu = \frac{1}{n} \sum_i x_i$.
Covariance: Sigma = np.cov(Xc, rowvar=False, bias=False) yields $\Sigma = \frac{1}{n-1} X_c^\top X_c \in \mathbb{R}^{2\times 2}$.
Eigenvalues: eig_S = np.linalg.eigvalsh(Sigma) returns eigenvalues in ascending order; check eig_S.min() >= 0.
Kernel Gram: K = toy_kernel_rbf(X, gamma=0.5) computes $K_{ij} = \exp(-0.5 \|x_i - x_j\|^2) \in \mathbb{R}^{40\times 40}$.
Kernel eigenvalues: eig_K = np.linalg.eigvalsh(K) should all be nonnegative.
Typical outputs: min eig(cov) $\approx 0.02$ (positive, so $\Sigma$ is PD); min eig(kernel) $\approx 0.001$ (positive, so $K$ is PD).

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Covariance matrices $\Sigma = \frac{1}{n-1} X_c^\top X_c$ are always PSD (Gram matrix structure).
Valid kernel Gram matrices $K$ (e.g., RBF) are PSD due to feature space inner products (Mercerâs theorem).
Eigenvalue verification: np.linalg.eigvalsh computes eigenvalues; minimum should be $\ge 0$ (modulo roundoff $\sim 10^{-15}$).
PSD vs PD: PSD allows zero eigenvalues (rank-deficient); PD requires all $\lambda_i > 0$ (full rank, invertible).
Numerical stability: small eigenvalues indicate near-singularity; regularization ($A + \epsilon I$) ensures invertibility.
ML implications: PSD structure enables Cholesky solves, convex optimization, and interpretable variance decompositions.

Notes

Part 1: Sample Covariance is PSD. $\Sigma = \frac{1}{n-1} X_c^\top X_c$ is a Gram matrix: $v^\top \Sigma v = \frac{1}{n-1} \|X_c v\|_2^2 \ge 0$ for all $v$, so eigenvalues $\lambda_i \ge 0$. If $X_c$ has full column rank ($\text{rank}(X_c) = d$), $\Sigma$ is PD (all $\lambda_i > 0$); if $n < d$ or data are collinear, some $\lambda_i = 0$ (PSD but not PD).
Part 2: RBF Kernel Gram Matrix is PSD. The RBF kernel $k(x, x') = \exp(-\gamma \|x - x'\|^2)$ is positive definite (Mercer), meaning $K_{ij} = k(x_i, x_j)$ satisfies $\alpha^\top K \alpha \ge 0$ for all $\alpha \in \mathbb{R}^n$. This follows from the feature map interpretation: $k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}$ for an infinite-dimensional RKHS, so $K = \Phi \Phi^\top$ is PSD.
Part 3: Eigenvalue Verification and Numerical Tolerance. np.linalg.eigvalsh computes eigenvalues of symmetric matrices; minimum should be $\ge 0$ for PSD. Floating-point roundoff can produce tiny negative values (e.g., $-10^{-16}$)âuse tolerance: assert eig.min() > -1e-10. Zero eigenvalues indicate rank deficiency (nullspace); positive eigenvalues confirm PD (invertible).
Why This Matters for ML. PSD structure ensures convex optimization (quadratic forms $v^\top A v \ge 0$), numerical stability (Cholesky for solves, regularization for invertibility), and interpretability (eigenvalues = variances or effective rank). Ridge regression, kernel methods, and Gaussian processes all require PSD/PD matrices; violations cause non-convex objectives or undefined posteriors.
ML Examples and Patterns. Ridge: $(X^\top X + \lambda I)$ is PD for $\lambda > 0$, ensuring unique solutions. GP: $(K + \sigma^2 I)^{-1}$ requires PD for posterior mean/covariance. Kernel PCA: diagonalize $K = U\Lambda U^\top$; top-$k$ eigenvectors define feature-space PCs. NystrÃ¶m: low-rank kernel approximation via submatrix $K_{mm}$, which must be PD (invertible). Covariance shrinkage: regularize $\Sigma$ for small $n$ to ensure invertibility.
Connection to Linear Algebra Theory. Spectral theorem: symmetric $A = V\Lambda V^\top$ with orthonormal $V$; PSD means $\lambda_i \ge 0$. Gram matrix: $A = B^\top B$ is always PSD. Cholesky: PD $A = LL^\top$ with lower-triangular $L$; costs $O(d^3)$ for $A \in \mathbb{R}^{d\times d}$. Kernel positive definiteness: Mercerâs theorem ensures $k$ induces PSD Gram matrices via RKHS feature maps. Matrix square root: for PSD $A$, define $A^{1/2} = V\Lambda^{1/2}V^\top$.
Numerical and Implementation Notes. Use eigvalsh (symmetric) not eig (general)âfaster and more accurate. Check eig.min() > -tol with $\text{tol} \approx 10^{-10}$ for roundoff. Condition number: $\kappa = \lambda_{\max}/\lambda_{\min}$; large $\kappa$ (e.g., $> 10^{10}$) causes instability. Regularization: add $\epsilon I$ (e.g., $\epsilon = 10^{-6}$) to ensure PD and floor minimum eigenvalue. For large $n$, avoid forming $\Sigma$ explicitly; use randomized/incomplete methods.
Numerical and Shape Notes. $X \in \mathbb{R}^{n\times d}$, $X_c \in \mathbb{R}^{n\times d}$, $\Sigma \in \mathbb{R}^{d\times d}$, $K \in \mathbb{R}^{n\times n}$. Covariance: rowvar=False treats columns as variables (default treats rows as variables). bias=False uses $n-1$ denominator (unbiased); bias=True uses $n$ (MLE). Kernel: symmetric $K = K^\top$; diagonal $K_{ii} = 1$ for normalized kernels.
Pedagogical Significance. This is the canonical demonstration that structural properties (Gram matrix, valid kernel) guarantee PSD. Key insights: (1) Covariance = Gram matrix of centered data. (2) Kernels = implicit inner products in feature space. (3) Eigenvalues diagnose PSD/PD and conditioning. (4) PSD enables convexity and stability. Understanding when and why matrices are PSD is foundational for kernel methods, regularization, and numerical linear algebra in ML.

History and Applications

The theory of positive definite matrices traces to quadratic forms in 19th-century algebra (Sylvester, Cauchy). Mercer (1909) formalized positive definite kernels and their connection to integral equations, establishing that valid kernels correspond to inner products in Hilbert spaces. Aronszajn (1950) developed the theory of reproducing kernel Hilbert spaces (RKHS), providing the mathematical foundation for modern kernel methods. In ML, kernel methods emerged in the 1990s with support vector machines (Vapnik, Boser, Guyon) and Gaussian processes (Williams, Rasmussen), both relying on PSD Gram matrices for convex optimization and probabilistic inference. The RBF (Gaussian) kernel became ubiquitous due to its universal approximation properties and smoothness. In modern practice, verifying PSD via eigenvalues is routine for kernel matrix validation, covariance regularization (Ledoit-Wolf shrinkage, 2004), and numerical stability checks in Cholesky-based solves. PSD structure underpins convex optimization, ensuring global optima in ridge regression, kernel ridge regression, and Gaussian process maximum likelihood estimation.

Connection to Broader Examples

Links to Chapter 9 (PSD/PD): This is the canonical verificationâeigenvalue checks confirm PSD structure.
Connects to Chapter 11 (PCA): Covariance eigenvalues are variances along principal directions; PSD ensures real nonnegative variances.
Relates to Chapter 8 (eigenvalues): Spectral theorem for symmetric matrices; PSD means all $\lambda_i \ge 0$.
Bridges to Chapter 13 (solving systems): Cholesky decomposition ($A = LL^\top$) requires PD; used for efficient GP posterior inference.
Complements Chapter 14 (conditioning): Small eigenvalues â large condition number $\kappa = \lambda_{\max}/\lambda_{\min}$; regularization improves stability.
Reinforces kernel methods (SVM, GP): Kernel Gram matrices must be PSD for convex dual optimization and well-defined posteriors.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments