ex1.ai

Example number

Example slug

example_89_psd_in_ml_covariance_and_kernel_gram_matrices

Background

The concept of positive definite matrices emerged in the late 1800s from work on quadratic forms in algebra and physics (energy functions in mechanics must be non-negative). Gram matrices (matrices of the form $G = B^\top B$) were studied by JÃ¸rgen Pedersen Gram (1850â1916) in the context of orthogonalization. The connection to inner products and positive definiteness became clear: any Gram matrix is PSD because $x^\top G x = \|Bx\|_2^2 \ge 0$. Mercerâs theorem (1909) formalized the theory of positive definite kernels: continuous functions $k(x, y)$ that induce PSD Gram matrices correspond to inner products in implicit feature spaces (reproducing kernel Hilbert spaces, RKHS). This became foundational for kernel methods in ML (1990sâ2000s): support vector machines, kernel ridge regression, Gaussian processes. The Cholesky decomposition (AndrÃ©-Louis Cholesky, 1910s) provides a numerically stable algorithm for decomposing PSD matrices into $A = LL^\top$, enabling efficient sampling, solves, and log-determinant computation. In modern ML (2010s+), PSD-ness is checked via eigenvalue computation (eigvalsh in NumPy), and regularization ($\lambda I$) is added routinely to ensure numerical stability in ill-conditioned problems (ridge regression, Gaussian process inference, covariance estimation in high dimensions).

Purpose

Understand the positive semidefinite (PSD) property that guarantees stability, convexity, and well-defined inner products across machine learning. Master why covariance matrices and kernel Gram matrices are always PSD, what this means for optimization (convex quadratic forms, stable Cholesky decomposition), and how to verify PSD-ness numerically (checking minimum eigenvalues). Learn to diagnose numerical issues (small negative eigenvalues from rounding), apply regularization ($\lambda I$) to recover strict positive definiteness, and recognize when PSD-ness breaks down (invalid kernels, numerical corruption). This property is the foundation of numerical reliability in ML: ridge regression, Gaussian processes, kernel methods, spectral clustering, and covariance-based methods all require PSD matrices to function correctly.

Problem

Compute eigenvalues of (1) a covariance matrix and (2) an RBF kernel Gram matrix; report minimum eigenvalues.

Solution (Math)

Covariance matrices are PSD because they are of the form $X_c^TX_c$ (up to scaling). Valid kernel Gram matrices are PSD because they equal inner products in a feature space. Eigenvalues should be nonnegative up to numerical tolerance.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np
from scripts.toy_data import toy_pca_points, toy_kernel_rbf

X = toy_pca_points(n=40, seed=0)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)
eig_S = np.linalg.eigvalsh(Sigma)
print("min eig(cov):", eig_S.min())

K = toy_kernel_rbf(X, gamma=0.5)
eig_K = np.linalg.eigvalsh(K)
print("min eig(kernel):", eig_K.min())

Code Introduction

This code verifies positive semidefiniteness in two standard ML matrices. It builds a sample dataset, computes a covariance matrix Î£, and an RBF kernel Gram matrix K, then reports their minimum eigenvalues using eigvalsh (symmetric eigensolver).

Covariance PSD: Î£ = cov(Xc) is a Gram matrix = (1/(n-1)) Xc.T @ Xc, so all eigenvalues are â¥ 0 up to rounding.
Kernel PSD: K[i,j] = exp(-Î³ ||x_i - x_j||^2) yields a PSD Gram matrix by Mercerâs theorem.
Computation: Use np.linalg.eigvalsh for stability; print eig_S.min() and eig_K.min() to check PSD numerically.
Tip: If small negatives appear (e.g., -1e-12), treat them as numerical noise; add Î»I to enforce strict PD for Cholesky-based methods.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

eigvalsh vs.Â eig: np.linalg.eigvalsh exploits symmetry for better stability and efficiency ($O(d^3)$ with better constants). General eig doesnât assume symmetry and may return complex eigenvalues for non-symmetric rounding errors. Always use eigvalsh for PSD matrices.
Covariance computation: np.cov(Xc, rowvar=False, bias=False) computes the unbiased estimator $\frac{1}{n-1} X_c^\top X_c$. The bias=False flag uses $n-1$ denominator (Besselâs correction); bias=True uses $n$ (biased estimator).
Kernel matrix construction: toy_kernel_rbf(X, gamma) computes $K_{ij} = \exp(-\gamma \|x_i - x_j\|^2)$ for all pairs. Complexity: $O(n^2 d)$ for $n$ examples in $\mathbb{R}^d$. For large $n$ (e.g., $10^4$), this is expensive; approximations (NystrÃ¶m, random features) reduce cost.
Eigenvalue sorting: eigvalsh returns eigenvalues in ascending order by default. Minimum is eig[0], maximum is eig[-1].
Numerical threshold for zero: Machine epsilon for double precision is $\sim 2.22 \times 10^{-16}$. Eigenvalues in range $[-10^{-10}, 0]$ are treated as numerically zero (rank deficiency). Values $< -10^{-10}$ indicate true indefiniteness (non-PSD).
Regularization for stability: Adding $\lambda I$ shifts all eigenvalues up by $\lambda$: $\text{eig}(A + \lambda I) = \text{eig}(A) + \lambda$. For $\lambda > |\lambda_{\min}(A)|$, the matrix becomes strictly positive definite.
Cholesky decomposition: np.linalg.cholesky(A) computes $L$ such that $A = LL^\top$. Fails if $A$ is not strictly PD (raises LinAlgError). Complexity: $O(d^3)$ for $d \times d$ matrix. Used for sampling from Gaussian distributions, solving linear systems, and computing log-determinants.
Memory for kernel matrices: For $n = 10^4$ examples, $K \in \mathbb{R}^{10^4 \times 10^4}$ requires $\sim 800$ MB (double precision). For $n = 10^5$, this exceeds typical RAM. Use low-rank approximations (NystrÃ¶m, random Fourier features) or iterative methods that avoid forming $K$ explicitly.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Covariance matrices are always PSD: By construction, $\Sigma = \frac{1}{n-1} X_c^\top X_c$ (Gram matrix), so all eigenvalues are $\ge 0$. This guarantees that variance is non-negative and PCA is well-defined.
Kernel Gram matrices are PSD by Mercerâs theorem: Valid kernels (RBF, polynomial, etc.) produce PSD Gram matrices, ensuring kernel methods define valid inner products in RKHS.
Eigenvalue inspection verifies PSD-ness: Computing $\lambda_{\min}$ checks whether all eigenvalues are non-negative. Values $< -10^{-10}$ indicate numerical issues or invalid construction.
eigvalsh for symmetric matrices: Specialized algorithm for symmetric/Hermitian matrices is more stable and efficient than general eig. Always use eigvalsh for covariance/kernel matrices.
Numerical precision vs.Â mathematical zero: Even for truly PSD matrices, floating-point rounding may produce eigenvalues $\sim -10^{-15}$. Thresholds around $-10^{-10}$ distinguish numerical noise from genuine indefiniteness.
Cholesky decomposition requires strict PD: $A = LL^\top$ exists only if all eigenvalues are strictly positive ($> 0$). Regularization $A + \lambda I$ ensures this for ill-conditioned or rank-deficient matrices.
Condition number from eigenvalues: $\kappa(A) = \lambda_{\max} / \lambda_{\min}$ measures ill-conditioning. Large $\kappa$ (e.g., $> 10^{10}$) requires regularization or preconditioning for stable solves.

Notes

Part 1: Core setup - PSD in ML: covariance and kernel Gram matrices

State the objects, shapes, and target question for PSD in ML: covariance and kernel Gram matrices. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - PSD in ML: covariance and kernel Gram matrices

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make PSD in ML: covariance and kernel Gram matrices work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - PSD in ML: covariance and kernel Gram matrices

Give the computational recipe for PSD in ML: covariance and kernel Gram matrices, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

Shape discipline: check dimensions before manipulating formulas.
Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).

History and Applications

History: Positive (semi)definite matrices arose in the study of quadratic forms and energy in physics (19th century). Gram matrices connect to inner products ($G = B^\top B$), while Mercerâs theorem (1909) established the foundation for positive definite kernels. The Cholesky decomposition (early 1900s) provided a stable factorization for SPD matrices, central to numerical linear algebra.

Applications: PSD structure underpins covariance estimation, Gaussian processes, kernel methods (SVM, kernel ridge), and graph Laplacians in spectral clustering. In optimization, $\nabla^2 f \succeq 0$ ensures convexity. Practically, adding $\lambda I$ enforces strict PD for stable Cholesky solves; eigenvalue checks guard against invalid kernels or numerical corruption.

Connection to Broader Examples

PCA and covariance (Ex 11, 83): PCA diagonalizes the covariance matrix $\Sigma$. PSD-ness guarantees eigenvalues (variances) are non-negative, ensuring interpretable principal components.
SVD and Gram matrices (Ex 10): Covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c$ relates to SVD: eigenvectors of $\Sigma$ are right singular vectors of $X_c$. PSD-ness follows from SVD structure.
Eigenvectors and power iteration (Ex 88): Power iteration on PSD matrices converges to the dominant eigenvector. PSD-ness ensures real eigenvalues and orthonormal eigenvectors.
Least squares and normal equations (Ex 12, 86): Normal equations $(X^\top X) w = X^\top y$ involve the Gram matrix $X^\top X$ (PSD). Ridge regression $(X^\top X + \lambda I)$ ensures strict PD for stable solves.
Kernel methods and RKHS: Support vector machines, kernel ridge regression, and Gaussian processes operate on kernel Gram matrices. PSD-ness ensures the kernel defines a valid inner product in the induced feature space.
Regularization (Ridge, GP inference): Regularization adds $\lambda I$ to covariance or kernel matrices, shifting eigenvalues to ensure strict PD-ness. Essential for numerical stability in ill-conditioned problems.
Optimization and convexity: Quadratic objectives $f(w) = w^\top A w$ are convex if $A$ is PSD (Hessian $\nabla^2 f = 2A \succeq 0$). PSD-ness guarantees local minima are global.
Spectral clustering (graph methods): Graph Laplacians $L = D - A$ (degree minus adjacency) are PSD. Eigendecomposition of $L$ enables spectral clustering via eigenvectors.