Example number
13
Example slug
example_13_solving_spd_system_with_cholesky
Background

For SPD matrices, Cholesky gives a stable and efficient factorization $A = LL^\top$ with $L$ lower triangular. Compared with generic LU, Cholesky halves storage, avoids pivoting (under SPD), and minimizes rounding error accumulation. In ML, such systems appear in Gaussian processes (kernel matrices), normal equations for least squares (when SPD), and second-order methods (Hessian approximations). The forward/back substitution sequence $Ly=b$, $L^\top x=y$ avoids forming inverses and preserves numerical stability. Conditioning governs the attainable accuracy: if $\kappa(A)$ is large, small perturbations in $b$ or roundoff can enlarge errors in $x$—even when using stable algorithms.

Purpose

Build intuition that many “hard to train” ML issues are really linear solves under ill-conditioning, and that choosing the right factorization yields accuracy and speed:

  • When $A$ is symmetric positive definite (SPD), use Cholesky: $A = LL^\top$.
  • Solve by two triangular systems (forward/back substitution) instead of inverting.
  • Reuse the factorization for multiple right-hand sides—amortizing cost in training loops.
  • Recognize conditioning: small pivots/eigenvalues cause instability; choose robust solvers.
Problem

Solve Ax=b with Cholesky and compare to np.linalg.solve.

Solution (Math)

For SPD $A$, Cholesky $A=LL^T$. Solve $Ly=b$ then $L^Tx=y$. This is stable and efficient.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np

A = np.array([[4.,1.],[1.,3.]])
b = np.array([1.,2.])

L = np.linalg.cholesky(A)
y = np.linalg.solve(L,b)
x = np.linalg.solve(L.T,y)

xref = np.linalg.solve(A,b)
print("x:", x, "xref:", xref, "diff:", np.linalg.norm(x-xref))
Code Introduction

This snippet solves a symmetric positive definite linear system via Cholesky factorization. The matrix $A$ is SPD, so it admits a factorization $A = L L^\top$ with $L$ lower triangular. The code computes $L = \mathrm{chol}(A)$, then solves two triangular systems: first $Ly = b$ (forward substitution), then $L^\top x = y$ (back substitution). Triangular solves avoid explicit inverses and are numerically stable. The final line compares $x$ against the direct solution np.linalg.solve(A, b) and prints the norm of their difference, which should be near zero (floating-point roundoff).

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Factorization: L = np.linalg.cholesky(A) requires $A$ to be symmetric positive definite; otherwise it raises LinAlgError.
  • Solves: y = np.linalg.solve(L, b) (forward substitution), then x = np.linalg.solve(L.T, y) (back substitution).
  • Reference: xref = np.linalg.solve(A, b) uses a general-purpose solver (typically LU with pivoting) for comparison.
  • Diagnostic: print np.linalg.norm(x - xref); expect values on the order of $10^{-15}$–$10^{-12}$ for well-conditioned $A$.
  • Multiple RHS: if solving for many $b$, reuse L to avoid repeated factorizations.
  • Failure modes: if cholesky fails, check symmetry, ensure SPD (eigenvalues > 0), optionally add a small jitter (e.g., $10^{-8}I$), or use scipy.linalg.cho_factor/cho_solve for convenience.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Cholesky factorization for SPD: $A=LL^\top$ with $L$ lower triangular.
  • Solving via two triangular systems equals the direct solve np.linalg.solve(A,b) up to roundoff.
  • Numerical sanity check: $\|x - x_{\text{ref}}\|_2$ is near machine precision.
  • Efficiency: factor once, solve many right-hand sides at $O(n^2)$ each.
  • Practical guardrails: detect non-SPD (factorization failure) and add diagonal jitter or switch methods.
Notes

Part 1: Core setup - Solving SPD system with Cholesky

State the objects, shapes, and target question for Solving SPD system with Cholesky. Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Solving SPD system with Cholesky

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Solving SPD system with Cholesky work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Solving SPD system with Cholesky

Give the computational recipe for Solving SPD system with Cholesky, note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: check dimensions before manipulating formulas.
  • Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
  • Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).

Why Cholesky: for SPD matrices, Cholesky is faster and more stable than generic LU—no pivoting needed, fewer floating-point operations, and better numerical properties. Factor once in $O(n^3)$; each subsequent right-hand side solves in $O(n^2)$. If the factorization fails, the matrix is not SPD up to numerical tolerances; check symmetry, add a small diagonal jitter, or switch to QR/SVD or CG.

History and Applications

Origins of Cholesky: The Cholesky factorization is named after André-Louis Cholesky (early 20th century), who developed it for geodesy computations. It exploits symmetry and positive definiteness to factor $A$ into $LL^\top$ efficiently, avoiding pivoting required by general LU.

From theory to practice: With the advent of floating‑point arithmetic and stability analysis (Wilkinson, Higham), Cholesky became the default method for SPD systems because it halves storage, reduces flops, and accumulates less rounding error than LU. Its success carries into modern libraries (LAPACK, NumPy, SciPy) and hardware‑accelerated BLAS.

ML applications: SPD solves appear in Gaussian processes (kernel Gram matrices plus jitter), Kalman filtering and smoothing (covariance updates), normal equations for least squares (when acceptable), and Newton/Quasi‑Newton methods (approximate Hessians). In large‑scale settings, Conjugate Gradient (CG) solves SPD systems iteratively using only matrix‑vector products; sparse Cholesky leverages sparsity patterns for graph‑structured problems. Adding a small diagonal jitter ensures numerical SPD when eigenvalues are near zero, enabling robust factorizations in practice.

Connection to Broader Examples
  • Least squares (Ch. 12): Normal equations yield SPD matrices; Cholesky can be used when conditioning is acceptable; otherwise prefer QR/SVD.
  • PSD/PD (Ch. 9): SPD is the strict PSD case; guarantees unique solutions and valid Cholesky.
  • Conditioning (Ch. 14): The condition number controls sensitivity; stable factorizations mitigate but cannot remove ill-conditioning.
  • Sparse/large-scale: In big problems, sparse Cholesky and iterative methods (CG for SPD) inherit the same principles—exploit structure and maintain stability.

Comments