Example number
29
Example slug
example_29_solving_spd_system_with_cholesky
Background

Numerical linear algebra developed stability theory to make computation reliable under finite precision; ML inherits these issues at scale. André-Louis Cholesky, a French military officer and geodesist, developed the decomposition in the early 1900s for solving least squares problems in surveying. The method remained obscure until the 1950s when numerical analysts recognized its superior stability and efficiency for SPD systems. Solving $Ax = b$ via the normal equations $X^\top X w = X^\top y$ (used in least squares) squares the condition number, catastrophically amplifying rounding errors when $X$ is ill-conditioned; Cholesky avoids this by working directly with the system matrix. The fundamental insight is that factorization structure determines numerical stability: explicit inversion $A^{-1}$ is never computed; instead, the decomposition $A = LL^\top$ enables solving via efficient triangular substitutions that respect floating-point precision limits. In modern ML, Cholesky is ubiquitous in second-order optimization (Newton’s method, interior-point methods), Gaussian process inference (computing posterior covariances), and kernel methods (solving Gram matrix systems). The practical challenge is that Cholesky requires $A$ to be exactly symmetric positive definite; in practice, numerical rounding or ill-conditioning can make a theoretically SPD matrix appear indefinite, causing Cholesky to fail. Detecting this failure (catching the exception) is often the first diagnostic for ill-conditioning.

Purpose

Show how ‘hard to train’ often means ‘ill-conditioned’ and how factorization choices change outcomes. Solving linear systems $Ax = b$ is the computational backbone of machine learning: from Newton’s method (solving $H \Delta w = -g$ where $H$ is the Hessian) to Gaussian processes (inferring posterior distributions via $K^{-1} y$), most optimization and inference algorithms reduce to solving sequences of linear systems. For symmetric positive definite (SPD) matrices—the most common case in ML (Hessians near minima, covariance matrices, kernel Gram matrices)—Cholesky decomposition provides both efficiency and numerical stability. This example demonstrates why factorization method matters: naive direct inversion via $A^{-1}$ amplifies rounding errors and is computationally wasteful; structured solvers (Cholesky for SPD) avoid inversion entirely, solving via forward-backward substitution on triangular factors. The comparison between Cholesky-based and general-purpose solvers reveals why “hard to train” often reflects ill-conditioning (large condition number $\kappa(A)$): as $\kappa(A)$ grows, even stable Cholesky solves become unreliable, requiring preconditioning or regularization. Understanding Cholesky decomposition equips you to diagnose optimization failures, design efficient iterative solvers, and recognize when a system is inherently ill-conditioned rather than buggy.

Problem

Solve Ax=b with Cholesky and compare to np.linalg.solve.

Solution (Math)

For SPD $A$, Cholesky $A=LL^T$. Solve $Ly=b$ then $L^Tx=y$. This is stable and efficient.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

Runnable reference implementation using NumPy’s Cholesky factorization and triangular solves. Prints the solution, reference, and their norm difference.


import numpy as np

A = np.array([[4.,1.],[1.,3.]])
b = np.array([1.,2.])

L = np.linalg.cholesky(A)
y = np.linalg.solve(L,b)
x = np.linalg.solve(L.T,y)

xref = np.linalg.solve(A,b)
print("x:", x, "xref:", xref, "diff:", np.linalg.norm(x-xref))
Code Introduction

This code demonstrates Cholesky decomposition and two-stage forward-backward substitution for solving symmetric positive definite (SPD) linear systems. The workflow consists of five steps: (1) construct an SPD matrix $A$ and right-hand side $b$, (2) compute the Cholesky factorization $A = LL^\top$ via L = np.linalg.cholesky(A), (3) solve the lower triangular system $Ly = b$ via forward substitution to get intermediate vector $y$, (4) solve the upper triangular system $L^\top x = y$ via backward substitution to get the final solution $x$, and (5) verify against a direct solver np.linalg.solve(A, b). The Cholesky-based approach avoids computing $A^{-1}$ explicitly, which would amplify rounding errors, and instead leverages the triangular structure of $L$ and $L^\top$ for stable, efficient computation. The factorization $A = LL^\top$ exists and is unique for any SPD matrix, exploiting symmetry to require only $O(n^3/3)$ operations compared to $O(n^3)$ for general LU. When solving multiple systems with the same $A$, the amortized cost becomes $O(n^2)$ per system (factorization done once), making Cholesky-based solvers highly efficient. The verification step ||x - xref|| confirms solution accuracy; for well-conditioned problems, both methods agree to machine precision ($\sim 10^{-15}$). If Cholesky fails (raises “not positive definite”), the matrix is either non-SPD or severely ill-conditioned, signaling trouble in the upstream algorithm (e.g., defective Hessian estimate, singular covariance). This code is the computational pattern used throughout numerical optimization and inference: Newton’s method, interior-point methods, Gaussian processes, and kernel methods all rely on efficient Cholesky solves for computational efficiency.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. SPD matrix construction: A = np.array([[4.,1.],[1.,3.]]) creates a $2 \times 2$ symmetric positive definite matrix with $A = A^\top$ and all eigenvalues positive (here, $\lambda_1 \approx 4.56$, $\lambda_2 \approx 2.44$). Positive definiteness can be verified by checking that all leading principal minors are positive: $A_{1,1} = 4 > 0$ and $\det(A) = 4 \cdot 3 - 1^2 = 11 > 0$.
  2. Cholesky factorization: L = np.linalg.cholesky(A) computes the lower triangular factor such that $A = LL^\top$. This requires $O(n^3/3)$ floating-point operations, exploiting symmetry to be faster than general LU decomposition ($O(n^3/3)$ operations but without the $2/3$ factor). The algorithm automatically checks for positive definiteness; if violated, an exception is raised.
  3. Forward substitution: y = np.linalg.solve(L, b) solves the triangular system $Ly = b$ in $O(n^2)$ operations using forward substitution. Since $L$ is lower triangular, $y_1 = b_1 / L_{1,1}$, then $y_2 = (b_2 - L_{2,1}y_1) / L_{2,2}$, etc.
  4. Backward substitution: x = np.linalg.solve(L.T, y) solves the upper triangular system $L^\top x = y$ in $O(n^2)$ operations using backward substitution. Starting from the bottom: $x_n = y_n / L_{n,n}$, then $x_{n-1} = (y_{n-1} - L_{n,n-1}x_n) / L_{n-1,n-1}$, etc.
  5. Total cost analysis: The combined cost is $O(n^3/3)$ for Cholesky plus $O(n^2)$ for two triangular solves, totaling $O(n^3)$ per solve when amortizing. For multiple right-hand sides (solving $AX = B$ with $B$ having many columns), reuse the factorization: compute Cholesky once, solve each column in $O(n^2)$.
  6. Reference solution: xref = np.linalg.solve(A, b) uses a general-purpose solver (typically LU with partial pivoting for non-SPD matrices), confirming the Cholesky result.
  7. Verification: ||x - xref|| should be near machine precision ($\sim 10^{-15}$), confirming both methods compute the same solution within floating-point accuracy despite using different factorizations.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Cholesky factorization for SPD systems: For symmetric positive definite matrices $A$, the decomposition $A = LL^\top$ (where $L$ is lower triangular) exists and is unique, enabling efficient, stable solving without inversion.
  • Two-stage triangular solve: Forward substitution solves $Ly = b$ ($O(n^2)$ cost), then backward substitution solves $L^\top x = y$ ($O(n^2)$ cost), totaling $O(n^2)$ to solve once the $O(n^3)$ factorization is done.
  • Efficiency for repeated solves: When solving multiple systems $Ax_i = b_i$ with the same $A$, amortize the factorization cost: compute Cholesky once, then solve each system in $O(n^2)$ rather than $O(n^3)$.
  • Numerical stability: Cholesky avoids the $\kappa(A)^2$ condition number amplification of the normal equations; triangular solves respect finite-precision arithmetic. Direct inversion $A^{-1}$ is never computed, preventing catastrophic error magnification.
  • Failure diagnosis: If Cholesky raises an exception (“not positive definite”), the matrix is either non-SPD or severely ill-conditioned, signaling trouble in the calling algorithm (e.g., defective Hessian estimate, singular covariance).
  • Computation pattern: This is a solve operation—computing $A^{-1} b$ implicitly through factorization—ubiquitous in optimization (Newton steps) and inference (Gaussian process posteriors).
Notes

Part 1: The SPD Matrix and Cholesky Factorization
The symmetric positive definite matrix $A = \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix}$ and right-hand side $b = [1, 2]^\top$ define the system $Ax = b$. The Cholesky decomposition L = np.linalg.cholesky(A) factorizes $A$ into a product of a lower triangular matrix and its transpose: $A = LL^\top$, where $L \in \mathbb{R}^{2 \times 2}$ is lower triangular with positive diagonal entries. For the given $A$, this produces $L = \begin{bmatrix} 2 & 0 \\ 0.5 & \sqrt{2.75} \end{bmatrix}$ (approximately). Cholesky is faster and more numerically stable than general LU or QR decomposition for SPD matrices because it requires only half the operations (no pivoting needed) and exploits the symmetric positive definite structure to guarantee no breakdown. Computing $A^{-1}$ explicitly via inv(A) would amplify rounding errors; Cholesky avoids this by working with triangular factors.

Part 2: Forward and Backward Substitution
Once the factorization $A = LL^\top$ is obtained, solving $Ax = b$ becomes two triangular solves. Forward substitution: y = np.linalg.solve(L, b) solves $Ly = b$ for the intermediate vector $y \in \mathbb{R}^2$. Since $L$ is lower triangular, this is done efficiently by forward substitution—each equation involves only previously computed unknowns, so $y$ is computed in $O(n^2)$ time. Backward substitution: x = np.linalg.solve(L.T, y) solves $L^\top x = y$ for the final solution $x$. Since $L^\top$ is upper triangular, backward substitution computes $x$ in another $O(n^2)$ operations. Together, the two solves cost $O(n^2)$ compared to $O(n^3)$ for the Cholesky factorization itself, making repeated solves with the same $A$ very efficient.

Part 3: Verification Against Direct Solve
The reference solution xref = np.linalg.solve(A, b) solves $Ax = b$ directly using a general-purpose solver (typically LU with partial pivoting under the hood). The printed difference ||x - xref|| should be near machine precision (around $10^{-15}$), confirming that both methods compute the same solution to floating-point accuracy. The Cholesky-based approach is not more accurate, but it is more efficient when solving multiple systems with the same $A$: compute Cholesky once ($O(n^3)$), then solve each system in $O(n^2)$. This pattern is ubiquitous in optimization: computing Hessian-vector products via repeated solves, interior-point methods in convex optimization, and Gaussian process inference all exploit this structure.

Connection to Broader Linear Algebra
This example unifies several key concepts: (1) Symmetric matrices (Chapter 8-9): SPD matrices have real positive eigenvalues and orthogonal eigenvectors; Cholesky exploits this structure for efficiency. (2) Factorizations (Chapter 13): LU, QR, and Cholesky are all ways to decompose $A$ to accelerate solving; Cholesky is the most efficient for SPD matrices. (3) Numerical stability (Chapter 14): The condition number $\kappa(A) = \lambda_{\max} / \lambda_{\min}$ governs error amplification; triangular solves through Cholesky avoid squaring the condition number (unlike normal equations $X^\top X w = X^\top y$). (4) Optimization in ML: Computing Newton steps $H^{-1} g$ (where $H$ is the Hessian, which is SPD near minima) via Cholesky is standard in second-order optimization methods.

Numerical and Shape Notes
Shape discipline: $A \in \mathbb{R}^{2 \times 2}$, $L \in \mathbb{R}^{2 \times 2}$ (lower triangular), $b \in \mathbb{R}^2$, $y \in \mathbb{R}^2$ (intermediate), $x \in \mathbb{R}^2$ (solution). Crucially, np.linalg.cholesky requires $A$ to be SPD; if $A$ is not positive definite, the solver raises an error (e.g., “not positive definite”). Testing SPD before factorization can be done via eigenvalue inspection or attempting Cholesky and catching the exception. Gotchas: (1) For ill-conditioned $A$ (large condition number), Cholesky may fail due to numerical rounding pushing computed diagonal entries negative. (2) The syntax np.linalg.solve(L.T, y) solves $L^\top x = y$, not $x = L^{-\top} y$—the solver is more stable than explicit inversion. (3) Printing shapes and intermediate results (print L.shape, y.shape, x.shape) confirms correctness before comparing norms.

History and Applications

Cholesky decomposition emerged in the early 1900s when André-Louis Cholesky, a French military officer and geodesist, developed the method for solving least squares problems in surveying and cartography. He worked on triangulation computations for the French military, where solving large linear systems arising from overdetermined surveying equations was central. His method remained relatively obscure in published literature until after his death in World War I, when numerical analysts in the 1950s recognized its computational advantages. John G. F. Francis and Vera N. Kublanovskaya independently developed the QR algorithm around the same time, establishing the numerical linear algebra foundations of modern computing. The key insight—that factorization structure determines numerical stability—transformed computational mathematics: instead of explicitly forming $A^{-1}$ (catastrophic for ill-conditioned $A$), factorizations like Cholesky, QR, and SVD enable solving through structured substitutions that respect floating-point precision. In the 1960s-1970s, Gene Golub and colleagues formalized stability analysis, quantifying how condition number $\kappa(A)$ affects rounding error amplification. Today, Cholesky is ubiquitous in machine learning: Newton’s method and interior-point optimization compute Hessian systems $H \Delta w = -g$ (where $H$ is SPD at minima) via Cholesky; Gaussian process inference solves covariance systems $K y$ via Cholesky ($K$ is SPD by construction); kernel machines solve Gram matrix systems via Cholesky; Kalman filtering uses block Cholesky for covariance updates. A major challenge is detecting ill-conditioning: Cholesky failing (raising “not positive definite”) often signals that the system is inherently difficult (large $\kappa(A)$) rather than buggy, necessitating regularization, preconditioning, or iterative solvers. Modern variants like incomplete Cholesky (computing only the dominant factors to serve as a preconditioner) and randomized Cholesky (approximating the factorization) extend the method to large-scale problems. The historical progression—from Cholesky’s geodetic computations to modern ML systems solving billions of SPD systems per day—illustrates how a 120-year-old algorithm remains foundational to scientific computing.

Connection to Broader Examples

Cholesky solving synthesizes concepts from multiple chapters:

  • Chapter 5 (Inner Products & Norms): The Cholesky factor $L$ inherits the positive definiteness of $A$ through the inner product structure: $A = LL^\top$ means $\langle x, Ax \rangle = \langle L^\top x, L^\top x \rangle = \|L^\top x\|_2^2 > 0$ for $x \neq 0$.
  • Chapter 8-9 (Eigenvalues & PSD): SPD matrices have real positive eigenvalues and orthogonal eigenvectors; Cholesky exploits this structure to guarantee numerical stability (no pivoting needed).
  • Chapter 13 (Solving Systems): This example demonstrates the computational core of solving $Ax = b$ for SPD $A$—using Cholesky instead of LU or direct inversion.
  • Chapter 14 (Conditioning & Stability): The condition number $\kappa(A) = \lambda_{\max} / \lambda_{\min}$ governs error amplification; Cholesky’s stability avoids squaring $\kappa(A)$ (as would happen with normal equations).
  • Newton’s method (optimization): Computing the Newton step $H^{-1} g$ (where $H$ is the Hessian, SPD near minima) via Cholesky is standard in second-order optimization, interior-point methods, and trust-region algorithms.
  • Gaussian processes & kernel methods: Solving $K^{-1} y$ (where $K$ is a positive definite Gram matrix) via Cholesky is computationally stable and scales as $O(n^3)$ for $n$ training points—a major computational bottleneck that motivates low-rank approximations (Nyström, inducing points).

Comments