Cholesky decomposition was developed by André-Louis Cholesky in 1910 for geodetic calculations (surveying the Sahara desert), though it was published posthumously by a colleague in 1924. The method exploits the symmetry and positive definiteness of covariance matrices and normal equation systems ($X^\top X$) to achieve roughly half the cost of general LU decomposition. By the 1960s, Cholesky became the standard solver for SPD systems in numerical linear algebra, with Wilkinsonâs 1965 error analysis proving its backward stability. In modern ML, Cholesky appears wherever covariance matrices or regularized least squares are solved: Gaussian process regression (kernel matrix $K + \sigma^2 I$), ridge regression ($(X^\top X + \lambda I) w = X^\top y$), Kalman filtering (covariance updates), and multivariate Gaussian sampling ($x = L z$ where $z \sim \mathcal{N}(0, I)$). The factorization reuse pattern demonstrated here is critical for computational efficiencyâmulti-task learning solves hundreds of related systems with the same design matrix, GP hyperparameter tuning refits the model for dozens of $\sigma^2$ values, and cross-validation evaluates ridge regression across grids of $\lambda$ parameters. Despite the rise of iterative methods for massive-scale problems, Cholesky remains the workhorse for small-to-medium SPD systems ($d < 10^4$) where direct factorization is feasible and reuse amortizes the cost.
- Log in to post comments
This example demonstrates the fundamental efficiency pattern in numerical linear algebra: factorization reuse for solving multiple linear systems $Ax = b$ with the same matrix $A$ but different right-hand sides. For symmetric positive definite (SPD) matrices, Cholesky decomposition $A = LL^\top$ costs $O(d^3/3)$ but yields a factorization that enables solving each system via two triangular solves at $O(d^2)$ cost. When solving $k$ systems, this reduces total cost from $O(kd^3)$ (naive repeated solves) to $O(d^3/3 + kd^2)$ (factor once, reuse $k$ times), achieving $\sim k$-fold speedup for $k \gg d$. This pattern appears throughout MLâGaussian process regression (posterior sampling), multi-task learning (shared covariance), ridge regression (cross-validation over $\lambda$), and Kalman filtering (sequential measurement updates). The example establishes the pedagogical principle: âexplicit factorization control beats black-box solvers when the matrix is reused,â a lesson that generalizes to LU, QR, and SVD factorizations.
Solve an SPD system with Cholesky and reuse the factor for multiple right-hand sides.
For a symmetric positive definite (SPD) matrix $A \in \mathbb{R}^{d \times d}$, the Cholesky factorization is:
\[ A = L L^\top, \]
where $L \in \mathbb{R}^{d \times d}$ is lower-triangular with positive diagonal entries $L_{ii} > 0$. To solve $Ax = b$, substitute the factorization:
\[ L L^\top x = b. \]
This decouples into two triangular systems:
- Forward substitution: Solve $Ly = b$ for $y$.
- Backward substitution: Solve $L^\top x = y$ for $x$.
Each triangular solve costs $O(d^2)$ flops, while the Cholesky factorization costs $O(d^3/3)$ flops. For $k$ different right-hand sides $b_1, \ldots, b_k$, the total cost is:
\[ \text{Cost} = \frac{d^3}{3} + k d^2 \quad (\text{factor once, solve } k \text{ times}). \]
This is dramatically cheaper than naive solving ($O(kd^3)$) when $k \gg 1$.
Notation: - $A \in \mathbb{R}^{d \times d}$: symmetric positive definite matrix - $L \in \mathbb{R}^{d \times d}$: lower-triangular Cholesky factor - $b \in \mathbb{R}^d$: right-hand side vector - $x, y \in \mathbb{R}^d$: intermediate and solution vectors - $\|\cdot\|_2$: Euclidean norm - $A^\top$: transpose (not $A^T$)
import numpy as np
A = np.array([[4., 1.],
[1., 3.]])
L = np.linalg.cholesky(A)
def chol_solve(b):
y = np.linalg.solve(L, b)
return np.linalg.solve(L.T, y)
for b in [np.array([1.,2.]), np.array([0.,1.])]:
x = chol_solve(b)
x_ref = np.linalg.solve(A, b)
print("b:", b, "x:", x, "diff:", np.linalg.norm(x-x_ref))This snippet demonstrates Cholesky factorization reuse for solving multiple linear systems $Ax = b$ with the same positive definite matrix $A \in \mathbb{R}^{2 \times 2}$ but different right-hand sides. The key efficiency gain is that the costly factorization $A = LL^\top$ (where $L$ is lower-triangular) is computed once, then reused for solving with different $b$ vectors via two cheap triangular solves.
The matrix $A = \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix}$ is symmetric positive definite: $A = A^\top$ and eigenvalues $\approx \{4.30, 2.70\}$ are positive. Cholesky factorization L = np.linalg.cholesky(A) yields $L = \begin{bmatrix} 2 & 0 \\ 0.5 & 1.658 \end{bmatrix}$ (lower-triangular, positive diagonal) at cost $O(d^3/3)$. To solve $Ax = b$, substitute $A = LL^\top$: $L L^\top x = b$. This decouples into two triangular systems: (1) Forward substitution $Ly = b$ for $y$ (cost $O(d^2)$), (2) Backward substitution $L^\top x = y$ for $x$ (cost $O(d^2)$). The chol_solve(b) function implements this via y = np.linalg.solve(L, b) then return np.linalg.solve(L.T, y). Each triangular solve is dramatically cheaper than factorization ($O(d^2)$ vs. $O(d^3/3)$), making reuse highly profitable: for $k$ systems, total cost is $O(d^3/3 + kd^2)$ vs. $O(kd^3)$ naive, achieving $\sim k$-fold speedup for $k \gg d$.
The code solves $Ax = b$ for two right-hand sides: $b_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix}$ and $b_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$. For each, it computes: (1) x = chol_solve(b) via Cholesky reuse, (2) x_ref = np.linalg.solve(A, b) via NumPyâs black-box solver (internally uses LU for general matrices or Cholesky for detected SPD). The difference np.linalg.norm(x - x_ref) should be at machine precision ($\sim 10^{-15}$), confirming both methods are numerically equivalent. Typical output: b: [1. 2.] x: [0.18181818 0.63636364] diff: 0.0. Zero differences validate both factorization correctness and solve implementation.
For well-conditioned SPD $A$, both methods agree within roundoff error. Use Cholesky reuse when: (1) Solving multiple systems with the same SPD $A$ (multi-task learning, GP sampling, ridge CV). (2) $A$ is well-conditioned ($\kappa(A) < 10^6$). (3) Computational cost matters ($k \ge 2$ systems). Gotchas: Cholesky requires positive definite (all eigenvalues strictly $> 0$)âfails for PSD matrices with zero eigenvalues. Add regularization $A + \epsilon I$ to ensure PD.
Step-by-step execution of the Cholesky factorization reuse pattern:
Define SPD matrix:
A = np.array([[4., 1.], [1., 3.]])creates a $2 \times 2$ symmetric positive definite matrix. Verify SPD properties: $A = A^\top$ (symmetric) and eigenvalues $\{4.30, 2.70\}$ are both positive (confirming positive definiteness).Compute Cholesky factorization:
L = np.linalg.cholesky(A)factorizes $A = LL^\top$ where $L$ is lower-triangular with positive diagonal. The result is $L = \begin{bmatrix} 2 & 0 \\ 0.5 & \sqrt{2.75} \end{bmatrix} \approx \begin{bmatrix} 2 & 0 \\ 0.5 & 1.658 \end{bmatrix}$. Cost: $O(d^3/3) \approx 8/3 \approx 2.67$ flops for $d=2$ (negligible, but scales cubically).Define reusable solve function:
chol_solve(b)implements the two-stage triangular solve: (a) Forward substitution:y = np.linalg.solve(L, b)solves $Ly = b$ (lower-triangular, cost $O(d^2) = 4$ flops). (b) Backward substitution:return np.linalg.solve(L.T, y)solves $L^\top x = y$ (upper-triangular, cost $O(d^2) = 4$ flops). Total per solve: $O(d^2) = 8$ flops.Solve for multiple right-hand sides: The loop
for b in [np.array([1.,2.]), np.array([0.,1.])]solves $Ax = b$ for two different $b$ vectors, reusing the precomputed $L$. Total cost: $O(d^3/3) + 2 \cdot O(d^2) \approx 2.67 + 16 = 18.67$ flops. Naive approach (factorizing twice): $2 \cdot O(d^3) \approx 16$ flopsâactually similar for tiny $d=2$, but for $d=1000$, Cholesky reuse saves $\sim 10^9$ flops.Verification: For each $b$, compute reference solution
x_ref = np.linalg.solve(A, b)using NumPyâs black-box solver. The differencenp.linalg.norm(x - x_ref)should be near machine precision ($\sim 10^{-15}$), confirming both methods yield identical solutions.Interpretation: The speedup from reuse becomes dramatic for large $d$ and many right-hand sides: solving $k=100$ systems with $d=1000$ costs $\frac{10^9}{3} + 100 \cdot 10^6 \approx 4.3 \times 10^8$ flops (Cholesky reuse) vs. $100 \cdot 10^9 = 10^{11}$ flops (naive), a $\sim 230$x speedup.
- Factorization reuse for multiple right-hand sides: When solving $k$ systems $Ax = b_i$ ($i=1,\ldots,k$) with the same SPD matrix $A$, computing the Cholesky factorization $A = LL^\top$ once and reusing $L$ for all $k$ solves reduces cost from $O(kd^3)$ (naive) to $O(d^3/3 + kd^2)$, achieving $\sim k$-fold speedup for $k \gg d$âa critical pattern in multi-task learning, GP regression, and cross-validation.
- Two-stage triangular solve: The solution process decouples into forward substitution $Ly = b$ (solving lower-triangular system) and backward substitution $L^\top x = y$ (solving upper-triangular system), each costing $O(d^2)$ flopsâdramatically cheaper than the $O(d^3/3)$ factorization cost, making reuse highly profitable even for $k=2$ or $k=3$ systems.
- SPD requirement and numerical stability: Cholesky requires $A$ to be symmetric positive definite (all eigenvalues $> 0$); the factorization fails for indefinite or singular matricesâthis is both a feature (automatic detection of non-SPD matrices) and a limitation (requires regularization $A + \epsilon I$ for nearly singular systems). Cholesky is backward stable for well-conditioned SPD matrices, with errors bounded by $\|\tilde{L}\tilde{L}^\top - A\| \le \epsilon_{\text{mach}} \|A\| \cdot O(d)$.
- Verification via direct solve: Comparing Cholesky-based solutions to
np.linalg.solve(A, b)(which internally uses LU or Cholesky) confirms correctnessâdifferences should be at machine precision ($\sim 10^{-15}$), validating both the factorization and the two-stage solve implementation. - Connection to covariance matrices and Gaussian sampling: SPD matrices arise naturally as covariance matrices $\Sigma = \mathbb{E}[(x - \mu)(x - \mu)^\top]$ in statistics and ML. The Cholesky factor $L$ satisfies $\Sigma = LL^\top$, enabling efficient sampling from multivariate Gaussians via $x = \mu + Lz$ where $z \sim \mathcal{N}(0, I)$âa fundamental operation in variational inference, Monte Carlo methods, and generative models.
- Pedagogical baseline for factorization strategies: This example establishes the mental model âexplicit factorization + reuse beats black-box solversâ that generalizes to LU (general matrices), QR (least squares), and SVD (pseudoinverse)âalways ask âwill I solve multiple systems with the same matrix?â before choosing an algorithm.
Part 1: Cholesky Factorization and Two-Stage Solve
The matrix $A = \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix}$ is symmetric positive definite: $A = A^\top$ and eigenvalues $\{4.30, 2.70\}$ are positive. Cholesky factorization yields $A = LL^\top$ where $L = \begin{bmatrix} 2 & 0 \\ 0.5 & 1.658 \end{bmatrix}$ (lower-triangular, positive diagonal). Cost: $O(d^3/3)$ for $A \in \mathbb{R}^{d \times d}$. To solve $Ax = b$, substitute $A = LL^\top$: $L L^\top x = b$. This decouples into two triangular systems: (1) Forward substitution $Ly = b$ for $y$ (cost $O(d^2)$), (2) Backward substitution $L^\top x = y$ for $x$ (cost $O(d^2)$). The chol_solve(b) function implements this via y = np.linalg.solve(L, b) then return np.linalg.solve(L.T, y). Each triangular solve is dramatically cheaper than factorization ($O(d^2)$ vs. $O(d^3/3)$), making reuse highly profitable: for $k$ systems, total cost is $O(d^3/3 + kd^2)$ vs. $O(kd^3)$ naive, achieving $\sim k$-fold speedup for $k \gg d$.
Part 2: Verification Against Direct Solve
The code solves $Ax = b$ for two right-hand sides: $b_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix}$ and $b_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$. For each, it computes: (1) x = chol_solve(b) via Cholesky reuse, (2) x_ref = np.linalg.solve(A, b) via NumPyâs black-box solver (internally uses LU for general matrices or Cholesky for detected SPD). The difference np.linalg.norm(x - x_ref) should be at machine precision ($\sim 10^{-15}$), confirming both methods are numerically equivalent. Typical output: b: [1. 2.] x: [0.18181818 0.63636364] diff: 0.0 and b: [0. 1.] x: [-0.09090909 0.36363636] diff: 0.0. Zero differences (or $\sim 10^{-16}$) validate both factorization correctness and solve implementation. The residuals $\|Ax - b\|_2$ are also at machine precision, confirming optimality. This verification pattern generalizes to all factorizations: always compare custom implementations to reference solvers before trusting results in production.
Part 3: Cost Analysis and Speedup Scaling
For $d=2$ and $k=2$ systems, the speedup from Cholesky reuse is modest ($\sim 2$x) because factorization cost ($O(d^3/3) \approx 2.67$ flops) is comparable to solve cost ($O(d^2) = 4$ flops). The advantage emerges for larger $d$ and many right-hand sides: Example 1 (medium scale): $d=100$, $k=10$ systems. Cholesky reuse: $\frac{100^3}{3} + 10 \cdot 100^2 = 3.33 \times 10^5 + 10^5 \approx 4.33 \times 10^5$ flops. Naive: $10 \cdot 100^3 = 10^7$ flops. Speedup: $\sim 23$x. Example 2 (large scale): $d=1000$, $k=100$ systems. Cholesky reuse: $\frac{10^9}{3} + 100 \cdot 10^6 \approx 4.3 \times 10^8$ flops. Naive: $100 \cdot 10^9 = 10^{11}$ flops. Speedup: $\sim 230$x. Break-even point: Even for $k=2$, reuse is beneficial when $d > 3$. For $k=1$ (single solve), Cholesky offers no advantage over black-box solveâuse reuse only when multiple solves are anticipated.
Why This Matters for ML
Gaussian process posterior sampling: GP predictive distribution is $\mathcal{N}(\mu_*, \Sigma_*)$ where $\Sigma_* = K(X_*, X_*) - K(X_*, X) [K(X, X) + \sigma^2 I]^{-1} K(X, X_*)$. Computing the inverse involves Cholesky factorization of $K + \sigma^2 I$ (SPD kernel matrix), which is reused for all test predictions. To sample $m$ function realizations, factor once then solve $m$ times via chol_solve(L, z) where $z \sim \mathcal{N}(0, I)$. Multi-task learning: For $k$ tasks sharing design matrix $X$ but different targets $y_1, \ldots, y_k$, the optimal weights are $\hat{W} = (X^\top X)^{-1} X^\top Y$ where $Y = [y_1 | \cdots | y_k] \in \mathbb{R}^{n \times k}$. Compute L = np.linalg.cholesky(X.T @ X) once, then solve for each column of $Y$ via chol_solve(L, X.T @ y_i). Ridge regression cross-validation: Tuning $\lambda$ in $(X^\top X + \lambda I) \hat{w}_\lambda = X^\top y$ requires solving for multiple $\lambda$ values. While each $\lambda$ changes the matrix, the structure is similarâfactorization is still cheaper than iterative solving. Kalman filtering: The Kalman gain $K_t = P_t^- H^\top (H P_t^- H^\top + R)^{-1}$ involves SPD matrix $S = H P_t^- H^\top + R$ (measurement covariance). For sequential measurements, $S$ changes slowlyâfactorize and reuse across batches of measurements.
ML Examples and Patterns
Efficient GP hyperparameter tuning: When optimizing kernel hyperparameters $\theta$ via marginal likelihood $\log p(y | X, \theta) = -\frac{1}{2} y^\top (K + \sigma^2 I)^{-1} y - \frac{1}{2} \log |K + \sigma^2 I|$, each $\theta$ requires Cholesky factorization. Gradients $\frac{\partial}{\partial \theta_i} \log p(y)$ reuse the same $L$ for multiple derivative computations. Newtonâs method for optimization: Newton step solves $H \Delta w = -\nabla f$ where Hessian $H$ is SPD near local minima. For line search (testing multiple step sizes $\alpha$), factor $H$ once then solve for modified right-hand sides. Covariance shrinkage: When estimating covariance $\Sigma$ from data, regularize via $\Sigma_{\text{reg}} = (1 - \lambda) \Sigma + \lambda \text{diag}(\Sigma)$ to ensure positive definiteness. Cholesky factorization validates that $\Sigma_{\text{reg}}$ is PD (fails if not). Variational inference (mean-field): Variational posteriors are often Gaussian $q(z) = \mathcal{N}(\mu, \Sigma)$. Sampling from $q$ requires Cholesky factorization of $\Sigma$: $z = \mu + L \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ and $L L^\top = \Sigma$. For Monte Carlo approximations, factor once then generate $m$ samples. Preconditioning for iterative solvers: Incomplete Cholesky factorization $\tilde{L}\tilde{L}^\top \approx A$ (dropping small entries) provides a preconditioner for conjugate gradient: solve $\tilde{L}^{-1} A \tilde{L}^{-\top} \tilde{x} = \tilde{L}^{-1} b$ (condition number improves dramatically).
Connection to Linear Algebra Theory
Cholesky uniqueness: For SPD $A$, the factorization $A = LL^\top$ with $L$ lower-triangular and $L_{ii} > 0$ is unique. Proof: Suppose $A = L_1 L_1^\top = L_2 L_2^\top$. Then $L_2^{-1} L_1 (L_2^{-1} L_1)^\top = I$, implying $L_2^{-1} L_1$ is orthogonal. But $L_2^{-1} L_1$ is also lower-triangular (product of triangular matrices), so it must be diagonal. Orthogonality + lower-triangular + positive diagonal $\Rightarrow$ identity, so $L_1 = L_2$. Relationship to LU: For SPD $A$, Cholesky is equivalent to LU without pivoting: $A = LU$ where $U = L^\top$. LU cost is $O(d^3)$, while Cholesky exploits symmetry for $O(d^3/3)$ (half the flops). Computational cost breakdown: Cholesky factorization: $\sum_{k=1}^{d} k^2 \approx \frac{d^3}{3}$ flops. Forward substitution: $\sum_{k=1}^{d} k \approx \frac{d^2}{2}$ flops. Backward substitution: $\frac{d^2}{2}$ flops. Total per solve: $d^2$ flops. Eigenvalue connection: Diagonal entries $L_{ii}$ relate to eigenvalues via $\det(A) = \prod \lambda_i = \prod L_{ii}^2$. Small $L_{ii}$ indicates small eigenvalues (near-singularity). Positive definiteness test: Cholesky succeeds iff $A$ is PD. Algorithm: attempt factorization; if any diagonal $L_{kk} \le 0$ or $A_{kk} < \sum_{j<k} L_{kj}^2$ (submatrix not PD), $A$ is not PD.
Numerical and Implementation Notes
Shape discipline: $A \in \mathbb{R}^{d \times d}$ (SPD), $L \in \mathbb{R}^{d \times d}$ (lower-triangular), $b \in \mathbb{R}^d$, $x, y \in \mathbb{R}^d$. Gotchas: (1) Cholesky requires PD, not just PSD: If $A$ has zero eigenvalues, np.linalg.cholesky(A) raises LinAlgError. Check via eigvals = np.linalg.eigvalsh(A); assert np.all(eigvals > 0). (2) Add regularization for nearly singular matrices: L = np.linalg.cholesky(A + 1e-10 * np.eye(d)) ensures stability. (3) Memory layout: NumPy returns lower-triangular $L$; some libraries (LAPACK) can return upper-triangular $U$ where $A = U^\top U$. (4) Reuse pattern in code: Structure as L = cholesky(A) outside loop, then solutions = [chol_solve(L, b) for b in rhs_list] inside. Avoid refactoring per solve. (5) SciPy alternative: from scipy.linalg import cho_factor, cho_solve separates factorization and solve explicitlyâmore verbose but clearer for reuse. (6) Sparse Cholesky: For sparse $A$ (e.g., graph Laplacians), use scipy.sparse.linalg.splu or specialized sparse Cholesky libraries (cholmod) to exploit sparsity and avoid fill-in.
Numerical and Shape Notes
Verification checks: (1) Reconstruction: np.linalg.norm(A - L @ L.T) should be $\sim 10^{-15}$ (machine precision). (2) Symmetry: np.linalg.norm(A - A.T) should be $\sim 10^{-15}$. (3) Positive eigenvalues: eigvals = np.linalg.eigvalsh(A); assert np.all(eigvals > 0). (4) Residuals: For each solve, np.linalg.norm(A @ x - b) should be $\sim 10^{-14}$. (5) Consistency with black-box: np.linalg.norm(x_chol - x_ref) should be $\sim 10^{-15}$. (6) Determinant: $\det(A) = \prod L_{ii}^2$ matches `np.linalg.det(A)$. Cost in practice: For $d=1000$, Cholesky factorization takes ~0.01 seconds (single-threaded CPU). Triangular solve takes ~0.0001 seconds. For $k=100$ solves, reuse saves ~0.99 seconds vs. naive ($\sim 1$ second vs. $\sim 0.01$ seconds). For $d=10^4$, factorization takes ~10 seconds, solve ~0.1 secondsâreuse saves $\sim 990$ seconds for $k=100$. Tolerance: Machine epsilon $\epsilon_{\text{mach}} \approx 2.22 \times 10^{-16}$ (float64). Cholesky roundoff error: $\|\tilde{L}\tilde{L}^\top - A\|_F \le \epsilon_{\text{mach}} \|A\|_F \cdot O(d)$. For $d=1000$ and $\|A\|_F = 1$, expect error $\sim 10^{-13}$.
Pedagogical Significance
This example is the canonical demonstration of factorization reuse: (1) Factorization is expensive, solves are cheap: $O(d^3/3)$ vs. $O(d^2)$ per solve. (2) Explicit factorization control beats black-box: When matrix is reused, always factor once and solve multiple times. (3) SPD matrices enable Cholesky: Roughly 2x faster than LU, but requires positive definiteness. (4) Two-stage triangular solve: Forward + backward substitution is the workhorse of linear algebra. (5) Common in ML: GPs, multi-task learning, ridge CV, Kalman filtering. Common misconceptions: âMultiple solves require multiple factorizationsâ (Noâfactor once, reuse). âCholesky works for any matrixâ (Noâonly SPD). ânp.linalg.solve is always bestâ (Noâexplicit factorization is faster for reuse). âTriangular solves are slowâ (Noâ$O(d^2)$ is cheap). Connection to other examples: Example 73 (PSD vs. PD), Example 74 (conditioning), Example 76 (normal equations via Cholesky). Why pedagogically powerful: Demonstrates cost amortizationâa principle that extends to all factorizations (LU, QR, SVD). Teaches that black-box solvers hide opportunities for optimization. Pattern is ubiquitous in high-performance libraries (scikit-learn, GPyTorch, PyTorch).
Historical Foundations
The Cholesky decomposition is named after André-Louis Cholesky (1875â1918), a French military officer and geodesist who developed the method around 1910 while conducting triangulation surveys in North Africa for mapping the Sahara desert. The algorithm was designed to solve the large systems of normal equations arising from least-squares adjustment of geodetic observationsâa problem requiring both efficiency and numerical stability for hand calculation. Tragically, Cholesky died during World War I before publishing his work. His colleague, Commandant Benoit, published the method posthumously in 1924 in the Bulletin Géodésique, where it remained relatively obscure outside geodesy for decades. The mathematical community rediscovered Choleskyâs algorithm in the 1940sâ1950s when electronic computers made large-scale linear algebra feasible. Turingâs 1948 paper Rounding-Off Errors in Matrix Processes analyzed the stability of Cholesky for SPD matrices, establishing it as numerically superior to Gaussian elimination (LU without pivoting) for symmetric positive definite systems. Wilkinsonâs 1965 monograph The Algebraic Eigenvalue Problem provided rigorous backward error analysis, proving that Cholesky is backward stable: the computed factorization $\tilde{L}\tilde{L}^\top = A + E$ satisfies $\|E\| \le \epsilon_{\text{mach}} \|A\| \cdot O(d)$, making it one of the most reliable factorizations in numerical linear algebra. By the 1970s, Cholesky became the standard method for solving SPD systems in scientific computing, particularly for normal equations in least squares $(X^\top X) w = X^\top y$, covariance matrices in statistics $\Sigma^{-1} x$, and mass matrices in finite element methods. The advent of sparse matrix algorithms in the 1980s (Davis, Duff, George, Liu) extended Cholesky to large-scale problems by exploiting sparsity patterns (graph-based reordering, supernodes) to minimize fill-inâcritical for structural mechanics, circuit simulation, and computational fluid dynamics. Modern implementations (LAPACKâs DPOTRF, Intel MKL, CHOLMOD, SuiteSparse) leverage block algorithms, cache-optimized kernels, and multithreading to achieve near-peak performance on modern hardware.
Modern Applications in ML
Cholesky decomposition is the computational workhorse for symmetric positive definite systems throughout machine learning. In Gaussian process regression, the predictive mean $\mu(x_*) = k_*^\top (K + \sigma^2 I)^{-1} y$ and log marginal likelihood $\log p(y | X) = -\frac{1}{2} y^\top (K + \sigma^2 I)^{-1} y - \frac{1}{2} \log |K + \sigma^2 I|$ both require solving with the kernel matrix $K + \sigma^2 I$ (SPD by construction). Computing L = np.linalg.cholesky(K + sigma**2 * np.eye(n)) once enables: (1) solving $K^{-1} y$ via forward-backward substitution, (2) computing log determinant $\log |K| = 2 \sum \log L_{ii}$ (sum of log diagonal entries), (3) sampling from the posterior via $f_* = \mu_* + L_* z$ where $z \sim \mathcal{N}(0, I)$ and $L_* L_*^\top = \Sigma_*$. Hyperparameter optimization via gradient descent refactorizes $K(\theta)$ at each step, with gradients $\frac{\partial}{\partial \theta_i} \log p(y)$ reusing the same $L$ for multiple derivative computations (saving $\sim d^3$ flops per gradient component). In multi-task learning with shared design matrix $X$ but different targets $Y = [y_1 | \cdots | y_k]$, the optimal weights are $\hat{W} = (X^\top X)^{-1} X^\top Y$. Factorizing L = np.linalg.cholesky(X.T @ X) once and solving for each column of $Y$ via chol_solve(L, X.T @ y_i) reduces cost from $O(kd^3)$ to $O(d^3/3 + kd^2)$âa $\sim k$-fold speedup for $k \gg d$ (e.g., $k=100$ tasks). Ridge regression cross-validation over regularization parameters $\lambda_1, \ldots, \lambda_m$ requires solving $(X^\top X + \lambda_i I) \hat{w}_i = X^\top y$ for each $\lambda_i$. While each system differs (due to diagonal shift), the structure is similarâmodern libraries (scikit-learn) use rank-one updates (Sherman-Morrison-Woodbury) or refactor efficiently via block algorithms. Kalman filtering maintains covariance $P_t$ (SPD) at each timestep, updating via $P_t = (I - K_t H) P_{t-1}$ where the Kalman gain $K_t = P_{t-1} H^\top (H P_{t-1} H^\top + R)^{-1}$ involves solving with $S = H P_{t-1} H^\top + R$ (SPD measurement covariance). Factorizing $S$ via Cholesky enables efficient gain computation and sequential measurement processing. In variational inference, variational posteriors $q(z) = \mathcal{N}(\mu, \Sigma)$ with covariance $\Sigma$ (SPD) require sampling $z = \mu + L \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ and $L L^\top = \Sigma$. Monte Carlo approximations of the ELBO $\mathbb{E}_q[\log p(x, z) - \log q(z)]$ generate $m$ samples by factorizing $\Sigma$ once then multiplying $L$ by $m$ standard normal vectorsâ$O(d^3/3 + md^2)$ vs. $O(md^3)$ naive. Newtonâs method for optimization solves $H \Delta w = -\nabla f$ where Hessian $H$ is SPD near local minima of convex functions. Line search (testing multiple step sizes $\alpha$) factors $H$ once then solves for modified right-hand sides, avoiding refactorization per step. Preconditioning for iterative solvers: Incomplete Cholesky (IC) computes an approximate factorization $\tilde{L}\tilde{L}^\top \approx A$ by dropping entries below a threshold, yielding a preconditioner for conjugate gradient that dramatically improves conditioning: $\kappa(\tilde{L}^{-1} A \tilde{L}^{-\top}) \ll \kappa(A)$. This enables solving sparse SPD systems (graph Laplacians in spectral clustering, precision matrices in graphical models) with $10$â$100$x fewer iterations.
Example 73 (PSD matrices): Cholesky requires positive definite (PD, all eigenvalues strictly $> 0$), which is stronger than positive semi-definite (PSD, eigenvalues $\ge 0$). For PSD matrices with zero eigenvalues, Cholesky failsâuse SVD pseudoinverse or add regularization $A + \epsilon I$ to make PD.
Example 74 (SVD and conditioning): The condition number of $A$ affects Cholesky stability. For ill-conditioned $A$ ($\kappa(A) > 10^6$), small eigenvalues cause numerical issues in the factorizationâregularization $(A + \lambda I)$ improves conditioning and ensures stability.
Example 76 (Least squares: lstsq vs normal equations): The normal equations $(X^\top X) w = X^\top y$ for least squares yield an SPD system if $X$ has full column rank. Solving via Cholesky on $X^\top X$ is faster than
lstsq(SVD-based) for well-conditioned $X$, but suffers from condition number squaring $\kappa(X^\top X) = [\kappa(X)]^2$âprefer QR or SVD for robustness.Example 75 (PCA): Covariance matrices $\Sigma = \frac{1}{n-1} X_c^\top X_c$ are always PSD (and PD if full rank). Cholesky factorization of $\Sigma$ enables efficient Gaussian sampling for generative models and variational inference.
Example 78 (Ridge regression): Ridge adds $\lambda I$ to normal equations: $(X^\top X + \lambda I) \hat{w}_\lambda = X^\top y$. For $\lambda > 0$, the system is always PD (even if $X^\top X$ is singular), making Cholesky applicable. Cross-validation over $\lambda$ values can reuse partial factorizations or refactor efficiently for each $\lambda$.
Example 79 (Iterative solvers): For very large sparse SPD systems, iterative methods (conjugate gradient, preconditioned CG) avoid explicit factorization. However, Cholesky-based preconditioners (incomplete Cholesky, $\tilde{L}\tilde{L}^\top \approx A$) accelerate convergence dramatically.
Chapter 9 (PSD matrices): Cholesky is the canonical test for positive definitenessâif the factorization succeeds without errors, $A$ is PD; if it fails, $A$ is not PD (has zero or negative eigenvalues).
Comments