ex1.ai

Example number

Example slug

example_93_solving_systems_cholesky_factor_reuse

Background

The Cholesky factorization was developed by AndrÃ©-Louis Cholesky (French military officer and mathematician, 1924). The method decomposes an SPD matrix $A$ as $A = LL^\top$ (lower triangular $L$), enabling efficient solution via two triangular solves. Early numerical computing (1950sâ1960s) recognized that SPD systems are numerically superior to general systems: Cholesky requires no pivoting (unlike LU), is $2\times$ faster (only accesses lower triangle), and is provably backward-stable under finite precision. George Forsythe and John Wilkinson proved that Cholesky errors are bounded by a small multiple of machine epsilon times $\|A\|$, independent of $\kappa(A)$ (in contrast, LUâs stability depends heavily on pivoting and conditioning). By the 1970sâ1980s, libraries like LINPACK and LAPACK made Cholesky the standard for SPD systems. In modern ML (2000s+), SPD systems are ubiquitous: covariance matrices in statistics, Gram matrices in kernels, Hessians in optimization. Factor reuse (computing $L$ once, using for multiple RHS) is the key efficiency trick in Gaussian process inference (predictive mean/variance for many test points), kernel methods (kernel matrix Cholesky + multiple solves), and second-order optimization (Hessian Cholesky + Newton steps).

Purpose

Understand symmetric positive definite (SPD) systems and the Cholesky factorization as the gold-standard solver for such problems. Learn why SPD systems are numerically superior: Cholesky is $2\times$ faster than LU decomposition and always numerically stable (no pivoting needed). Discover the power of factor reuse: compute the Cholesky factorization $A = LL^\top$ once, then solve multiple right-hand sides (RHS) $Ax_i = b_i$ cheaply via two triangular solves per RHS. This amortization of factorization cost is crucial in ML: Gaussian processes, kernel methods, and second-order optimization solve many SPD systems with the same matrix $A$ but different RHS. Learn to recognize SPD structure in real problemsâcovariance matrices, Gram matrices, Hessians in quadratic optimizationâand exploit it for dramatic efficiency gains.

Problem

Solve an SPD system with Cholesky and reuse the factor for multiple right-hand sides.

Solution (Math)

For SPD $A$, Cholesky $A=LL^T$. Solve $Ax=b$ via $Ly=b$, $L^Tx=y$. Reuse $L$ for multiple $b$âs to amortize factorization cost.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np

A = np.array([[4., 1.],
              [1., 3.]])
L = np.linalg.cholesky(A)

def chol_solve(b):
    y = np.linalg.solve(L, b)
    return np.linalg.solve(L.T, y)

for b in [np.array([1.,2.]), np.array([0.,1.])]:
    x = chol_solve(b)
    x_ref = np.linalg.solve(A, b)
    print("b:", b, "x:", x, "diff:", np.linalg.norm(x-x_ref))

Code Introduction

This section walks through the Cholesky factorization of a symmetric positive definite (SPD) matrix and demonstrates the two-stage solve pattern for efficiency. The key insight is factor reuse: compute the factorization once, then solve multiple right-hand sides cheaply via triangular solves.

The Cholesky Factorization:

For an SPD matrix $A \in \mathbb{R}^{d \times d}$, the Cholesky factorization computes $A = L L^\top$ where $L$ is lower triangular. In NumPy:

L = np.linalg.cholesky(A)  # A = L @ L.T

Cost: $O(d^3/3)$ (half of LU decomposition). The factorization is numerically stable without pivotingâerrors are $O(\epsilon \|A\|)$ independent of conditioning. This is the key advantage: even ill-conditioned SPD matrices factor stably.

Two-Stage Solve for a Single RHS:

To solve $Ax = b$, we exploit $A = LL^\top$: 1. Forward substitution: Solve $Ly = b$ for $y$ (lower triangular system). python y = np.linalg.solve(L, b) # Ly = b Cost: $O(d^2)$.

Back substitution: Solve $L^\top x = y$ for $x$ (upper triangular system).
```
x = np.linalg.solve(L.T, y)  # L^T x = y
```
Cost: $O(d^2)$.

Total: $O(d^2 + d^2) = O(d^2)$ per RHS. This is half the cost of general LU solve (which is $O(d^2)$ but with a larger constant).

Factor Reuse for Multiple Right-Hand Sides:

The power of Cholesky emerges when solving many systems with the same matrix but different right-hand sides. Compute $L$ once ($O(d^3/3)$), then for each new RHS $b_i$, execute the two-stage solve ($2 \times O(d^2)$). Total cost: $O(d^3/3 + m \cdot d^2)$ for $m$ systems.

Comparison of total cost: - Naive: Solve $m$ independent systems via $m$ Cholesky factorizations: $O(m \cdot d^3)$. - Cholesky with reuse: One factorization + $m$ cheap solves: $O(d^3/3 + m \cdot d^2) = O(d^2(d/3 + m))$. - Speedup factor: $\approx m/2$ (for large $m$, factor reuse is $m/2 \times$ faster).

For example, with $m = 100$ RHS and $d = 100$: - Naive: $100 \times 100^3 / 3 \approx 33$ million operations. - Reuse: $100^3 / 3 + 100 \times 100^2 \approx 1.3 + 1 = 2.3$ million operations. - Speedup: $\approx 14x$.

Shapes and Broadcasting:

$A \in \mathbb{R}^{d \times d}$ (square, symmetric).
$L \in \mathbb{R}^{d \times d}$ (lower triangular, from Cholesky).
Single RHS: $b, y, x \in \mathbb{R}^d$ (column vectors).
Multiple RHS: $B, Y, X \in \mathbb{R}^{d \times m}$ (each column is a RHS). NumPy solve broadcasts: Y = solve(L, B) solves all $m$ systems simultaneously.

Numerical Verification:

After solving $Ax = b$ via Cholesky, verify by comparing with a reference solver (e.g., NumPyâs general solve):

x_chol = chol_solve(b)       # Cholesky two-stage solve
x_ref = np.linalg.solve(A, b)  # Direct solve (LU or other)
error = np.linalg.norm(x_chol - x_ref)

For well-conditioned SPD matrices, the two solutions should agree to machine precision. For ill-conditioned matrices, both will lose accuracy (conditioning affects all solvers), but Cholesky remains the most stable choice.

Recognizing SPD Structure:

Covariance matrices: Always PSD by definition ($v^\top \Sigma v = \text{Var}(v^\top X) \geq 0$). Strict SPD if full-rank data.
Gram matrices: $G = X^\top X$ is PSD; SPD if $X$ has full column rank.
Kernel matrices: Radial basis function (RBF) kernels, Gaussian kernels produce SPD matrices; add small regularization $\lambda I$ if ill-conditioned.
Hessians in convex optimization: Convex quadratic objectives have PSD Hessians; strong convexity ensures SPD.
Regularized systems: $(X^\top X + \lambda I)$ is SPD if $\lambda > 0$ and $X$ has at least one column.

Common Pitfalls:

Non-SPD input: If $A$ is not SPD, np.linalg.cholesky raises LinAlgError. Check: np.allclose(A, A.T) for symmetry; ensure all eigenvalues are positive.
Numerical near-singularity: For nearly singular SPD matrices, add regularization: L = cholesky(A + lambda * I) with small $\lambda > 0$.
Mismatched shapes: Ensure $b$ is a column vector (b.shape = (d,) or (d, 1)) for single RHS.
No factor reuse: For multiple RHS, reusing $L$ is essential; donât call cholesky(A) repeatedly.

Fundamental Principle:

For any SPD system, always factor once and reuse. Cholesky is faster, more stable, and more cache-friendly than repeated full solves. Recognize covariance, Gram, and regularized matrices as opportunities for Cholesky.

Numerical Implementation Details

Cholesky factorization: L = np.linalg.cholesky(A) computes $L$ such that $A = LL^\top$. Cost: $O(d^3/3)$. Returns lower triangular $L \in \mathbb{R}^{d \times d}$.
Forward substitution: y = np.linalg.solve(L, b) solves $Ly = b$ via forward substitution in $O(d^2)$. Numerically stable.
Back substitution: x = np.linalg.solve(L.T, y) solves $L^\top x = y$ via back substitution in $O(d^2)$. Numerically stable.
Reusing factor: Cache $L$ after first factorization; for each new $b_i$, solve $Ly_i = b_i$, then $L^\top x_i = y_i$. Total time: $O(d^3/3 + m \cdot d^2)$ for $m$ RHS.
Checking SPD: Before Cholesky, verify $A$ is symmetric (A.T == A) and attempt factorization. If it fails, $A$ is not SPD (has negative/zero eigenvalues). Fall back to lstsq.
Regularization for ill-conditioning: For covariance or Gram matrices, add small $\lambda I$ (jitter, regularization) to ensure strict PD and improve conditioning. Use $A + \lambda I$ instead of $A$.
Comparison with full solve: np.linalg.solve(A, b) uses LU or other general method; for single RHS, comparable cost to Cholesky. For many RHS, Cholesky-reuse is $\approx m/2$ times faster.
Memory considerations: Cholesky stores only lower triangle (exploits symmetry). For large $d$, memory savings are significant compared to storing full $A$.

What This Example Demonstrates

SPD systems are numerically superior: Cholesky factorization is unconditionally stable (no pivoting needed); errors are $O(\epsilon \|A\|)$ independent of $\kappa(A)$. In contrast, LU requires careful pivoting and stability depends on pivoting quality.
Cholesky factorization cost: Computing $L$ from $A$ costs $O(d^3/3)$ (half the cost of LU). Once $L$ is cached, each subsequent RHS is solved in $O(d^2)$ via two triangular solves.
Factor reuse amortization: For $m$ right-hand sides, the cost is $O(d^3/3 + m \cdot d^2)$. Solving $m$ independent systems via Cholesky-plus-reuse is $\approx m/2$ times faster than $m$ independent full LU+solve cycles.
Two-triangular-solve pattern: From $A = LL^\top$ and $Ax = b$, solve $Ly = b$ (forward substitution), then $L^\top x = y$ (back substitution). Both are $O(d^2)$, numerically stable.
Detecting SPD structure: Covariance matrices (by definition PSD), Gram matrices $X^\top X$, Hessians in quadratic optimizationârecognizing these enables using Cholesky instead of general solvers.
Failure modes: Cholesky fails if $A$ is not SPD (negative eigenvalues, singular). Try Cholesky first; if it fails, fall back to lstsq or check data for errors.
Practical efficiency gains: Gaussian process prediction on $n$ test points with $d$ training examples: one $O(d^3)$ Cholesky, then $n \times O(d^2)$ solves. Total: $O(d^3 + n \cdot d^2)$ vs.Â $O(n \cdot d^3)$ for repeated full solves.

Notes

Part 1: SPD Systems and Cholesky Factorization

A matrix $A \in \mathbb{R}^{d \times d}$ is symmetric positive definite (SPD) if: - $A = A^\top$ (symmetric) - $v^\top A v > 0$ for all $v \ne 0$ (positive definiteness)

Equivalently: all eigenvalues are positive, all principal minors are positive. The Cholesky factorization decomposes $A$ uniquely as: \[ A = LL^\top, \] where $L$ is lower triangular with positive diagonal entries. Solving $Ax = b$ becomes: \[ Ly = b \quad (\text{forward}), \quad L^\top x = y \quad (\text{back}). \]

Part 2: Factor Reuse for Multiple Right-Hand Sides

For $m$ right-hand sides $b_1, \ldots, b_m$, compute $L$ once ($O(d^3/3)$), then for each $i$: \[ y_i = L^{-1} b_i \quad (O(d^2)), \quad x_i = (L^\top)^{-1} y_i \quad (O(d^2)). \] Total: $O(d^3/3 + m \cdot d^2)$. Speedup: $\frac{m \cdot d^3}{d^3/3 + m \cdot d^2} \approx \frac{m \cdot d^3}{m \cdot d^2} = d$ for large $m$. For $m = 10, d = 100$: speedup $\approx 100x$.

Part 3: Numerical Stability and Conditioning

Cholesky is unconditionally backward-stable: errors are $O(\epsilon \|A\|)$ regardless of $\kappa(A)$. This contrasts with LU (stability depends on pivoting and $\kappa(A)$). However, the solution error still depends on $\kappa(A)$: for ill-conditioned $A$, small perturbations in $b$ cause large changes in $x$. Cure: regularization $A + \lambda I$ improves conditioning and preserves SPD-ness.

Why This Matters for ML

Gaussian processes: Predicting mean/covariance for many test points requires solving $(K + \sigma^2 I)^{-1} k_*$ repeatedly ($K$ = kernel matrix, $k_*$ = test kernel vector). Cholesky factors $(K + \sigma^2 I)$ once, then reuses for many predictions.
Kernel methods: SVM, ridge regression, and Gaussian processes all solve Gram matrix systems. Cholesky factor reuse is the efficiency workhorse.
Second-order optimization: Newtonâs method at each iteration solves $H \delta w = -g$ (Hessian $H$, gradient $g$). If $H$ is SPD (or regularized to be), Cholesky-reuse across iterations is fast.
Uncertainty quantification: Bayesian methods compute posterior covariances via Cholesky. Multiple predictions require multiple solvesâreuse is essential.

ML Examples and Patterns

Gaussian process prediction: Fit covariance matrix $K$ once via Cholesky ($O(n^3)$); predict on $m$ test points via $m$ solves ($O(m \cdot n^2)$). Total $O(n^3 + m \cdot n^2)$ vs.Â $O(m \cdot n^3)$ for repeated full solves.
Kernel ridge regression: Solve $(K + \lambda I)w = y$ where $K$ is Gram matrix (SPD). Use Cholesky for efficiency.
Laplace approximation: Posterior covariance $\approx H^{-1}$ (inverse Hessian). Factor $H$ via Cholesky if SPD; reuse for uncertainty estimates.
Proximal gradient methods: Subproblems often involve quadratic problems solvable via Cholesky.
Variational inference: ELBO optimization involves covariance computations; Cholesky enables efficient variational updates.

Connection to Linear Algebra Theory

Spectral decomposition: $A = LL^\top \implies A = P\Lambda P^\top$ where $P = L / \sqrt{\text{diag}(L L^\top)}$ (orthogonalization). SPD-ness ensures real positive eigenvalues.
Gram matrices: $A = X^\top X$ is always PSD; SPD if $X$ has full column rank. Cholesky of Gram matrix is related to QR of $X$ (lower Cholesky factor $L \approx R^\top$ from $X = QR$).
Schur complement: For block matrices, Cholesky reveals structure. SPD block structure enables efficient factorization and inversion.

Numerical and Implementation Notes

Always verify $A$ is SPD before calling Cholesky. Check symmetry: np.allclose(A, A.T).
Cholesky fails with LinAlgError if $A$ is not SPD. Use try-except and fall back to lstsq.
For ill-conditioned SPD matrices, add regularization: L = cholesky(A + lambda * I) before solving.
Store $L$ (not $A$) to save memory and clarify intent (factor reuse).
Two-stage solve: y = solve(L, b); x = solve(L.T, y) is faster than single solve(A, b) for many RHS.

Numerical and Shape Notes

Shapes: $A \in \mathbb{R}^{d \times d}$ (square, symmetric), $L \in \mathbb{R}^{d \times d}$ (lower triangular), $b, x, y \in \mathbb{R}^d$ (vectors).
Cholesky assumes lower triangular; for upper, use $A = U^\top U$ and solve via $U^\top y = b$, then $Ux = y$.
Broadcasting: for multiple RHS as matrix $B \in \mathbb{R}^{d \times m}$, solve Y = solve(L, B); X = solve(L.T, Y) (each column is a RHS).

Pedagogical Significance

This example teaches structure exploitation in numerical algorithms: recognizing SPD structure enables dramatic efficiency gains. Students learn that not all linear systems are equalâSPD systems admit specialized, faster, more stable solvers. This principle generalizes: sparse matrices, rank-deficient systems, Toeplitz matrices all have tailored algorithms. The factor reuse pattern is ubiquitous in production ML: precompute expensive factorizations, cache them, reuse for many RHS. Understanding this unlocks optimization opportunities throughout ML.

Connection to Broader Examples

Least squares (Ex 92): The normal equations $(X^\top X) w = X^\top y$ form a Gram matrix (SPD). Solving via Cholesky is numerically stable; more efficient than lstsq if using the same RHS repeatedly.
Conditioning (Ex 90): SPD matrices inherit conditioning from their eigenvalues. Ill-conditioning (large $\kappa$) means $L$ has poorly scaled entries. Cholesky is stable regardless, but conditioning still affects solution accuracy (preconditioning helps).
PSD matrices (Ex 89): Covariance matrices are PSD; Cholesky requires strict PD. Add jitter $\lambda I$ to ensure PD. Kernel matrices are PSD; regularized versions $(K + \lambda I)$ are strict PD.
PCA (Ex 91): Covariance matrix $X_c^\top X_c$ is SPD (after centering). Factor via Cholesky for stability when solving derived systems.
Orthogonality and projections (Ex 86): Gram matrix $X^\top X$ from least squares is SPD. Cholesky factors it; projection onto column space uses this factorization.
Rank and nullspace (Ex 87): For rank-deficient $X$, the Gram matrix $X^\top X$ is singular (not PD). Regularization $(X^\top X + \lambda I)$ restores PD-ness.
Ridge regression (ML): Ridge $(X^\top X + \lambda I)^{-1} X^\top y$ forms an SPD matrix; Cholesky solves this efficiently.