ex1.ai

Example number

Example slug

example_45_solving_systems_cholesky_factor_reuse

Background

Historical context: Cholesky decomposition, discovered by AndrÃ©-Louis Cholesky (a French military engineer) in the early 1900s, factorizes SPD matrices into lower-triangular factors. It remained relatively obscure until the digital computer era, when numerical analysts (Wilkinson, Golub, others) realized its critical importance: Cholesky is faster and more stable than LU for SPD matrices, requiring only $d^3/3$ operations versus $2d^3/3$ for LU, and avoiding the numerical instability of partial pivoting.

Mathematical characterization: For symmetric positive-definite matrix $A$, the Cholesky factorization $A = LL^\top$ is unique when $L$ is required to be lower-triangular with positive diagonal. The system $Ax = b$ becomes $(LL^\top)x = b$, solved in two steps: 1. Forward substitution: Solve $Ly = b$ (lower-triangular, $O(d^2)$) 2. Backward substitution: Solve $L^\top x = y$ (upper-triangular, $O(d^2)$)

Total cost: $O(d^3/3)$ factorization + $O(d^2)$ per solve (negligible amortized cost for multiple solves).

SPD prevalence in ML: Covariance matrices (always SPD), kernel Gram matrices (SPD for many kernels), and Hessians of convex loss functions (e.g., logistic regression, ridge regression) are inherently SPD. Recognizing this structure enables specialized algorithms: Cholesky for solving, eigendecomposition for analysis, SVD for low-rank approximation.

Why not always use Cholesky? The SPD check is non-trivial for noisy or ill-conditioned matrices. For general matrices, LU with partial pivoting is more robust. Cholesky fails (throws an error) if the matrix is not SPD, whereas LU degrades gracefully. In practice, assume/verify SPD structure before Cholesky; use LU or QR as fallback.

Purpose

Demonstrate how Cholesky decomposition exploits symmetric positive-definite (SPD) structure to solve linear systems efficiently and stably. Show that factorization reuseâcomputing the decomposition once and solving multiple right-hand sides via forward/backward substitutionâamortizes the factorization cost, critical for problems with many solves. Build intuition for when to use Cholesky (SPD systems) versus LU (general) versus QR (least-squares) decompositions. Connect numerical linear algebra primitives to ML patterns: Gaussian processes (covariance matrices are SPD), kernel methods (Gram matrices are SPD), optimization (Hessians of convex loss are SPD), and sampling. Emphasize that recognizing and exploiting problem structureâhere, symmetry and positive-definitenessâyields dramatic speedups and stability improvements.

Problem

Solve an SPD system with Cholesky and reuse the factor for multiple right-hand sides.

Solution (Math)

For SPD $A$, Cholesky $A=LL^T$. Solve $Ax=b$ via $Ly=b$, $L^Tx=y$. Reuse $L$ for multiple $b$âs to amortize factorization cost.

We use:

Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
Vectors are column vectors by default.
$\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.

Solution (Python)


import numpy as np

A = np.array([[4., 1.],
              [1., 3.]])
L = np.linalg.cholesky(A)

def chol_solve(b):
    y = np.linalg.solve(L, b)
    return np.linalg.solve(L.T, y)

for b in [np.array([1.,2.]), np.array([0.,1.])]:
    x = chol_solve(b)
    x_ref = np.linalg.solve(A, b)
    print("b:", b, "x:", x, "diff:", np.linalg.norm(x-x_ref))

Code Introduction

This code demonstrates Cholesky decomposition for solving symmetric positive-definite (SPD) systems, comparing a manual two-step triangular solve against NumPyâs direct solver. It illustrates how exploiting matrix structureâspecifically, symmetry and positive-definitenessâenables faster and more numerically stable solutions.

The code defines a $2 \times 2$ symmetric positive-definite matrix: \[ A = \begin{pmatrix} 4 & 1 \\ 1 & 3 \end{pmatrix}. \] The Cholesky decomposition L = np.linalg.cholesky(A) factors $A$ into a lower-triangular matrix $L$ such that: \[ A = L L^\top. \] For this example, $L$ is a $2 \times 2$ lower-triangular matrix with positive diagonal entries. The decomposition exists if and only if $A$ is symmetric positive-definiteâan important precondition. The algorithm runs in $O(d^3/3)$ time (half the cost of LU decomposition) and is numerically stable because it avoids pivoting (SPD matrices donât need it).

The function chol_solve(b) solves $Ax = b$ by exploiting the Cholesky factorization: 1. Forward substitution: y = np.linalg.solve(L, b) solves the lower-triangular system $Ly = b$ in $O(d^2)$ time via forward elimination. 2. Backward substitution: np.linalg.solve(L.T, y) solves the upper-triangular system $L^\top x = y$ in $O(d^2)$ time via back-substitution.

Together, these two operations solve $A x = (L L^\top) x = L(L^\top x) = b$ by first finding $L^\top x = y$, then solving for $y$. The total cost is $O(d^3/3)$ (Cholesky once) plus $O(d^2)$ (two triangular solves)âcompared to $O(d^3/3)$ for LU factorization plus $O(d^2)$ solves. For multiple right-hand sides, the Cholesky decomposition is reused, amortizing its cost.

The code tests chol_solve on two right-hand sides: $b_1 = [1, 2]^\top$ and $b_2 = [0, 1]^\top$. For each, it compares the Cholesky-based solution x against a reference solution x_ref = np.linalg.solve(A, b) (which uses LU internally). The printed difference np.linalg.norm(x - x_ref) should be near machine precision ($\sim 10^{-14}$ for double precision), confirming the two methods agree numerically. Any discrepancy greater than $10^{-10}$ would signal a bug in the Cholesky implementation or an ill-conditioned system (though $A$ here is well-conditioned).

Why Cholesky matters for ML: For problems with multiple right-hand sides (e.g., Bayesian inference sampling, least-squares with regularization), factorizing once and reusing dominates. Ridge regression solves $(X^\top X + \lambda I)w = X^\top y$ repeatedly as $\lambda$ variesâcomputing Cholesky once per $\lambda$ and solving triangular systems for each example is far cheaper than full LU decomposition each time. SPD matrices are inherently well-conditioned for Cholesky factorizationâno pivoting is required, and the condition number doesnât worsen during factorization. LU decomposition can require partial pivoting for non-SPD matrices, introducing extra operations. Cholesky avoids this overhead. Many ML problems naturally produce SPD matrices: covariance matrices (always SPD), kernel Gram matrices (SPD for many kernels), and Hessians of convex loss functions (logistic regression, ridge regression, SVMs). Recognizing this structure enables specialized algorithms: Cholesky for solving, eigendecomposition for analysis, SVD for low-rank approximation. Gaussian processes require solving $K\alpha = y$ where $K$ is the covariance Gram matrix (SPD); Cholesky + triangular solves is the standard approach. Support Vector Machines and kernel ridge regression require solving systems with the kernel Gram matrix $(K_{ij} = k(x_i, x_j))$. For RBF, polynomial, and many other kernels, $K$ is SPD; Cholesky enables efficient training and inference. Newtonâs method requires solving $(H)d = -\nabla L$ where $H$ is the Hessian. For convex losses (logistic regression, ridge regression, SVM), $H$ is SPD; Cholesky is the preferred factorization. Sampling from a Gaussian $\mathcal{N}(\mu, \Sigma)$ requires solving $\Sigma z = w$ for random noise $w$. Cholesky factorization $\Sigma = LL^\top$ enables efficient sampling via $z = L w$ (no solve needed; just matrix-vector multiply). This example operationalizes the principle: exploit matrix structure to unlock faster, more stable algorithms. Cholesky is the minimal example showing why structure matters: recognizing symmetry and positive-definiteness yields speedups and stability improvements unavailable for general matrices.

Numerical Implementation Details

Numerical and Shape Notes

Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1.
Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).

Define SPD matrix: A = np.array([[4., 1.], [1., 3.]]) is $2 \times 2$, symmetric ($A = A^\top$), and positive-definite (eigenvalues $\lambda_1 \approx 4.16$, $\lambda_2 \approx 2.84$ are both positive).
Cholesky factorization: L = np.linalg.cholesky(A) computes the lower-triangular factor $L$ such that $A = LL^\top$ in $O(d^3/3)$ time. The algorithm is numerically stable and requires no pivoting for SPD matrices.
Forward substitution: y = np.linalg.solve(L, b) solves the lower-triangular system $Ly = b$ in $O(d^2)$ time via forward elimination.
Backward substitution: np.linalg.solve(L.T, y) solves the upper-triangular system $L^\top x = y$ in $O(d^2)$ time via back-substitution.
Solution assembly: The composite solve chol_solve(b) returns $x$ by combining the two triangular solves.
Reference solution: x_ref = np.linalg.solve(A, b) uses LU decomposition (general, more expensive) as ground truth.
Numerical comparison: np.linalg.norm(x - x_ref) measures the Euclidean distance between solutions; should be near machine precision ($\sim 10^{-14}$ for double precision).
Multiple right-hand sides: The loop tests chol_solve on two vectors $b_1 = [1, 2]^\top$ and $b_2 = [0, 1]^\top$, demonstrating that a single factorization supports efficient reuse.

What This Example Demonstrates

Pedagogical Significance

Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.

Cholesky factorization is half the cost of LU for SPD matrices: $O(d^3/3)$ vs.Â $O(2d^3/3)$, with better numerical stability due to no pivoting requirement.
Forward/backward substitution solves triangular systems in $O(d^2)$ time: Critical for amortizing factorization cost over many right-hand sides.
Structure exploitation pays off: Recognizing that a matrix is SPD (symmetric + positive-definite) enables specialized algorithms unavailable for general matrices.
Reuse pattern: Compute the Cholesky factorization once, then solve multiple right-hand sides by only doing cheap triangular solves.
Numerical equivalence: Cholesky-based solve (chol_solve) and direct solve (np.linalg.solve) give identical results (up to rounding error), validating the mathematical approach.
SPD structure is ubiquitous in ML: Covariance matrices, kernel Gram matrices, regularized Hessiansâall are SPD; understanding Cholesky unlocks efficient algorithms throughout statistics and machine learning.

Notes

Part 1: Cholesky Factorization

Part 2: Two-Step Triangular Solve via Cholesky

Part 3: Numerical Validation

Why Cholesky Matters for ML

Computational efficiency: For problems with multiple right-hand sides (e.g., Bayesian inference sampling, least-squares with regularization), factorizing once and reusing dominates. Ridge regression solves $(X^\top X + \lambda I)w = X^\top y$ repeatedly as $\lambda$ variesâcomputing Cholesky once per $\lambda$ and solving triangular systems for each example is far cheaper than full LU decomposition each time.

Numerical stability: SPD matrices are inherently well-conditioned for Cholesky factorizationâno pivoting is required, and the condition number doesnât worsen during factorization. LU decomposition can require partial pivoting for non-SPD matrices, introducing extra operations. Cholesky avoids this overhead.

Why This Matters for ML

Gaussian processes: The covariance matrix $K$ is SPD; inference requires solving $K\alpha = y$ for the dual weights $\alpha$. Cholesky + triangular solves is the standard approach. For $n$ training examples, computing Cholesky is $O(n^3)$ once, then each prediction (solving a triangular system for a new test point) is $O(n^2)$.

Kernel methods: Support Vector Machines (SVMs) and kernel ridge regression require solving systems with the kernel Gram matrix $(K_{ij} = k(x_i, x_j))$. For RBF, polynomial, and many other kernels, $K$ is SPD. Cholesky enables efficient training and inference.

Optimization: Newtonâs method solves $(H)d = -\nabla L$ where $H$ is the Hessian. For convex losses (logistic regression, ridge regression, SVM), $H$ is SPD. Cholesky is the preferred factorization, more stable and faster than LU.

Bayesian sampling: Sampling from a Gaussian $\mathcal{N}(\mu, \Sigma)$ requires solving $\Sigma z = w$ for random noise $w$. Cholesky factorization $\Sigma = LL^\top$ enables efficient sampling via $z = L w$ (no solve needed; just matrix-vector multiply).

Connection to ML Applications

Ridge regression: Solving $(X^\top X + \lambda I)w = X^\top y$ with varying $\lambda$ requires Cholesky of the Gram matrix for each $\lambda$. The repeated solve pattern (multiple $\lambda$ values) benefits from factorization reuse.

Variational inference: Approximating complex posteriors via a Gaussian mean-field requires computing covariance-inverse-vector products repeatedlyâessential for optimization. Cholesky enables efficient computation.

Matrix completion: Low-rank matrix recovery solves SPD least-squares subproblems iteratively; Cholesky factorization per iteration is the workhorse.

Connection to Broader Linear Algebra

This example bridges several concepts: - Matrix factorization: Cholesky is one of three canonical decompositions (LU, QR, Cholesky/Eigendecomposition, SVD); choose based on structure. - Triangular systems: Forward/backward substitution are the workhorses of numerical linear algebra; understanding their cost ($O(d^2)$) is essential for algorithm design. - Positive definiteness: Many ML problems naturally produce SPD matrices (covariance, kernel Gram, Hessians of convex loss)ârecognizing this structure enables optimization. - Conditioning: SPD matrices have $\kappa(A) = \lambda_{\max} / \lambda_{\min}$ (ratio of extreme eigenvalues). Cholesky preserves conditioning, unlike some other factorizations.

Numerical and Implementation Notes

Use np.linalg.cholesky() for verified SPD matrices. For matrices that may not be SPD (e.g., covariance from finite samples), wrap in a try-except block and fall back to np.linalg.solve(A, b) or regularize $A + \epsilon I$ to improve conditioning. For large-scale systems (millions of variables), use sparse Cholesky (scipy.sparse.linalg.cholesky). The Cholesky factor $L$ also enables efficient computation of the determinant ($\det(A) = \prod_i L_{ii}^2$) and matrix inverse (via forward/backward substitution), though inverses should be avoided when possible. For sampling, Cholesky enables efficient Gaussian sampling via $z = \mu + L w$ where $w \sim \mathcal{N}(0, I)$.

Numerical and Shape Notes

Shape discipline: - $A \in \mathbb{R}^{d \times d}$: symmetric positive-definite matrix - $L \in \mathbb{R}^{d \times d}$: lower-triangular Cholesky factor - $b \in \mathbb{R}^d$: right-hand side vector (or multiple right-hand sides in a matrix) - $y \in \mathbb{R}^d$: intermediate vector after first solve - $x \in \mathbb{R}^d$: solution vector

Gotchas: 1. SPD requirement: Cholesky fails if $A$ is not positive-definite (e.g., singular, indefinite, or nearly so). Check eigenvalues or test decomposition before relying on it in production. 2. Lower vs.Â upper triangular: np.linalg.cholesky(A) returns the lower-triangular factor $L$. Using $L^\top$ (upper-triangular) in the second solve is critical; swapping them breaks the solution. 3. Symmetric assumption: Cholesky assumes $A = A^\top$ and only accesses the lower triangle. Asymmetric perturbations to the upper triangle are silently ignoredâverify symmetry if unsure. 4. Efficiency tradeoff: For a single solve, direct np.linalg.solve(A, b) may be faster due to implementation optimizations. Cholesky shines with multiple right-hand sides or repeated solves with the same SPD matrix. 5. Ill-conditioning: Even for SPD matrices, extreme condition numbers ($\kappa \gg 10^8$) lead to loss of precision. Regularization or better-conditioned reformulations may be necessary. 6. Multiple right-hand sides: To solve $A X = B$ for matrix $B$ with multiple columns, solve each column separately (via Cholesky reuse) or pass $B$ directly to np.linalg.solve(A, B) for efficiency.

History and Applications

Early foundations: Cholesky decomposition was discovered by AndrÃ©-Louis Cholesky, a French military engineer, in the early 1900s. His work focused on solving systems arising in geodesy and surveyingâclassic applications of least-squares fitting. The decomposition remained relatively obscure in the pre-computer era because hand calculation was tedious.

Numerical algebra awakening: In the digital computer era (1950s-1970s), numerical analysts (Wilkinson, Golub, Van Loan, others) realized Choleskyâs critical importance: it is faster and more stable than LU for SPD matrices, requiring only $d^3/3$ operations versus $2d^3/3$ for LU, and avoiding the numerical instability of partial pivoting. Wilkinsonâs analysis of rounding errors made clear that Choleskyâs stability (no pivoting needed) is a numerical advantage, not just a computational one.

Rise of iterative solvers: In the 1970s-1980s, as matrices grew larger, iterative solvers (conjugate gradient, MINRES) emerged for SPD systems. These algorithms implicitly exploit Cholesky structure (via matrix-vector products) without explicitly forming the factorization. Cholesky remains the gold standard for direct solving of SPD systems.

Modern ML applications: - Gaussian processes (1990s-present): Cholesky factorization of the kernel Gram matrix $K$ is the workhorse for GP inference. Scaling GPs to large datasets (via sparse approximations) fundamentally relies on efficient Cholesky implementations. - Optimization libraries: Modern optimization packages (CVXPY, PyTorch, TensorFlow) use Cholesky internally for solving convex problems with SPD Hessians. The principle âuse Cholesky for SPD systemsâ is nearly universal. - Probabilistic inference: Variational inference, belief propagation, and mean-field methods repeatedly solve SPD systems. Cholesky factorization enables efficient iterative algorithms. - Kernel methods: SVMs, kernel ridge regression, and other kernel methods rely on Cholesky for efficient training and inference. - Matrix completion and low-rank recovery: Algorithms (alternating least squares, proximal methods) solve SPD subproblems iteratively; Cholesky is the factorization of choice. - Bayesian neural networks: Posterior inference over weights requires sampling from high-dimensional Gaussians; Cholesky enables efficient sampling via $z = \mu + L w$ where $L$ is the Cholesky factor of the posterior covariance.

Key historical lesson: Recognizing and exploiting problem structureâhere, symmetry and positive-definitenessâyields dramatic algorithmic and numerical improvements. Cholesky is not just a computational trick; it is a fundamental principle: structure â specialization â efficiency + stability. This principle extends throughout numerical linear algebra and ML: exploiting low-rank structure (SVD), sparsity (sparse solvers), and other patterns consistently outperforms generic algorithms.

Contemporary challenges: Modern large-scale problems (deep learning, large-scale GPs, distributed optimization) push Cholesky to its limits. Distributed Cholesky algorithms, GPU acceleration, and hybrid direct-iterative approaches extend Choleskyâs reach. The principle remains: when a matrix is SPD, Cholesky is the natural choice.

Connection to Broader Examples

Vector spaces (Ch. 1): The solutions $x_1, x_2$ live in $\mathbb{R}^d$ (the feature space); they span a subspace determined by $A$âs invertibility.
Span and linear combinations (Ch. 2): The solve $Ax = b$ expresses $b$ as a linear combination of $A$âs rows (or columns, by duality).
Basis and dimension (Ch. 3): If $A$ is SPD with full rank, its columns form a basis for $\mathbb{R}^d$; Cholesky reveals this via the invertible $L$.
Linear maps and matrices (Ch. 4): Solving $Ax = b$ inverts the linear map $x \mapsto Ax$.
Inner products and norms (Ch. 5): $A$ is often defined via inner products (e.g., covariance $A = X^\top X$) or encodes a norm (e.g., $\|x\|_A = \sqrt{x^\top A x}$).
Orthogonality and projections (Ch. 6): Cholesky factors induce orthogonal structure; solving via Cholesky preserves orthogonal relationships.
Rank and invertibility (Ch. 7): SPD matrices have full rank; Cholesky factorization confirms invertibility and reveals no zero pivots.
Eigendecomposition (Ch. 8): For SPD $A$, eigenvalues determine conditioning; Cholesky is numerically stable because SPD matrices donât need pivoting (eigenvalues are all positive).
SVD (Ch. 10): SVD of $A$ gives $A = U\Sigma U^\top$ (for SPD); Cholesky gives $A = LL^\top$ (lower-triangular). SVD reveals singular values; Cholesky is specialized for SPD structure.
Conditioning and stability (Ch. 14): Cholesky is numerically stable because $\kappa(A)$ (condition number of SPD $A$) determines solve accuracy, not the factorization process; no pivoting amplifies errors.

Numerical and Shape Notes

Numerical and Implementation Notes

Pedagogical Significance

ML Examples and Patterns

Comments