This code demonstrates Cholesky factorization as an efficient solver for symmetric positive-definite (SPD) systemsâa critical technique in ML for solving normal equations, Gaussian likelihoods, and kernel ridge regression. Cholesky exploits the SPD structure to decompose $A = LL^\top$ and solve $Ax = b$ via two triangular solves, which is twice as fast as general LU decomposition and more numerically stable.
Part 1: Cholesky Decomposition. The matrix $A \in \mathbb{R}^{2 \times 2}$ is $A = \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix}$, which is symmetric ($A = A^\top$) and positive-definite (all eigenvalues positive). The Cholesky factorization computes a lower-triangular matrix $L$ such that $A = LL^\top$. The code computes this via L = np.linalg.cholesky(A), yielding $L = \begin{bmatrix} 2 & 0 \\ 0.5 & \sqrt{2.75} \end{bmatrix} \approx \begin{bmatrix} 2.0 & 0.0 \\ 0.5 & 1.658 \end{bmatrix}$. You can verify: $LL^\top = A$ by matrix multiplication. Cholesky factorization only exists for SPD matricesâif $A$ has negative eigenvalues or isnât symmetric, cholesky raises a LinAlgError. Why Cholesky is special: For general matrices, LU decomposition requires $O(n^3)$ flops and produces two triangular matrices ($L$ and $U$). Cholesky exploits symmetry and positive-definiteness to produce a single triangular matrix $L$ in half the flops ($\frac{1}{2}n^3$ vs. $\frac{2}{3}n^3$ for LU), and guarantees numerical stability (no pivoting needed).
Part 2: Two-Step Solve via Triangular Systems. To solve $Ax = b$ using the Cholesky factorization $A = LL^\top$, rewrite the system as $LL^\top x = b$. This is solved in two steps: (1) Forward substitution: Solve $Ly = b$ for $y$ (lower-triangular system), (2) Backward substitution: Solve $L^\top x = y$ for $x$ (upper-triangular system). The chol_solve function implements this pattern:
def chol_solve(b):
y = np.linalg.solve(L, b) # Step 1: L y = b (forward)
return np.linalg.solve(L.T, y) # Step 2: L^T x = y (backward)
Each triangular solve is $O(n^2)$ flops (much faster than the $O(n^3)$ factorization). The function signature mirrors np.linalg.solve(A, b), but internally uses the precomputed Cholesky factor $L$. Key insight: Once $L$ is computed, solving multiple right-hand sides (different $b$ vectors) is extremely fastâonly two triangular solves per $b$, no re-factorization. This is critical for applications like Gaussian process regression, where the same covariance matrix is used for many predictions.
Part 3: Verification Against Direct Solve. The code tests the Cholesky solver on two right-hand sides: $b_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix}$, $b_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$. For each $b$, it computes x = chol_solve(b) (solution via Cholesky two-step) and x_ref = np.linalg.solve(A, b) (solution via NumPyâs general solver using LU with pivoting). The difference np.linalg.norm(x - x_ref) measures numerical agreement. For well-conditioned SPD matrices, this should be near machine precision ($\sim 10^{-15}$). Typical output:
b: [1. 2.] x: [0.18181818 0.63636364] diff: 0.0
b: [0. 1.] x: [-0.09090909 0.36363636] diff: 0.0
The difference is exactly zero (or $\sim 10^{-16}$), confirming that Cholesky gives the same solution as LU, but faster and more stably for SPD systems.
Why This Matters for ML. Normal equations in least-squares: The Gram matrix $X^\top X$ is always symmetric and positive semidefinite (PSD). If $X$ has full column rank, $X^\top X$ is positive-definite (PD), making Cholesky the optimal solver for normal equations $(X^\top X) \hat{w} = X^\top y$. Form $L$ once via Cholesky, then solve for $\hat{w}$ via two triangular solves. This is faster than np.linalg.lstsq when $X$ is well-conditioned and you need to solve for multiple $y$ vectors (e.g., multi-output regression). Ridge regression: The regularized system is $(X^\top X + \lambda I) \hat{w}_\lambda = X^\top y$. Adding $\lambda I$ ensures $X^\top X + \lambda I$ is strictly PD (even if $X^\top X$ is only PSD). Cholesky factorization is the recommended approach. This avoids forming $(X^\top X + \lambda I)^{-1}$ explicitly and is numerically stable. Gaussian processes and kernel methods: The kernel Gram matrix $K \in \mathbb{R}^{n \times n}$ is symmetric and PSD. For inference, solve $K \alpha = y$. Cholesky factorization of $K$ is the standard approach in GP libraries (scikit-learn, GPyTorch). The log-likelihood also requires $\log |K|$, which is efficiently computed from Cholesky: $\log |K| = \log |LL^\top| = 2 \log |L| = 2 \sum_{i=1}^n \log L_{ii}$, where $L_{ii}$ are the diagonal entries of $L$ (no determinant computation needed). Condition number and stability: Cholesky factorization is numerically stable for SPD matricesâno pivoting is needed, and the growth factor is bounded. If $A$ is nearly singular (small eigenvalues), Cholesky will fail gracefully (non-positive diagonal entry) rather than produce garbage results. The condition number relationship $\kappa(L) = \sqrt{\kappa(A)}$ explains improved stability.
Numerical Gotchas and Verification. Shape discipline: $A \in \mathbb{R}^{2 \times 2}$, $L \in \mathbb{R}^{2 \times 2}$, $b \in \mathbb{R}^2$, $x \in \mathbb{R}^2$, $y \in \mathbb{R}^2$. Cholesky only for SPD matrices: If $A$ is not symmetric or not positive-definite, cholesky raises LinAlgError. Check symmetry via np.allclose(A, A.T) and eigenvalues via np.linalg.eigvalsh(A) (all must be positive). Lower vs. upper triangular: NumPyâs cholesky returns the lower-triangular $L$ (SciPy returns upper by default). Always verify with np.allclose(L @ L.T, A). Numerical near-singularity: If the smallest eigenvalue of $A$ is very small ($\sim 10^{-15}$), Cholesky may fail or produce inaccurate results. Add regularization ($A + \epsilon I$ for small $\epsilon > 0$) to stabilize. Multiple right-hand sides: If solving $Ax = B$ where $B \in \mathbb{R}^{n \times k}$ (multiple columns), both np.linalg.solve(L, B) and solve_triangular handle this efficiently (vectorized triangular solve).
Connection to Linear Algebra Theory. Cholesky decomposition existence: A symmetric matrix $A$ has a Cholesky factorization $A = LL^\top$ if and only if $A$ is positive-definite. This is equivalent to: (1) All eigenvalues of $A$ are positive, (2) All leading principal minors of $A$ are positive (Sylvesterâs criterion), (3) $x^\top A x > 0$ for all $x \neq 0$ (quadratic form test). Computational complexity: Cholesky factorization is $\frac{1}{3}n^3$ flops (half of LUâs $\frac{2}{3}n^3$), forward/backward substitution is $n^2$ flops each (total $2n^2$ per solve). Determinant computation: For SPD matrices, $\det(A) = \det(L) \det(L^\top) = \det(L)^2 = \left(\prod_{i=1}^n L_{ii}\right)^2$, where $L_{ii}$ are the diagonal entries of $L$. This is far more stable than computing $\det(A)$ via LU. Condition number propagation: $\kappa(L) = \sqrt{\kappa(A)}$, so Cholesky reduces condition number by a factor of $\sqrt{\kappa(A)}$ compared to working with $A$ directly. This is one source of improved numerical stability.
Pedagogical Significance. This example is the canonical demonstration of Cholesky factorization as a specialized solver: (1) Cholesky exploits SPD structure to achieve 2x speedup over LU and improved stability, (2) Two-step solve (forward + backward) is the standard pattern for using Cholesky factors, (3) Precompute factorization once, solve many times is a critical optimization for ML applications, (4) Cholesky failure = non-PD matrixâa diagnostic tool for debugging covariance matrices. The code is minimal (13 lines), but the lesson is profound: know your matrix structure, choose your solver accordingly.
Comments