Part 1: Cholesky Factorization
The code defines a $2 \times 2$ symmetric positive-definite matrix: \[
A = \begin{pmatrix} 4 & 1 \\ 1 & 3 \end{pmatrix}.
\] The Cholesky decomposition L = np.linalg.cholesky(A) factors $A$ into a lower-triangular matrix $L$ such that: \[
A = L L^\top.
\] For this example, $L$ is a $2 \times 2$ lower-triangular matrix with positive diagonal entries. The decomposition exists if and only if $A$ is symmetric positive-definiteâan important precondition. The algorithm runs in $O(d^3/3)$ time (half the cost of LU decomposition) and is numerically stable because it avoids pivoting (SPD matrices donât need it).
Part 2: Two-Step Triangular Solve via Cholesky
The function chol_solve(b) solves $Ax = b$ by exploiting the Cholesky factorization: 1. Forward substitution: y = np.linalg.solve(L, b) solves the lower-triangular system $Ly = b$ in $O(d^2)$ time via forward elimination. 2. Backward substitution: np.linalg.solve(L.T, y) solves the upper-triangular system $L^\top x = y$ in $O(d^2)$ time via back-substitution.
Together, these two operations solve $A x = (L L^\top) x = L(L^\top x) = b$ by first finding $L^\top x = y$, then solving for $y$. The total cost is $O(d^3/3)$ (Cholesky once) plus $O(d^2)$ (two triangular solves)âcompared to $O(d^3/3)$ for LU factorization plus $O(d^2)$ solves. For multiple right-hand sides, the Cholesky decomposition is reused, amortizing its cost.
Part 3: Numerical Validation
The code tests chol_solve on two right-hand sides: $b_1 = [1, 2]^\top$ and $b_2 = [0, 1]^\top$. For each, it compares the Cholesky-based solution x against a reference solution x_ref = np.linalg.solve(A, b) (which uses LU internally). The printed difference np.linalg.norm(x - x_ref) should be near machine precision ($\sim 10^{-14}$ for double precision), confirming the two methods agree numerically. Any discrepancy greater than $10^{-10}$ would signal a bug in the Cholesky implementation or an ill-conditioned system (though $A$ here is well-conditioned).
Why Cholesky Matters for ML
Computational efficiency: For problems with multiple right-hand sides (e.g., Bayesian inference sampling, least-squares with regularization), factorizing once and reusing dominates. Ridge regression solves $(X^\top X + \lambda I)w = X^\top y$ repeatedly as $\lambda$ variesâcomputing Cholesky once per $\lambda$ and solving triangular systems for each example is far cheaper than full LU decomposition each time.
Numerical stability: SPD matrices are inherently well-conditioned for Cholesky factorizationâno pivoting is required, and the condition number doesnât worsen during factorization. LU decomposition can require partial pivoting for non-SPD matrices, introducing extra operations. Cholesky avoids this overhead.
Why This Matters for ML
Gaussian processes: The covariance matrix $K$ is SPD; inference requires solving $K\alpha = y$ for the dual weights $\alpha$. Cholesky + triangular solves is the standard approach. For $n$ training examples, computing Cholesky is $O(n^3)$ once, then each prediction (solving a triangular system for a new test point) is $O(n^2)$.
Kernel methods: Support Vector Machines (SVMs) and kernel ridge regression require solving systems with the kernel Gram matrix $(K_{ij} = k(x_i, x_j))$. For RBF, polynomial, and many other kernels, $K$ is SPD. Cholesky enables efficient training and inference.
Optimization: Newtonâs method solves $(H)d = -\nabla L$ where $H$ is the Hessian. For convex losses (logistic regression, ridge regression, SVM), $H$ is SPD. Cholesky is the preferred factorization, more stable and faster than LU.
Bayesian sampling: Sampling from a Gaussian $\mathcal{N}(\mu, \Sigma)$ requires solving $\Sigma z = w$ for random noise $w$. Cholesky factorization $\Sigma = LL^\top$ enables efficient sampling via $z = L w$ (no solve needed; just matrix-vector multiply).
Connection to ML Applications
Ridge regression: Solving $(X^\top X + \lambda I)w = X^\top y$ with varying $\lambda$ requires Cholesky of the Gram matrix for each $\lambda$. The repeated solve pattern (multiple $\lambda$ values) benefits from factorization reuse.
Variational inference: Approximating complex posteriors via a Gaussian mean-field requires computing covariance-inverse-vector products repeatedlyâessential for optimization. Cholesky enables efficient computation.
Matrix completion: Low-rank matrix recovery solves SPD least-squares subproblems iteratively; Cholesky factorization per iteration is the workhorse.
Connection to Broader Linear Algebra
This example bridges several concepts: - Matrix factorization: Cholesky is one of three canonical decompositions (LU, QR, Cholesky/Eigendecomposition, SVD); choose based on structure. - Triangular systems: Forward/backward substitution are the workhorses of numerical linear algebra; understanding their cost ($O(d^2)$) is essential for algorithm design. - Positive definiteness: Many ML problems naturally produce SPD matrices (covariance, kernel Gram, Hessians of convex loss)ârecognizing this structure enables optimization. - Conditioning: SPD matrices have $\kappa(A) = \lambda_{\max} / \lambda_{\min}$ (ratio of extreme eigenvalues). Cholesky preserves conditioning, unlike some other factorizations.
Numerical and Implementation Notes
Use np.linalg.cholesky() for verified SPD matrices. For matrices that may not be SPD (e.g., covariance from finite samples), wrap in a try-except block and fall back to np.linalg.solve(A, b) or regularize $A + \epsilon I$ to improve conditioning. For large-scale systems (millions of variables), use sparse Cholesky (scipy.sparse.linalg.cholesky). The Cholesky factor $L$ also enables efficient computation of the determinant ($\det(A) = \prod_i L_{ii}^2$) and matrix inverse (via forward/backward substitution), though inverses should be avoided when possible. For sampling, Cholesky enables efficient Gaussian sampling via $z = \mu + L w$ where $w \sim \mathcal{N}(0, I)$.
Numerical and Shape Notes
Shape discipline: - $A \in \mathbb{R}^{d \times d}$: symmetric positive-definite matrix - $L \in \mathbb{R}^{d \times d}$: lower-triangular Cholesky factor - $b \in \mathbb{R}^d$: right-hand side vector (or multiple right-hand sides in a matrix) - $y \in \mathbb{R}^d$: intermediate vector after first solve - $x \in \mathbb{R}^d$: solution vector
Gotchas: 1. SPD requirement: Cholesky fails if $A$ is not positive-definite (e.g., singular, indefinite, or nearly so). Check eigenvalues or test decomposition before relying on it in production. 2. Lower vs. upper triangular: np.linalg.cholesky(A) returns the lower-triangular factor $L$. Using $L^\top$ (upper-triangular) in the second solve is critical; swapping them breaks the solution. 3. Symmetric assumption: Cholesky assumes $A = A^\top$ and only accesses the lower triangle. Asymmetric perturbations to the upper triangle are silently ignoredâverify symmetry if unsure. 4. Efficiency tradeoff: For a single solve, direct np.linalg.solve(A, b) may be faster due to implementation optimizations. Cholesky shines with multiple right-hand sides or repeated solves with the same SPD matrix. 5. Ill-conditioning: Even for SPD matrices, extreme condition numbers ($\kappa \gg 10^8$) lead to loss of precision. Regularization or better-conditioned reformulations may be necessary. 6. Multiple right-hand sides: To solve $A X = B$ for matrix $B$ with multiple columns, solve each column separately (via Cholesky reuse) or pass $B$ directly to np.linalg.solve(A, B) for efficiency.
Comments