Sparse linear algebra matured in scientific computing to solve huge PDE systems and circuit simulations, where storing dense matrices was impossible. The core idea: exploit structure and sparsity pattern to reduce memory and arithmetic. ML repurposed these techniques for text (bag-of-words), graphs (adjacency/Laplacian), recommenders (user-item), and large-scale optimization (Hessian/Jacobian sparsity). Coordinate (COO), CSR/CSC, and block-sparse formats enable $O(\text{nnz})$ matvec, which is the workhorse primitive inside iterative solvers and graph neural networks.
- Log in to post comments
Show how exploiting sparsity turns an $O(n^2)$ operation into $O(\text{nnz})$, making otherwise impossible matrix operations practical. Emphasize that sparse formats store only nonzeros, reducing memory from $O(n^2)$ to $O(\text{nnz})$, and that sparse matvec (y = Ax) runs in time proportional to nnz. Highlight the graph setting (adjacency matrix) as a canonical ML use case where structure, not algorithmic cleverness alone, makes the computation feasible.
Build a 6-node cycle adjacency as COO, do sparse matvec, verify vs dense, and report nnz.
COO stores only nonzeros, enabling $O(nnz)$ matvec. Adjacency of a sparse graph has nnz proportional to edges, not $n^2$.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.sparse_helpers import coo_from_edges, coo_matvec, coo_to_dense, nnz
n = 6
edges = [(i, (i+1)%n) for i in range(n)]
A = coo_from_edges(n, edges, weights=1.0)
x = np.ones(n)
y_sparse = coo_matvec(A, x)
y_dense = coo_to_dense(A) @ x
print("y_sparse:", y_sparse)
print("y_dense :", y_dense)
print("nnz:", nnz(A), "dense entries:", n*n)This code builds a sparse adjacency matrix for a 6-node directed cycle, multiplies it by an all-ones vector using sparse routines, and checks that the result matches the dense computation. The edge list [(i, (i+1)%n)] encodes a ring: each node points to the next with wrap-around. coo_from_edges creates a COO matrix $A \in \mathbb{R}^{6\times 6}$ with one nonzero per row, so $\text{nnz}(A) = 6$ versus $36$ dense entries. coo_matvec(A, x) performs $y = Ax$ in $O(\text{nnz})$ time; with $x = \mathbf{1}$, every node receives one incoming edge, so $y$ is all ones. coo_to_dense(A) @ x serves as a sanity check that sparse and dense results agree. The nnz print highlights memory savings and the $O(\text{nnz})$ runtime model that underpins large graph and text workloads.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Define the graph: Build an edge list for a 6-node directed cycle
[(i, (i+1)%n)]. - Create sparse matrix (COO):
coo_from_edges(n, edges, weights=1.0)builds $A \in \mathbb{R}^{6\times 6}$ with one nonzero per row. - Prepare input vector:
x = np.ones(n)ensures a simple, interpretable result. - Sparse matvec:
y_sparse = coo_matvec(A, x)iterates only over nnz triples. - Dense validation:
y_dense = coo_to_dense(A) @ xchecks that sparse and dense outputs agree. - Count nnz:
nnz(A)reports stored nonzeros; compare ton*ndense entries to highlight savings. - Complexity: Time is $O(\text{nnz}) = O(n)$ for this graph; memory is $O()` vs. $O(n^2)$ dense.
- Generalization: Swap in larger graphs or weighted edges; same pattern holds, and savings grow with sparsity.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- Sparsity cuts memory from $O(n^2)$ to $O(\text{nnz})$: Only nonzeros are stored.
- Sparse matvec is $O(\text{nnz})$: Runtime proportional to number of nonzeros, not $n^2$.
- Graphs yield natural sparsity: Adjacency of a cycle has $\text{nnz} = O(n)$ vs. $n^2$ dense entries.
- Dense vs. sparse parity check: Converting to dense and multiplying matches sparse output, validating correctness.
- COO format basics: Store
(row, col, value)triples; easy to build from edge lists. - Memory and compute savings grow with $n$: Advantage increases as graphs scale.
- Shape discipline: Sparse matvec must respect dimensions; adjacency is square here, but methods generalize to rectangular sparse matrices.
Part 1: Sparse adjacency construction - Edge list captures graph structure; COO stores (row, col, value) for each edge. - A cycle graph has exactly one outgoing edge per node, so $\text{nnz} = n$.
Part 2: Sparse matvec - coo_matvec touches only nonzeros; work scales with $\text{nnz}$. - For $x = \mathbf{1}$ on a cycle, each node sums one neighbor â output is all ones.
Part 3: Dense check and parity - Converting to dense and multiplying provides a correctness oracle. - nnz vs. $n^2$ illustrates memory savings: $6$ vs. $36$ entries here; gap grows with $n$.
Why This Matters for ML - Sparse representations enable training on large graphs, vocabularies, and user-item matrices that would be impossible dense. - $O(\text{nnz})$ matvec is the inner loop of GNNs, PageRank, and iterative solvers for large ML systems.
Connection to ML Applications - Graph ML: Message passing uses sparse adjacency for aggregation. - Recommender systems: User-item matrices are extremely sparse; all core ops are $O(\text{nnz})$. - NLP: Bag-of-words and TF-IDF are sparse; sparse matvec powers retrieval and linear models. - Sparse attention: Block/neighbor attention restricts nnz to keep attention layers tractable.
Connection to Linear Algebra Theory - Sparsity patterns encode graphs; powers of $A$ trace walks. - Graph Laplacians are sparse, symmetric, and singular; spectral methods exploit sparsity.
Numerical and Implementation Notes - Avoid densifying large sparse matrices; memory blow-up is immediate. - Choose formats wisely: COO for construction, CSR/CSC for repeated matvec. - Check shapes: adjacency is $n\times n$; input vector must be length $n$. - Be mindful of data types; use float64 for stable accumulation.
Numerical and Shape Notes - Validate with a dense check on small problems, then trust sparse ops for scale. - Count nnz to estimate memory and runtime; $O(\text{nnz})$ dominates.
Sparse matrix methods emerged in the 1960sâ1970s to solve large PDEs and circuit problems where dense storage was infeasible. Early formats (linked lists, skyline) evolved into modern COO/CSR/CSC standards that balance construction ease with fast matvec. Graph theory provided a natural source of sparsity: adjacency and Laplacian matrices are typically $O(n)$ nnz, enabling PageRank, spectral clustering, and shortest-path algorithms at web scale. In ML, sparse linear algebra powers bag-of-words models, recommender systems, and graph neural networks; production systems often hinge on efficient $O(\text{nnz})$ kernels. Hardware and libraries (BLAS, cuSPARSE, MKL) now expose optimized sparse primitives, while block/structured sparsity underpins modern sparse attention and mixture-of-experts architectures.
- Chapter 4 (Linear maps): Sparsity is structure in the matrix representation of a linear map.
- Chapter 5 (Inner products): Sparse dot products power sparse matvec.
- Chapter 6 (Projections): Sparse projectors arise in graph-based smoothing and label propagation.
- Chapter 7 (Rank/nullspace): Sparsity affects rank; graph Laplacians are sparse and singular.
- Chapter 10 (SVD): Sparse SVD/PCA for large graphs and documents.
- Chapter 12 (Least-squares): Iterative LS solvers rely on $O(\text{nnz})$ matvec.
- Chapter 13 (Solving systems): Krylov solvers (CG/GMRES) use sparse matvec as the core primitive.
- Chapter 14 (Conditioning): Sparsity plus preconditioning improves stability and speed.
- Chapter 15 (Sparse): Core chapter; COO/CSR/CSC trade-offs.
- Graph ML: Adjacency/Laplacian sparsity underpins GNN message passing.
Comments