Sparse linear algebra matured in scientific computing (finite elements, CFD, circuit simulation) where matrices are huge but mostly zero. Modern ML reuses these ideas for text (bag-of-words has billions of zeros), recommender systems (userâitem graphs), GNNs (message passing on sparse adjacency), and large-scale optimization (sparse Jacobians/Hessians). COO/CSR/CSC formats avoid storing zeros and enable $O(\text{nnz})$ primitives; the entire stackâmemory layout, cache behavior, and kernel fusionâmatters for performance.
- Log in to post comments
Show how exploiting sparsity turns an $O(n^2)$ dense multiply into an $O(\text{nnz})$ operation, making graph-scale computations feasible in memory and time.
Build a 6-node cycle adjacency as COO, do sparse matvec, verify vs dense, and report nnz.
COO stores only nonzeros, enabling $O(nnz)$ matvec. Adjacency of a sparse graph has nnz proportional to edges, not $n^2$.
We use:
- Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
- Vectors are column vectors by default.
- $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
import numpy as np
from scripts.sparse_helpers import coo_from_edges, coo_matvec, coo_to_dense, nnz
n = 6
edges = [(i, (i+1)%n) for i in range(n)]
A = coo_from_edges(n, edges, weights=1.0)
x = np.ones(n)
y_sparse = coo_matvec(A, x)
y_dense = coo_to_dense(A) @ x
print("y_sparse:", y_sparse)
print("y_dense :", y_dense)
print("nnz:", nnz(A), "dense entries:", n*n)This code builds a sparse adjacency matrix in COO form for a directed cycle on $n=6$ nodes, then validates a sparse matrixâvector product against a dense implementation. The edge list edges = [(i, (i+1)%n) for i in range(n)] encodes one outgoing edge per node, so the adjacency $A \in \mathbb{R}^{n\times n}$ has exactly $n$ nonzeros with $A_{i,(i+1)\bmod n} = 1$.
With $x = \mathbf{1} \in \mathbb{R}^n$, the product $y = Ax$ equals the row sums of $A$. Because each row contains a single $1$, both the sparse path coo_matvec(A, x) and the dense path coo_to_dense(A) @ x return $y = \mathbf{1}$. The printout compares y_sparse and y_dense to confirm correctness and reports nnz(A) = n versus $n^2$ dense entries, highlighting the memory and compute savings of sparse storage.
COO stores $(\text{row}, \text{col}, \text{value})$ triplets and enables matvec in $O(\text{nnz})$, while dense multiplication is $O(n^2)$. Converting to dense via coo_to_dense is used here solely for verification; avoid it for large $n$ due to quadratic time and memory. Shape discipline: $A \in \mathbb{R}^{n\times n}$, $x \in \mathbb{R}^n$, $y \in \mathbb{R}^n$, and $y = Ax$. A simple robustness check is to replace $x$ with a random vector; y_sparse and y_dense should still match exactly while the sparse path retains $O(\text{nnz})$ cost.
Numerical and Shape Notes
- Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
- Axis discipline: Be explicit with
axisin reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to â1. - Broadcasting: Check that broadcasts are intended (e.g.,
(n,1)with(n,d)). Prefer reshape/expand-dims to make semantics clear. - Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
- Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
- Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
- Sanity checks: Print shapes and residuals (e.g.,
||Ax-b||, reconstruction error, row-sum â 1). Assert finiteness and expected monotonicity where applicable.
Numerical and Implementation Notes
- Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use
solve,lstsq, Cholesky/QR/SVD. - Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (
axis) and ensure broadcasts are intended. - Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
- Conditioning: Inspect
np.linalg.cond(A)when solutions look unstable; regularize (ridge) or rescale features to improve conditioning. - Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g.,
||Ax-b||, reconstruction errors) and assert finiteness. - Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
- Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
- Sanity checks: Compare against references (e.g.,
lstsqvs.solve), check orthogonality (U.T @ U â I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
- Build the edge list for a directed 6-cycle with
(i, (i+1)%n)so each row has exactly one nonzero. - Convert edges to COO via
coo_from_edges, producing aligned arrays of rows, cols, and data. - Form the test vector $x=\mathbf{1}$ to make the expected output easy to reason about (each row sum is 1).
- Compute $y_{\text{sparse}} = \text{coo\_matvec}(A, x)$ in $O(\text{nnz})$.
- Compute $y_{\text{dense}} = \text{coo\_to\_dense}(A) @ x$ as a correctness oracle (but not for production scale).
- Compare
y_sparseandy_denseelementwise; they should match exactly for this integer-weighted graph. - Report
nnz(A)alongside the dense size $n^2$ to highlight memory savings ($\text{nnz}=n$ vs $n^2$ entries). - Optional: replace $x$ with random values to stress-test equality while preserving $O(\text{nnz})$ cost on the sparse path.
Pedagogical Significance
- Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
- ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
- Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
- Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
- Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
- Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.
ML Examples and Patterns
- Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
- Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
- Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
- Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
- Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
- The single takeaway: sparse representations cut both storage and matvec cost from $O(n^2)$ to $O(\text{nnz})$.
- How an adjacency matrix for a cycle graph has $\text{nnz}=O(n)$ instead of $n^2$, and why that matters for scalability.
- Verification that sparse and dense matvec agree numerically on the same data, guarding against implementation bugs.
- How COO triplets (row, col, value) map cleanly to graph edge lists used across ML workloads.
- The connection to ML computation patterns: fit (gradient steps on sparse features), project (message passing on graphs), decompose (sparse factorizations), solve (iterative solvers on sparse systems).
- How shape discipline keeps $A\in\mathbb{R}^{n\times n}$, $x\in\mathbb{R}^n$, $y\in\mathbb{R}^n$ consistent across sparse and dense paths.
- Part 1: Constructing the sparse adjacency. Use $(i,(i{+}1)\bmod n)$ edges to create a directed cycle with exactly one nonzero per row.
- Part 2: Sparse matvec in $O(\text{nnz})$. COO matvec iterates over stored triplets; work scales with nonzeros, not $n^2$.
- Part 3: Dense verification oracle. Converting to dense is only for checking correctness on tiny graphs; avoid for large $n$.
- Why This Matters for ML. GNN message passing, TFâIDF bag-of-words, and recommender userâitem graphs are all sparse; $O(\text{nnz})$ kernels make them tractable.
- ML Examples and Patterns. Fit: gradient steps on sparse design matrices; project: graph diffusion/Laplacian smoothing; decompose: sparse factorizations; solve: iterative methods using sparse matvec.
- Connection to Linear Algebra Theory. Sparsity preserves linear structureâsame operators, different storage/complexity class.
- Numerical and Implementation Notes. Beware duplicate edges (sum or deduplicate), zero weights (skip), and index ordering (improves cache friendliness).
- Numerical and Shape Notes. Keep $A\in\mathbb{R}^{n\times n}$, $x\in\mathbb{R}^n$, $y\in\mathbb{R}^n$; verify outputs before scaling up.
- Pedagogical Significance. This is the minimal working demo that shows why sparse formats exist and how to test them safely.
Sparse matrix storage and kernels emerged in mid-20th-century scientific computing (finite elements, circuit simulation) to make otherwise impossible $O(n^2)$ and $O(n^3)$ workloads feasible. Graph algorithms later embraced sparsity for web-scale link analysis and social networks. In modern ML, sparsity underpins search/recommendation (userâitem graphs), NLP (bag-of-words/TFâIDF), and GNNs (message passing with sparse adjacency). Iterative solvers (CG/GMRES) rely on $O(\text{nnz})$ matvecs; without sparse formats, these methods would be intractable on real-world graphs. The same principles drive hardware/software co-design for sparse accelerators and kernel fusion in contemporary deep learning frameworks.
- Pairs with Chapter 15 (sparse) to show storage/layout fundamentals before factorization or iterative methods.
- Complements Example 15 sparse chapter scripts that rely on COO utilities from
scripts/sparse_helpers.py. - Bridges to graph-centric ML (PCA on graphs, Laplacian systems, spectral clustering) where adjacency sparsity is essential.
- Links to conditioning (Ch. 14): sparse structure does not fix conditioning; you still need preconditioning/regularization.
- Connects to least squares (Ch. 12): sparse design matrices benefit from $O(\text{nnz})$ matvecs inside CG or LSQR.
- Supports matrix products (Ch. 16): sparseâdense multiplies are ubiquitous in attention variants on sparse graphs.
- Reinforces shape discipline from early chapters: rows as examples, columns as features; here rows/cols are nodes.
- Sets up later examples using sparse solvers (CG/GMRES) where matvec dominates cost.
Comments