Example number
47
Example slug
example_47_sparse_matrices_adjacency_coo_onnz_matvec
Background

Sparse linear algebra matured in scientific computing to solve huge PDE systems and circuit simulations, where storing dense matrices was impossible. The core idea: exploit structure and sparsity pattern to reduce memory and arithmetic. ML repurposed these techniques for text (bag-of-words), graphs (adjacency/Laplacian), recommenders (user-item), and large-scale optimization (Hessian/Jacobian sparsity). Coordinate (COO), CSR/CSC, and block-sparse formats enable $O(\text{nnz})$ matvec, which is the workhorse primitive inside iterative solvers and graph neural networks.

Purpose

Show how exploiting sparsity turns an $O(n^2)$ operation into $O(\text{nnz})$, making otherwise impossible matrix operations practical. Emphasize that sparse formats store only nonzeros, reducing memory from $O(n^2)$ to $O(\text{nnz})$, and that sparse matvec (y = Ax) runs in time proportional to nnz. Highlight the graph setting (adjacency matrix) as a canonical ML use case where structure, not algorithmic cleverness alone, makes the computation feasible.

Problem

Build a 6-node cycle adjacency as COO, do sparse matvec, verify vs dense, and report nnz.

Solution (Math)

COO stores only nonzeros, enabling $O(nnz)$ matvec. Adjacency of a sparse graph has nnz proportional to edges, not $n^2$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.sparse_helpers import coo_from_edges, coo_matvec, coo_to_dense, nnz

n = 6
edges = [(i, (i+1)%n) for i in range(n)]
A = coo_from_edges(n, edges, weights=1.0)

x = np.ones(n)
y_sparse = coo_matvec(A, x)
y_dense = coo_to_dense(A) @ x

print("y_sparse:", y_sparse)
print("y_dense :", y_dense)
print("nnz:", nnz(A), "dense entries:", n*n)
Code Introduction

This code builds a sparse adjacency matrix for a 6-node directed cycle, multiplies it by an all-ones vector using sparse routines, and checks that the result matches the dense computation. The edge list [(i, (i+1)%n)] encodes a ring: each node points to the next with wrap-around. coo_from_edges creates a COO matrix $A \in \mathbb{R}^{6\times 6}$ with one nonzero per row, so $\text{nnz}(A) = 6$ versus $36$ dense entries. coo_matvec(A, x) performs $y = Ax$ in $O(\text{nnz})$ time; with $x = \mathbf{1}$, every node receives one incoming edge, so $y$ is all ones. coo_to_dense(A) @ x serves as a sanity check that sparse and dense results agree. The nnz print highlights memory savings and the $O(\text{nnz})$ runtime model that underpins large graph and text workloads.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Define the graph: Build an edge list for a 6-node directed cycle [(i, (i+1)%n)].
  2. Create sparse matrix (COO): coo_from_edges(n, edges, weights=1.0) builds $A \in \mathbb{R}^{6\times 6}$ with one nonzero per row.
  3. Prepare input vector: x = np.ones(n) ensures a simple, interpretable result.
  4. Sparse matvec: y_sparse = coo_matvec(A, x) iterates only over nnz triples.
  5. Dense validation: y_dense = coo_to_dense(A) @ x checks that sparse and dense outputs agree.
  6. Count nnz: nnz(A) reports stored nonzeros; compare to n*n dense entries to highlight savings.
  7. Complexity: Time is $O(\text{nnz}) = O(n)$ for this graph; memory is $O()` vs. $O(n^2)$ dense.
  8. Generalization: Swap in larger graphs or weighted edges; same pattern holds, and savings grow with sparsity.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Sparsity cuts memory from $O(n^2)$ to $O(\text{nnz})$: Only nonzeros are stored.
  • Sparse matvec is $O(\text{nnz})$: Runtime proportional to number of nonzeros, not $n^2$.
  • Graphs yield natural sparsity: Adjacency of a cycle has $\text{nnz} = O(n)$ vs. $n^2$ dense entries.
  • Dense vs. sparse parity check: Converting to dense and multiplying matches sparse output, validating correctness.
  • COO format basics: Store (row, col, value) triples; easy to build from edge lists.
  • Memory and compute savings grow with $n$: Advantage increases as graphs scale.
  • Shape discipline: Sparse matvec must respect dimensions; adjacency is square here, but methods generalize to rectangular sparse matrices.
Notes

Part 1: Sparse adjacency construction - Edge list captures graph structure; COO stores (row, col, value) for each edge. - A cycle graph has exactly one outgoing edge per node, so $\text{nnz} = n$.

Part 2: Sparse matvec - coo_matvec touches only nonzeros; work scales with $\text{nnz}$. - For $x = \mathbf{1}$ on a cycle, each node sums one neighbor → output is all ones.

Part 3: Dense check and parity - Converting to dense and multiplying provides a correctness oracle. - nnz vs. $n^2$ illustrates memory savings: $6$ vs. $36$ entries here; gap grows with $n$.

Why This Matters for ML - Sparse representations enable training on large graphs, vocabularies, and user-item matrices that would be impossible dense. - $O(\text{nnz})$ matvec is the inner loop of GNNs, PageRank, and iterative solvers for large ML systems.

Connection to ML Applications - Graph ML: Message passing uses sparse adjacency for aggregation. - Recommender systems: User-item matrices are extremely sparse; all core ops are $O(\text{nnz})$. - NLP: Bag-of-words and TF-IDF are sparse; sparse matvec powers retrieval and linear models. - Sparse attention: Block/neighbor attention restricts nnz to keep attention layers tractable.

Connection to Linear Algebra Theory - Sparsity patterns encode graphs; powers of $A$ trace walks. - Graph Laplacians are sparse, symmetric, and singular; spectral methods exploit sparsity.

Numerical and Implementation Notes - Avoid densifying large sparse matrices; memory blow-up is immediate. - Choose formats wisely: COO for construction, CSR/CSC for repeated matvec. - Check shapes: adjacency is $n\times n$; input vector must be length $n$. - Be mindful of data types; use float64 for stable accumulation.

Numerical and Shape Notes - Validate with a dense check on small problems, then trust sparse ops for scale. - Count nnz to estimate memory and runtime; $O(\text{nnz})$ dominates.

History and Applications

Sparse matrix methods emerged in the 1960s–1970s to solve large PDEs and circuit problems where dense storage was infeasible. Early formats (linked lists, skyline) evolved into modern COO/CSR/CSC standards that balance construction ease with fast matvec. Graph theory provided a natural source of sparsity: adjacency and Laplacian matrices are typically $O(n)$ nnz, enabling PageRank, spectral clustering, and shortest-path algorithms at web scale. In ML, sparse linear algebra powers bag-of-words models, recommender systems, and graph neural networks; production systems often hinge on efficient $O(\text{nnz})$ kernels. Hardware and libraries (BLAS, cuSPARSE, MKL) now expose optimized sparse primitives, while block/structured sparsity underpins modern sparse attention and mixture-of-experts architectures.

Connection to Broader Examples
  • Chapter 4 (Linear maps): Sparsity is structure in the matrix representation of a linear map.
  • Chapter 5 (Inner products): Sparse dot products power sparse matvec.
  • Chapter 6 (Projections): Sparse projectors arise in graph-based smoothing and label propagation.
  • Chapter 7 (Rank/nullspace): Sparsity affects rank; graph Laplacians are sparse and singular.
  • Chapter 10 (SVD): Sparse SVD/PCA for large graphs and documents.
  • Chapter 12 (Least-squares): Iterative LS solvers rely on $O(\text{nnz})$ matvec.
  • Chapter 13 (Solving systems): Krylov solvers (CG/GMRES) use sparse matvec as the core primitive.
  • Chapter 14 (Conditioning): Sparsity plus preconditioning improves stability and speed.
  • Chapter 15 (Sparse): Core chapter; COO/CSR/CSC trade-offs.
  • Graph ML: Adjacency/Laplacian sparsity underpins GNN message passing.

Comments