Example number
63
Example slug
example_63_sparse_matrices_adjacency_coo_onnz_matvec
Background

Sparse linear algebra matured in scientific computing (finite elements, CFD, circuit simulation) where matrices are huge but mostly zero. Modern ML reuses these ideas for text (bag-of-words has billions of zeros), recommender systems (user–item graphs), GNNs (message passing on sparse adjacency), and large-scale optimization (sparse Jacobians/Hessians). COO/CSR/CSC formats avoid storing zeros and enable $O(\text{nnz})$ primitives; the entire stack—memory layout, cache behavior, and kernel fusion—matters for performance.

Purpose

Show how exploiting sparsity turns an $O(n^2)$ dense multiply into an $O(\text{nnz})$ operation, making graph-scale computations feasible in memory and time.

Problem

Build a 6-node cycle adjacency as COO, do sparse matvec, verify vs dense, and report nnz.

Solution (Math)

COO stores only nonzeros, enabling $O(nnz)$ matvec. Adjacency of a sparse graph has nnz proportional to edges, not $n^2$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.sparse_helpers import coo_from_edges, coo_matvec, coo_to_dense, nnz

n = 6
edges = [(i, (i+1)%n) for i in range(n)]
A = coo_from_edges(n, edges, weights=1.0)

x = np.ones(n)
y_sparse = coo_matvec(A, x)
y_dense = coo_to_dense(A) @ x

print("y_sparse:", y_sparse)
print("y_dense :", y_dense)
print("nnz:", nnz(A), "dense entries:", n*n)
Code Introduction

This code builds a sparse adjacency matrix in COO form for a directed cycle on $n=6$ nodes, then validates a sparse matrix–vector product against a dense implementation. The edge list edges = [(i, (i+1)%n) for i in range(n)] encodes one outgoing edge per node, so the adjacency $A \in \mathbb{R}^{n\times n}$ has exactly $n$ nonzeros with $A_{i,(i+1)\bmod n} = 1$.

With $x = \mathbf{1} \in \mathbb{R}^n$, the product $y = Ax$ equals the row sums of $A$. Because each row contains a single $1$, both the sparse path coo_matvec(A, x) and the dense path coo_to_dense(A) @ x return $y = \mathbf{1}$. The printout compares y_sparse and y_dense to confirm correctness and reports nnz(A) = n versus $n^2$ dense entries, highlighting the memory and compute savings of sparse storage.

COO stores $(\text{row}, \text{col}, \text{value})$ triplets and enables matvec in $O(\text{nnz})$, while dense multiplication is $O(n^2)$. Converting to dense via coo_to_dense is used here solely for verification; avoid it for large $n$ due to quadratic time and memory. Shape discipline: $A \in \mathbb{R}^{n\times n}$, $x \in \mathbb{R}^n$, $y \in \mathbb{R}^n$, and $y = Ax$. A simple robustness check is to replace $x$ with a random vector; y_sparse and y_dense should still match exactly while the sparse path retains $O(\text{nnz})$ cost.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Build the edge list for a directed 6-cycle with (i, (i+1)%n) so each row has exactly one nonzero.
  • Convert edges to COO via coo_from_edges, producing aligned arrays of rows, cols, and data.
  • Form the test vector $x=\mathbf{1}$ to make the expected output easy to reason about (each row sum is 1).
  • Compute $y_{\text{sparse}} = \text{coo\_matvec}(A, x)$ in $O(\text{nnz})$.
  • Compute $y_{\text{dense}} = \text{coo\_to\_dense}(A) @ x$ as a correctness oracle (but not for production scale).
  • Compare y_sparse and y_dense elementwise; they should match exactly for this integer-weighted graph.
  • Report nnz(A) alongside the dense size $n^2$ to highlight memory savings ($\text{nnz}=n$ vs $n^2$ entries).
  • Optional: replace $x$ with random values to stress-test equality while preserving $O(\text{nnz})$ cost on the sparse path.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • The single takeaway: sparse representations cut both storage and matvec cost from $O(n^2)$ to $O(\text{nnz})$.
  • How an adjacency matrix for a cycle graph has $\text{nnz}=O(n)$ instead of $n^2$, and why that matters for scalability.
  • Verification that sparse and dense matvec agree numerically on the same data, guarding against implementation bugs.
  • How COO triplets (row, col, value) map cleanly to graph edge lists used across ML workloads.
  • The connection to ML computation patterns: fit (gradient steps on sparse features), project (message passing on graphs), decompose (sparse factorizations), solve (iterative solvers on sparse systems).
  • How shape discipline keeps $A\in\mathbb{R}^{n\times n}$, $x\in\mathbb{R}^n$, $y\in\mathbb{R}^n$ consistent across sparse and dense paths.
Notes
  • Part 1: Constructing the sparse adjacency. Use $(i,(i{+}1)\bmod n)$ edges to create a directed cycle with exactly one nonzero per row.
  • Part 2: Sparse matvec in $O(\text{nnz})$. COO matvec iterates over stored triplets; work scales with nonzeros, not $n^2$.
  • Part 3: Dense verification oracle. Converting to dense is only for checking correctness on tiny graphs; avoid for large $n$.
  • Why This Matters for ML. GNN message passing, TF–IDF bag-of-words, and recommender user–item graphs are all sparse; $O(\text{nnz})$ kernels make them tractable.
  • ML Examples and Patterns. Fit: gradient steps on sparse design matrices; project: graph diffusion/Laplacian smoothing; decompose: sparse factorizations; solve: iterative methods using sparse matvec.
  • Connection to Linear Algebra Theory. Sparsity preserves linear structure—same operators, different storage/complexity class.
  • Numerical and Implementation Notes. Beware duplicate edges (sum or deduplicate), zero weights (skip), and index ordering (improves cache friendliness).
  • Numerical and Shape Notes. Keep $A\in\mathbb{R}^{n\times n}$, $x\in\mathbb{R}^n$, $y\in\mathbb{R}^n$; verify outputs before scaling up.
  • Pedagogical Significance. This is the minimal working demo that shows why sparse formats exist and how to test them safely.
History and Applications

Sparse matrix storage and kernels emerged in mid-20th-century scientific computing (finite elements, circuit simulation) to make otherwise impossible $O(n^2)$ and $O(n^3)$ workloads feasible. Graph algorithms later embraced sparsity for web-scale link analysis and social networks. In modern ML, sparsity underpins search/recommendation (user–item graphs), NLP (bag-of-words/TF–IDF), and GNNs (message passing with sparse adjacency). Iterative solvers (CG/GMRES) rely on $O(\text{nnz})$ matvecs; without sparse formats, these methods would be intractable on real-world graphs. The same principles drive hardware/software co-design for sparse accelerators and kernel fusion in contemporary deep learning frameworks.

Connection to Broader Examples
  • Pairs with Chapter 15 (sparse) to show storage/layout fundamentals before factorization or iterative methods.
  • Complements Example 15 sparse chapter scripts that rely on COO utilities from scripts/sparse_helpers.py.
  • Bridges to graph-centric ML (PCA on graphs, Laplacian systems, spectral clustering) where adjacency sparsity is essential.
  • Links to conditioning (Ch. 14): sparse structure does not fix conditioning; you still need preconditioning/regularization.
  • Connects to least squares (Ch. 12): sparse design matrices benefit from $O(\text{nnz})$ matvecs inside CG or LSQR.
  • Supports matrix products (Ch. 16): sparse–dense multiplies are ubiquitous in attention variants on sparse graphs.
  • Reinforces shape discipline from early chapters: rows as examples, columns as features; here rows/cols are nodes.
  • Sets up later examples using sparse solvers (CG/GMRES) where matvec dominates cost.

Comments