Example number
15
Example slug
example_15_sparse_adjacency_matvec_pure_numpy_coo
Background

A matrix is sparse if most entries are zero. Storing all $n^2$ entries wastes memory and computation. Instead, sparse formats store only nonzeros: COO (coordinate) uses three arrays (row indices, column indices, values); CSR (Compressed Sparse Row) optimizes row-wise access; CSC (Compressed Sparse Column) optimizes column-wise access. For a graph adjacency matrix with $n$ nodes and $|E|$ edges, dense storage requires $O(n^2)$ memory; sparse requires $O(|E|)$—the difference between gigabytes and megabytes for social networks.

In ML, sparsity appears everywhere: graph neural networks propagate features via sparse adjacency matrices; text processing uses sparse term-document matrices (TF-IDF); recommender systems store user-item interactions sparsely (most users rate few items); attention masks in transformers indicate which tokens attend to which. Sparse operations enable scaling to millions of dimensions by touching only nonzeros. The fundamental insight: $y = Ax$ costs $O(n^2)$ dense, but $O(|E|)$ sparse—a game-changer when $|E| \ll n^2$.

Purpose

Build intuition that exploiting structure (sparsity) transforms infeasible $O(n^2)$ operations into practical $O(|E|)$ computation:

  • Understand COO (coordinate) format: store only nonzero entries as (row, col, value) triplets.
  • See that sparse matrix-vector multiplication costs $O(|E|)$ where $|E|$ is the number of edges/nonzeros.
  • Recognize when sparsity matters: graphs, text, recommender systems, attention masks.
  • Connect to ML: graph neural networks, sparse features, large-scale optimization.
Problem

Build a small sparse adjacency matrix and compute A@x using COO matvec; verify vs dense.

Solution (Math)

Sparse COO stores only nonzeros. Matvec loops over nonzeros, costing $O(nnz)$ instead of $O(n^2)$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.sparse_helpers import coo_from_edges, coo_matvec, coo_to_dense, nnz

n=5
edges=[(0,1),(1,2),(2,3),(3,4)]
A=coo_from_edges(n, edges, weights=1.0)
x=np.ones(n)

y1=coo_matvec(A,x)
y2=coo_to_dense(A)@x
print("y_coo:", y1)
print("y_dense:", y2)
print("nnz:", nnz(A))
Code Introduction

This snippet demonstrates sparse matrix-vector multiplication using coordinate (COO) format, contrasting memory-efficient sparse operations against dense matrix conversion. The code constructs a graph adjacency matrix from an edge list, then computes matrix-vector products via two paths to verify correctness.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Construction: A = coo_from_edges(n, edges, weights=1.0) builds a COO matrix from an edge list (graph edges).
  • Sparse matvec: y = coo_matvec(A, x) loops over stored nonzeros, accumulating y[row] += value * x[col]; complexity $O(|E|)$.
  • Dense conversion: A_dense = coo_to_dense(A) materializes the full $n \times n$ array; use only for debugging/small matrices.
  • Sparsity diagnostic: nnz(A) counts nonzeros; if nnz(A) << n^2, sparse wins; if nnz(A) ~ n^2, sparse overhead may not help.
  • Shapes: adjacency $A \in \mathbb{R}^{n \times n}$, vector $x \in \mathbb{R}^n$, result $y \in \mathbb{R}^n$; for path graph with $n=5$, $|E|=4$ (or 8 if symmetric).
  • Memory: dense stores $25$ floats; sparse stores $3 \times 4 = 12$ entries (row, col, value)—already a 2× savings at tiny scale.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • COO format stores only nonzeros: (row, col, value) triplets instead of full $n \times n$ array.
  • Sparse matvec costs $O(|E|)$ where $|E|$ = number of edges/nonzeros, not $O(n^2)$.
  • Memory savings: $O(|E|)$ vs $O(n^2)$ storage; for $|E| \ll n^2$, this is orders of magnitude.
  • Verification: sparse and dense results match; always check correctness when optimizing.
  • ML relevance: graph adjacency, attention masks, feature matrices—sparsity is pervasive.
Notes

Part 1: Core setup - Sparse adjacency matvec (pure NumPy COO)

State the objects, shapes, and target question for Sparse adjacency matvec (pure NumPy COO). Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Sparse adjacency matvec (pure NumPy COO)

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Sparse adjacency matvec (pure NumPy COO) work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Sparse adjacency matvec (pure NumPy COO)

Give the computational recipe for Sparse adjacency matvec (pure NumPy COO), note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: check dimensions before manipulating formulas.
  • Numerical note: prefer stable primitives (lstsq, QR/SVD, Cholesky for SPD) over explicit inverses.
  • Interpretation: relate algebraic steps to geometry (subspaces, projections) and to ML behavior (generalization, stability).
  • Sparse Matrix Construction from Graph Edges The code builds a sparse $5 \times 5$ matrix representing a path graph with 4 edges: $(0,1), (1,2), (2,3), (3,4)$. The function coo_from_edges(n, edges, weights=1.0) from scripts/sparse_helpers constructs a COO-format sparse matrix where each edge becomes a nonzero entry with value 1.0. In COO format, the matrix is stored as three arrays: row indices, column indices, and values—only the 4 nonzero entries are stored, not all $25$ entries of the full matrix. This is critical for large graphs: a social network with $n = 10^6$ nodes and average degree 100 has $\sim 10^8$ edges, but a dense representation would require $10^{12}$ entries (terabytes of memory). The sparse representation uses $O(|E|)$ memory where $|E|$ is the number of edges, while dense uses $O(n^2)$—the difference between feasibility and impossibility for web-scale graphs.
  • Matrix-Vector Product via Sparse Operations The input vector x = np.ones(n) is multiplied by the adjacency matrix in two ways. First, y1 = coo_matvec(A, x) performs the product using the COO representation directly: it iterates over the stored edges, accumulating y1[row] += value * x[col] for each nonzero. This takes $O(|E|)$ time—proportional to the number of edges, not $n^2$. For the path graph with 4 edges, this is dramatically faster than scanning all 25 entries of a dense matrix. Second, y2 = coo_to_dense(A) @ x converts the sparse matrix to a dense $5 \times 5$ NumPy array, then uses standard matrix-vector multiplication. This takes $O(n^2)$ storage and time, defeating the purpose of sparsity but serving as a correctness check: the outputs y1 and y2 should be identical.
  • Verifying Sparsity Structure The final print statements reveal the structure. For the path graph adjacency, y_coo and y_dense will both equal [1., 2., 2., 2., 1.]—each interior node (1, 2, 3) has degree 2 (connected to two neighbors), while endpoints (0, 4) have degree 1. The nnz(A) call returns the number of nonzeros: 4 for the undirected edges, or 8 if the graph is stored as symmetric with both $(i,j)$ and $(j,i)$ entries. This count is the key sparsity diagnostic: if nnz(A) << n^2, sparse operations are vastly more efficient; if nnz(A) ~ n^2, the matrix is effectively dense and sparse storage overhead provides no benefit.
  • ML Connection Sparse matrices appear throughout machine learning: graph neural networks represent connectivity via sparse adjacency matrices; text processing uses sparse term-document matrices where most word-document pairs are zero; recommender systems store user-item interactions as sparse matrices (most users rate few items); feature engineering produces sparse one-hot or indicator features. Understanding COO format—and its cousins CSR (Compressed Sparse Row) and CSC (Compressed Sparse Column)—enables scaling algorithms to millions of dimensions. The principle is simple: only store and compute on nonzero entries. When debugging sparse operations, always check nnz(A) to confirm sparsity, verify that sparse and dense results match (as this code does), and monitor memory usage—converting to dense for “just one operation” can exhaust RAM instantly for large-scale problems. The scripts/sparse_helpers module encapsulates these patterns, providing reusable primitives for graph algorithms, attention masks, and kernel matrices that exploit sparsity structure.
History and Applications

Origins of sparse methods: Sparse matrix techniques emerged in the 1960s–70s from scientific computing needs—finite element analysis and circuit simulation produced huge systems where most entries were zero. Early work by George, Wilkinson, and Duff formalized efficient storage schemes (COO, CSR, CSC) and direct solvers that exploit sparsity patterns. The breakthrough was recognizing that structure is computable information: knowing where zeros are lets you skip operations entirely.

Graph theory and adjacency matrices: Graph algorithms (shortest paths, PageRank, community detection) naturally operate on sparse adjacency matrices. The adjacency matrix of a graph with $n$ nodes and $|E|$ edges has density $|E|/n^2$—for scale-free networks, this is often $< 0.01$. Sparse formats enable analyzing million-node graphs on commodity hardware; dense would require supercomputers.

ML applications: Sparse linear algebra underpins modern ML infrastructure. Graph neural networks (GNNs) propagate node features via sparse adjacency: h^{(l+1)} = \sigma(A h^{(l)} W^{(l)}) where $A$ is sparse. Natural language processing uses sparse TF-IDF matrices for document retrieval and sparse attention masks in transformers to enforce causal or local patterns. Recommender systems factorize sparse user-item matrices (Netflix problem: $480{,}000 \times 17{,}000$ with $< 1\%$ density) via alternating least squares or gradient descent. Feature engineering produces sparse one-hot encodings (categorical variables) and indicator matrices (presence/absence signals). Sparse solvers (Conjugate Gradient, preconditioned GMRES) handle large-scale optimization when Hessians or gradient-related matrices are sparse. The principle scales: exploit structure wherever it exists—sparsity in adjacency, low rank in embeddings, symmetry in kernels—to make intractable problems tractable.

Connection to Broader Examples
  • Graph neural networks: Sparse adjacency enables message passing on million-node graphs; dense would exhaust memory.
  • Attention masks (Ch. 16): Transformers use sparse masks to restrict attention patterns (local, causal, etc.).
  • Text/NLP: TF-IDF matrices are sparse (most word-document pairs are zero); sparse solvers enable scaling.
  • Recommender systems: User-item interaction matrices are sparse; collaborative filtering exploits this structure.
  • Iterative methods (Ch. 13-14): Conjugate gradient, GMRES, and preconditioned solvers use sparse matvec as the core primitive.
  • Conditioning: Sparse matrices can still be ill-conditioned; sparsity saves memory/time but doesn’t fix sensitivity.

Comments