Example number
95
Example slug
example_95_sparse_matrices_adjacency_coo_onnz_matvec
Background

Sparse matrix methods originated in scientific computing (1960s–1970s) to solve discretized partial differential equations (PDEs) and large linear systems. Early work by George (1973), Tinney and Walker (1967), and Reid (1971) on sparse Gaussian elimination recognized that explicitly storing $n \times n$ matrices was wasteful for problems with only $O(n)$ or $O(n \log n)$ nonzeros. The coordinate (COO) format emerged as a simple storage scheme: list (row, column, value) for each nonzero. Later, compressed sparse row (CSR) and compressed sparse column (CSC) formats enabled faster algorithms. By the 1980s, libraries like SPARSKIT and SuperLU implemented sparse LU decomposition and iterative solvers. In modern ML (2000s+), sparsity is everywhere: text (word-document matrices are 99% zero), graphs (social networks, knowledge graphs), recommender systems (user-item matrices have billions of entries but < 0.1% observed), and large-scale optimization (millions of features but few per sample). TensorFlow, PyTorch, and SciPy all provide sparse matrix support. The key insight: exploit sparsity or scale fails. Dense algorithms on sparse billion-node graphs are infeasible; sparse algorithms make it routine.

Purpose

Understand how exploiting sparsity transforms linear algebra from “impossible” to “practical.” Learn the coordinate format (COO) representation: store only nonzeros (row, column, value triples), enabling $O(\text{nnz})$ matrix-vector product instead of $O(n^2)$. For sparse graphs (nnz $\sim n$ or $n\log n$ edges), this is orders of magnitude faster. Discover why sparsity is endemic to ML: text is sparse (most words absent from most documents), graphs are sparse (few edges per node), recommender systems are sparse (users rate few items), optimization problems are sparse (few nonzero gradients). Learn to recognize sparse structure in real problems and exploit it via specialized data structures and algorithms. Understand the memory-compute tradeoff: COO saves memory (store $O(\text{nnz})$ vs $O(n^2)$) but requires indirect memory access; dense is faster if data fits and sparsity is low. Master both when applicable.

Problem

Build a 6-node cycle adjacency as COO, do sparse matvec, verify vs dense, and report nnz.

Solution (Math)

COO stores only nonzeros, enabling $O(nnz)$ matvec. Adjacency of a sparse graph has nnz proportional to edges, not $n^2$.

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.sparse_helpers import coo_from_edges, coo_matvec, coo_to_dense, nnz

n = 6
edges = [(i, (i+1)%n) for i in range(n)]
A = coo_from_edges(n, edges, weights=1.0)

x = np.ones(n)
y_sparse = coo_matvec(A, x)
y_dense = coo_to_dense(A) @ x

print("y_sparse:", y_sparse)
print("y_dense :", y_dense)
print("nnz:", nnz(A), "dense entries:", n*n)
Code Introduction

This section demonstrates sparse matrix storage (COO format) and sparse matrix-vector product ($y = Ax$ in $O(\text{nnz})$ time). We construct a sparse adjacency matrix for a 6-node cycle graph and compare sparse vs dense computation.

Coordinate Format (COO):

The coordinate format stores only nonzeros as (row, column, value) triples: \[ A = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 \\ 1 & 0 & 0 & 0 & 1 & 0 \end{pmatrix} \quad \text{(adjacency of 6-cycle)} \]

In COO format:

row = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
col = [1, 5, 0, 2, 1, 3, 2, 4, 3, 5, 4, 0]
val = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Memory savings: Dense storage: $6 \times 6 = 36$ entries. COO: $3 \times 12 = 36$ values (for 12 nonzeros). For sparse matrices, nnz $\ll n^2$, so savings are dramatic (e.g., billion-node graph: dense is $10^{18}$ bytes, COO is $10^{10}$ bytes).

Sparse Matrix-Vector Product:

To compute $y = Ax$ where $A$ is sparse (COO format):

y = np.zeros(n)
for k in range(nnz):
    y[row[k]] += val[k] * x[col[k]]

Time complexity: $O(\text{nnz})$. No nested loop; only iterate over nonzeros.

Speedup: Dense matvec is $O(n \times d)$. For 6-node graph with nnz = 12: - Dense: $6 \times 6 = 36$ multiplications - Sparse: $12$ multiplications - Speedup: $36 / 12 = 3\times$ (modest for small graph, but dramatic for billion-node graphs: $\approx n / \text{avg degree}$)

Example Walkthrough:

import numpy as np
from scripts.sparse_helpers import coo_from_edges, coo_matvec, coo_to_dense, nnz

# 1. Construct cycle graph: edges (0→1), (1→2), ..., (5→0)
n = 6
edges = [(i, (i+1) % n) for i in range(n)]  # cycle edges
A = coo_from_edges(n, edges, weights=1.0)

# 2. Create dense vector x = [1, 1, 1, 1, 1, 1]
x = np.ones(n)

# 3. Sparse matvec: y = Ax via COO
y_sparse = coo_matvec(A, x)

# 4. Dense matvec for comparison
y_dense = coo_to_dense(A) @ x

# 5. Print results and sparsity stats
print("y_sparse:", y_sparse)           # Output: [2., 2., 2., 2., 2., 2.]
print("y_dense :", y_dense)            # Output: [2., 2., 2., 2., 2., 2.]
print("nnz:", nnz(A), "dense entries:", n*n)  # Output: nnz: 12, dense entries: 36

Output Interpretation:

  • y_sparse = y_dense = [2, 2, 2, 2, 2, 2]: Both methods agree. Each node in the cycle has exactly 2 neighbors, so $Ax = [2, 2, 2, 2, 2, 2]$ (sum of neighbors, all ones).
  • nnz = 12, dense entries = 36: Sparsity ratio = $12 / 36 = 1/3$ (33% sparse; for comparison, text corpora are 99%+ sparse).

Shapes and Broadcasting:

  • $A \in \mathbb{R}^{6 \times 6}$ (square adjacency matrix)
  • $x \in \mathbb{R}^6$ (dense vector)
  • $y = Ax \in \mathbb{R}^6$ (dense output; even if $A$ is sparse, output is typically dense)
  • COO representation: 3 arrays of length 12 (nnz)

Fundamental Principle:

Exploit sparsity: if nnz $/$ $(n \times d) < 0.1$, use sparse algorithms. Recognition of sparsity → tailored storage (COO, CSR) → $O(\text{nnz})$ complexity → billion-scale suddenly feasible. Structure exploitation is the universal principle of scaling linear algebra to modern data.

Key Takeaway:

Dense linear algebra is $O(n^2)$ or $O(n^3)$; modern data is sparse ($\text{nnz} \ll n^2$). Sparse algorithms reduce complexity by factor $n^2 / \text{nnz}$ (e.g., 1000× for 99% sparse data). This is the difference between “impossible” and “routine.”

Numerical Implementation Details
  • COO format storage: Store three arrays: row (row indices), col (column indices), val (values). All length nnz. Example: $A = \begin{pmatrix} 1 & 0 & 2 \\ 0 & 3 & 0 \end{pmatrix}$ → row=[0,0,1], col=[0,2,1], val=[1,2,3].
  • COO matvec $y = Ax$: For each $k = 1, \ldots, \text{nnz}$: y[row[k]] += val[k] * x[col[k]]. Time: $O(\text{nnz})$. No inner loop over $d$.
  • Compressed Sparse Row (CSR): Partition nonzeros by row; store row pointers (length $n+1$) plus column indices and values. Faster than COO for SpMV; worse for adding/removing entries.
  • Sparse vs dense tradeoff: Sparse matvec is $O(\text{nnz})$ but has indirect memory access (cache misses). Dense matvec is $O(n^2)$ but has good cache locality. Breakeven: roughly when $\text{nnz} / n^2 \approx 0.1$. For sparser matrices, sparse wins dramatically.
  • Construction cost: Building COO from edge list (as in Example 95) is $O(m)$ where $m$ is number of edges. Building CSR requires sorting; cost $O(\text{nnz} \log \text{nnz})$.
  • Sparsity detection and exploitation: Given matrix $A$, compute nnz and sparsity ratio $\text{nnz}/(n \times d)$. If ratio $< 0.01$, use sparse algorithms; if $< 0.1$, consider sparse; if $> 0.1$, likely dense algorithms are faster.
  • Sparse matrix products: $C = AB$ where $A, B$ sparse is nontrivial; nnz of $C$ can be dense (product is often denser than inputs). Structured algorithms (graph multiplication, Schur complements) are necessary for efficiency.
  • Memory-time tradeoff in storage: COO: $O(\text{nnz})$ memory, fast iteration but slow search. CSR: $O(\text{nnz} + n)$ memory, slower iteration but efficient row access. Choose based on access pattern.
What This Example Demonstrates
  • Coordinate (COO) format: Store matrix as list of (row, column, value) tuples. One triplet per nonzero. Memory: $O(\text{nnz})$. Simple but not cache-efficient.
  • Sparse matrix-vector product: $y = Ax$ in $O(\text{nnz})$ time by iterating over nonzero entries, not all $n^2$ entries. Speedup: $n^2 / \text{nnz}$ (dramatic for sparse matrices).
  • Sparsity patterns in graphs: Graph adjacency matrices are sparse: $n$ nodes, $m$ edges $\Rightarrow$ nnz $= O(m)$. For sparse graphs, $m \ll n^2$ (e.g., $m \sim n$ or $m \sim n \log n$).
  • Memory savings: Storing $n \times n$ dense matrix requires $O(n^2)$ space. COO stores $O(\text{nnz})$ (often 1-100× smaller). For billion-node graphs, difference between infeasible and feasible.
  • Complexity hierarchy: Dense: $O(n^2)$ matvec. Sparse: $O(\text{nnz})$. Structured (diagonal, tridiagonal): $O(n)$. Choose algorithm and storage to match problem structure.
  • Cache efficiency and indirect access: COO has poor cache locality (random-access pattern). Compressed formats (CSR, CSC) improve cache efficiency via index hierarchies.
  • Recognition in ML: Text data (term-document matrix), graphs (adjacency), recommenders (rating matrix), large optimization (feature matrix). Most real high-dimensional data is sparse.
  • Algorithm selection: If data is sparse and nnz/n² $< 0.1$, use sparse algorithms. If $< 0.01$, must use sparse algorithms. If data is dense despite appearing “large,” use dense algorithms (they’re faster and simpler).
Notes

Part 1: Sparse Matrix Formats and COO

A sparse matrix $A \in \mathbb{R}^{n \times d}$ has most entries zero. The sparsity ratio is: \[ \text{sparsity ratio} = \frac{\text{nnz}(A)}{n \times d} \] where $\text{nnz}(A)$ is the number of nonzeros. If this ratio is small (e.g., $< 0.1$), the matrix is sparse and should be stored specially.

Coordinate (COO) format: Store three arrays of length nnz: - row: row indices $i$ of nonzeros - col: column indices $j$ of nonzeros - val: values $A_{ij}$

Example: $A = \begin{pmatrix} 1 & 0 & 2 \\ 0 & 3 & 0 \\ 4 & 0 & 5 \end{pmatrix}$ has nnz = 5:

row = [0, 0, 1, 2, 2]
col = [0, 2, 1, 0, 2]
val = [1, 2, 3, 4, 5]

Memory: $3 \times \text{nnz}$ integers/floats (vs $n \times d$ for dense).

Part 2: Sparse Matrix-Vector Product

The sparse matvec $y = Ax$ is computed as:

y = zeros(n)
for k in range(nnz):
    y[row[k]] += val[k] * x[col[k]]

Time complexity: $O(\text{nnz})$. No nested loop over all $d$ entries.

Speedup over dense: Dense matvec is $O(n \times d)$. Sparse matvec is $O(\text{nnz})$. Speedup: $\frac{n \times d}{\text{nnz}}$. For sparse graphs with $\text{nnz} \approx n$ (constant average degree), speedup is $\sim d$ (dramatic).

Cache efficiency: COO has random-access pattern (row and column indices jump around), causing cache misses. Compressed formats (CSR, CSC) improve locality via index hierarchies, but are more complex to maintain.

Part 3: Sparse Matrices in Real Data

Text (term-document matrix): $n$ documents, $d$ terms. $A_{ij}$ = count of term $j$ in document $i$. Typical sparsity: 99%+ (most terms absent from most documents). nnz $\approx 10^7$–$10^9$ (billions of documents/terms but only millions of observed term-doc pairs).

Graphs (adjacency matrix): $n$ nodes. $A_{ij} = 1$ if edge $(i, j)$ exists, 0 otherwise. For sparse graphs, nnz $\approx n$ or $n \log n$ (constant or logarithmic average degree). Social networks, knowledge graphs, protein interaction networks all sparse. nnz $\ll n^2$.

Recommender systems (rating matrix): $n$ users, $d$ items. $A_{ij}$ = rating (or 0 if unobserved). Sparsity: 99.9%+ (users rate tiny fraction of items). nnz $\approx 10^7$–$10^8$ (billions of user-item pairs observed out of trillions possible).

Feature matrices in ML: High-dimensional text features (bag-of-words, TF-IDF), categorical one-hot encoding of large dictionaries, signal processing (wavelets have few significant coefficients). Sparsity: 95%–99.9%.

Why This Matters for ML

  • Scale: Dense algorithms on billion-dimensional sparse data are infeasible (memory, time). Sparse algorithms make it routine.
  • Efficiency: For sparse data, sparse algorithms are 10–1000× faster than dense (depending on sparsity ratio).
  • Memory: Storing dense billion-node graph adjacency: $10^{18}$ bytes (exabyte). COO: $10^{10}$ bytes (10 GB, 100 million edges). Difference between “impossible” and “laptop-feasible.”
  • Iterative methods: Conjugate gradient, GMRES, and other Krylov subspace methods require only matvecs; sparse matvec enables scalable solving of $Ax = b$ without forming explicit inverse.
  • Graph algorithms: PageRank, community detection, message passing on graphs all reduce to sparse matvecs. Billions-of-node graphs are solvable via repeated sparse matvec.
  • Recommendation: Collaborative filtering (matrix factorization of sparse rating matrix) is fundamental; sparse algorithms are mandatory.

ML Examples and Patterns

  1. Text classification: Feature matrix is TF-IDF or one-hot bag-of-words (99%+ sparse). Ridge regression, SVM, logistic regression all solved via sparse least squares (LSQR). Without sparsity exploitation, infeasible on million-word vocabulary.
  2. Graph neural networks (GNNs): Message passing computes sparse matvecs with adjacency matrix and learned weight matrix. Scales to billion-node graphs because each layer is $O(\text{nnz})$.
  3. Recommender systems: Collaborative filtering factorizes sparse user-item matrix. Alternating least squares (ALS) solves per-user and per-item least squares subproblems, each on sparse matrices. Sparse algorithms are essential.
  4. NLP and embeddings: Word embeddings on sparse co-occurrence matrices (positive PMI, CCA). Sparse linear algebra computes SVD or other factorizations efficiently.
  5. Anomaly detection in networks: Sparse graph matrices from transaction/communication networks. Spectral methods (eigenvalues of adjacency) detect community structure; sparse eigensolvers (Lanczos) compute top eigenvectors in $O(\text{nnz} \cdot k \log k)$ for $k$ eigenvectors.

Connection to Linear Algebra Theory

  • Matrix-vector product: $y = Ax$ is $O(n \times d)$ dense; $O(\text{nnz})$ sparse. Fundamental operation; the algebra is unchanged, only the implementation.
  • Sparse matrix algebra: $(A + B)_{\text{nnz}} \leq \text{nnz}(A) + \text{nnz}(B)$ (sum of sparse matrices is sparse). $(AB)_{\text{nnz}}$ can be dense (product fills in). Example: if $A$ is tridiagonal, $A^2$ has entries beyond main/super/subdiagonals (“fill”).
  • Sparse factorizations: LU, QR, Cholesky of sparse matrices can be sparse (with reordering to minimize fill). Sparse Cholesky of discretized Laplacian preserves sparsity; cost $O(n^{3/2})$ (not $O(n^3)$).
  • Graph Laplacian: $L = D - A$ (degree matrix minus adjacency) is sparse (same sparsity as $A$). Eigenvalues/eigenvectors of $L$ determine graph connectivity; sparse eigensolvers compute them in $O(\text{nnz} \log n)$ per iteration.
  • Iterative methods: CG, GMRES, LSQR all require repeated matvecs, never explicitly forming matrix inverse. Sparse matvec enables scalability; algorithm complexity is iterations $\times$ matvec cost $=$ iterations $\times O(\text{nnz})$.

Numerical and Implementation Notes

  • Always check sparsity ratio before deciding: dense vs sparse. Ratio $< 0.01$ strongly suggests sparse. Ratio $> 0.1$ suggests dense despite high dimension.
  • COO format is simple to construct and iterate. For many matvecs, convert to CSR (compressed sparse row) for better cache efficiency.
  • Be aware of “fill” during factorization: sparse input can produce dense Cholesky. Reordering (nested dissection, approximate minimum degree) reduces fill.
  • Sparse eigensolvers (Lanczos, Arnoldi) compute top $k$ eigenpairs in $O(k \times \text{iterations} \times \text{matvec cost})$ without ever forming matrix explicitly.
  • For graph problems, exploit known structure (tree, planar, bounded treewidth); specialized algorithms can be superfast.

Numerical and Shape Notes

  • Shapes: $A \in \mathbb{R}^{n \times d}$ with nnz $\ll n \times d$ (sparse). $x \in \mathbb{R}^d$ (dense vector). $y = Ax \in \mathbb{R}^n$ (dense output, even if $A$ sparse).
  • Sparsity ratio: $\rho = \text{nnz} / (n \times d)$. Complexity scales with $1/\rho$: sparse algorithms faster by factor $1/\rho$ vs dense.
  • COO space: $3 \times \text{nnz}$ (three values per nonzero: two indices, one value).
  • CSR space: $2 \times \text{nnz} + n + 1$ (column indices, values, plus row pointers).

Pedagogical Significance

Sparsity is the bridge between theory and billion-scale practice in ML. Dense linear algebra assumes $O(n^2)$ or $O(n^3)$ complexity, manageable for $n \sim 10^4$. But modern data has $n, d \sim 10^6$ to $10^9$ (text, graphs). Dense algorithms fail completely. Sparse algorithms exploit structure: recognize sparsity → use tailored storage → get $O(\text{nnz})$ complexity → suddenly billion-scale is routine. This teaches structure exploitation as a universal principle: every problem has structure; recognize and exploit it or fail at scale. Students learn not to memorize algorithms but to think about problem structure first, then choose algorithms accordingly.

History and Applications

History: Sparse linear algebra developed in the 1960s–1980s alongside large-scale simulations. Early work on sparse elimination (Tinney–Walker, George, Reid) led to COO/CSR/CSC formats and efficient factorizations. Mature libraries (SPARSKIT, SuiteSparse, SuperLU) and later GPU-aware kernels standardized sparse computation.

Applications: Sparsity is ubiquitous in ML: text (bag-of-words), graphs (adjacency/Laplacian), recommenders (user–item), and large-scale optimization (Jacobian/Hessian structure). Practically, sparse SpMV enables scalable training/inference, ALS for recommenders, graph analytics (PageRank, GNN message passing), and iterative solvers that rely solely on matvecs. Recognizing and exploiting sparsity is the difference between infeasible and production-ready systems.

Connection to Broader Examples
  • Least squares with sparse $X$ (Ex 92): When feature matrix $X$ is sparse (e.g., text features), solve $(X^T X) w = X^T y$ where $X^T X$ might become dense. Sparse least squares algorithms (LSQR, MINRES) avoid forming $X^T X$; instead, work with $X$ directly via matvecs.
  • Conditioning and sparse systems (Ex 94): Sparse matrices can be ill-conditioned (e.g., discretized PDE with boundary layers). Preconditioning (via sparse approximations or incomplete factorizations) improves conditioning without destroying sparsity.
  • SVD and sparse matrices (Ex 90): Computing full SVD of sparse matrix is generally infeasible (singular vectors may be dense). Randomized SVD or truncated SVD methods exploit sparsity, computing top $k$ singular values/vectors in $O(k \cdot \text{nnz})$.
  • PCA on sparse data (Ex 91): Sparse covariance matrix (e.g., from text) can be worked with via sparse algorithms. Computing top PCs without forming explicit covariance is key for scalability.
  • Power iteration on sparse matrices (Ex 88): Computing dominant eigenvector via power iteration requires only matvecs, so $O(\text{nnz})$ per iteration. Works well for sparse matrices; eigensolvers (Lanczos, Arnoldi) exploit this.
  • Cholesky for sparse SPD matrices (Ex 93): Sparse SPD systems (e.g., discretized Laplacian, covariance from sparse data) can be factored sparsely. Sparse Cholesky preserves structure and avoids fill (creating new nonzeros during factorization).
  • Orthogonality and projections (Ex 86): Least squares on sparse $X$: projection matrix $P = X(X^T X)^{-1} X^T$ is generally dense, but can be applied implicitly via sparse operations.
  • Rank and nullspace (Ex 87): Sparse matrices have null space structure that can be exploited algorithmically; rank computation avoids explicit dense operations.

Comments