Example number
79
Example slug
example_79_sparse_matrices_adjacency_coo_onnz_matvec
Background

Sparse linear algebra emerged in the 1960s-70s for finite element methods (FEM) and partial differential equations (PDEs), where discretization yields matrices with $O(n)$ nonzeros (tridiagonal, banded, graph Laplacian structure). Early storage formats (coordinate lists, compressed row/column) enabled solving million-variable systems on kilobyte-scale memory. The field matured through LINPACK (1979), LAPACK (1992), and specialized libraries (UMFPACK for sparse LU, CHOLMOD for sparse Cholesky). Modern ML repurposed these primitives for entirely different domains: graph neural networks (billion-edge social networks with $\text{nnz} \sim 10^9$, dense storage $\sim 10^{18}$ infeasible), NLP (document-term matrices $99\%$ sparse, TF-IDF vocabulary $v \sim 10^5$), recommender systems (Netflix user-item matrix $< 1\%$ dense, $10^{12}$ entries but $10^{10}$ ratings), and deep learning embeddings (sparse gradients—only batch tokens updated per step, $< 0.1\%$ of vocabulary). The COO (coordinate) format demonstrated here is pedagogically simplest (store (row, col, value) triples) but computationally converted to CSR/CSC (compressed sparse row/column) for efficient arithmetic. Sparse methods transformed intractable $O(n^3)$ dense operations to scalable $O(\text{nnz})$ sparse ones, enabling modern web-scale ML.

Purpose

Demonstrate how sparse matrix storage (COO format) exploits structure to reduce memory from $O(n^2)$ to $O(\text{nnz})$ and computation from $O(n^2)$ to $O(\text{nnz})$ for matrix-vector products, where $\text{nnz}$ is the number of nonzero entries. For a cycle graph adjacency matrix with $n$ nodes, $\text{nnz} = n$ (one outgoing edge per node), yielding $n$-fold memory savings and $n$-fold speedup over dense operations. This transformation—from $O(n^2)$ infeasible to $O(n)$ tractable—is foundational for ML applications on graphs (GNNs, PageRank), text (TF-IDF, bag-of-words), and recommender systems (sparse user-item matrices). The example validates numerical equivalence between sparse and dense methods while quantifying the efficiency gains, motivating the universal adoption of sparse primitives (SciPy’s csr_matrix, PyTorch Sparse) in modern ML pipelines.

Problem

Build a 6-node cycle adjacency as COO, do sparse matvec, verify vs dense, and report nnz.

Solution (Math)

COO (Coordinate) format stores a sparse matrix $A \in \mathbb{R}^{n \times n}$ with $\text{nnz}(A)$ nonzero entries as three arrays:

\[ \text{rows} \in \mathbb{Z}^{\text{nnz}}, \quad \text{cols} \in \mathbb{Z}^{\text{nnz}}, \quad \text{data} \in \mathbb{R}^{\text{nnz}}, \]

where (rows[k], cols[k], data[k]) represents $A_{\text{rows}[k], \text{cols}[k]} = \text{data}[k]$ for $k = 0, \ldots, \text{nnz}-1$. All other entries are implicitly zero. Memory: $3 \times \text{nnz}$ storage vs. $n^2$ for dense.

Sparse matrix-vector product $y = Ax$ computes:

\[ y_i = \sum_{j=0}^{n-1} A_{ij} x_j = \sum_{k: \text{rows}[k]=i} \text{data}[k] \cdot x_{\text{cols}[k]}. \]

Algorithm (COO matvec):

initialize y = 0 ∈ ℝⁿ
for k = 0 to nnz-1:
    i = rows[k]
    j = cols[k]
    y[i] += data[k] * x[j]

Cost: $O(\text{nnz})$ operations (one multiply-add per stored nonzero).

Cycle graph adjacency: For a directed cycle of $n$ nodes with edges $(i, (i+1) \bmod n)$ for $i=0, \ldots, n-1$, the adjacency matrix is:

\[ A = \begin{bmatrix} 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \\ 1 & 0 & 0 & \cdots & 0 \end{bmatrix} \in \mathbb{R}^{n \times n}. \]

This has $\text{nnz}(A) = n$ (one nonzero per row), so sparse storage is $3n$ vs. $n^2$ dense (an $n/3$ reduction). For $n=6$: $18$ vs. $36$ (2× savings). For $n=10^6$: $3 \times 10^6$ vs. $10^{12}$ (333,000× savings).

Standard notation: - $A \in \mathbb{R}^{n \times n}$: adjacency matrix (sparse) - $x \in \mathbb{R}^n$: input vector - $y = Ax \in \mathbb{R}^n$: output vector - $\text{nnz}(A)$: number of nonzeros (for cycle graph, $\text{nnz} = n$) - Transpose: $A^\top$ (not $A^T$)

Solution (Python)

import numpy as np
from scripts.sparse_helpers import coo_from_edges, coo_matvec, coo_to_dense, nnz

n = 6
edges = [(i, (i+1)%n) for i in range(n)]
A = coo_from_edges(n, edges, weights=1.0)

x = np.ones(n)
y_sparse = coo_matvec(A, x)
y_dense = coo_to_dense(A) @ x

print("y_sparse:", y_sparse)
print("y_dense :", y_dense)
print("nnz:", nnz(A), "dense entries:", n*n)
Code Introduction

This code demonstrates sparse matrix storage and computation by constructing a cycle graph adjacency matrix in COO (coordinate) format and comparing sparse vs. dense matrix-vector multiplication. The example shows how COO format stores only nonzero entries, achieving dramatic memory savings for sparse matrices while producing numerically identical results to dense operations.

Part 1: Building a Sparse Cycle Graph

The code constructs an adjacency matrix for a cycle graph with $n=6$ nodes, where each node connects to its successor (forming a ring):

edges = [(i, (i+1)%n) for i in range(n)]
# Result: [(0,1), (1,2), (2,3), (3,4), (4,5), (5,0)]

Each tuple (i, j) represents a directed edge from node $i$ to node $j$. The modulo operator (i+1)%n wraps the last node (5) back to the first (0), closing the cycle. This produces the adjacency matrix:

\[ A = \begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 0 & 0 \end{bmatrix} \in \mathbb{R}^{6 \times 6}. \]

The function coo_from_edges(n, edges, weights=1.0) converts the edge list into COO (coordinate) format, storing only the 6 nonzero entries as (row, col, value) tuples:

# COO representation (conceptual):
# rows = [0, 1, 2, 3, 4, 5]
# cols = [1, 2, 3, 4, 5, 0]
# data = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

This requires storing 3 arrays of length 6 (18 floats total) instead of a full $6 \times 6$ matrix (36 floats), a 50% memory savings. For larger sparse matrices (e.g., $n=10^6$ with $10^7$ edges), this becomes critical—storing $10^7$ entries instead of $10^{12}$ (a 100,000× reduction).

Part 2: Sparse vs. Dense Matrix-Vector Multiplication

The code multiplies $A$ by a vector $x = [1, 1, 1, 1, 1, 1]^\top$ using two methods:

Method 1 (Sparse): y_sparse = coo_matvec(A, x) - Iterates only over the 6 stored nonzero entries - For each (i, j, A_ij), accumulates y[i] += A_ij * x[j] - Cost: $O(\text{nnz}(A))$ operations ($\text{nnz} = 6$ here)

Method 2 (Dense): y_dense = coo_to_dense(A) @ x - Converts COO to full dense matrix (materializes all 36 entries) - Performs standard matrix-vector product: $y_i = \sum_{j=0}^{5} A_{ij} x_j$ - Cost: $O(n^2)$ operations (36 multiplications)

For the cycle graph, the result is:

\[ y = A x = \begin{bmatrix} 0 \cdot 1 + 1 \cdot 1 + 0 + 0 + 0 + 0 \\ 0 + 0 \cdot 1 + 1 \cdot 1 + 0 + 0 + 0 \\ 0 + 0 + 0 \cdot 1 + 1 \cdot 1 + 0 + 0 \\ 0 + 0 + 0 + 0 \cdot 1 + 1 \cdot 1 + 0 \\ 0 + 0 + 0 + 0 + 0 \cdot 1 + 1 \cdot 1 \\ 1 \cdot 1 + 0 + 0 + 0 + 0 + 0 \cdot 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}. \]

Typical output:

y_sparse: [1. 1. 1. 1. 1. 1.]
y_dense : [1. 1. 1. 1. 1. 1.]
nnz: 6 dense entries: 36

Both methods yield identical results (y_sparse == y_dense), confirming numerical equivalence. The sparse method uses 6× fewer operations (6 vs. 36), and this gap widens dramatically for larger sparse matrices.

Why This Matters for ML

Graph neural networks (GNNs): Graph convolutional layers propagate features along edges: \[ H^{(\ell+1)} = \sigma\left( \tilde{A} H^{(\ell)} W^{(\ell)} \right), \] where $\tilde{A}$ is the normalized adjacency matrix (sparse) and $H^{(\ell)} \in \mathbb{R}^{n \times d}$ are node embeddings. For social networks with millions of nodes and billions of edges, storing $\tilde{A}$ densely ($10^{12}$ floats, ~4TB) is infeasible. Sparse storage ($10^9$ floats, ~4GB) enables efficient message passing.

PageRank and web graphs: Google’s PageRank iteratively computes: \[ r_{t+1} = \alpha P^\top r_t + (1-\alpha) \mathbf{1}, \] where $P \in \mathbb{R}^{n \times n}$ is the transition matrix for $n \sim 10^{10}$ web pages. Dense storage would require $10^{20}$ floats (exabytes). Sparse storage exploits that each page links to only $\sim 10$ others (average degree), storing $\sim 10^{11}$ entries (hundreds of GB).

Collaborative filtering (recommender systems): The user-item interaction matrix $R \in \mathbb{R}^{m \times n}$ is highly sparse (most users rate $< 1\%$ of items). For Netflix ($m=10^8$ users, $n=10^4$ movies), dense storage is $10^{12}$ floats (~4TB). Sparse storage with $\sim 10^{10}$ ratings (~40GB) is tractable. Matrix factorization algorithms (ALS, SGD) operate directly on sparse $R$.

Text embeddings (TF-IDF, bag-of-words): Document-term matrices $X \in \mathbb{R}^{d \times v}$ ($d$ documents, $v$ vocabulary size) are $> 99\%$ sparse (most words absent from most documents). For $d=10^6$ documents and $v=10^5$ vocabulary, dense storage is $10^{11}$ floats (400GB). Sparse storage with $\sim 10^8$ nonzeros (400MB) enables efficient similarity search.

Sparse gradients in embeddings: In NLP models, embedding lookup layers produce sparse gradients—only embeddings for tokens in the current batch receive nonzero gradients. For a vocabulary of $v=10^5$ and batch size $b=32$, only $\sim 100$ embeddings are updated per step (0.1% sparsity). Sparse gradient updates avoid touching $99.9\%$ of the embedding matrix.

Linear solvers for FEM/PDEs: Finite element discretizations of PDEs yield sparse linear systems $Ax = b$ where $A \in \mathbb{R}^{n \times n}$ has $O(n)$ nonzeros (tridiagonal, banded, or graph Laplacian structure). Iterative solvers (conjugate gradient, multigrid) exploit sparsity to achieve $O(n)$ cost per iteration instead of $O(n^3)$ for dense LU factorization.

Numerical Implementation Details

Step-by-step execution of sparse cycle graph construction and matrix-vector multiplication:

  1. Define graph structure: n = 6 nodes, edges = [(i, (i+1)%n) for i in range(n)] creates the edge list [(0,1), (1,2), (2,3), (3,4), (4,5), (5,0)]. Each tuple (i, j) represents a directed edge from node $i$ to node $j$. The modulo operation (i+1)%n wraps the last node (5) back to the first (0), forming a cycle graph (ring topology).

  2. Build COO representation: A = coo_from_edges(n, edges, weights=1.0) converts the edge list to COO format, storing three arrays: rows = [0,1,2,3,4,5], cols = [1,2,3,4,5,0], data = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]. This represents the adjacency matrix $A \in \mathbb{R}^{6 \times 6}$ with $A_{ij} = 1$ if $(i,j)$ is an edge, 0 otherwise. Memory: $3 \times 6 = 18$ floats (COO) vs. $36$ floats (dense $6 \times 6$ matrix).

  3. Define input vector: x = np.ones(n) creates $x = [1, 1, 1, 1, 1, 1]^\top \in \mathbb{R}^6$ (all-ones vector). This will sum the neighbors of each node.

  4. Sparse matvec: y_sparse = coo_matvec(A, x) computes $y = Ax$ by iterating over the 6 stored nonzeros:

    • Initialize y = [0, 0, 0, 0, 0, 0]
    • For each (row, col, val) in COO: y[row] += val * x[col]
    • Iteration 1: (0, 1, 1.0) → y[0] += 1.0 * x[1] = 1.0
    • Iteration 2: (1, 2, 1.0) → y[1] += 1.0 * x[2] = 1.0
    • … (6 iterations total)
    • Result: y_sparse = [1, 1, 1, 1, 1, 1] (each node receives contribution from one neighbor)
  5. Dense matvec (verification): y_dense = coo_to_dense(A) @ x materializes the full $6 \times 6$ dense matrix and computes the standard matrix-vector product (36 multiplications, 30 of which are $0 \times x_j = 0$). Result: y_dense = [1, 1, 1, 1, 1, 1] (identical to sparse).

  6. Report statistics: nnz(A) = 6 (number of stored nonzeros), n*n = 36 (dense matrix size). Sparsity: $6/36 = 16.7\%$ nonzero (or $83.3\%$ sparse). Speedup: $36/6 = 6\times$ fewer operations.

Interpretation: For the cycle graph with uniform edge weights and all-ones input, $y = Ax$ computes the degree of each node (number of incoming edges). Since every node has exactly one incoming edge, $y = [1, 1, 1, 1, 1, 1]^\top$. For undirected graphs (symmetrize $A$), this would compute the sum of neighbors’ features.

What This Example Demonstrates
  • COO storage exploits sparsity: For a cycle graph adjacency matrix with $n=6$ nodes and $\text{nnz}=6$ edges, COO stores only the 6 nonzero (row, col, value) triples instead of all 36 matrix entries. Memory: $3 \times 6 = 18$ floats (COO) vs. $36$ floats (dense), a 2× savings. For larger sparse matrices ($\text{nnz} \ll n^2$), savings scale to $1000\times$ or more (e.g., web graph with $n=10^9$ nodes, average degree 10: $\text{nnz} \sim 10^{10}$ vs. dense $10^{18}$).

  • Sparse matvec is $O(\text{nnz})$ vs. $O(n^2)$ dense: The function coo_matvec(A, x) iterates only over the 6 stored nonzeros, performing 6 multiply-add operations. Dense matvec computes all $n^2 = 36$ products (including $30$ wasteful $0 \times x_j$ terms). For $n=6$, speedup is $36/6 = 6\times$. For real-world graphs ($n \sim 10^6$, $\text{nnz} \sim 10^7$), sparse is $\sim 100,000\times$ faster ($10^{12}$ ops dense vs. $10^7$ sparse).

  • Numerical equivalence: y_sparse and y_dense match exactly (within machine precision $\sim 10^{-16}$), confirming that sparse storage/computation is lossless—no approximation, just efficient exact arithmetic on nonzeros.

  • Graph structure as sparsity: Adjacency matrices for bounded-degree graphs (social networks, web graphs, molecular structures) have $\text{nnz} = O(n)$ (each node connects to $k \ll n$ neighbors). The cycle graph is the canonical minimal example: each node has degree 1, so $\text{nnz} = n$ (linear in graph size, not quadratic).

  • Format conversion for computation: COO is a storage format (easy to construct, append entries) but inefficient for repeated arithmetic. Production code converts to CSR (Compressed Sparse Row) or CSC (Compressed Sparse Column) for $O(1)$ row/column access: A_csr = A_coo.tocsr(). This example uses custom coo_matvec for pedagogy, but SciPy’s CSR is the standard.

  • ML pattern: structure → efficiency: Sparse methods generalize to all ML domains with inherent sparsity: text (bag-of-words, TF-IDF), graphs (GNNs, PageRank), recommenders (user-item matrices), embeddings (sparse gradients), attention masks (block-sparse Transformers). The pattern is universal: identify structure (zeros, low rank, Kronecker product), choose specialized format/algorithm (COO/CSR, SVD truncation, structured matrices), achieve tractability.

Notes

Part 1: Building a Sparse Cycle Graph

The edge list [(i, (i+1)%n) for i in range(n)] encodes a directed cycle: node 0 → 1 → 2 → 3 → 4 → 5 → 0. This is the simplest nontrivial sparse graph—each node has out-degree 1 and in-degree 1 (balanced). The adjacency matrix $A$ is a circulant matrix (each row is a cyclic shift of the previous row), with special structure: eigenvalues are $n$-th roots of unity $\lambda_k = e^{2\pi i k / n}$ for $k=0, \ldots, n-1$. For undirected cycles, symmetrize: A_sym = A + A.T (COO allows duplicate entries, which sum during conversion to CSR). The cycle graph is pedagogically minimal: $\text{nnz} = n$ (linear sparsity), diameter $n/2$ (maximal path length), and spectral gap $\lambda_1 - \lambda_0 = O(1/n^2)$ (slow mixing for random walks). Real-world analogs: ring topologies in distributed systems, circular dependencies in software, and polymer chains in chemistry.

Part 2: Sparse vs. Dense Matrix-Vector Multiplication

Sparse matvec (coo_matvec): Iterates over 6 stored nonzeros, performing 6 multiply-add operations. Algorithm: for (i, j, val) in zip(rows, cols, data): y[i] += val * x[j]. Cost: $O(\text{nnz})$ (6 ops). Memory access: sequential reads of rows, cols, data, random writes to y (cache-unfriendly but unavoidable for sparse). Dense matvec (@ operator): Computes all $n^2 = 36$ products $A_{ij} x_j$, including 30 zero products ($0 \times x_j = 0$, wasted computation). Cost: $O(n^2)$ (36 ops). Memory: $n^2 = 36$ floats stored. Speedup: $36/6 = 6\times$ for $n=6$. For real-world sparse matrices ($\text{nnz} \ll n^2$), speedup is $\sim n^2 / \text{nnz}$: e.g., web graph with $n=10^9$, average degree 10 yields $\text{nnz} \sim 10^{10}$, so speedup $\sim 10^{18} / 10^{10} = 10^8$ (hundred-million-fold). Numerical equivalence: Both methods compute identical results (within machine precision $\sim 10^{-16}$), confirming sparse storage is lossless—no approximation, just efficient exact arithmetic.

Part 3: Format Conversion and Computational Efficiency

COO is a storage format (easy to construct, append entries, convert from edge lists) but inefficient for repeated arithmetic. Each matvec scans all triples sequentially—no $O(1)$ access to row/column. CSR (Compressed Sparse Row) format stores indptr, indices, data arrays, enabling efficient row access: row $i$’s nonzeros are data[indptr[i]:indptr[i+1]] at columns indices[indptr[i]:indptr[i+1]]. Conversion: A_csr = A_coo.tocsr() (one-time $O(\text{nnz} \log \text{nnz})$ sort by row, then $O(\text{nnz})$ build). When to use each: (1) COO: constructing sparse matrices incrementally (FEM assembly, graph edge lists), allowing duplicates. (2) CSR: repeated row-wise operations (matvec, matrix-matrix multiply, left-multiply). (3) CSC: repeated column-wise operations (right-multiply, column access). (4) LIL (List of Lists): random access during construction (but convert to CSR/CSC for arithmetic). Production pattern: Build in COO/LIL, convert to CSR/CSC once, perform all arithmetic in CSR/CSC. SciPy default: csr_matrix for row-major, csc_matrix for column-major.

Why This Matters for ML

Graph neural networks (GNNs): Message passing layers compute $H^{(\ell+1)} = \sigma(\tilde{A} H^{(\ell)} W^{(\ell)})$, where $\tilde{A}$ is sparse normalized adjacency. For social networks ($n \sim 10^9$ users, $\text{nnz} \sim 10^{11}$ friendships), dense $\tilde{A}$ is $10^{18}$ entries (~4 exabytes)—infeasible. Sparse storage ($10^{11}$ entries, ~400 GB) enables billion-node GNNs (PyTorch Geometric, DGL). PageRank: Google’s web graph ($n \sim 10^{10}$ pages) has sparse transition matrix $P$ (average degree $\sim 10$). Sparse power iteration $r_{t+1} = \alpha P^\top r_t + (1-\alpha) \mathbf{1}$ converges in $\sim 50$ iterations at $O(\text{nnz})$ cost per iteration—dense would be $O(n^2) \sim 10^{20}$ ops (years of compute). Collaborative filtering: User-item matrices $R \in \mathbb{R}^{m \times n}$ are $< 1\%$ dense (Netflix: $10^8$ users, $10^4$ movies, $10^{10}$ ratings out of $10^{12}$ entries). Sparse storage enables matrix factorization $R \approx UV^\top$ via alternating least squares (ALS) or SGD on nonzeros only. Text embeddings: Document-term matrices $X \in \mathbb{R}^{d \times v}$ are $> 99\%$ sparse (most words absent from most documents). TF-IDF vectorization stores $\sim 10^{-4}$ of entries, enabling cosine similarity search on millions of documents. Attention masks: Block-sparse Transformers (Longformer, BigBird) use sparse attention patterns to reduce $O(L^2)$ to $O(L \log L)$ for sequence length $L$, enabling 16k-token contexts.

ML Examples and Patterns

GNN message passing (PyTorch Geometric):

from torch_geometric.nn import GCNConv
conv = GCNConv(in_channels=16, out_channels=32)
# edge_index: COO format (2, num_edges) tensor
# x: node features (num_nodes, in_channels)
x_new = conv(x, edge_index)  # Sparse neighbor aggregation

Sparse TF-IDF similarity (scikit-learn):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer()
X_sparse = vectorizer.fit_transform(documents)  # CSR matrix
similarities = cosine_similarity(X_sparse)  # Sparse @ Sparse.T

Sparse linear regression (Lasso on bag-of-words):

from scipy.sparse import csr_matrix
from sklearn.linear_model import Lasso
X_sparse = csr_matrix(X)  # (n_samples, n_features) sparse
model = Lasso(alpha=0.1)
model.fit(X_sparse, y)  # Exploits sparsity in X

PageRank via sparse power iteration:

from scipy.sparse import csr_matrix
import numpy as np
def pagerank(A, alpha=0.85, max_iter=100):
    n = A.shape[0]
    out_deg = np.array(A.sum(axis=0)).flatten()
    out_deg[out_deg == 0] = 1
    P = csr_matrix((A.data / out_deg[A.col], (A.row, A.col)), shape=A.shape).T
    r = np.ones(n) / n
    for _ in range(max_iter):
        r = alpha * (P @ r) + (1 - alpha) / n
    return r

Sparse gradient update (embedding lookup):

# Only update embeddings for tokens in batch
batch_token_ids = [42, 1337, 99999]
gradients = np.random.randn(len(batch_token_ids), embed_dim)
for idx, grad in zip(batch_token_ids, gradients):
    embeddings[idx] -= lr * grad  # Sparse update (0.003% of vocab)

Connection to Linear Algebra Theory

Graph Laplacian and sparsity: For undirected graph with adjacency $A$ and degree matrix $D = \text{diag}(\sum_j A_{ij})$, the graph Laplacian is $L = D - A$. For sparse $A$ ($\text{nnz}(A) = O(n)$), $L$ is also sparse ($\text{nnz}(L) = \text{nnz}(A) + n$). The Laplacian’s eigenvalues encode graph connectivity: $\lambda_0 = 0$ (trivial), $\lambda_1 > 0$ (Fiedler value, algebraic connectivity), and $\lambda_1$ small indicates slow mixing (bottleneck structure). Sparse eigenvalue solvers (Lanczos, ARPACK) compute top/bottom $k$ eigenpairs in $O(k \cdot \text{nnz}(L))$ time per iteration.

Sparse linear systems: For sparse $A$ with $\text{nnz}(A) = O(n)$ (tridiagonal, banded, graph Laplacian), direct solvers (sparse LU, Cholesky) have complexity $O(n^{3/2})$ to $O(n^2)$ depending on fill-in (new nonzeros created during factorization). Iterative solvers (conjugate gradient, GMRES, multigrid) achieve $O(n)$ cost per iteration when $\text{nnz}(A) = O(n)$, with convergence in $O(\sqrt{\kappa})$ to $O(\log n)$ iterations (preconditioned). For large systems ($n \sim 10^6$), iterative methods dominate.

Spectral graph theory: The eigenvectors of $L$ (or normalized Laplacian $\mathcal{L} = D^{-1/2} L D^{-1/2}$) define a Fourier basis for graph signals. Low-frequency eigenvectors (small $\lambda_k$) vary smoothly across edges (graph smoothness), while high-frequency eigenvectors oscillate. Spectral clustering: Project nodes onto the first $k$ eigenvectors of $\mathcal{L}$, then cluster via k-means. This partitions the graph to minimize cut size.

Matrix-vector product as linear combination: $y = Ax$ can be viewed as a linear combination of $A$’s columns: $y = \sum_{j=0}^{n-1} x_j A_{:,j}$. For sparse $A$, only columns with $x_j \neq 0$ contribute. This interpretation underlies sparse gradient descent: only weights corresponding to nonzero features receive updates.

Numerical and Implementation Notes

Shape discipline: $A \in \mathbb{R}^{n \times n}$ (sparse adjacency, $n=6$), $x \in \mathbb{R}^n$ (input vector), $y \in \mathbb{R}^n$ (output), $\text{nnz}(A) = 6$ (scalar). COO arrays: rows, cols $\in \mathbb{Z}^{\text{nnz}}$, data $\in \mathbb{R}^{\text{nnz}}$. Check dimensions before operations: A.shape = (n, n), x.shape = (n,), y.shape = (n,).

Gotcha 1: COO allows duplicates. Multiple (i, j, val) entries sum during conversion to CSR/CSC. Useful for FEM assembly (accumulate element contributions) but can cause bugs if unintended. Use A_coo.sum_duplicates() to merge before arithmetic.

Gotcha 2: Random access is slow. A_csr[i, j] requires binary search in row $i$’s indices ($O(\log \text{nnz})$). For sparse matrix construction with random access, use LIL format during building, then convert: A_lil[i, j] = val ($O(1)$), A_csr = A_lil.tocsr().

Gotcha 3: Sparse-dense operations. A_sparse @ x_dense is efficient ($O(\text{nnz})$), but A_dense @ x_sparse doesn’t exploit sparsity in $x$. SciPy converts sparse to dense if one operand is dense—avoid by ensuring both are sparse: A_csr @ B_csr.

Gotcha 4: Memory overhead. COO stores 3 arrays (rows, cols, data), so overhead is $3 \times \text{nnz}$ floats. For very dense matrices ($\text{nnz} > 0.3 \times n^2$), dense storage can be more efficient. CSR/CSC have lower overhead (2 arrays: indptr, indices+data), more compact for moderately sparse matrices.

Gotcha 5: Sparse linear solves. scipy.sparse.linalg.spsolve(A_csr, b) uses sparse LU factorization. For SPD matrices, use cholesky from CHOLMOD: from sksparse.cholmod import cholesky; factor = cholesky(A_csr); x = factor(b). For iterative solves: cg(A_csr, b) (conjugate gradient, requires SPD), gmres(A_csr, b) (general).

Numerical and Shape Notes

Verification checks: (1) Sparsity ratio: nnz(A) / (n * n) = 6 / 36 = 0.167 (16.7% nonzero). (2) Numerical equivalence: np.allclose(y_sparse, y_dense) should be True (exact match within tolerance $10^{-9}$). (3) Memory savings: sys.getsizeof(A_coo) < sys.getsizeof(A_dense) confirms space reduction. (4) Matvec correctness: Residual np.linalg.norm(A_dense @ x - y_sparse) < 1e-14 verifies implementation.

Cost analysis: COO matvec: $O(\text{nnz})$ operations (6 multiply-adds). Dense matvec: $O(n^2)$ operations (36 multiply-adds). Memory: COO stores $3 \times \text{nnz} = 18$ floats, dense stores $n^2 = 36$ floats. Break-even sparsity: For $\text{nnz} / n^2 > 1/3$, dense storage is more efficient (less overhead). For $\text{nnz} / n^2 < 0.1$, sparse is typically $10\times$ faster and $10\times$ more memory-efficient.

Tolerance: Sparse arithmetic is exact (no approximation), so y_sparse and y_dense should match to machine precision ($\sim 10^{-16}$ relative error). Any discrepancy indicates implementation bug. For iterative sparse solvers (CG, GMRES), residual tolerance $\|Ax - b\| / \|b\| < 10^{-6}$ is typical (converges in $\sim 10$-100 iterations for well-conditioned systems).

Pedagogical Significance

This example is the foundational entry point to sparse linear algebra for ML:

Key takeaways: (1) Sparse storage exploits structure: Store only $\text{nnz}$ nonzeros instead of $n^2$ total entries. (2) Sparse operations scale linearly: Matvec costs $O(\text{nnz})$ vs. $O(n^2)$ dense. (3) Graph adjacency matrices are naturally sparse: Bounded-degree graphs have $\text{nnz} = O(n)$. (4) Format matters: COO for construction, CSR/CSC for computation. (5) Numerical equivalence: Sparse and dense methods produce identical results (no approximation).

Common misconceptions addressed: (a) “Sparse matrices are approximate”—No, they store exact nonzeros (lossless). (b) “Sparse operations are always faster”—Only for sufficient sparsity ($\text{nnz} < 0.1 \times n^2$); overhead dominates otherwise. (c) “COO is efficient for arithmetic”—No, COO is for storage; use CSR/CSC for repeated operations. (d) “Sparse linear solves are always iterative”—No, sparse direct solvers (sparse LU, Cholesky) exist but have fill-in overhead. (e) “Sparsity only matters for massive graphs”—Sparsity benefits appear even for moderate $n \sim 10^3$ (e.g., text with $10^5$ vocabulary).

Connection to other examples: Example 71 (sparse dot products), Example 74 (sparse SVD), Example 76 (sparse least squares), Example 77 (sparse Cholesky), Example 78 (sparse preconditioning). This example provides the implementation foundation for understanding how ML scales to billion-parameter models (sparse attention), billion-node graphs (GNNs), and billion-document corpora (TF-IDF).

Why pedagogically powerful: The cycle graph is the simplest nontrivial sparse example—$\text{nnz} = n$ (linear), easy to visualize (ring topology), and transparently demonstrates the core pattern: store/iterate only nonzeros. Students see that sparsity is structure, not approximation, and that the $O(\text{nnz})$ complexity is exact arithmetic, not approximation. The $6\times$ speedup for $n=6$ is modest, but the pattern scales to $10^8\times$ for real-world problems. This is the gateway to web-scale ML—without sparse methods, GNNs on social networks, PageRank on web graphs, and Transformers on long sequences would be computationally infeasible.

History and Applications

The development of sparse matrix methods was driven by the computational demands of large-scale scientific computing in the mid-20th century. Finite element methods (FEM) for structural engineering and finite difference methods for PDEs (heat equation, fluid dynamics) produced linear systems $Ax = b$ where $A \in \mathbb{R}^{n \times n}$ had only $O(n)$ nonzeros despite $n \sim 10^4$ unknowns. Dense storage ($n^2 \sim 10^8$ floats) exceeded the memory of 1960s computers (kilobytes to low megabytes), while sparse storage ($\sim 10^5$ floats) fit comfortably.

Early storage formats: The coordinate (COO) format emerged in the 1960s as a natural encoding of sparse graphs and FEM meshes—store triples (row, col, value) explicitly. Compressed Sparse Row (CSR) and Compressed Sparse Column (CSC) formats were developed in the 1970s for efficient row/column access during matrix-vector products and factorizations. Sparse direct solvers (sparse LU factorization) appeared in the 1980s, with algorithms to minimize fill-in (new nonzeros created during factorization). Key contributions: George and Liu (nested dissection ordering, 1981), Davis (multifrontal methods, 1990s), and Duff (MA27/MA57 sparse solvers, 1980s-90s). The UMFPACK (Davis, 2004) and CHOLMOD (Davis, 2006) libraries became standards for sparse LU and Cholesky factorizations, integrated into MATLAB and SciPy.

Iterative methods: Conjugate Gradient (CG) (Hestenes-Stiefel, 1952) and GMRES (Saad-Schultz, 1986) enabled solving large sparse systems without factorization, with cost $O(\text{nnz})$ per iteration. Multigrid methods (Brandt, 1977) achieved $O(n)$ total complexity for certain PDEs (Poisson, elliptic). Incomplete Cholesky and ILU (Incomplete LU) preconditioning (1970s-80s) improved convergence by approximating $A^{-1}$ with a sparse matrix. Krylov subspace methods (Lanczos, 1950; Arnoldi, 1951) enabled sparse eigenvalue computation—computing top $k$ eigenpairs in $O(k \cdot \text{nnz})$ time per iteration instead of $O(n^3)$ dense eigenvalue decomposition.

Graph algorithms and spectral methods: The connection between sparse matrices and graphs was formalized in the 1970s-80s. Graph Laplacians $L = D - A$ (where $A$ is sparse adjacency, $D$ is degree matrix) appeared in spectral graph theory (Fiedler, 1973). Spectral clustering (Shi-Malik, 2000) and Cheeger inequalities linked Laplacian eigenvalues to graph cuts. PageRank (Page-Brin, 1998) applied sparse power iteration to Google’s web graph ($n \sim 10^9$ pages), enabling search at web scale. The NetworkX library (2002) and later graph-tool (2006) provided sparse graph primitives for researchers.

Modern ML applications repurposed sparse methods for entirely new domains:

Text and NLP (1990s-2010s): Bag-of-words and TF-IDF vectorization produce document-term matrices $X \in \mathbb{R}^{d \times v}$ that are $> 99\%$ sparse. The SciPy sparse module (scipy.sparse, 2001) and scikit-learn (TfidfVectorizer, 2007) made sparse text embeddings standard. Latent Semantic Analysis (LSA) (Deerwester et al., 1990) applied sparse SVD (truncated SVD) to extract semantic topics from millions of documents. Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) trained on sparse co-occurrence matrices.

Recommender systems (2000s): Netflix Prize (2006-2009) popularized sparse matrix factorization $R \approx UV^\top$ for collaborative filtering, where $R \in \mathbb{R}^{m \times n}$ is the sparse user-item rating matrix ($< 1\%$ dense). Alternating Least Squares (ALS) and stochastic gradient descent (SGD) on sparse $R$ became industry standards (Spotify, YouTube, Amazon). Implicit feedback models (Hu-Koren-Volinsky, 2008) extended to binary sparse matrices (clicks, views, purchases).

Graph neural networks (2010s-2020s): Graph Convolutional Networks (GCNs) (Kipf-Welling, 2017) applied sparse adjacency matrices to neural networks, enabling message passing on graphs with billions of edges. PyTorch Geometric (2019) and DGL (Deep Graph Library) (2019) provided sparse primitives (SparseTensor, COO edge indices) for GNNs. GraphSAGE (Hamilton et al., 2017), GAT (Graph Attention Networks) (Veličković et al., 2018), and GIN (Graph Isomorphism Networks) (Xu et al., 2019) scaled to web-scale graphs (social networks, knowledge graphs, molecular databases).

Transformer sparsity (2019-2020s): Sparse Transformers (Child et al., 2019) introduced block-sparse attention patterns to reduce $O(L^2)$ complexity to $O(L \sqrt{L})$ for sequence length $L$. Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) used sparse attention masks (local + global + random) to enable 16k-token contexts. Linear attention mechanisms (Katharopoulos et al., 2020) replaced softmax attention with kernel approximations, achieving $O(L)$ complexity.

Sparse training and pruning (2015-2020s): Magnitude pruning (Han et al., 2015) zeroed small weights in neural networks, achieving $10\times$ sparsity with minimal accuracy loss. Lottery Ticket Hypothesis (Frankle-Carbin, 2019) showed that sparse subnetworks (found via pruning) train to full accuracy from initialization. Sparse neural networks (e.g., SparseGPT, Frantar-Alistarh, 2023) enable training billion-parameter models with $80-90\%$ sparsity, reducing memory and compute $5-10\times$.

The simple COO cycle graph in this example—storing 6 nonzeros instead of 36 entries—is the pedagogical foundation for all these modern applications. Without sparse methods, web search (PageRank), social networks (GNNs), recommendation engines (collaborative filtering), and large language models (sparse attention) would be computationally infeasible. Sparsity transformed ML from “impossible at scale” to “routine at web scale”.

Connection to Broader Examples
  • Example 71 (Inner products and norms): Sparse dot products (e.g., cosine similarity between sparse TF-IDF vectors) use the same $O(\text{nnz})$ pattern—only multiply nonzero terms. SciPy: A_csr @ B_csr.T computes pairwise similarities efficiently.

  • Example 74 (SVD and condition number): Sparse SVD (via Lanczos/Arnoldi) computes top $k$ singular vectors in $O(k \cdot \text{nnz})$ time per iteration, avoiding $O(n^3)$ dense SVD. Used in sparse PCA (text embeddings), spectral clustering (graph partitioning), and recommender systems (matrix factorization).

  • Example 76 (Least squares): Sparse linear regression (Lasso, elastic net) on high-dimensional sparse features (bag-of-words $X \in \mathbb{R}^{n \times d}$ with $\text{nnz}(X) \ll nd$) exploits sparse $X^\top X$ computation—X_csr.T @ X_csr is $O(\text{nnz}(X) \cdot d)$ instead of $O(nd^2)$ dense.

  • Example 77 (Cholesky reuse): Sparse Cholesky factorization for graph Laplacians $L = D - A$ (where $A$ is sparse adjacency) preserves sparsity with controlled fill-in. Used in Gaussian processes on graphs, network flow optimization, and finite element methods. Libraries: CHOLMOD (SuiteSparse), scipy.sparse.linalg.splu.

  • Example 78 (Conditioning): Sparse matrices can still be ill-conditioned (e.g., graph Laplacians with disconnected components have $\kappa(L) \to \infty$). Sparse preconditioning (incomplete Cholesky, Jacobi) reduces condition number to enable iterative solvers (CG, GMRES).

  • Example 70 (Projections): Graph Laplacian eigenvectors (computed via sparse eigenvalue solvers) define projections onto low-frequency graph signals. Spectral clustering projects nodes onto the first $k$ eigenvectors of the normalized Laplacian $\mathcal{L} = I - D^{-1/2} A D^{-1/2}$ (all matrices sparse).

  • Chapter 11 (PCA): Sparse PCA via truncated SVD on sparse data matrices (document-term matrices, gene expression) recovers principal components in $O(k \cdot \text{nnz}(X))$ time. sklearn.decomposition.TruncatedSVD uses randomized algorithms (Halko et al., 2011) for efficiency.

  • Chapter 15 (Sparse methods): This example demonstrates the foundational COO format. Chapter 15 covers CSR/CSC (compressed storage), sparse factorizations (LU, Cholesky, QR), iterative solvers (CG, GMRES, multigrid), and graph algorithms (PageRank, shortest paths, spectral clustering).

Comments