Example number
88
Example slug
example_88_eigenvectors_in_ml_power_iteration_for_dominant_pca_direction
Background

The power method (or power iteration) was formalized in the 1930s by Hotelling and refined by subsequent mathematicians and numerical analysts. Early applications included computing eigenvalues for structural engineering and quantum mechanics. With the advent of computers (1950s–1960s), power iteration became a standard tool because it requires only matrix-vector multiplication—no linear system solves, no full matrix storage. The related QR algorithm (Francis, 1961) computes all eigenvalues simultaneously but is expensive for very large or sparse matrices. The Lanczos algorithm (Lanczos, 1950) and its variants (Arnoldi, 1951) generalize power iteration by building Krylov subspaces, enabling efficient approximation of multiple eigenvalues without computing the full decomposition. These iterative methods became essential in the 1980s–1990s with the rise of large-scale scientific computing. In modern ML (2010s+), power iteration and Lanczos are used for: PCA of high-dimensional data, spectral clustering of large graphs, computing the Hessian spectrum of neural networks, and training kernel methods. The implicit bias of stochastic gradient descent toward minimum-norm solutions relates to implicit regularization in the nullspace of the Jacobian—a connection involving eigenvalue analysis of the loss landscape.

Purpose

Master the power iteration algorithm, one of the most elegant and practically important numerical methods for finding dominant eigenvectors. Understand how simple repeated matrix multiplication—$v_{k+1} = \Sigma v_k / \|\Sigma v_k\|_2$—converges to the principal direction of maximum variance. Learn when power iteration beats full SVD (sparse matrices, distributed computing, online learning), how to extend it to multiple eigenvectors via deflation, and why it underlies PCA, spectral clustering, PageRank, and neural network analysis. Build intuition for iterative eigensolvers (Lanczos, Arnoldi), which generalize power iteration and solve large-scale eigenvalue problems that direct methods cannot. Recognize that the eigenvector computed via power iteration is identical to the dominant singular vector of centered data—connecting iterative and direct approaches.

Problem

Use power iteration on the toy covariance matrix to approximate the dominant eigenvector and compare with SVD PC1.

Solution (Math)

For symmetric PSD covariance $\Sigma$, power iteration $v_{t+1}=\Sigma v_t/\|\Sigma v_t\|$ converges to the top eigenvector under a spectral gap. For PCA, this eigenvector matches the first right singular vector of centered data $X_c$ (up to sign).

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

rng = np.random.default_rng(1088)
X = toy_pca_points(n=120, seed=4)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)

v = rng.normal(size=2); v /= np.linalg.norm(v)
for _ in range(50):
    v = Sigma @ v
    v /= np.linalg.norm(v)

_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
v_svd = Vt[0]; v_svd /= np.linalg.norm(v_svd)

print("abs cosine(power, svd):", abs(v @ v_svd))
Code Introduction

This code demonstrates the power iteration algorithm, one of the most elegant and practically important numerical methods for finding dominant eigenvectors. Starting with a random vector, the algorithm uses repeated multiplication by the covariance matrix—normalized at each step—to converge to the principal direction of maximum variance. The result shows that simple iteration achieves the same solution that SVD (Singular Value Decomposition) computes, illustrating the convergence of iterative and direct approaches.

The power iteration method: Given a symmetric positive semidefinite matrix $\Sigma \in \mathbb{R}^{d \times d}$, power iteration finds the dominant eigenvector via: \[ v_{k+1} = \frac{\Sigma v_k}{\|\Sigma v_k\|_2}, \quad v_0 \in \mathbb{R}^d \text{ (random unit vector)}. \]

After normalizing by $\|\Sigma v_k\|_2$ at each step, this sequence converges to the eigenvector $u_1$ corresponding to the largest eigenvalue $\lambda_1$. Without normalization, $v_k$ would grow as $\lambda_1^k$ (exponentially), causing numerical overflow; normalization discards this growth, preserving only the direction.

Why power iteration works: Decompose the random starting vector in the eigenbasis: \[ v_0 = c_1 u_1 + c_2 u_2 + \cdots + c_d u_d. \]

After $k$ matrix multiplications (before normalization): \[ \Sigma^k v_0 = c_1 \lambda_1^k u_1 + c_2 \lambda_2^k u_2 + \cdots + c_d \lambda_d^k u_d \approx c_1 \lambda_1^k u_1, \]

since $\lambda_1 > \lambda_i$ for $i > 1$, so all other terms vanish. Normalization each step keeps $\|v_k\| = 1$, tracking only the direction toward $u_1$.

The code setup:

X = toy_pca_points(n=120, seed=4)      # 120 examples in R^2
Xc = X - X.mean(0)                      # Center (subtract mean per feature)
Sigma = np.cov(Xc, rowvar=False, bias=False)  # 2×2 covariance matrix

The covariance matrix is computed from centered data: $\Sigma = \frac{1}{n-1} X_c^\top X_c$. For a 2D dataset, $\Sigma$ is a $2 \times 2$ symmetric positive semidefinite matrix whose eigenvectors point along principal component directions.

Power iteration loop:

v = rng.normal(size=2)                  # Random starting vector
v /= np.linalg.norm(v)                  # Normalize to unit length
for _ in range(50):
    v = Sigma @ v                       # Multiply by covariance matrix
    v /= np.linalg.norm(v)              # Normalize (prevent overflow)

Each iteration multiplies $v$ by $\Sigma$ and renormalizes. For a 2D covariance matrix with well-separated eigenvalues, convergence is rapid. The convergence rate is geometric: error decreases as $(\lambda_2 / \lambda_1)^k$. For typical PCA problems, 50 iterations is conservative; often 10–20 suffice.

SVD comparison:

_, _, Vt = np.linalg.svd(Xc, full_matrices=False)  # Compute SVD of Xc
v_svd = Vt[0]                                        # First right singular vector

SVD decomposes the centered data matrix: $X_c = U \Sigma_{\text{SVD}} V^\top$. The right singular vectors (columns of $V$) are identical to the eigenvectors of the covariance matrix. This is because: \[ X_c^\top X_c = V \Sigma_{\text{SVD}}^2 V^\top = (n-1) \Sigma. \]

So the first right singular vector $V_{:,1}$ equals the dominant eigenvector of $\Sigma$.

Convergence check:

print("abs cosine(power, svd):", abs(v @ v_svd))

The dot product $v^\top v_{\text{SVD}}$ measures the cosine of the angle between the two vectors. Since both are unit vectors, this is exactly $\cos(\theta)$. A value of $\approx 1.0$ (accounting for sign ambiguity, hence abs) confirms that power iteration converged to the same direction as SVD.

Typical output:

abs cosine(power, svd): 0.9999999999

The result is $\approx 1.0$ (within numerical precision $\sim 10^{-14}$), proving that 50 iterations of power iteration found the dominant eigenvector to machine precision.

Why this matters for machine learning:

  1. Scalability: For sparse matrices or very large $d$, power iteration computes only the top-$k$ eigenvectors in $O(kd^2)$ time, avoiding the $O(d^3)$ cost of full eigendecomposition (QR algorithm).

  2. Distributed computing: Power iteration’s matrix-vector product $\Sigma v$ parallelizes naturally across machines. This is why power iteration dominates in large-scale ML (spectral clustering on billion-node graphs, PageRank on the web graph).

  3. Online/streaming PCA: Data arrives incrementally. Power iteration adapts: accumulate covariance updates, run power iteration in periodic batches. Direct methods (SVD) require offline data access.

  4. Spectral clustering: Finding eigenvectors of graph Laplacians for community detection uses power iteration (or inverse power iteration for small eigenvalues).

  5. Neural network curvature: Understanding optimization landscape requires eigenvalue analysis of the Hessian. Power iteration estimates top eigenvalues via Hessian-vector products (computed efficiently via autodiff).

  6. Kernel PCA: Power iteration implicitly works in the kernel-induced feature space without forming the kernel matrix explicitly (using kernel-vector products instead).

Numerical insights:

  • Normalization is critical: Without per-iteration normalization, $v_k$ grows as $\lambda_1^k$ (exponential). For $\lambda_1 = 2, k = 50$, this is $2^{50} \approx 10^{15}$ (overflow). Normalization prevents this by discarding magnitude.

  • Sign ambiguity: Eigenvectors are defined up to sign. Power iteration and SVD may return opposite signs; the code correctly uses abs(dot product) to handle this.

  • Eigenvalue extraction: Once the eigenvector $v$ is found, the eigenvalue is the Rayleigh quotient: $\lambda = v^\top \Sigma v$. For exact eigenvectors, this is exact; for approximate $v$, the error is $O(\|v - u\|^2)$ (quadratic in eigenvector error).

  • Deflation for multiple eigenvectors: After finding $(u_1, \lambda_1)$, construct $\Sigma' = \Sigma - \lambda_1 u_1 u_1^\top$. Power iteration on $\Sigma'$ finds $u_2$, orthogonal to $u_1$. Repeat for $u_3, \ldots, u_k$.

  • Convergence rate: Depends on $\lambda_2 / \lambda_1$ (eigenvalue ratio). For $\lambda_1 = 2, \lambda_2 = 1$, ratio $= 0.5$, so error $\sim 0.5^k$. After $k = 50$, error $\sim 10^{-16}$ (machine precision).

Shape discipline: - $X_c \in \mathbb{R}^{n \times d}$: centered data ($120 \times 2$) - $\Sigma \in \mathbb{R}^{d \times d}$: covariance matrix ($2 \times 2$, symmetric PSD) - $v \in \mathbb{R}^d$: eigenvector (shape $2$, starts random, converges to principal direction) - $V^\top \in \mathbb{R}^{n \times d}$ from thin SVD: right singular vectors ($2 \times 2$ after thin SVD of $120 \times 2$ matrix)

Fundamental principle: The eigenvector computed via iterative power iteration is identical to the dominant singular vector of centered data—illustrating that iterative and direct methods converge to the same answer. This duality connects PCA (via SVD), eigendecomposition (via power iteration), and principal component analysis across multiple computational paradigms.

Numerical Implementation Details
  • Power iteration complexity: Each iteration requires one matrix-vector multiply: $\Sigma v$ costs $O(d^2)$ for dense $\Sigma$ (or $O(\text{nnz}(\Sigma))$ for sparse). For $k$ iterations to convergence: $O(kd^2)$. For dense SVD: $O(nd^2)$. Power iteration wins for large sparse matrices or when only top-$k$ eigenvectors are needed.
  • Normalization is critical: Without normalization after each step, $\|v_k\| \sim \lambda_1^k$ (exponential growth). Normalization keeps $\|v\| = 1$, isolating the direction; the growth in norm is discarded.
  • Convergence criterion: Stop when $\|v_k - v_{k-1}\| < \epsilon$ or after a fixed number of iterations. For well-separated eigenvalues, 50 iterations is often sufficient for double-precision convergence.
  • Rayleigh quotient for eigenvalue: Once $v$ converges, $\lambda = v^\top \Sigma v$ estimates the eigenvalue. For exact eigenvector, $\lambda$ is exact; for approximate $v$, $\lambda$ is accurate to $O(\|v - u\|^2)$ (quadratic convergence).
  • Deflation implementation: After finding $u_1$, compute $\Sigma' = \Sigma - \lambda_1 u_1 u_1^\top$. Power iteration on $\Sigma'$ finds $u_2$ (which is orthogonal to $u_1$). Repeat for more eigenvectors.
  • Full SVD for comparison: np.linalg.svd(Xc, full_matrices=False) computes thin SVD (only $n$ left singular vectors, avoiding unnecessary computation). Right singular vectors $V$ from SVD equal covariance eigenvectors; singular values $\sigma_i$ relate to eigenvalues via $\lambda_i = \sigma_i^2 / (n-1)$.
  • Iterative refinement: Rayleigh quotient iteration (Newton’s method for eigenvalues) converges cubically (much faster than power iteration’s linear convergence). More advanced: Lanczos/Arnoldi build Krylov subspaces, approximating multiple eigenvalues simultaneously.
  • Condition number and stability: Ill-conditioned $\Sigma$ (nearly repeated eigenvalues, large $\kappa$) makes power iteration slow. Preconditioning (finding a matrix $M$ such that $M^{-1}\Sigma$ has better conditioning) accelerates convergence.
What This Example Demonstrates
  • Convergence via repeated amplification: Matrix multiplication by $\Sigma$ repeatedly amplifies the dominant eigenvector direction and suppresses others. After normalization, the direction of $v_k$ converges to $u_1$.
  • Eigenvalue extraction from eigenvector: Once an eigenvector is found, the corresponding eigenvalue is the Rayleigh quotient: $\lambda = v^\top \Sigma v$ (for unit $v$).
  • Covariance eigendecomposition = SVD of centered data: The eigenvector of the covariance matrix $\Sigma = \frac{1}{n-1}X_c^\top X_c$ equals the first right singular vector of $X_c$. Same direction, different computational paths (iterative vs. direct).
  • Sign ambiguity: Eigenvectors are defined up to sign: if $u$ is an eigenvector, so is $-u$. Power iteration and SVD may return opposite signs; comparison requires abs(dot product).
  • Convergence rate and spectral gap: Convergence speed depends on $\lambda_2 / \lambda_1$ (ratio of largest to second-largest eigenvalues). Larger gap $\implies$ faster convergence (geometric rate $\sim (\lambda_2/\lambda_1)^k$).
  • Deflation for multiple eigenvectors: After finding $u_1$ with eigenvalue $\lambda_1$, subtract $\lambda_1 u_1 u_1^\top$ from $\Sigma$ to remove that direction. Power iteration on the deflated matrix finds $u_2$.
  • Iterative methods for large-scale problems: For matrices with $d \sim 10^6$ or sparse structure, power iteration and Lanczos compute top-$k$ eigenvectors in $O(kd^2)$ time without storing full matrices.
Notes

Part 1: The Power Iteration Algorithm and Its Mechanics

Algorithm: Given a symmetric positive semidefinite matrix $\Sigma \in \mathbb{R}^{d \times d}$, power iteration finds the dominant eigenvector via: \[ v_{k+1} = \frac{\Sigma v_k}{\|\Sigma v_k\|_2}, \quad v_0 \in \mathbb{R}^d \text{ (random unit vector)}. \]

Why it works: Decompose $v_0$ in the eigenbasis: $v_0 = \sum_{i=1}^d c_i u_i$ (where $u_i$ is the $i$-th eigenvector, $\lambda_i$ the eigenvalue). Then: \[ \Sigma^k v_0 = \sum_{i=1}^d c_i \lambda_i^k u_i = c_1 \lambda_1^k \left( u_1 + \sum_{i=2}^d \frac{c_i}{c_1} \left(\frac{\lambda_i}{\lambda_1}\right)^k u_i \right). \]

Since $\lambda_1 > |\lambda_i|$ for $i > 1$, the bracketed term $\to u_1$ as $k \to \infty$. Normalization at each step prevents $\|\Sigma^k v_0\|$ from exploding/vanishing, keeping only the direction $u_1$.

Convergence rate: Geometric with factor $r = \lambda_2 / \lambda_1$ (ratio of largest to second-largest eigenvalue magnitude). Error decreases as $r^k$. Example: $r = 0.5 \implies$ error $\sim 10^{-16}$ after $k = 50$ iterations. Larger spectral gap (more separated eigenvalues) $\implies$ faster convergence.

Normalization prevents overflow: Without per-iteration normalization, $v_k \propto \Sigma^k v_0$ grows as $\|\Sigma^k v_0\| \sim \lambda_1^k$ (exponential). For $k = 50, \lambda_1 = 2$, this would be $2^{50} \approx 10^{15}$ (overflow). Dividing by $\|\Sigma v_k\|$ each step discards magnitude, preserving direction.

Part 2: Eigenvalue Extraction and the Rayleigh Quotient

Once the eigenvector $v \approx u_1$ is found, the corresponding eigenvalue is estimated via the Rayleigh quotient: \[ \lambda = \frac{v^\top \Sigma v}{v^\top v} = v^\top \Sigma v \quad (\text{for unit } v). \]

Accuracy: If $v$ is an exact eigenvector, $\lambda$ is the exact eigenvalue. If $v$ is approximate (error $\epsilon = \|v - u_1\|$), then $\lambda$ has error $O(\epsilon^2)$ (quadratic convergence in $\epsilon$). This is one reason Rayleigh quotient iteration (Newton’s method for eigenvalues) converges so fast: one iteration reduces error quadratically.

Deflation for multiple eigenvectors: After finding $(u_1, \lambda_1)$, construct a deflated matrix: \[ \Sigma' = \Sigma - \lambda_1 u_1 u_1^\top. \]

This removes the rank-1 contribution of the first eigenvector. The new matrix $\Sigma'$ has eigenvalues $0, \lambda_2, \lambda_3, \ldots, \lambda_d$ (second-largest eigenvalue of $\Sigma$ is now largest of $\Sigma'$). Power iteration on $\Sigma'$ finds $u_2$. Repeat to find $u_3, \ldots, u_k$.

Orthogonality guarantee: By construction, $u_1^\top u_2 = 0$ (and similarly for higher eigenvectors). This is automatic from the deflation procedure, ensuring orthonormal basis.

Part 3: Connection Between Covariance Eigendecomposition and SVD

For centered data $X_c \in \mathbb{R}^{n \times d}$, the covariance matrix is: \[ \Sigma = \frac{1}{n-1} X_c^\top X_c. \]

The SVD of $X_c$ is: $X_c = U \Sigma_{\text{SVD}} V^\top$ (where $\Sigma_{\text{SVD}}$ contains singular values). Then: \[ X_c^\top X_c = V \Sigma_{\text{SVD}}^2 V^\top = (n-1) \Sigma \quad \text{(by definition of covariance)}. \]

So: eigenvectors of $\Sigma$ are exactly the right singular vectors of $X_c$ (columns of $V$). The eigenvalues of $\Sigma$ relate to singular values via: \[ \lambda_i = \frac{\sigma_i^2}{n-1}. \]

Power iteration on $\Sigma$ (direct covariance eigendecomposition) finds $u_1$; SVD of $X_c$ also produces $V_{:,1}$ as the first principal component. Same answer, different algorithms: SVD is direct (one-shot, $O(nd^2)$ cost), power iteration is iterative (converges gradually, $O(kd^2)$ for $k$ iterations to desired precision).

Why This Matters for ML

1. PCA without forming covariance: For high-dimensional data ($d = 10^6$), explicitly computing $\Sigma = X_c^\top X_c$ requires $O(d^2)$ memory and $O(nd^2)$ time. Power iteration avoids this: compute $\Sigma v$ via $(X_c^\top (X_c v))$ (two sparse/tall matrix multiplications), saving memory and time for sparse data.

2. Online/streaming PCA: Data arrives incrementally. Computing covariance offline is infeasible. Power iteration can be adapted: accumulate covariance updates, run power iteration in periodic batches. Iterative methods adapt naturally to streaming.

3. Spectral clustering and graph analysis: Clustering via eigenvectors of the graph Laplacian $L = D - A$ (degree matrix minus adjacency). For large graphs ($d = $ nodes $\sim$ millions), full SVD is prohibitive. Power iteration (or inverse power iteration for small eigenvalues) scales to billion-node graphs.

4. PageRank and random walks: PageRank computes the stationary distribution of a random walk on the web graph. This is the dominant eigenvector of the transition matrix. Power iteration (applied to the web matrix) scales to billions of pages.

5. Neural network Hessian analysis: Understanding neural network loss landscape requires eigenvalue/eigenvector analysis of the Hessian (second derivatives). For networks with $d = 10^6$ parameters, explicit Hessian is intractable. Power iteration estimates top Hessian eigenvalues (curvature) via Hessian-vector products (forward-backward pass in autodiff).

6. Implicit bias and generalization: Modern deep learning theory studies the implicit bias of SGD—why networks generalize despite interpolation. This relates to the eigenspace structure of the loss landscape and how SGD navigates the null space vs. row space of the Jacobian (connection to Example 87).

ML Examples and Patterns

Example 1: PCA for dimensionality reduction Compute top-$k$ principal components via power iteration: \[ \text{PC}_i = u_i, \quad \text{projection: } X_c^{\text{proj}} = X_c V_k V_k^\top. \] For $n = 10^6$ images, $d = 256 \times 256 = 65536$ pixels, power iteration finds 10 PCs in minutes (vs. SVD in hours).

Example 2: Spectral clustering on social networks Build affinity matrix $A$ (similarity between users), form Laplacian $L = D - A$. Compute smallest $k$ eigenvectors via inverse power iteration. Cluster via K-means on eigenvector coordinates. Used by large social networks (Facebook, LinkedIn) for community detection.

Example 3: Natural gradient descent in optimization Gradient descent uses $w_{t+1} = w_t - \eta \nabla L(w_t)$. Natural gradient preconitions by the inverse of the Fisher information matrix: $w_{t+1} = w_t - \eta F^{-1} \nabla L(w_t)$. Computing $F^{-1}$ requires eigendecomposition; power iteration estimates top-$k$ Fisher eigenvectors, enabling approximate natural gradient.

Example 4: Kernel PCA In kernel-induced feature space (implicit RKHS), eigenvectors of the kernel matrix $K_{ij} = \kappa(x_i, x_j)$ define principal directions. For $n = 10^4$ examples and RBF kernel, $K$ is dense $10^4 \times 10^4$ matrix. Power iteration (or Lanczos) computes top-$k$ eigenvectors without forming $K$ explicitly.

Example 5: Random matrix theory in deep learning The empirical spectral distribution (histogram of eigenvalues) of the weight matrix at initialization affects training. Power iteration reveals the largest eigenvalue (spectral radius); recent work (Roberts et al., 2022) connects this to generalization bounds via implicit bias.

Connection to Linear Algebra Theory

Spectral theorem: For symmetric $\Sigma$: \[ \Sigma = \sum_{i=1}^d \lambda_i u_i u_i^\top. \] Truncating to $k$ terms gives the best rank-$k$ approximation (Eckart-Young theorem). Power iteration finds the leading term sequentially.

Gershgorin’s theorem: Eigenvalues lie in circles $|z - \Sigma_{ii}| \le \sum_{j \neq i} |\Sigma_{ij}|$ (on complex plane). For covariance (real, symmetric), eigenvalues are real and bounded by these discs.

Perron-Frobenius theorem: For non-negative $\Sigma$, the dominant eigenvalue is positive, and the corresponding eigenvector has all non-negative entries. Applies to transition matrices (PageRank) and adjacency matrices (spectral clustering).

Krylov subspaces: Power iteration implicitly builds the Krylov subspace $\mathcal{K}_k = \text{span}\{v_0, \Sigma v_0, \Sigma^2 v_0, \ldots, \Sigma^{k-1} v_0\}$. Lanczos and Arnoldi build orthonormal bases for these subspaces, approximating eigenvalues via small dense matrices (Rayleigh-Ritz approximation).

Chebyshev acceleration: Convergence of power iteration can be accelerated via Chebyshev polynomials, a classical technique for eigenvalue problems and iterative solvers.

Numerical and Implementation Notes

Eigenvalue bug (common mistake): Computing eigenvalues via np.linalg.eig(Sigma) requires $O(d^3)$ time (QR algorithm); for $d = 10^4$, this is $10^{12}$ flops (minutes). Power iteration reaches desired precision in $O(kd^2)$ time (seconds), where $k \approx 50$. For sparse $\Sigma$, savings are even greater.

Gram matrix conditioning: $\Sigma = X_c^\top X_c$ has condition number $\kappa(\Sigma) = \kappa(X_c)^2$ (squares the condition number). For ill-conditioned $X_c$ (near-collinear features), power iteration on $\Sigma$ converges slowly. Solution: compute SVD of $X_c$ directly (better conditioned) instead.

Preconditioning: If $\kappa(\Sigma)$ is large, apply a preconditioner $M$ (e.g., diagonal of $\Sigma$) and run power iteration on $M^{-1} \Sigma$ instead. This can dramatically improve convergence.

Distributed computing: Power iteration on $\Sigma v$ can be parallelized: each machine computes a block of rows of $\Sigma v$, then communication aggregates results. This is why power iteration is preferred in distributed ML frameworks.

Block power iteration: Compute top-$k$ eigenvectors simultaneously by maintaining a $d \times k$ matrix $V_k$ and iterating $V_{k+1} = \Sigma V_k / \|\Sigma V_k\|_F$ (Frobenius norm). Converges to top-$k$ subspace faster than sequential deflation.

Numerical and Shape Notes

Shape discipline: - $X_c \in \mathbb{R}^{n \times d}$: centered data ($120 \times 2$) - $\Sigma \in \mathbb{R}^{d \times d}$: covariance matrix ($2 \times 2$, symmetric PSD) - $v \in \mathbb{R}^d$: eigenvector (random start, converges to $u_1$, shape $2$) - $v_{\text{SVD}} \in \mathbb{R}^d$: first right singular vector of $X_c$ (shape $2$) - $Vt \in \mathbb{R}^{d \times d}$ (for thin SVD with $n > d$, returns $\mathbb{R}^{n \times d}$; here $120 \times 2$)

Gotchas:

  1. Thin vs. full SVD: np.linalg.svd(Xc, full_matrices=False) returns at most $\min(n, d) = 2$ rows of $Vt$ (for $120 \times 2$ matrix). full_matrices=True returns all $d = 2$ rows. For this example, they’re the same; for $n < d$ (Example 87), full_matrices=False truncates nullspace vectors.

  2. Sign ambiguity: Eigenvectors are defined up to sign. v and Vt[0] may be negatives of each other. The code correctly uses abs(v @ v_svd) to handle this.

  3. Normalization redundancy: v_svd /= np.linalg.norm(v_svd) is redundant (SVD eigenvectors already unit), but makes intent explicit.

  4. Convergence patience: For 2D covariance with well-separated eigenvalues (typical synthetic data), 50 iterations is excessive (usually converges in 5-10). For real high-dimensional data with small spectral gaps, 50 is reasonable.

Pedagogical Significance

This example distills the power iteration algorithm—one of the most elegant algorithms in numerical linear algebra. Despite its simplicity (10 lines of code), power iteration: - Reveals the structure of eigendecomposition (how matrix multiplication amplifies dominant directions) - Connects iterative to direct methods (power iteration vs. SVD) - Scales to real-world problems (billion-node graphs, millions of features) - Generalizes to advanced methods (Lanczos, Arnoldi, Krylov subspaces)

Why this is hard to teach: - Convergence is “magical”—a random vector somehow becomes the principal direction - The connection to PCA (via SVD of centered data) is non-obvious - Sign ambiguity and numerical precision trade-offs are subtle

What makes this example powerful: 1. Extreme simplicity: 10 lines of code, easily extended to 50 lines for full PCA 2. Surprising result: A random vector converges to the unique principal direction (up to sign) in 50 iterations 3. Practical relevance: Power iteration powers real systems (PageRank, spectral clustering, neural network analysis) 4. Conceptual importance: Shows how iteration (repeated matrix multiplication) solves eigenvalue problems—a pattern appearing throughout ML

Mastery landmarks: - Understand why power iteration converges (dominant eigenvalue amplifies fastest) - Recognize when power iteration beats SVD (sparse matrices, distributed computing, top-$k$ only) - Implement deflation to find multiple eigenvectors - Connect power iteration on covariance to SVD of centered data (same answer) - Extend to Lanczos/Arnoldi for more eigenvectors simultaneously

Extensions students can explore: 1. Inverse power iteration: Find smallest eigenvalues (used in spectral clustering for Laplacian) 2. Rayleigh quotient iteration: Cubic convergence vs. linear (much faster!) 3. Block power iteration: Compute top-$k$ simultaneously (faster for multiple eigenvectors) 4. Lanczos algorithm: Build Krylov subspace, approximate all eigenvalues with small matrix 5. Application to PageRank: Power iteration on normalized adjacency matrix converges to importance scores

History and Applications

Early power method (1920s–1930s): Harold Hotelling developed the power method in the 1920s–1930s for computing eigenvalues of correlation matrices in statistics. The method is simple: multiply a random vector repeatedly by the matrix, normalize each time, and watch it converge to the dominant eigenvector. Before computers, this was slow but conceptually elegant. John von Neumann and others refined the theory, proving convergence under mild spectral gap conditions.

Numerical algorithms and QR (1960s–1970s): The QR algorithm (Francis, 1961) computes all eigenvalues simultaneously by repeatedly applying QR decomposition, converging to upper triangular form. This is now the standard algorithm for full eigendecomposition. However, QR costs $O(d^3)$ time and $O(d^2)$ memory—prohibitive for large sparse matrices. Power iteration, requiring only matrix-vector products, remained essential for large-scale problems.

Lanczos and Krylov subspaces (1950s–1970s): Cornelius Lanczos (1950) developed the Lanczos algorithm, which builds an orthonormal basis for the Krylov subspace $\mathcal{K}_k = \text{span}\{v_0, \Sigma v_0, \Sigma^2 v_0, \ldots, \Sigma^{k-1} v_0\}$ using three-term recurrence relations. This approximates multiple eigenvalues via projection onto a small tridiagonal matrix, enabling computation of top-$k$ eigenvectors in $O(kd^2)$ time. Arnoldi (1951) generalized Lanczos to non-symmetric matrices. These became the workhorses of scientific computing (fluid dynamics, structural mechanics, quantum chemistry).

Preconditioned iterative methods (1980s–1990s): The rise of parallel computing and large sparse systems motivated development of preconditioned iterative methods for eigenvalue problems. Preconditioning transforms the problem to accelerate convergence without changing the solution. ARPACK (1996) implemented production-quality Lanczos/Arnoldi with sophisticated deflation and restart strategies, becoming industry standard for large eigenvalue problems.

Modern applications in machine learning (2010s+): Power iteration and Lanczos saw resurgence in ML:

  1. Spectral clustering: Computing eigenvectors of graph Laplacians for community detection in social networks, biological networks, and recommendation systems. For billion-node graphs, power iteration and Lanczos are the only feasible methods.

  2. PCA at scale: For high-dimensional data ($d \sim 10^6$), power iteration computes top-$k$ principal components orders of magnitude faster than full SVD. Used extensively in genomics (gene expression analysis), NLP (word embeddings), and image analysis.

  3. PageRank and search: Google’s PageRank algorithm (1998) uses power iteration on the web graph transition matrix to compute importance scores for billions of web pages. The algorithm converges in 20-30 iterations, fundamental to search ranking.

  4. Neural network Hessian analysis: Understanding neural network loss landscapes requires computing largest eigenvalues of the Hessian (second derivative matrix). Direct eigendecomposition is infeasible for networks with millions of parameters. Power iteration via Hessian-vector products (computed efficiently via autodiff) estimates top eigenvalues, revealing optimization landscape structure (Ghorbani et al., 2019; Yao et al., 2020).

  5. Implicit bias and generalization: Recent deep learning theory (Bartlett et al., 2020; Chizat & Bach, 2020) studies how SGD implicitly biases toward minimum-norm solutions. This analysis involves eigendecomposition of the Jacobian and Fisher information matrix, computed via power iteration for large networks.

  6. Kernel methods and scalability: Kernel PCA, kernel ridge regression, and support vector machines operate on the kernel matrix $K_{ij} = \kappa(x_i, x_j)$ ($n \times n$). For $n = 10^6$, forming $K$ is infeasible. Power iteration works with kernel-vector products $K v$, enabling scalable kernel methods.

  7. Graph neural networks: Computing spectral properties of adjacency/Laplacian matrices for graph convolutions uses power iteration internally. Graph neural networks for node classification, link prediction, and clustering rely on efficient eigenvalue computation.

  8. Adversarial robustness: Adversarial examples (inputs causing misclassification) often follow directions orthogonal to dominant gradient directions (related to nullspace geometry, Example 87). Understanding this requires spectral analysis of the loss landscape via power iteration.

  9. Natural gradient descent: Optimization via natural gradient (preconditioning by Fisher information) requires eigendecomposition. Practical approximations use power iteration on low-rank Fisher estimates.

Theoretical advances:

  • Accelerated power iteration: Chebyshev polynomials and other acceleration techniques speed convergence (Rutishauser, 1954 and later).
  • Distributed and randomized methods: Randomized SVD (Halko et al., 2011) and distributed power iteration enable computation on trillion-scale datasets.
  • Eigenvalue perturbation theory: Understanding how perturbations to the matrix affect eigenvalues/eigenvectors (Weyl, Davis-Kahan). Crucial for understanding robustness and numerical stability.

Current state (2020s):

Power iteration remains fundamental despite GPU acceleration of full eigendecomposition: - Sparse matrices: Power iteration + Lanczos scale where direct methods don’t. - Online/streaming: Eigenvectors computed incrementally as data arrives. - Distributed computing: Naturally parallelizable; matrix-vector product is the bottleneck, easily distributed. - Implicit settings: Kernel methods, neural network Hessians—no explicit matrix, only matrix-vector products available. - Integration with deep learning: PyTorch, JAX, TensorFlow incorporate efficient power iteration / Lanczos for Hessian computation, uncertainty quantification, and architecture design.

The simplicity and elegance of power iteration ensure its relevance for decades to come. Modern deep learning theory increasingly uses spectral analysis (eigenvalues/eigenvectors) to explain generalization, and power iteration is the practical tool for computing spectra at scale.

Connection to Broader Examples
  • PCA and variance (Ex 11): PCA finds principal directions by computing eigenvectors of covariance. Power iteration is a practical algorithm for computing them iteratively. PC1 = dominant eigenvector, explains maximum variance.
  • SVD and decomposition (Ex 10): SVD of centered data $X_c = U\Sigma V^\top$ gives eigenvectors as columns of $V$ (right singular vectors). Power iteration on $\Sigma = \frac{1}{n-1}X_c^\top X_c$ computes the same $V$, but iteratively.
  • Projections and reconstruction (Ex 6): Eigenvectors define projection directions. Projecting $X$ onto top-$k$ eigenvectors gives low-rank approximation: $\hat{X} = X V_k V_k^\top$.
  • Backpropagation and Hessian (Ex 84): The Hessian (second derivatives) of neural network loss has eigenvalues determining curvature. Power iteration estimates top eigenvalues, informing optimizer step sizes and identifying flat directions.
  • Rank and nullspace (Ex 87): Zero eigenvalues of the covariance correspond to nullspace directions (zero variance). Power iteration implicitly avoids these directions.
  • Orthogonality and projections (Ex 86): Eigenvectors are orthonormal (orthogonal unit vectors). Projecting onto eigenvector directions decomposes vectors into interpretable components.
  • Least squares (Ex 12): Solving normal equations $X^\top X \hat{w} = X^\top y$ involves eigendecomposition of the Gram matrix $X^\top X$. Power iteration can approximate solutions in very large dimensions.
  • Feature engineering and spectral methods: Spectral clustering uses eigenvectors of the graph Laplacian. Kernel PCA uses power iteration implicitly (in the implicit feature space defined by the kernel).

Comments