Example number
24
Example slug
example_24_power_iteration_for_dominant_eigenvector_pca_direction
Background

For a symmetric matrix $A \in \mathbb{R}^{d\times d}$ with eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d$ and corresponding orthonormal eigenvectors $q_1, q_2, \ldots, q_d$, the eigendecomposition is $A = Q\Lambda Q^\top$. Power iteration is a classic iterative method: start with a random unit vector $v_0$, then repeatedly compute $v_{k+1} = Av_k / \|Av_k\|$. If $|\lambda_1| > |\lambda_2|$ (spectral gap exists), $v_k \to q_1$ exponentially fast at rate $|\lambda_2/\lambda_1|^k$.

In ML, covariance matrices $\Sigma = \frac{1}{n-1}X_c^\top X_c$ are symmetric positive semidefinite (PSD), so eigenvalues are real and non-negative. The eigenvector for the largest eigenvalue is the first principal component, pointing along the direction of maximum variance. SVD of centered data $X_c = U\Sigma V^\top$ directly gives $V$ as the eigenvectors of $X_c^\top X_c$, but power iteration is faster when you only need the top few directions. Modern extensions (Lanczos, randomized SVD) scale power iteration to millions of features, enabling PCA, spectral clustering, and eigenface-style dimensionality reduction in vision and NLP.

Purpose

Build a crisp, ML-first understanding of power iteration: a simple, scalable algorithm for extracting the dominant eigenvector of a symmetric matrix. When full eigendecomposition is too expensive ($O(d^3)$ cost), power iteration computes the top eigenvector in $O(d^2)$ per iteration—just repeated matrix-vector products. This is foundational for PCA on large datasets, spectral methods, and understanding optimization landscapes via Hessian analysis.

You will learn how power iteration converges exponentially fast when there’s a spectral gap, verify convergence by comparing to SVD-derived principal components, and see why the dominant eigenvector of the covariance matrix is the first principal component. The goal is to make you fluent in iterative eigenvalue methods and their ML applications: fast PCA, graph clustering, and curvature estimation in neural network training.

Problem

Run power iteration on covariance of toy data and compare to SVD PC1.

Solution (Math)

For covariance $\Sigma$, power iteration $v_{t+1}=\Sigma v_t/\|\Sigma v_t\|$ converges to the top eigenvector given a spectral gap. That eigenvector equals PCA direction (up to sign).

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points
rng = np.random.default_rng(1024)

X = toy_pca_points(n=100, seed=2)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)

v = rng.normal(size=2); v /= np.linalg.norm(v)
for _ in range(40):
    v = Sigma@v
    v /= np.linalg.norm(v)

_,_,Vt = np.linalg.svd(Xc, full_matrices=False)
pc1 = Vt[0]; pc1 /= np.linalg.norm(pc1)

print("abs cosine:", abs(v@pc1))
Code Introduction

This code demonstrates power iteration, an iterative algorithm for finding the dominant eigenvector of a symmetric matrix, and verifies that it converges to the first principal component from SVD. The example illustrates how eigenvalue problems and PCA are intimately connected through the covariance matrix.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Load centered data: Xc = X - X.mean(0) shifts the origin to the data center, required for PCA.
  • Compute covariance: Sigma = np.cov(Xc, rowvar=False, bias=False) with rowvar=False treating columns as features and bias=False using $n-1$ denominator.
  • Initialize random unit vector: v = rng.normal(size=2); v /= np.linalg.norm(v) ensures $\|v_0\| = 1$.
  • Power iteration loop: for _ in range(40): v = Sigma @ v; v /= np.linalg.norm(v) applies $\Sigma$ and renormalizes 40 times.
  • Extract SVD reference: U, S, Vt = np.linalg.svd(Xc, full_matrices=False) gives principal components as rows of Vt; pc1 = Vt[0] is the first.
  • Verify alignment: abs(v @ pc1) computes absolute cosine similarity, accounting for sign ambiguity; expect $\approx 1.0$ after convergence.
  • Convergence diagnostics: track $\|\Sigma v - \lambda v\|$ or cosine with previous iterate; stop when change is below tolerance.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Power iteration mechanics: repeatedly apply $v \leftarrow \Sigma v / \|\Sigma v\|$ to converge to the dominant eigenvector.
  • Exponential convergence: with spectral gap $|\lambda_2/\lambda_1| < 1$, the error decays as $(\lambda_2/\lambda_1)^k$.
  • Connection to PCA: the top eigenvector of the covariance equals the first principal component (up to sign).
  • Verification strategy: compare power iteration result to SVD-based PC1 via cosine similarity.
  • Scalability advantage: $O(d^2)$ per iteration vs $O(d^3)$ for full eigendecomposition—critical for large $d$.
  • Shape and sign discipline: eigenvectors are defined up to sign; use absolute cosine to measure alignment.
Notes

Part 1: Data Preparation and Covariance Construction. The code loads 100 points from toy_pca_points(n=100, seed=2), producing $X \in \mathbb{R}^{100 \times 2}$ with inherent directional structure along one axis. Centering via Xc = X - X.mean(0) removes the mean, shifting the origin to the data’s center of mass. The sample covariance matrix is computed as $\Sigma = \frac{1}{n-1} X_c^\top X_c \in \mathbb{R}^{2 \times 2}$, using np.cov(Xc, rowvar=False, bias=False). The rowvar=False flag treats columns as variables (features) rather than rows, and bias=False uses the unbiased estimator with $n-1$ denominator. This covariance matrix $\Sigma$ is symmetric positive semidefinite (PSD) by construction, guaranteeing real non-negative eigenvalues and orthogonal eigenvectors. The eigendecomposition $\Sigma = Q \Lambda Q^\top$ reveals the principal axes: eigenvectors $Q$ point along directions of variance, and eigenvalues $\Lambda$ quantify variance magnitude along each axis.

Part 2: Power Iteration for Dominant Eigenvector. Power iteration is a simple iterative method to extract the dominant eigenvector (corresponding to the largest eigenvalue) of a matrix. Starting from a random unit vector v = rng.normal(size=2) normalized to $\|v\|_2 = 1$, the algorithm repeatedly applies the covariance matrix and renormalizes: $v_{k+1} = \frac{\Sigma v_k}{\|\Sigma v_k\|}$. The loop for _ in range(40): v = Sigma @ v; v /= np.linalg.norm(v) executes 40 iterations. Why does this converge? Expand the initial vector in the eigenbasis: $v_0 = c_1 q_1 + c_2 q_2$, where $q_1, q_2$ are eigenvectors with eigenvalues $\lambda_1 > \lambda_2$. After $k$ iterations, $\Sigma^k v_0 = c_1 \lambda_1^k q_1 + c_2 \lambda_2^k q_2 = \lambda_1^k \left( c_1 q_1 + c_2 \left(\frac{\lambda_2}{\lambda_1}\right)^k q_2 \right)$. Since $|\lambda_2 / \lambda_1| < 1$ (strict inequality when eigenvalues are distinct), the term $(\lambda_2/\lambda_1)^k \to 0$ exponentially fast. Renormalization removes the growing factor $\lambda_1^k$, leaving $v_k \to q_1$ (up to sign). The convergence rate depends on the spectral gap $|\lambda_2/\lambda_1|$: closer eigenvalues require more iterations.

Part 3: Verification via SVD Principal Component. To validate convergence, the code computes the first principal component using SVD. The decomposition U, S, Vt = np.linalg.svd(Xc, full_matrices=False) factorizes $X_c = U \Sigma V^\top$, where $V^\top$ (stored as Vt) contains right singular vectors—the principal directions. Extracting pc1 = Vt[0] gives the first principal component, a unit vector in $\mathbb{R}^2$. The connection between SVD and eigendecomposition is direct: for centered data, the covariance is $\Sigma = \frac{1}{n-1} X_c^\top X_c = \frac{1}{n-1} V S^2 V^\top$, so the right singular vectors $V$ are the eigenvectors of $\Sigma$, and eigenvalues are $\lambda_i = \sigma_i^2 / (n-1)$. The printed abs cosine: abs(v @ pc1) computes the absolute cosine similarity between the power iteration result and the SVD-derived principal component. A value near 1.0 confirms that both methods converged to the same direction. Shape discipline: $\Sigma \in \mathbb{R}^{2 \times 2}$, $v \in \mathbb{R}^2$, $V^\top \in \mathbb{R}^{2 \times 2}$ (from SVD with full_matrices=False). The absolute cosine accounts for sign ambiguity: eigenvectors are defined up to sign, so $v$ and $-v$ are equally valid.

Part 4: Why This Matters for ML. Power iteration reveals the computational trade-offs in eigenvalue problems. Full eigendecomposition (via np.linalg.eigh) costs $O(d^3)$ for a $d \times d$ matrix, which is prohibitive when $d$ is large (e.g., high-dimensional embeddings, covariance matrices with thousands of features). Power iteration costs $O(d^2)$ per iteration—just a matrix-vector product—and often converges in dozens of iterations, making it practical for extracting the top few eigenvectors. Modern extensions like Lanczos and randomized SVD build on this principle, computing low-rank approximations efficiently for large-scale PCA, spectral clustering, and neural network weight initialization. The spectral gap $|\lambda_2/\lambda_1|$ is also a conditioning diagnostic: small gaps slow convergence and make eigenvalue estimation noisy, connecting to conditioning themes (Chapter 14). In deep learning, the dominant eigenvector of the Hessian (curvature matrix) governs optimization landscape geometry—power iteration on mini-batches estimates this direction cheaply. Gotchas: if eigenvalues are equal (e.g., spherical covariance), power iteration can converge to any direction in the eigenspace—the choice depends on initialization. For complex or non-symmetric matrices, power iteration requires specialized variants (e.g., Arnoldi iteration). Always verify convergence by tracking the residual $\|\Sigma v - \lambda v\|$ or checking alignment with reference eigenvectors, as done here.

History and Applications

Power iteration dates to the 18th century (Euler) but became computationally significant in the mid-20th century with the rise of digital computers. Von Mises (1929) formalized the method for dominant eigenvalue extraction, and Hotelling (1933) applied similar ideas to PCA in statistics. Golub and Van Loan (1970s–1980s) established rigorous convergence theory and practical variants (inverse iteration, Rayleigh quotient iteration). The Lanczos algorithm (1950) extended power iteration to efficiently compute multiple eigenvalues, forming the basis of modern sparse eigensolvers.

In modern ML, power iteration underpins fast PCA implementations (randomized SVD, streaming PCA), spectral clustering (graph Laplacian eigenvectors), and PageRank (Google’s original ranking algorithm: dominant eigenvector of a modified adjacency matrix). In deep learning, estimating the top eigenvector of the Hessian reveals the direction of maximum curvature, guiding learning rate selection and diagnosing sharp vs flat minima. Variants like Arnoldi and Krylov subspace methods scale power iteration to millions of dimensions, enabling real-time dimensionality reduction in recommendation systems, NLP embeddings, and image compression. Understanding power iteration builds intuition for iterative algorithms more broadly: gradient descent, conjugate gradient, and ADMM all share the theme of repeated linear operations converging to optimal solutions.

Connection to Broader Examples
  • Chapter 8 (eigenvalues and eigenvectors): power iteration is a practical algorithm for the dominant eigenvalue problem.
  • Chapter 11 (PCA): the first principal component is the top eigenvector of the covariance; power iteration is a fast alternative to full SVD.
  • Chapter 10 (SVD): SVD directly computes all principal components; power iteration is cheaper when you only need the top few.
  • Chapter 14 (conditioning): the spectral gap $|\lambda_2/\lambda_1|$ governs convergence rate; small gaps slow iteration and signal near-degenerate covariance.
  • Chapter 9 (positive definite): covariance matrices are PSD by construction; power iteration exploits symmetry and non-negative eigenvalues.
  • Modern ML: dominant eigenvector analysis appears in spectral clustering (graph Laplacians), neural network curvature (Hessian top eigenvector), and PageRank (Markov chain stationary distribution).

Comments