Example number
72
Example slug
example_72_eigenvectors_in_ml_power_iteration_for_dominant_pca_direction
Background

Power iteration is one of the oldest and simplest eigenvalue algorithms. In ML, eigenvectors and singular vectors underpin PCA, spectral clustering, graph embeddings, and PageRank. Computing only the top direction is often enough for dimensionality reduction and variance explanation. In large-scale settings, the covariance is never formed explicitly; instead, matrix–vector products (on-the-fly or via data passes) make power iteration attractive. Connections to randomized SVD and streaming/online PCA illustrate modern scalable variants.

Purpose

Build intuition for how power iteration recovers the dominant eigenvector of a covariance matrix and why that vector equals the first principal component direction in PCA. You will see both the iterative viewpoint (repeated multiplications and renormalization) and the decomposition viewpoint (SVD/eigendecomposition), understand convergence requirements (spectral gap), and learn when to prefer power iteration, eigendecomposition, or SVD in ML workflows.

Problem

Use power iteration on the toy covariance matrix to approximate the dominant eigenvector and compare with SVD PC1.

Solution (Math)

For symmetric PSD covariance $\Sigma$, power iteration updates

\[ v_{t+1} = \frac{\Sigma v_t}{\|\Sigma v_t\|} \]

and converges to the dominant eigenvector $u_1$ when $|\lambda_1| > |\lambda_2|$ (spectral gap) and $v_0$ has a nonzero component along $u_1$. For PCA, $\Sigma = \tfrac{1}{n-1} X_c^\top X_c$ and the leading eigenvector equals the first right singular vector of centered data $X_c$ (up to sign):

\[ X_c = U\,\Sigma_{\text{svd}}\,V^\top \quad \Rightarrow \quad \tfrac{1}{n-1} X_c^\top X_c = V\,\tfrac{\Sigma_{\text{svd}}^2}{n-1}\,V^\top. \]

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

rng = np.random.default_rng(1072)
X = toy_pca_points(n=120, seed=4)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)

v = rng.normal(size=2); v /= np.linalg.norm(v)
for _ in range(50):
    v = Sigma @ v
    v /= np.linalg.norm(v)

_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
v_svd = Vt[0]; v_svd /= np.linalg.norm(v_svd)

print("abs cosine(power, svd):", abs(v @ v_svd))
Code Introduction

This snippet compares two ways to recover the first principal component of a centered dataset and confirms they agree. With $X \in \mathbb{R}^{n\times d}$, the data are centered $X_c = X - \mu$ and the sample covariance is $\Sigma \in \mathbb{R}^{d\times d}$ via np.cov(Xc, rowvar=False, bias=False) (unbiased, denominator $n-1$). A random unit vector $v$ is iteratively updated by the power method, $v \leftarrow \Sigma v$ followed by normalization, which converges to the dominant eigenvector of the symmetric positive semidefinite $\Sigma$.

The code then computes the SVD $X_c = U \Sigma_{\text{svd}} V^\top$ and takes v_svd = Vt[0] (the first right singular vector). For PCA, these definitions are equivalent: the covariance factors as \[ \Sigma = \tfrac{1}{n-1}\, X_c^\top X_c = V\, \Sigma_{\text{svd}}^2\, V^\top, \] so the eigenvectors of $\Sigma$ are the columns of $V$. The final print shows abs(v @ v_svd) \approx 1, using an absolute cosine because eigenvectors are defined up to sign.

Gotchas and notes: - Shape discipline: $X \in \mathbb{R}^{n\times d}$, $X_c \in \mathbb{R}^{n\times d}$, $\Sigma \in \mathbb{R}^{d\times d}$, $v \in \mathbb{R}^d$. - Power iteration requires repeated normalization and a start vector not orthogonal to the top eigenvector; convergence slows when the leading eigenvalues are close. - For symmetric matrices, np.linalg.eigh(\Sigma) is a direct alternative; SVD on $X_c$ avoids explicitly forming $\Sigma$ for larger $d$.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Center data: $X_c = X - \mathbf{1}\mu^\top$ where $\mu = \tfrac{1}{n} \sum_i X_{i\cdot}$.
  • Covariance: Sigma = np.cov(Xc, rowvar=False, bias=False) yields $\Sigma = \tfrac{1}{n-1} X_c^\top X_c$.
  • Initialize $v$ uniformly at random on the unit sphere; normalize to $\|v\|_2 = 1$.
  • Iterate: v = Sigma @ v; v /= np.linalg.norm(v) for a fixed number of steps or until convergence $\|v_{t+1} - v_t\| < \varepsilon$.
  • Compare with SVD: _, _, Vt = np.linalg.svd(Xc, full_matrices=False) and take v_svd = Vt[0] (first right singular vector); compare abs(v @ v_svd).
  • Convergence speed: approximately geometric with ratio $|\lambda_2/\lambda_1|$; more iterations are needed when the spectral gap is small.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Power iteration: repeated multiply-and-renormalize converges to the dominant eigenvector under a spectral gap.
  • PCA equivalence: the top eigenvector of $\Sigma$ equals the first right singular vector of $X_c$ (principal axis).
  • Practical verification: cosine similarity between the power-iterated vector and SVD PC1 is near 1 (up to sign).
  • When it converges and when it struggles: role of initialization and eigenvalue separation $|\lambda_1/\lambda_2|$.
  • Implementation tradeoffs: forming $\Sigma$ vs operating on $X_c$ directly; eigh for symmetric matrices vs SVD for PCA.
Notes
  • Part 1: Power Iteration Mechanics. Start with $v_0$ (non-orthogonal to $u_1$), then iterate $v_{t+1} = \Sigma v_t / \|\Sigma v_t\|$. Writing $v_0$ in the eigenbasis shows the $u_1$ component dominates as $t$ grows if $|\lambda_1| > |\lambda_2|$.
  • Part 2: PCA Equivalence via SVD. With $X_c = U\Sigma_{\text{svd}}V^\top$, the covariance satisfies $\tfrac{1}{n-1}X_c^\top X_c = V \tfrac{\Sigma_{\text{svd}}^2}{n-1} V^\top$. Thus the top eigenvector is $v_1 = V_{\cdot 1}$, matching SVD PC1 up to sign.
  • Part 3: Practical Verification. Compute abs(v @ v_svd) after iterations; values near 1 confirm agreement. Use absolute value because eigenvectors are defined up to sign.
  • Why This Matters for ML. Efficiently finding top directions enables PCA-based dimensionality reduction, denoising, and initialization for downstream models. Many graph and ranking algorithms rely on dominant eigenvectors (e.g., PageRank); power iteration scales via matvecs without forming dense matrices.
  • ML Examples and Patterns. PCA for variance capture; spectral clustering (graph Laplacian eigenvectors); PageRank (power method on a stochastic matrix); streaming PCA/online Oja’s rule for large or evolving datasets; randomized methods to approximate top-k subspaces.
  • Connection to Linear Algebra Theory. Spectral theorem for symmetric matrices guarantees an orthonormal eigenbasis; convergence governed by eigenvalue ratios; equivalence between eigenvectors of $X_c^\top X_c$ and right singular vectors of $X_c$ follows from SVD identities.
  • Numerical and Implementation Notes. Normalize each step to avoid overflow/underflow; stop with a tolerance on angle or residual $\|\Sigma v - \lambda v\|$; prefer np.linalg.eigh(\Sigma) for small $d$; for large $n,d$, avoid forming $\Sigma$ and apply power iteration directly with $X_c$ using $v = X_c^(X_c v)/(n-1)`.
  • Numerical and Shape Notes. $X, X_c \in \mathbb{R}^{n\times d}$; $\Sigma \in \mathbb{R}^{d\times d}$; $v, v_{\text{svd}} \in \mathbb{R}^d$. With SVD: U (n×d), Σ_svd (d), Vt (d×d) for full_matrices=False.
  • Pedagogical Significance. Demonstrates how an iterative method (power iteration) aligns with decomposition-based PCA, reinforcing the unity of perspectives (iteration vs factorization) and building intuition for when each is preferable.
History and Applications

Power iteration dates back to early 20th-century numerical analysis and remains a core primitive for extracting dominant eigenvectors due to its simplicity and scalability. PCA itself traces to Pearson (1901) and Hotelling (1933), establishing principal components as directions of maximal variance. In large-scale information retrieval and web search, PageRank applies a power-method-like iteration to a stochastic matrix to find a stationary distribution (dominant eigenvector). In streaming and online settings, Oja’s rule (1982) provides an incremental PCA that mirrors power iteration with stochastic updates. Modern randomized linear algebra (Halko–Martinsson–Tropp, 2011) blends subspace iteration with random projections to approximate top-k eigenspaces efficiently. Across ML, dominant eigenvectors underpin dimensionality reduction, spectral clustering, graph embeddings, recommendation, and initialization heuristics.

Connection to Broader Examples
  • Links to Chapter 8 (eigen): power iteration exemplifies iterative eigen-solvers and spectral gap dependence.
  • Connects to Chapter 11 (PCA): principal directions are eigenvectors of covariance and right singular vectors of centered data.
  • Relates to Chapter 10 (SVD): equivalence between eigenvectors of $X_c^\top X_c$ and right singular vectors of $X_c$.
  • Bridges to Chapter 14 (conditioning): small spectral gaps slow convergence and increase sensitivity to noise.
  • Complements Chapter 16 (matrix products): repeated matvecs are the core primitive powering large-scale methods.

Comments