Example number
40
Example slug
example_40_power_iteration_for_dominant_eigenvector_pca_direction
Background

For a symmetric matrix $\Sigma$, power iteration repeatedly applies $\Sigma$ to a vector and renormalizes. If $|\lambda_1| > |\lambda_2|$ (spectral gap), $v_t$ converges to the eigenvector of $\lambda_1$ up to sign. For a centered data matrix $X_c \in \mathbb{R}^{n\times d}$, the covariance $\Sigma = \frac{1}{n-1} X_c^\top X_c$ shares eigenvectors with the right singular vectors of $X_c$, so the dominant eigenvector equals PCA’s first principal direction. Normalization prevents overflow and keeps $\|v_t\|_2=1$. Convergence speed depends on the ratio $|\lambda_2/\lambda_1|$; small gaps slow convergence.

Purpose

Show how power iteration extracts the dominant eigenvector of a symmetric matrix—in this case, the covariance—recovering the first PCA direction with only matrix–vector multiplies and normalization. Emphasize when it converges (spectral gap), why it is sign-ambiguous, and how it provides a cheap, scalable way to approximate the top component without a full SVD.

Problem

Run power iteration on covariance of toy data and compare to SVD PC1.

Solution (Math)

For covariance $\Sigma$, power iteration $v_{t+1}=\Sigma v_t/\|\Sigma v_t\|$ converges to the top eigenvector given a spectral gap. That eigenvector equals PCA direction (up to sign).

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points
rng = np.random.default_rng(1040)

X = toy_pca_points(n=100, seed=2)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)

v = rng.normal(size=2); v /= np.linalg.norm(v)
for _ in range(40):
    v = Sigma@v
    v /= np.linalg.norm(v)

_,_,Vt = np.linalg.svd(Xc, full_matrices=False)
pc1 = Vt[0]; pc1 /= np.linalg.norm(pc1)

print("abs cosine:", abs(v@pc1))
Code Introduction

This code runs power iteration on a 2D covariance matrix from toy PCA data, then compares the resulting direction to the first principal component from SVD by reporting their absolute cosine similarity.

This snippet contrasts the power iteration eigenvector estimate with the SVD-derived principal component: both should align because the dominant eigenvector of the covariance is the first PCA direction, so a cosine near 1 confirms successful convergence.

This snippet contrasts the power iteration eigenvector estimate with the SVD-derived principal component: agreement (cosine near 1) verifies both methods recover the same dominant eigenvector of the covariance.

This snippet contrasts the power iteration eigenvector estimate with the principal component from SVD, showing they converge to the same dominant direction of the covariance. It generates 100 centered 2D points $X_c$, forms the sample covariance $\Sigma \in \mathbb{R}^{2\times 2}$, and runs 40 steps of power iteration starting from a random unit vector $v$. Each step applies $v \leftarrow \Sigma v$ then renormalizes, driving $v$ toward the top eigenvector of $\Sigma$ (largest eigenvalue). In parallel, it computes the SVD of $X_c$: $X_c = U \Sigma V^\top$, whose first right singular vector $pc_1$ equals the first principal component (also the top eigenvector of $\Sigma$). After normalizing both, the dot product $\text{abs}(v@pc1)$ reports the absolute cosine similarity; values near 1 indicate that power iteration and SVD agree on the dominant direction, verifying that PCA’s principal axis is the leading eigenvector of the covariance.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Load toy data: X = toy_pca_points(n=100, seed=2) yields $X \in \mathbb{R}^{100\times 2}$.
  2. Center features: Xc = X - X.mean(0) to form zero-mean data.
  3. Form covariance: Sigma = np.cov(Xc, rowvar=False, bias=False) so $\Sigma \in \mathbb{R}^{2\times 2}$.
  4. Initialize direction: random unit vector v; normalization avoids scale blow-up.
  5. Iterate: repeat $v \leftarrow \Sigma v; v \leftarrow v/\|v\|$ for a fixed count (40 steps) to ensure convergence given a gap.
  6. Reference via SVD: compute Vt from np.linalg.svd(Xc, full_matrices=False); pc1 = Vt[0] is the PCA direction.
  7. Check alignment: abs(v@pc1) close to 1 confirms convergence to the dominant eigenvector up to sign.
  8. Optional diagnostics: monitor $\|\Sigma v - \lambda v\|$ or successive cosine changes to set a stopping tolerance instead of a fixed iteration budget.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Power iteration converges to the dominant eigenvector when a spectral gap exists.
  • PCA link: the dominant eigenvector of covariance equals the first right singular vector of centered data.
  • Normalization each step controls scale and keeps iterates on the unit sphere.
  • Convergence rate: roughly geometric with factor $|\lambda_2/\lambda_1|$; small gaps mean slow progress.
  • Sign ambiguity: both $v$ and $-v$ are valid eigenvectors; cosine similarity uses absolute value.
  • Diagnostic: compare to SVD PC1 using $|v^\top \text{pc1}|$ to measure alignment.
Notes

Part 1: Problem Setup and Power Iteration
Given centered data $X_c$, form $\Sigma = \frac{1}{n-1} X_c^\top X_c$. Initialize $v$ randomly with $\|v\|_2=1$. Iterate $v \leftarrow \Sigma v$ then renormalize. The goal is to extract the dominant eigenvector corresponding to the largest eigenvalue of $\Sigma$.

Part 2: Spectral Gap and Convergence
Convergence speed is geometric with rate $|\lambda_2/\lambda_1|$. A clear gap ($|\lambda_1| \gg |\lambda_2|$) yields fast convergence; near-multiplicity slows progress. If $\lambda_1$ and $\lambda_2$ have the same magnitude, power iteration may fail to settle to a single direction.

Part 3: Numerical Verification as Diagnostic
Use $|v^\top \text{pc1}|$ to measure alignment with the SVD top singular vector; values near 1 indicate success. You can also monitor the Rayleigh quotient $\rho = v^\top \Sigma v$ and the residual $\|\Sigma v - \rho v\|_2$; both should stabilize as the method converges.

Why This Matters for ML
Power iteration is the simplest scalable way to get the top PCA direction for large or streaming data where full SVD is expensive. It underlies randomized PCA, spectral clustering, PageRank-like methods, and initialization strategies for low-rank models. Understanding the spectral gap guides how many iterations are needed.

Numerical and Implementation Notes
Normalize every step to avoid overflow/underflow. Use double precision for stability. Stop when the change in $v$ (or cosine with the previous $v$) is below tolerance, or when the Rayleigh quotient stabilizes. For higher components, use deflation or orthogonal iteration; plain power iteration finds only the top component.

Numerical and Shape Notes
Shapes: $X (n\times d)$, $X_c (n\times d)$, $\Sigma (d\times d)$, $v (d)$. For 2D toy data, $d=2$; for higher $d$, cost per iteration is $O(d^2)$ for dense $\Sigma$ or $O(nd)$ if you multiply by $X_c$ directly. Eigenvectors are sign-ambiguous; compare using absolute cosine.

History and Applications

Power iteration dates to early 20th-century work by von Mises and Pollaczek-Geiringer (1929) as a simple eigen-solver. PCA as an eigenproblem was popularized by Hotelling (1933). Later, Lanczos (1950) and Krylov methods generalized power iteration for faster convergence to multiple eigenpairs. In modern ML, power iteration underlies scalable PCA, spectral clustering, eigenvector-based centrality (e.g., PageRank-style scores), and initialization schemes for low-rank models. Its simplicity makes it a core building block in randomized and streaming PCA algorithms where full SVD is infeasible.

Connection to Broader Examples
  • Vector spaces (Ch. 1): Eigenvectors form invariant directions in $\mathbb{R}^d$.
  • Span (Ch. 2): Iterates remain in the span of the initialization projected onto eigenvectors.
  • Basis and dimension (Ch. 3): Eigenvectors can serve as an orthonormal basis; power iteration isolates the leading one.
  • Linear maps (Ch. 4): Power iteration repeatedly applies the linear map $v \mapsto \Sigma v$.
  • Inner products and norms (Ch. 5): Normalization uses the 2-norm; cosine tracks alignment.
  • Projections (Ch. 6): PCA projects data onto leading eigenvectors; power iteration finds that direction.
  • SVD (Ch. 10): Covariance eigenvectors equal right singular vectors; power iteration is a cheap alternative to full SVD for the top component.
  • Conditioning (Ch. 14): Spectral gap controls convergence speed; near-multiplicity slows or stalls progress.

Comments