Example number
8
Example slug
example_8_power_iteration_for_dominant_eigenvector_pca_direction
Background

Power iteration is a classic algorithm (dating to von Mises and Wielandt) for extracting the dominant eigenvector of a matrix by repeated multiplication and renormalization. For covariance matrices $\Sigma = \text{cov}(X_c)$, the dominant eigenvector points in the direction of maximum variance and matches the first principal component from PCA. A spectral gap between the largest and second-largest eigenvalues ensures fast convergence; normalization prevents vector norms from exploding.

Because covariance is symmetric positive semi-definite, its eigenvectors are orthonormal and align with right singular vectors of the centered data matrix $X_c$. Thus power iteration provides a lightweight alternative to a full SVD when only the top component is needed.

Purpose

Build intuition that ties PCA to eigenvectors and iterative methods:

  • See that power iteration converges to the dominant eigenvector of covariance.
  • Understand that this dominant eigenvector equals the first principal component (up to sign).
  • Recognize the role of centering, normalization, and spectral gap for convergence speed.
  • Keep shapes straight: $X\in\mathbb{R}^{n\times d}$ (here $d=2$), $\Sigma\in\mathbb{R}^{d\times d}$.
Problem

Run power iteration on covariance of toy data and compare to SVD PC1.

Solution (Math)

For covariance $\Sigma$, power iteration $v_{t+1}=\Sigma v_t/\|\Sigma v_t\|$ converges to the top eigenvector given a spectral gap. That eigenvector equals PCA direction (up to sign).

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points
rng = np.random.default_rng(1008)

X = toy_pca_points(n=100, seed=2)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)

v = rng.normal(size=2); v /= np.linalg.norm(v)
for _ in range(40):
    v = Sigma@v
    v /= np.linalg.norm(v)

_,_,Vt = np.linalg.svd(Xc, full_matrices=False)
pc1 = Vt[0]; pc1 /= np.linalg.norm(pc1)

print("abs cosine:", abs(v@pc1))
Code Introduction

This code shows that power iteration recovers the top eigenvector of a covariance matrix and matches the first principal component from SVD. It loads 2D data with toy_pca_points, centers it (Xc), and computes the sample covariance $\Sigma = \text{cov}(X_c)$. A random unit vector $v$ is iteratively updated by $v \leftarrow \Sigma v$ and renormalized—this converges to the dominant eigenvector because repeated multiplication amplifies the largest eigenvalue direction.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  • Center data: $X_c = X - \text{mean}(X)$; covariance via np.cov(Xc, rowvar=False, bias=False) gives $\Sigma\in\mathbb{R}^{2\times 2}$.
  • Initialize $v$ randomly and normalize: v /= np.linalg.norm(v); avoid starting exactly orthogonal to the top eigenvector.
  • Iterate: v = Sigma @ v; v /= np.linalg.norm(v); 40 steps suffice here due to a spectral gap; more steps may be needed if eigenvalues are close.
  • Reference PCA direction with SVD: Vt[0] is the first right singular vector of $X_c$; normalize to get pc1.
  • Alignment check: use abs(v @ pc1) to ignore sign ambiguity; values near 1 confirm convergence.
  • Shapes: $X:(n\times 2)$, $X_c:(n\times 2)$, $\Sigma:(2\times 2)$, $v,\text{pc1}:(2)$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Power iteration converges to the dominant eigenvector of $\Sigma$ when there is a spectral gap.
  • The dominant eigenvector of covariance equals the first PCA direction (up to sign).
  • Normalizing each step is essential for numerical stability and to isolate direction.
  • Comparing to SVD via cosine similarity validates convergence (value near 1 in magnitude).
Notes

Part 1: Core setup - Power iteration for dominant eigenvector (PCA direction)

State the objects, shapes, and target question for Power iteration for dominant eigenvector (PCA direction). Name the data matrices or vectors, specify their dimensions, and clarify the transformation or comparison this example develops.

Part 2: Geometry and algebraic insight - Power iteration for dominant eigenvector (PCA direction)

Describe the geometric picture (subspaces, projections, bases, or decompositions) and the algebraic identities that make Power iteration for dominant eigenvector (PCA direction) work. Highlight how these structures constrain solutions and connect to earlier linear algebra tools.

Part 3: Numerics and ML practice - Power iteration for dominant eigenvector (PCA direction)

Give the computational recipe for Power iteration for dominant eigenvector (PCA direction), note stability or conditioning checks, and tie to an ML use case. Mention parameter choices, common pitfalls, and quick sanity checks such as shape validation or reconstruction error.

  • Shape discipline: $X\in\mathbb{R}^{n\times d}$ (here $d=2$), $X_c$ centered, $\Sigma\in\mathbb{R}^{d\times d}$ symmetric PSD, $v\in\mathbb{R}^d$.
  • Spectral gap: convergence rate depends on $|\lambda_2/\lambda_1|$; if the top two eigenvalues are close, increase iterations or use more advanced methods (e.g., Lanczos).
  • Sign ambiguity: eigenvectors are defined up to sign; use absolute cosine to compare directions.
  • Stability: renormalize each step to prevent overflow/underflow; centering is required so covariance reflects variance not mean offset.
History and Applications

Eigenvalue and eigenvector theory emerged in the late 18th and 19th centuries through Euler’s work on rotations and Cauchy’s development of the characteristic equation. The connection to principal axes of ellipsoids was geometric and profound.

Power iteration (von Mises method): Von Mises (1929) and Wielandt developed power iteration as a practical algorithm for computing the dominant eigenvector of large matrices. Before computer-based SVD algorithms (Golub & Kahan, 1965), power iteration was the workhorse for extracting the leading principal component.

PageRank and modern ranking: Google’s PageRank algorithm (Brin & Page, 1998) famously uses power iteration to compute the dominant eigenvector of a link-transition matrix. Each iteration computes the expected “random surfer” position after one more web hop; convergence gives the importance of each page. This brought eigenvectors into the algorithmic mainstream and highlighted their utility for large, sparse matrix problems.

Connection to Broader Examples
  • PCA links eigenvectors of covariance to directions of maximal variance; power iteration is a cheap way to extract the top direction without full SVD.
  • Spectral gap considerations mirror conditioning themes: close eigenvalues slow convergence and can confuse direction estimates.
  • Iterative eigen methods (power iteration, Lanczos) generalize to larger $d$ when computing only a few leading components.
  • In downstream ML, the top component defines the principal axis for projection, denoising, and initialization of low-rank models.

Comments