Example number
56
Example slug
example_56_eigenvectors_in_ml_power_iteration_for_dominant_pca_direction
Background

Historical context: PCA dates back to Pearson (1901) and Hotelling (1933), with modern numerical methods refined through SVD (Golub & Reinsch, 1970). Power iteration is one of the earliest iterative eigenmethods, foundational to Krylov and Lanczos approaches. In large-scale ML (text, vision, recommender systems), iterative methods dominate because they scale with matrix–vector multiplies rather than full decompositions.

Mathematical characterization: For symmetric PSD covariance $\Sigma$, eigenpairs $(\lambda_i, u_i)$ satisfy $\Sigma u_i = \lambda_i u_i$ with $\lambda_1 \ge \lambda_2 \ge \cdots \ge 0$. Power iteration repeatedly applies $v \leftarrow \Sigma v$ and normalizes; if $v$ has nonzero overlap with $u_1$ and $\lambda_1 > \lambda_2$, then $v \to \pm u_1$. For centered data $X_c$, $\Sigma = \frac{1}{n-1} X_c^\top X_c$, so $u_1$ equals the first right singular vector of $X_c$.

Prevalence in ML: Computing only the top direction (or a few top directions) is common in dimensionality reduction, whitening, and streaming PCA. Iterative schemes (power/Lanczos) are used when $n,d$ are large, when data arrive online, or when forming $\Sigma$ is memory-expensive. Understanding these connections lets you switch between SVD-based PCA and iterative eigen methods confidently.

Purpose

Show, in an ML-first way, that power iteration on the sample covariance converges to the dominant PCA direction, and explain when and why it works. Build intuition for the spectral-gap requirement, the role of normalization, and practical implementations that avoid explicitly forming the covariance in high dimensions. Connect the eigenvector of the covariance $\Sigma$ to the first right singular vector of centered data $X_c$ and clarify sign/scale conventions so you can confidently compare iterative and decomposition-based methods.

Problem

Use power iteration on the toy covariance matrix to approximate the dominant eigenvector and compare with SVD PC1.

Solution (Math)

For symmetric PSD covariance $\Sigma$, power iteration $v_{t+1}=\Sigma v_t/\|\Sigma v_t\|$ converges to the top eigenvector under a spectral gap. For PCA, this eigenvector matches the first right singular vector of centered data $X_c$ (up to sign).

We use:

  • Data matrix $X\in\mathbb{R}^{n\times d}$ (rows are examples).
  • Vectors are column vectors by default.
  • $\|x\|_2$ is Euclidean norm; $\langle x,y\rangle=x^Ty$.
Solution (Python)

import numpy as np
from scripts.toy_data import toy_pca_points

rng = np.random.default_rng(1056)
X = toy_pca_points(n=120, seed=4)
Xc = X - X.mean(0)
Sigma = np.cov(Xc, rowvar=False, bias=False)

v = rng.normal(size=2); v /= np.linalg.norm(v)
for _ in range(50):
    v = Sigma @ v
    v /= np.linalg.norm(v)

_, _, Vt = np.linalg.svd(Xc, full_matrices=False)
v_svd = Vt[0]; v_svd /= np.linalg.norm(v_svd)

print("abs cosine(power, svd):", abs(v @ v_svd))
Code Introduction

This snippet compares two ways to find the dominant PCA direction of 2D data: power iteration on the sample covariance and SVD of the centered data. The data are centered as $X_c = X - \mu$, and the sample covariance is computed as $\Sigma = \frac{1}{n-1} X_c^\top X_c$ via np.cov(rowvar=False, bias=False). Because $\Sigma$ is symmetric positive semidefinite, its eigenvectors define the principal axes.

Power iteration initializes a unit vector $v$ and repeatedly applies $v \leftarrow \Sigma v$ followed by normalization. Under mild conditions (nonzero overlap with the top eigenvector and a gap $\lambda_1 > \lambda_2$), $v$ converges to the dominant eigenvector; the error decays geometrically at rate $(\lambda_2/\lambda_1)^k$.

The SVD route computes $X_c = U \Sigma_s V^\top$. The first right singular vector $v_{\text{svd}} = V_{:,0}$ is the leading eigenvector of $X_c^\top X_c$ (and of $\Sigma$), so normalizing both $v$ and $v_{\text{svd}}$ makes their directions comparable. The printed quantity is the absolute cosine $|\cos\theta| = |v^\top v_{\text{svd}}|$, which should be near 1; the absolute value removes the sign ambiguity (eigenvectors are defined up to $\pm$).

Gotchas: - Centering is required for PCA; otherwise the top direction can point toward the mean. - If the top two eigenvalues are close, more iterations are needed for convergence. - For higher dimensions, avoid forming $\Sigma$ explicitly; use $(\Sigma v) = \frac{1}{n-1} X_c^\top (X_c v)$ to keep memory and compute costs at $O(nd)$.

Numerical Implementation Details

Numerical and Shape Notes

  • Shapes first: Declare shapes (e.g., $X \in \mathbb{R}^{n imes d}$, $w \in \mathbb{R}^{d}$, $b \in \mathbb{R}^{n}$). Vectors are column by convention; keep row/column usage consistent.
  • Axis discipline: Be explicit with axis in reductions and normalizations. For attention-like ops, softmax over keys (row-wise) so rows sum to ≈1.
  • Broadcasting: Check that broadcasts are intended (e.g., (n,1) with (n,d)). Prefer reshape/expand-dims to make semantics clear.
  • Stability eps: Add $arepsilon$ for divisions/logs and $arepsilon I$ (jitter) for SPD solves; use log-sum-exp for softmax.
  • Masking preserves shape: Masks should broadcast to the score/activation tensor; verify masked outputs keep the same shape and zero out excluded entries.
  • Dtype choices: Use float64 for clarity in scripts; with mixed precision, keep reductions/factorizations in float32/float64 to avoid under/overflow.
  • Sanity checks: Print shapes and residuals (e.g., ||Ax-b||, reconstruction error, row-sum ≈ 1). Assert finiteness and expected monotonicity where applicable.

Numerical and Implementation Notes

  • Dtype & precision: Prefer float64 for clarity; if using mixed precision, keep reductions (norms, softmax sums, factorizations) in float32/float64. Avoid explicit inverses; use solve, lstsq, Cholesky/QR/SVD.
  • Shapes & broadcasting: Annotate shapes (e.g., $X \in \mathbb{R}^{n imes d}$); vectors are column by default. Verify axes for reductions (axis) and ensure broadcasts are intended.
  • Stability: Use log-sum-exp for softmax; add small diagonal $arepsilon I$ (jitter) for SPD solves; prefer QR/SVD for ill-conditioned least squares.
  • Conditioning: Inspect np.linalg.cond(A) when solutions look unstable; regularize (ridge) or rescale features to improve conditioning.
  • Reproducibility: Set NumPy seed for random data; print shapes and residuals (e.g., ||Ax-b||, reconstruction errors) and assert finiteness.
  • Complexity & memory: Matmul ~ $O(n^3)$ for factorizations, $O(n^2)$ for triangular solves/products. Prefer vectorization over Python loops; avoid materializing large intermediates.
  • Masking & indexing: Use boolean masks that broadcast to target shapes; for attention-like ops, add $-\infty$ before softmax or zero-out after, then verify rows sum to ~1.
  • Sanity checks: Compare against references (e.g., lstsq vs. solve), check orthogonality (U.T @ U ≈ I), PSD (x.T @ A @ x > 0), and residual norms within tolerance (~1e-12 for float64).
  1. Center the data: Compute $X_c = X - \bar{X}$ so PCA measures variance around the mean.
  2. Form covariance (small d): $\Sigma = \frac{1}{n-1} X_c^\top X_c$ via np.cov(Xc, rowvar=False, bias=False).
  3. Initialize: Draw $v$ from a random distribution and normalize: $v \leftarrow v/\|v\|$.
  4. Iterate: For $t=1\ldots T$: set $v \leftarrow \Sigma v$ and renormalize $v \leftarrow v/\|v\|$.
  5. Compare to SVD: Compute SVD of $X_c$ with np.linalg.svd(Xc, full_matrices=False); take Vt[0] and normalize.
  6. Validate direction: Compute $|v^\top v_{\text{svd}}|$; values $\approx 1$ confirm agreement up to sign.
  7. Use matrix–vector form (large d): Replace $\Sigma v$ by $\frac{1}{n-1} X_c^\top (X_c v)$ to avoid forming $\Sigma$.
  8. Iteration count & gap: Increase iterations if the top two eigenvalues are close; monitor change in $|v^\top v_{\text{svd}}|$.
What This Example Demonstrates

Pedagogical Significance

  • Learning goals: Build intuition for when and why this tool is used in ML, not just how to compute it.
  • ML-first framing: Tie the concept to a concrete task pattern (fit / project / decompose / solve / measure) to anchor understanding.
  • Shape discipline: Habitually annotating dimensions prevents silent bugs and reinforces linear map thinking.
  • Numerical habits: Prefer stable factorizations over inverses; check residuals and condition numbers to separate bugs from ill-conditioning.
  • Transfer: Reuse the same pattern across models (e.g., projection in PCA, orthogonalization in regressions, attention as weighted sums).
  • Assessment ideas: Quick checks: predict sensitivity from $\kappa(A)$, verify projection properties, or compare solver outputs within tolerance.

ML Examples and Patterns

  • Fit: Linear/logistic regression via least squares or softmax; regularization (ridge) improves conditioning and generalization.
  • Project: PCA/SVD for dimensionality reduction; orthogonal projections to subspaces for denoising and feature extraction.
  • Decompose: Eigen/SVD factorizations to expose structure (low rank, PSD) used in recommender systems, LSA, and spectral clustering.
  • Solve: Stable solves without inversion (Cholesky/QR/SVD; CG for SPD) for optimization steps and kernel methods.
  • Measure: Norms, angles, and condition number $\kappa(A)$ to diagnose sensitivity, stability, and training difficulty.
  • Power iteration convergence: Repeatedly applying $\Sigma$ and renormalizing yields the top eigenvector under a spectral gap.
  • Equivalence to PCA direction: The resulting vector matches the first right singular vector of $X_c$ (up to sign).
  • Normalization matters: Renormalization controls growth by $\lambda_1$ and keeps iterations stable and comparable.
  • Spectral gap intuition: Convergence rate is geometric with factor $(\lambda_2/\lambda_1)$; small gaps slow convergence.
  • Centering for PCA: Using $X_c$ ensures the principal direction captures variance around the mean, not the mean itself.
  • Matrix–vector form for scale: Computing $\Sigma v$ via $X_c^\top (X_c v)$ avoids explicitly forming $\Sigma$.
  • Sign ambiguity resolution: Comparing directions uses absolute cosine since eigenvectors are defined up to sign.
  • Numerical verification: The absolute cosine between power-iteration and SVD directions should be near 1.
Notes

Part 1: Setup and Centering - Center the data $X_c = X - \bar{X}$ to remove mean bias. - Define $\Sigma = \frac{1}{n-1} X_c^\top X_c$; $\Sigma$ is symmetric PSD. - Initialize $v$ with $\|v\|_2=1$ to compare directions across iterations.

Part 2: Power Iteration Mechanics - Apply $v \leftarrow \Sigma v$ and renormalize each step. - Convergence requires nonzero overlap with the top eigenvector and $\lambda_1 > \lambda_2$. - Error decays like $(\lambda_2/\lambda_1)^k$; normalization prevents numerical overflow.

Part 3: Equivalence with SVD - SVD: $X_c = U \Sigma_s V^\top$; PC1 direction is $V_{:,0}$. - Since $X_c^\top X_c$ and $\Sigma$ share eigenvectors, power iteration and SVD agree up to sign. - Use absolute cosine $|v^\top v_{\text{svd}}|$ to resolve sign ambiguity.

Why This Matters for ML - Iterative methods scale to large $n,d$ with matrix–vector products. - Enables online/streaming PCA without storing full datasets. - Clarifies when SVD and iterative methods produce the same directions.

ML Examples and Patterns - Whitening and dimensionality reduction with only top-$k$ directions. - Feature compression in vision/NLP with budgeted computation. - Warm-starting iterative methods across epochs or streaming batches.

Connection to Linear Algebra Theory - Spectral theorem for symmetric matrices guarantees orthonormal eigenvectors. - Relation $\Sigma = \frac{1}{n-1} X_c^\top X_c$ ties PCA to covariance eigenanalysis. - Convergence depends on the ratio of leading eigenvalues and initial overlap.

Numerical and Implementation Notes - Prefer $X_c^\top (X_c v)$ over explicit $\Sigma$ for memory efficiency. - Normalize at each step to avoid under/overflow and to track direction. - Monitor $|v^\top v_{\text{svd}}|$ or successive $|v_t^\top v_{t-1}|$ for convergence.

Numerical and Shape Notes - $X \in \mathbb{R}^{n\times d}$; here $d=2$ for visualization, but method generalizes. - $X_c^\top X_c \in \mathbb{R}^{d\times d}$; $v \in \mathbb{R}^d$. - SVD shapes: $U \in \mathbb{R}^{n\times d}$, $\Sigma_s \in \mathbb{R}^{d\times d}$, $V^\top \in \mathbb{R}^{d\times d}$ with full_matrices=False.

Pedagogical Significance - Bridges iterative and decomposition viewpoints of PCA. - Provides a concrete, testable convergence story for students. - Emphasizes centering, normalization, and spectral gap as the key ingredients.

History and Applications

Early PCA work by Pearson (1901) and Hotelling (1933) established variance-based dimensionality reduction. The advent of stable SVD algorithms (Golub & Reinsch, 1970) made PCA robust and widely applicable. Iterative eigenmethods, including power iteration and Lanczos, enabled large-scale computations by relying on matrix–vector products rather than explicit matrices—central to modern applications in information retrieval (latent semantic analysis), computer vision (eigenfaces), and streaming analytics. Today, scalable PCA often combines randomized projections with power/Lanczos iterations, and online variants maintain principal directions as new data arrive, reinforcing the practicality of iteration-based approaches.

Connection to Broader Examples
  • Chapter 8 (Eigen): Iterative methods approximate leading eigenvectors without full decompositions.
  • Chapter 10 (SVD): The top PCA direction equals the first right singular vector of $X_c$.
  • Chapter 11 (PCA): Dominant variance direction defines PC1; power iteration offers a scalable way to compute it.
  • Chapter 6 (Projections): PCA projects onto leading subspaces; power iteration identifies these subspaces.
  • Chapter 14 (Conditioning): Small spectral gaps slow convergence; conditioning affects iterative method performance.
  • Large-scale ML: Prefer matrix–vector products over explicit matrices; streaming/online PCA naturally uses iterative updates.

Comments