Part 1: Data Preparation and Covariance Construction. The code loads 100 points from toy_pca_points(n=100, seed=2), producing $X \in \mathbb{R}^{100 \times 2}$ with inherent directional structure along one axis. Centering via Xc = X - X.mean(0) removes the mean, shifting the origin to the dataâs center of mass. The sample covariance matrix is computed as $\Sigma = \frac{1}{n-1} X_c^\top X_c \in \mathbb{R}^{2 \times 2}$, using np.cov(Xc, rowvar=False, bias=False). The rowvar=False flag treats columns as variables (features) rather than rows, and bias=False uses the unbiased estimator with $n-1$ denominator. This covariance matrix $\Sigma$ is symmetric positive semidefinite (PSD) by construction, guaranteeing real non-negative eigenvalues and orthogonal eigenvectors. The eigendecomposition $\Sigma = Q \Lambda Q^\top$ reveals the principal axes: eigenvectors $Q$ point along directions of variance, and eigenvalues $\Lambda$ quantify variance magnitude along each axis.
Part 2: Power Iteration for Dominant Eigenvector. Power iteration is a simple iterative method to extract the dominant eigenvector (corresponding to the largest eigenvalue) of a matrix. Starting from a random unit vector v = rng.normal(size=2) normalized to $\|v\|_2 = 1$, the algorithm repeatedly applies the covariance matrix and renormalizes: $v_{k+1} = \frac{\Sigma v_k}{\|\Sigma v_k\|}$. The loop for _ in range(40): v = Sigma @ v; v /= np.linalg.norm(v) executes 40 iterations. Why does this converge? Expand the initial vector in the eigenbasis: $v_0 = c_1 q_1 + c_2 q_2$, where $q_1, q_2$ are eigenvectors with eigenvalues $\lambda_1 > \lambda_2$. After $k$ iterations, $\Sigma^k v_0 = c_1 \lambda_1^k q_1 + c_2 \lambda_2^k q_2 = \lambda_1^k \left( c_1 q_1 + c_2 \left(\frac{\lambda_2}{\lambda_1}\right)^k q_2 \right)$. Since $|\lambda_2 / \lambda_1| < 1$ (strict inequality when eigenvalues are distinct), the term $(\lambda_2/\lambda_1)^k \to 0$ exponentially fast. Renormalization removes the growing factor $\lambda_1^k$, leaving $v_k \to q_1$ (up to sign). The convergence rate depends on the spectral gap $|\lambda_2/\lambda_1|$: closer eigenvalues require more iterations.
Part 3: Verification via SVD Principal Component. To validate convergence, the code computes the first principal component using SVD. The decomposition U, S, Vt = np.linalg.svd(Xc, full_matrices=False) factorizes $X_c = U \Sigma V^\top$, where $V^\top$ (stored as Vt) contains right singular vectorsâthe principal directions. Extracting pc1 = Vt[0] gives the first principal component, a unit vector in $\mathbb{R}^2$. The connection between SVD and eigendecomposition is direct: for centered data, the covariance is $\Sigma = \frac{1}{n-1} X_c^\top X_c = \frac{1}{n-1} V S^2 V^\top$, so the right singular vectors $V$ are the eigenvectors of $\Sigma$, and eigenvalues are $\lambda_i = \sigma_i^2 / (n-1)$. The printed abs cosine: abs(v @ pc1) computes the absolute cosine similarity between the power iteration result and the SVD-derived principal component. A value near 1.0 confirms that both methods converged to the same direction. Shape discipline: $\Sigma \in \mathbb{R}^{2 \times 2}$, $v \in \mathbb{R}^2$, $V^\top \in \mathbb{R}^{2 \times 2}$ (from SVD with full_matrices=False). The absolute cosine accounts for sign ambiguity: eigenvectors are defined up to sign, so $v$ and $-v$ are equally valid.
Part 4: Why This Matters for ML. Power iteration reveals the computational trade-offs in eigenvalue problems. Full eigendecomposition (via np.linalg.eigh) costs $O(d^3)$ for a $d \times d$ matrix, which is prohibitive when $d$ is large (e.g., high-dimensional embeddings, covariance matrices with thousands of features). Power iteration costs $O(d^2)$ per iterationâjust a matrix-vector productâand often converges in dozens of iterations, making it practical for extracting the top few eigenvectors. Modern extensions like Lanczos and randomized SVD build on this principle, computing low-rank approximations efficiently for large-scale PCA, spectral clustering, and neural network weight initialization. The spectral gap $|\lambda_2/\lambda_1|$ is also a conditioning diagnostic: small gaps slow convergence and make eigenvalue estimation noisy, connecting to conditioning themes (Chapter 14). In deep learning, the dominant eigenvector of the Hessian (curvature matrix) governs optimization landscape geometryâpower iteration on mini-batches estimates this direction cheaply. Gotchas: if eigenvalues are equal (e.g., spherical covariance), power iteration can converge to any direction in the eigenspaceâthe choice depends on initialization. For complex or non-symmetric matrices, power iteration requires specialized variants (e.g., Arnoldi iteration). Always verify convergence by tracking the residual $\|\Sigma v - \lambda v\|$ or checking alignment with reference eigenvectors, as done here.
Comments