This code demonstrates PCA as low-rank approximation via SVD: it reconstructs data using only the top principal component and measures how much variance is retained. The explained variance ratio (EVR) quantifies the information loss, while the mean squared error (MSE) measures reconstruction quality.
Part 1: Data Centering and SVD. The code loads 60 2D points via toy_pca_points(n=60, seed=3), yielding $X \in \mathbb{R}^{60 \times 2}$ (60 examples, 2 features). Centering removes the mean: $\mu = \frac{1}{n}\sum_{i=1}^n X_i, \quad X_c = X - \mu \in \mathbb{R}^{60 \times 2}$. The SVD factorization $X_c = U \Sigma V^\top$ decomposes the centered data as: $U \in \mathbb{R}^{60 \times 2}$ (left singular vectors, principal directions in data space), $\Sigma = \text{diag}(\sigma_1, \sigma_2)$ (singular values, importance/variance scales), and $V^\top \in \mathbb{R}^{2 \times 2}$ (right singular vectors, principal axes in feature space). With full_matrices=False, the SVD returns only the first $\min(60, 2) = 2$ columns of $U$ and rows of $V^\top$, matching the data dimensionality.
Part 2: Explained Variance Ratio. The explained variance ratio (EVR) measures what fraction of total variance is captured by the top-$k$ components. For variance computed from singular values: $\text{Total variance} = \sum_{i=1}^d \sigma_i^2 = \|X_c\|_F^2$, the EVR for the first component is: $\text{EVR}_1 = \frac{\sigma_1^2}{\sum_{i=1}^d \sigma_i^2}$. The code computes this as evr1 = (S[0]**2) / np.sum(S**2). For 2D data, cumulative EVR for both components is: $\text{EVR}_{1:2} = \frac{\sigma_1^2 + \sigma_2^2}{\sigma_1^2 + \sigma_2^2} = 1.0$, computed as evr2 = (S[0]**2 + S[1]**2) / np.sum(S**2), which should equal 1.0 (all variance is explained by all components). Interpretation: If $\text{EVR}_1 = 0.95$, the first component captures 95% of the variation in the dataâmost information fits along a single direction. If $\text{EVR}_1 = 0.5$, the data are more isotropic (variance spread equally across directions).
Part 3: Rank-1 Reconstruction. The rank-1 approximation of centered data uses only the dominant singular vector: $\hat{X}_c^{(1)} = \sigma_1 u_1 v_1^\top$, where $u_1 = U_{:,0}$ (first column of $U$) and $v_1 = V_{0,:}$ (first row of $V^\top$). The code constructs this via Xc1 = np.outer(U[:,0], Vt[0]) * S[0]. Here, np.outer(U[:,0], Vt[0]) computes $u_1 v_1^\top$ (outer product, shape $60 \times 2$), and multiplying by S[0] gives $\sigma_1 u_1 v_1^\top$. To reconstruct the original data scale, the mean is added back: $\hat{X}^{(1)} = \mu + \hat{X}_c^{(1)} \in \mathbb{R}^{60 \times 2}$. The code computes this as X1 = mu + Xc1. Geometric interpretation: $\hat{X}^{(1)}$ projects all data points onto a line passing through the mean $\mu$ in the direction $v_1$. The rank-1 approximation is the best least-squares fit to $X$ by a 1D affine subspace (line through $\mu$).
Part 4: Reconstruction Error (MSE). The mean squared error (MSE) per example measures reconstruction quality: $\text{MSE} = \frac{1}{n} \sum_{i=1}^n \|X_i - \hat{X}_i^{(1)}\|_2^2 = \frac{1}{n} \|X - \hat{X}^{(1)}\|_F^2$. The code computes this as mse1 = np.mean(np.sum((X - X1)**2, axis=1)). (X - X1)**2 squares element-wise differences, np.sum(..., axis=1) sums squared errors per example, and np.mean(...) averages over all examples. Relationship to EVR: The MSE is inversely related to explained variance. If $\text{EVR}_1$ is high (most variance captured), MSE is low (good reconstruction). The variance lost is $\sum_{i=2}^d \sigma_i^2$, which accounts for the reconstruction errorâthe remaining singular values determine how far points deviate from the rank-1 subspace.
Why This Matters for ML. PCA with $k$ components projects high-dimensional data onto a $k$-dimensional subspace. Choosing $k$ via EVR (e.g., retain components capturing 95% variance) balances information retention and computational efficiency. For $d = 1000$ features, keeping top-10 components might capture 90% variance, reducing dimensionality 100-fold. The rank-$k$ approximation is optimal (EckartâYoung theorem): no other rank-$k$ matrix fits the data better in least-squares sense. Reconstruction error guides model selection: if test MSE is much higher than train MSE, the low-rank structure doesnât generalize (overfitting). EVR is used in: dimensionality reduction (feature extraction for downstream tasks), visualization (2D/3D projections), denoising (truncate small singular values), compression (image/video encoding), and anomaly detection (high MSE indicates outliers).
Numerical Gotchas and Verification. - Shape discipline: $X \in \mathbb{R}^{60 \times 2}$, $X_c \in \mathbb{R}^{60 \times 2}$, $U \in \mathbb{R}^{60 \times 2}$, $S \in \mathbb{R}^2$, $V^\top \in \mathbb{R}^{2 \times 2}$. np.outer(u, v) with $u \in \mathbb{R}^{60}, v \in \mathbb{R}^2$ yields $(60, 2)$. - Centering is mandatory: Without centering, SVD captures direction toward the mean (not variance structure). Always compute $X_c = X - \mu$ before SVD for PCA. - Singular values â eigenvalues of covariance: Singular values of $X_c$ relate to covariance eigenvalues via $\lambda_i = \frac{\sigma_i^2}{n-1}$ (variance per component). EVR uses $\sigma_i^2$ directly (sum of squared centered values); dividing by $n-1$ shifts the baseline but doesnât change the ratio. - Rank-1 reconstruction via outer product: np.outer(U[:,0], Vt[0]) * S[0] is equivalent to U[:, [0]] @ (S[0:1, np.newaxis] * Vt[0:1, :]) but clearer. Always verify shape: outer product yields $(60, 2)$. - MSE vs. Frobenius norm: The computed MSE is equivalent to np.linalg.norm(X - X1, 'fro')**2 / 60 (squared Frobenius norm divided by $n$). For comparison with variance, note that Frobenius norm sums squared elements; dividing by $n$ gives average squared error per example. - 2D data edge case: For 2D data, EVR_2 is always 1.0 (both components explain all variance). The codeâs print statement âshould be ~1â reflects this special case. For higher-dimensional data, EVR_2 would be less than 1.
Verification checks:
# Verify reconstruction is rank-1
rank_Xc1 = np.linalg.matrix_rank(Xc1)
print(f"Rank of rank-1 approximation: {rank_Xc1}") # Should be 1
# Verify EVR relationship to singular values
total_variance = np.sum(S**2)
print(f"Total variance (sum S^2): {total_variance:.6f}")
print(f"EVR1 + (1 - EVR1): {evr1 + (1 - evr1):.10f}") # Should be 1.0
# Verify MSE via Frobenius norm
mse1_alt = np.linalg.norm(X - X1, 'fro')**2 / 60
print(f"MSE (direct): {mse1:.6f}, MSE (via Frobenius): {mse1_alt:.6f}")
# Verify reconstruction error relates to lost variance
lost_variance = S[1]**2 # Variance in 2nd component
mse_from_variance = lost_variance / 60
print(f"MSE: {mse1:.6f}, Lost variance / n: {mse_from_variance:.6f}")
Linear Algebra Theory Connection. The condition number bounds relative error: for perturbation $\Delta b$ in $Ax = b$, $\frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\Delta b\|}{\|b\|}$. For PCA, the SVD $X_c = U \Sigma V^\top$ reveals conditioning via singular value ratio. The EckartâYoung theorem states that for any matrix $A$, the best rank-$k$ approximation in Frobenius norm is $A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top = U_k \Sigma_k V_k^\top$. The approximation error is $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$. For 2D data with $k=1$, the error is exactly $\sigma_2^2$ (the missing singular value squared). Explained variance decomposition: total variance in centered data is $\text{Var}(X_c) = \frac{1}{n-1} \sum_{i=1}^d \sigma_i^2$. Principal components are ordered by decreasing variance: $\text{Var}(PC_1) \ge \text{Var}(PC_2) \ge \ldots$. The EVR measures what fraction of total variance is captured: $\text{EVR}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^d \sigma_i^2}$.
Pedagogical Power. This example is the canonical demonstration of PCA as dimensionality reduction. It distills PCA into three core insightsâcentering, SVD computation, and rank-$k$ reconstructionâshowing how linear algebra enables dimensionality reduction. The EVR and MSE metrics concretely measure the trade-off: higher EVR means better reconstruction (lower MSE), but keeping more components increases computational cost. This pattern repeats throughout ML: compression via low-rank structure, guided by variance/energy measures. Understanding this one relationship deepens intuition for matrix factorization, autoencoders, and modern deep learning.
Comments