Principal Component Analysis (PCA), introduced by Pearson (1901) and Hotelling (1933), finds orthogonal directions in data space ordered by variance. The first principal component points in the direction of maximum variance; the second is orthogonal to the first and captures maximum remaining variance, and so on. This provides a natural coordinate system for data exploration and compression.
In modern ML, PCA serves as a preprocessing step to reduce input dimensionality (e.g., from thousands of pixels to dozens of components), a feature extraction method (learning interpretable axes in gene expression or text data), and a visualization tool (projecting high-dimensional data onto 2D/3D for plotting). The explained variance ratio quantifies how much information each component retains, guiding the choice of $k$ components: if the first $k$ components explain 95% of variance, downstream models trained on $k$-dimensional projections lose only 5% of the original structure. SVD provides a numerically stable way to compute PCA without forming the covariance matrix explicitly.
The explained variance ratio (EVR) computation evr1 = (S[0]**2) / np.sum(S**2) quantifies what fraction of total variance the first component captures. Since the sample covariance of centered data is $\Sigma = \frac{1}{n-1} X_c^\top X_c$, its eigenvalues are $\lambda_i = \sigma_i^2 / (n-1)$, where $\sigma_i$ are the singular values from SVD. The total variance is $\text{tr}(\Sigma) = \sum_i \lambda_i \propto \sum_i \sigma_i^2$, so the ratio $\sigma_1^2 / \sum_i \sigma_i^2$ gives the proportion of variance along the first principal direction. A value near 1.0 indicates the data is approximately 1-dimensional (all variance concentrated in one direction); a value near 0.5 for 2D data suggests roughly equal variance along both axes. This ratio guides dimensionality reduction decisions: if evr1 = 0.95, retaining only pc1 preserves 95% of the datasetâs structure.
Connection to ML workflows: PCA appears throughout machine learning as a preprocessing step (reducing input dimensionality before feeding to classifiers), a feature extraction method (learning interpretable directions in high-dimensional data), and a denoising technique (truncating small singular values to remove noise). The explained variance ratio helps practitioners choose the number of components $k$: plot EVR for each component (the âscree plotâ), select the elbow point where additional components add little variance, and project data onto the top-$k$ principal components via $Z = X_c V_k$, where $V_k \in \mathbb{R}^{d \times k}$ contains the first $k$ columns of $V$. Understanding this singular valueâvariance relationship also clarifies why poorly conditioned data (small $\sigma_{\min}$) causes numerical instability: the condition number $\kappa = \sigma_{\max} / \sigma_{\min}$ quantifies how sensitive inversions and projections are to perturbations, which is why regularization (adding $\lambda I$ to covariance matrices) stabilizes ill-conditioned PCA by bounding the smallest eigenvalue away from zero.
Comments