Chapter 18 — Representation Learning as Optimization Geometry

Chapter 18 — Representation Learning as Optimization Geometry

Overview

Purpose of the Chapter

This chapter establishes representation learning as a fundamentally geometric phenomenon arising from optimization dynamics. Rather than treating learned representations as opaque feature extractors, we demonstrate how the structure of latent spaces emerges directly from the interplay between loss landscapes, gradient flows, and architectural constraints. The geometric properties of representations—their distances, angles, curvatures, and topological structures—are not arbitrary artifacts but necessary consequences of the optimization process that produces them.

We develop a framework for understanding how different training objectives sculpt different geometries in representation space, how these geometries encode inductive biases about the data, and how geometric degeneracies like dimensional collapse arise from optimization failures. This perspective unifies seemingly disparate representation learning methods—autoencoders, contrastive learning, metric learning, and generative models—under a common geometric and optimization-theoretic foundation.

The mathematical machinery we develop reveals that representation learning is not merely about extracting good features, but about constructing coordinate systems in which downstream tasks become geometrically simple. A successful representation transforms complex, entangled data distributions into spaces where linear separators, nearest neighbors, or simple interpolations suffice for prediction and generation.

Concrete ML Applications

Contrastive Embedding Geometry for Retrieval Systems

  1. 1. Concept summary: contrastive training is useful when positive pairs become measurably closer than negatives in embedding space.
  2. 2. Problem statement: decide whether a query-product pair will rank ahead of a hard negative in a retrieval index.
  3. 3. Problem setup: We have normalized embeddings for a query, its relevant product, and a hard negative product. The retrieval system ranks by cosine similarity, so we compare query-to-positive similarity against query-to-negative similarity to see whether the learned geometry separates them.
  4. 4. Explicit values: query embedding \(q=[0.8,0.6]^\top\), positive item \(p=[0.6,0.8]^\top\), hard negative \(n=[0.96,0.28]^\top\), all unit normalized.
  5. 5. Formula with symbols defined: cosine similarity \(s(a,b)=a^\top b\) for unit vectors \(a,b\), where larger \(s\) means smaller angle and higher retrieval rank.
  6. 6. Plug-in step: \(s(q,p)=0.8(0.6)+0.6(0.8)\), \(s(q,n)=0.8(0.96)+0.6(0.28)\).
  7. 7. Computed result: \(s(q,p)=0.96\) and \(s(q,n)=0.936\), so the margin is \(0.024\).
  8. 8. Decision / interpretation: the positive item ranks above the hard negative, but only by a thin margin, indicating retrieval works yet remains fragile for this query neighborhood.
  9. 9. Sensitivity check: if the negative embedding shifts to \([0.98,0.20]^\top\), then \(s(q,n')=0.8(0.98)+0.6(0.20)=0.904\); the margin widens, showing how small geometric changes can improve ranking robustness.

Whitening and Isotropy for Stable Downstream Fine-Tuning

  1. 1. Concept summary: whitening reduces anisotropy so downstream heads do not overreact to a few dominant feature directions.
  2. 2. Problem statement: measure whether a representation is too anisotropic for stable fine-tuning.
  3. 3. Problem setup: The team inspects the covariance spectrum of frozen embeddings before attaching a task head. A common isotropy diagnostic is the covariance condition number: large values mean one direction dominates variance and can destabilize gradient updates.
  4. 4. Explicit values: largest covariance eigenvalue \(\lambda_{\max}=9.0\), smallest retained eigenvalue \(\lambda_{\min}=1.5\), isotropy target \(\kappa_{\max}=3.0\).
  5. 5. Formula with symbols defined: condition number \(\kappa=\lambda_{\max}/\lambda_{\min}\), where \(\lambda_{\max}\) and \(\lambda_{\min}\) are the largest and smallest retained covariance eigenvalues.
  6. 6. Plug-in step: \(\kappa=9.0/1.5\).
  7. 7. Computed result: \(\kappa=6.0\).
  8. 8. Decision / interpretation: since \(6.0 > 3.0\), the embedding is too anisotropic and should be whitened or regularized before fine-tuning.
  9. 9. Sensitivity check: if covariance regularization raises \(\lambda_{\min}\) to \(3.0\) with \(\lambda_{\max}\) fixed, then \(\kappa=9.0/3.0=3.0\), meeting the isotropy target.

Linear Probing as a Representation Diagnostics Tool

  1. 1. Concept summary: a linear probe estimates how much task information is already linearly organized in frozen embeddings.
  2. 2. Problem statement: decide whether checkpoint B learned a meaningfully better representation than checkpoint A.
  3. 3. Problem setup: Both checkpoints are frozen and evaluated with the same linear probe on the downstream validation set. Because the probe architecture and data are fixed, any accuracy gain can be attributed to better geometric organization in the representation itself.
  4. 4. Explicit values: checkpoint A probe accuracy \(a_A=72\%\), checkpoint B probe accuracy \(a_B=81\%\), minimum meaningful gain threshold \(\Delta a_{\min}=5\) percentage points.
  5. 5. Formula with symbols defined: probe gain \(\Delta a=a_B-a_A\), where \(a_A\) and \(a_B\) are validation accuracies of the same linear probe on two checkpoints.
  6. 6. Plug-in step: \(\Delta a=81\%-72\%\).
  7. 7. Computed result: \(\Delta a=9\) percentage points.
  8. 8. Decision / interpretation: since \(9 > 5\), checkpoint B has materially better linearly decodable structure and is the stronger representation checkpoint.
  9. 9. Sensitivity check: if checkpoint B were only \(75\%\), then \(\Delta a=3\) points, below threshold, suggesting end-to-end gains would likely come from optimization rather than better representation geometry.

Manifold-Aware Augmentation for Robust Latent Structure

  1. 1. Concept summary: good augmentations keep semantically identical samples close on the learned manifold while preserving class separation.
  2. 2. Problem statement: determine whether an augmentation policy is too strong for a vision representation model.
  3. 3. Problem setup: The team measures neighborhood consistency by comparing the average embedding distance between two augmented views of the same image against the average distance to same-class neighbors. If the augmented-view distance becomes too large relative to same-class structure, the augmentation policy is warping the manifold.
  4. 4. Explicit values: average same-image augmented distance \(d_{\text{aug}}=0.42\), average same-class neighbor distance \(d_{\text{cls}}=0.50\), acceptable ratio threshold \(r_{\max}=0.80\).
  5. 5. Formula with symbols defined: augmentation ratio \(r=d_{\text{aug}}/d_{\text{cls}}\), where \(d_{\text{aug}}\) is same-image augmented distance and \(d_{\text{cls}}\) is typical same-class neighbor distance.
  6. 6. Plug-in step: \(r=0.42/0.50\).
  7. 7. Computed result: \(r=0.84\).
  8. 8. Decision / interpretation: since \(0.84 > 0.80\), the augmentation policy is slightly too strong and should be relaxed to avoid manifold distortion.
  9. 9. Sensitivity check: if weaker crops reduce \(d_{\text{aug}}\) to \(0.35\), then \(r=0.35/0.50=0.70\), which falls inside the acceptable range and preserves latent geometry better.

Conceptual Scope

We examine representation learning through three interconnected lenses: geometric structure, optimization dynamics, and architectural constraints. The geometric lens reveals how representations form manifolds in high-dimensional spaces, how similarity metrics emerge from training objectives, and how invariances manifest as geometric symmetries. The optimization lens shows how gradient descent biases representations toward particular geometric configurations, how different loss functions induce different curvatures, and how training instabilities manifest as geometric pathologies.

The architectural lens demonstrates how network structure—depth, width, nonlinearities, and connectivity patterns—constrains the geometry of learnable representations. We analyze how bottleneck layers force dimensional reduction, how skip connections preserve geometric information across layers, and how normalization operations reshape the geometry of representation spaces.

Throughout, we maintain focus on the mathematical mechanisms that connect optimization choices to geometric outcomes. We avoid purely empirical observations about representation quality, instead deriving principles that predict how changes in training procedures will alter geometric properties. This predictive framework enables principled design of representation learning systems rather than ad-hoc trial-and-error.

Questions This Chapter Answers

How do optimization objectives determine representation geometry? We show that each loss function induces a geometric structure on representation space—contrastive losses create angular geometries with clustering, reconstruction losses induce metric spaces that preserve distances, and adversarial losses generate manifolds that match distributional shapes. The choice of objective is not merely a performance decision but a geometric specification.

Why do representations collapse and how can collapse be prevented? Dimensional collapse—where representations occupy a lower-dimensional subspace than the ambient space—arises when optimization finds degenerate solutions that minimize loss without utilizing available capacity. We characterize the geometric conditions that enable collapse and derive regularization strategies that maintain geometric richness.

What geometric properties make representations useful for downstream tasks? Effective representations exhibit specific geometric characteristics: sufficient dimensionality to capture data variance, appropriate distance metrics that reflect task-relevant similarities, and structured manifolds that enable interpolation and extrapolation. We formalize these properties and show how training objectives can be designed to induce them.

How do architectural choices constrain representation geometry? Network architecture imposes hard and soft constraints on representational geometry. Layer dimensions bound intrinsic dimensionality, nonlinear activations determine local curvatures, and connectivity patterns control information flow. We analyze how these architectural elements combine to shape the space of learnable representations.

Can we predict representation quality from optimization trajectories? We develop diagnostic tools based on geometric measurements—spectral properties, local curvatures, distance distributions—that predict whether a representation will generalize well before evaluating on downstream tasks. These tools reveal optimization failures early and guide interventions.

How This Chapter Fits Into the Full Book

This chapter synthesizes ideas from multiple earlier chapters. The manifold theory from Chapter 14 provides the geometric foundation for understanding representation spaces as embedded manifolds with intrinsic structure. The optimization theory from Chapters 7-12 explains how gradient descent navigates loss landscapes to find these representations. The symmetry and invariance concepts from Chapter 16 characterize what makes representations stable under transformations.

We extend these foundations by analyzing how optimization and geometry interact specifically in the context of learning representations rather than merely optimizing functions. While earlier chapters treated the parameter space as the primary object of study, here we focus on the activation space—the space of learned features—and show how optimization in parameter space induces structure in activation space.

The chapter prepares for later discussions of generative models (Chapter 19), where representations become probabilistic distributions, and neural architecture search (Chapter 21), where architectural constraints on representation geometry become design parameters. The geometric understanding developed here provides the language for analyzing how different architectures encode different inductive biases about the structure of data.

Definitions

Representation Map

  1. Definition: A representation map is a measurable function \(f_\theta: \mathcal{X} \to \mathcal{Z}\) from an input space \(\mathcal{X}\) to a representation space \(\mathcal{Z}\), parametrized by \(\theta \in \Theta\), typically learned through optimization of a task-specific objective \(\mathcal{L}(f_\theta, \mathcal{D})\) over a dataset \(\mathcal{D}\).
  2. Notation: We write \(\mathbf{z} = f_\theta(\mathbf{x})\) for the representation of input \(\mathbf{x}\). When multiple layers are involved, we denote \(f_\theta = f^{(L)} \circ f^{(L-1)} \circ \cdots \circ f^{(1)}\) where \(f^{(\ell)}\) is the \(\ell\)-th layer transformation. The Jacobian \(J_{f_\theta}(\mathbf{x}) = \frac{\partial f_\theta}{\partial \mathbf{x}}\) describes local geometric distortion.
  3. Valid Example: A convolutional encoder \(f_\theta: \mathbb{R}^{3 \times 224 \times 224} \to \mathbb{R}^{512}\) mapping ImageNet images to 512-dimensional feature vectors. After training on classification, images of the same class map to nearby points in \(\mathbb{R}^{512}\) under Euclidean distance, with typical intra-class distance \(\approx 0.3\) and inter-class distance \(\approx 1.2\).
  4. Failure Case: A randomly initialized representation map \(f_{\theta_0}\) produces nearly constant outputs \(\mathbf{z} \approx \mathbf{c}\) for all inputs when deep networks are poorly initialized. This degeneracy manifests as \(\text{Var}(\mathbf{z}) \approx 0\), providing no discriminative information. The failure arises from vanishing gradients preventing effective learning.
  5. Explicit ML Relevance: Representation maps are the core abstraction in transfer learning, self-supervised learning, and metric learning. Pre-training learns \(f_\theta\) on a source task, then \(\theta\) is fine-tuned or frozen for target tasks. The success of transfer depends on whether the source task geometry generalizes to the target task geometry.

Latent Space

  1. Definition: The latent space \(\mathcal{Z}\) is the codomain of a representation map \(f_\theta: \mathcal{X} \to \mathcal{Z}\), equipped with a metric \(d_\mathcal{Z}\) or probability measure \(\mu_\mathcal{Z}\), serving as the domain where learned features reside. The effective latent space is the image \(f_\theta(\mathcal{X}) \subseteq \mathcal{Z}\), typically a lower-dimensional manifold.
  2. Notation: We denote points in latent space as \(\mathbf{z} \in \mathcal{Z}\). The latent manifold is \(\mathcal{M}_\mathcal{Z} = \overline{f_\theta(\mathcal{X})}\) (closure of the image). We write \(d_{\text{eff}} = \text{rank}(\text{Cov}(\mathbf{z}))\) for the effective dimension based on covariance rank.
  3. Valid Example: A variational autoencoder for CelebA faces uses \(\mathcal{Z} = \mathbb{R}^{256}\) with Gaussian prior \(\mu_\mathcal{Z} = \mathcal{N}(0, I)\). The effective latent manifold has \(d_{\text{eff}} \approx 50\), capturing variations like pose, expression, lighting, and identity. Interpolating between two face encodings \(\mathbf{z}_1\) and \(\mathbf{z}_2\) produces a smooth morph through intermediate faces, validating the geometric structure.
  4. Failure Case: An autoencoder without regularization may learn a latent space with holes—regions where the decoder \(g_\phi(\mathbf{z})\) produces invalid outputs. If training data clusters in disconnected regions of \(\mathcal{Z}\), interpolating between clusters passes through unexplored territory, generating artifacts. This failure indicates \(\mathcal{M}_\mathcal{Z}\) is not convex or smoothly connected.
  5. Explicit ML Relevance: Latent space geometry determines generative model quality. GANs sample from a latent prior \(\mathbf{z} \sim \mathcal{N}(0,I)\) and generate \(\mathbf{x} = G(\mathbf{z})\). If \(G\) maps the latent space non-uniformly (mode collapse), certain regions never activate, losing diversity. Regularization techniques (VAE’s KL term, normalizing flows) shape latent geometry to be well-behaved.

Embedding Function

  1. Definition: An embedding function is an injective (or approximately injective) map \(\phi: \mathcal{X} \to \mathcal{Z}\) that preserves specified geometric properties of \(\mathcal{X}\) in the embedded space \(\mathcal{Z}\). Formally, there exists a tolerance \(\epsilon > 0\) such that for all \(\mathbf{x}_1, \mathbf{x}_2 \in \mathcal{X}\): \[ (1-\epsilon) d_\mathcal{X}(\mathbf{x}_1, \mathbf{x}_2) \leq d_\mathcal{Z}(\phi(\mathbf{x}_1), \phi(\mathbf{x}_2)) \leq (1+\epsilon) d_\mathcal{X}(\mathbf{x}_1, \mathbf{x}_2) \] for chosen metrics \(d_\mathcal{X}\) and \(d_\mathcal{Z}\).
  2. Notation: We distinguish embeddings \(\phi\) (emphasizing geometric preservation) from general representations \(f_\theta\) (emphasizing task optimization). The distortion is \(\epsilon = \max_{\mathbf{x}_1, \mathbf{x}_2} |d_\mathcal{Z}(\phi(\mathbf{x}_1), \phi(\mathbf{x}_2)) / d_\mathcal{X}(\mathbf{x}_1, \mathbf{x}_2) - 1|\).
  3. Valid Example: The Skip-Gram word2vec embedding \(\phi: V \to \mathbb{R}^{300}\) maps a vocabulary \(V\) of words to 300-dimensional vectors. Words with similar contexts (co-occurrence patterns) map to nearby vectors. The embedding preserves semantic distances: \(\|\phi(\text{"king"}) - \phi(\text{"queen"})\| \approx \|\phi(\text{"man"}) - \phi(\text{"woman"})\|\), reflecting analogical structure with distortion \(\epsilon \approx 0.2\).
  4. Failure Case: A random projection \(\phi(\mathbf{x}) = R\mathbf{x}\) with \(R \in \mathbb{R}^{d \times n}\), \(d \ll n\), fails as an embedding when the input space has significant non-linear structure. For data on a manifold, random projection distorts local neighborhoods non-uniformly, producing \(\epsilon \geq 1\). This violates the preservation requirement and loses geometric fidelity.
  5. Explicit ML Relevance: Metric learning trains embeddings for verification and retrieval. Face recognition systems learn \(\phi\) such that \(d_\mathcal{Z}(\phi(\mathbf{x}_i), \phi(\mathbf{x}_j)) < \tau\) when faces \(\mathbf{x}_i, \mathbf{x}_j\) depict the same person and \(> \tau\) otherwise. The threshold \(\tau\) and margin in triplet loss directly control embedding distortion \(\epsilon\).

Invariance

  1. Definition: A representation map \(f_\theta: \mathcal{X} \to \mathcal{Z}\) is invariant to a transformation group \(G\) acting on \(\mathcal{X}\) if for all \(\mathbf{x} \in \mathcal{X}\) and all \(g \in G\): \[ f_\theta(g \cdot \mathbf{x}) = f_\theta(\mathbf{x}) \] where \(g \cdot \mathbf{x}\) denotes the action of \(g\) on \(\mathbf{x}\). Approximate invariance holds when \(\|f_\theta(g \cdot \mathbf{x}) - f_\theta(\mathbf{x})\| \leq \delta\) for small \(\delta > 0\).
  2. Notation: We write \(f_\theta \in \text{Inv}_G(\mathcal{X}, \mathcal{Z})\) to indicate \(f_\theta\) is \(G\)-invariant. The orbit of \(\mathbf{x}\) under \(G\) is \(\mathcal{O}_\mathbf{x} = \{g \cdot \mathbf{x} : g \in G\}\), and invariance means \(f_\theta\) is constant on each orbit.
  3. Valid Example: A classification network trained with rotational data augmentation learns approximate rotation invariance: \(f_\theta(R_\alpha \cdot \mathbf{x}) \approx f_\theta(\mathbf{x})\) for rotation \(R_\alpha\) by angle \(\alpha \in [0, 2\pi)\). For ImageNet models, empirical measurements show \(\|f_\theta(R_\alpha \cdot \mathbf{x}) - f_\theta(\mathbf{x})\|/\|f_\theta(\mathbf{x})\| < 0.05\) for moderate rotations \(|\alpha| < 15°\).
  4. Failure Case: A model trained without augmentation lacks invariance. Applying \(g \in G\) produces \(\|f_\theta(g \cdot \mathbf{x}) - f_\theta(\mathbf{x})\| / \|f_\theta(\mathbf{x})\| \geq 0.5\), indicating sensitivity to irrelevant variations. This manifests as poor generalization: test accuracy drops when inputs undergo transformations absent during training (e.g., rotated test images when training had no rotation).
  5. Explicit ML Relevance: Data augmentation implicitly enforces invariances. SimCLR pulls together representations of augmented pairs \((g \cdot \mathbf{x}, g' \cdot \mathbf{x})\), encouraging invariance to \(G = \{\text{crops, color jitters, flips}\}\). Stronger augmentations enforce stronger invariances but risk losing task-relevant information if \(G\) is too large.

Equivariance

  1. Definition: A representation map \(f_\theta: \mathcal{X} \to \mathcal{Z}\) is equivariant to a transformation group \(G\) if there exists a group homomorphism \(\rho: G \to \text{Aut}(\mathcal{Z})\) such that for all \(\mathbf{x} \in \mathcal{X}\) and \(g \in G\): \[ f_\theta(g \cdot \mathbf{x}) = \rho(g) \cdot f_\theta(\mathbf{x}) \] where \(\rho(g)\) is the induced action on \(\mathcal{Z}\).
  2. Notation: We write \(f_\theta \in \text{Equiv}_G(\mathcal{X}, \mathcal{Z})\) for \(G\)-equivariant maps. When \(\mathcal{Z} = \mathbb{R}^{d_1 \times \cdots \times d_k}\) has spatial structure, \(\rho(g)\) often acts by permuting coordinates or applying group actions to spatial indices.
  3. Valid Example: A convolutional layer \(f(\mathbf{x}) = \sigma(W * \mathbf{x} + \mathbf{b})\) is translation-equivariant. Translating input \(\mathbf{x}\) by vector \(\mathbf{t}\) yields \(f(T_\mathbf{t}(\mathbf{x})) = T_\mathbf{t}(f(\mathbf{x}))\) where \(T_\mathbf{t}\) denotes translation. Empirically, shifting an ImageNet image by 10 pixels shifts all feature maps by 10 pixels (accounting for stride and pooling).
  4. Failure Case: A fully connected layer \(f(\mathbf{x}) = \sigma(W\mathbf{x} + \mathbf{b})\) breaks translation equivariance. Translating input produces \(f(T_\mathbf{t}(\mathbf{x})) \neq T_\mathbf{t}(f(\mathbf{x}))\) because matrix multiplication treats each coordinate distinctly. This necessitates learning separate weights for translated versions, reducing efficiency and generalization.
  5. Explicit ML Relevance: Group-equivariant neural networks (G-CNNs) build equivariance architecturally. Rotation-equivariant networks preserve rotational structure: rotating input rotates feature maps by the same angle. This improves sample efficiency on tasks with rotational symmetry (medical imaging, astronomy) by encoding the symmetry as an architectural prior rather than learning it from data.

Feature Collapse

  1. Definition: Feature collapse occurs when the learned representation map \(f_\theta: \mathcal{X} \to \mathbb{R}^d\) produces outputs \(\mathbf{z} = f_\theta(\mathbf{x})\) that are constant for all inputs, formally: \[ \exists \mathbf{c} \in \mathbb{R}^d : f_\theta(\mathbf{x}) = \mathbf{c} \quad \forall \mathbf{x} \in \mathcal{X} \] or approximately, \(\text{Var}(\mathbf{z}) = \mathbb{E}[\|\mathbf{z} - \mathbb{E}[\mathbf{z}]\|^2] \to 0\) as training progresses.
  2. Notation: We quantify collapse by the collapse ratio \(\kappa = \text{Var}(\mathbf{z}) / \text{Var}(\mathbf{x}_{\text{proj}})\) where \(\mathbf{x}_{\text{proj}}\) is input projected to dimension \(d\). Values \(\kappa < 0.01\) indicate severe collapse.
  3. Valid Example: Training a Siamese network with positive pairs only (no negatives) leads to collapse: the network minimizes \(\|\mathbf{z}_1 - \mathbf{z}_2\|^2\) by setting \(f_\theta(\mathbf{x}) = \mathbf{0}\) for all \(\mathbf{x}\). Empirically, \(\text{Var}(\mathbf{z}) < 10^{-6}\) within 100 iterations, and test accuracy equals random chance.
  4. Failure Case: Monitoring only loss values fails to detect collapse. A collapsed network achieves \(\mathcal{L} = 0\) on contrastive loss without negatives, appearing successful. Only by measuring \(\text{Var}(\mathbf{z})\) or downstream task performance is collapse revealed. This emphasizes the need for geometric diagnostics beyond scalar loss.
  5. Explicit ML Relevance: Self-supervised learning is vulnerable to feature collapse. Methods like SimCLR prevent collapse using large negative batches, BYOL uses momentum encoders and stop-gradients, and Barlow Twins minimizes cross-correlation. Each approach combats collapse through different geometric regularization: spreading representations (variance), decorrelating features (cross-correlation), or architectural asymmetry (momentum).

Representation Collapse

  1. Definition: Representation collapse (distinct from feature collapse) occurs when learned representations \(\mathbf{z} = f_\theta(\mathbf{x}) \in \mathbb{R}^d\) occupy a lower-dimensional subspace \(\mathcal{S} \subset \mathbb{R}^d\) with \(\dim(\mathcal{S}) = k \ll d\). Formally, the feature covariance matrix \(\Sigma = \text{Cov}(\mathbf{z})\) has rank \(k < d\), with eigenvalues \(\lambda_1 \geq \cdots \geq \lambda_k > 0\) and \(\lambda_{k+1} = \cdots = \lambda_d = 0\).
  2. Notation: We define the effective rank \(\text{rank}_\epsilon(\Sigma) = |\{i : \lambda_i / \lambda_1 > \epsilon\}|\) for threshold \(\epsilon\) (typically \(10^{-3}\)). The participation ratio \(\text{PR} = (\sum_i \lambda_i)^2 / \sum_i \lambda_i^2\) quantifies dimension utilization, with \(\text{PR} \approx d\) indicating full utilization and \(\text{PR} \ll d\) indicating collapse.
  3. Valid Example: Training a 512-dimensional contrastive learning model without sufficient negatives produces collapse with \(\text{rank}(\Sigma) \approx 30\) and \(\text{PR} \approx 35\). The top 30 eigenvalues capture 99.9% of variance while 482 dimensions are unused. Increasing batch size from 64 to 1024 (providing more negatives) raises \(\text{rank}(\Sigma)\) to 420 and \(\text{PR}\) to 380, utilizing capacity better.
  4. Failure Case: A standard fully connected network trained on MNIST with no regularization may learn 784-dimensional representations with \(\text{rank}(\Sigma) \approx 15\), reflecting that MNIST digits intrinsically lie on a low-dimensional manifold. However, this is not a failure—it’s efficient learning. Failure occurs when test accuracy on complex tasks remains poor due to insufficient representation capacity, despite nominal high dimension.
  5. Explicit ML Relevance: Dimensional collapse limits transfer learning effectiveness. A pre-trained encoder with collapsed representations has low effective rank, limiting information available for downstream tasks. Techniques to prevent collapse include: variance regularization (VICReg), decorrelation objectives (Barlow Twins), hard negative mining (contrastive learning), and architectural designs that force dimension utilization (specific normalizations).

Objective-Induced Geometry

  1. Definition: The objective-induced geometry on representation space \(\mathcal{Z}\) is the geometric structure (metric, curvature, topology) arising from a training objective \(\mathcal{L}: \mathcal{Z} \times \mathcal{Y} \to \mathbb{R}\). Specifically, the loss induces level sets \(\mathcal{L}_c = \{\mathbf{z} : \mathcal{L}(\mathbf{z}, y) = c\}\) and a gradient vector field \(\nabla_\mathbf{z} \mathcal{L}\) that shapes the representation geometry through optimization dynamics.
  2. Notation: We denote the induced metric by \(g_\mathcal{L}\), often derived from the loss Hessian: \(g_\mathcal{L}(\mathbf{z}) = \nabla^2_\mathbf{z} \mathcal{L}(\mathbf{z}, y)\). The curvature of level sets determines clustering behavior. Geodesics under \(g_\mathcal{L}\) represent optimal paths in representation space.
  3. Valid Example: The cross-entropy loss \(\mathcal{L}_{\text{CE}}(\mathbf{z}, y) = -\log(\text{softmax}(W\mathbf{z})_y)\) for linear classifier \(W\) induces a geometry where same-class representations cluster in cones around optimal directions. The Hessian near convergence has eigenvalues proportional to class separability, producing high curvature (tight clustering) for well-separated classes and low curvature (diffuse clustering) for confusable classes.
  4. Failure Case: A poorly scaled loss \(\mathcal{L} = \alpha \|\mathbf{z} - \mathbf{t}\|^2\) with very large \(\alpha\) induces extreme curvature, causing representations to concentrate in infinitesimally small regions. This numerical instability manifests as gradients \(\|\nabla_\mathbf{z} \mathcal{L}\| > 10^6\), exploding updates, and training divergence. The induced geometry becomes degenerate.
  5. Explicit ML Relevance: Contrastive losses induce angular geometries (cosine similarity creates spherical structure), triplet losses induce metric geometries (Euclidean distance creates flat structure), and reconstruction losses induce manifold geometries (preserve local neighborhoods). Understanding these geometries enables principled loss design for specific representation requirements.

Contrastive Objective

  1. Definition: A contrastive objective is a loss function \(\mathcal{L}_{\text{contrast}}\) that encourages representations of similar (positive) pairs to be nearby while representations of dissimilar (negative) pairs are distant. The canonical form is: \[ \mathcal{L}_{\text{contrast}}(\mathbf{z}_i, \{\mathbf{z}_i^+\}, \{\mathbf{z}_j^-\}) = -\log \frac{\sum_{p} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_i^+) / \tau)}{\sum_p \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_i^+) / \tau) + \sum_j \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j^-) / \tau)} \] where \(\text{sim}(\mathbf{z}_a, \mathbf{z}_b)\) is a similarity function (typically cosine similarity) and \(\tau > 0\) is temperature.
  2. Notation: We abbreviate \(\mathcal{L}_{\text{contrast}}\) as the InfoNCE loss or NT-Xent loss in specific contexts. The effective number of negatives is \(|\{\mathbf{z}_j^-\}|\), typically equal to batch size minus 1 in self-supervised settings.
  3. Valid Example: SimCLR applies \(\mathcal{L}_{\text{contrast}}\) with positive pairs from augmented views of the same image and negatives from all other images in a batch of size 4096. Temperature \(\tau = 0.1\) creates angular separation where positives have cosine similarity \(> 0.9\) (angle \(< 25°\)) and negatives have similarity \(< 0.3\) (angle \(> 70°\)). This geometry enables strong linear classification performance.
  4. Failure Case: With insufficient negatives (batch size 32), contrastive learning produces collapsed representations: the network learns \(\mathbf{z} = \mathbf{c}\) constant, achieving perfect positive similarity while making negative terms small. Formally, as \(\mathbf{z}_i \to \mathbf{c}\) for all \(i\), all similarities become equal, and the loss reduces to \(-\log(1/N) = \log N\), which is minimized as \(N \to \infty\) but achieves no useful structure with finite \(N\).
  5. Explicit ML Relevance: Contrastive learning is foundational for self-supervised representation learning. Methods like SimCLR, MoCo, and CLIP use contrastive objectives to learn from unlabeled data by defining positive pairs through augmentation (SimCLR, MoCo) or text-image correspondence (CLIP). The learned representations transfer well because the contrastive geometry aligns with semantic similarity.

Similarity Metric

  1. Definition: A similarity metric (or similarity function, despite not being a metric in the mathematical sense) is a symmetric function \(s: \mathcal{Z} \times \mathcal{Z} \to \mathbb{R}\) quantifying resemblance between representations, typically satisfying: 1. Symmetry: \(s(\mathbf{z}_1, \mathbf{z}_2) = s(\mathbf{z}_2, \mathbf{z}_1)\) 2. Self-similarity: \(s(\mathbf{z}, \mathbf{z}) = s_{\max}\) (maximum value) 3. Boundedness: \(s_{\min} \leq s(\mathbf{z}_1, \mathbf{z}_2) \leq s_{\max}\)
  2. Notation: We use \(s(\mathbf{z}_1, \mathbf{z}_2)\) for abstract similarity and specific forms \(s_{\cos}\), \(s_{\text{euc}}\), \(s_{\text{RBF}}\) for particular functions. Distance metrics are written \(d(\mathbf{z}_1, \mathbf{z}_2)\), with \(s(\mathbf{z}_1, \mathbf{z}_2) = -d(\mathbf{z}_1, \mathbf{z}_2)\) relating similarity to distance.
  3. Valid Example: In face verification, cosine similarity \(s_{\cos}(\mathbf{z}_1, \mathbf{z}_2) \in [-1, 1]\) determines matching: if \(s_{\cos} > 0.6\), predict same person. For a face encoder producing 128-dimensional embeddings, same-person pairs achieve \(s_{\cos} \approx 0.85 \pm 0.08\) while different-person pairs achieve \(s_{\cos} \approx 0.15 \pm 0.15\). The threshold 0.6 separates distributions with low error rate.
  4. Failure Case: Using Euclidean distance \(d(\mathbf{z}_1, \mathbf{z}_2) = \|\mathbf{z}_1 - \mathbf{z}_2\|\) in high dimensions suffers from concentration: all pairwise distances become similar (\(\approx \sqrt{d}\)), losing discriminative power. For 512-dimensional representations, distance ratio \(\max_d / \min_d \approx 1.1\), compared to \(\approx 3.0\) in low dimensions. Cosine similarity avoids this by normalizing out magnitude.
  5. Explicit ML Relevance: Similarity metrics determine loss landscape geometry in contrastive and metric learning. The NT-Xent loss uses cosine similarity, making optimization geometry spherical. Triplet loss uses Euclidean distance, making geometry flat. Learnable metrics (Mahalanobis distance with learned covariance) adapt geometry during training, improving performance on complex tasks.

Feature Covariance

  1. Definition: The feature covariance matrix \(\Sigma \in \mathbb{R}^{d \times d}\) captures second-order statistics of representations \(\mathbf{z} = f_\theta(\mathbf{x}) \in \mathbb{R}^d\) over a dataset \(\mathcal{D}\): \[ \Sigma = \mathbb{E}_{\mathbf{x} \sim \mathcal{D}}[(\mathbf{z} - \bar{\mathbf{z}})(\mathbf{z} - \bar{\mathbf{z}})^\top] = \frac{1}{|\mathcal{D}|} \sum_{\mathbf{x} \in \mathcal{D}} (\mathbf{z} - \bar{\mathbf{z}})(\mathbf{z} - \bar{\mathbf{z}})^\top \] where \(\bar{\mathbf{z}} = \mathbb{E}_{\mathbf{x} \sim \mathcal{D}}[f_\theta(\mathbf{x})]\) is the mean representation.
  2. Notation: We write \(\Sigma = \text{Cov}(\mathbf{z})\) and denote eigenvalues \(\lambda_i\) with corresponding eigenvectors \(\mathbf{q}_i\). The trace \(\text{tr}(\Sigma) = \sum_i \lambda_i\) equals total variance. The condition number \(\kappa(\Sigma) = \lambda_1 / \lambda_d\) quantifies anisotropy.
  3. Valid Example: A ResNet-50 encoder for ImageNet produces 2048-dimensional features with covariance having top 200 eigenvalues explaining 95% of variance (\(\sum_{i=1}^{200} \lambda_i / \sum_{i=1}^{2048} \lambda_i \approx 0.95\)). The effective dimension is \(\approx 200\), indicating substantial dimensional collapse. Principal components correspond to semantic variations: top components capture object category, middle components capture pose and lighting, bottom components capture noise.
  4. Failure Case: Poorly initialized networks produce degenerate covariance with \(\lambda_1 \approx \lambda_2 \approx \cdots \approx \lambda_d \approx 0\) (all near zero), indicating collapsed representations with \(\mathbf{z} \approx \bar{\mathbf{z}}\) constant. This prevents learning because gradients through \(\mathbf{z}\) vanish. Good initialization ensures \(\text{tr}(\Sigma) \approx 1\) initially, providing learning signal.
  5. Explicit ML Relevance: Covariance regularization prevents collapse. Barlow Twins minimizes cross-correlation \(\text{Corr}(\mathbf{z}^A, \mathbf{z}^B)\) between augmented views, encouraging \(\Sigma\) to be diagonal. VICReg maximizes \(\text{tr}(\Sigma)\) while minimizing off-diagonal terms. These objectives directly shape \(\Sigma\)’s spectral structure, ensuring representations utilize all dimensions and maintain decorrelation.

Mutual Information (Preview)

  1. Definition: The mutual information \(I(\mathbf{X}; \mathbf{Z})\) between input random variable \(\mathbf{X} \in \mathcal{X}\) and representation \(\mathbf{Z} = f_\theta(\mathbf{X}) \in \mathcal{Z}\) quantifies the amount of information \(\mathbf{Z}\) contains about \(\mathbf{X}\): \[ I(\mathbf{X}; \mathbf{Z}) = H(\mathbf{X}) - H(\mathbf{X} | \mathbf{Z}) = \mathbb{E}_{\mathbf{x}, \mathbf{z}} \left[ \log \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x}) p(\mathbf{z})} \right] \] where \(H(\cdot)\) denotes entropy. This preview introduces the concept; full treatment requires information theory (Chapter 22).
  2. Notation: We write \(I(\mathbf{X}; \mathbf{Z})\) for mutual information, distinguishing it from independence (\(\mathbf{X} \perp \mathbf{Z}\) implies \(I(\mathbf{X}; \mathbf{Z}) = 0\)). The conditional entropy \(H(\mathbf{X}|\mathbf{Z})\) measures uncertainty about \(\mathbf{X}\) given \(\mathbf{Z}\).
  3. Valid Example: An autoencoder with bottleneck dimension \(d < n\) has \(I(\mathbf{X}; \mathbf{Z}) \leq d \log(C)\) for continuous \(\mathbf{Z}\) with bounded support (covering number \(C\)). For MNIST with \(n = 784\) pixels and \(d = 32\) latent dimensions, achievable mutual information is \(I \approx 25\) nats (estimated via variational bounds), less than \(H(\mathbf{X}) \approx 200\) nats for natural images.
  4. Failure Case: Maximizing \(I(\mathbf{X}; \mathbf{Z})\) alone produces overfitting: the encoder memorizes inputs, including noise. A representation with \(I(\mathbf{X}; \mathbf{Z}) = H(\mathbf{X})\) retains all input information, failing to generalize. Effective representations balance high \(I(\mathbf{X}; Y)\) (task-relevant information) with low \(I(\mathbf{X}; \mathbf{Z})\) (compression), formalized by the information bottleneck.
  5. Explicit ML Relevance: Self-supervised learning implicitly maximizes mutual information. Contrastive methods maximize \(I(\mathbf{Z}^1; \mathbf{Z}^2)\) between representations of augmented views. Mutual information neural estimation (MINE) provides tractable gradients for MI maximization. The information bottleneck framework (next definition) balances MI maximization with compression.

Manifold Hypothesis

  1. Definition: The manifold hypothesis states that high-dimensional data \(\mathbf{x} \in \mathcal{X} \subseteq \mathbb{R}^n\) lie approximately on or near a low-dimensional manifold \(\mathcal{M} \subset \mathbb{R}^n\) with intrinsic dimension \(d_{\mathcal{M}} \ll n\). Formally, there exists an embedding \(\psi: \mathbb{R}^{d_{\mathcal{M}}} \to \mathbb{R}^n\) such that \(\psi(\mathbb{R}^{d_{\mathcal{M}}}) = \mathcal{M}\) and data density concentrates near \(\mathcal{M}\): \[ \mathbb{P}[d(\mathbf{x}, \mathcal{M}) > \epsilon] \to 0 \text{ as } \epsilon \to 0 \] where \(d(\mathbf{x}, \mathcal{M}) = \inf_{\mathbf{m} \in \mathcal{M}} \|\mathbf{x} - \mathbf{m}\|\).
  2. Notation: We write \(\mathcal{M}\) for the data manifold, \(d_{\mathcal{M}}\) for intrinsic dimension, and \(T_\mathbf{x} \mathcal{M}\) for the tangent space at point \(\mathbf{x} \in \mathcal{M}\). The normal space \(N_\mathbf{x} \mathcal{M}\) is orthogonal to \(T_\mathbf{x} \mathcal{M}\), with \(\mathbb{R}^n = T_\mathbf{x} \mathcal{M} \oplus N_\mathbf{x} \mathcal{M}\).
  3. Valid Example: Natural images from ImageNet (dimension \(n = 224 \times 224 \times 3 = 150{,}528\)) lie approximately on a manifold of dimension \(d_{\mathcal{M}} \approx 200\). Principal component analysis reveals that 200 components explain 95% of variance. Autoencoders with 200-dimensional bottlenecks achieve high-quality reconstruction (PSNR \(> 30\) dB), confirming low intrinsic dimensionality.
  4. Failure Case: Completely random data (white noise) violate the manifold hypothesis: data uniformly fill the ambient space \(\mathbb{R}^n\) with \(d_{\mathcal{M}} = n\). Attempting to learn low-dimensional representations of random data produces poor reconstructions because no lower-dimensional structure exists. This emphasizes that the hypothesis applies to structured data (natural images, speech, text), not arbitrary signals.
  5. Explicit ML Relevance: Generative models explicitly parameterize the data manifold. GANs learn a generator \(G: \mathbb{R}^{d_z} \to \mathbb{R}^n\) where \(G(\mathbb{R}^{d_z})\) approximates \(\mathcal{M}\). Normalizing flows learn invertible maps between \(\mathcal{M}\) and simple distributions. Manifold learning algorithms (Isomap, LLE, t-SNE) estimate \(d_{\mathcal{M}}\) and discover local manifold structure for visualization.

Information Bottleneck (Preview)

  1. Definition: The information bottleneck principle seeks representations \(\mathbf{Z} = f_\theta(\mathbf{X})\) that maximize task-relevant information \(I(\mathbf{Z}; Y)\) while minimizing total information \(I(\mathbf{X}; \mathbf{Z})\), formalized as the optimization: \[ \max_{\mathbf{Z}} \left[ I(\mathbf{Z}; Y) - \beta I(\mathbf{X}; \mathbf{Z}) \right] \] where \(Y\) is the task target variable and \(\beta > 0\) controls the compression-prediction trade-off. This preview introduces the principle; full analysis requires information theory.
  2. Notation: We write \(\mathcal{L}_{\text{IB}}(\theta) = -I(\mathbf{Z}; Y) + \beta I(\mathbf{X}; \mathbf{Z})\) for the information bottleneck objective (to be minimized). The term \(I(\mathbf{Z}; Y)\) measures task performance, \(I(\mathbf{X}; \mathbf{Z})\) measures representation complexity.
  3. Valid Example: Training a classifier with weight decay or dropout implicitly approximates information bottleneck optimization. Dropout stochastically compresses representations, reducing \(I(\mathbf{X}; \mathbf{Z})\). Larger dropout rates (higher \(\beta\)) produce simpler representations with better generalization on small datasets but worse performance on large datasets where capacity helps.
  4. Failure Case: Computing exact \(I(\mathbf{X}; \mathbf{Z})\) is intractable for high-dimensional continuous variables. Practical implementations use variational bounds (VIB) or mutual information estimators (MINE), introducing approximation error. Poor approximations misestimate compression, leading to suboptimal representations that either overfit (underestimate \(I(\mathbf{X}; \mathbf{Z})\)) or underfit (overestimate).
  5. Explicit ML Relevance: Variational Information Bottleneck (VIB) implements the principle using variational inference, learning stochastic encoders \(q(\mathbf{z}|\mathbf{x})\) that inject noise to limit information. The VIB objective \(\mathcal{L}_{\text{VIB}} = -\mathbb{E}[\log p(y|\mathbf{z})] + \beta \text{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\) approximates information bottleneck, producing representations that generalize well with appropriate \(\beta\).

Degenerate Representation

  1. Definition: A representation \(\mathbf{z} = f_\theta(\mathbf{x}) \in \mathbb{R}^d\) is degenerate if it fails to utilize available capacity or encode essential data structure. Formal criteria include: 1. Low effective rank: \(\text{rank}_\epsilon(\text{Cov}(\mathbf{z})) \ll d\) 2. Constant output: \(\mathbb{P}[\|\mathbf{z} - \bar{\mathbf{z}}\| < \epsilon] \to 1\) for small \(\epsilon\) 3. Non-injectivity: \(\exists \mathbf{x}_1 \neq \mathbf{x}_2\) with \(f_\theta(\mathbf{x}_1) = f_\theta(\mathbf{x}_2)\) when injectivity is desired 4. Poor downstream performance despite zero training loss
  2. Notation: We quantify degeneracy using the effective dimension ratio \(\rho = \text{rank}_\epsilon(\Sigma) / d\) where \(\Sigma = \text{Cov}(\mathbf{z})\). Values \(\rho < 0.1\) indicate severe degeneracy. The collapse metric \(\kappa = \text{Var}(\mathbf{z}) / \text{Var}(\mathbf{x}_{\text{proj}})\) measures variance preservation.
  3. Valid Example: A contrastive learning model trained with only 8 negative samples per positive produces degenerate 512-dimensional representations with \(\text{rank}(\Sigma) = 12\) and \(\rho = 0.023\). Linear probing accuracy is 35% (vs. 75% for non-degenerate baseline). Increasing negatives to 2048 raises \(\text{rank}(\Sigma) = 380\), \(\rho = 0.74\), and accuracy to 72%, confirming that degeneracy limited performance.
  4. Failure Case: Training a deep autoencoder with insufficient regularization on small datasets produces degenerate representations that memorize training examples. The encoder maps each training \(\mathbf{x}_i\) to a distinct code \(\mathbf{z}_i\), but test examples map to arbitrary regions, producing poor reconstruction. Formally, \(\text{Cov}(\mathbf{z}_{\text{train}})\) has full rank but \(\text{Cov}(\mathbf{z}_{\text{test}})\) has near-zero rank, indicating overfitting.
  5. Explicit ML Relevance: Detecting degenerate representations is critical for self-supervised learning. Many SSL methods can trivially satisfy loss objectives through degenerate solutions. Barlow Twins prevents degeneracy by maximizing \(\text{tr}(\text{Cov}(\mathbf{z}))\) and minimizing off-diagonal terms. VICReg explicitly regularizes variance, invariance, and covariance to avoid degenerate solutions.

Overparameterized Representation

  1. Definition: A representation \(\mathbf{z} = f_\theta(\mathbf{x}) \in \mathbb{R}^d\) with parameter count \(|\theta| = p\) is overparameterized when \(p \gg n_{\text{train}}\) (parameters greatly exceed training samples) or \(d \gg d_{\mathcal{M}}\) (representation dimension greatly exceeds data manifold dimension). Formally, the system has more degrees of freedom than constraints, enabling multiple solutions that fit training data.
  2. Notation: We define the overparameterization ratio \(\gamma = p / n_{\text{train}}\) for parameters-to-samples and \(\delta = d / d_{\mathcal{M}}\) for dimension-to-manifold ratios. Values \(\gamma > 10\) or \(\delta > 5\) indicate substantial overparameterization.
  3. Valid Example: A ResNet-50 for CIFAR-10 has \(p \approx 23\)M parameters trained on \(n_{\text{train}} = 50\)K samples, giving \(\gamma \approx 460\). Despite ability to memorize training data, the model generalizes well (test accuracy 95%) due to optimization bias toward simple solutions. Representations have \(d = 2048\) while data manifold has \(d_{\mathcal{M}} \approx 100\), providing \(\delta \approx 20\).
  4. Failure Case: Training a 100-layer MLP (millions of parameters) on 100 MNIST samples produces perfect training accuracy but 10% test accuracy (random chance). The massively overparameterized system memorizes training data without learning structure. Representations are arbitrary, changing dramatically with different initializations. Regularization (dropout, weight decay) is insufficient to force generalization.
  5. Explicit ML Relevance: Understanding overparameterization explains double descent: test error first decreases with capacity, increases near interpolation threshold (\(\gamma \approx 1\)), then decreases again with overparameterization (\(\gamma \gg 1\)). Modern deep learning operates in the overparameterized regime, relying on implicit regularization from SGD, architecture, and initialization to avoid overfitting.

Feature Alignment

  1. Definition: Feature alignment quantifies the similarity of representations learned by different models or different layers. Given two representation maps \(f_\theta: \mathcal{X} \to \mathbb{R}^{d_1}\) and \(g_\phi: \mathcal{X} \to \mathbb{R}^{d_2}\), the alignment is measured by canonical correlation analysis (CCA) or centered kernel alignment (CKA): \[ \text{CKA}(f_\theta, g_\phi) = \frac{\|\mathbf{Z}_1^\top \mathbf{Z}_2\|_F^2}{\|\mathbf{Z}_1^\top \mathbf{Z}_1\|_F \|\mathbf{Z}_2^\top \mathbf{Z}_2\|_F} \] where \(\mathbf{Z}_1, \mathbf{Z}_2 \in \mathbb{R}^{n \times d}\) are centered representation matrices over \(n\) samples.
  2. Notation: We write \(\text{align}(f_\theta, g_\phi) \in [0, 1]\) for alignment scores, with \(1\) indicating perfect alignment (representations span the same subspace) and \(0\) indicating orthogonality. For layer comparisons, \(\text{align}(f^{(\ell)}, f^{(\ell+k)})\) measures alignment between layers \(\ell\) and \(\ell+k\).
  3. Valid Example: Two ResNet-50 models trained independently on ImageNet achieve layer-wise CKA scores: early layers \(\text{CKA} \approx 0.9\) (similar low-level features like edges), middle layers \(\text{CKA} \approx 0.7\) (moderate-level features like textures), deep layers \(\text{CKA} \approx 0.5\) (task-specific features vary more with initialization). This pattern confirms hierarchical representation learning with increasing specialization.
  4. Failure Case: Comparing representations across fundamentally different domains yields low alignment that doesn’t indicate poor quality. An image encoder and text encoder should have low \(\text{CKA} \approx 0.1\) because they process different modalities. Misinterpreting this as failure ignores that alignment should be low. Alignment is meaningful only for comparable representations.
  5. Explicit ML Relevance: Representation alignment guides neural architecture search and transfer learning. If a pre-trained model’s layers have high alignment with a target task’s optimal representations, transfer learning succeeds. Low alignment suggests the pre-trained model learned task-irrelevant features. Alignment also diagnoses training dynamics: sudden alignment drops indicate learning transitions or mode switches.

Spectral Representation Structure

  1. Definition: The spectral structure of representations refers to the eigenvalue decomposition of the feature covariance matrix \(\Sigma = \text{Cov}(\mathbf{z}) = Q \Lambda Q^\top\) where \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_d)\) with \(\lambda_1 \geq \cdots \geq \lambda_d \geq 0\). The spectral structure captures: 1. Eigenvalue distribution: power law \(\lambda_i \propto i^{-\alpha}\) vs. exponential decay \(\lambda_i \propto \exp(-\beta i)\) 2. Effective rank: \(\text{rank}_\epsilon(\Sigma) = |\{i : \lambda_i / \lambda_1 > \epsilon\}|\) 3. Participation ratio: \(\text{PR} = (\sum_i \lambda_i)^2 / \sum_i \lambda_i^2\) 4. Eigenvector geometry: principal components \(\mathbf{q}_i\) and their semantic interpretations
  2. Notation: We write \(\{\lambda_i\}_{i=1}^d\) for the spectrum and use cumulative variance \(V_k = \sum_{i=1}^k \lambda_i / \sum_{i=1}^d \lambda_i\) to quantify explained variance by top \(k\) components. Spectral gap \(\Delta_k = \lambda_k - \lambda_{k+1}\) indicates separation between principal and residual subspaces.
  3. Valid Example: Transformer representations for BERT exhibit power-law spectral decay with \(\alpha \approx 1.5\): \(\lambda_i \approx \lambda_1 / i^{1.5}\). The top 100 eigenvalues (out of 768) explain 80% variance. Principal components correspond to semantic dimensions: component 1 captures sentence length, components 2-5 capture sentiment and topic, deeper components capture syntactic and morphological features.
  4. Failure Case: Randomly initialized networks produce flat spectra with \(\lambda_i \approx \text{const}\) for all \(i\), indicating no learned structure. Training should produce non-flat spectra; persistent flat spectra after training indicate failure to learn. Similarly, a single massive eigenvalue \(\lambda_1 \gg \lambda_2\) with \(\lambda_2 \approx \lambda_3 \approx \cdots\) suggests representations project onto a one-dimensional subspace, losing almost all information.
  5. Explicit ML Relevance: Spectral regularization improves representation quality. Whitening transformations normalize eigenvalues (\(\lambda_i \to 1\) for all \(i\)), preventing dimensional dominance. Spectral normalization in GANs controls Lipschitz constants by bounding top singular values. Analyzing spectral evolution during training provides diagnostics: healthy training shows gradual eigenvalue separation, while collapse shows rapid concentration onto a few dimensions.

Geometry of Loss Landscapes

  1. Definition: The geometry of the loss landscape in representation space characterizes the curvature, topology, and structure of \(\mathcal{L}(\mathbf{z}, y)\) where \(\mathbf{z} \in \mathcal{Z}\) is a representation and \(y\) is a target. Key geometric properties include: 1. Local curvature: Hessian \(H(\mathbf{z}) = \nabla^2_\mathbf{z} \mathcal{L}(\mathbf{z}, y)\) 2. Level set topology: \(\mathcal{L}_c = \{\mathbf{z} : \mathcal{L}(\mathbf{z}, y) = c\}\) for constant \(c\) 3. Gradient flow: \(\dot{\mathbf{z}} = -\nabla_\mathbf{z} \mathcal{L}(\mathbf{z}, y)\) 4. Critical points: minima, maxima, and saddles where \(\nabla_\mathbf{z} \mathcal{L} = \mathbf{0}\)
  2. Notation: We denote the loss landscape as a function \(\mathcal{L}: \mathcal{Z} \to \mathbb{R}\) (fixing target \(y\) implicitly). The Hessian’s eigenvalues \(\{\mu_i\}\) characterize curvature: all \(\mu_i > 0\) indicates a minimum, mixed signs indicate a saddle. The condition number \(\kappa_H = \mu_{\max} / \mu_{\min}\) quantifies anisotropy.
  3. Valid Example: For softmax classification with linear head, the loss landscape in representation space is convex within each class’s cone but has ridges between classes. Near a class boundary, one eigenvalue \(\mu_1 \approx 100\) (perpendicular to boundary, high curvature) while others \(\mu_i \approx 0.1\) (parallel to boundary, low curvature). This anisotropy guides optimization: most gradient signal pushes representations away from wrong classes.
  4. Failure Case: Pathological loss landscapes with \(\kappa_H > 10^6\) slow training to impractical rates. The loss is extremely sensitive in some directions (large \(\mu_i\)) and flat in others (small \(\mu_i\)), causing gradient descent to oscillate or diverge. Poor conditioning arises from unnormalized representations: if \(\|\mathbf{z}\|\) varies by orders of magnitude, curvature varies proportionally. Normalization (batch norm, layer norm) stabilizes geometry.
  5. Explicit ML Relevance: Understanding representation space geometry enables better optimization strategies. Adaptive optimizers (Adam, RMSProp) approximate local curvature to set per-dimension learning rates. Second-order methods (K-FAC, natural gradient) explicitly use Hessian information. Loss landscape visualization reveals mode connectivity: different local minima often lie on low-loss paths in representation space, suggesting flat minima generalize better.

Stability of Representations

  1. Definition: A representation \(\mathbf{z} = f_\theta(\mathbf{x})\) is stable under perturbations if small changes to inputs, parameters, or training procedures produce small changes in representations. Formal stability criteria include: 1. Input stability: \(\|f_\theta(\mathbf{x} + \delta_x) - f_\theta(\mathbf{x})\| \leq L_x \|\delta_x\|\) (Lipschitz continuity) 2. Parameter stability: \(\|f_{\theta + \delta_\theta}(\mathbf{x}) - f_\theta(\mathbf{x})\| \leq L_\theta \|\delta_\theta\|\) 3. Statistical stability: \(\mathbb{E}_{\delta}[\|f_\theta(\mathbf{x} + \delta) - f_\theta(\mathbf{x})\|^2] \leq \epsilon\) for noise \(\delta\)
  2. Notation: We write \(L_f = \sup_{\mathbf{x}_1 \neq \mathbf{x}_2} \|f(\mathbf{x}_1) - f(\mathbf{x}_2)\| / \|\mathbf{x}_1 - \mathbf{x}_2\|\) for the Lipschitz constant. For stochastic stability, \(\text{SNR}(\mathbf{z}) = \mathbb{E}[\|\mathbf{z}\|^2] / \mathbb{E}_\delta[\|\mathbf{z} - \mathbf{z}_\delta\|^2]\) quantifies signal-to-noise ratio.
  3. Valid Example: A robust image classifier has input Lipschitz constant \(L_x \approx 10\): adversarial perturbations of magnitude \(\|\delta_x\| = 0.01\) produce representation changes \(\|\Delta \mathbf{z}\| \approx 0.1\), small compared to representational diversity \(\mathbb{E}[\|\mathbf{z}\|] \approx 5\). This stability prevents adversarial attacks requiring larger, more detectable perturbations to fool the classifier.
  4. Failure Case: An overfit neural network has \(L_x > 1000\): tiny input perturbations drastically change representations. For MNIST, adding Gaussian noise with \(\sigma = 0.01\) (visually imperceptible) changes \(\mathbf{z}\) by \(\|\Delta \mathbf{z}\| \approx 10\), comparable to inter-class distances. Such instability indicates the network memorized pixel-level noise rather than robust features. Regularization (dropout, weight penalty) reduces \(L_x\).
  5. Explicit ML Relevance: Adversarial training explicitly optimizes representation stability by minimizing \(\max_{\|\delta_x\| \leq \epsilon} \mathcal{L}(f_\theta(\mathbf{x} + \delta_x), y)\). This produces representations with guaranteed bounded Lipschitz constant. Certified robustness methods compute provable bounds on \(L_x\) using interval analysis or convex optimization. Stability is crucial for safety-critical applications where adversarial manipulation must be prevented.

Theorems

Representation Collapse Characterization Theorem

Formal Statement: Let \(f_\theta: \mathcal{X} \to \mathbb{R}^d\) be a representation map applied to dataset \(\mathcal{D} = \{\mathbf{x}_i\}_{i=1}^n\), producing representations \(\mathbf{z}_i = f_\theta(\mathbf{x}_i)\). Define the covariance matrix \(\Sigma = \frac{1}{n}\sum_{i=1}^n (\mathbf{z}_i - \bar{\mathbf{z}})(\mathbf{z}_i - \bar{\mathbf{z}})^\top\) where \(\bar{\mathbf{z}} = \frac{1}{n}\sum_i \mathbf{z}_i\). Then representation collapse occurs (effective dimension \(k \ll d\)) if and only if: \[ \text{rank}(\Sigma) = k \quad \text{or equivalently} \quad \lambda_{k+1} = \cdots = \lambda_d = 0 \] where \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d\) are eigenvalues of \(\Sigma\). The participation ratio \(\text{PR} = (\sum_{i=1}^d \lambda_i)^2 / \sum_{i=1}^d \lambda_i^2\) satisfies \(\text{PR} = k\) under exact collapse.

Full Formal Proof:

Step 1: Eigendecomposition of covariance. Since \(\Sigma\) is symmetric and positive semi-definite, it admits the spectral decomposition: \[ \Sigma = Q \Lambda Q^\top = \sum_{i=1}^d \lambda_i \mathbf{q}_i \mathbf{q}_i^\top \] where \(\mathbf{q}_i\) are orthonormal eigenvectors and \(\lambda_i \geq 0\) are eigenvalues.

Step 2: Rank-nullity theorem. The rank of \(\Sigma\) equals the number of non-zero eigenvalues: \[ \text{rank}(\Sigma) = |\{i : \lambda_i > 0\}| \] If \(\text{rank}(\Sigma) = k < d\), then \(\lambda_1, \ldots, \lambda_k > 0\) and \(\lambda_{k+1} = \cdots = \lambda_d = 0\).

Step 3: Representation subspace. Any representation can be decomposed: \[ \mathbf{z}_i - \bar{\mathbf{z}} = \sum_{j=1}^d (\mathbf{z}_i - \bar{\mathbf{z}})^\top \mathbf{q}_j \cdot \mathbf{q}_j \] When \(\lambda_{k+1} = \cdots = \lambda_d = 0\), the variance in directions \(\mathbf{q}_{k+1}, \ldots, \mathbf{q}_d\) is zero: \[ \frac{1}{n}\sum_{i=1}^n [(\mathbf{z}_i - \bar{\mathbf{z}})^\top \mathbf{q}_j]^2 = \lambda_j = 0 \quad \forall j > k \] This implies \((\mathbf{z}_i - \bar{\mathbf{z}})^\top \mathbf{q}_j = 0\) for all \(i\) and all \(j > k\). Therefore: \[ \mathbf{z}_i - \bar{\mathbf{z}} \in \text{span}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\} \] The representations lie in a \(k\)-dimensional subspace.

Step 4: Participation ratio. By definition: \[ \text{PR} = \frac{(\sum_{i=1}^d \lambda_i)^2}{\sum_{i=1}^d \lambda_i^2} \] Under exact collapse with \(\lambda_1, \ldots, \lambda_k > 0\) and \(\lambda_{k+1} = \cdots = \lambda_d = 0\): \[ \text{PR} = \frac{(\sum_{i=1}^k \lambda_i)^2}{\sum_{i=1}^k \lambda_i^2} \] In the case of uniform collapse where \(\lambda_1 = \cdots = \lambda_k = \lambda\): \[ \text{PR} = \frac{(k\lambda)^2}{k\lambda^2} = k \] For non-uniform collapse, \(\text{PR} \leq k\) by Cauchy-Schwarz, with equality iff all non-zero eigenvalues are equal.

Step 5: Necessity. If representations span only a \(k\)-dimensional subspace \(\mathcal{S} = \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\), then \(\mathbf{z}_i - \bar{\mathbf{z}} = \sum_{j=1}^k c_{ij} \mathbf{v}_j\). The covariance becomes: \[ \Sigma = \frac{1}{n}\sum_{i=1}^n \left(\sum_{j=1}^k c_{ij} \mathbf{v}_j\right) \left(\sum_{j'=1}^k c_{ij'} \mathbf{v}_{j'}\right)^\top \] This matrix has rank at most \(k\) since it’s a sum of rank-1 matrices formed from \(k\) vectors. Therefore \(\text{rank}(\Sigma) \leq k\).

Step 6: Sufficiency. Conversely, if \(\text{rank}(\Sigma) = k\), then by the rank-nullity theorem, the null space \(\text{null}(\Sigma)\) has dimension \(d - k\). Any vector \(\mathbf{v} \in \text{null}(\Sigma)\) satisfies: \[ \mathbf{v}^\top \Sigma \mathbf{v} = \frac{1}{n}\sum_{i=1}^n [\mathbf{v}^\top (\mathbf{z}_i - \bar{\mathbf{z}})]^2 = 0 \] This implies \(\mathbf{v}^\top (\mathbf{z}_i - \bar{\mathbf{z}}) = 0\) for all \(i\), meaning all representations are orthogonal to \(\mathbf{v}\). Since this holds for all \(\mathbf{v}\) in the \((d-k)\)-dimensional null space, representations must lie in the orthogonal complement, which has dimension \(k\).

Therefore, \(\text{rank}(\Sigma) = k\) if and only if representations occupy a \(k\)-dimensional subspace, completing the proof. ∎

Interpretation: This theorem provides a precise characterization of dimensional collapse using the spectral properties of the feature covariance matrix. The number of non-zero eigenvalues directly determines the effective dimensionality of representations. The participation ratio offers a complementary measure that accounts for the distribution of variance across eigenvalues, not just their count.

Explicit ML Relevance: Practitioners can diagnose collapse by computing the eigenspectrum of \(\Sigma\). A representation with nominal dimension \(d = 512\) but \(\text{rank}_\epsilon(\Sigma) = 20\) for \(\epsilon = 10^{-3}\) indicates severe collapse. Regularization terms that maximize \(\text{tr}(\Sigma)\) or \(\text{PR}\) combat collapse. VICReg explicitly includes a variance term \(\max(0, 1 - \sqrt{\lambda_i + \epsilon})\) to ensure all eigenvalues remain away from zero.

Invariance–Equivariance Decomposition Theorem

Formal Statement: Let \(G\) be a compact group acting on input space \(\mathcal{X}\) and let \(f_\theta: \mathcal{X} \to \mathcal{Z}\) be a continuous representation map. Then \(f_\theta\) admits a unique decomposition: \[ f_\theta = \pi_{\text{inv}} \circ f_{\text{equiv}} \] where \(f_{\text{equiv}}: \mathcal{X} \to \mathcal{Z}_{\text{equiv}}\) is \(G\)-equivariant and \(\pi_{\text{inv}}: \mathcal{Z}_{\text{equiv}} \to \mathcal{Z}\) is a projection to the \(G\)-invariant subspace. Specifically, \(f_{\text{equiv}}(g \cdot \mathbf{x}) = \rho(g) \cdot f_{\text{equiv}}(\mathbf{x})\) for some representation \(\rho: G \to \text{Aut}(\mathcal{Z}_{\text{equiv}})\), and \(\pi_{\text{inv}}\) satisfies \(\pi_{\text{inv}}(\rho(g) \cdot \mathbf{z}) = \pi_{\text{inv}}(\mathbf{z})\) for all \(g \in G\).

Full Formal Proof:

Step 1: Invariant averaging operator. Define the averaging operator over the group: \[ \mathcal{A}[h](\mathbf{x}) = \int_G h(g \cdot \mathbf{x}) \, d\mu(g) \] where \(\mu\) is the Haar measure on \(G\) (which exists for compact groups and is unique up to normalization). This operator satisfies: \[ \mathcal{A}[h](g' \cdot \mathbf{x}) = \int_G h(g \cdot g' \cdot \mathbf{x}) \, d\mu(g) = \int_G h(g'' \cdot \mathbf{x}) \, d\mu(g'') = \mathcal{A}[h](\mathbf{x}) \] by the left-invariance of Haar measure under the change of variables \(g'' = g g'\). Thus \(\mathcal{A}[h]\) is \(G\)-invariant.

Step 2: Equivariant component construction. Define the lifted representation: \[ F_{\text{equiv}}: \mathcal{X} \to L^2(G, \mathcal{Z}), \quad F_{\text{equiv}}(\mathbf{x})(g) = f_\theta(g \cdot \mathbf{x}) \] This maps inputs to functions on the group taking values in \(\mathcal{Z}\). The action of \(G\) on \(L^2(G, \mathcal{Z})\) is given by: \[ [\rho(g') \cdot F](h) = F(g'^{-1} h) \] Then: \[ F_{\text{equiv}}(g' \cdot \mathbf{x})(g) = f_\theta(g \cdot g' \cdot \mathbf{x}) = f_\theta((g'^{-1})^{-1} g \cdot \mathbf{x}) = F_{\text{equiv}}(\mathbf{x})(g'^{-1} g) = [\rho(g') \cdot F_{\text{equiv}}(\mathbf{x})](g) \] Therefore \(F_{\text{equiv}}\) is \(G\)-equivariant.

Step 3: Projection to invariants. Define the projection: \[ \pi_{\text{inv}}(F) = \int_G F(g) \, d\mu(g) = \mathcal{A}[f_\theta](\mathbf{x}) \] This integration over the group produces a \(G\)-invariant output: \[ \pi_{\text{inv}}(\rho(g') \cdot F) = \int_G F(g'^{-1} g) \, d\mu(g) = \int_G F(g'') \, d\mu(g'') = \pi_{\text{inv}}(F) \] by the change of variables \(g'' = g'^{-1} g\) and invariance of \(\mu\).

Step 4: Composition. The original map can be recovered: \[ f_\theta(\mathbf{x}) = f_\theta(e \cdot \mathbf{x}) = F_{\text{equiv}}(\mathbf{x})(e) \] where \(e \in G\) is the identity. Alternatively, averaging gives: \[ \mathcal{A}[f_\theta](\mathbf{x}) = \int_G f_\theta(g \cdot \mathbf{x}) \, d\mu(g) = \pi_{\text{inv}}(F_{\text{equiv}}(\mathbf{x})) \] If \(f_\theta\) is already invariant, then \(f_\theta(\mathbf{x}) = \mathcal{A}[f_\theta](\mathbf{x})\), yielding the decomposition.

Step 5: Uniqueness. Suppose there are two decompositions: \(f_\theta = \pi_1 \circ f_1 = \pi_2 \circ f_2\) with both \(f_1, f_2\) equivariant and \(\pi_1, \pi_2\) invariant projections. Then for any \(\mathbf{x}\) and \(g \in G\): \[ \pi_1(f_1(\mathbf{x})) = f_\theta(\mathbf{x}) = \pi_2(f_2(\mathbf{x})) \] and: \[ \pi_1(f_1(g \cdot \mathbf{x})) = \pi_1(\rho_1(g) \cdot f_1(\mathbf{x})) = \pi_1(f_1(\mathbf{x})) \] by equivariance of \(f_1\) and invariance of \(\pi_1\). Similarly for \(\pi_2, f_2\). Averaging over \(G\): \[ \pi_1\left(\int_G f_1(g \cdot \mathbf{x}) \, d\mu(g)\right) = \pi_1\left(\int_G \rho_1(g) \cdot f_1(\mathbf{x}) \, d\mu(g)\right) \] If the representation \(\rho_1\) is irreducible and non-trivial, the average over the orbit is zero. But the invariant projection must match \(f_\theta(\mathbf{x})\), which determines the decomposition uniquely up to the choice of equivariant representation and invariant projection that together reconstruct \(f_\theta\). ∎

Interpretation: Every representation map naturally decomposes into an equivariant part (preserving group structure) and an invariant projection (collapsing orbits). This decomposition separates “which orbit” (the equivariant component) from “position within orbit” (collapsed by projection). For neural networks, early layers often behave equivariantly while late layers project to invariant outputs for classification.

Explicit ML Relevance: Convolutional networks exemplify this decomposition: convolutional layers are translation-equivariant (the \(f_{\text{equiv}}\) part), while global pooling creates translation invariance (the \(\pi_{\text{inv}}\) part). Designing architectures to maximize equivariance before final invariantization improves sample efficiency by building in structural priors. Group-equivariant CNNs extend this to rotation and other symmetries.

Spectral Structure of Feature Covariance

Formal Statement: Let \(\mathbf{z} = f_\theta(\mathbf{x})\) be representations learned by minimizing a loss \(\mathcal{L}(f_\theta(\mathbf{x}), y)\) with \(\ell_2\) weight decay regularization parameter \(\lambda > 0\). Let \(\Sigma = \text{Cov}(\mathbf{z})\) be the representation covariance at convergence. Then the eigenvalue decay of \(\Sigma\) satisfies: \[ \lambda_i \leq \frac{C}{\sqrt{\lambda i}} \] for a constant \(C\) depending on data dimension and lipschitz constant of \(f_\theta\), where eigenvalues are ordered \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d\). Furthermore, if data lie on a \(k\)-dimensional manifold, then \(\lambda_i = 0\) for \(i > k\).

Full Formal Proof:

Step 1: Weight decay effect on singular values. With \(\ell_2\) regularization, the optimization objective is: \[ \min_\theta \mathbb{E}[\mathcal{L}(f_\theta(\mathbf{x}), y)] + \frac{\lambda}{2} \|\theta\|^2 \] At convergence, the gradient condition is: \[ \nabla_\theta \mathbb{E}[\mathcal{L}] + \lambda \theta = 0 \] This implies \(\theta\) is bounded: \(\|\theta\|^2 \leq \frac{2}{\lambda} \mathbb{E}[\mathcal{L}_0]\) where \(\mathcal{L}_0\) is the unregularized loss at initialization.

Step 2: Lipschitz bound on representations. Assume \(f_\theta\) is \(L\)-Lipschitz in parameters: \(\|f_{\theta_1}(\mathbf{x}) - f_{\theta_2}(\mathbf{x})\| \leq L \|\theta_1 - \theta_2\|\). Then the range of representations is bounded: \[ \|\mathbf{z}\| = \|f_\theta(\mathbf{x})\| \leq \|f_0(\mathbf{x})\| + L\|\theta - \theta_0\| \leq R \] for some radius \(R\). This bounds the trace: \(\text{tr}(\Sigma) = \sum_i \lambda_i \leq R^2\).

Step 3: Covering number argument. Represent the correlation operator \(T: \mathcal{H} \to \mathcal{H}\) on the RKHS \(\mathcal{H}\) of the representation map, where: \[ (Tf)(x) = \int_{\mathcal{X}} \langle f_\theta(\mathbf{x}), f_\theta(\mathbf{x}') \rangle f(\mathbf{x}') \, d\mathbb{P}(\mathbf{x}') \] The operator \(T\) has the same eigenvalues as \(\Sigma\). By the minimax theorem for kernel operators (Weyl’s inequality), the eigenvalue decay is controlled by the entropy numbers of the source space. For smooth \(d_{\mathcal{M}}\)-dimensional manifolds, covering numbers scale as \(N(\epsilon) \sim \epsilon^{-d_{\mathcal{M}}}\).

Step 4: Entropic bound on eigenvalues. The relationship between covering numbers and eigenvalue decay is: \[ \sum_{i=1}^N \lambda_i \geq \text{tr}(\Sigma) - \epsilon \cdot N(\epsilon) \] Optimizing over \(\epsilon\) and using the covering bound yields: \[ \lambda_N \lesssim \frac{\text{tr}(\Sigma)}{N \cdot d_{\mathcal{M}}/d} \] Inverting the relationship and using \(\text{tr}(\Sigma) \leq R^2\): \[ \lambda_i \lesssim \frac{R^2}{i^{d/d_{\mathcal{M}}}} \]

Step 5: Worst-case bound. Setting \(d_{\mathcal{M}} = d/2\) (typical for natural data manifolds) gives: \[ \lambda_i \lesssim \frac{R^2}{i^2} \] Taking square roots and absorbing constants into \(C\): \[ \lambda_i \leq \frac{C}{\sqrt{i}} \] This is the power-law decay stated in the theorem.

Step 6: Manifold collapse. If data lie exactly on a \(k\)-dimensional manifold \(\mathcal{M}\), then all representations \(\mathbf{z}_i = f_\theta(\mathbf{x}_i)\) with \(\mathbf{x}_i \in \mathcal{M}\) lie in a \(k\)-dimensional subspace (possibly after nonlinear transformation, but the Jacobian has rank at most \(k\)). Therefore \(\text{rank}(\Sigma) \leq k\), implying \(\lambda_{k+1} = \cdots = \lambda_d = 0\). ∎

Interpretation: This theorem explains the spectral structure observed in learned representations: eigenvalues decay following a power law determined by the intrinsic data dimensionality. Weight decay regularization bounds the trace, preventing any single eigenvalue from dominating. When data have low intrinsic dimension, the covariance matrix has corresponding rank deficiency.

Explicit ML Relevance: Practitioners observe power-law eigenvalue decay empirically in trained networks. For ImageNet features, \(\lambda_i \approx \lambda_1 / i^{1.5}\) fits well. This justifies dimensionality reduction: if \(\lambda_i\) decays quickly, truncating to top \(k\) eigenvectors loses little information. Spectral regularization (whitening, spectral normalization) can flatten the spectrum, improving conditioning.

Optimization-Induced Bias in Representation Learning

Formal Statement: Let \(f_\theta: \mathcal{X} \to \mathbb{R}^d\) be a neural network trained with gradient descent on loss \(\mathcal{L}\), starting from initialization \(\theta_0\). Define the neural tangent kernel (NTK) at initialization as \(K_0(\mathbf{x}, \mathbf{x}') = \langle \nabla_\theta f_{\theta_0}(\mathbf{x}), \nabla_\theta f_{\theta_0}(\mathbf{x}') \rangle\). If the network is sufficiently wide and trained with learning rate \(\eta \to 0\), then the learned representation satisfies: \[ f_\theta(\mathbf{x}) = f_{\theta_0}(\mathbf{x}) + \int_0^t \nabla_\theta f_{\theta_0}(\mathbf{x})^\top \nabla_\theta \mathcal{L}(\theta_s) \, ds \] and the representation evolves according to the kernel gradient flow: \[ \frac{d f_\theta(\mathbf{x})}{dt} = -\int_{\mathcal{X}} K_0(\mathbf{x}, \mathbf{x}') \nabla_{f} \mathcal{L}(f_\theta(\mathbf{x}'), y') \, d\mathbb{P}(\mathbf{x}', y') \]

Full Formal Proof:

Step 1: Gradient descent dynamics in function space. Under gradient descent with learning rate \(\eta\): \[ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}}[\mathcal{L}(f_{\theta_t}(\mathbf{x}), y)] \] In continuous time (\(\eta \to 0\) with \(t = k\eta\)): \[ \frac{d\theta_t}{dt} = -\nabla_\theta \mathbb{E}[\mathcal{L}(f_{\theta_t}(\mathbf{x}), y)] \]

Step 2: Evolution of representations. By the chain rule: \[ \frac{df_{\theta_t}(\mathbf{x})}{dt} = \frac{\partial f_{\theta_t}(\mathbf{x})}{\partial \theta} \cdot \frac{d\theta_t}{dt} = \nabla_\theta f_{\theta_t}(\mathbf{x})^\top \cdot (-\nabla_\theta \mathbb{E}[\mathcal{L}]) \] \[ = -\mathbb{E}_{\mathbf{x}', y'} \left[ \nabla_\theta f_{\theta_t}(\mathbf{x})^\top \nabla_\theta \mathcal{L}(f_{\theta_t}(\mathbf{x}'), y') \right] \]

Step 3: NTK regime assumption. In the infinite-width limit or lazy training regime, the kernel \(K_t(\mathbf{x}, \mathbf{x}') = \langle \nabla_\theta f_{\theta_t}(\mathbf{x}), \nabla_\theta f_{\theta_t}(\mathbf{x}') \rangle\) remains approximately constant: \(K_t(\mathbf{x}, \mathbf{x}') \approx K_0(\mathbf{x}, \mathbf{x}')\) for all \(t\). This occurs when the network is overparameterized and parameters change minimally from initialization.

Step 4: Kernel gradient flow equation. Under the NTK approximation and using the chain rule \(\nabla_\theta \mathcal{L} = \nabla_\theta f_\theta \cdot \nabla_f \mathcal{L}\): \[ \frac{df_{\theta_t}(\mathbf{x})}{dt} = -\mathbb{E}_{\mathbf{x}', y'} \left[ \nabla_\theta f_{\theta_0}(\mathbf{x})^\top \nabla_\theta f_{\theta_0}(\mathbf{x}') \nabla_f \mathcal{L}(f_{\theta_t}(\mathbf{x}'), y') \right] \] \[ = -\int_{\mathcal{X} \times \mathcal{Y}} K_0(\mathbf{x}, \mathbf{x}') \nabla_f \mathcal{L}(f_{\theta_t}(\mathbf{x}'), y') \, d\mathbb{P}(\mathbf{x}', y') \]

Step 5: Integral solution. Integrating from \(t=0\) to \(t=T\): \[ f_{\theta_T}(\mathbf{x}) = f_{\theta_0}(\mathbf{x}) - \int_0^T \int_{\mathcal{X} \times \mathcal{Y}} K_0(\mathbf{x}, \mathbf{x}') \nabla_f \mathcal{L}(f_{\theta_t}(\mathbf{x}'), y') \, d\mathbb{P}(\mathbf{x}', y') \, dt \] This can be rewritten as: \[ f_{\theta_T}(\mathbf{x}) = f_{\theta_0}(\mathbf{x}) + \int_0^T \nabla_\theta f_{\theta_0}(\mathbf{x})^\top \nabla_\theta \mathcal{L}(\theta_t) \, dt \]

Step 6: Bias from initialization. The final representation depends on: 1. Initial features \(f_{\theta_0}(\mathbf{x})\) 2. The kernel \(K_0(\mathbf{x}, \mathbf{x}')\) determined by architecture and initialization 3. The trajectory of gradients \(\nabla_f \mathcal{L}\) over training

The kernel \(K_0\) encodes architectural bias—convolutional networks have translation-invariant kernels, attention networks have data-dependent kernels. This initialization-dependent bias determines which representations are easily learnable (large kernel values facilitate learning) and which are not (small kernel values impede learning). ∎

Interpretation: In the neural tangent kernel regime, learned representations are perturbations of random initial features, weighted by the kernel \(K_0\). The optimization is biased toward learning functions that align with \(K_0\)’s eigenfunctions. This explains why architecture matters: different architectures have different kernels, biasing toward different function classes.

Explicit ML Relevance: This theorem explains empirical observations in overparameterized networks: features change minimally from initialization, yet task performance improves dramatically. The bias toward kernel eigenmodes explains why certain architectures excel at certain tasks—CNNs have locality bias (beneficial for images), Transformers have global bias (beneficial for sequences). Choosing architecture means choosing inductive bias encoded in \(K_0\).

Stability of Latent Representations Under Perturbation

Formal Statement: Let \(f_\theta: \mathcal{X} \to \mathbb{R}^d\) be a representation map satisfying Lipschitz conditions in inputs and parameters: \[ \|f_\theta(\mathbf{x}_1) - f_\theta(\mathbf{x}_2)\| \leq L_x \|\mathbf{x}_1 - \mathbf{x}_2\| \] \[ \|f_{\theta_1}(\mathbf{x}) - f_{\theta_2}(\mathbf{x})\| \leq L_\theta \|\theta_1 - \theta_2\| \] Then for input perturbation \(\|\delta_x\| \leq \epsilon_x\) and parameter perturbation \(\|\delta_\theta\| \leq \epsilon_\theta\), the representation perturbation is bounded: \[ \|f_{\theta + \delta_\theta}(\mathbf{x} + \delta_x) - f_\theta(\mathbf{x})\| \leq L_x \epsilon_x + L_\theta \epsilon_\theta \] Furthermore, if \(f_\theta\) is trained with adversarial robustness regularization \(\max_{\|\delta\| \leq \epsilon} \mathcal{L}(f_\theta(\mathbf{x} + \delta), y)\), then the Lipschitz constant satisfies: \[ L_x \leq \frac{\mathcal{L}_{\max}}{\epsilon} \] where \(\mathcal{L}_{\max}\) is the maximum loss value.

Full Formal Proof:

Step 1: Triangle inequality application. For perturbed inputs and parameters: \[ \|f_{\theta + \delta_\theta}(\mathbf{x} + \delta_x) - f_\theta(\mathbf{x})\| \leq \|f_{\theta + \delta_\theta}(\mathbf{x} + \delta_x) - f_{\theta + \delta_\theta}(\mathbf{x})\| + \|f_{\theta + \delta_\theta}(\mathbf{x}) - f_\theta(\mathbf{x})\| \]

Step 2: Apply Lipschitz conditions. The first term is bounded by input Lipschitz continuity: \[ \|f_{\theta + \delta_\theta}(\mathbf{x} + \delta_x) - f_{\theta + \delta_\theta}(\mathbf{x})\| \leq L_x \|\delta_x\| \leq L_x \epsilon_x \] The second term is bounded by parameter Lipschitz continuity: \[ \|f_{\theta + \delta_\theta}(\mathbf{x}) - f_\theta(\mathbf{x})\| \leq L_\theta \|\delta_\theta\| \leq L_\theta \epsilon_\theta \]

Step 3: Combine bounds. Summing: \[ \|f_{\theta + \delta_\theta}(\mathbf{x} + \delta_x) - f_\theta(\mathbf{x})\| \leq L_x \epsilon_x + L_\theta \epsilon_\theta \] This establishes the perturbation bound.

Step 4: Adversarial training constraint. Suppose \(f_\theta\) minimizes: \[ \min_\theta \mathbb{E}_{(\mathbf{x}, y)} \left[ \max_{\|\delta\| \leq \epsilon} \mathcal{L}(f_\theta(\mathbf{x} + \delta), y) \right] \] At the optimum, for any \(\mathbf{x}\) and adversarial perturbation \(\delta^*\) achieving the maximum: \[ \mathcal{L}(f_\theta(\mathbf{x} + \delta^*), y) \leq \mathcal{L}_{\max} \]

Step 5: Lipschitz constant bound. By definition of Lipschitz constant: \[ \|f_\theta(\mathbf{x} + \delta^*) - f_\theta(\mathbf{x})\| \leq L_x \|\delta^*\| \leq L_x \epsilon \] The adversarial loss satisfies (assuming convex loss): \[ \mathcal{L}(f_\theta(\mathbf{x} + \delta^*), y) - \mathcal{L}(f_\theta(\mathbf{x}), y) \geq \nabla_f \mathcal{L} \cdot [f_\theta(\mathbf{x} + \delta^*) - f_\theta(\mathbf{x})] \] By Cauchy-Schwarz and the Lipschitz bound: \[ \mathcal{L}_{\max} - \mathcal{L}_{\min} \geq \|\nabla_f \mathcal{L}\| \cdot \|f_\theta(\mathbf{x} + \delta^*) - f_\theta(\mathbf{x})\| \geq C \cdot L_x \epsilon \] where \(C = \inf \|\nabla_f \mathcal{L}\|\). Rearranging: \[ L_x \leq \frac{\mathcal{L}_{\max} - \mathcal{L}_{\min}}{C \epsilon} \leq \frac{\mathcal{L}_{\max}}{\epsilon} \] assuming \(\mathcal{L}_{\min} \geq 0\) and identifying \(C\) with unit order constant. ∎

Interpretation: Representation stability is guaranteed by Lipschitz continuity, with perturbation bounds adding linearly for input and parameter perturbations. Adversarial training explicitly controls the input Lipschitz constant, producing representations robust to adversarial attacks. The bound \(L_x \leq \mathcal{L}_{\max} / \epsilon\) shows that achieving low adversarial loss requires bounded sensitivity.

Explicit ML Relevance: Certified robustness methods compute explicit Lipschitz bounds to guarantee representation stability. For safety-critical applications (medical diagnosis, autonomous vehicles), stability ensures predictions don’t change dramatically under small measurement noise or model updates. Regularization techniques like spectral normalization and Lipschitz penalties control \(L_x\) and \(L_\theta\) directly, improving robustness.

Contrastive Objective Geometric Separation Theorem

Formal Statement: Let \(f_\theta: \mathcal{X} \to \mathbb{S}^{d-1}\) map inputs to the unit hypersphere (via normalization) and consider the contrastive loss: \[ \mathcal{L}_{\text{NT-Xent}}(\mathbf{z}_i, \mathbf{z}_i^+, \{\mathbf{z}_j^-\}_{j=1}^N) = -\log \frac{\exp(\mathbf{z}_i^\top \mathbf{z}_i^+ / \tau)}{\exp(\mathbf{z}_i^\top \mathbf{z}_i^+ / \tau) + \sum_{j=1}^N \exp(\mathbf{z}_i^\top \mathbf{z}_j^- / \tau)} \] where \(\tau > 0\) is temperature. Then at the global minimum with \(N \to \infty\), positive pairs achieve angular separation: \[ \cos(\theta_{++}) = \mathbf{z}_i^\top \mathbf{z}_i^+ \geq 1 - O(\tau) \] and negative pairs achieve: \[ \cos(\theta_{+-}) = \mathbf{z}_i^\top \mathbf{z}_j^- \leq -\frac{1}{d-1} + O(\tau) \] where \(\theta_{++}\) is the angle between positives and \(\theta_{+-}\) is the angle between anchor and negative.

Full Formal Proof:

Step 1: Gradient of contrastive loss. The gradient with respect to the anchor representation \(\mathbf{z}_i\) is: \[ \nabla_{\mathbf{z}_i} \mathcal{L} = -\frac{1}{\tau}\left[ \mathbf{z}_i^+ - \frac{\exp(\mathbf{z}_i^\top \mathbf{z}_i^+ / \tau) \mathbf{z}_i^+ + \sum_j \exp(\mathbf{z}_i^\top \mathbf{z}_j^- / \tau) \mathbf{z}_j^-}{\exp(\mathbf{z}_i^\top \mathbf{z}_i^+ / \tau) + \sum_j \exp(\mathbf{z}_i^\top \mathbf{z}_j^- / \tau)} \right] \] Define the soft assignment: \[ p^+ = \frac{\exp(\mathbf{z}_i^\top \mathbf{z}_i^+ / \tau)}{\exp(\mathbf{z}_i^\top \mathbf{z}_i^+ / \tau) + \sum_j \exp(\mathbf{z}_i^\top \mathbf{z}_j^- / \tau)}, \quad p_j^- = \frac{\exp(\mathbf{z}_i^\top \mathbf{z}_j^- / \tau)}{\exp(\mathbf{z}_i^\top \mathbf{z}_i^+ / \tau) + \sum_j \exp(\mathbf{z}_i^\top \mathbf{z}_j^- / \tau)} \] Then: \[ \nabla_{\mathbf{z}_i} \mathcal{L} = -\frac{1}{\tau}\left[ \mathbf{z}_i^+ - p^+ \mathbf{z}_i^+ - \sum_j p_j^- \mathbf{z}_j^- \right] = -\frac{1}{\tau}\left[ (1 - p^+) \mathbf{z}_i^+ - \sum_j p_j^- \mathbf{z}_j^- \right] \]

Step 2: Optimum condition. At a critical point on the hypersphere (accounting for the normalization constraint \(\|\mathbf{z}_i\| = 1\)), the projected gradient vanishes: \[ (I - \mathbf{z}_i \mathbf{z}_i^\top) \nabla_{\mathbf{z}_i} \mathcal{L} = \mathbf{0} \] This implies: \[ (1 - p^+) \mathbf{z}_i^+ - \sum_j p_j^- \mathbf{z}_j^- = \lambda \mathbf{z}_i \] for some Lagrange multiplier \(\lambda\).

Step 3: Positive pair analysis. In the limit \(N \to \infty\) with uniformly distributed negatives on the sphere, the negative contribution averages to: \[ \sum_j p_j^- \mathbf{z}_j^- \approx \mathbb{E}_{\mathbf{z}^- \sim \text{uniform}(\mathbb{S}^{d-1})}[p(\mathbf{z}^-) \mathbf{z}^-] = \mathbf{0} \] by symmetry (uniform distribution has zero mean). Therefore: \[ (1 - p^+) \mathbf{z}_i^+ \approx \lambda \mathbf{z}_i \] Taking inner product with \(\mathbf{z}_i\): \[ (1 - p^+) \mathbf{z}_i^\top \mathbf{z}_i^+ = \lambda \] Taking norm: \[ (1 - p^+)^2 \approx \lambda^2 \] since \(\|\mathbf{z}_i^+\| = 1\) and \(\mathbf{z}_i^+ \approx \mathbf{z}_i\) at the optimum.

Step 4: Temperature dependence. For small \(\tau\), if \(\mathbf{z}_i^\top \mathbf{z}_i^+ \approx 1\), then: \[ p^+ = \frac{\exp(1/\tau)}{\exp(1/\tau) + N \exp(\langle \mathbf{z}_i, \mathbf{z}_j^- \rangle / \tau)} \approx 1 - N \exp((\langle \mathbf{z}_i, \mathbf{z}_j^- \rangle - 1)/\tau) \] For \(N \to \infty\) with \(\langle \mathbf{z}_i, \mathbf{z}_j^- \rangle \approx -1/(d-1)\) (typical for uniform sphere), we need: \[ \mathbf{z}_i^\top \mathbf{z}_i^+ \geq 1 - O(\tau) \] to maintain \(p^+ \approx 1\). This gives the positive pair bound.

Step 5: Negative pair analysis. For negatives, the loss is minimized when: \[ \mathbf{z}_i^\top \mathbf{z}_j^- \ll \mathbf{z}_i^\top \mathbf{z}_i^+ \] On the high-dimensional sphere, maximum separation occurs when \(\mathbf{z}_i^\top \mathbf{z}_j^- \to -1/(d-1)\) (the expected value for uniform random vectors). Deviations from this are \(O(\tau)\) since the softmax smoothing with temperature \(\tau\) allows some overlap. Therefore: \[ \cos(\theta_{+-}) \leq -\frac{1}{d-1} + O(\tau) \]

Step 6: Geometric interpretation. The positive pairs occupy cones of angular width \(\theta_{++} \lesssim \sqrt{\tau}\) around cluster centers. Negative pairs are pushed to be nearly orthogonal (in high dimensions, random vectors are nearly orthogonal with \(\cos(\theta) \approx 1/\sqrt{d}\)). The temperature \(\tau\) controls the sharpness: small \(\tau\) creates tight clusters, large \(\tau\) allows more spread. ∎

Interpretation: Contrastive learning creates angular geometric structure on the hypersphere: positive pairs cluster tightly with small angular separation (\(\theta_{++} \sim \tau\)), while negative pairs achieve maximum angular separation allowed by the sphere’s geometry. In high dimensions, this maximum separation is near-orthogonality.

Explicit ML Relevance: The theorem explains empirical observations in SimCLR and MoCo: with \(\tau = 0.1\) and batch size 4096, same-image augmentations achieve cosine similarity \(> 0.9\) (angle \(< 25°\)), while different images achieve cosine similarity \(\approx 0\) (angle \(\approx 90°\)). Tuning temperature trades off cluster tightness versus separation sharpness. Very small \(\tau\) risks numerical instability as \(\exp(s/\tau)\) overflows.

Degeneracy Under Rank Deficiency

Formal Statement: Let \(f_\theta: \mathbb{R}^n \to \mathbb{R}^d\) be a representation map with covariance \(\Sigma = \text{Cov}(\mathbf{z})\) having rank \(k < d\). Then there exists a \((d-k)\)-dimensional null space \(\mathcal{N} = \text{null}(\Sigma)\) such that: 1. For any \(\mathbf{v} \in \mathcal{N}\) and any data point \(\mathbf{x}\), \(\mathbf{v}^\top f_\theta(\mathbf{x}) = \mathbf{v}^\top \bar{\mathbf{z}}\) (representations project to mean in null directions) 2. Any downstream task with loss \(\mathcal{L}(W\mathbf{z}, y)\) where \(W\) has non-zero components in \(\mathcal{N}\) is degenerate in those components 3. The effective capacity of the representation is bounded by \(\text{rank}(\Sigma)\), not the nominal dimension \(d\)

Full Formal Proof:

Step 1: Null space characterization. By the spectral theorem, \(\Sigma = \sum_{i=1}^k \lambda_i \mathbf{q}_i \mathbf{q}_i^\top\) where \(\lambda_1 \geq \cdots \geq \lambda_k > 0\) and \(\lambda_{k+1} = \cdots = \lambda_d = 0\). The null space is: \[ \mathcal{N} = \text{span}\{\mathbf{q}_{k+1}, \ldots, \mathbf{q}_d\} \] For any \(\mathbf{v} = \sum_{i=k+1}^d c_i \mathbf{q}_i \in \mathcal{N}\): \[ \mathbf{v}^\top \Sigma \mathbf{v} = \sum_{i=k+1}^d c_i^2 \lambda_i = 0 \]

Step 2: Zero variance in null directions. The variance in direction \(\mathbf{v}\) is: \[ \text{Var}(\mathbf{v}^\top \mathbf{z}) = \mathbf{v}^\top \Sigma \mathbf{v} = 0 \] This implies: \[ \mathbb{E}[(\mathbf{v}^\top \mathbf{z} - \mathbb{E}[\mathbf{v}^\top \mathbf{z}])^2] = 0 \] Therefore: \[ \mathbf{v}^\top \mathbf{z} = \mathbf{v}^\top \mathbb{E}[\mathbf{z}] = \mathbf{v}^\top \bar{\mathbf{z}} \] almost surely for all data points. Any representation component in the null space is constant.

Step 3: Degenerate downstream learning. Consider a linear downstream task \(\hat{y} = W\mathbf{z}\) with weight matrix \(W \in \mathbb{R}^{m \times d}\). Decompose \(W = W_\mathcal{R} + W_\mathcal{N}\) where: \[ W_\mathcal{R} = \sum_{i=1}^k (W\mathbf{q}_i) \mathbf{q}_i^\top, \quad W_\mathcal{N} = \sum_{i=k+1}^d (W\mathbf{q}_i) \mathbf{q}_i^\top \] project onto the range and null spaces respectively. Then: \[ W\mathbf{z} = W_\mathcal{R} \mathbf{z} + W_\mathcal{N} \mathbf{z} = W_\mathcal{R} \mathbf{z} + W_\mathcal{N} \bar{\mathbf{z}} \] The second term is constant for all inputs. Optimization can set arbitrary values of \(W_\mathcal{N}\) without affecting predictions (absorbed into bias term). The effective parameters are only \(W_\mathcal{R}\), reducing capacity.

Step 4: Information loss bound. The mutual information between input and representation is bounded by: \[ I(\mathbf{X}; \mathbf{Z}) \leq \frac{1}{2} \log \det(I + \Sigma) = \frac{1}{2} \sum_{i=1}^k \log(1 + \lambda_i) \] Only the \(k\) non-zero eigenvalues contribute. When \(k \ll d\), maximum information capacity is reduced from \(O(d)\) to \(O(k)\).

Step 5: Rank as effective dimension. Define a task requiring \(\ell\) dimensions to solve optimally (e.g., \(\ell\)-way classification needs \(\ell-1\) dimensions). If \(k < \ell\), then the representation cannot achieve optimal performance regardless of downstream architecture. The achievable accuracy is limited by: \[ \text{Acc}_{\max} \leq \text{Acc}_{\text{opt}} \cdot (k/\ell) \] approximately, where \(\text{Acc}_{\text{opt}}\) is the optimal accuracy with full-rank representations.

Step 6: Recovery impossibility. No linear or nonlinear downstream function can recover information lost in null directions. If \(\mathbf{v} \in \mathcal{N}\) and \(\mathbf{v}^\top \mathbf{z} = c\) constant, then any function \(g(\mathbf{z})\) satisfies: \[ \frac{\partial g}{\partial (\mathbf{v}^\top \mathbf{z})} = 0 \] Information about variations in \(\mathbf{v}\)-direction is irrecoverably lost. ∎

Interpretation: Rank deficiency in representation covariance indicates fundamental degeneracy: dimensions in the null space carry no information. Downstream tasks cannot utilize these dimensions, reducing effective capacity from nominal \(d\) to actual \(k\). This degeneracy is irrecoverable without retraining the encoder.

Explicit ML Relevance: Pre-trained models with rank-deficient representations (common in self-supervised learning with insufficient regularization) have limited transfer learning capacity. Measuring \(\text{rank}(\Sigma)\) on pre-trained features predicts downstream performance: models with \(\text{rank}(\Sigma) / d > 0.8\) transfer better than those with ratios \(< 0.3\). Regularization techniques (VICReg, Barlow Twins) explicitly maximize rank to prevent degeneracy.

Alignment–Uniformity Tradeoff Theorem

Formal Statement: Define alignment of positive pairs as: \[ \mathcal{A}(f_\theta) = \mathbb{E}_{(\mathbf{x}, \mathbf{x}^+)}\left[ \|f_\theta(\mathbf{x}) - f_\theta(\mathbf{x}^+)\|^2 \right] \] and uniformity of representation distribution on the hypersphere as: \[ \mathcal{U}(f_\theta) = \log \mathbb{E}_{\mathbf{x}_1, \mathbf{x}_2}\left[ \exp(-\|f_\theta(\mathbf{x}_1) - f_\theta(\mathbf{x}_2)\|^2) \right] \] where representations are normalized: \(\|f_\theta(\mathbf{x})\| = 1\). Then for any contrastive loss \(\mathcal{L}_{\text{contrast}}\), the optimum satisfies: \[ \mathcal{A}(f_\theta) \leq \epsilon_\text{align}, \quad \mathcal{U}(f_\theta) \geq -\log d + O(1/d) \] There exists a fundamental tradeoff: reducing alignment below \(\epsilon_\text{align}\) requires decreasing uniformity below the maximum \(-\log d\), with the relationship: \[ \mathcal{U}(f_\theta) \leq -\log d + C \cdot \mathcal{A}(f_\theta)^{1/2} \] for constant \(C\) depending on the number of classes.

Full Formal Proof:

Step 1: Alignment interpretation. Small alignment \(\mathcal{A}\) means positive pairs map to nearby points: \[ \|f_\theta(\mathbf{x}) - f_\theta(\mathbf{x}^+)\|^2 \approx 2(1 - \cos\theta) \] where \(\theta\) is the angle between normalized representations. Thus \(\mathcal{A} \approx 2(1 - \cos\theta)\).

Step 2: Uniformity interpretation. Maximum uniformity occurs when representations are uniformly distributed on the hypersphere. For uniform distribution on \(\mathbb{S}^{d-1}\): \[ \mathbb{E}[\exp(-\|\mathbf{z}_1 - \mathbf{z}_2\|^2)] = \mathbb{E}[\exp(-2(1 - \mathbf{z}_1^\top \mathbf{z}_2))] \] On the high-dimensional sphere, \(\mathbf{z}_1^\top \mathbf{z}_2 \approx \mathcal{N}(0, 1/d)\). Therefore: \[ \mathbb{E}[\exp(-2(1 - \mathbf{z}_1^\top \mathbf{z}_2))] \approx \exp(-2) \cdot \mathbb{E}[\exp(2\mathbf{z}_1^\top \mathbf{z}_2)] \approx \exp(-2) \cdot \exp(2/(2d)) = \exp(-2 + 1/d) \] Taking logarithm: \[ \mathcal{U}_{\max} = \log[\exp(-2 + 1/d)] = -2 + 1/d \approx -\log d \] for large \(d\).

Step 3: Positive pair clustering constraint. If representations cluster into \(K\) clusters (e.g., \(K\) classes) with positive pairs in the same cluster, then the distribution is no longer uniform—it concentrates on \(K\) regions. The volume occupied is reduced by factor \(\approx K\), so: \[ \mathcal{U} \approx -\log(d/K) = -\log d + \log K \] Uniformity decreases by \(\log K\).

Step 4: Alignment-uniformity coupling. To achieve tight clustering (small \(\mathcal{A}\)), representations within each cluster must concentrate within angular radius \(\theta_c\), where: \[ \mathcal{A} \approx 2(1 - \cos\theta_c) \approx \theta_c^2 \] for small angles. The volume of a spherical cap of radius \(\theta_c\) on \(\mathbb{S}^{d-1}\) scales as: \[ \text{Vol}(\text{cap}) \approx \theta_c^{d-1} \] To fit \(K\) non-overlapping caps: \[ K \cdot \theta_c^{d-1} \lesssim \text{Vol}(\mathbb{S}^{d-1}) \] Therefore: \[ \theta_c \gtrsim K^{-1/(d-1)} \]

Step 5: Tradeoff relation. Combining \(\mathcal{A} \approx \theta_c^2\) and \(\theta_c \gtrsim K^{-1/(d-1)}\): \[ \mathcal{A} \gtrsim K^{-2/(d-1)} \] The uniformity penalty is \(\log K\). Solving for \(K\) from the alignment bound: \[ K \lesssim \mathcal{A}^{-(d-1)/2} \] Substituting into uniformity: \[ \mathcal{U} \lesssim -\log d + \log(K) \lesssim -\log d + \frac{d-1}{2} \log(1/\mathcal{A}) \approx -\log d + C \sqrt{d} \cdot \mathcal{A}^{-1/2} \] Rearranging: \[ \mathcal{U} \leq -\log d + C' \mathcal{A}^{1/2} \] for constants \(C, C'\) depending on dimension.

Step 6: Achieving the tradeoff. Contrastive learning with temperature \(\tau\) controls this tradeoff: small \(\tau\) prioritizes alignment (tight clusters, low \(\mathcal{A}\)) at the cost of uniformity (clusters concentrate, high \(\mathcal{U}\)). Large \(\tau\) prioritizes uniformity (spread distribution) at the cost of alignment (looser clusters, high \(\mathcal{A}\)). Optimal \(\tau\) balances both. ∎

Interpretation: There is an inherent tradeoff between making positive pairs similar (alignment) and spreading representations uniformly (uniformity). Perfect alignment (\(\mathcal{A} = 0\)) collapses all representations to a single point, achieving zero uniformity. Perfect uniformity (\(\mathcal{U} = -\log d\)) treats all points as equally different, losing semantic structure. Good representations balance both.

Explicit ML Relevance: Wang & Isola (2020) propose explicitly optimizing \(\mathcal{L} = \mathcal{A} + \lambda \mathcal{U}\) as an alternative to contrastive losses. Empirically, optimal \(\lambda \approx 1\) balances the tradeoff. This framework explains why contrastive learning needs negative samples (to enforce uniformity) and why simply pulling positives together fails (alignment without uniformity causes collapse).

Spectral Regularization Effect Theorem

Formal Statement: Let \(\Sigma_t = \text{Cov}(\mathbf{z}_t)\) be the representation covariance during training at time \(t\), evolving under gradient descent on loss \(\mathcal{L}\) with spectral regularization: \[ \mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) - \alpha \text{tr}(\Sigma) + \beta \|\Sigma - I\|_F^2 \] where \(\alpha, \beta > 0\) are regularization strengths. Then the eigenvalue dynamics satisfy: \[ \frac{d\lambda_i}{dt} = \frac{\partial \mathcal{L}}{\partial \lambda_i} + \alpha - 2\beta(\lambda_i - 1) \] At equilibrium, eigenvalues satisfy: \[ \lambda_i = 1 + \frac{1}{2\beta}\left(\alpha + \frac{\partial \mathcal{L}}{\partial \lambda_i}\right) \] For large \(\beta\), eigenvalues concentrate near \(1\), whitening the representation.

Full Formal Proof:

Step 1: Covariance evolution. The covariance evolves as: \[ \frac{d\Sigma}{dt} = \mathbb{E}\left[\frac{d(\mathbf{z} - \bar{\mathbf{z}})}{dt}(\mathbf{z} - \bar{\mathbf{z}})^\top + (\mathbf{z} - \bar{\mathbf{z}})\frac{d(\mathbf{z} - \bar{\mathbf{z}})^\top}{dt}\right] \] Under gradient descent \(\frac{d\theta}{dt} = -\nabla_\theta (\mathcal{L} - \alpha \text{tr}(\Sigma) + \beta \|\Sigma - I\|_F^2)\): \[ \frac{d\mathbf{z}}{dt} = \frac{\partial \mathbf{z}}{\partial \theta} \frac{d\theta}{dt} = -J_\theta \nabla_\theta \mathcal{L}_{\text{reg}} \] where \(J_\theta = \frac{\partial \mathbf{z}}{\partial \theta}\) is the Jacobian.

Step 2: Eigenvalue-wise dynamics. Since \(\Sigma = Q\Lambda Q^\top\) with \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_d)\), the evolution in the eigenbasis is: \[ \frac{d\Lambda}{dt} = Q^\top \frac{d\Sigma}{dt} Q \] Assuming eigenvectors evolve slowly (adiabatic approximation), the diagonal elements satisfy: \[ \frac{d\lambda_i}{dt} = \mathbf{q}_i^\top \frac{d\Sigma}{dt} \mathbf{q}_i \]

Step 3: Regularization gradient. The gradient of trace regularization is: \[ \frac{\partial \text{tr}(\Sigma)}{\partial \Sigma} = I \] So: \[ \nabla_\theta \text{tr}(\Sigma) = \nabla_\theta \left(\sum_i \lambda_i\right) = \sum_i \frac{\partial \lambda_i}{\partial \theta} \]

The gradient of the Frobenius term is: \[ \frac{\partial \|\Sigma - I\|_F^2}{\partial \Sigma} = 2(\Sigma - I) \] In the eigenbasis: \[ \frac{\partial \|\Sigma - I\|_F^2}{\partial \lambda_i} = 2(\lambda_i - 1) \]

Step 4: Combined dynamics. The regularized gradient contributes: \[ \frac{d\lambda_i}{dt} = -\frac{\partial \mathcal{L}}{\partial \lambda_i} + \alpha - 2\beta(\lambda_i - 1) \] Rearranging: \[ \frac{d\lambda_i}{dt} = -\frac{\partial \mathcal{L}}{\partial \lambda_i} + \alpha - 2\beta \lambda_i + 2\beta \]

Step 5: Equilibrium analysis. At equilibrium \(\frac{d\lambda_i}{dt} = 0\): \[ \frac{\partial \mathcal{L}}{\partial \lambda_i} = \alpha + 2\beta(1 - \lambda_i) \] Solving for \(\lambda_i\): \[ \lambda_i = 1 + \frac{1}{2\beta}\left(\alpha - \frac{\partial \mathcal{L}}{\partial \lambda_i}\right) \]

Step 6: Whitening limit. For large \(\beta \to \infty\): \[ \lambda_i \to 1 + O(1/\beta) \] All eigenvalues converge to \(1\), achieving perfect whitening: \(\Sigma \to I\). For finite \(\beta\), eigenvalues deviate from \(1\) proportionally to the task-specific gradients \(\frac{\partial \mathcal{L}}{\partial \lambda_i}\). ∎

Interpretation: Spectral regularization actively shapes the eigenvalue spectrum of representations. Trace regularization (\(\alpha \text{tr}(\Sigma)\)) encourages non-zero eigenvalues (preventing collapse). Frobenius regularization (\(\beta \|\Sigma - I\|_F^2\)) pulls eigenvalues toward uniformity. Together, they produce full-rank, well-conditioned representations that utilize all dimensions equally.

Explicit ML Relevance: Batch normalization implicitly implements spectral regularization by normalizing features. Explicit whitening (ZCA, Cholesky) transforms representations to have \(\Sigma = I\). VICReg uses variance and covariance terms equivalent to spectral regularization with specific \(\alpha, \beta\). These techniques improve optimization conditioning (smaller condition number) and prevent dimensional collapse (all \(\lambda_i\) stay bounded away from zero).

Information Compression Bound (Finite Case)

Formal Statement: Let \(f_\theta: \mathcal{X} \to \mathbb{R}^d\) be a deterministic encoder and \(g_\phi: \mathbb{R}^d \to \mathcal{Y}\) a decoder for task \(Y\). For finite discrete input and output spaces \(|\mathcal{X}| = n\), \(|\mathcal{Y}| = m\), the mutual information between representation and input is bounded: \[ I(\mathbf{Z}; \mathbf{X}) \leq d \log_2(n) \] and task-relevant information satisfies: \[ I(\mathbf{Z}; Y) \leq \min\{d \log_2(n), H(Y)\} \] Furthermore, if task performance (accuracy) is \(\text{Acc}\), then: \[ I(\mathbf{Z}; Y) \geq H(Y) - H(Y | \hat{Y}) \geq H(Y)(1 - H_{\text{binary}}(\text{Acc})) \] where \(H_{\text{binary}}(p) = -p \log p - (1-p)\log(1-p)\) is binary entropy.

Full Formal Proof:

Step 1: Deterministic encoder bound. For deterministic \(f_\theta\), the conditional entropy \(H(\mathbf{Z}|\mathbf{X}) = 0\) since \(\mathbf{z}\) is a function of \(\mathbf{x}\). Therefore: \[ I(\mathbf{Z}; \mathbf{X}) = H(\mathbf{Z}) - H(\mathbf{Z}|\mathbf{X}) = H(\mathbf{Z}) \] The entropy of a \(d\)-dimensional real vector with finite support (since inputs are finite) is bounded by the number of distinct values. With \(n\) inputs, at most \(n\) distinct representations exist, so: \[ H(\mathbf{Z}) \leq \log_2 n \] For vector-valued \(\mathbf{z} \in \mathbb{R}^d\), coordinate-wise: \[ H(\mathbf{Z}) \leq \sum_{i=1}^d H(Z_i) \leq d \log_2 n \]

Step 2: Data processing inequality. The Markov chain \(Y - \mathbf{X} - \mathbf{Z}\) (task label depends on input, representation depends on input) implies: \[ I(\mathbf{Z}; Y) \leq I(\mathbf{X}; Y) \] Furthermore: \[ I(\mathbf{Z}; Y) \leq H(Y) \] since mutual information is bounded by marginal entropy. Combining with the representation entropy bound: \[ I(\mathbf{Z}; Y) \leq \min\{I(\mathbf{Z}; \mathbf{X}), H(Y)\} \leq \min\{d \log_2 n, H(Y)\} \]

Step 3: Performance lower bound via Fano’s inequality. Let \(\hat{Y} = g_\phi(\mathbf{Z})\) be the predicted label. Fano’s inequality states: \[ H(Y | \mathbf{Z}) \leq H(Y | \hat{Y}) + P_e \log(m - 1) \] where \(P_e = \mathbb{P}[Y \neq \hat{Y}] = 1 - \text{Acc}\) is error probability. The conditional entropy \(H(Y|\hat{Y})\) for deterministic predictions is: \[ H(Y|\hat{Y}) = \sum_{\hat{y}} P(\hat{y}) H(Y | \hat{Y} = \hat{y}) \]

Step 4: Binary classification case. For binary classification (\(m = 2\)) with accuracy \(\text{Acc}\): \[ H(Y|\hat{Y}) = H_{\text{binary}}(\text{Acc}) \] The mutual information is: \[ I(\mathbf{Z}; Y) = H(Y) - H(Y|\mathbf{Z}) \geq H(Y) - H(Y|\hat{Y}) = H(Y) - H_{\text{binary}}(\text{Acc}) \] For balanced classes \(H(Y) = 1\) bit: \[ I(\mathbf{Z}; Y) \geq 1 - H_{\text{binary}}(\text{Acc}) \]

Step 5: Multi-class case. For \(m\)-way classification with uniform prior \(H(Y) = \log_2 m\): \[ I(\mathbf{Z}; Y) \geq H(Y) - H_{\text{binary}}(P_e) - P_e \log_2(m-1) \] When accuracy is high (\(\text{Acc} \to 1\), so \(P_e \to 0\)): \[ I(\mathbf{Z}; Y) \geq H(Y) - o(1) \approx \log_2 m \] The representation must contain nearly all task information.

Step 6: Compression-performance tradeoff. Combining upper and lower bounds: \[ H(Y)(1 - H_{\text{binary}}(\text{Acc})) \leq I(\mathbf{Z}; Y) \leq \min\{d \log_2 n, H(Y)\} \] For fixed accuracy, minimum representation dimension satisfies: \[ d \geq \frac{H(Y)(1 - H_{\text{binary}}(\text{Acc}))}{\log_2 n} \] Higher accuracy requires larger \(d\) (more capacity) relative to input complexity. ∎

Interpretation: Representation dimensionality \(d\) must be large enough to encode task-relevant information but need not encode all input information. The bound \(I(\mathbf{Z}; Y) \geq H(Y)(1 - H_{\text{binary}}(\text{Acc}))\) shows that achieving accuracy \(\text{Acc}\) requires retaining a minimum amount of task information. The gap between \(I(\mathbf{Z}; \mathbf{X})\) and \(I(\mathbf{Z}; Y)\) represents task-irrelevant information that can be compressed.

Explicit ML Relevance: This theorem justifies dimensionality reduction for specific tasks. If a representation achieves 95% accuracy on a 1000-way classification task (\(H(Y) = \log_2 1000 \approx 10\) bits), it must encode \(I(\mathbf{Z}; Y) \geq 10 \times (1 - 0.29) \approx 7.1\) bits. This requires \(d \geq 7.1 / \log_2 n\) dimensions. Overparameterized representations (\(d \gg\) this bound) waste capacity; underparameterized representations cannot achieve target accuracy. The information bottleneck framework uses these bounds to guide regularization strength.

Worked Examples

Example 1 — Linear Autoencoder Geometry

Setup: Consider a linear autoencoder where both encoder and decoder are single matrices: \(f_\theta(\mathbf{x}) = W_e \mathbf{x}\) and reconstruction \(\hat{\mathbf{x}} = W_d f_\theta(\mathbf{x}) = W_d W_e \mathbf{x}\). Let the encoder map from \(\mathbb{R}^{100}\) to \(\mathbb{R}^{10}\) (representing 10-dimensional latent space) and the decoder map back to \(\mathbb{R}^{100}\). We train on a dataset of 1000 images where \(\mathbf{x} \in \mathbb{R}^{100}\) (say, 10×10 flattened images). The loss is the reconstruction error: \[ \mathcal{L} = \mathbb{E}[\|\hat{\mathbf{x}} - \mathbf{x}\|^2] = \mathbb{E}[\|W_d W_e \mathbf{x} - \mathbf{x}\|^2] \]

Reasoning: At the global minimum (which we can find analytically for linear autoencoders), the encoder \(W_e\) and decoder \(W_d\) perform principal component analysis (PCA). Specifically, if we perform singular value decomposition (SVD) on the data covariance matrix \(\Sigma_X = \mathbb{E}[\mathbf{x}\mathbf{x}^\top]\), the optimal encoder projects onto the top 10 eigenvectors (principal components). The columns of \(W_e\) form an orthonormal basis for the top-10 eigenspace of \(\Sigma_X\). The decoder \(W_d = W_e^\top\) (up to scaling) projects back from the latent space.

The representation covariance in latent space is \(\Sigma_Z = \mathbb{E}[\mathbf{z}\mathbf{z}^\top] = W_e \Sigma_X W_e^\top\), which is diagonal with entries being the top 10 eigenvalues of \(\Sigma_X\). This diagonal structure is crucial: it means the latent representation has no correlation between dimensions. The Frobenius norm of the representation covariance, \(\text{tr}(\Sigma_Z) = \sum_{i=1}^{10} \lambda_i\), equals the total variance in the data explained by PC components. For natural images, this might account for 85-90% of total variance in the original space.

Geometrically, the data manifold in the original space is 10-dimensional (approximately), embedded in the 100-dimensional image space. The latent space discovers this manifold by learning coordinates for it. The reconstruction error comes from intrinsic data noise and the small 11th-100th eigenvalues (tail variance) that cannot be captured in 10 dimensions. If \(\lambda_1 >> \lambda_2 >> \cdots >> \lambda_{10} > \lambda_{11} \approx \varepsilon\), most information is captured by the top component, and the manifold is highly anisotropic (elongated along the first principal direction).

Interpretation: The linear autoencoder’s learned geometry is the Eigenspace geometry of the data. Unlike nonlinear autoencoders which can discover curved manifolds, linear autoencoders are fundamentally limited to flat subspaces. However, this limitation is interpretable and analytically tractable—we can directly analyze what the autoencoder learns through spectral decomposition. The representation covariance being diagonal reveals that each latent dimension captures independent variance: dimension 1 captures \(\lambda_1\) units, dimension 2 captures \(\lambda_2\) units, and so on. This decorrelation is automatic and requires no explicit regularization.

The phenomenon of “how much variance is explained” directly determines reconstruction quality. If the data distribution is truly 10-dimensional (e.g., images of a rotating object where position and angle fully specify the image), then the linear autoencoder will achieve near-zero reconstruction error once trained. If data is intrinsically higher-dimensional or noisy, reconstruction error plateaus at a positive level determined by the tail eigenvalues \(\sum_{i=11}^{100} \lambda_i\).

Common Misconceptions: A frequent misunderstanding is that adding more layers to an autoencoder automatically produces better representations. In fact, a deep nonlinear autoencoder on the same data might learn the same 10-dimensional manifold but discover it in a curved, nonlinear way. While this can sometimes improve reconstruction, it doesn’t necessarily improve downstream task performance—sometimes the simpler linear PCA projection generalizes better to new data. Another misconception is that the latent space is a “compressed” version of the data. It’s not compression in the information theory sense (which would require quantization or probabilistic modeling); rather, it’s a change of coordinates to a subspace. The latent codes still contain high precision (full floating point).

Many practitioners assume decorrelation requires explicit regularization. The linear autoencoder shows that certain geometries (eigenspaces) are naturally decorrelated by the problem structure itself. This suggests that if representations collapse or become correlated during nonlinear autoencoder training, the issue is likely not intrinsic to the problem but rather an artifact of optimization difficulty, architecture design, or initialization.

What-if Scenarios: Suppose we increased the latent dimension from 10 to 50. The reconstruction error would decrease because we now capture the top 50 eigenvalues, explaining perhaps 98-99% of variance instead of 85-90%. The representation covariance would remain diagonal but with 50 nonzero eigenvalues instead of 10, meaning more dimensions become “active.” Downstream task performance using these 50-dimensional representations would likely improve (more information available), but transfer to datasets with different statistics might suffer (overfitting to source data structure).

Alternatively, suppose the data has a natural hierarchical structure (e.g., images of 5 different objects, each appearing in 200 poses). A linear autoencoder would still find the 10-dimensional subspace, but that subspace would be “mixed”—principal components would blend information about object identity and pose. A nonlinear autoencoder might learn a higher-dimensional representation that factorizes these: some dimensions encode object identity, others encode pose. This factorization (disentanglement) is not possible for linear autoencoders because the data manifold itself isn’t factorized in the original space—poses for object A and object B occupy overlapping regions.

Explicit ML Relevance: Linear autoencoders are used in practice for fast dimensionality reduction and outlier detection. Their interpretability through SVD makes them valuable for understanding data geometry before committing to expensive nonlinear models. In transfer learning, PCA features from source domain often transfer well because top eigenvalues typically capture broad, domain-general structure (contrast, edges, overall image statistics) before domain-specific details. The decorrelated representation helps downstream classifiers: logistic regression or SVM often perform better in PCA space than raw pixel space because irrelevant correlations are removed, reducing effective dimensionality.

Example 2 — Feature Covariance Spectrum Analysis

Setup: Train a ResNet-50 on ImageNet (1.2M images, 1000 classes) and extract 2048-dimensional representations from the global average pooling layer before the classification head. Compute the covariance matrix \(\Sigma \in \mathbb{R}^{2048 \times 2048}\) over all training images. Perform eigendecomposition: \(\Sigma = Q \Lambda Q^\top\) where \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_{2048})\) with \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_{2048}\). Analyze the eigenvalue spectrum.

Reasoning: The eigenvalue distribution reveals how information is distributed across the 2048 representation dimensions. Empirically, for ImageNet-trained ResNets, the spectrum exhibits power-law decay: \(\lambda_i \propto i^{-\alpha}\) where \(\alpha \approx 1.5-2\). This means early eigenvalues decay slowly, but later eigenvalues drop rapidly. Concretely, the top 200 eigenvalues might sum to 95% of total trace, while eigenvalues 1000-2048 contribute negligibly.

This power-law decay is not random—it reflects the hierarchical structure of visual data. Top principal components capture coarse, broadly-relevant information: overall brightness, contrast, presence of edges. Middle components capture intermediate-level features: texture properties, specific object parts. Bottom components capture fine details: textures unique to specific images, noise. The power law emerges because natural image statistics have long-range dependencies and scale-invariance across different levels of abstraction.

Computing the participation ratio \(\text{PR} = (\sum_i \lambda_i)^2 / \sum_i \lambda_i^2\) yields \(\text{PR} \approx 200-300\), indicating that the effective dimension is roughly 200 out of 2048. This is the key insight: ResNet nominally produces 2048-dimensional features, but only 200-300 dimensions meaningfully vary across the dataset, meaning the network operates in a ~10x compressed space relative to its nominal output dimension. The remaining ~1800 dimensions have near-zero variance (they’re quasi-constant across images).

Interpretation: The existence of a clear power-law spectrum with effective rank much lower than nominal dimension indicates that ResNet is finding a hierarchically structured representation. The network has learned that coarser distinctions between images (which class is this?) can be made using just the top components, while fine distinctions require progressively more components. This is geometrically sensible: class identity is low-dimensional (1000 classes, roughly \(\log_2(1000) \approx 10\) bits), so top components efficiently capture it. Image-specific details (pose, occlusion, lighting) require more dimensions for full specification.

The power-law decay specifically (rather than, say, exponential decay) is theoretically significant. Power-law spectra are characteristic of data with structure at multiple scales and with long-range correlations. This suggests that during training, the network learned to weight different scales of variation appropriately: a little bit of variance at each scale, with the amount roughly proportional to the scale’s importance for the learning objective.

Common Misconceptions: Many practitioners assume that “using all 2048 dimensions” means the model is utilizing full capacity. The spectrum analysis shows this is not true—most dimensions are wasted. Some practitioners also assume that whitening representations (making \(\Sigma = I\)) always helps. For downstream tasks, partial whitening (dividing by \(\sqrt{\lambda_i}\) to normalize variance) often helps more than full whitening: it rescales high-variance dimensions down and low-variance dimensions up, removing task-irrelevant scaling variations while preserving meaningful structure.

Another misconception is that low effective rank indicates the network is “compressed” or “sparsely activated.” The low rank is a consequence of task structure (not all dimensions needed for classification), not necessarily of sparse activation. Many or all 2048 neurons might fire strongly, but their collective outputs land in a low-dimensional subspace. This is a global property of the representation, not a local sparsity property.

What-if Scenarios: Suppose we trained ResNet-50 on a different dataset, such as medical images (chest X-rays from CheXpert). The effective rank might be notably different. Medical images have different statistical structure: less global color variation (black and white), more emphasis on local texture patterns (pathology findings). We’d expect the top eigenvalue to be smaller (less first-component dominance), and perhaps the effective rank to be lower (medical images have simpler structure for classification). The power law might persist but with different exponent \(\alpha\).

Alternatively, consider a ResNet-50 trained on a trivial dataset (randomly labeled images). The covariance spectrum would likely flatten—all eigenvalues become similar—because the network cannot find structured information to exploit. The power-law structure would collapse to a uniform spectrum, and effective rank would approach 2048. This demonstrates that the power-law decay, not inevitable, but rather a consequence of the network successfully learning the task structure.

If we used a much shallower network (e.g., a 6-layer CNN instead of 50-layer ResNet), the effective rank might be higher (fewer learning capacity, less compression). Conversely, a deeper 200-layer ResNet might achieve even lower effective rank, compressing more aggressively. This trade-off between network depth/capacity and representation compression reflects the network’s efficiency at learning the task.

Explicit ML Relevance: The covariance spectrum predicts transfer learning success. A representation with clear power-law decay transfers better than one with flat spectrum because the top components capture task-invariant information (likely relevant to many tasks), while bottom components capture task-specific details (likely irrelevant to other tasks). When transferring to a new task, using only the top \(k\) principal components (e.g., top 200 out of 2048) often works as well as the full representation, dramatically reducing computational cost.

The spectrum also diagnoses architectural choices. Batch normalization and layer normalization both affect spectrum shape by controlling variance flow. Comparing spectra before and after normalization layers reveals which layers cause variance concentration or spreading. Spectral analysis is also used in practice to detect whether a model has overfit: overfit models often have more spread-out spectra (higher effective rank) because fine-grained image-specific details dominate coarser class-defining features.

Example 3 — Contrastive Loss Separation

Setup: Implement SimCLR on CIFAR-10 (60,000 images, 10 classes, 32×32 resolution). Use a simple CNN encoder producing 128-dimensional features, normalized to unit norm. For each image, create two augmented versions via random crops, color jittering, and flips. Apply contrastive loss: \[ \mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_i^+) / \tau)}{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_i^+) / \tau) + \sum_{j \neq i} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j^-) / \tau)} \] with batch size 512 (so ~256 negative pairs per positive), temperature \(\tau = 0.1\). Analyze representations learned after 100 training epochs.

Reasoning: Contrastive loss operates by pushing positive pairs (same image, different augmentations) to high cosine similarity while pushing negative pairs (different images) to low similarity. With 512 batch size, each positive pair competes against ~1000 negative similarity comparisons (512 images × 2 augmentations per image, minus the positive pair itself). This large number of negatives is crucial: without them, the network can reach zero loss trivially by mapping all images to the same representation.

During training, early epochs show high loss (~2-3 nats with temperature 0.1). This indicates positive and negative similarities are similar, meaning the network hasn’t learned to separate them. As training progresses, positive similarities increase (images and their augmentations map close together) while negative similarities decrease (different images map far apart), reducing loss toward ~0.5 nats by epoch 100.

At convergence, representations cluster on the unit hypersphere: images of the same class form tight clusters (due to data augmentation creating positive pairs within class), and different classes form separated clusters. The angular separation between clusters is large: same-class pairs achieve cosine similarity \(\approx 0.85\), different-class pairs achieve \(\approx -0.05\) (near-orthogonal on the sphere). The temperature parameter \(\tau = 0.1\) controls sharpness: smaller \(\tau\) makes the softmax more peaked, enforcing tighter separation; larger \(\tau\) allows softer boundaries.

Geometrically, we can visualize representations using t-SNE: 10 distinct clusters appear (one per class), tightly grouped within clusters, well-separated between clusters. The margin between clusters (smallest distance between any two different-class points) is typically 0.3-0.4 in cosine distance, while intra-cluster spread is 0.1-0.15, yielding a clear separation with ratio > 2.

Interpretation: Contrastive learning succeeds because augmentation implicitly defines which images are similar: two crops of the same image are more similar than crops of different images. By training the network to respect these similarity judgments, the network learns to ignore superficial variations (lighting, pose, cropping) and focus on class-identity-relevant features. The angular geometry on the sphere is particularly convenient: cosine similarity is bounded between -1 and 1, and the sphere has uniform geometry (no special directions). This prevents representational pathologies like magnitude divergence.

The ability to achieve clean separation (margin > 2) after 100 epochs shows that the learning task is tractable: the network can find representations that simultaneously satisfy thousands of comparison constraints. The fact that contrastive loss on CIFAR-10 works well with relatively little hyperparameter tuning suggests the method is robust. However, success depends on batch size being large enough; with batch size 32, the number of negatives drops, and the method fails (loss barely decreases, achieving only \(\approx 1.5\) nats final loss instead of 0.5).

Common Misconceptions: One misconception is that contrastive learning requires labeled data. In fact, the SimCLR example above uses only image augmentations (no labels) to define positive pairs during training. Labels are only used for linear evaluation afterward (training a simple classifier on the learned representations). The method is entirely self-supervised. Another misconception is that high loss means training failed. Contrastive loss values are hard to interpret absolutely; only relative changes are meaningful. Loss of 0.5 nats with 512 batch size might indicate good separation, while loss of 1.0 nats might indicate poor separation, but without knowing batch size and temperature, loss values alone don’t inform quality.

Many practitioners misunderstand why negatives are necessary. They assume positives alone create clustering. Actually, without negatives, the trivial solution (constant representation) satisfies the positive pair constraint perfectly (positive similarity = 0 for identical representations). The negative terms prevent this: they force representations to spread out and create meaningful structure. Large batch sizes are proxy for good negative sampling: more negatives provide more constraints, pulling representations in different directions.

What-if Scenarios: Suppose we reduced batch size to 64 (only ~15 negatives per example). Training would be slower and less stable. After 100 epochs, loss might only reach 1.5-2 nats instead of 0.5, and learned representations would show visible class overlap when visualized. Linear evaluation accuracy would drop from ~90% (with batch size 512) to perhaps ~75%, confirming that the representations learned less discriminative structure.

Alternatively, increase temperature from 0.1 to 0.5. The softmax becomes softer (exponentials have smaller range), allowing higher losses for suboptimal solutions. Convergence is slower, but final representations remain roughly separable. This demonstrates that temperature controls the “hardness” of constraints: smaller \(\tau\) enforces strict separation (hard constraints), larger \(\tau\) allows softer separation (soft constraints).

If we doubled training epochs to 200, final representations would become even more separated: intra-cluster spread might shrink to 0.08 and margins might expand to 0.45. However, improvements diminish beyond this point; the law of diminishing returns applies. This suggests convergence happens relatively rapidly for contrastive learning on simple datasets like CIFAR-10.

Explicit ML Relevance: Contrastive learning has become foundational for self-supervised representation learning in computer vision. Unlike supervised learning which requires labels, contrastive methods learn from raw data by exploiting the structure that augmentations preserve class identity. The learned representations transfer exceptionally well to downstream tasks: linear classifiers trained on top of SimCLR features achieve 90%+ accuracy on CIFAR-10, compared to 95% for supervised baselines, with < 10% performance gap. This gap reflects the information loss from self-supervision, which is small for simple, well-structured datasets like CIFAR.

Contrastive learning also avoids mode collapse (representations collapsing to a single point or low-rank subspace) because the negative repulsion prevents all representations from converging to the same location. This contrasts with other self-supervised methods (like some generative models) which require explicit regularization to maintain representational diversity.

Example 4 — Representation Collapse Scenario

Setup: Train a simple autoencoder on MNIST (28×28 images, 70,000 training samples) without any regularization terms. Encoder: 784 → 128 → 64 → 32 (latent), Decoder: 32 → 64 → 128 → 784. Use only reconstruction loss \(\mathcal{L} = \|\hat{\mathbf{x}} - \mathbf{x}\|^2\) with no variance regularization, KL divergence, or other constraints. Monitor the covariance rank and singular values of latent representations during training.

Reasoning: Without regularization, the autoencoder is free to minimize loss in any way, including degenerate ways. During training, the network discovers that most images are similar (roughly similar mean intensity, moderate variance in pixel values), so representing all images with the same latent code works reasonably well for reconstruction. As training progresses, loss decreases because the network increasingly maps all images toward a similar representation—perhaps something close to the mean image.

By epoch 50, we observe that the latent dimensionality has effectively collapsed: computing the covariance matrix \(\Sigma = \text{Cov}(\mathbf{z})\) over all training images yields eigenvalues approximately \(\lambda_1 \approx 0.8, \lambda_2 \approx 0.7, \ldots, \lambda_{10} \approx 0.1, \lambda_{11} \approx 0, \ldots, \lambda_{32} \approx 0\), with effective rank around 10. The network nominally has 32 latent dimensions but uses only ~10.

Continued training worsens collapse: by epoch 200, effective rank drops to ~3, with \(\lambda_1 \approx 2.0, \lambda_2 \approx 1.5, \lambda_3 \approx 0.8, \lambda_4 \approx 0\). This extreme collapse means almost all images map to nearly the same location in the 32-dimensional latent space, differing only along 3 principal directions. The representation becomes essentially useless for downstream tasks because most information is discarded.

Interestingly, reconstruction loss continues decreasing throughout: from ~20 (early epochs) to ~2 (epoch 100) to ~0.5 (epoch 200). The loss keeps decreasing because the network finds better reconstruction strategies by specializing: instead of learning a general latent coordinate system, it memorizes patterns. The decoder learns that if it sees input from the low-dimensional collapsed region, it should reconstruct the mean image plus small component-specific adjustments.

Interpretation: This example reveals a fundamental tension in representation learning without explicit constraints: minimizing reconstruction loss alone does not encourage rich representations. The network gets stuck in a degenerate solution where dimensionality is wasted. The solution achieves lower loss than non-collapsed solutions because it optimizes purely for training signal, ignoring generalization or representation quality.

This pathology is specific to autoencoders without regularization. Supervised learning with the same architecture wouldn’t collapse because the classification objective forces discrimination—different classes must map to different representations, preventing collapse. Self-supervised learning (contrastive methods) also avoids collapse because negative samples repel representations, enforcing diversity. It’s specifically the reconstruction objective in the unregularized setting that permits collapse.

The loss still decreasing during collapse indicates that reconstruction loss is a poor proxy for representation quality. A model with high loss but diverse representations might be better than a model with low loss but collapsed representations. This demonstrates why linear evaluation (training a downstream classifier) is the standard evaluation protocol: it directly measures representation quality instead of relying on loss values.

Common Misconceptions: Many practitioners assume that if a model is training (loss decreasing), everything is progressing well. This example shows that decreasing loss can coincide with catastrophic representation quality degradation. Another misconception is that collapse only happens in self-supervised settings. Here, we see it in a standard autoencoder with only unsupervised reconstruction loss. The key requirement for collapse is lack of constraints forcing diversity.

Some believe the network “gets stuck in a local minimum” when collapse occurs. Actually, collapse is not a local minimum but rather a global minimum of the unregularized objective. The network finds the optimal solution to the optimization problem, but the problem itself (minimize reconstruction loss) is the issue, not the optimization procedure. Reformulating the problem (adding regularization) changes the solution.

What-if Scenarios: Suppose we add KL regularization (as in VAEs): \(\mathcal{L}_{\text{VAE}} = \|\hat{\mathbf{x}} - \mathbf{x}\|^2 + \text{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\) with prior \(p(\mathbf{z}) = \mathcal{N}(0, I)\). The KL term penalizes the encoder for producing narrow distributions (which would occur during collapse). Training with \(\beta = 0.01\) (weak KL weighting) still allows some collapse but much less severe: effective rank reaches ~20 instead of ~3. With \(\beta = 1\) (standard VAE), effective rank stays near 30, preventing severe collapse. This shows that regularization strength controls degeneracy.

Alternatively, add explicit variance maximization: \(\mathcal{L}_{\text{reg}} = \|\hat{\mathbf{x}} - \mathbf{x}\|^2 - \alpha \text{tr}(\text{Cov}(\mathbf{z}))\) with small \(\alpha\). This directly penalizes low variance, fighting collapse. Even with \(\alpha = 0.01\), effective rank stays near 25-28, and collapse is essentially prevented. Different regularization strategies (KL, variance, decorrelation) all combat collapse but in different ways.

Explicit ML Relevance: This scenario illustrates why modern autoencoders use regularization: VAEs, β-VAEs, and other variants all include terms preventing collapse. In practice, autoencoders on real datasets without regularization often show degraded performance: reconstruction looks good on training data but poor on test data (overfitting); downstream classifier performance is mediocre because representations lack structure. Practitioners learn to add regularization terms empirically, not knowing the underlying mechanism of collapse that makes regularization necessary.

The scenario also explains why self-supervised methods more naturally prevent collapse than unsupervised reconstruction: the contrastive objective inherently requires diverse representations. This is an advantage of contrastive methods that isn’t always appreciated: they gracefully avoid pathologies that plague other objectives.

Example 5 — Invariance Construction Example

Setup: Train two image classifiers on rotated MNIST: one WITHOUT rotation augmentation and one WITH rotation augmentation (training images are randomly rotated ±30 degrees). Both use the same CNN architecture. After training, test both on unrotated MNIST images and on 4 different rotation conditions: 15°, 30°, 45°, and 90° rotations. For each model and each rotation angle, measure the representation distance: how much do representations change when images are rotated?

Reasoning: The model trained WITHOUT augmentation learns to classify upright digits optimally. Its representations are specialized for upright digits: the first layer learns filters that are sensitive to upright strokes. When we apply a 15° rotation to a test image, these filters response differently, causing representations to shift significantly. Empirically, the representation changes by \(\approx 20\text{%}\) of the mean inter-class distance with a 15° rotation, \(\approx 40\text{%}\) with 30° rotation.

In contrast, the model trained WITH augmentation learns representations that are approximately invariant to rotations. During training, the network sees the same digit at many angles—a 3 rotated 15° and a 3 rotated 30° are both positive examples for class 3. This teaches the network to produce similar representations regardless of angle. The learned filters are rotation-invariant or nearly so: they respond similarly to rotated versions of the same stroke pattern. After training with augmentation, rotating test images 15° changes representations by only \(\approx 5\text{%}\) of inter-class distance, 30° by \(\approx 8\text{%}\). Invariance improved substantially.

However, perfect invariance never occurs in practice. Some residual sensitivity remains because: (1) Augmentations are discrete (rotation in ±30° range), not continuous, so extrapolation beyond training rotations shows degradation (45° rotation shows \(\approx 15\text{%}\) representation change, 90° shows ~50%); (2) The network might not have learned pure rotation invariance but rather some correlated feature that happens to be rotation-similar for small angles; (3) Boundary effects and interpolation artifacts in image rotation introduce slight differences.

Interpretation: This example demonstrates that invariances are learned properties, not architectural hardwiring. The Theorem on Invariance-Equivariance Decomposition showed that any representation can be decomposed as equivariant component plus invariant projection. Here, the augmentation-trained model has learned a strong invariant projection (the mapping from inputs to representations throws away rotation information), while the unaugmented model has not.

The quantitative difference (20% vs 5% representation change) is significant for downstream tasks. A linear classifier trained on the invariant representations generalizes better to rotated test images because the class boundary doesn’t move when images rotate. For the non-invariant representations, the class boundary moves substantially, causing accuracy drops.

The fact that invariance improves but isn’t perfect is important: it reveals that learning is constrained by inductive biases and generalization limits. The CNN architecture itself has some rotation priors (spatial convolutions have local translation priors), but rotations are fundamentally different. Learning rotation invariance requires the network to see varied rotations during training. Without augmentation, it never needs to learn this property and doesn’t.

Common Misconceptions: Many assume that data augmentation is purely about increasing data quantity (more diverse training examples). While that’s part of it, the primary effect is teaching invariances. With only augmentation, no new information classes; the network sees more examples of the same digits. The data augmentation teaches what variations don’t matter (rotation, brightness) and what matters (local structure, strokes).

Another misconception is that stronger augmentations always help. Extremely strong augmentations (±90° rotations) might teach invariance to meaningless deformations (a rotated 6 looks like a 9, which is actually a different class). The balance in choosing augmentations is to match data variations that are task-irrelevant but commonly occur. For handwritten digit recognition, ±30° rotation is reasonable. For upright machine-printed digits, rotation invariance might hurt performance.

What-if Scenarios: Suppose we trained with even stronger augmentation (±45° rotations). Invariance to 30° would improve further, but the cost would be degraded performance on upright images (the network would treat 6 and 9 as more similar due to rotation equivalence). This is the invariance-information tradeoff: invariance to some transformations necessarily means losing sensitivity to variations that align with those transformations.

Alternatively, what if we used group-equivariant CNNs (G-CNNs) that explicitly encode rotation equivariance in the architecture? These networks use filters that transform under group operations. A G-CNN trained on the unrotated dataset (no augmentation) would automatically produce rotation-equivariant representations: rotating input by θ rotates representation by θ. This demonstrates that architectural priors can replace or supplement learned invariances from augmentation.

Explicit ML Relevance: In practice, data augmentation for invariance is ubiquitous. ImageNet models use random crops (teaching translation invariance), brightness jittering (lighting invariance), and horizontal flips (left-right invariance). These choices reflect what variations don’t matter for object recognition. Medical imaging uses different augmentations (slight rotations, elastic deformations, intensity shifts) because different variations are task-irrelevant.

The concept of invariance guides architectural choices: using convolutions (translation equivariant/invariant) rather than fully connected layers for images, using attention mechanisms (permutation equivariant) for sets and sequences. Modern self-supervised methods (SimCLR, BYOL) treat strong augmentation (stronger than typical supervised augmentation) as crucial to learning useful representations for transfer learning.

Example 6 — Equivariance Under Group Action

Setup: Design a rotation-equivariant network using group-equivariant convolutions (G-CNN) on MNIST. The network implements SO(2)-equivariance (continuous rotations), meaning if input is rotated by angle θ, the feature maps are rotated by θ. Use a simple G-CNN: 1 input channel → 4 orientations of filters → 16 rotated filters → global average pooling → classification. Train on unrotated MNIST (no rotation augmentation). After training, measure how representations transform under rotations.

Reasoning: A standard CNN has translational equivariance (spatial shifting of input shifts feature maps identically) hardwired via weight sharing on a grid. This equivariance emerges automatically without explicit training. Extending this to rotation requires explicitly constructing filters that transform under rotations and channels that parametrize orientations.

In the G-CNN, the first layer has 4 orientations: each of 4 basis filters (edge detectors at 0°, 45°, 90°, 135°) is parametrized as a continuous function of angle and sampled at these 4 orientations. When the input image is rotated by some angle θ, each orientation-specific feature map is affected predictably: what was detected at 0° is now detected at θ. The second layer respects this structure: its convolutions couple outputs across orientations in rotation-equivariant ways.

Training proceeds normally: the network learns to classify unrotated digits even though it has built-in equivariance structure. Interestingly, the network learns effectively despite this structural constraint. The bottleneck representations (before final pooling) have shape \((\text{batch}, 16 \text{ channels}, 4 \text{ orientations}, H, W)\). For a test image, rotating it by 45° produces feature maps where all 4 orientations shift systematically: orientation-0 features move toward orientation-45, etc., like a rotation in feature space.

When we measure representation equivariance empirically: for two rotated test images (rotated by different angles), their representations (after pooling) should show a rotation in the representation space. This can be measured by comparing top principal components’ alignment or other geometric measures. A standard CNN shows no such alignment (representations of rotated images are seemingly random variants), while the G-CNN shows clear geometric alignment (signatures of rotation).

Interpretation: Equivariance is fundamentally about structure preservation. The G-CNN preserves the structure of rotations: that rotating an object by θ is the “same transformation” in feature space as in input space. This is stronger than invariance: the G-CNN “knows about” rotations and represents them explicitly in its feature maps, allowing downstream layers to reason about object orientation explicitly if needed.

The benefit of equivariance architectural structure is making orientations explicit. A standard CNN discriminates between rotations by learning those distinctions through data. A G-CNN makes orientation a first-class feature that the network can use. This is particularly useful for tasks where orientation matters: predicting orientation of a character, understanding 3D pose, or rotating objects.

Equivariance also improves sample efficiency: the network doesn’t need to see rotated examples separately to know their structure is related. If it sees an upright 3, it can infer properties of a rotated 3 through equivariance. This isn’t true invariance (it doesn’t treat rotations as the same), but it’s a form of structure sharing.

Common Misconceptions: A key misconception is that equivariance and invariance are the same. They’re complementary: equivariance preserves transformation structure, invariance discards it. For classification, final layer usually projects to an invariant representation (the class, which doesn’t depend on orientation), but intermediate equivariant layers preserve orientation structure for processing.

Another misconception is that implementing equivariance is always beneficial. For tasks were orientation is irrelevant (digit classification, ImageNet classification on upright images), enforcing equivariance structure costs some flexibility and makes training harder because the architecture has reduced degrees of freedom compared to unconstrained networks. Equivariance shines for tasks with clear transformation structure (3D vision, group-theoretic data).

What-if Scenarios: Suppose we extended the G-CNN to SO(3)-equivariance (3D rotations) for volumetric medical imaging (CT or MRI scans). The network would be equivariant to 3D rotations: rotating the 3D scan should rotate feature maps consistently. This additional structure makes the network particularly suitable for medical imaging where tumors, vessel, and organs might appear at various orientations.

Alternatively, reduce the network to Z_4-equivariance (equivariance to 90° rotations only, not arbitrary rotations). This is simpler to implement (4 orientations instead of continuously parameterized) and trains faster. Empirically, it still helps performance on rotated images, though not as much as full SO(2)-equivariance.

Explicit ML Relevance: Group-equivariant networks have shown benefits in scientific domains. Protein folding predictions (AlphaFold-like models) use equivariances to 3D rotations and reflections because protein structure is inherently 3D. Geometric deep learning on graphs and point clouds builds equivariances to permutations and rigid transformations. These architectural choices reflect domain structure and improve both sample efficiency and generalization.

The field of equivariant neural networks has grown significantly, recognizing that many tasks have natural group structures that should be encoded architecturally. This approach often outperforms augmentation alone because the architectural priors are more flexible and efficient than sampling-based invariance learning.

Example 7 — Spectral Bias in Neural Networks

Setup: Train a 3-layer fully-connected network on a 1D synthetic function fitting task: given inputs \(\mathbf{x} \in [-1, 1]\), learn \(y = \sin(2\pi x) + 0.1 \sin(10\pi x)\) (a low-frequency sine with a small high-frequency component, plus noise). Network has 512 hidden units, ReLU activations. Sample 100 training points uniformly from the input range. Train with MSE loss for 10,000 iterations.

Reasoning: Neural networks exhibit spectral bias: they learn low-frequency components of functions before high-frequency components. During early training (iterations 0-1000), the network learns approximately \(\hat{y} \approx \sin(2\pi x) + \text{noise}\). The learned function is smooth, fitting the main low-frequency sine component well but failing to capture the high-frequency \(0.1 \sin(10\pi x)\) component.

Intermediate training (iterations 1000-5000) shows gradual introduction of higher frequencies. The network predicts approximately \(\hat{y} \approx \sin(2\pi x) + 0.05 \sin(10\pi x) + \text{noise}\), partially fitting the high-frequency component but not fully.

Late training (iterations 5000-10000) approaches the true function: \(\hat{y} \approx \sin(2\pi x) + 0.1 \sin(10\pi x) + \text{noise}\). The network has learned both components, fitting all frequencies present in the training data.

This progression—low frequencies first, high frequencies later—is the spectral bias. We can measure it quantitatively by Fourier analyzing the function \(\hat{y}(x)\) at each training step. Define \(E_k(t) = |\hat{y}_k(t)|^2\) where \(\hat{y}_k\) is the \(k\)-th Fourier coefficient. For small frequencies (\(k=1, 2\)), \(E_k\) increases rapidly (reaching close to optimal within ~100 iterations). For large frequencies (\(k=10, 20\)), \(E_k\) increases slowly and lags behind. By iteration 10,000, low frequencies are fit nearly perfectly while high frequencies still have some error.

Interpretation: Spectral bias arises from the inductive bias of neural network architectures and initialization. Random initialization produces functions that are typically smooth (in the limit of infinite width, random networks act like kernel methods with smooth kernels). Learning a smooth approximation is easier than learning rough, oscillatory functions. This geometric preference for smoothness is built into the architecture without explicit specification.

Theoretically, this connects to the neural tangent kernel (NTK): in the infinite-width limit, neural networks behave like kernel methods with a specific kernel \(K\) determined by architecture. For fully connected networks with ReLU activations, this kernel is smooth (decays with frequency). Low-frequency functions can be represented well in the span of K’s eigenfunctions (which emphasize low frequencies), while high-frequency functions require mixing many eigenfunctions (high-frequency terms emerge from interference). Optimization via gradient descent explores the kernel eigenspace in order of eigenvalue magnitude, naturally fitting low frequencies first.

Practically, spectral bias means neural networks are “conservative” learners: they start simple and gradually complexify. This is actually beneficial for generalization—fitting noise first (overfitting) is avoided. The network preferentially fits real structure in the data (low frequencies, which are typically where real signal lives) before wasting capacity on noise (high frequencies).

Common Misconceptions: Some practitioners think spectral bias is a limitation or bug. Actually, it’s a feature that aids generalization. By learning simple functions first, networks avoid overfitting. If networks learned all frequencies equally, they’d memorize noise and fail to generalize.

Another misconception is that spectral bias only matters for small datasets or toy problems. In reality, it affects all neural network training. On ImageNet, networks learn low-frequency structure (objects, edges) before high-frequency details (textures, noise). The spectra have changed (image frequencies are different from 1D function frequencies), but the principle of learning easy, low-frequency features before hard, high-frequency details remains.

What-if Scenarios: Suppose we changed the hidden unit count from 512 to 32 (undowerparameterized). With fewer parameters, the network might not have capacity to fit all frequencies. After 10,000 iterations, it would fit the main sine well but fail to capture the high-frequency component, no matter how long we train. This shows that spectral bias interacts with capacity: wider networks eventually fit all frequencies given enough time, while narrow networks demonstrate spectral bias more dramatically because high frequencies are completely sacrificed.

Alternatively, use different activations: sigmoid or tanh instead of ReLU. Different activations have different kernels with different spectral properties. Sigmoid networks typically show even stronger bias toward low frequencies (sigmoid kernel is even smoother than ReLU kernel), while carefully designed activations might reduce spectral bias. This demonstrates that spectral bias is not universal law but architectural choice.

Explicit ML Relevance: Spectral bias explains many phenomena in deep learning. For example, why does transfer learning work (learned low-frequency features on source task transfer to target task —low-frequency structure is usually task-invariant). Why do deep networks generalize despite massive overparameterization (spectral bias prevents memorization). Why does early stopping often improve test accuracy (high frequencies learned late are often noise).

In practice, accelerating spectrum learning (learning high frequencies faster) usually requires: larger learning rates (aggressive optimization), data preprocessing to normalize inputs and outputs (making high frequencies more salient), or architectural choices (different activations, skip connections which preserve high frequencies). ConvNets and Vision Transformers have different spectral biases; Vision Transformers demonstrate slower spectral bias, learning frequencies more uniformly, which why they sometimes resist memorization less than CNNs on small datasets but work well on large datasets.

Example 8 — Alignment vs Uniformity Tradeoff

Setup: Implement two versions of normalized embedding learning on CIFAR-10: (1) “Alignment-Only” model that trains with positive pairs only (no negative samples), using loss \(\mathcal{L}_{\text{align}} = \|f(\mathbf{x}_i) - f(\mathbf{x}_i^+)\|^2\); (2) “Balanced” model using standard contrastive loss with negatives. Monitor both alignment \(\mathcal{A} = \mathbb{E}[\|f(\mathbf{x}) - f(\mathbf{x}^+)\|^2]\) and uniformity \(\mathcal{U} = \log \mathbb{E}[\exp(-\|f(\mathbf{x}_1) - f(\mathbf{x}_2)\|^2)]\) during training.

Reasoning: The alignment-only model minimizes alignment loss directly. Optimal solution is to map all images to the same point (representations collapse): if \(f(\mathbf{x}) = \mathbf{c}\) for all \(\mathbf{x}\), then \(\|f(\mathbf{x}) - f(\mathbf{x}^+)\| = 0\) exactly. Through various optimizations tricks, practitioners prevent this collapse by using exponential moving averages (momentum encoders) or stop-gradients, but alignment-only methods inherently want to minimize feature diversity.

Monitoring uniformity for this model: \(\mathcal{U}\) is very negative (uniformity is poor) because representations cluster densely. On a d-dimensional hypersphere, uniform distributions achieve \(\mathcal{U} \approx -\log d\). The alignment-only model achieves \(\mathcal{U} \approx -\infty\) (or approximately \(-50\) in practice), indicating extremely non-uniform distribution (nearly all representations are identical or nearly so).

The balanced contrastive model (standard SimCLR) explicitly includes negatives. During training, it simultaneously minimizes alignment (positive pairs get close) and explicitly maximizes \(\mathcal{U}\) (negatives push representations apart, encouraging spread). Empirically, this model achieves \(\mathcal{A} \approx 0.2\) (slight separation between positive pairs after normalization) and \(\mathcal{U} \approx -\log d - 0.1 \approx 2.3 - 0.1 \approx 2.2\) nats (close to theoretical maximum for uniform distribution).

Linear evaluation: alignment-only model achieves ~30% accuracy (barely better than random 10% for CIFAR-10), while balanced model achieves ~90% accuracy. The poor alignment-only performance is not due to optimization failure (loss reaches zero) but because representations lack discriminative structure—they’re all identical, so a linear classifier can’t distinguish classes.

Interpretation: The alignment-uniformity tradeoff is fundamental to contrastive learning. Perfect alignment (\(\mathcal{A} = 0\)) requires mapping all positives together, which necessitates sacrificing uniformity (can’t spread representations out while keeping all same-class examples together forever). Conversely, perfect uniformity requires spreading all representations maximally, which breaks alignment (can’t have similar positive pairs if they’re spread uniformly).

The tradeoff is parametrized by the number of classes and data structure. For K classes with balanced data, the theoretical maximum uniformity with alignment constraint is approximately \(\mathcal{U}^* \approx -\log d + O(\log K)\): uniformity decreases by roughly \(\log K\) compared to unconstrained uniform distribution, because K clusters reduce occupied volume proportionally.

The balanced contrastive model resolves this explicitly: positive samples pull representations together (alignment), negative samples push them apart (uniformity), creating a dynamic equilibrium. The proportion of negatives to positives (batch size) controls the tradeoff: large batches emphasize uniformity (more negatives push harder), small batches allow focusing on alignment (fewer negatives).

Common Misconceptions: Many practitioners don’t realize alignment and uniformity are competing objectives. They might assume both can be maximized simultaneously, which is impossible (they’re in tradeoff). The confusion arises because “representation quality” is multidimensional: aligned representations might be poor if they’re non-uniform (collapsed), and uniform representations might be poor if they’re misaligned (not capturing similarity relationships).

Another misconception is that alignment-only methods simply “don’t work.” Actually, they work fine for certain tasks where you don’t need discriminative structure—for instance, representation clustering or anomaly detection might benefit from alignment without needing uniformity. The failure is specific to downstream classification tasks.

What-if Scenarios: Suppose we used a different loss balancing: \(\mathcal{L} = 2\mathcal{A} + \mathcal{U}\) (double weight on alignment). The model would prioritize alignment over uniformity: representations of positive pairs would be even more similar, but uniform spread would decrease slightly. Linear evaluation accuracy might improve on few-shot learning (small labeled dataset) but might degrade on standard evaluation because the model has overly clustered representations with significant class overlap.

Alternatively, use a weighting reflecting data statistics: for 10-class CIFAR-10, use \(\mathcal{L} = \mathcal{A} + 10\mathcal{U}\) (emphasize uniformity). Representations would spread more, achieving better uniformity at the cost of slightly larger alignment error. Linear evaluation accuracy might degrade slightly on balanced CIFAR-10 but improve on extremely imbalanced datasets where uniformity is more critical.

Explicit ML Relevance: The alignment-uniformity perspective unifies diverse self-supervised methods. SimCLR explicitly balances both via large negative batches. MoCo uses a memory bank to provide more negatives, shifting toward stronger uniformity. BYOL and SimSiam use asymmetric architectures or stop-gradients to avoid collapse without explicit negatives, finding alternative mechanisms to maintain uniformity. Understanding the tradeoff clarifies why these methods work and when they might fail.

In practice, practitioners often find that increasing batch size improves transfer learning performance (more negatives improve uniformity), but the effect saturates beyond batch size ~1024. This saturation reflects the alignment-uniformity tradeoff: after sufficient negatives, adding more doesn’t improve uniformity further (representations are already maximally spread), and forcing tighter alignment becomes counterproductive.

Example 9 — Degenerate Latent Space Case

Setup: Train a VAE on CelebA (face images, 64×64×3) with a bottleneck dimension of 2 (extremely low, for visualization purposes). Encoder produces 2D Gaussian parameters \(\mu, \sigma\), decoder reconstructs from 2D latent code. The VAE objective combines reconstruction loss and KL divergence: \(\mathcal{L} = \mathbb{E}[\log p(\mathbf{x}|\mathbf{z})] - \beta \text{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\). Use three settings: \(\beta = 0.001, 0.1, 1.0\) (annealing from weak to strong KL term).

Reasoning: With \(\beta = 0.001\) (weak KL regularization), the VAE prioritizes reconstruction. The encoder learns to map each face to a distinct 2D code that the decoder can reconstruct well. Empirically, codes concentrate in a 2D region with \(\text{Cov}(\mathbf{z}_\text{train}) \approx I_2\) (unit covariance), but codes spread non-uniformly to maximize reconstruction: faces with similar features cluster together. Importantly, much of the 2D space remains unused: only a small region (perhaps 0.3 of the total Gaussian space) is populated by training examples.

At inference, sampling from the prior \(\mathcal{N}(0, I)\) and decoding produces artifacts for codes outside the populated region. If we sample codes uniformly from \([-3, 3]^2\) (95% probability mass of \(\mathcal{N}(0, I)\)), many samples decode to blurry, unrealistic faces or gibberish, particularly in sparsely populated regions. This indicates a “hole” in the latent space: regions between training clusters are unoccupied because the encoder never maps training data there.

With \(\beta = 1.0\) (strong KL regularization), the objective changes. The KL term \(\text{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})) = 0\) when \(q(\mathbf{z}|\mathbf{x})\) is Gaussian with mean 0 and variance 1 (matches the prior exactly). This term strongly encourages the encoder to produce outputs close to the prior distribution. The encoder must now trade off reconstruction (which wants specialized codes) against prior matching (which wants dispersed, standard normal codes). Reconstruction error increases compared to \(\beta = 0.001\), but latent space becomes more uniform: codes spread across the entire Gaussian space rather than clustering.

Intermediate setting \(\beta = 0.1\) balances both: reconstruction is reasonable (few visual artifacts), and latent space is partially filled (less severe hole artifacts, but some, not fully uniform). Sampling from latent space and decoding produces mostly realistic faces, with occasional glitches in sparsely populated regions.

Interpretation: This example illustrates the information-compression tradeoff. Low \(\beta\) achieves high reconstruction (preserves image details) but creates degeneracy: latent space exploits all available dimensions to gain reconstruction performance, leaving some regions unused. High \(\beta\) trades reconstruction fidelity for regularized, well-behaved latent space. The “holes” in low-\(\beta\) cases represent regions where no training data provides supervision—the decoder hasn’t learned meaningful reconstructions for those codes.

Evaluating this degenerate space on downstream tasks: image classification fine-tuning on limited labeled faces works better with \(\beta = 0.001\) (encoder has learned more detailed features from reconstruction), but sampling new synthetic faces to augment training works better with \(\beta = 1.0\) (samples from holes produce nonsense, which is harmful for augmentation).

The degeneracy is problem-specific. For a face dataset where images cluster smoothly (faces form a low-dimensional manifold), even low \(\beta\) provides reasonable coverage. For more complex data (diverse objects, scene variations), degeneracy becomes worse: the high-dimensional manifold can’t be well-covered by finite training data in low-dimensional latent space.

Common Misconceptions: Many assume \(\beta = 1\) (standard VAE) is always correct. Actually, optimal \(\beta\) depends on task: generative modeling benefits from \(\beta = 1\) (well-regulated sampling), while representation learning for downstream tasks might benefit from \(\beta < 1\) (better reconstruction preserves details useful for tasks).

Another misconception is that holes in latent space indicate the model is “broken.” Holes are inevitable when latent dimension is much lower than data intrinsic dimension. The question is whether this is problematic for the task. For image generation, holes are bad (sampling produces garbage). For representation learning if classifier never samples from holes, holes are irrelevant.

What-if Scenarios: Suppose we increased bottleneck dimension to 32. With \(\beta = 0.001\), latent space would still show some non-uniformity and holes, but less severe: more dimensions allow richer representation without hole artifacts. Reconstruction quality would improve dramatically (more capacity), but downstream transfer might worsen slightly (large representations sometimes overfit to source task statistics).

Alternatively, use a hierarchical VAE or ladder network with multiple latent variable levels at different scales. Different levels can capture different amounts of detail (low-level variables capture fine details, high-level capture coarse structure), potentially filling latent space more efficiently than a single bottleneck.

Explicit ML Relevance: This scenario explains why VAEs require careful \(\beta\) tuning in practice. The literature reports that best β varies by application: small β (0.01-0.1) for reconstruction-focused tasks, large β (1-10) for generative modeling and systematic factor discovery. Modern variants like β-TCVAE and FactorVAE use modified objectives to achieve better disentanglement without sacrificing reconstruction so severely.

The degeneracy also motivates normalizing flow-based generative models, which parametrize complex distributions with invertible transformations. By using flows, these models push the effective latent space distribution beyond the simple Gaussian, enabling sparser codes while maintaining coverage.

Example 10 — Effect of Weight Decay on Representations

Setup: Train three ResNet-50 models on ImageNet with identical settings except for weight decay (L2 regularization coefficient): (1) No weight decay (\(\lambda = 0\)), (2) Standard weight decay (\(\lambda = 1 \times 10^{-4}\)), (3) Strong weight decay (\(\lambda = 1 \times 10^{-3}\)). After training, extract representations from the penultimate layer for 10,000 validation images. Analyze representation properties: covariance eigenvalues, parameter norms, generalization.

Reasoning: Without weight decay, the optimization objective is purely the classification loss. Parameters can grow arbitrarily large if higher magnitude weights reduce loss. The network learns representations that could be very “spiky”—individual dimensions have large magnitude. The covariance matrix \(\Sigma = \text{Cov}(\mathbf{z})\) can have large eigenvalues (high variance), but the spectrum can be unbalanced: a few large eigenvalues, many small ones. Parameter norms \(\|\theta\|^2\) are potentially large.

With standard weight decay (\(\lambda = 1 \times 10^{-4}\)), the objective becomes \(\mathcal{L}_{\text{classification}} + \lambda \|\theta\|^2\). This penalizes large parameters, pushing the network toward smaller weights. During training, the network must trade off: larger weights reduce classification loss, but incur regularization penalty. At convergence, parameters have moderate magnitude. The balance point reflects that classification loss reduction becomes expensive at large parameter values.

Effect on representations: representations remain discriminative (network still classifies well), but their magnitudes are constrained. The covariance spectrum becomes more balanced (eigenvalues spread more evenly rather than concentrating on a few large values). This can be understood via the Spectral Regularization Effect Theorem: weight decay acts as a regularization that controls parameter scaling, which in turn controls representation scale.

Empirically, increasing weight decay from 0 to \(1e-4\) improves test accuracy slightly (from 76.0% to 76.5% for ImageNet), reflecting better generalization from regularization. Further increasing to \(1e-3\) still improves test accuracy (76.8%) but training accuracy drops significantly (98% without decay, 94% with \(1e-3\) decay), indicating underfitting.

Interpretation: Weight decay serves multiple purposes simultaneously: (1) Capacity control—constraining parameter magnitude prevents overfitting by limiting model complexity; (2) Implicit bias—biasing toward smaller, simpler solutions aligns with the simplicity bias we saw in spectral bias; (3) Representation stability—regularized parameters produce more stable, better-generalized representations.

Weight decay’s effect on representations is subtle: it doesn’t explicitly regularize representation geometry, but it implicitly does through parameter constraint. A layer with smaller weights produces representations closer to the input (small perturbations), making representations less sensitive to random initialization or input noise. This increases robustness.

The Representation Collapse Characterization Theorem mentioned that dimensionality collapse can occur when optimization finds degenerate solutions. Weight decay prevents collapse by discouraging parameters from concentrating on a few large weights (which would be required for collapse—you’d need large weights on a low-rank representation and tiny weights elsewhere). Distributing weight magnitude more evenly makes collapse geometrically harder to achieve.

Common Misconceptions: Many practitioners think weight decay is primarily about preventing overfitting. While that’s true, weight decay simultaneously shapes the learned geometry. Some assume weight decay hurts performance (regularization is a tradeoff). Actually, moderate weight decay usually improves both training and test performance because it prevents bad solutions (large parameters that memorize noise). Only extreme weight decay hurts training accuracy.

Another misconception is that weight decay and dropout are equivalent. They’re complementary: dropout provides implicit capacity reduction by stochastic feature ablation, weight decay provides implicit capacity reduction by parameter magnitude constraint. Using both together often outperforms using either alone because they regularize different aspects.

What-if Scenarios: Suppose we used L1 regularization (\(\|\theta\|_1\)) instead of L2. L1 regularization pushes parameters toward sparsity: many weights become exactly zero, others large. Representations would have a different structure than L2: potentially sparser, with fewer important dimensions strongly activated. Sparsity sometimes improves interpretability and transfer learning but can hurt generalization on small datasets (fewer degrees of freedom for fitting).

Alternatively, use adaptive regularization: higher weight decay in early layers, lower in late layers (or vice versa). Early layer regularization prevents overfitting to input-specific details, late layer regularization prevents overfitting to task-specific patterns. This nuanced regularization could improve transfer learning by learning more task-invariant early features.

Explicit ML Relevance: In practice, weight decay is one of the most important hyperparameters. It’s rarely discussed explicitly but is crucial for good performance. “Best practices” for common architectures include specific weight decay values (e.g., \(1e-4\) for ImageNet ResNets). Different tasks need different decay: medical imaging models often use higher decay (more regularization) because training datasets are smaller.

The effect of weight decay on representations explains why transfer learning often works—pre-trained models are regularized (weight decay during training), producing relatively simple, general representations that transfer better than would unregularized models. This justifies the practice of using pre-trained models even when source and target tasks differ: the regularized representations are more robust.

Example 11 — Overparameterized Representation Example

Setup: Train an 8-layer MLP on CIFAR-10 with two configurations: (1) “Standard” with 256 hidden units per layer (total ~300K parameters), (2) “Overparameterized” with 2048 hidden units per layer (total ~2M parameters). Train both for 100 epochs with batch size 128, learning rate 0.1. Monitor training and test accuracy, plus the condition number of the Jacobian \(\kappa(J) = \max_i |\lambda_i| / \min_i |\lambda_i|\) (ratio of largest to smallest singular value).

Reasoning: The standard model reaches 95% test accuracy after 100 epochs, with training accuracy 99.5% (slight overfitting). The overparameterized model reaches the same 95% test accuracy but training accuracy is 99.99%+ (nearly perfect fit on training). Despite ability to precisely memorize training set, test accuracy doesn’t degrade—in fact, it’s slightly better than the standard model (95.0% vs 94.9%). This is the “double descent” phenomenon: adding parameters improves generalization despite enabling memorization.

Analyzing the Jacobian condition number (measuring local curvature of loss landscape): standard model achieves \(\kappa(J) \approx 100-200\) (moderate conditioning). Overparameterized model achieves \(\kappa(J) \approx 10-20\) (better conditioning). Better conditioning means loss landscape is more isotropic locally: learning rates don’t need extreme adaptation across directions. Lower curvature enables faster, more stable training.

Examining representations: both models learn similar discriminative representations. Linear evaluation (training a separate classifier on frozen representations) shows similar accuracy (~93% for both). However, the overparameterized model’s representations are slightly more “generic”—transfer to CIFAR-100 works slightly better (60% vs 59%), suggesting overparameterization helps learn more transferable representations.

Interpretation: Overparameterization changes the optimization landscape, making it easier to find good solutions. With more parameters, there are more flat directions (directions where loss is insensitive to changes). Gradient descent naturally exploits flat regions, finding solutions that generalize well. As stated in the Neural Tangent Kernel theory: overparameterized networks in the lazy training regime behave like kernel methods with smooth, well-behaved kernels that generalize despite massive overparameterization.

The representation learning aspect: with more parameters, the network has more flexibility to make representations match different objectives (class separation for classification, diversity for transfer learning, etc.) simultaneously. The standard model must make tradeoffs—squeeze everything into 256 dimensions per layer. The overparameterized model can allocate dimensions more liberally, learning richer representations.

The implicit bias toward simpler solutions (simplicity bias from spectral bias) remains active even with overparameterization. The network doesn’t suddenly memorize everything; instead, it first learns simpler features, then gradually fits harder-to-learn details. With more parameters, it has more capacity later in training for these details, but the order of learning is preserved. This explains why generalization doesn’t catastrophically fail despite ability to memorize.

Common Misconceptions: Many assume overparameterization always risks overfitting. The phenomenon of generalization despite overparameterization—double descent—reveals this is not universal. Modern deep learning thrives in the overparameterized regime precisely because of the combination of: (1) right inductive biases in architecture (convolutions, attention), (2) right optimization algorithm (SGD with momentum), (3) implicit regularization from these factors.

Another misconception is that more parameters always hurt transfer learning because the model specializes to source task. Actually, overparameterized models often transfer better because they learn more diverse representations. The extra capacity allows learning representations that satisfy both source task (classification) and general structure (transfer properties) simultaneously.

What-if Scenarios: Suppose we used a significantly smaller training set (1000 CIFAR-10 examples instead of 50,000). The standard model would achieve ~70% test accuracy with some overfitting (training 99% vs test 70%). The overparameterized model would show severe overfitting (training 100%, test ~45%), indicating that overparameterization’s benefits diminish with small data. The double descent phenomenon requires sufficient training data to provide enough constraints that the overparameterized model doesn’t collapse into memorization.

Alternatively, reduce overparameterization’s benefit by adding strong explicit regularization (dropout, weight decay). With regularization, the standard model doesn’t overfit (test remains 95%), and the overparameterized model’s advantage shrinks. This reveals that overparameterization’s implicit regularization benefits diminish if you add explicit regularization—they’re partially substitutes.

Explicit ML Relevance: This phenomenon explains the success of large language models and vision transformers: scaling parameters often improves test performance on downstream tasks despite easy training set memorization. The design philosophy of modern deep learning embraces overparameterization combined with appropriate implicit biases (architectural choices, optimization algorithms). This is contrary to classical machine learning wisdom (“simpler models generalize better”), but the empirical evidence in deep learning is compelling.

In practice, practitioners often scale up models (more parameters) before trying other improvements (different architectures, better data, different training procedures), reflecting that overparameterization often helps. The field has shifted from “use just enough parameters to fit training data” to “use as many parameters as computation allows.”

Example 12 — Stability of Embeddings Under Noise

Setup: Train a metric learning model (using triplet loss) on a face verification task (LFW-aligned dataset). After training, take test face images and add different types and levels of noise: (1) Gaussian noise with \(\sigma \in [0.01, 0.1]\), (2) Brightness perturbations ±20%, (3) Random crops of ±5 pixels (for 32×32 faces). For each noise level, measure embedding stability: how much does the representation distance between a pair change under noise?

Reasoning: For a face verification system, two embeddings from the same person should be similar (distance < threshold). If adding small noise dramatically changes distance, the system is unstable: verification might pass for the original but fail for slightly noisy inputs. Quantifying stability: for same-person pairs with original distance \(d_0 \approx 0.3\), perturb both images with Gaussian noise \(\sigma = 0.01\), and measure perturbed distance \(d_\sigma\). Stability metric: \(\Delta d = |d_\sigma - d_0| / d_0\).

Empirically, Gaussian noise with \(\sigma = 0.01\) produces \(\Delta d \approx 5\%\): distance changes from 0.30 to 0.315, causing no verification failure (threshold typically at 1.0). Brightness perturbations ±20% produce \(\Delta d \approx 2\%\) (input space changes substantially, but learned representations are robust because training data included lighting variation augmentation). Crops ±5 pixels produce \(\Delta d \approx 8\%\).

Scaling to larger perturbations—Gaussian \(\sigma = 0.05\) produces \(\Delta d \approx 20\%\): distance goes from 0.30 to 0.36, still within verification threshold but approaching it. Brightness ±50% produces \(\Delta d \approx 10\%\). At \(\sigma = 0.1\), Gaussian noise produces \(\Delta d \approx 50\%\): distance reaches ~0.45, starting to threaten verification reliability for marginal same-person pairs.

Importantly, different-person distances are more stable: \(d_{\text{different}} \approx 1.2\), and same noise produces \(\Delta d \approx 3\%\) (absolute change ~0.036). This asymmetry—same-person distances change more than different-person distances—reflects that the model learned representations where same-person pair distances cluster tightly (small baseline, so relative change is large), while different-person distances have larger spread.

Interpretation: Embedding stability reflects representation robustness. A well-trained metric learning model produces stable embeddings: small input perturbations don’t cause large representation changes. This contrasts with adversarial fragility: some models produce dramatically different predictions for tiny input perturbations. Metric learning models trained on clean data without adversarial robustness can be unstable to noise they didn’t encounter during training (e.g., high Gaussian noise if training used only brightening/cropping augmentation).

The asymmetry in stability (same-person more sensitive than different-person) reflects the learned geometry. Same-person clusters are tight (distance ~0.3) with small margin to overlap threshold, so they’re positioned where noise can more easily cause crossing threshold. Different-person clusters are far apart (distance ~1.2) with larger margin, so noise is less likely to cause misclassification.

Stability is constrained by the Lipschitz constant of the encoder. Embeddings produced by an \(L\)-Lipschitz encoder satisfy \(\|\mathbf{z}_1 - \mathbf{z}_2\| \leq L \|\mathbf{x}_1 - \mathbf{x}_2\|\). For noise with \(\|\delta \| \leq 0.01\), the embedding change is bounded by \(0.01L\). If \(L \approx 30\), then embedding is change is at most 0.3, which matches our empirical observation.

Common Misconceptions: Many practitioners conflate adversarial robustness with stability under natural noise. An adversarial attack is an adaptive perturbation designed to fool the model—it exploits model fragility. Noise stability is about non-adaptive perturbations (random noise, natural variation). These are different: models can be robust to natural noise while remaining vulnerable to adversarial attack, and vice versa (adversarially robust models can remain sensitive to innocent noise they haven’t trained on).

Another misconception is that stability requires sacrificing accuracy. Actually, regularization that improves stability (like the Lipschitz constraint) typically improves generalization without hurting accuracy. The model learns to be invariant to noisy variations that don’t matter for the task, focusing capacity on meaningful signal.

What-if Scenarios: Suppose we trained the metric learning model with added noise augmentation during training (Gaussian noise, brightness jittering, crops all included). The embedding stability would improve, particularly for the types of noise used in training. Gaussian noise robustness would improve to \(\Delta d \approx 2\%\) at \(\sigma = 0.01\), \(\approx 8\%\) at \(\sigma = 0.05\), vs improvements for brightness to \(\Delta d \approx 1\%\) (already good). Cross-entropy between noise-augmented and non-augmented training would show trade-offs: improved robustness to augmented noise types, possibly worse robustness to novel noise types not in training.

Alternatively, use adversarial training: add perturbations that weakly fool the model (\(\max_\delta \mathcal{L}(f(\mathbf{x} + \delta), y)\)) during training. This would improve embedding stability dramatically: \(\Delta d\) would remain < 5% even under strong noise. However, the cost would be slower training (each iteration has inner minimization loop) and potentially sacrificed accuracy (attention diverted from main task to robustness).

Explicit ML Relevance: In production face verification systems, embedding stability is critical. The model operates in noisy real-world conditions (camera artifacts, lighting variations, compression artifacts from transmission). Systems must be robust. Practitioners often measure embedding stability on a held-out test set by adding realistic noise, checking that verification still works.

Differential privacy in machine learning also benefits from understanding embedding stability. DP attacks sometimes work by seeing how much gradient backprop changes with small input perturbations, which relates to Lipschitz properties. Stable embeddings are harder to attack differentially.

The concept also applies to model updates: if updating model parameters slightly changes representations, the system remains stable. This is relevant for continual learning where models update over time; stability ensures learned knowledge doesn’t catastrophically change with new training data.

Summary

Key Ideas Consolidated

This chapter established representation learning as a geometric and optimization phenomenon. The central thesis is that learned representations are not arbitrary numerical vectors but geometrically structured objects whose properties determine model behavior, generalization, and transfer learning success. We introduced formal definitions of representations, latent spaces, embeddings, and the geometric constraints (invariance/equivariance) that shape representations. We then presented theorems characterizing collapse, spectral properties, optimization bias, and stability, grounding abstract geometric intuitions in rigorous mathematics.

The key geometric insights are: (1) Representation geometry emerges from the optimization landscape—the loss function’s level sets, curvature, and gradient field literally shape how representations are positioned; (2) Information is distributed non-uniformly across dimensions according to eigenvalue decay following power laws, not arbitrary scaling; (3) Fundamental tradeoffs exist between competing objectives (alignment vs uniformity, reconstruction vs compression, invariance vs information), and these are not failures of model design but structural constraints of the learning problem itself; (4) Degeneracy and collapse are not random failures but inevitable consequences of optimization without explicit anti-collapse mechanisms (regularization, negative sampling, variance maximization).

The worked examples demonstrated that these theoretical concepts are observable in practice. Linear autoencoders reveal eigenspace discovery, contrastive learning shows angular separation on hyperspheres, weight decay controls representation scale, overparameterization enables richer representations despite capacity to memorize, and noise perturbations quantify embedding stability. Each example illustrated the bridge between theory and practice: the theorem statements predict phenomena we see empirically.

A synthesizing observation: representation learning succeeds when there is alignment between the learning objective, the architecture, and the optimization algorithm. Misalignments create pathologies. Deep networks with appropriate inductive biases (convolutions for images, attention for sequences) learn representations amenable to downstream tasks. Contrastive losses with large negative batches learn separated representations. Weight decay regularization prevents catastrophic collapse. These are not independent choices but a coordinated system where each component contributes to discovering good representations.

What the Reader Should Now Be Able To Do

Upon completing this chapter, you should be able to:

Theoretical Competencies:

  1. Analyze representations geometrically: Given a trained model, compute and interpret eigenspectrum of the representation covariance matrix, diagnose effective rank, and identify collapse signatures.

  2. Predict representation properties from objectives: Given a learning objective, reason about what representations will be encouraged or discouraged (alignment vs uniformity, compression degree, etc.), and predict their geometric behavior.

  3. Formalize representation quality mathematically: Define what constitutes a “good” representation—high-dimensional enough to capture task-relevant information but not so high-dimensional that generalization fails, with eigenvalues distributed by information importance and stability to perturbations.

  4. Debug representation problems systematically: When a model fails, trace whether failure stems from collapse (spectral analysis), poor class separation (contrastive geometry), instability (Lipschitz bounds), or overfitting (generalization gap).

  5. Evaluate transfer learning potential: Assess whether learned representations are likely to transfer well to new tasks by analyzing their geometric properties (spectral structure, stability, alignment with task-relevant features).

Practical Competencies:

  1. Design augmentation and regularization strategies: Choose data augmentations and regularization parameters to enforce desired inductive biases (invariances, diversity, stability) based on target task geometry.

  2. Diagnose and prevent representation collapse: Recognize collapse signatures in self-supervised learning settings and implement targeted solutions (explicit negative sampling, architectural modifications like momentum encoding, diversity regularization).

  3. Prevent transfer learning failures: Use stronger regularization during pre-training (higher weight decay, lower β for VAEs) to produce more general, transferable representations that avoid source task specialization.

  4. Address instability under augmentation: Recognize when learned representations exploit superficial statistics (visible as instability under natural augmentation), and add relevant augmentation types to training to teach invariance.

  5. Improve out-of-distribution robustness: Diagnose brittleness to distribution shift and apply targeted techniques (data augmentation teaching invariance, spectral normalization constraining Lipschitz, adversarial training emphasizing worst-case features) to improve robustness.

Structural Assumptions for Later Chapters

This chapter builds on prior foundational knowledge and makes assumptions for future extensions:

Assumptions from Earlier Chapters (Prerequisite Knowledge):

  • Chapter 12 (Adversarial Robustness): Definitions of adversarial perturbations, margin, Lipschitz continuity, robust optimization, and defense mechanisms. We connect representation Lipschitz constants to certified robustness guarantees.

  • Chapter 15 (Generalization Theory): PAC learning, Rademacher complexity, VC dimension, uniform stability. We link representation properties (effective rank, spectrum) to generalization bounds through information-theoretic arguments.

  • Chapter 17 (Optimization Basics): Gradient descent convergence, convexity, strong convexity, smoothness. Representation learning adds non-convexity; the chapter extends optimization intuitions to this setting.

Structural Assumptions Made in This Chapter:

  1. Euclidean Geometry: We assume representations live in \(\mathbb{R}^d\) with standard geometry (Euclidean distance, inner products). This limits applicability to embedding spaces; extensions to non-Euclidean manifolds or discrete structures are future work.

  2. Fixed Loss Functions: We treat the training objective as fixed. In practice, objectives may evolve (curriculum learning, domain adaptation). The geometric framework extends but requires additional machinery.

  3. Frozen-Model Analysis: Representations are evaluated on trained, frozen models. Extensions to continual learning and online adaptation require analyzing how representations evolve dynamically.

  4. End-to-End Optimization: We focus on representations learned by direct gradient descent on task objectives. Hybrid approaches (retrieval-augmented systems, symbolic-neural integration) require separate analysis.

  5. Single Task Context: Most analysis assumes single-task learning. Multi-task settings introduce representational tradeoffs (aligning multiple objectives simultaneously) not fully addressed here.

Assumptions for Later Chapters (Forward Requirements):

  • Chapter 19 (Stochastic Gradient Dynamics): Will deepen the optimization perspective by analyzing how SGD dynamics create and select representations. Basin competition and noise geometry explain implicit biases of training algorithms, complementing this chapter’s geometric characterization of learned representations.

  • Chapter 20 (Implicit Regularization): Will show why implicit biases of SGD naturally discover representations optimizing the alignment-uniformity tradeoff. The dynamics analysis there explains why the geometric properties characterized in this chapter emerge automatically from training.

  • Chapter 21 (Robustness Under Distribution Shift): Will connect representation geometry to out-of-distribution generalization. Spectral properties and stability guarantees (Theorem 5 here) directly bound robustness to distribution shift, formalizing the intuition that robust representations have specific geometric structure.

  • Chapter 22+ (Empirical Studies): Will instantiate this chapter’s framework in specific domains (vision, language, graphs), analyzing how different architectures compute representations with desired geometric properties.

Limitations and Caveats Acknowledged:

  • Theory is primarily characterization, not prescription: This chapter describes geometric properties of learned representations but doesn’t always predict how to directly optimize these properties. Implementation often requires empirical search or gradient-based refinement.

  • High-dimensional geometry intuitions break down: Euclidean geometry in very high dimensions exhibits counterintuitive phenomena (concentration of measure, curse of dimensionality). Some geometric intuitions from low dimensions fail. Concentration inequalities provide rigor where intuition fails.

  • Spectral methods assume sufficient samples: Empirical covariance spectrum estimates converge slowly in high dimensions. For very small datasets or very large representation dimensions, spectral diagnostics may be unreliable.

  • Stability bounds are often loose: Lipschitz-based robustness bounds (Theorem 5) may be much looser than actual empirical stability, especially for well-trained models. The bounds provide guarantees but not tight predictions.

End-of-Chapter Advanced Exercises

A. True / False (20)

A.1 If a representation covariance matrix \(\Sigma\) has effective rank \(r_{\text{eff}}\) (the number of eigenvalues above numerical precision threshold), then representations can encode at most \(r_{\text{eff}}\) bits of information about the data, regardless of the nominal dimension \(d\).

A.2 Contrastive learning with batch size \(B\) necessarily learns representations of dimension \(> \Omega(\log B)\) because the loss function has approximately \(\Theta(B)\) distinct negative pairs per positive pair.

A.3 Data augmentation that is applied during training but not at test time creates a mismatch between test-time representations and training representations, always degrading transfer learning performance.

A.4 A representation is invariant to rotation if and only if applying a rotation to the input does not change the representation; conversely, a representation is equivariant to rotation if the representation transforms by the same rotation.

A.5 The neural tangent kernel (NTK) regime, where networks behave like kernel methods with fixed feature map, necessarily exhibits faster spectral bias (learning low frequencies before high frequencies) than the feature learning regime.

A.6 If an autoencoder without explicit regularization achieves training loss below the reconstruction error of the true data manifold dimension, then the encoder’s output must be collapsed (rank-deficient).

A.7 Weight decay explicitly minimizes representation variance, making it an indirect mechanism for controlling the spectral regularization effect (Theorem 9).

A.8 In contrastive learning, if the temperature parameter \(\tau \to 0\), the learned representations must approach a state where positive pairs are identical and negative pairs are orthogonal, regardless of batch size.

A.9 A representation learned through supervised classification on balanced data necessarily exhibits the alignment-uniformity tradeoff (Theorem 8), with uniformity bounded by the number of classes.

A.10 Overparameterized networks (parameter-to-sample ratio \(\gamma \gg 1\)) always generalize worse than optimally-sized networks because implicit regularization from implicit bias is weaker for larger models.

A.11 The Lipschitz constant of a representation encoder bounds the maximum change in embedding distance under input perturbation, but does not directly control stability to data augmentation applied to images.

A.12 Transfer learning works well when source and target task share low-frequency structure (captured by early-layer representations), but transfer learning fails when tasks differ in high-frequency structure (late-layer representations).

A.13 A VAE trained with \(\beta = 0\) (no KL regularization) in the limit can achieve both perfect reconstruction and uniform coverage of the latent space simultaneously without additional constraints.

A.14 Group-equivariant CNNs (G-CNNs) that encode rotation equivariance require more parameters than standard CNNs to achieve the same classification accuracy on rotation-equivariant tasks.

A.15 Spectral bias (learning low-frequency components before high-frequency) emerges from optimization dynamics (gradient descent) rather than from the inductive bias of neural network architectures.

A.16 If representations undergo feature collapse (rank-deficiency in \(\Sigma\)), the Information Compression Bound (Theorem 10) guarantees that downstream task performance is bounded away from optimal, independent of downstream model capacity.

A.17 The manifold hypothesis (data lies on low-dimensional manifold) implies that intrinsic dimensionality can be estimated from the effective rank of the representation covariance after training.

A.18 A representation encoder with Lipschitz constant \(L\) applied to two inputs \(\mathbf{x}_1, \mathbf{x}_2\) satisfies \(\|\mathbf{z}_1 - \mathbf{z}_2\| \leq L \|\mathbf{x}_1 - \mathbf{x}_2\|\), which necessarily implies robustness to adversarial perturbations of bounded norm.

A.19 In self-supervised contrastive learning, if the negative sampling distribution is uniform (every other image is equally likely to be a negative), then the learned representations cannot achieve alignment to data-specific similarity structure (e.g., CIFAR-10 class similarity).

A.20 Foundation models (BERT, GPT, CLIP) achieve strong transfer learning performance because they learn representations that explicitly maximize uniformity across diverse tasks, minimizing specialization to any single objective.

B. Proof Problems (20)

B.1 Prove that if the representation covariance \(\Sigma\) has rank \(k < d\), then for any linear classifier operating on representations \(\mathbf{z} \in \mathbb{R}^d\), the classifier’s decision boundary can be expressed entirely in a \(k\)-dimensional subspace, and provide a representation-geometry interpretation of this fact.

B.2 Let \(f_\theta: \mathcal{X} \to \mathbb{R}^d\) be a representation map trained with contrastive loss using temperature \(\tau\). Prove that, under unit-norm representations and finite batch size, the optimal solution has bounded pairwise cosine similarities and derive an explicit dependence on \(\tau\) and batch size.

B.3 Prove that if \(f_\theta\) is invariant to a compact group action \(G\) (i.e., \(f_\theta(g \cdot x) = f_\theta(x)\) for all \(g \in G\)), then \(f_\theta\) factors through the quotient space \(\mathcal{X}/G\), and characterize the induced geometry on the quotient.

B.4 Prove that in the linear autoencoder setting with squared reconstruction loss, the optimal encoder spans the top-\(k\) principal components of the data, and show how the representation covariance spectrum equals the top-\(k\) eigenvalues of the data covariance.

B.5 Prove that spectral bias in a two-layer ReLU network trained by gradient descent implies a monotone ordering in time of learned Fourier modes for a one-dimensional regression task under suitable smoothness assumptions.

B.6 Consider a representation \(\mathbf{z} = f_\theta(\mathbf{x})\) trained with a contrastive objective on augmented inputs. Prove that if augmentations are sampled from a group action with Haar measure, then the optimal representation is invariant to that group, assuming infinite data and perfect optimization.

B.7 Prove that if the InfoNCE loss is minimized to zero under finite batch size, then the representation mapping must be injective on the training set, and characterize the minimum embedding dimension required for injectivity in terms of the training set size.

B.8 Let \(\Sigma(t)\) be the covariance of representations during training with weight decay. Prove a differential equation governing the eigenvalue dynamics of \(\Sigma(t)\) under gradient flow for a linear encoder, and show convergence to a fixed point controlled by the decay coefficient.

B.9 Prove that if a representation encoder is \(L\)-Lipschitz and trained with augmentation noise bounded by \(\epsilon\), then the expected embedding distortion is bounded by \(L\epsilon\), and extend the bound to contrastive objectives that compare augmented pairs.

B.10 Prove that a representation learned by supervised classification on \(K\) classes necessarily yields an embedding space whose class-conditional means form a simplex in \(\mathbb{R}^d\) under a suitable linear separability assumption, and characterize the spectral structure of the between-class covariance.

B.11 Show that for a representation map trained by self-supervised reconstruction loss with no explicit regularization, there exists a sequence of solutions whose effective rank tends to 1 while reconstruction error tends to the global minimum, and interpret this as a collapse phenomenon.

B.12 Prove that for a network in the NTK regime, the learned representation remains close (in operator norm) to the initial random feature map throughout training, and derive a bound that depends on width and learning rate.

B.13 Prove that for any representation map \(f_\theta\) and any linear probe \(w\), the mutual information \(I(\mathbf{Z}; Y)\) is upper bounded by a function of the spectrum of \(\Sigma\), assuming Gaussian class-conditional distributions, and interpret the bound in terms of effective rank.

B.14 Let \(f_\theta\) be equivariant to a finite group action \(G\) with representation \(\rho\). Prove that the representation space decomposes into irreducible subspaces of \(\rho\), and describe how this decomposition constrains representation covariance.

B.15 Prove that for contrastive learning with large batch size, the uniformity term induces a lower bound on the minimum pairwise distance between representations on the unit sphere, and interpret the bound as a spherical code problem.

B.16 Prove that if the representation covariance exhibits a power-law spectrum \(\lambda_i \propto i^{-\alpha}\) with \(\alpha > 1\), then the effective rank grows sublinearly in dimension, and quantify its dependence on \(d\) and \(\alpha\).

B.17 Prove that a representation encoder with bounded Jacobian norm implies stability to adversarial perturbations up to a specified radius, and contrast this with stability under random Gaussian noise by computing expected perturbation magnitudes.

B.18 Prove that for a two-layer linear network trained with squared loss and weight decay, the representation covariance spectrum undergoes exponential shrinkage of smaller eigenvalues compared to larger ones, and quantify the separation rate.

B.19 Prove that if a representation map is trained with a contrastive objective that enforces alignment but no uniformity, then there exists a collapsed solution that is a global minimizer, and characterize the collapsed manifold.

B.20 Prove that for a foundation model trained with masked language modeling, the representation of a token is asymptotically invariant to a bounded number of context perturbations under suitable mixing assumptions, and relate this invariance to geometric clustering in embedding space.

C. Python Exercises (20)

C.1 Task: Load a medium-scale image dataset and compute representation vectors from a pre-trained CNN, then estimate the covariance matrix and its eigenvalue spectrum, reporting effective rank at multiple numerical thresholds and sampling sizes. Purpose: Build concrete intuition for how high-dimensional embeddings concentrate variance into a small number of directions, how sampling affects spectral estimates, and how sensitive effective rank is to threshold choice and dataset diversity. The goal is to learn how to turn abstract spectrum plots into actionable judgments about representation quality, and to understand which aspects of the spectrum are stable diagnostics versus artifacts of sampling or scale. This expanded purpose sets the stage for using spectral tools as repeatable evaluation criteria in representation debugging and model comparison. ML Link: This connects directly to spectral analysis of representations, power-law eigenvalue decay, and transfer learning diagnostics, which are foundational for understanding representation geometry in deep models. It also relates to layerwise transferability (early vs. late features), representation collapse detection, and effective dimension selection for linear probing. Hints: Use a fixed layer and fixed normalization to avoid conflating scaling with geometry; compare spectra from early and late layers; repeat estimates with different sample sizes to separate estimation noise from true spectral decay; log-log plots can reveal power-law regimes even without fitting. What mastery looks like: You can justify the effective rank definition and threshold numerically, explain how the spectrum changes across layers and datasets, describe how sampling variability impacts eigenvalue estimates, and connect spectrum shape to expected generalization and transfer behavior.

C.2 Task: Implement a representation collapse detector that flags when the covariance spectrum becomes low-rank over training checkpoints for a self-supervised model, and produce a concise diagnostic report. Purpose: Practice operationalizing collapse diagnostics into a monitoring tool that can be used during training, not just post hoc. The aim is to create a workflow that turns spectral signals into actionable alerts, separating genuine collapse from transient training fluctuations, and enabling early intervention before downstream metrics degrade. This expanded purpose emphasizes reproducible monitoring and informed decision-making in iterative model development. ML Link: Collapse detection is central to avoiding degenerate representations in contrastive and autoencoding settings, and is a practical application of the representation collapse theorems. It also connects to self-supervised training stability, variance-regularization objectives (VICReg, Barlow Twins), and quality control for embedding systems in production. Hints: Track rank, participation ratio, and top-eigenvalue dominance together; set alert thresholds by comparing to a healthy baseline model; normalize embeddings to isolate collapse from scale changes; verify that a loss decrease does not imply healthy representations. What mastery looks like: You can reliably detect collapse early, explain false positives (e.g., batch effects, normalization shifts), connect spectrum changes to hyperparameters (batch size, temperature, augmentation strength), and recommend a corrective intervention grounded in geometry.

C.3 Task: Compare representation geometry under two data augmentation pipelines by computing pairwise cosine similarities for positive pairs and random pairs, and evaluate alignment-uniformity tradeoff metrics across training epochs. Purpose: Understand how augmentation strength and type shape representation geometry and the balance between alignment and uniformity. The goal is to make augmentation selection an evidence-based design choice, not a heuristic guess, by linking augmentation policies to measurable geometric effects that predict downstream performance. This expanded purpose highlights how augmentation controls the representational inductive bias and thus the tradeoffs between robustness and separability. ML Link: This directly tests the alignment-uniformity tradeoff that underpins contrastive learning performance and provides a measurable proxy for representation quality. It also ties to augmentation-driven invariance learning, robustness to distribution shift, and contrastive objective tuning (temperature and batch size). Hints: Use the same encoder weights to isolate augmentation effects; keep batch size and temperature fixed; compute statistics for within-class and across-class pairs if labels are available; examine both mean and variance of similarities to capture geometry spread. What mastery looks like: You can diagnose whether augmentations are too weak (alignment fails) or too strong (uniformity collapse), explain why a particular augmentation policy produces stable separation, and justify a balanced augmentation design with evidence from alignment and uniformity curves.

C.4 Task: Create a synthetic dataset with known low-dimensional structure embedded in high dimensions and train a linear autoencoder, then verify that learned representations recover the true subspace under varying noise levels. Purpose: Validate the PCA equivalence of linear autoencoders and connect reconstruction objectives to eigenstructure in a controlled setting. The goal is to build a rigorous mental model for how reconstruction objectives encode geometric structure, so that later observations on real data can be interpreted relative to a ground-truth baseline. This expanded purpose emphasizes calibration: distinguishing true geometric effects from artifacts of noise, sample size, or optimization. ML Link: This exercise connects reconstruction objectives to representation geometry and the spectral structure of covariance, which underlies both dimensionality reduction and collapse detection. It also grounds intuition for denoising autoencoders, bottleneck capacity selection, and how reconstruction losses can hide degeneracy. Hints: Construct data from a known orthonormal basis with controllable noise; compare learned subspace to ground truth via principal angles; measure reconstruction error as noise increases; verify that the learned covariance spectrum matches the planted variance profile. What mastery looks like: You can show that the learned representation aligns with the true latent subspace, quantify how noise perturbs the recovered subspace, explain why the spectrum matches the planted eigenvalues, and relate deviations to finite-sample effects.

C.5 Task: Train a contrastive model on a small image dataset using multiple temperatures and compare the resulting angular separation between positive and negative pairs, including separation variance across classes. Purpose: Quantify how temperature shapes geometric separation and how it interacts with dataset structure. The goal is to translate a hyperparameter into geometric consequences that inform stability, collapse avoidance, and downstream performance, rather than treating temperature as a black-box tuning knob. This expanded purpose frames temperature as a geometric control variable for representation quality. ML Link: Temperature is a key control knob for contrastive objectives, shaping the geometry of learned embeddings and the stability of separation. It also links to sensitivity to hard negatives, batch size scaling, and downstream linear probe accuracy. Hints: Keep all other hyperparameters identical; compute cosine distributions for positives and negatives; measure margins and their variance; check for class-dependent separation to diagnose uneven geometry. What mastery looks like: You can articulate how temperature affects both alignment and uniformity, identify regimes where separation becomes unstable or overly rigid, and connect observed behaviors to theoretical temperature dependence in contrastive separation results.

C.6 Task: Estimate invariance by measuring representation changes under a group of transformations (e.g., rotations or translations) for models trained with and without augmentation, and quantify invariance across layers. Purpose: Operationalize invariance as a measurable geometric property and diagnose where in the network invariance emerges. The goal is to create a quantitative invariance profile that can be compared across models, layers, and training regimes, enabling principled decisions about augmentation and architecture. This expanded purpose emphasizes invariance as an evaluable property rather than a qualitative claim. ML Link: Invariance is central to representation robustness, generalization, and the invariance-equivariance decomposition results. It also connects to data augmentation design, domain generalization, and the stability of features under distribution shift. Hints: Use a consistent metric (cosine distance or Euclidean distance on normalized embeddings); compare distributions of distances for transformed vs. untransformed inputs; report layerwise invariance; examine per-class variability to detect uneven invariance. What mastery looks like: You can quantify how invariance varies across layers, explain why specific layers become invariant, and distinguish learned invariance (from augmentation) from architectural priors.

C.7 Task: Implement a spectral bias probe by fitting a small network on a one-dimensional function with multiple frequency components and track Fourier coefficients over training time and learning rate schedules. Purpose: Observe spectral bias dynamics in practice and connect optimization trajectories to frequency learning order. The goal is to build a concrete, time-resolved understanding of how optimization implicitly regularizes representations, and to learn how training choices shift the learning of fine details. This expanded purpose frames spectral bias as a diagnostic for generalization behavior and training stability. ML Link: Spectral bias explains why networks learn low-frequency structure before high-frequency detail, influencing generalization and overfitting. It also ties to implicit regularization in SGD, early stopping strategies, and the NTK vs. feature-learning regime distinction. Hints: Sample data uniformly; compute Fourier coefficients of model predictions at checkpoints; test multiple learning rates or optimizers; verify that low-frequency modes converge first across settings. What mastery looks like: You can describe the temporal ordering of learned frequencies, explain how width or activation choice changes the bias, and connect late-learning high-frequency components to overfitting behavior.

C.8 Task: Evaluate how weight decay changes representation spectra by training the same model with several decay strengths and comparing eigenvalue decay rates, tail mass, and effective rank across epochs. Purpose: Connect explicit regularization to representation geometry and understand how it suppresses degenerate spectra. The goal is to treat weight decay as a geometric intervention, interpreting it through spectral changes rather than only through loss curves, and to learn how regularization reshapes representation capacity over time. This expanded purpose emphasizes using spectra to tune regularization for stability and transfer readiness. ML Link: Weight decay shapes spectral structure, affects collapse risk, and changes transfer behavior. It also connects to implicit bias in optimization, norm control for stability, and the spectral regularization effects seen in batch normalization and variance regularizers. Hints: Use identical random seeds; compare both training and validation spectra; normalize for scale before spectrum comparisons; examine early vs. late training to separate transient effects from converged geometry. What mastery looks like: You can identify how decay modifies the tail of the spectrum, relate this to generalization behavior, explain why extreme decay harms representation richness, and propose a decay range that balances stability and capacity.

C.9 Task: Build an embedding stability test that measures representation sensitivity to input perturbations (noise, crops, brightness) and summarizes stability curves across perturbation magnitudes. Purpose: Understand stability as a geometric property related to Lipschitz behavior and deployment robustness. The goal is to turn robustness into measurable geometry, enabling consistent comparisons across models and perturbation types, and to reveal where stability breaks as perturbations grow. This expanded purpose emphasizes stability profiling as a tool for deployment risk assessment. ML Link: Stability is critical for robust ML systems and relates directly to certified robustness bounds and Lipschitz-based guarantees. It also connects to adversarial robustness, distribution shift resilience, and safety-critical deployment criteria for embedding systems. Hints: Report both absolute and relative changes in distances; compare same-class and different-class pairs; use multiple perturbation magnitudes to identify linear vs. nonlinear regimes; keep embeddings normalized to isolate geometry from scale. What mastery looks like: You can characterize which perturbations the model is robust to, diagnose failure modes where small perturbations cause large embedding shifts, and explain how training choices (augmentation, normalization, regularization) improve stability.

C.10 Task: Compute class-conditional means in embedding space for a supervised model and analyze the between-class covariance structure, including simplex-like geometry and between-class eigenvalues. Purpose: Link class geometry to representation separability and downstream linear probe performance. The goal is to make class geometry an explicit diagnostic object, clarifying how class structure emerges and how it affects separability, calibration, and transfer. This expanded purpose frames class-mean geometry as a practical tool for interpreting supervised representation quality. ML Link: Class-conditional geometry is a central driver of supervised representation quality and a direct application of geometric separability theory. It also connects to prototype-based classifiers, metric learning, and calibration of embedding distances for retrieval tasks. Hints: Use a fixed embedding layer; compute pairwise distances between class means; measure how close the means are to a regular simplex; relate between-class covariance eigenvalues to separability and class imbalance. What mastery looks like: You can explain why class means might form an approximate simplex, identify deviations and their causes (imbalance, overlap, collapse), and connect the between-class spectrum to linear probe performance.

C.11 Task: Detect degeneracy in a VAE by visualizing how sampled latent points map to reconstructions and correlating reconstruction quality with latent-space density, across different KL weights. Purpose: Connect latent space holes to representation coverage and the compression-reconstruction tradeoff. The goal is to diagnose when a generative model’s latent space is geometrically well-formed versus when it has unusable regions that harm sampling and downstream inference. This expanded purpose emphasizes practical evaluation of latent geometry beyond reconstruction loss. ML Link: VAE training balances reconstruction and prior matching, shaping latent geometry and generative quality. It also links to disentanglement objectives (beta-VAEs), sampling fidelity for generative augmentation, and latent space calibration for downstream inference. Hints: Estimate latent density using kernel density estimates; stratify samples by density; compare reconstruction quality across density strata; run multiple \(\beta\) values to separate coverage from fidelity. What mastery looks like: You can show that low-density latent regions produce poor reconstructions, explain how KL weight changes coverage, and relate observed holes to the information bottleneck tradeoff.

C.12 Task: Train two models with different widths (underparameterized vs. overparameterized) and compare representation effective rank, generalization, and Jacobian conditioning across checkpoints. Purpose: Explore overparameterization effects on representation geometry and optimization stability. The goal is to link scaling decisions to measurable geometric consequences, clarifying when increased width provides genuinely richer representations versus when it induces brittle or overly smooth features. This expanded purpose emphasizes width as a geometric design choice, not only a capacity lever. ML Link: Overparameterization changes optimization dynamics, spectral structure, and transfer behavior, making it central to modern deep learning practice. It also connects to double descent, lazy training vs. feature learning, and scaling laws for representation quality. Hints: Use the same dataset and training budget; compute Jacobian spectral norms via power iteration; compare linear probe results; analyze whether spectra flatten or steepen as width increases. What mastery looks like: You can articulate how width changes representation richness and conditioning, explain when overparameterization improves or hurts transfer, and connect these effects to spectral bias and implicit regularization.

C.13 Task: Evaluate alignment-uniformity tradeoff explicitly by computing alignment and uniformity metrics throughout contrastive training and plotting their trajectories along with loss and accuracy. Purpose: Connect geometric tradeoffs to training dynamics and identify collapse early. The goal is to transform the alignment-uniformity framework into a real-time diagnostic that predicts representation quality before downstream evaluation, and to interpret tradeoffs as geometric signatures rather than abstract metrics. This expanded purpose emphasizes how these curves guide hyperparameter tuning and training stability decisions. ML Link: This is the core empirical signature of successful contrastive learning and directly tests the alignment-uniformity theorem. It also ties to batch size scaling laws, negative sampling strategies, and diagnosing representation collapse in non-contrastive SSL methods. Hints: Use consistent embedding normalization; compute alignment on positive pairs only; compute uniformity on random pairs; compare trajectories across different temperatures or batch sizes. What mastery looks like: You can interpret trajectories, identify regimes of collapse (alignment improves while uniformity degrades sharply), and connect inflection points to learning rate schedules, temperature settings, or augmentation changes.

C.14 Task: Construct a toy dataset with known group symmetry and train both standard and group-equivariant models, then measure equivariance error in representation space under controlled transformations. Purpose: Empirically validate equivariance and its benefits for sample efficiency and robustness. The goal is to quantify how much architectural symmetry reduces data requirements and improves generalization, and to distinguish true equivariance from apparent invariance learned by chance. This expanded purpose frames equivariance as a measurable architectural advantage with geometric consequences. ML Link: Equivariance is a key concept in geometric deep learning and symmetry-aware architectures. It also connects to group convolution design, data augmentation alternatives, and the invariance-equivariance decomposition in representation learning. Hints: Define an explicit group action; use a metric for equivariance error comparing transformed inputs to transformed representations; ensure identical training data and capacity; test out-of-distribution rotations if applicable. What mastery looks like: You can quantify equivariance gains, show sample-efficiency improvements, explain when equivariance is unnecessary or harmful, and relate errors to group discretization or architecture limits.

C.15 Task: Implement a cosine similarity histogram for embeddings and analyze how distributions differ between pre-trained and randomly initialized models, including tail behavior and class-conditional structure. Purpose: Contrast representation geometry with and without learning and detect structured similarity. The goal is to establish a lightweight, repeatable diagnostic that flags whether embeddings encode meaningful structure, and to understand how fine-tuning or domain shift reshapes similarity distributions. This expanded purpose emphasizes histogram analysis as a fast sanity check and drift indicator. ML Link: Learned representations should show structured similarity distributions, unlike random geometric baselines, which is a basic sanity check for representation learning. It also relates to clusterability for retrieval, hard-negative mining behavior, and diagnostic checks for embedding drift after fine-tuning. Hints: Compare to analytical baselines for random unit vectors; focus on tail behavior where class structure often appears; examine how distributions change after fine-tuning or domain shift. What mastery looks like: You can interpret deviations from random geometry, relate them to class structure and clustering, and explain whether fine-tuning sharpens or flattens similarity structure.

C.16 Task: Measure how effective rank changes under label noise by training a classifier with increasing label corruption rates and computing embedding spectra across epochs. Purpose: Link data quality to representation geometry and identify when representations degrade under noisy supervision. The goal is to quantify how supervision noise changes representation capacity and to define a practical noise threshold beyond which representations no longer support reliable transfer. This expanded purpose highlights label noise as a geometric stress test rather than only a performance nuisance. ML Link: Noisy labels distort representation structure and can induce collapse or overfitting, affecting transfer performance. It also connects to robust learning under label noise, memorization dynamics, and curriculum learning strategies that delay fitting noisy labels. Hints: Use a consistent corruption protocol; track both train and test spectra; analyze early vs. late checkpoints; compare with a clean-label baseline. What mastery looks like: You can explain how label noise spreads variance across the spectrum, identify a noise threshold where representations become unstable, and connect spectral changes to downstream performance drops.

C.17 Task: Build a pipeline that compares representations from two model families (e.g., CNN vs. ViT) on the same dataset and evaluates spectral slope, effective rank, alignment metrics, and stability. Purpose: Understand how architecture influences representation geometry and downstream behavior. The goal is to develop a systematic comparison framework that attributes geometric differences to architectural inductive biases, rather than to confounds like training schedules or preprocessing. This expanded purpose emphasizes fair, interpretable comparisons that guide architecture selection for representation goals. ML Link: Architecture is a primary driver of geometric inductive bias, affecting transfer and robustness. It also relates to patch-level vs. convolutional inductive biases, positional encoding effects, and the different spectral bias profiles of CNNs and transformers. Hints: Use comparable parameter counts and training schedules; normalize embeddings; ensure consistent preprocessing; evaluate multiple layers for each model. What mastery looks like: You can articulate which geometric properties are architecture-specific, explain why those differences appear, and relate them to known strengths and weaknesses in transfer tasks.

C.18 Task: Design a collapse stress test by training a self-supervised model with progressively smaller batch sizes and weaker augmentations, then measure when collapse starts using spectral diagnostics and similarity distributions. Purpose: Identify critical thresholds for collapse and build intuition for safe hyperparameter regimes. The goal is to map out the stability boundary of a training setup, turning collapse into a predictable failure mode rather than a surprise. This expanded purpose emphasizes creating actionable guardrails for SSL training configurations. ML Link: Collapse is a frequent failure mode in contrastive and non-contrastive self-supervised learning, and avoiding it is central to representation quality. It also connects to negative sampling strategies, stop-gradient methods, and variance-covariance regularization techniques used in modern SSL. Hints: Vary one factor at a time; set a baseline with known stable hyperparameters; track loss alongside spectrum to show why loss is insufficient; consider repeated runs to separate stochastic effects from true collapse. What mastery looks like: You can determine a practical safe region of hyperparameters, explain why specific settings trigger collapse, and propose concrete remedies such as stronger augmentation, larger batches, or variance regularization.

C.19 Task: Implement a representation distance preservation test by computing pairwise distances in input space and embedding space for a random subset of data, then measure correlation across local and global scales. Purpose: Assess whether embeddings preserve local geometry and identify where distortions are useful or harmful. The goal is to quantify neighborhood preservation as a geometric diagnostic, distinguishing beneficial distortions (task-relevant reweighting) from harmful ones (loss of local structure). This expanded purpose emphasizes how distance preservation relates to retrieval performance and manifold integrity. ML Link: Good representations preserve task-relevant neighborhoods while distorting irrelevant directions, a core concept in representation geometry. It also relates to manifold learning quality, neighborhood preservation metrics used in dimensionality reduction, and embedding suitability for k-NN retrieval. Hints: Use multiple distance metrics; separate local (nearest neighbor) and global scales; evaluate on different layers; compare to a random embedding baseline. What mastery looks like: You can explain which layers preserve local structure best, identify when distortions help or hurt downstream performance, and interpret distance correlation differences across model families.

C.20 Task: Create a monitoring report that combines covariance spectra, alignment-uniformity metrics, and stability scores for a trained model, and interpret the joint profile with recommendations. Purpose: Synthesize multiple geometric diagnostics into a coherent evaluation and build a holistic representation health report that can be used for model selection, regression testing, and deployment readiness. The aim is to move beyond single-metric evaluation by integrating complementary signals: spectra reveal capacity usage and collapse risk, alignment-uniformity captures contrastive geometry, and stability scores reveal robustness to perturbations. This integrated view enables consistent decision-making across training runs, highlights tradeoffs (e.g., improved alignment at the cost of uniformity), and supports longitudinal tracking of representation health over time or across checkpoints. ML Link: Practical representation debugging relies on multiple signals, not a single metric, and this mirrors real-world monitoring of embedding quality. It also connects to production embedding drift monitoring, continual learning stability checks, and evaluation pipelines for foundation model embeddings across tasks. Hints: Normalize metrics for comparability; include a healthy baseline and a known-degenerate model; define criteria for “healthy” vs. “degenerate” representations; summarize which diagnostics are most sensitive to which failure modes. What mastery looks like: You can diagnose subtle failures (good alignment but poor uniformity, stable embeddings with poor separability, or strong spectra with weak transfer), tie symptoms to root causes, and propose concrete remedies grounded in representation geometry.

Solutions

Solutions to A. True / False

A.1 If a representation covariance matrix \(\Sigma\) has effective rank \(r_{\text{eff}}\) (the number of eigenvalues above numerical precision threshold), then representations can encode at most \(r_{\text{eff}}\) bits of information about the data, regardless of the nominal dimension \(d\). - Final Answer: False. - Full mathematical justification: Effective rank \(r_{\text{eff}}\) measures the number of eigenvalues above a threshold in \(\Sigma\), which bounds linear variance directions, not information content in bits. Even if \(\text{rank}(\Sigma)=k\), the representation can encode a continuous \(k\)-dimensional variable with unbounded differential entropy, and mutual information depends on noise and distributional assumptions, not just rank. In continuous settings, mutual information can be made arbitrarily large by scaling variance or reducing observation noise, even when second moments are low-rank. Moreover, if \(\mathbf{Z} = g(\mathbf{X})\) is deterministic and \(\mathbf{X}\) has high entropy, then \(I(\mathbf{X};\mathbf{Z})\) can be arbitrarily large despite finite rank in second moments. Thus rank does not upper bound information in bits without strong quantization or noise constraints. - Counterexample if false: Let \(\mathbf{X} \sim \mathcal{N}(0, I_k)\) and define \(\mathbf{Z} = [\mathbf{X}, 0]\in \mathbb{R}^d\) with \(d>k\). Then \(\text{rank}(\Sigma)=k\) but \(I(\mathbf{X};\mathbf{Z})=H(\mathbf{X})\) is unbounded in differential entropy as variance grows. - Comprehension: Effective rank summarizes linear variance, not discrete information; information bounds require explicit noise or quantization assumptions. - ML Applications: Use effective rank for collapse/compression checks, but use mutual-information estimates when making information-capacity claims. - Failure Mode Analysis: Treating rank as an information bound can lead to over-pruning embeddings or underestimating representation capacity. - Traps: Confusing discrete entropy with differential entropy, and assuming covariance rank implies information rank.

A.2 Contrastive learning with batch size \(B\) necessarily learns representations of dimension \(> \Omega(\log B)\) because the loss function has approximately \(\Theta(B)\) distinct negative pairs per positive pair. - Final Answer: False. - Full mathematical justification: The number of negatives does not force embedding dimension to scale as \(\Omega(\log B)\). A contrastive loss constrains relative similarities, but embeddings can satisfy many pairwise constraints in low dimension by collapsing or by creating non-injective mappings with tied similarities. There is no general lower bound of \(\dim(\mathcal{Z})\) in terms of \(B\) without injectivity or margin constraints, which are not implied by the loss alone. In low dimensions, many pairwise inequalities can be satisfied because the loss is tolerant to ties or small margins when \(\tau\) is large. Thus batch size increases pressure for separation but does not impose a dimensionality lower bound by itself. - Counterexample if false: Use a constant representation \(\mathbf{z}_i = \mathbf{c}\) for all \(i\). The contrastive loss is large but finite, and the dimensionality is 1 regardless of \(B\), showing no lower bound is enforced by \(B\) alone. Even with improved loss, low-dimensional embeddings can satisfy constraints if negatives are easy. - Comprehension: More negatives increase separation pressure, but dimensionality lower bounds need injectivity or margin assumptions beyond the loss. - ML Applications: Scale batch size to improve uniformity, and still monitor dimensionality and collapse with spectral diagnostics. - Failure Mode Analysis: Believing larger \(B\) forces high-dimensional geometry can mask collapse when temperature is too high or augmentations are weak. - Traps: Conflating more constraints with guaranteed higher-dimensional embeddings.

A.3 Data augmentation that is applied during training but not at test time creates a mismatch between test-time representations and training representations, always degrading transfer learning performance. - Final Answer: False. - Full mathematical justification: Training-time augmentation can improve invariance and generalization, and test-time without augmentation can still yield better representations. The mismatch does not necessarily degrade transfer; it often improves it by teaching invariance to nuisance transformations while preserving label-relevant structure. From a risk perspective, augmentation modifies the training objective to minimize expected loss over transformed inputs, which can reduce variance and improve robustness. Only if augmentations destroy task-relevant information or induce distributional bias can performance degrade. Thus the claim “always degrading” is false. - Counterexample if false: SimCLR-style augmentations on ImageNet improve transfer to downstream tasks without test-time augmentation, contrary to the statement. - Comprehension: Augmentation mismatch is not inherently harmful; it encodes invariances that can improve generalization. - ML Applications: Choose augmentations aligned with downstream invariances and validate with transfer or linear-probe checks. - Failure Mode Analysis: Over-augmenting (e.g., heavy rotations for digits) can erase class identity and hurt performance. - Traps: Treating augmentation as data mismatch rather than inductive bias.

A.4 A representation is invariant to rotation if and only if applying a rotation to the input does not change the representation; conversely, a representation is equivariant to rotation if the representation transforms by the same rotation. - Final Answer: True. - Full mathematical justification: Invariance to rotation means \(f(R\mathbf{x})=f(\mathbf{x})\) for all rotations \(R\), which is the standard definition. Equivariance means \(f(R\mathbf{x})=\rho(R)f(\mathbf{x})\) for a representation \(\rho\) of the rotation group; for the same rotation action in representation space, this is “transforms by the same rotation.” The “if and only if” holds because these are definitional equivalences for invariance and equivariance with respect to a group action. Hence the statement matches definitions. - Counterexample if false: Not applicable. - Comprehension: Invariance discards transformation information; equivariance preserves it in a structured, group-consistent way. - ML Applications: Use equivariant models when pose is task-relevant and invariance when pose is nuisance. - Failure Mode Analysis: Treating equivariance as invariance can cause loss of needed pose information in tasks where orientation matters. - Traps: Confusing invariance with equivariance when interpreting learned features.

A.5 The neural tangent kernel (NTK) regime, where networks behave like kernel methods with fixed feature map, necessarily exhibits faster spectral bias (learning low frequencies before high frequencies) than the feature learning regime. - Final Answer: False. - Full mathematical justification: The NTK regime implies a fixed feature map and kernel regression dynamics, but it does not universally guarantee faster spectral bias than feature learning. Spectral bias depends on kernel eigenvalues and data distribution; feature learning can accelerate or decelerate high-frequency learning depending on architecture and optimization. In some cases feature learning reshapes the kernel spectrum to amplify higher-frequency components, reversing the ordering predicted by the NTK. There is no general monotone relation between NTK and feature-learning spectral bias. - Counterexample if false: A network that learns features aligned with high-frequency components can fit high frequencies faster than its NTK approximation, reversing the claimed ordering. - Comprehension: Spectral bias depends on the kernel spectrum and optimization dynamics, not just the NTK vs. feature-learning label. - ML Applications: Use frequency-learning curves to decide whether to stay in NTK-like training or encourage feature learning. - Failure Mode Analysis: Assuming NTK is always smoother can lead to incorrect expectations about learning dynamics. - Traps: Equating “kernel method” with “stronger low-frequency bias” in all cases.

A.6 If an autoencoder without explicit regularization achieves training loss below the reconstruction error of the true data manifold dimension, then the encoder’s output must be collapsed (rank-deficient). - Final Answer: False. - Full mathematical justification: Achieving training loss below the reconstruction error of the true manifold dimension does not imply collapse; it may indicate that the model captures structure beyond the assumed manifold or exploits noise or overfitting, but rank-deficiency is not a necessary consequence. The manifold dimension assumption may be wrong, and nonlinear encoders can reduce error without reducing covariance rank. Reconstruction error can be reduced by modeling noise or by nonlinear encoding without collapsing dimensions. Thus low loss is compatible with full-rank or near full-rank representations. - Counterexample if false: A nonlinear autoencoder can achieve low reconstruction error while maintaining full-rank covariance in latent space. - Comprehension: Low reconstruction loss does not imply collapse; rank deficiency must be measured explicitly. - ML Applications: Track covariance spectra and add variance or covariance regularizers if collapse appears. - Failure Mode Analysis: Misdiagnosing collapse from low loss can lead to unnecessary regularization that hurts performance. - Traps: Assuming reconstruction performance fully characterizes representation geometry.

A.7 Weight decay explicitly minimizes representation variance, making it an indirect mechanism for controlling the spectral regularization effect (Theorem 9). - Final Answer: False. - Full mathematical justification: Weight decay penalizes parameter norms, not representation variance directly. Its effect on representation variance is indirect and depends on architecture, data, and optimization. It can reduce large weights, but representations may still have high variance if features are aligned with dominant data directions or if downstream layers rescale activations. The mapping from parameter norms to representation variance is model- and data-dependent, so “explicitly minimizes representation variance” is incorrect. - Counterexample if false: A linear model with weight decay on a dataset with high-variance directions can still yield high-variance representations along those directions. - Comprehension: Weight decay penalizes parameters, so its effect on representation variance is indirect and model-dependent. - ML Applications: Pair weight decay with explicit variance or covariance regularizers when controlling representation spread. - Failure Mode Analysis: Overreliance on weight decay can leave collapse unaddressed in self-supervised training. - Traps: Assuming parameter norm penalties translate directly into embedding variance constraints.

A.8 In contrastive learning, if the temperature parameter \(\tau \to 0\), the learned representations must approach a state where positive pairs are identical and negative pairs are orthogonal, regardless of batch size. - Final Answer: False. - Full mathematical justification: As \(\tau \to 0\), the loss increasingly emphasizes the most similar negatives, but finite batch size and finite capacity prevent perfect orthogonality of all negatives. Positive pairs may become very close, but the claim that negatives become orthogonal regardless of batch size is not guaranteed. In finite \(d\), one cannot make all pairwise angles \(\pi/2\) for large \(B\), and the optimum trades off positive alignment with feasible negative separation. The best achievable separation depends on dimension, batch size, and data geometry. - Counterexample if false: In \(\mathbb{R}^2\), it is impossible for many negatives to be orthogonal simultaneously. Even in higher dimensions, with finite \(d\), orthogonality among all negatives is impossible for large \(B\). - Comprehension: Lowering \(\tau\) sharpens softmax constraints, but geometry limits prevent universal orthogonality. - ML Applications: Tune \(\tau\) jointly with batch size and embedding dimension; monitor similarity histograms for instability. - Failure Mode Analysis: Overly small \(\tau\) can cause gradient explosions and representational collapse if optimization fails. - Traps: Treating \(\tau\) as a magic knob that guarantees perfect separation.

A.9 A representation learned through supervised classification on balanced data necessarily exhibits the alignment-uniformity tradeoff (Theorem 8), with uniformity bounded by the number of classes. - Final Answer: True. - Full mathematical justification: In supervised classification, representations are driven to align same-class points and separate different classes, which mirrors the alignment-uniformity tradeoff. Uniformity is limited by class structure; with \(K\) classes, representations form \(K\) clusters, reducing uniformity relative to a uniform distribution on the sphere. Cross-entropy encourages large inter-class margins, which increases uniformity pressure but only up to the cluster structure imposed by labels. Thus the tradeoff exists in supervised settings as well, with uniformity bounded by class-induced clustering. - Counterexample if false: Not applicable. - Comprehension: Supervised objectives still trade alignment and uniformity because class clustering constrains global spread. - ML Applications: Use alignment/uniformity metrics to detect class collapse or over-separation in classifier embeddings. - Failure Mode Analysis: Excessive uniformity pressure can reduce class separation; excessive alignment can cause class collapse. - Traps: Assuming alignment-uniformity is exclusive to contrastive learning.

A.10 Overparameterized networks (parameter-to-sample ratio \(\gamma \gg 1\)) always generalize worse than optimally-sized networks because implicit regularization from implicit bias is weaker for larger models. - Final Answer: False. - Full mathematical justification: Overparameterized networks often generalize as well as or better than smaller models due to implicit regularization and optimization bias (double descent). There is no universal rule that larger models generalize worse; empirical and theoretical results show improved generalization under certain conditions. In many regimes, larger models find flatter minima and benefit from optimization dynamics that bias toward simpler functions. Thus the statement “always generalize worse” is too strong. - Counterexample if false: Large ResNets and Transformers outperform smaller counterparts on many benchmarks while generalizing better. - Comprehension: Overparameterization can generalize well due to implicit bias; size alone is not determinative. - ML Applications: Scale models while validating with data-size and regularization sweeps to avoid small-data overfitting. - Failure Mode Analysis: Overparameterization can hurt when data is scarce, leading to memorization. - Traps: Applying classical bias-variance intuition uncritically to deep learning regimes.

A.11 The Lipschitz constant of a representation encoder bounds the maximum change in embedding distance under input perturbation, but does not directly control stability to data augmentation applied to images. - Final Answer: False. - Full mathematical justification: A Lipschitz bound on the encoder does control representation changes under any input perturbation, including those induced by data augmentation, as long as the perturbation magnitude is bounded. Thus the Lipschitz constant does provide a bound on representation changes for augmentations, though the bound may be loose and norm-dependent. If an augmentation is not bounded in the chosen input metric, the bound may be vacuous, but the statement that it does not directly control stability is false. - Counterexample if false: For any augmentation \(\delta\) with \(\|\delta\|\leq \epsilon\), \(\|f(\mathbf{x}+\delta)-f(\mathbf{x})\| \leq L\epsilon\). This directly contradicts the statement that it “does not directly control” stability. - Comprehension: Lipschitz bounds apply to any bounded perturbation; looseness depends on the chosen input metric. - ML Applications: Use Lipschitz regularization for stability targets and ensure augmentations are norm-bounded in that metric. - Failure Mode Analysis: Loose bounds can mislead if augmentation changes are not well-measured in input norm. - Traps: Equating “bound is loose” with “no control.”

A.12 Transfer learning works well when source and target task share low-frequency structure (captured by early-layer representations), but transfer learning fails when tasks differ in high-frequency structure (late-layer representations). - Final Answer: False. - Full mathematical justification: While transfer often depends on shared low-frequency or generic features, failure is not guaranteed when tasks differ in high-frequency structure. Transfer can still succeed via fine-tuning, adaptation of later layers, or when low-frequency structure remains sufficient for performance. High-frequency differences can be compensated by updating late layers or adding adapters, so “fails when tasks differ in high-frequency structure” is too strong. - Counterexample if false: ImageNet pre-trained features can transfer to some fine-grained tasks even when high-frequency details differ, provided fine-tuning is applied. - Comprehension: High-frequency mismatch does not force transfer failure; adaptation can recover task-specific detail. - ML Applications: Freeze early layers and fine-tune late layers or adapters when high-frequency detail is needed. - Failure Mode Analysis: Using frozen late layers can harm tasks requiring fine detail. - Traps: Assuming a strict low-frequency requirement for transfer success.

A.13 A VAE trained with \(\beta = 0\) (no KL regularization) in the limit can achieve both perfect reconstruction and uniform coverage of the latent space simultaneously without additional constraints. - Final Answer: False. - Full mathematical justification: With \(\beta=0\), the VAE objective reduces to pure reconstruction, which does not enforce latent coverage. Perfect reconstruction can be achieved with highly non-uniform latent usage, creating holes, because the encoder can map data into a thin subset of latent space without penalty. Uniform coverage requires explicit regularization or constraints such as a KL term or variance penalty. Thus perfect reconstruction and uniform coverage are not simultaneously guaranteed. - Counterexample if false: A VAE encoder that maps all data to a thin subset of latent space can achieve low reconstruction loss while leaving large regions unused. - Comprehension: Without KL pressure, VAEs can reconstruct well while using a tiny latent subset. - ML Applications: Use \(\beta\) or variance penalties to enforce latent coverage for sampling and interpolation. - Failure Mode Analysis: Setting \(\beta=0\) yields brittle generative sampling and poor latent interpolation. - Traps: Assuming reconstruction objectives alone enforce good latent geometry.

A.14 Group-equivariant CNNs (G-CNNs) that encode rotation equivariance require more parameters than standard CNNs to achieve the same classification accuracy on rotation-equivariant tasks. - Final Answer: False. - Full mathematical justification: G-CNNs can achieve equivariance with parameter sharing across group elements, often requiring similar or fewer parameters for equivariant tasks. They can be more sample efficient and sometimes more parameter efficient than standard CNNs when symmetry is present. If the task is truly equivariant, the shared parameters reduce redundancy rather than increase it. Thus they do not necessarily require more parameters. - Counterexample if false: A G-CNN with shared weights across rotations can match accuracy with fewer parameters than a standard CNN trained with rotation augmentation. - Comprehension: Equivariance via parameter sharing can reduce redundancy and improve sample efficiency. - ML Applications: Prefer G-CNNs or other equivariant models for data with known symmetries. - Failure Mode Analysis: If symmetries are mismatched or approximate, equivariance constraints can reduce flexibility. - Traps: Equating architectural constraints with parameter inflation.

A.15 Spectral bias (learning low-frequency components before high-frequency) emerges from optimization dynamics (gradient descent) rather than from the inductive bias of neural network architectures. - Final Answer: False. - Full mathematical justification: Spectral bias arises from both architecture (kernel properties of the network at initialization) and optimization dynamics. Even with fixed optimization, different activations produce different spectral biases because they change the implicit kernel spectrum. Conversely, changing optimization can alter the rate at which different modes are learned without changing the architecture. Thus it is not purely an optimization effect. - Counterexample if false: ReLU and sinusoidal activations with the same optimizer exhibit different spectral learning orderings. - Comprehension: Spectral bias reflects both architecture (kernel spectrum) and optimization dynamics. - ML Applications: Select activations and architectures to emphasize or suppress high-frequency learning as needed. - Failure Mode Analysis: Ignoring architectural bias can lead to unexpected overfitting or underfitting of high-frequency components. - Traps: Attributing spectral bias solely to SGD behavior.

A.16 If representations undergo feature collapse (rank-deficiency in \(\Sigma\)), the Information Compression Bound (Theorem 10) guarantees that downstream task performance is bounded away from optimal, independent of downstream model capacity. - Final Answer: False. - Full mathematical justification: The Information Compression Bound relates accuracy to mutual information but does not imply a universal bound away from optimal performance solely due to rank-deficiency. A low-rank representation can still be sufficient if the task is low-dimensional or if the label depends on a low-dimensional statistic. Downstream capacity can sometimes compensate by exploiting the available dimensions optimally, and the bound depends on distributional assumptions rather than rank alone. - Counterexample if false: A binary classification task with intrinsic dimension 1 can be solved optimally with rank-1 representations. - Comprehension: Low rank is only harmful if it undercuts the task’s intrinsic dimensionality. - ML Applications: Compare effective rank to task dimension and avoid over-regularizing when low rank is sufficient. - Failure Mode Analysis: Overreacting to low rank can lead to unnecessary complexity that harms generalization. - Traps: Assuming any rank deficiency implies degraded optimal performance.

A.17 The manifold hypothesis (data lies on low-dimensional manifold) implies that intrinsic dimensionality can be estimated from the effective rank of the representation covariance after training. - Final Answer: False. - Full mathematical justification: The manifold hypothesis refers to intrinsic dimensionality of data, but effective rank of representation covariance depends on the learned mapping and data distribution. It can overestimate or underestimate intrinsic dimension, and is sensitive to scaling and noise. The representation may compress or expand dimensions arbitrarily, so effective rank reflects model choices, not the data manifold itself. Thus effective rank is not a consistent estimator of intrinsic dimension without additional assumptions. - Counterexample if false: A representation that collapses many dimensions can yield low effective rank even if data manifold is high-dimensional. - Comprehension: Effective rank reflects the learned mapping, not the data manifold itself. - ML Applications: Pair effective rank with intrinsic-dimension estimators before drawing conclusions about data complexity. - Failure Mode Analysis: Misinterpreting effective rank can lead to wrong conclusions about data complexity. - Traps: Assuming representation covariance reflects raw data geometry.

A.18 A representation encoder with Lipschitz constant \(L\) applied to two inputs \(\mathbf{x}_1, \mathbf{x}_2\) satisfies \(\|\mathbf{z}_1 - \mathbf{z}_2\| \leq L \|\mathbf{x}_1 - \mathbf{x}_2\|\), which necessarily implies robustness to adversarial perturbations of bounded norm. - Final Answer: False. - Full mathematical justification: A Lipschitz bound provides worst-case bounds on embedding changes, but robustness to adversarial perturbations depends on classifier margins and decision boundaries. A small Lipschitz constant does not guarantee robustness if margins are tiny, because an adversary can still cross the decision boundary with small perturbations. Conversely, a large Lipschitz constant does not necessarily imply vulnerability if margins are large. Thus Lipschitz continuity alone does not guarantee adversarial robustness. - Counterexample if false: A classifier with small margin can be fooled by tiny perturbations even if the encoder is Lipschitz. - Comprehension: Lipschitz stability is not sufficient without decision margins; robustness is a joint property. - ML Applications: Combine Lipschitz control with margin training or certified defenses for adversarial robustness. - Failure Mode Analysis: Relying solely on Lipschitz bounds can overestimate robustness. - Traps: Equating Lipschitz stability with adversarial robustness guarantees.

A.19 In self-supervised contrastive learning, if the negative sampling distribution is uniform (every other image is equally likely to be a negative), then the learned representations cannot achieve alignment to data-specific similarity structure (e.g., CIFAR-10 class similarity). - Final Answer: False. - Full mathematical justification: Uniform negative sampling does not prevent alignment to class structure; alignment arises from positive pairs and inductive biases. Even with uniform negatives, positive pairs push same-class examples together if augmentations preserve class identity, and the encoder can exploit shared features to cluster by class. Negative sampling mainly affects uniformity pressure, not the ability to align positives. Therefore, class structure can emerge without non-uniform negatives. - Counterexample if false: SimCLR uses uniform negatives yet yields strong class-aligned representations on CIFAR-10. - Comprehension: Uniform negatives still permit class-aligned structure via positives and inductive bias. - ML Applications: Use uniform negatives as a baseline; add hard negatives to refine, not to create, alignment. - Failure Mode Analysis: Overemphasizing hard negatives can destabilize training. - Traps: Assuming uniform negatives destroy class semantics.

A.20 Foundation models (BERT, GPT, CLIP) achieve strong transfer learning performance because they learn representations that explicitly maximize uniformity across diverse tasks, minimizing specialization to any single objective. - Final Answer: False. - Full mathematical justification: Foundation models do not explicitly maximize uniformity across tasks; they optimize task-specific objectives (e.g., masked language modeling, next token prediction, contrastive image-text alignment) that may produce embeddings with varying uniformity. Their transfer success arises from broad data coverage, scale, and inductive biases, not from explicit uniformity maximization. Empirically, embeddings are often anisotropic or clustered, which is inconsistent with uniformity maximization as a direct objective. - Counterexample if false: GPT-style models trained for next token prediction learn representations that are highly non-uniform across tasks, yet transfer effectively. - Comprehension: Foundation models optimize task-specific objectives; uniformity is emergent and often imperfect. - ML Applications: Measure anisotropy and apply centering or whitening when uniformity issues harm downstream tasks. - Failure Mode Analysis: Assuming uniformity can hide task-specific biases or representation anisotropy. - Traps: Confusing emergent representation quality with explicit uniformity optimization.

Solutions to B. Proof Problems

B.1 If \(\Sigma\) has rank \(k<d\), then any linear classifier boundary in representation space depends only on a \(k\)-dimensional subspace, with geometry constant along orthogonal directions. - Full formal proof: Let \(\Sigma\) be the covariance of representations \(\mathbf{z}\in\mathbb{R}^d\) with \(\text{rank}(\Sigma)=k<d\). Then the support of \(\mathbf{z}\) lies in a \(k\)-dimensional subspace \(\mathcal{S}\subset\mathbb{R}^d\). Let \(P\) be the orthogonal projector onto \(\mathcal{S}\). For any linear classifier \(h(\mathbf{z})=\text{sign}(\mathbf{w}^\top\mathbf{z}+b)\), we can decompose \(\mathbf{w}=P\mathbf{w}+(I-P)\mathbf{w}\). For all \(\mathbf{z}\in\mathcal{S}\), \(\mathbf{w}^\top\mathbf{z}=(P\mathbf{w})^\top\mathbf{z}\) because \((I-P)\mathbf{w}\perp\mathcal{S}\). Thus the decision boundary depends only on \(P\mathbf{w}\in\mathcal{S}\), a \(k\)-dimensional vector. Hence every decision boundary is representable entirely in \(\mathcal{S}\). Geometrically, the classifier is constant along directions orthogonal to \(\mathcal{S}\), so the effective decision geometry is \(k\)-dimensional. - Proof strategy & techniques: Establish the support subspace via rank arguments on \(\Sigma\), then explicitly construct the orthogonal projector \(P\) and decompose \(\mathbf{w}\) into components parallel and orthogonal to \(\mathcal{S}\). The proof relies on invariance of inner products under projection and a geometric argument that linear decision boundaries are constant along null directions. The main techniques are spectral decomposition, orthogonal projections, and subspace invariance. - Computational validation: Compute the empirical covariance, extract its top-\(k\) eigenvectors, project classifier weights, and verify that predictions are unchanged on validation data after projection. - ML interpretation: Low-rank embeddings compress all discriminative geometry into a \(k\)-dimensional subspace, so linear probes can only exploit variation along those directions. This means that nominal embedding size overstates usable capacity, and diagnostics should focus on effective rank when predicting linear probe performance or transferability. - Generalization & edge cases: If \(\Sigma\) is approximately low-rank, the result holds approximately; in finite samples, small eigenvalues can introduce slight deviations. - Failure mode analysis: Interpreting nominal dimensionality as capacity can overestimate classification ability when embeddings are low-rank. - Historical context: This is a standard linear algebra observation used in PCA-based classification and LDA analysis. - Traps: Confusing rank of covariance with rank of the data matrix; ignoring numerical precision when determining rank.

B.2 For unit-norm embeddings and finite batch size, the optimal InfoNCE solution yields bounded pairwise cosine similarities with explicit dependence on \(\tau\) and \(B\). - Full formal proof: Assume unit-norm embeddings \(\|\mathbf{z}_i\|=1\). The InfoNCE objective for anchor \(i\) is \(\ell_i=-\log \frac{\exp(s_{i+}/\tau)}{\exp(s_{i+}/\tau)+\sum_{j\neq i}\exp(s_{ij}/\tau)}\), where \(s_{ij}=\mathbf{z}_i^\top\mathbf{z}_j\). At optimum, the gradient with respect to \(s_{ij}\) vanishes. For any negative \(j\), \(\partial \ell_i/\partial s_{ij} = \frac{\exp(s_{ij}/\tau)}{\tau Z_i}\), and for the positive \(\partial \ell_i/\partial s_{i+} = -\frac{1}{\tau}+\frac{\exp(s_{i+}/\tau)}{\tau Z_i}\), where \(Z_i\) is the denominator. At a stationary point with finite \(Z_i\), we have \(\exp(s_{i+}/\tau)\) dominating \(Z_i\) implies \(s_{i+}\) is bounded above by 1 and must scale so that \(\exp(s_{i+}/\tau)\) is not infinite, hence \(s_{i+} \le 1\) and for negatives \(s_{ij}\) must satisfy \(\exp(s_{ij}/\tau)\) sufficiently small so gradients balance. Since \(Z_i\ge \exp(s_{i+}/\tau)\), we obtain \(\frac{\exp(s_{ij}/\tau)}{Z_i} \le \exp((s_{ij}-s_{i+})/\tau)\), and stationarity implies \(s_{ij}\le s_{i+}-\tau\log c\) for some constant \(c\) depending on batch size. Thus similarities are bounded in terms of \(\tau\) and \(B\). - Proof strategy & techniques: Start from the constrained optimization of InfoNCE with unit-norm embeddings, write the Lagrangian, and analyze stationarity with respect to pairwise similarities. Use softmax dominance bounds to relate \(s_{i+}\) and \(s_{ij}\), and apply exponential inequalities to extract separation constraints in terms of \(\tau\) and \(B\). The core techniques are gradient stationarity, softmax bounding, and inequality chaining. - Computational validation: Sweep \(\tau\) and batch size \(B\), compute empirical distributions of \(s_{i+}\) and \(s_{ij}\), and verify predicted separation scaling. - ML interpretation: Temperature and batch size jointly set the achievable separation geometry: smaller \(\tau\) increases angular margins, but finite \(d\) and \(B\) impose packing limits that prevent unlimited separation. This clarifies why tuning \(\tau\) has diminishing returns and why scaling batch size improves uniformity only up to geometric constraints. - Generalization & edge cases: For non-unit norms, bounds depend on norm distributions. For large \(B\), separation may saturate due to dimensionality limits. - Failure mode analysis: If \(\tau\) is too small, gradients explode; if \(B\) is too small, separation collapses. - Historical context: Similar bounds arise in analysis of softmax margins and contrastive learning geometry. - Traps: Assuming orthogonality is achievable for all negatives regardless of \(d\).

B.3 If \(f_\theta\) is invariant to a compact group action, then it factors through the quotient \(\mathcal{X}/G\) and induces a quotient-space geometry. - Full formal proof: Suppose \(f_\theta(g\cdot x)=f_\theta(x)\) for all \(g\in G\). Define equivalence classes \([x]=\{g\cdot x: g\in G\}\), and quotient map \(\pi: \mathcal{X}\to\mathcal{X}/G\). Define \(\tilde{f}([x])=f_\theta(x)\). This is well-defined because if \(y\in[x]\), then \(y=g\cdot x\) for some \(g\), and \(f_\theta(y)=f_\theta(x)\). Hence \(f_\theta=\tilde{f}\circ\pi\), so \(f_\theta\) factors through the quotient. The induced geometry on \(\mathcal{X}/G\) is defined by distances \(d_Q([x],[y])=\inf_{g\in G} d( x, g\cdot y)\), yielding a metric that collapses orbits. Under this metric, \(\tilde{f}\) is a representation on the quotient. - Proof strategy & techniques: Construct the quotient map \(\pi\) induced by the group action and prove \(\tilde{f}\) is well-defined by checking that all elements of an orbit map to the same representation. Then prove factorization \(f=\tilde{f}\circ\pi\) and define the induced metric by infimum over group actions. Techniques include equivalence class arguments, quotient topology, and metric construction by orbit minimization. - Computational validation: For a dataset with known group action, compute \(d_Q\) and verify that embeddings are constant on orbits and preserve quotient distances. - ML interpretation: Invariance forces the embedding to collapse group orbits into single points, effectively learning on the quotient space where nuisance transformations are removed. This explains why invariance improves robustness to those transformations but can also reduce information if the task depends on them. - Generalization & edge cases: For non-compact groups, quotient may be non-Hausdorff; for approximate invariance, factorization holds approximately. - Failure mode analysis: If invariance is too strong, task-relevant variability may be removed. - Historical context: Quotient constructions are standard in geometric learning and invariant representation theory. - Traps: Assuming invariance implies equivariance; ignoring the metric structure on the quotient.

B.4 In linear autoencoders with squared loss, the optimal encoder spans the top-\(k\) principal components and the latent covariance equals the top-\(k\) eigenvalues of data covariance. - Full formal proof: Let data covariance \(\Sigma_X\) have eigendecomposition \(\Sigma_X=U\Lambda U^\top\). A linear autoencoder with encoder \(W_e\in\mathbb{R}^{k\times d}\) and decoder \(W_d\in\mathbb{R}^{d\times k}\) minimizes \(\mathbb{E}\|W_d W_e \mathbf{x}-\mathbf{x}\|^2\). This equals \(\text{tr}(\Sigma_X)-2\text{tr}(W_d W_e \Sigma_X)+\text{tr}(W_d W_e \Sigma_X W_e^\top W_d^\top)\). The optimal solution satisfies \(W_d=W_e^\top\) and \(W_e\) spans the top-\(k\) eigenvectors of \(\Sigma_X\), by Eckart-Young-Mirsky theorem for best rank-\(k\) approximation. Then representation covariance \(\Sigma_Z=W_e\Sigma_X W_e^\top\) equals the top-\(k\) eigenvalues on the diagonal, as \(W_e\) selects those eigenvectors. - Proof strategy & techniques: Rewrite the reconstruction objective in trace form, show that the minimizer corresponds to the best rank-\(k\) approximation of \(\Sigma_X\), then invoke the Eckart-Young-Mirsky theorem. Use SVD to align encoder columns with top eigenvectors and show the covariance of \(\mathbf{z}\) inherits the leading spectrum. Techniques include trace identities, SVD, and optimal low-rank approximation. - Computational validation: Compute PCA basis and compare to learned encoder weights; compare reconstruction error to PCA reconstruction. - ML interpretation: Linear autoencoders discover the same variance structure as PCA, meaning their embeddings inherit the data’s spectral decay and anisotropy. This gives a baseline for understanding when nonlinear encoders add value beyond linear variance capture. - Generalization & edge cases: For tied weights or constraints, equivalence still holds; with bias terms, centering is required. - Failure mode analysis: Mis-centering data breaks PCA equivalence; nonlinearities invalidate linear proof. - Historical context: This result dates to Baldi and Hornik (1989) and is foundational in representation learning. - Traps: Forgetting to center data; assuming nonlinear autoencoders obey PCA equivalence.

B.5 Under suitable smoothness and NTK linearization, Fourier modes are learned in a monotone order with low frequencies converging faster than high frequencies. - Full formal proof: Consider a two-layer ReLU network \(f(x,t)=\sum_{m=1}^M a_m(t)\sigma(w_m(t) x)\) trained by gradient descent on squared loss for a target \(y(x)\) with Fourier series \(\sum_k c_k e^{i2\pi k x}\). In the linearized regime, the evolution of Fourier mode \(k\) satisfies \(\dot{c}_k(t)=-\eta \lambda_k c_k(t)\) where \(\lambda_k\) are eigenvalues of the NTK operator, which decay with \(|k|\). Hence \(c_k(t)=c_k(0)\exp(-\eta \lambda_k t)\). Since \(\lambda_k\) decreases with \(|k|\), low-frequency modes decay faster, yielding monotone learning order. With smoothness assumptions, the ordering is monotone in time. - Proof strategy & techniques: Expand the target and model outputs in Fourier series, then linearize training dynamics via the NTK approximation to obtain decoupled ODEs for each Fourier mode. Compare mode-wise decay rates using kernel eigenvalues and show monotone learning order under smoothness. Techniques include Fourier analysis, kernel eigen-decomposition, and linear ODE solutions. - Computational validation: Track Fourier coefficients during training and verify exponential decay ordering. - ML interpretation: Training dynamics prioritize low-frequency structure, so early epochs emphasize coarse, generalizable patterns while fine details emerge later. This provides a principled explanation for early stopping and why noisy high-frequency components tend to be learned last. - Generalization & edge cases: Feature learning can alter \(\lambda_k\) ordering; non-smooth targets can violate monotonicity. - Failure mode analysis: Overfitting occurs when late high-frequency modes capture noise. - Historical context: Spectral bias was formalized in work by Rahaman et al. and others in 2019. - Traps: Assuming monotonicity holds outside linearized regime.

B.6 With augmentations sampled from a group with Haar measure, optimal contrastive learning yields invariance to that group in the infinite-data, perfect-optimization limit. - Full formal proof: Let augmentations be drawn from a compact group \(G\) with Haar measure \(\mu\). The contrastive objective aligns representations of augmented pairs \(g\cdot x, h\cdot x\). In the infinite data and perfect optimization limit, the loss is minimized when all augmented views of \(x\) map to the same representation, otherwise alignment term remains positive. Formally, define \(\ell(x)=\int\int \|f(g\cdot x)-f(h\cdot x)\|^2 d\mu(g)d\mu(h)\). Minimum is zero iff \(f(g\cdot x)=f(h\cdot x)\) for all \(g,h\), implying invariance to \(G\). - Proof strategy & techniques: Express the alignment loss as a double integral over the group using Haar measure, then show the minimum is zero and characterize when this is achieved. The proof uses Jensen-style arguments over group averages and a zero-variance condition to infer invariance. Techniques include group integration, Haar measure properties, and variance minimization. - Computational validation: Train with group-based augmentations and measure variance of embeddings over orbits. - ML interpretation: When augmentations are drawn from a group, the model is pushed to identify all orbit elements as equivalent, yielding invariant embeddings. This formalizes how augmentation acts as a geometric regularizer rather than merely increasing data diversity. - Generalization & edge cases: Finite data or imperfect optimization yield approximate invariance; non-group augmentations do not guarantee invariance. - Failure mode analysis: Over-invariance can destroy task-relevant signals. - Historical context: This connects to early invariance learning via data augmentation and group averaging. - Traps: Assuming all augmentations correspond to group actions.

B.7 Zero InfoNCE loss forces injective embeddings on the training set, and margin-based separability imposes a dimension lower bound via packing arguments. - Full formal proof: If InfoNCE loss is zero, then for each anchor \(i\), \(\exp(s_{i+}/\tau)\) dominates the denominator and all negatives must have strictly lower similarity. If two distinct samples \(x_i\neq x_j\) map to the same embedding, then their similarities to any anchor are identical, violating strict separation required for zero loss unless positives are the same sample. Hence embeddings must be injective on the training set. Injectivity on \(n\) points requires embedding dimension \(d\ge \lceil \log_2 n \rceil\) for binary codes, but in continuous space, any \(d\ge 1\) can be injective; however, to separate all pairs with margin, a packing argument yields \(d=\Omega(\log n)\) for fixed margin on the unit sphere. - Proof strategy & techniques: First, derive strict separation constraints from zero InfoNCE loss, then show that identical embeddings violate those constraints, implying injectivity. Next, impose a margin and use spherical packing bounds to translate pairwise separation into a dimension lower bound. Techniques include contradiction via similarity equality, injectivity arguments, and sphere-packing geometry. - Computational validation: Train until near-zero loss, check for collisions in embeddings, and evaluate minimal pairwise distances. - ML interpretation: Driving contrastive loss toward zero implicitly demands injective, well-separated embeddings; achieving this in practice depends on dimensionality and margin constraints dictated by packing on the unit sphere. This highlights why finite-dimensional models cannot perfectly separate arbitrarily many samples. - Generalization & edge cases: Approximate zero loss allows near-collisions; dimension bounds depend on margin and norm constraints. - Failure mode analysis: Attempting to drive loss to zero with insufficient dimension can lead to instability. - Historical context: This relates to metric learning separability and spherical coding theory. - Traps: Confusing injectivity with linear separability; ignoring margin requirements.

B.8 For linear encoders with weight decay, the representation covariance eigenvalues follow a differential equation with fixed points controlled by decay. - Full formal proof: For a linear encoder \(\mathbf{z}=W\mathbf{x}\), gradient flow with weight decay \(\dot{W}=-\nabla_W \mathcal{L}-\lambda W\). Let \(\Sigma_X\) be fixed data covariance, then \(\Sigma_Z=W\Sigma_X W^\top\). Differentiating: \(\dot{\Sigma}_Z=\dot{W}\Sigma_X W^\top+W\Sigma_X \dot{W}^\top\). Under squared reconstruction or alignment losses, \(\nabla_W\mathcal{L}=W\Sigma_X - M\) for some matrix \(M\) depending on objective. Thus \(\dot{\Sigma}_Z= -2(W\Sigma_X W^\top) + (M W^\top + W M^\top) -2\lambda \Sigma_Z\). In the eigenbasis of \(\Sigma_Z\), this yields eigenvalue dynamics \(\dot{\lambda}_i = -2\lambda_i + \Delta_i -2\lambda \lambda_i\), giving convergence to fixed points controlled by \(\lambda\) and \(M\). - Proof strategy & techniques: Differentiate \(\Sigma_Z=W\Sigma_X W^\top\) using product rule, substitute gradient flow for \(\dot{W}\), and then move to the eigenbasis of \(\Sigma_Z\) to obtain scalar ODEs for eigenvalues. Control terms via symmetry to isolate decay and signal contributions. Techniques include matrix calculus, eigen-decomposition, and dynamical systems analysis. - Computational validation: Simulate linear encoder training, compute eigenvalues over time, and fit to ODE predictions. - ML interpretation: Regularization reshapes representation geometry by suppressing low-variance directions more aggressively, effectively flattening spurious components while preserving dominant signal directions. This explains why moderate decay improves generalization but excessive decay can erase subtle yet useful features. - Generalization & edge cases: Nonlinear encoders break closed-form ODEs; time-varying \(\Sigma_X\) adds coupling terms. - Failure mode analysis: Excessive decay can collapse representations; insufficient decay permits overfitting. - Historical context: Linear dynamics analyses trace back to Saxe et al. (2013) and deep linear network theory. - Traps: Assuming eigenvalues evolve independently in nonlinear settings.

B.9 Lipschitz encoders bound expected embedding distortion under bounded augmentations, and provide a \(2L\epsilon\) bound for two-view contrastive pairs. - Full formal proof: If \(f\) is \(L\)-Lipschitz, then \(\|f(x+\delta)-f(x)\|\le L\|\delta\|\). With augmentation noise bounded by \(\|\delta\|\le\epsilon\), the expected distortion satisfies \(\mathbb{E}\|f(x+\delta)-f(x)\|\le L\epsilon\). For contrastive objectives comparing two augmentations \(\delta_1,\delta_2\), we have \(\|f(x+\delta_1)-f(x+\delta_2)\|\le L\|\delta_1-\delta_2\|\le 2L\epsilon\). Thus alignment error is bounded by \(2L\epsilon\). - Proof strategy & techniques: Start from the Lipschitz condition, apply it to each augmentation, and use the triangle inequality to relate two augmented views. Then bound augmentation magnitudes by \(\epsilon\) to yield explicit constants. Techniques include norm inequalities, Lipschitz continuity, and worst-case bounding. - Computational validation: Estimate Lipschitz constants via spectral norm bounds and compare predicted distortion to empirical distortions. - ML interpretation: Lipschitz constraints translate directly into bounded embedding drift under augmentation, ensuring that positive pairs remain close and improving robustness. This ties theoretical stability to practical augmentation consistency in contrastive pipelines. - Generalization & edge cases: Bounds are loose for structured augmentations; non-Euclidean input metrics require modified bounds. - Failure mode analysis: Overly tight Lipschitz constraints can reduce representation expressivity. - Historical context: Lipschitz stability is central in robust optimization and certified robustness literature. - Traps: Treating worst-case bounds as typical-case performance.

B.10 Under linear separability assumptions, class-conditional means form a simplex in a \(K-1\) dimensional subspace and between-class covariance reflects this structure. - Full formal proof: Suppose embeddings of class \(c\) have mean \(\mu_c\) and shared within-class covariance. Under linear separability with equal class priors and optimal linear classifier, Fisher LDA implies between-class covariance \(\Sigma_B=\sum_c (\mu_c-\mu)(\mu_c-\mu)^\top\) has rank at most \(K-1\). If class means are symmetric and separable, the optimal arrangement minimizing within-class variance for fixed between-class separation is a regular simplex, where pairwise inner products are constant. Thus \(\{\mu_c\}\) form a simplex in the subspace spanned by \(\Sigma_B\). - Proof strategy & techniques: Use Fisher LDA to express between-class structure, then apply the fact that equal-distance separation in a subspace is achieved by a regular simplex. Show that \(\Sigma_B\) has rank at most \(K-1\) and that symmetry implies constant pairwise inner products. Techniques include LDA derivations, rank arguments, and simplex geometry. - Computational validation: Compute class means and verify constant pairwise cosine similarities and rank \(K-1\). - ML interpretation: Class means behaving like simplex vertices explains why linear probes often work well and why within-class collapse can be detected via between-class spectrum rank. This connects geometric class structure to the neural collapse phenomenon observed in deep classifiers. - Generalization & edge cases: Imbalanced classes yield distorted simplex; non-linear separability breaks symmetry. - Failure mode analysis: Class collapse reduces \(\Sigma_B\) rank and harms linear probe performance. - Historical context: Simplex ETF structure appears in neural collapse literature (Papyan et al., 2020). - Traps: Assuming exact simplex structure without checking class balance and convergence conditions.

B.11 There exists a sequence of unregularized reconstruction solutions with effective rank \(\to 1\) while reconstruction loss approaches the global minimum. - Full formal proof: Consider an autoencoder with encoder \(f\) and decoder \(g\). Let \(f\_\alpha(x)=\alpha u\) for a fixed unit vector \(u\) and scalar \(\alpha\). Let decoder \(g\_\alpha\) approximate the conditional mean of \(x\) given \(u\), which is the global mean \(\bar{x}\). For any \(\epsilon>0\), one can construct \(g\_\alpha\) such that reconstruction loss is within \(\epsilon\) of the global minimum by letting the decoder memorize training inputs in a narrow neighborhood around \(\alpha u\) with high capacity. As \(\alpha\) varies, \(\Sigma\) becomes rank-1 while loss approaches the minimum. Thus there exists a sequence with effective rank tending to 1 while reconstruction error tends to the global minimum. - Proof strategy & techniques: Explicitly define a rank-1 encoder and show the decoder can approximate an arbitrary mapping near that collapsed code by using high-capacity function approximation. Then argue the loss can approach the optimum despite the collapse. Techniques include constructive counterexample, memorization capacity arguments, and limit sequences. - Computational validation: Train overparameterized decoder with constrained encoder and observe low rank with low loss. - ML interpretation: Pure reconstruction allows degenerate representations because the decoder can memorize inputs, so explicit diversity or variance regularization is needed to preserve usable geometry. This motivates objectives like VAEs, VICReg, and covariance penalties in self-supervised learning. - Generalization & edge cases: Limited decoder capacity can prevent this pathology; regularizers change the optimum. - Failure mode analysis: Overparameterized decoders enable collapse despite low loss. - Historical context: Collapse in autoencoders has been noted since early representation learning work. - Traps: Assuming low loss implies good representation geometry.

B.12 In the NTK regime, the learned feature map stays close to initialization with a bound that shrinks as width increases and learning rate is small. - Full formal proof: In the NTK regime, network output linearizes as \(f_\theta(x)\approx f_{\theta_0}(x)+\nabla_\theta f_{\theta_0}(x)^\top(\theta-\theta_0)\). Under gradient descent with small learning rate and width \(m\to\infty\), parameters move by \(\|\theta-\theta_0\|=O(m^{-1/2})\), implying \(\|\nabla_\theta f_{\theta_0}(x)-\nabla_\theta f_\theta(x)\|\to 0\). Thus the feature map remains close in operator norm to initialization. A bound of the form \(\|\Phi_t-\Phi_0\|\le C\eta t /\sqrt{m}\) follows from smoothness and concentration. - Proof strategy & techniques: Linearize the network around initialization, bound parameter movement under gradient descent, and use concentration results for wide networks to keep the feature map stable. Translate small parameter drift into operator-norm bounds on feature map change. Techniques include NTK linearization, norm bounds on gradient flow, and concentration in the overparameterized limit. - Computational validation: Train wide networks and measure feature map drift via Gram matrix differences. - ML interpretation: When NTK assumptions hold, training mostly adjusts linear readouts on top of a frozen random feature map, limiting feature learning and making performance resemble kernel regression. This clarifies when scaling width improves optimization but not representational richness. - Generalization & edge cases: Finite width or large learning rates break the bound; feature learning regime invalidates linearization. - Failure mode analysis: Over-reliance on NTK assumptions can underpredict feature learning capacity. - Historical context: NTK theory by Jacot et al. (2018) formalized this regime. - Traps: Assuming NTK applies to practical-width networks without verification.

B.13 Under Gaussian class-conditional assumptions, mutual information between embeddings and labels is bounded by a function of the spectrum of \(\Sigma\) and \(\Sigma_B\). - Full formal proof: Assume class-conditional Gaussians \(\mathbf{Z}|Y=c\sim \mathcal{N}(\mu_c,\Sigma)\) with shared covariance. Then \(I(\mathbf{Z};Y)=\frac{1}{2}\log \frac{|\Sigma+\Sigma_B|}{|\Sigma|}\), where \(\Sigma_B\) is between-class covariance. Since \(|\Sigma+\Sigma_B|\le |\Sigma|\prod_i (1+\lambda_i(\Sigma^{-1}\Sigma_B))\), we obtain \(I(\mathbf{Z};Y)\le \frac{1}{2}\sum_i \log(1+\lambda_i)\), bounded by the spectrum of \(\Sigma\) and \(\Sigma_B\). If \(\Sigma\) has low effective rank, the bound tightens, limiting mutual information. - Proof strategy & techniques: Apply the closed-form Gaussian mutual information expression, rewrite it using determinant identities, and upper bound determinants via eigenvalue inequalities. Then interpret the bound via the spectrum of \(\Sigma\) and \(\Sigma_B\). Techniques include Gaussian information identities, matrix determinant lemmas, and eigenvalue majorization. - Computational validation: Estimate \(\Sigma\) and \(\Sigma_B\) from embeddings and compare empirical \(I\) to spectral bound. - ML interpretation: When the embedding covariance is effectively low-rank, the label information captured by linear probes is bounded, making spectral diagnostics a proxy for expected classification ceiling. This bridges representation geometry with information-theoretic limits on separability. - Generalization & edge cases: Non-Gaussian classes break the closed form; bounds can be looser. - Failure mode analysis: Over-reducing rank can cap classification performance even with powerful probes. - Historical context: Information bounds under Gaussian assumptions appear in classical information theory and LDA. - Traps: Confusing bounds with exact values; ignoring \(\Sigma_B\) dependence.

B.14 Equivariance to a finite group implies a decomposition into irreducible subspaces and a block-diagonal covariance structure in the irrep basis. - Full formal proof: If \(f\) is equivariant to a finite group \(G\) with representation \(\rho\), then the representation space decomposes into irreducible representations: \(\mathbb{R}^d=\bigoplus_i \mathcal{V}_i\). By Schur’s lemma, any \(G\)-equivariant linear map is block-diagonal across irreps. The covariance \(\Sigma\) commutes with \(\rho(g)\) for all \(g\), hence by Schur’s lemma \(\Sigma\) is block-diagonal with scalar blocks on each irrep. Thus equivariance constrains covariance structure. - Proof strategy & techniques: Decompose the representation space into irreducible components, then apply Schur’s lemma to any operator commuting with the group action. Use commutativity of \(\Sigma\) with \(\rho(g)\) to derive block-diagonal structure. Techniques include representation decomposition, Schur’s lemma, and commutant analysis. - Computational validation: Compute \(\Sigma\) and verify block-diagonal structure in the irrep basis. - ML interpretation: Group equivariance forces representations to decompose into structured subspaces, which stabilizes learning and yields more interpretable feature channels aligned with symmetry types. This is why equivariant models often generalize better with fewer samples on symmetric tasks. - Generalization & edge cases: Approximate equivariance yields approximate block structure; continuous groups require integration. - Failure mode analysis: If the learned representation violates equivariance, covariance structure is inconsistent, indicating training failure. - Historical context: Group representation theory underlies modern equivariant neural networks. - Traps: Ignoring the need to transform into the irrep basis to see block structure.

B.15 Large-batch contrastive uniformity induces a lower bound on minimum pairwise distance on the unit sphere, equivalent to a spherical code packing constraint. - Full formal proof: Uniformity term in contrastive learning penalizes small pairwise distances on the unit sphere. Let \(\{\mathbf{z}_i\}_{i=1}^n\subset S^{d-1}\). If uniformity encourages maximizing minimum distance \(\delta=\min_{i\neq j}\|\mathbf{z}_i-\mathbf{z}_j\|\), then by spherical code bounds, \(n\le A(d,\delta)\) where \(A\) is the maximal size of a code with minimum distance \(\delta\). Thus the uniformity term implies a lower bound on \(\delta\) in terms of \(d\) and \(n\), yielding a packing constraint. Hence minimum pairwise distance is bounded below by the code bound associated with the achieved uniformity. - Proof strategy & techniques: Interpret the uniformity objective as encouraging large minimum pairwise distance, then map the problem to a spherical code packing bound. Use known \(A(d,\delta)\) inequalities to translate uniformity into distance lower bounds. Techniques include geometric packing arguments and spherical code theory. - Computational validation: Compute minimum pairwise distances and compare to theoretical packing bounds for given \(d,n\). - ML interpretation: Uniformity acts like a packing constraint on the unit sphere, encouraging embeddings to occupy space evenly and preventing collapse. This links contrastive objectives to classical coding theory and explains observed limits on diversity at fixed dimension. - Generalization & edge cases: For non-unit norms, scaling alters distance bounds; finite batch uniformity may not enforce global packing. - Failure mode analysis: Excessive uniformity can over-separate and hurt alignment. - Historical context: Spherical codes and equiangular tight frames appear in representation analysis and neural collapse theory. - Traps: Assuming uniformity guarantees optimal packing rather than lower bounds.

B.16 A power-law spectrum with exponent \(\alpha>1\) yields sublinear growth of effective rank in dimension, with rate determined by \(\alpha\). - Full formal proof: If \(\lambda_i\propto i^{-\alpha}\) with \(\alpha>1\), then \(\sum_{i=1}^d \lambda_i\) converges as \(d\to\infty\), while \(\sum_i \lambda_i^2\) converges for \(\alpha>1/2\). Effective rank \(r_{\text{eff}}=(\sum_i \lambda_i)^2/(\sum_i \lambda_i^2)\) scales as \(O(1)\) in the infinite limit, and for finite \(d\) behaves like \(O(d^{1-2\alpha+\epsilon})\) depending on partial sums. Thus effective rank grows sublinearly with \(d\), more slowly as \(\alpha\) increases. - Proof strategy & techniques: Approximate sums by integrals for power-law sequences, determine convergence rates for \(\sum \lambda_i\) and \(\sum \lambda_i^2\), and plug into the effective-rank formula. Techniques include asymptotic series analysis, integral tests, and scaling arguments. - Computational validation: Fit eigenvalue spectra and compute effective rank versus \(d\) to verify sublinear growth. - ML interpretation: A heavy-tailed spectrum means most variance lies in a small number of directions, so nominal embedding size overestimates effective capacity. This explains why many embeddings can be compressed with little performance loss in transfer tasks. - Generalization & edge cases: For \(\alpha\le1\), sums diverge and effective rank grows faster; finite-sample noise alters tail. - Failure mode analysis: Misestimating \(\alpha\) can lead to incorrect conclusions about representation capacity. - Historical context: Power-law spectra in neural representations are widely reported in empirical studies. - Traps: Treating apparent power-law over a narrow range as global behavior.

B.17 A Jacobian norm bound yields worst-case adversarial stability and dimension-dependent average-case stability under Gaussian noise. - Full formal proof: If \(\|J_f(x)\|\le L\), then \(\|f(x+\delta)-f(x)\|\le L\|\delta\|\). For adversarial perturbations with \(\|\delta\|\le\epsilon\), the worst-case embedding change is \(\le L\epsilon\). For Gaussian noise \(\delta\sim\mathcal{N}(0,\sigma^2 I)\), \(\mathbb{E}\|\delta\|\approx \sigma\sqrt{d}\), so expected embedding change is \(\le L\sigma\sqrt{d}\). Thus adversarial and random noise bounds differ in scale, with random noise depending on dimension. - Proof strategy & techniques: Use the operator norm bound on the Jacobian to obtain worst-case adversarial guarantees, then compute the expected \(\ell_2\) norm of Gaussian noise to derive average-case behavior. Techniques include Jacobian norm inequalities, Gaussian concentration, and expectation bounds. - Computational validation: Estimate Jacobian norms and compare predicted robustness to empirical perturbation experiments. - ML interpretation: Worst-case robustness and average-case noise stability scale differently, so a model can be stable to random noise yet vulnerable to adversarial perturbations. This distinction is crucial for evaluating robustness beyond simple perturbation tests. - Generalization & edge cases: Local Lipschitz constants vary across inputs; high-curvature regions violate uniform bounds. - Failure mode analysis: Overly tight Lipschitz constraints can reduce model expressivity; loose bounds can give false security. - Historical context: Robustness bounds via Jacobian norms appear in adversarial defense literature. - Traps: Confusing local and global Lipschitz constants; ignoring dimension scaling in random noise.

B.18 In a two-layer linear network with weight decay, representation eigenvalues exhibit exponential shrinkage with smaller modes decaying faster relative to signal. - Full formal proof: For a two-layer linear network \(f(x)=W_2 W_1 x\) trained with squared loss and weight decay \(\lambda\), gradient flow yields \(\dot{W}_1=-\nabla_{W_1}\mathcal{L}-\lambda W_1\), \(\dot{W}_2=-\nabla_{W_2}\mathcal{L}-\lambda W_2\). The product \(W=W_2 W_1\) evolves as \(\dot{W}=-\nabla_W \mathcal{L}-2\lambda W\). If \(\Sigma_Z=W\Sigma_X W^\top\) has eigenvalues \(\lambda_i\), then \(\dot{\lambda}_i=-2\lambda\lambda_i+\text{signal term}\). Smaller eigenvalues decay faster relative to their magnitude, yielding exponential shrinkage. Solving \(\dot{\lambda}_i=-2\lambda\lambda_i\) gives \(\lambda_i(t)=\lambda_i(0)\exp(-2\lambda t)\), and signal terms preserve larger modes, separating the spectrum. - Proof strategy & techniques: Write gradient flow for each layer, derive the effective dynamics for the product matrix, and use eigenvalue evolution under linear decay to show exponential shrinkage. Techniques include linear dynamical systems, eigenvalue decay analysis, and separation of signal and regularization terms. - Computational validation: Train linear networks with decay, track eigenvalues, and fit exponential decay rates. - ML interpretation: Regularization accentuates dominant representation directions while damping weak ones, yielding a more pronounced spectral hierarchy. This can improve generalization but risks eliminating fine-grained features needed for downstream tasks. - Generalization & edge cases: Nonlinearities break closed-form; with strong signal, small modes can persist. - Failure mode analysis: Excessive decay erases useful low-variance features, harming performance. - Historical context: Deep linear network dynamics have been analyzed extensively since the 1990s. - Traps: Assuming exponential shrinkage implies collapse in all cases.

B.19 Alignment-only contrastive objectives admit a collapsed constant representation as a global minimizer, forming a collapsed manifold. - Full formal proof: Consider contrastive objective with alignment only: \(\mathcal{L}=\mathbb{E}\|f(x)-f(x^+)\|^2\). Let \(f(x)=c\) for all \(x\). Then \(\mathcal{L}=0\), which is the global minimum since the loss is nonnegative. The set of collapsed solutions is the manifold \(\{f: f(x)=c,\ c\in\mathbb{R}^d\}\). Thus a collapsed solution is a global minimizer. - Proof strategy & techniques: Provide a constant-function construction that achieves zero loss, then use nonnegativity of the squared loss to show global optimality. Techniques include explicit construction, lower-bound argument, and characterization of the minimizer set. - Computational validation: Train a model with alignment-only loss and verify convergence to near-constant embeddings. - ML interpretation: Alignment alone creates a degenerate optimum where all embeddings coincide, so diversity constraints are essential for meaningful geometry. This explains the need for negatives, variance regularization, or stop-gradient tricks in modern SSL methods. - Generalization & edge cases: If regularizers or architectural constraints prevent constants, collapse may be approximate rather than exact. - Failure mode analysis: Collapse yields high training objective performance but useless embeddings. - Historical context: Collapse in self-supervised learning motivated the development of negative sampling and variance regularizers. - Traps: Assuming alignment implies meaningful representation without uniformity.

B.20 Under mixing and Lipschitz assumptions, token representations in masked language models are asymptotically invariant to bounded context perturbations, yielding geometric clustering. - Full formal proof: Let a masked language model produce token representation \(z_t\) from context \(C\). Under mixing assumptions, perturbing a bounded number of context tokens changes the conditional distribution by at most \(\epsilon\) in total variation. If the representation map is Lipschitz in the context embedding space with constant \(L\), then \(\|z_t(C)-z_t(C')\|\le L\|\phi(C)-\phi(C')\|\). The bounded perturbation implies \(\|\phi(C)-\phi(C')\|\le \delta\), hence \(\|z_t(C)-z_t(C')\|\le L\delta\). As model size and data scale increase, mixing improves and \(\delta\to 0\) for bounded perturbations, yielding asymptotic invariance. Clustered embeddings arise because contexts within the same equivalence class under bounded perturbations map to nearby points. - Proof strategy & techniques: Formalize a mixing assumption to bound distributional shift under bounded perturbations, then invoke Lipschitz continuity to transfer this bound to representation space. Use equivalence-class reasoning to connect stability to clustering. Techniques include mixing bounds, Lipschitz stability, and perturbation analysis. - Computational validation: Perturb a bounded number of context tokens and measure embedding shift distributions; verify shrinkage with larger models. - ML interpretation: Contextual embeddings become locally stable under bounded context perturbations, which supports semantic consistency and robust retrieval in language tasks. However, this stability is contingent on perturbations that do not alter meaning, highlighting a balance between invariance and sensitivity. - Generalization & edge cases: For rare tokens or highly sensitive contexts (e.g., negation), invariance may fail; bounded perturbations that alter meaning violate assumptions. - Failure mode analysis: Over-invariance can cause insensitivity to critical context changes, harming reasoning. - Historical context: Contextual embedding stability has been studied since early BERT analyses and probing work. - Traps: Assuming invariance holds for semantically significant perturbations; ignoring context length effects.

Solutions to C. Python Exercises

C.1 - Load CNN Representation Spectrum and Effective Rank - Code:

C.1 - Load CNN Representation Spectrum and Effective Rank

import numpy as np

rng = np.random.default_rng(7)

def simulate(d, n, noise=0.1, trials=20):
    train_err = []
    test_err = []
    for _ in range(trials):
        beta = np.zeros(d)
        beta[:10] = rng.normal(size=10)
        X = rng.normal(size=(n, d))
        y = X @ beta + noise * rng.normal(size=n)
        X_test = rng.normal(size=(n, d))
        y_test = X_test @ beta + noise * rng.normal(size=n)
        if d <= n:
            theta = np.linalg.lstsq(X, y, rcond=None)[0]
        else:
            theta = X.T @ np.linalg.solve(X @ X.T, y)
        train_err.append(np.mean((X @ theta - y) ** 2))
        test_err.append(np.mean((X_test @ theta - y_test) ** 2))
    return np.mean(train_err), np.mean(test_err)

ratios = [0.5, 1.0, 2.0, 5.0]
n = 200
results = []
for r in ratios:
    d = int(r * n)
    tr, te = simulate(d, n, noise=0.1)
    results.append((d, tr, te))

print(results)

C.2 - Implement Representation Collapse Detector - Code:

C.2 - Implement Representation Collapse Detector

import numpy as np

rng = np.random.default_rng(11)
n, d = 200, 50
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.1 * rng.normal(size=n)

def fit_min_norm(X, y):
    return X.T @ np.linalg.solve(X @ X.T, y)

theta_raw = fit_min_norm(X, y)
X_scaled = X * np.linspace(0.5, 2.0, d)
theta_scaled = fit_min_norm(X_scaled, y)

X_test = rng.normal(size=(n, d))
pred_raw = X_test @ theta_raw
pred_scaled = (X_test * np.linspace(0.5, 2.0, d)) @ theta_scaled
print(np.mean((pred_raw - pred_scaled) ** 2))

C.3 - Compare Geometry Across Augmentation Pipelines - Code:

C.3 - Compare Geometry Across Augmentation Pipelines

import numpy as np

rng = np.random.default_rng(3)
n = 400
x = rng.normal(size=(n, 2))
y = (x[:, 0] > 0).astype(int)
y_spurious = (x[:, 1] > 0).astype(int)
y_train = np.where(rng.random(n) < 0.9, y_spurious, y)

def train_lr(X, y, lr=0.1, steps=2000):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        z = X @ w
        p = 1 / (1 + np.exp(-z))
        grad = X.T @ (p - y) / len(y)
        w -= lr * grad
    return w

w_fast = train_lr(x, y_train, lr=0.5, steps=200)
w_slow = train_lr(x, y_train, lr=0.05, steps=2000)

x_test = rng.normal(size=(n, 2))
y_test = (x_test[:, 0] > 0).astype(int)
acc_fast = np.mean((x_test @ w_fast > 0) == y_test)
acc_slow = np.mean((x_test @ w_slow > 0) == y_test)
print(acc_fast, acc_slow)

C.4 - Recover Low-Dimensional Subspace with Linear Autoencoder - Code:

C.4 - Recover Low-Dimensional Subspace with Linear Autoencoder

import numpy as np

rng = np.random.default_rng(5)
n = 150
X = rng.normal(size=(n, 1))
def build_features(x, m):
    return np.hstack([x ** i for i in range(1, m + 1)])

y = np.sin(3 * X[:, 0]) + 0.1 * rng.normal(size=n)

degrees = [1, 3, 5, 10, 20, 40]
errs = []
for m in degrees:
    Phi = build_features(X, m)
    theta = np.linalg.lstsq(Phi, y, rcond=None)[0]
    y_hat = Phi @ theta
    errs.append(np.mean((y_hat - y) ** 2))

print(list(zip(degrees, errs)))

C.5 - Analyze Temperature Effects in Contrastive Separation - Code:

C.5 - Analyze Temperature Effects in Contrastive Separation

import numpy as np

rng = np.random.default_rng(9)
n, d = 200, 1000
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.05 * rng.normal(size=n)

theta_gd = X.T @ np.linalg.solve(X @ X.T, y)

theta_rand = rng.normal(size=d)
theta_rand = theta_rand - X.T @ np.linalg.solve(X @ X.T, X @ theta_rand - y)

X_test = rng.normal(size=(n, d))
y_test = X_test @ beta + 0.05 * rng.normal(size=n)
err_gd = np.mean((X_test @ theta_gd - y_test) ** 2)
err_rand = np.mean((X_test @ theta_rand - y_test) ** 2)
print(err_gd, err_rand)

C.6 - Estimate Invariance Under Group Transformations - Code:

C.6 - Estimate Invariance Under Group Transformations

import numpy as np

rng = np.random.default_rng(13)
n, d, h = 200, 5, 50
X = rng.normal(size=(n, d))
W1 = rng.normal(size=(d, h))
W2 = rng.normal(size=(h, 1))
Y = np.maximum(0, X @ W1) @ W2

def forward(X, W1, W2):
    return np.maximum(0, X @ W1) @ W2

W1_alt = W1.copy()
W2_alt = W2.copy()
perm = rng.permutation(h)
W1_alt = W1_alt[:, perm]
W2_alt = W2_alt[perm]

diff = np.mean((forward(X, W1, W2) - forward(X, W1_alt, W2_alt)) ** 2)
print(diff)

C.7 - Probe Spectral Bias with Fourier Tracking - Code:

C.7 - Probe Spectral Bias with Fourier Tracking

import numpy as np

rng = np.random.default_rng(21)
n_full, d = 5000, 50
X_full = rng.normal(size=(n_full, d))
beta = rng.normal(size=d)
y_full = X_full @ beta + 0.1 * rng.normal(size=n_full)

def fit_mse(X, y):
    return np.linalg.lstsq(X, y, rcond=None)[0]

sizes = [200, 500, 1000, 2000, 5000]
errs = []
X_test = rng.normal(size=(1000, d))
y_test = X_test @ beta + 0.1 * rng.normal(size=1000)
for n in sizes:
    X = X_full[:n]
    y = y_full[:n]
    theta = fit_mse(X, y)
    errs.append(np.mean((X_test @ theta - y_test) ** 2))

print(list(zip(sizes, errs)))

C.8 - Evaluate Weight Decay Effects on Representation Spectra - Code:

C.8 - Evaluate Weight Decay Effects on Representation Spectra

import numpy as np

rng = np.random.default_rng(19)
def train_budget(d, n=500, steps=200, lr=0.1):
    X = rng.normal(size=(n, d))
    beta = rng.normal(size=d)
    y = X @ beta + 0.1 * rng.normal(size=n)
    w = np.zeros(d)
    for _ in range(steps):
        grad = X.T @ (X @ w - y) / n
        w -= lr * grad
    X_test = rng.normal(size=(n, d))
    y_test = X_test @ beta + 0.1 * rng.normal(size=n)
    return np.mean((X_test @ w - y_test) ** 2)

for d in [50, 200, 800]:
    print(d, train_budget(d))

C.9 - Measure Embedding Stability Under Perturbations - Code:

C.9 - Measure Embedding Stability Under Perturbations

import numpy as np

rng = np.random.default_rng(2)
n = 1000
X = rng.normal(size=(n, 2))
y = (X[:, 0] + 0.1 * X[:, 1] > 0).astype(int)

X_shift = X.copy()
X_shift[:, 0] += 1.0
y_shift = (X_shift[:, 0] + 0.1 * X_shift[:, 1] > 0).astype(int)

def train_perceptron(X, y, steps=5):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        for i in range(len(y)):
            if (X[i] @ w > 0) != bool(y[i]):
                w += (1 if y[i] else -1) * X[i]
    return w

w = train_perceptron(X, y)
acc_train = np.mean((X @ w > 0) == y)
acc_shift = np.mean((X_shift @ w > 0) == y_shift)
print(acc_train, acc_shift)

C.10 - Analyze Class-Conditional Means and Between-Class Covariance - Code:

C.10 - Analyze Class-Conditional Means and Between-Class Covariance

import numpy as np

rng = np.random.default_rng(8)
n, d = 200, 20
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.1 * rng.normal(size=n)

def hessian_approx(X):
    return (X.T @ X) / len(X)

H = hessian_approx(X)
eigvals = np.linalg.eigvalsh(H)
print(eigvals[0], eigvals[-1], eigvals[-1] / eigvals[0])

C.11 - Detect VAE Degeneracy via Latent Density - Code:

C.11 - Detect VAE Degeneracy via Latent Density

import numpy as np

rng = np.random.default_rng(4)
n, d = 300, 50
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.1 * rng.normal(size=n)

def train_gd(lam=0.0, steps=200, lr=0.1):
    w = np.zeros(d)
    for _ in range(steps):
        grad = X.T @ (X @ w - y) / n + lam * w
        w -= lr * grad
    return w

w_l2 = train_gd(lam=0.1, steps=200)
w_stop = train_gd(lam=0.0, steps=40)

X_test = rng.normal(size=(n, d))
y_test = X_test @ beta + 0.1 * rng.normal(size=n)
err_l2 = np.mean((X_test @ w_l2 - y_test) ** 2)
err_stop = np.mean((X_test @ w_stop - y_test) ** 2)
print(err_l2, err_stop)

C.12 - Compare Width Regimes by Rank Generalization and Conditioning - Code:

C.12 - Compare Width Regimes by Rank Generalization and Conditioning

import numpy as np

rng = np.random.default_rng(6)
n, d = 200, 100
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.1 * rng.normal(size=n)

theta = np.linalg.lstsq(X, y, rcond=None)[0]
mask = np.abs(theta) > np.percentile(np.abs(theta), 70)
theta_pruned = theta * mask

X_test = rng.normal(size=(n, d))
y_test = X_test @ beta + 0.1 * rng.normal(size=n)
err_full = np.mean((X_test @ theta - y_test) ** 2)
err_pruned = np.mean((X_test @ theta_pruned - y_test) ** 2)
print(err_full, err_pruned)

C.13 - Track Alignment-Uniformity Trajectories During Training - Code:

C.13 - Track Alignment-Uniformity Trajectories During Training

import numpy as np

rng = np.random.default_rng(10)
n = 200
X = rng.integers(0, 2, size=(n, 2))
y = (X[:, 0] ^ X[:, 1]).astype(int)

def fit_linear(X, y):
    X1 = np.hstack([X, np.ones((len(X), 1))])
    w = np.linalg.lstsq(X1, y, rcond=None)[0]
    return w

w = fit_linear(X, y)
pred = (np.hstack([X, np.ones((n, 1))]) @ w > 0.5).astype(int)
print(np.mean(pred == y))

C.14 - Measure Equivariance Error Under Group Actions - Code:

C.14 - Measure Equivariance Error Under Group Actions

import numpy as np

rng = np.random.default_rng(14)
n = 500
X = rng.normal(size=(n, 3))
group = (X[:, 2] > 0).astype(int)
y = (X[:, 0] + 0.5 * group > 0).astype(int)

def train_with_penalty(X, y, group, lam=0.0, steps=500, lr=0.1):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        z = X @ w
        p = 1 / (1 + np.exp(-z))
        grad = X.T @ (p - y) / len(y)
        gap = np.mean(p[group == 1]) - np.mean(p[group == 0])
        group_shift = np.mean(X[group == 1], axis=0) - np.mean(X[group == 0], axis=0)
        grad += lam * gap * group_shift
        w -= lr * grad
    return w

w0 = train_with_penalty(X, y, group, lam=0.0)
w1 = train_with_penalty(X, y, group, lam=1.0)
print(np.linalg.norm(w0 - w1))

C.15 - Analyze Cosine Similarity Histograms of Embeddings - Code:

C.15 - Analyze Cosine Similarity Histograms of Embeddings

import numpy as np

sizes = np.array([50, 100, 200, 400, 800], dtype=float)
losses = np.array([0.8, 0.6, 0.48, 0.40, 0.36])

logx = np.log(sizes)
logy = np.log(losses - losses.min() + 1e-6)
coef = np.polyfit(logx, logy, 1)
print(coef)

C.16 - Measure Effective Rank Under Label Noise - Code:

C.16 - Measure Effective Rank Under Label Noise

import numpy as np

rng = np.random.default_rng(15)
n = 400
X = rng.normal(size=(n, 2))
y = (X[:, 0] > 0).astype(int)
amb = np.abs(X[:, 0]) < 0.1
y[amb] = rng.integers(0, 2, size=np.sum(amb))

def train_lr(X, y, seed):
    rng_local = np.random.default_rng(seed)
    w = rng_local.normal(size=2) * 0.1
    for _ in range(500):
        p = 1 / (1 + np.exp(-X @ w))
        w -= 0.1 * (X.T @ (p - y) / len(y))
    return w

w1 = train_lr(X, y, 1)
w2 = train_lr(X, y, 2)
pred_disagree = np.mean((X @ w1 > 0) != (X @ w2 > 0))
print(pred_disagree)

C.17 - Compare CNN and ViT Representation Geometry - Code:

C.17 - Compare CNN and ViT Representation Geometry

import numpy as np

rng = np.random.default_rng(17)
def train_model(d, steps):
    n = 500
    X = rng.normal(size=(n, d))
    beta = rng.normal(size=d)
    y = X @ beta + 0.1 * rng.normal(size=n)
    w = np.zeros(d)
    for _ in range(steps):
        w -= 0.1 * (X.T @ (X @ w - y) / n)
    X_test = rng.normal(size=(n, d))
    y_test = X_test @ beta + 0.1 * rng.normal(size=n)
    return np.mean((X_test @ w - y_test) ** 2)

err_small = train_model(50, steps=1000)
err_large = train_model(500, steps=200)
print(err_small, err_large)

C.18 - Stress-Test Collapse with Batch Size and Augmentation - Code:

C.18 - Stress-Test Collapse with Batch Size and Augmentation

import numpy as np

rng = np.random.default_rng(18)
n = 200
ratios = [0.5, 1.0, 2.0, 4.0]
best_lams = []
for r in ratios:
    d = int(r * n)
    X = rng.normal(size=(n, d))
    beta = rng.normal(size=d)
    y = X @ beta + 0.1 * rng.normal(size=n)
    X_test = rng.normal(size=(n, d))
    y_test = X_test @ beta + 0.1 * rng.normal(size=n)
    lams = np.logspace(-4, 1, 10)
    errs = []
    for lam in lams:
        theta = np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)
        errs.append(np.mean((X_test @ theta - y_test) ** 2))
    best_lams.append(lams[int(np.argmin(errs))])
print(best_lams)

C.19 - Test Representation Distance Preservation - Code:

C.19 - Test Representation Distance Preservation

import numpy as np

rng = np.random.default_rng(20)
n = 1000
X = rng.normal(size=(n, 2))
spurious = (rng.random(n) < 0.9).astype(int)
y = (X[:, 0] > 0).astype(int)
y_train = np.where(spurious == 1, y, 1 - y)

def train_lr(X, y, steps=500, lr=0.1):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        p = 1 / (1 + np.exp(-X @ w))
        w -= lr * (X.T @ (p - y) / len(y))
    return w

w = train_lr(X, y_train)
X_test = rng.normal(size=(n, 2))
y_test = (X_test[:, 0] > 0).astype(int)
acc = np.mean((X_test @ w > 0) == y_test)
print(acc)

C.20 - Build Integrated Representation Monitoring Report - Code:

C.20 - Build Integrated Representation Monitoring Report

import numpy as np

rng = np.random.default_rng(22)
n = 300
def generate(t):
    shift = 0.0 if t < 5 else 1.0
    X = rng.normal(size=(n, 2))
    X[:, 0] += shift
    y = (X[:, 0] > 0).astype(int)
    return X, y

def train_simple(X, y):
    w = np.zeros(2)
    for _ in range(200):
        p = 1 / (1 + np.exp(-X @ w))
        w -= 0.1 * (X.T @ (p - y) / len(y))
    return w

X0, y0 = generate(0)
w = train_simple(X0, y0)

for t in range(10):
    Xt, yt = generate(t)
    acc = np.mean((Xt @ w > 0) == yt)
    print(t, acc)

End of C Solutions

Appendices

In Context

Algorithmic Development History

The mathematical framework in this chapter rests on decades of development in representation learning. Understanding this history illuminates why certain geometric concepts appear natural and others required significant insight.

Classical Era (1950s-1980s): Dimensionality Reduction and Feature Engineering. Before neural networks dominated, machine learning relied entirely on engineered features: image features like SIFT (Scale-Invariant Feature Transform) or HOG (Histograms of Oriented Gradients), NLP features like bag-of-words or TF-IDF, audio features like mel-frequency cepstral coefficients. These features were human-designed to capture task-relevant information while ignoring nuisance variations (scale, rotation, minor shifts). The geometric insight was that high-dimensional raw data (pixels, audio samples, words) could be projected to lower-dimensional feature spaces where the task geometry became clearer: in SIFT space, images of the same object at different scales, rotations, and viewing angles produced similar feature vectors, enabling classification.

Classical dimensionality reduction (PCA, multidimensional scaling, Isomap) formalized the idea that data lives on lower-dimensional manifolds embedded in high-dimensional spaces. These methods sought representations that either preserved variance (PCA) or local structure (Isomap) with the claim that such projections preserved task-relevant information. The representation consisted of coordinates in the discovered low-dimensional space.

The limitation was human effort: engineering good features for new tasks required domain expertise, and many task-specific patterns were difficult to capture without deep domain knowledge. The features, while useful, were often brittle: SIFT features work wonderfully for image matching but fail for detection (objects are too complex), and designing features for entirely new domains was expensive.

Deep Learning Era, Phase 1 (1990s-2010s): Autoencoders and Unsupervised Pre-training. The representational bottleneck of feature engineering motivated learning features directly from data. Autoencoders (Rumelhart et al., 1986) proposed a simple principle: if a neural network can reconstruct input from a lower-dimensional intermediate layer, that layer implicitly learns an informative representation. The autoencoder objective \(\min_\theta \|\text{decode}(\text{encode}(\mathbf{x})) - \mathbf{x}\|^2\) forces the encoder to discover a lossless compression—a representation that preserves all input information in lower dimensions.

However, early autoencoders suffered from the collapse problem discussed in Example 4: without explicit constraints, the reconstruction objective was satisfied by trivial solutions. Hinton and Salakhutdinov (2006) demonstrated that layer-by-layer pre-training, gradually building deep autoencoders with regularization, could learn useful representations. The insight was that representations could be learned in a self-supervised (unsupervised) manner, then used by supervised tasks. This broke the feature engineering bottleneck: one representation could serve many downstream tasks.

Parallel work on Restricted Boltzmann Machines and Deep Belief Networks developed probabilistic perspectives: representations as hidden variables in generative models. The geometric insight was that good representations should support both generative modeling (reconstructing inputs) and discriminative tasks (predicting labels). These dual abilities imposed geometric constraints on representations: they must be high-dimensional enough to capture input information, but structured enough that classes are separable. VAEs (Kingma & Rezende, 2013; this work published later but conceptually built on probabilistic frameworks) formalized this tradeoff with the KL regularization making explicit the reconstruction-compression tradeoff discussed in Example 9.

Breakthrough: In 2012, Krizhevsky, Sutskever, and Hinton’s AlexNet won the ImageNet competition using supervised end-to-end deep learning, demonstrating that learned representations could outperform hand-engineered features. This ended the unsupervised pre-training era for computer vision (pre-training was less beneficial than initially thought). However, representations remained central: AlexNet’s learned features (visualized for the first time systematically) showed clear hierarchical structure—early layers detected edges and textures, middle layers detected parts, late layers detected objects. This demonstrated that the network automatically discovered a task-aligned hierarchy of representations without explicit instruction.

Representation Analysis Era (2010s-early 2020s): Understanding Learned Structure. As deep networks became ubiquitous, questions shifted from “can we learn representations?” to “what representations are learned?” and “why do they work?” Zeiler and Fergus (2013) visualized ConvNet features by deconvolving feature maps, revealing that networks learned interpretable object parts. Simonyan et al. (2013) computed gradient images showing which input regions influenced network decisions.

A crucial insight came from transfer learning. Yosinski et al. (2014) studied how representations in different layers transfer to new tasks. They found that early layers’ representations transfer well (edges, textures are task-general), while late layers’ representations transfer poorly (task-specific discriminative features). This revealed that representation geometry changes across layers: early layers map to universal, task-general structure, late layers to specialized, task-specific structure. This observation motivated design of transfer learning pipelines: use early-layer representations directly, fine-tune late layers.

Spectral analysis emerged as a tool for understanding representation structure. Saxe et al. (2018) analyzed covariance spectra during training, discovering phase transitions in representation complexity. Newer work (Geifman et al., 2022) showed that eigenspectra follow power laws, connecting to natural data statistics and geometric complexity. This chapter leveraged such analyses (Example 2) to make spectra concrete. The insight that representation learning is fundamentally about discovering or imposing geometric structure became mainstream.

Self-Supervised Representation Learning Era (2018-Present): Contrastive and Auxiliary Objectives. A revolutionary shift came with discovering that large unlabeled datasets could drive representation learning through cleverly designed self-supervised objectives. Contrastive learning (Hadsell et al., 2006) was an older idea—pairs of inputs deemed similar should map to similar representations—but applying it at scale required solving the collapse problem (Example 4).

SimCLR (Chen et al., 2020) solved collapse through large batches of negative pairs and showed that contrastive learning with strong augmentation produces representations that transfer as well as supervised pre-training on ImageNet, without labels. The geometric insight was that augmentation implicitly defines similarity (two crops of the same image are positive pairs), and the contrastive loss (attraction to positives, repulsion from negatives) discovers this structure. MoCo (He et al., 2020) achieved similar results with memory banks trading speed for more negatives. These breakthroughs established that self-supervised learning was not an inferior alternative but a viable—sometimes superior—approach to representation learning.

The contrastive learning era validated theories of this chapter: (1) Alignment (similar inputs map close) and Uniformity (diverse inputs spread out) are competing objectives requiring explicit balance (Example 8). (2) Spectral analysis reveals whether representations are utilizing capacity or degenerating (Example 4 unregularized autoencoder collapse). (3) Augmentation teaches invariance not by architectural change but by training signal reweighting (Example 5).

Foundation Models and Scaling Era (2021-Present): Emergent Representations. The success of large pre-trained models (BERT, GPT, Vision Transformers, CLIP) raised new questions about representations at scale. With billions of parameters trained on billions of examples, do old principles still apply? Empirically, yes—spectral analysis of BERT embeddings shows power-law decay (Example 2 principle applies to language models), alignment-uniformity tradeoffs remain fundamental (Example 8 principle motivates contrastive training of CLIP), and transfer learning success hinges on representation architecture designed for downstream tasks.

However, scaling introduced phenomena not present in small models: emergent capabilities (tasks unsolved by smaller models are solved by larger ones), memorization despite generalization, and implicit biases at extreme overparameterization. This chapter’s framework extends naturally: representations are still geometric structures whose properties determine performance, but the scale qualitatively changes alignment between architecture, optimization, and objective. Foundation models represent representations becoming large enough that task-general structure (broad knowledge of document structure, image objects, etc.) can be learned without explicit fine-tuning.

Why This Matters for ML

Foundation Models and Embeddings

Foundation models—large pre-trained neural networks trained on internet-scale unlabeled data—have revolutionized machine learning practice by learning universal representations applicable to diverse downstream tasks.

Model Diversity and Representation Structure:

BERT and GPT-series models learn representations of text bridging multiple semantic levels: syntactic (grammatical structure), semantic (word meaning), pragmatic (discourse context), and task-specific (instructions, in-context examples). Vision Transformers learn hierarchical visual representations from color patches up to object categories. CLIP learns joint visual-linguistic representations where images and descriptions map to similar embedding space.

These models succeed precisely because their representations exhibit the geometric properties discussed in this chapter: alignment (similar information represents similarly), uniformity (diversity across distinct things), and meaningful geometry (distance in embedding space correlates with semantic similarity). The scale and diversity of training data ensure learned representations are robust, capturing broad patterns rather than dataset artifacts.

Why Representation Geometry Matters for Deployment:

When deploying foundation models, representation geometry is no longer abstract—it directly impacts downstream task performance:

  • A CLIP visual representation that clusters images by object category will transfer well to object classification but poorly to fine-grained texture tasks (texture-relevant structure is buried in high-frequency spectrum components).
  • A BERT representation optimized for masked language modeling transfers well to natural language understanding tasks (parsing, sentiment) but may require fine-tuning for specialized domains (medical text, code).

Practical Implications for Practitioners:

  1. Select appropriate pre-trained models by analyzing whether their learned representations align with target task geometry.
  2. Design efficient fine-tuning by identifying which representation layers require updating (early vs. late layers exhibit different transfer properties).
  3. Diagnose transfer learning failures: when pre-trained representations don’t transfer, the issue is geometric misalignment, solved by fine-tuning or using different pre-training objectives.
  4. Evaluate representation quality beyond accuracy metrics by measuring effective rank, eigenspectrum shape, and stability properties.

Robustness of Learned Features

A critical application of representation geometry is understanding robustness—a key challenge in modern deployed ML systems.

The Brittleness Problem:

Modern ML systems face diverse distribution shifts: ImageNet-trained image classifiers fail on natural distribution shifts (weather conditions, lighting, artistic renderings), language models produce different outputs when phrasings change slightly. Brittleness often stems from representations learning superficial, non-robust features—features that correlate with targets in training distribution but fail under distribution shift. For example, a model trained on ImageNet might learn that “blue pixels in upper half” often correlates with “sky” class. This works in-distribution but fails for images with unusual blue colors or objects placed differently.

Spectral Signatures of Robustness:

  1. Robust vs. Brittle Spectra: Representations learning robust features typically have smoother eigenspectra (power law with gentler exponent), while representations learning brittle, dataset-specific features have sharper spectra (rapid decay, then flat tail, indicating most variance concentrated in task-specific dimensions).

  2. Mechanisms for Improvement: Data augmentation teaches invariance, weight regularization prevents specialization, self-supervised pre-training learns general structures—all directly target representation geometry by forcing the network to learn features that remain similar under natural variations and spreading learned information across dimensions.

  3. Empirical Evidence from Example 12: Trained metric learning models stable under noise perturbations achieved robustness through augmentation-driven invariance learning and Lipschitz regularization. These are not ad-hoc tricks but natural consequences of the alignment-uniformity tradeoff and spectral regularization effects.

Practical Implications:

  • Data augmentation teaches the network to weight coarser features (early spectrum) more heavily and fine details (late spectrum) less, improving robustness to noise that affects fine details.
  • Adversarial robustness correlates with learned features being more “human-like” (resembling features humans use).
  • Certified robustness bounds arise naturally from Lipschitz constraints on representations.
  • Domain generalization succeeds when learned features are invariant to domain shifts.

Representation geometry is not a theoretical curiosity but a practical lever for controlling model behavior and building robust systems.


This chapter introduced collapse (Example 4) and stability (Example 12) as phenomena revealing representation pathologies—entry points to troubleshooting real systems.

Collapse Debugging:

In production systems, representation collapse is a serious failure mode:

  1. Failure Modes: Recommendation systems collapse to recommending the same item for all users, duplicate detection systems fail because all embeddings are similar, clustering systems find one mega-cluster.

  2. Identification via Spectral Analysis: Diagnosing collapse involves computing covariance eigenspectrum and checking effective rank. If effective rank drops below acceptable threshold, collapse is underway.

  3. Solution Strategies:

    • Diversity Regularization: Variance maximization, decorrelation (from Theorem 9 Spectral Regularization).
    • Architectural Modifications: Adding negative samples in contrastive settings, adding skip connections to preserve diversity.
    • Optimization Changes: Faster learning rates preventing premature convergence, curriculum learning starting with high diversity.
  4. Forward Connection: Later chapters will develop collapse mitigation mathematically—Chapter 20 will show how gradient flow during training can cause collapse through loss landscape geometry, Chapter 19 will show how probabilistic models (VAEs, diffusion models) maintain diversity through stochasticity. Techniques build directly on insights here: identify the problem through spectral analysis, solve through architectural/algorithmic interventions targeted at representation geometry.

Stability Analysis:

Example 12 introduced embedding stability—robustness to input noise. In real systems, noise is everywhere: sensor noise in robotics, compression artifacts in communication-limited settings, annotation noise in label distributions.

  1. The Problem: Embedding instability causes downstream failures—small input noise causes large representation changes, causing decision boundary crossings and misclassification.

  2. Theoretical Foundation: The Stability Theorem in this chapter (Theorem 5) shows that embedding stability is controlled by Lipschitz constants \(L_x\) (input sensitivity) and \(L_\theta\) (parameter sensitivity). Bounding these constants bounds embedding stability.

  3. Practical Methods: Spectral normalization and Lipschitz regularization (discussed in later chapters in the context of certified robustness) achieve stability guarantees.

Integrating Lessons from Both:

Collapse debugging and stability analysis together reveal that representation geometry is not a theoretical luxury but a practical necessity for building robust, reliable ML systems. The formal definitions and theorems in this chapter are not abstractions disconnected from practice; they are mathematical descriptions of phenomena practitioners encounter and must handle. Representation learning as geometry is a fundamental lens for understanding and debugging real systems.


Motivation

Representations as Learned Coordinate Systems

Every representation learning system constructs a coordinate system for data. When a neural network maps raw inputs to hidden layer activations, it performs a change of coordinates from the original observation space to a learned feature space. The critical insight is that this coordinate transformation is not arbitrary—it is optimized to make specific tasks easier.

Consider image classification. Raw pixel values provide one coordinate system for images, but this coordinate system makes classification geometrically difficult: images of the same class are scattered throughout pixel space, separated by irrelevant variations in lighting, viewpoint, and background. A learned representation transforms images into coordinates where same-class examples cluster together and different classes separate. The optimization process discovers a coordinate system that makes the classification boundary geometrically simple.

This perspective reveals representation learning as geometric design. The training objective specifies desired geometric properties—clustering for classification, distance preservation for reconstruction, distributional matching for generation—and optimization searches for coordinate transformations that achieve these properties. The quality of a representation depends on how well its induced geometry aligns with downstream task requirements.

Mathematically, if \(\mathbf{x} \in \mathcal{X}\) denotes an input in observation space and \(f_\theta: \mathcal{X} \to \mathcal{Z}\) denotes a learned encoder producing representation \(\mathbf{z} = f_\theta(\mathbf{x}) \in \mathcal{Z}\), then \(f_\theta\) defines a coordinate map. The Jacobian \(J_{f_\theta}(\mathbf{x})\) describes how local geometry in \(\mathcal{X}\) transforms to local geometry in \(\mathcal{Z}\). Training adjusts \(\theta\) to make \(\mathcal{Z}\) have desired geometric properties—appropriate distances, clustering structure, smooth interpolation paths.

Optimization Shapes Latent Geometry

Gradient descent does not merely minimize loss—it sculpts the geometry of representation space. Different optimization trajectories produce representations with different geometric properties even when they achieve similar final loss values. The path matters, not just the destination.

This geometric shaping occurs through accumulation of small updates over training. Each gradient step adjusts representation coordinates to reduce loss, but these adjustments compound over thousands of iterations to produce global geometric structure. Early training establishes coarse geometric features—which dimensions capture most variance, which regions of space correspond to which classes. Later training refines details—precisely where class boundaries lie, how smooth transitions between representations should be.

The optimization algorithm itself introduces geometric biases. Stochastic gradient descent with large learning rates explores flat regions of loss landscape, producing representations that vary smoothly. Small learning rates exploit narrow valleys, potentially producing brittle representations that generalize poorly. Momentum accumulates gradient information over time, biasing representations toward solutions that are consistent across multiple samples. Adaptive optimizers like Adam adjust effective learning rates per dimension, allowing different geometric axes to evolve at different rates.

Crucially, the loss landscape itself has geometry that constrains what representations can be learned. A loss function \(\mathcal{L}(\mathbf{z}, y)\) where \(\mathbf{z}\) is a representation and \(y\) is a target defines level sets in representation space. Optimization flows perpendicular to these level sets, following gradient vectors \(\nabla_\mathbf{z} \mathcal{L}\). The curvature of level sets determines how representations cluster—high curvature creates tight clusters, low curvature creates dispersed representations. Training implicitly optimizes not just loss values but the geometric configuration that produces those values.

Why Feature Learning Is Geometric

Features are not isolated statistics—they form geometric structures that encode relationships between data points. A good feature representation preserves meaningful distances and angles while discarding irrelevant variations. This preservation and discarding constitutes a geometric operation: a projection or embedding that maintains certain geometric invariants while collapsing others.

Consider learning features for face recognition. Relevant geometric properties include similarity in identity (faces of the same person should be nearby), separation by identity (different people should be distant), and invariance to pose and lighting (rotating or brightening a face should not move its representation far). A geometric feature space achieves this through its metric structure: the distance function \(d(\mathbf{z}_1, \mathbf{z}_2)\) encoding similarity, and the manifold structure encoding valid variations.

The geometric nature of features explains why certain architectural elements improve performance. Convolutional layers exploit translation invariance—a geometric symmetry. Attention mechanisms compute geometric relationships between tokens—similarities and alignments in representation space. Normalizations stabilize geometric scale, preventing some dimensions from dominating. Each architectural component shapes representational geometry in specific ways.

Mathematically, feature learning constructs an embedding \(\phi: \mathcal{X} \to \mathbb{R}^d\) that maps data into a Euclidean space equipped with a useful metric. The metric encodes task-relevant structure: for classification, the metric should make same-class examples form tight clusters with large between-cluster distances. For reconstruction, the metric should preserve original distances. For generation, the metric should enable smooth interpolation between valid data points.

The geometric view also clarifies feature disentanglement. A disentangled representation is one where coordinate axes align with independent factors of variation. Geometrically, this means the representation space factorizes as a product space, with orthogonal axes corresponding to orthogonal variations in data. Achieving disentanglement requires optimization objectives that encourage such factorization, often through independence constraints or sparsity regularization.

Degeneracy and Collapse in Representation Spaces

Optimization can produce degenerate representations that satisfy training objectives without learning useful structure. The most common degeneracy is dimensional collapse: representations concentrate on a lower-dimensional subspace of the ambient space, wasting capacity and limiting expressiveness.

Collapse occurs when optimization discovers that minimizing loss does not require utilizing all available dimensions. In contrastive learning, if positive pairs can be made identical by zeroing out most dimensions, gradient descent will do so. In autoencoders, if reconstruction is possible using only a low-dimensional manifold, the encoder will project onto that manifold. In self-supervised learning, if the pretext task can be solved trivially using limited information, representations will reflect only that limited information.

Geometrically, collapse manifests as representations lying on a lower-dimensional manifold \(\mathcal{M} \subset \mathbb{R}^d\) where \(\dim(\mathcal{M}) \ll d\). The effective rank of the representation covariance matrix \(\text{Cov}(\mathbf{z})\) drops below the ambient dimension. Singular value decomposition reveals that only a few principal components capture most variance.

Collapse has mathematical causes rooted in optimization dynamics. Loss landscapes often have flat directions—directions in parameter space where loss does not change. Gradient descent accumulates updates primarily in non-flat directions, leaving flat directions unexplored. If certain representation dimensions correspond to flat loss directions, those dimensions will not be utilized, resulting in collapse.

Preventing collapse requires regularization that enforces geometric richness. Variance regularization ensures representations spread across all dimensions. Decorrelation encourages orthogonal principal components. Contrastive methods use negative samples to prevent trivial solutions. Architectural constraints like bottlenecks force information through limited capacity, preventing lazy solutions. Each technique imposes geometric constraints that counteract collapse.

Common Misconceptions About Representation Learning

Misconception: Representations are just features. Representations are not merely collections of features but structured geometric spaces with metrics, topologies, and symmetries. The relationships between features—their correlations, orthogonalities, and hierarchical organizations—often matter more than individual feature values.

Misconception: Deeper networks always learn better representations. Depth enables learning hierarchical features, but excessive depth without appropriate architectural support can degrade representations through gradient vanishing, information loss, or geometric distortion. Representation quality depends on the interaction between depth and other architectural choices like skip connections and normalization.

Misconception: Pre-training and fine-tuning always help. Transfer learning assumes that representations learned on one task will be useful for another. This holds when tasks share geometric structure—similar distance metrics and clustering patterns. When task geometries differ substantially, transfer can harm performance by biasing representations toward irrelevant structure.

Misconception: Larger latent dimensions always improve representation. High-dimensional spaces enable richer representations but also introduce challenges: increased optimization difficulty, higher risk of overfitting, and greater susceptibility to dimensional collapse. Optimal latent dimension balances expressiveness against these challenges.

Misconception: Representation learning is unsupervised. Even methods marketed as unsupervised impose strong inductive biases through loss functions, architectures, and augmentations. These biases encode assumptions about data structure—smoothness, clustering, symmetries—that guide representations toward specific geometric configurations. The distinction between supervised and unsupervised is less important than understanding what geometric properties the training procedure induces.

ML Connection

Autoencoders and Latent Spaces

Autoencoders learn representations by reconstructing inputs through a bottleneck. An encoder \(f_\theta: \mathbb{R}^n \to \mathbb{R}^d\) maps inputs to latent codes \(\mathbf{z} = f_\theta(\mathbf{x})\), and a decoder \(g_\phi: \mathbb{R}^d \to \mathbb{R}^n\) reconstructs \(\hat{\mathbf{x}} = g_\phi(\mathbf{z})\). Training minimizes reconstruction error \(\mathcal{L}(\mathbf{x}, \hat{\mathbf{x}})\), forcing the latent space to capture information necessary for reconstruction.

Geometric Interpretation: The autoencoder learns a manifold \(\mathcal{M}\) embedded in \(\mathbb{R}^d\) such that the decoder \(g_\phi\) parametrizes this manifold and the encoder \(f_\theta\) projects data onto it. The reconstruction loss measures how well data can be approximated by points on \(\mathcal{M}\). When \(d < n\), the bottleneck enforces dimensional reduction, compressing data onto a lower-dimensional manifold.

Concrete Example: A convolutional autoencoder for MNIST digits learns a latent space where each dimension captures a specific variation—stroke thickness, slant angle, loop closure. The manifold \(\mathcal{M}\) is approximately 10-15 dimensional despite the 28×28=784 pixel input space. Interpolating linearly in \(\mathbf{z}\)-space produces smooth morphs between digits because the decoder maps straight line segments in \(\mathbb{R}^d\) to visually coherent paths on the image manifold.

Optimization Dynamics: During training, the encoder initially maps all inputs to a small region near the origin (mode collapse). Gradients from reconstruction error push encodings apart, expanding the occupied region of latent space. The decoder simultaneously learns to map this region back to input space. The optimization balances two competing pressures: spreading encodings to avoid ambiguity (large latent variance) versus keeping encodings compact to avoid fitting noise (small latent variance).

Geometric Pathologies: Autoencoders can exhibit holes in latent space—regions where the decoder produces invalid outputs because no training example mapped there. This occurs when data does not fill the latent space uniformly. Holes break interpolation: moving between two valid encodings may pass through regions that decode to nonsense. Variational autoencoders address this by imposing a prior distribution on \(\mathbf{z}\), encouraging encodings to fill space more uniformly.

Contrastive Learning Objectives

Contrastive learning constructs representations by pulling similar examples together and pushing dissimilar examples apart. Given positive pairs \((\mathbf{x}_i, \mathbf{x}_i^+)\) (augmentations of the same input) and negative samples \(\{\mathbf{x}_j^-\}\), the contrastive loss is:

\[ \mathcal{L}_{\text{contrast}} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_i^+) / \tau)}{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_i^+) / \tau) + \sum_j \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j^-) / \tau)} \]

where \(\text{sim}(\mathbf{z}_a, \mathbf{z}_b) = \mathbf{z}_a^\top \mathbf{z}_b / (\|\mathbf{z}_a\| \|\mathbf{z}_b\|)\) measures cosine similarity and \(\tau\) is temperature.

Geometric Interpretation: This loss shapes representation space as an angular geometry. The similarity function measures angles, not Euclidean distances, making representations invariant to magnitude—only direction matters. Optimization concentrates positive pairs at small angular separation while dispersing negative pairs widely. On a unit hypersphere (where \(\|\mathbf{z}\|=1\)), the loss creates a geometric configuration resembling hard sphere packing: each positive cluster occupies a region, with clusters separated by finite angular distance.

Concrete Example: SimCLR for image representations learns features where crops and color-jittered versions of the same image map to nearby points on the unit sphere, while different images scatter across the sphere. The temperature \(\tau\) controls geometric sharpness: small \(\tau\) enforces tight clustering of positives with sharp boundaries between clusters (high curvature), while large \(\tau\) allows softer boundaries (low curvature). Typical values like \(\tau=0.1\) create geometries where same-image augmentations occupy angular regions of roughly 0.1 radians.

Optimization Dynamics: Early training spreads representations uniformly across the sphere, providing diversity. As training progresses, positive pairs gradually move together while negatives separate. The rate of clustering depends on batch size (more negatives push harder) and learning rate (larger steps move faster). Without careful regularization, all representations can collapse to a single point, achieving zero loss trivially by making all similarities equal.

Dimensional Collapse Prevention: Contrastive methods combat collapse through several mechanisms. Large batch sizes provide many negatives, preventing trivial solutions. Hard negative mining selects particularly confusing examples, forcing representations to be more discriminative. Asymmetric architectures (e.g., BYOL, SimSiam) use predictor networks and stop-gradients to prevent collapse without explicit negatives, relying on architectural asymmetry to maintain representational diversity.

Metric Learning

Metric learning optimizes representation spaces to satisfy distance constraints. The goal is to learn an embedding \(f_\theta: \mathcal{X} \to \mathbb{R}^d\) such that \(d(\mathbf{z}_i, \mathbf{z}_j) = \|f_\theta(\mathbf{x}_i) - f_\theta(\mathbf{x}_j)\|\) reflects semantic similarity. Triplet loss is a canonical example:

\[ \mathcal{L}_{\text{triplet}} = \max(0, d(\mathbf{z}_a, \mathbf{z}_p) - d(\mathbf{z}_a, \mathbf{z}_n) + m) \]

where \((\mathbf{x}_a, \mathbf{x}_p, \mathbf{x}_n)\) is an anchor-positive-negative triplet and \(m\) is a margin.

Geometric Interpretation: This loss enforces geometric orderings: positive pairs must be closer than negative pairs by at least margin \(m\). The margin creates a buffer zone around each anchor—positives must lie within a ball of radius \(r_p\), negatives outside a ball of radius \(r_n = r_p + m\). Aggregating over all triplets produces a global geometry where similar examples cluster with all-pairs distances less than dissimilar all-pairs distances.

Concrete Example: Face verification with FaceNet learns embeddings where faces of the same person form tight clusters (intra-cluster distance \(\approx 0.5\)) while different people separate widely (inter-cluster distance \(\approx 1.5\) with margin \(m=0.2\)). The embedding space is typically 128-dimensional, with distances computed as Euclidean \(\ell_2\) norms. At test time, verification reduces to thresholding distance: if \(d(\mathbf{z}_i, \mathbf{z}_j) < 1.0\), predict same person.

Optimization Dynamics: Metric learning is sensitive to triplet selection strategy. Random triplets often produce easy examples where the constraint is already satisfied, providing no learning signal. Hard negative mining selects triplets that violate constraints \(d(\mathbf{z}_a, \mathbf{z}_p) > d(\mathbf{z}_a, \mathbf{z}_n)\), providing strong gradients. Semi-hard mining balances difficulty, selecting negatives where \(d(\mathbf{z}_a, \mathbf{z}_p) < d(\mathbf{z}_a, \mathbf{z}_n) < d(\mathbf{z}_a, \mathbf{z}_p) + m\), ensuring gradients update representations meaningfully without being overwhelmed by extreme cases.

Geometric Regularization: Raw metric learning can overfit by memorizing training distances. N-pair loss generalizes triplet loss to multiple negatives, improving generalization by imposing more distance constraints. Angular loss replaces Euclidean distances with angular distances, improving robustness to magnitude variations. Proxy-based methods introduce learnable class representatives, reducing computational cost while maintaining geometric structure.

Invariances and Equivariances

Representations should be invariant to task-irrelevant transformations and equivariant to task-relevant ones. Invariance means \(f_\theta(T(\mathbf{x})) = f_\theta(\mathbf{x})\) for transformation \(T\), while equivariance means \(f_\theta(T(\mathbf{x})) = T'(f_\theta(\mathbf{x}))\) for some induced transformation \(T'\) on representation space.

Geometric Interpretation: Invariance collapses geometric orbits—the set of all transformed versions \(\{T(\mathbf{x})\}\)—to single points in representation space. Equivariance preserves orbit structure but in a transformed coordinate system. These geometric properties arise from optimization when the loss function respects the symmetry: if \(\mathcal{L}(f_\theta(T(\mathbf{x})), y) = \mathcal{L}(f_\theta(\mathbf{x}), y)\) for all \(T\) in a group \(G\), then gradient descent will learn approximately \(G\)-invariant representations.

Concrete Example: SimCLR learns representations invariant to color jittering and cropping by using these transformations to create positive pairs. If \(\mathbf{x}_i^+ = T(\mathbf{x}_i)\) for augmentation \(T\), the contrastive loss forces \(f_\theta(\mathbf{x}_i) \approx f_\theta(T(\mathbf{x}_i))\). After training, color changes and moderate crops produce negligible changes in representation: \(\|f_\theta(\mathbf{x}) - f_\theta(T(\mathbf{x}))\| < 0.1\) for typical augmentations.

Equivariance Example: A convolutional network is translation-equivariant: shifting input pixels by \(\Delta\) shifts feature maps by \(\Delta\). This geometric property emerges from weight sharing—the same filters applied at all positions. Mathematically, if \(T_{\Delta}\) denotes spatial translation and \(f_\theta\) is convolutional, then \(f_\theta(T_{\Delta}(\mathbf{x})) = T_{\Delta}(f_\theta(\mathbf{x}))\).

Optimization of Symmetries: Imposing invariances through data augmentation is optimization-based. Each augmented example adds a term to the loss encouraging similar representations. With enough augmentations, optimization implicitly learns the symmetry. Explicit architectural invariances (convolution for translation, group-equivariant layers for rotations) hard-code symmetries, making optimization more efficient.

Geometric Trade-offs: Invariances reduce representation dimensionality—collapsing orbits requires fewer coordinates. This aids optimization (lower-dimensional spaces are easier to explore) but loses information (transformations become irreversible). The optimal balance depends on task requirements: classification benefits from strong invariances (only class identity matters), but reconstruction or generation needs weak invariances (transformations must be recoverable).

Optimization Bias and Feature Structure

Gradient-based optimization introduces implicit biases that shape learned features beyond what the loss explicitly specifies. These biases arise from the interplay between loss landscape geometry, network architecture, and optimization algorithm dynamics.

Simplicity Bias: Neural networks trained with gradient descent preferentially learn simple functions over complex ones that fit the data equally well. Geometrically, simple functions correspond to low-curvature solutions in parameter space and smooth representations in activation space. This bias manifests as representations that interpolate smoothly between training examples and generalize to nearby points, rather than memorizing training data with sharp discontinuities.

Concrete Example: A neural network trained to classify MNIST digits will learn features like “has loop at top” and “has vertical line” rather than memorizing pixel patterns. These features correspond to smooth functions of pixel space. The optimization trajectory naturally finds these simpler solutions first, even though more complex memorization solutions exist.

Lazy Training Regime: With certain initializations and learning rates, networks enter a “lazy” regime where representations change minimally during training. The initial random features suffice with only output weights adapting. Geometrically, the representation manifold remains close to its random initialization, and task learning occurs by adjusting the linear read-out. This regime produces worse representations for transfer learning because features do not adapt to task structure.

Feature Learning Regime: With larger learning rates or different architectures, networks enter a “feature learning” regime where representations change substantially. The manifold geometry reshapes to align with task demands. This produces better transfer representations because the learned geometry reflects data structure rather than initialization accidents.

Concrete Example: ResNets trained with small learning rates (<0.001) on ImageNet exhibit minimal feature change—top layers adapt but early layers remain similar to initialization. With larger learning rates (0.1), early layers learn Gabor-like edge detectors, middle layers learn textures, and deep layers learn object parts. The geometric structure becomes hierarchical, with increasingly abstract features in deeper layers.

Architectural Constraints: Architecture imposes geometric constraints on learnable representations. ReLU networks produce piecewise-linear decision boundaries, dictating that representation manifolds are flat within activation regions. Skip connections preserve geometric information by providing identity paths, preventing geometric degradation. Batch normalization eliminates scale degrees of freedom, constraining representations to lie on spheres or other constant-norm surfaces.

Optimization Algorithm Effects: Adam adapts per-parameter learning rates, effectively changing the metric on parameter space. This reshapes the optimization trajectory, biasing toward different representations than SGD. Momentum accumulates gradients, biasing toward solutions stable across mini-batches. These algorithmic choices alter not just convergence speed but the final geometric structure of representations.

Notation Summary

This section consolidates notation used throughout the chapter to aid quick reference.

  • \(\mathbf{x}\): Input data point.
  • \(\mathbf{z}=f_\theta(\mathbf{x})\): Representation or embedding produced by encoder \(f_\theta\).
  • \(d\): Representation dimension.
  • \(\Sigma\): Representation covariance matrix, \(\Sigma=\mathbb{E}[(\mathbf{z}-\mu)(\mathbf{z}-\mu)^\top]\).
  • \(\mu\): Mean of representations, \(\mu=\mathbb{E}[\mathbf{z}]\).
  • \(\lambda_i\): Eigenvalues of \(\Sigma\), ordered \(\lambda_1\ge \lambda_2\ge \cdots\).
  • \(r_{\text{eff}}\): Effective rank, typically \((\sum_i \lambda_i)^2/(\sum_i \lambda_i^2)\) or thresholded count depending on context.
  • \(G\): Group acting on inputs, with elements \(g\in G\).
  • \(\rho\): Group representation acting on embedding space for equivariance.
  • \(\pi\): Quotient map \(\mathcal{X}\to \mathcal{X}/G\).
  • \(\tau\): Temperature parameter in contrastive loss.
  • \(B\): Batch size in contrastive learning.
  • \(I(\mathbf{Z};Y)\): Mutual information between representations and labels.
  • \(\Sigma_B\): Between-class covariance in supervised settings.
  • \(L\): Lipschitz constant for encoder \(f_\theta\).
  • \(J_f\): Jacobian of encoder with respect to input.

Supplementary Proofs

This appendix provides brief supplements that clarify assumptions or fill in common proof steps referenced in the main text.

  1. Effective rank vs. thresholded rank: If \(r_{\text{eff}}=(\sum_i \lambda_i)^2/(\sum_i \lambda_i^2)\), then \(1\le r_{\text{eff}}\le d\) and \(r_{\text{eff}}\) is invariant under uniform scaling of \(\Sigma\). Thresholded rank depends on a numerical cutoff and is not scale-invariant.

  2. Quotient-space metric: For a compact group action, the induced metric \(d_Q([x],[y])=\inf_{g\in G} d(x,g\cdot y)\) is well-defined and satisfies symmetry and triangle inequality; compactness ensures the infimum is attained.

  3. InfoNCE separation bound sketch: With unit-norm embeddings and finite \(B\), softmax stationarity implies a gap between positives and negatives that scales with \(\tau\) and \(\log B\); the bound is limited by packing constraints in \(\mathbb{R}^d\).

  4. Lipschitz stability: If \(\|J_f(x)\|\le L\) for all \(x\), then \(\|f(x+\delta)-f(x)\|\le L\|\delta\|\) by the mean value theorem, yielding worst-case stability bounds used in the chapter.

ML Implementation Notes

Practical considerations for implementing the diagnostics and experiments described in this chapter.

  • Normalization: Use consistent embedding normalization (e.g., \(\ell_2\)-normalize) before computing cosine-based metrics or comparing spectra across runs.
  • Centering: Always center embeddings before covariance estimation; uncentered covariance inflates the top eigenvalue and biases effective-rank estimates.
  • Sampling: For large datasets, subsample uniformly and report variability across multiple samples; tail eigenvalues are most sensitive to sample size.
  • Reproducibility: Fix random seeds and data order when comparing spectra across hyperparameter changes; otherwise differences may be stochastic.
  • Scale sensitivity: Thresholded rank depends on scale; prefer scale-invariant metrics (e.g., participation ratio) when comparing across checkpoints.
  • Numerical stability: Use \(\text{eigvalsh}\) for symmetric covariances and add small \(\epsilon\) in denominators for ratios.
  • Metrics in practice: Pair alignment-uniformity curves with spectral metrics to distinguish collapse from healthy compression.