Chapter 17 — Limits of Optimization & The Future of ML

Overview

Purpose of the Chapter

Throughout the preceding sixteen chapters, we have explored machine learning as fundamentally an optimization problem: select a model class, define an objective function, and use gradient-based or combinatorial search methods to find parameters that minimize loss. This framing has proven extraordinarily productive. Stochastic gradient descent, reinforcement learning, variational inference, and countless derivative algorithms have enabled breakthroughs across computer vision, natural language processing, robotics, and scientific discovery.

However, the practical deployment of machine learning systems reveals a profound limitation that no amount of additional data or computational power can overcome: optimization is only as useful as the objective function we choose to optimize. When the objective is misaligned with human values, when multiple equally-valid objectives conflict, or when the problem structure itself is fundamentally underspecified, optimization does not find “the right solution”—it finds the best solution according to a proxy that may be deeply flawed.

This chapter explores the structural limits of optimization in machine learning. We examine cases where increasing model capacity, adding more training data, or applying more sophisticated algorithms actively makes things worse. We analyze why certain problems cannot be solved through optimization alone, no matter how much compute we dedicate. And we confront the uncomfortable reality that many real-world ML problems lie in a frontier where optimization is necessary but fundamentally insufficient.

The goal is not to abandon optimization—it remains essential—but to develop intellectual humility about what optimization can and cannot accomplish. By understanding these limits, practitioners can recognize when optimization is the wrong tool, when objectives need scrutiny, and when human judgment, domain expertise, or entirely different problem formulations are required.

Concrete ML Applications

Detecting Optimization-Limited vs Data-Limited Regimes

1. Concept summary: marginal-improvement comparisons reveal whether more solver effort or better data is the active bottleneck.
2. Problem statement: decide whether the next experiment budget should go to longer training or to data curation.
3. Problem setup: The team measures validation-loss improvement from two interventions applied separately at fixed model size: 20% more optimization steps and a curated data refresh of equal engineering cost. We compare the loss drops; the larger drop identifies the limiting factor for the next round.
4. Explicit values: current validation loss $L_0=0.420$, after extra steps $L_{\text{steps}}=0.401$, after data curation $L_{\text{data}}=0.362$.
5. Formula with symbols defined: marginal gain $\Delta L=L_0-L_{\text{new}}$, where $L_0$ is current validation loss and $L_{\text{new}}$ is validation loss after one intervention.
6. Plug-in step: $\Delta L_{\text{steps}}=0.420-0.401$, $\Delta L_{\text{data}}=0.420-0.362$.
7. Computed result: $\Delta L_{\text{steps}}=0.019$ and $\Delta L_{\text{data}}=0.058$.
8. Decision / interpretation: because curated data yields roughly three times the improvement, the project is more data-limited than optimization-limited right now.
9. Sensitivity check: if a new optimizer lowered loss to $0.350$, then $\Delta L_{\text{steps}}$ would become $0.070$, overtaking data curation and changing the recommendation.

Compute-Optimal Training Budgets for Frontier Experiments

1. Concept summary: compute planning is an optimization problem over expected loss reduction per unit budget.
2. Problem statement: choose between two frontier experiment configurations under the same GPU-hour budget.
3. Problem setup: The lab has already fit a local scaling-law proxy that predicts validation loss from candidate run configurations. Because both runs fit within the same allowed compute envelope, the decision reduces to selecting the configuration with lower predicted loss.
4. Explicit values: configuration A predicted loss $L_A=1.82$, configuration B predicted loss $L_B=1.74$, shared budget $C=200{,}000$ GPU-hours.
5. Formula with symbols defined: choose configuration $i$ minimizing predicted loss $L_i$, where $L_i$ is the scaling-law estimate under budget $C$.
6. Plug-in step: compare $L_A=1.82$ with $L_B=1.74$.
7. Computed result: $L_B-L_A=1.74-1.82=-0.08$, so configuration B is predicted to be lower by $0.08$ loss units.
8. Decision / interpretation: launch configuration B because it offers the better compute-normalized operating point under the current budget.
9. Sensitivity check: if the proxy uncertainty on B is $\pm0.10$ while A is $\pm0.02$, the advantage becomes ambiguous, which may justify a smaller pilot run before committing the full budget.

Early Termination Rules from Curvature and Noise Indicators

1. Concept summary: stopping rules should halt training when expected future gain is smaller than the cost of continuing.
2. Problem statement: determine whether to stop a run after epoch 42.
3. Problem setup: The team estimates future improvement from recent validation-loss slope and compares it to a minimum worthwhile gain threshold. If the projected gain over the remaining patience window is too small, the run is stopped and compute is reassigned.
4. Explicit values: recent validation losses over three epochs $(0.286,0.284,0.283)$, average per-epoch improvement $g=0.0015$, remaining patience window $P=4$ epochs, minimum worthwhile gain $g_{\min}=0.010$.
5. Formula with symbols defined: projected remaining gain $G=P\,g$, where $g$ is recent per-epoch validation improvement and $P$ is the remaining patience window.
6. Plug-in step: $G=4(0.0015)=0.006$.
7. Computed result: projected future gain is $0.006$.
8. Decision / interpretation: since $0.006 < 0.010$, the run should be terminated early because expected improvement is too small to justify more compute.
9. Sensitivity check: if the recent slope improved to $g=0.003$, then $G=4(0.003)=0.012$, above threshold, so training should continue.

Future-Proofing Pipelines with Algorithmic Optionality

1. Concept summary: modular optimizer interfaces create option value because better methods can be adopted at low migration cost.
2. Problem statement: determine whether adding optimizer modularity now is worth the engineering effort.
3. Problem setup: We compare the expected future savings from easier optimizer swaps against the one-time implementation cost of abstraction work. If expected savings exceed the engineering cost, the modular interface is justified on economic grounds.
4. Explicit values: abstraction implementation cost $C_0=40$ engineer-hours, expected number of optimizer upgrades next year $n=3$, migration cost without modularity $c_{\text{old}}=24$ hours per upgrade, migration cost with modularity $c_{\text{new}}=6$ hours per upgrade.
5. Formula with symbols defined: net savings $S=n(c_{\text{old}}-c_{\text{new}})-C_0$, where $n$ is number of future upgrades and $c_{\text{old}}, c_{\text{new}}$ are per-upgrade costs before and after modularization.
6. Plug-in step: $S=3(24-6)-40=3(18)-40$.
7. Computed result: $S=54-40=14$ engineer-hours.
8. Decision / interpretation: the interface work is worthwhile because it produces a positive expected savings while reducing future integration risk.
9. Sensitivity check: if only one optimizer upgrade is expected, then $S=1(18)-40=-22$, so the abstraction would not pay for itself.

Conceptual Scope

This chapter addresses five interconnected themes:

Underspecification and Non-Identifiability: The problem of multiple valid solutions that fit the data equally well but behave very differently on unseen examples or in deployment.
Objective Misalignment: The gap between the loss function we optimize (mathematical objective) and the actual goal we care about (human value). This gap grows as systems become more consequential and the world more complex.
Non-Convex Landscapes and Computational Tractability: Even when the problem is well-specified, the landscape of possible solutions may be so complex that finding the global optimum is computationally intractable. Moreover, local optima that differ dramatically may be equally likely under reasonable search strategies.
Capacity vs. Generalization: A seemingly paradoxical phenomenon where larger models with more parameters generalize better on some tasks yet worse on others. Understanding when to allocate resources to model size versus data size versus architectural inductive bias reveals deep structural trade-offs.
Compute-Constrained Learning and Scaling Laws: As model size and dataset scale approach limits of available compute, diminishing returns emerge. The relationship between compute, data, and model size is far from linear, and naive scaling breaks down.

Throughout, we maintain a distinction between algorithmic limits (limitations of our current optimization algorithms, potentially solvable with better methods) and structural limits (fundamental limitations imposed by mathematics and physics, intrinsic to the problem).

Questions This Chapter Answers

By completion of this chapter, you will have addressed the following core questions:

When does adding more data to a model make it worse at generalization, and why?
Why do two models with identical training loss sometimes make completely different predictions on new data?
What does “the objective was right but the algorithm failed” mean, and when is this actually true versus a convenient excuse?
How can we recognize when a machine learning objective is fundamentally misaligned with the actual goal?
Why do some problems have provably no efficient solution regardless of compute budget?
What is the relationship between model capacity, data availability, and actual generalization performance, and how do these interact?
When is optimization the wrong tool entirely, and what alternatives exist?
How do compute limitations change what’s feasible, and when does scaling actually help versus hurt?

How This Chapter Concludes the Optimization Arc

Chapters 1 through 16 built a framework around optimization: defining objectives (loss functions), understanding generalization (learning theory), managing capacity (regularization, network depth), and handling distribution shift (robustness, fairness, feedback). These chapters characterized what optimization can achieve under various constraints.

Chapter 17 shifts perspective. Rather than asking “How do I optimize better?” we ask “What are the conditions under which optimization, no matter how good, fails to solve the problem?” This reframing reveals that the challenge of machine learning at scale is not primarily an algorithmic challenge but a structural one. We cannot optimize our way out of fundamental ambiguity in what we want the model to do.

This conclusion is not pessimistic. Rather, it clarifies where the hard work actually lies: in problem formulation, in auditing objective functions, in understanding the consequences of choices made at the “meta-level” before optimization even begins. By understanding these limits, we can design systems that are not just optimized but wise.

Definitions

Definition 1: Underspecification

Definition: : A learning problem is underspecified if there exist distinct parameter vectors $\theta_1, \theta_2 \in \Theta$ with $\theta_1 \neq \theta_2$ such that they achieve the same global minimum on auxiliary objectives (training loss, a priori constraints, or other auxiliary criteria) but satisfy $h_{\theta_1}(x) \neq h_{\theta_2}(x)$ on a non-negligible subset of the input space.

Formally, let $\mathcal{L}_{\text{train}}(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(h_\theta(x_i), y_i)$ be the training loss. The problem is underspecified if there exist $\theta_1, \theta_2$ with: \[ \mathcal{L}_{\text{train}}(\theta_1) = \mathcal{L}_{\text{train}}(\theta_2) = \inf_\theta \mathcal{L}_{\text{train}}(\theta) \] yet their predictions diverge: $\delta := \mathbb{E}_{x \sim \mathcal{P}}[|h_{\theta_1}(x) - h_{\theta_2}(x)|] > \epsilon$ for some small threshold $\epsilon > 0$.
Assumptions: : - Training data is fixed and finite; underspecification arises from insufficient data information - The model class $\mathcal{M}$ is expressive enough to achieve zero or near-zero training loss - The true data distribution $\mathcal{P}$ is not fully specified by the training set; divergence is measured on the true data distribution or a held-out test set
Notation: : - $\theta_1, \theta_2$: distinct parameters achieving the same training loss - $\mathcal{L}_{\text{train}}$: empirical loss on training data - $\mathcal{P}$: true data distribution (may differ from training distribution) - $\delta$: divergence measure (e.g., $\ell_2$ distance, KL divergence, prediction disagreement rate)
Usage: : Underspecification captures the phenomenon that optimizing the training loss does not uniquely determine the model; many different models fit the data equally well but make different predictions in deployment. This is the core problem of the generalization gap: training loss alone provides insufficient information to distinguish models that will generalize well from those that will not.
Valid Example: : Logistic regression on a loan default prediction task with $n = 100,000$ applications and $d = 500$ features (income, credit score, employment history, etc.). The feature space is rich enough that many weight vectors $\theta$ achieve near-zero training loss. Two weight vectors might both perfectly classify the historical training data but assign different default probabilities to a new applicant with unusual feature combinations. Both are equally good by training loss; they differ on the true distribution.
Failure Case: : A scenario where underspecification does not occur is a problem with $d < n$ and extremely noisy labels. If the true relationship is deterministic (zero Bayes error) and the feature space is low-dimensional, gradient descent converges to a unique global minimum with high probability, and underspecification is absent.
Explicit ML Relevance: : Underspecification is the root cause of poor generalization, adversarial vulnerability, and fairness violations. Models that are equally good on training data may treat demographic groups differently, may fail under distribution shift, or may exploit spurious correlations. Understanding underspecification is essential for recognizing that optimization alone cannot solve these problems: we must choose among underspecified solutions using additional criteria (fairness constraints, robustness requirements, human judgment).

Definition 2: Non-Identifiability

Definition: : A parameter $\theta$ is non-identifiable if there exist distinct parameter vectors $\theta_1 \neq \theta_2$ that induce the same likelihood function: $p(Y | X; \theta_1) = p(Y | X; \theta_2)$ for all possible data $(X, Y)$ drawn from the data-generating process.

Equivalently, the Fisher Information Matrix is singular: \[ \mathcal{I}(\theta) = \mathbb{E}_{(X,Y) \sim \mathcal{P}}[\nabla_\theta \log p(Y | X; \theta) \nabla_\theta \log p(Y | X; \theta)^T] \] is rank-deficient (has zero eigenvalues).
Assumptions: : - The model is probabilistic with a well-defined likelihood $p(Y | X; \theta)$ - The parameter space is continuous (discrete non-identifiability cases exist but are treated separately) - Parameters are distinct but produce identical likelihoods
Notation: : - $p(Y | X; \theta)$: likelihood function parameterized by $\theta$ - $\mathcal{I}(\theta)$: Fisher Information Matrix - $\mathcal{P}$: true data distribution
Usage: : Non-identifiability means that no finite amount of data can distinguish between two parameters. Maximum likelihood estimation will converge to some point on the non-identified manifold, but there is no unique solution. This is distinct from underspecification (which arises from finite training data) in that non-identifiability is fundamental: even with infinite data, parameters are not identified.
Valid Example: : A mixture of two Gaussians with unknown labels. The model has parameters $(\mu_1, \sigma_1, \mu_2, \sigma_2, \pi_1, \pi_2)$ specifying two components and their mixing probabilities. However, swapping the component labels $(\mu_1, \sigma_1, \pi_1) \leftrightarrow (\mu_2, \sigma_2, \pi_2)$ produces identical likelihood, so the model is non-identified. Any EM algorithm will converge to some labeling, but there is no way to know which component is “truly” component 1 versus component 2.
Failure Case: : A linear regression model $Y = X \beta + \epsilon$ with continuously-varying $X$ and fixed design matrix is identifiable as long as $X^T X$ is invertible (full rank). Non-identifiability arises when columns of $X$ are perfectly collinear.
Explicit ML Relevance: : Non-identifiability in neural networks is subtle; the model may be non-identified even if the feature space is rich. Deep networks with multiple solutions achieving zero loss are a form of non-identifiability. This means that regularization, initialization, and optimization algorithm choice determine which solution is found, not the data. For deployment, this is concerning: different initializations might produce models with very different behavior, and we have no principled way to choose among them.

Definition 3: Capacity Threshold

Definition: : The capacity threshold for a learning problem is the critical model size $d^* = d^*(n, \epsilon, \delta)$ such that: - For $d < d^*$: the model cannot achieve empirical risk less than $\epsilon$ with probability $\geq 1 - \delta$, regardless of training set size or optimization algorithm - For $d \geq d^*$: the model can achieve empirical risk less than $\epsilon$ with probability $\geq 1 - \delta$ when trained on $n$ samples using a reasonable optimization algorithm

Formally, the threshold is defined as: \[ d^*(n, \epsilon, \delta) = \inf \{ d : \exists \theta \in \mathbb{R}^d \text{ such that } \mathcal{L}_{\text{train}}(\theta) \leq \epsilon \text{ w.p. } \geq 1-\delta \} \]
Assumptions: : - The model class has bounded VC-dimension or Rademacher complexity - The hypothesis class is realizable or nearly realizable by some member of $\mathcal{M}$ - Training is noise-free or noise is bounded
Notation: : - $d$: model capacity (number of parameters) - $n$: training set size - $\epsilon$: target accuracy level - $\delta$: confidence parameter - $\mathcal{L}_{\text{train}}$: empirical risk
Usage: : The capacity threshold characterizes the minimum model size needed to fit a dataset. Below this threshold, even perfect optimization yields high training error. This threshold is problem-dependent: simple datasets (linearly separable) have low thresholds; complex datasets (high-dimensional, nonlinear) have high thresholds. This is distinct from the generalization threshold (which determines when we can fit without overfitting).
Valid Example: : Binary classification on a dataset of 1000 points in 100 dimensions. A linear classifier may have capacity threshold $d^* = O(1)$ (the separating hyperplane is determined by a few parameters). A high-degree polynomial classifier has higher threshold $d^* = O(100)$ (more flexibility needed to fit the data).
Failure Case: : The threshold does not apply if the model is misspecified, i.e., no member of $\mathcal{M}$ can fit the data well. For instance, trying to fit nonlinear data with a linear model: increasing capacity (adding more linear features) never yields low training loss.
Explicit ML Relevance: : Understanding the capacity threshold helps practitioners allocate resources. If the capacity threshold is very high (model is too small for the problem), adding more data does not help; you must increase model size. Conversely, if you have enough capacity to fit the training data, further increasing capacity risks overfitting unless you also increase data or regularization.

Definition 4: VC Dimension (Revisited and Extended)

Definition: : The Vapnik-Chervonenkis (VC) dimension of a hypothesis class $\mathcal{H}$ is the size of the largest finite subset $S \subseteq \mathcal{X}$ that can be shattered by $\mathcal{H}$, where a set $S$ is shattered if for every subset $S_0 \subseteq S$, there exists a hypothesis $h \in \mathcal{H}$ such that $h$ correctly classifies all points in $S_0$ as positive and all points in $S \setminus S_0$ as negative.

Formally: \[ \text{VC-dim}(\mathcal{H}) = \max\{|S| : \{h(s) : h \in \mathcal{H}, s \in S\} = 2^S\} \]

where the right-hand side denotes all $2^{|S|}$ possible binary labelings.
Assumptions: : - The hypothesis class is binary-valued (classifies into $\{0, 1\}$): extension to multiclass and regression exists but changes the definition - The input space is arbitrary (finite or infinite)
Notation: : - $\mathcal{H}$: hypothesis class - $\mathcal{X}$: input space - $S$: finite subset of $\mathcal{X}$ - $2^S$: power set (all subsets of $S$)
Usage: : VC dimension quantifies the sample complexity for a hypothesis class: a class with VC-dimension $d$ requires $O(d / \epsilon^2)$ samples to achieve $\epsilon$-accurate generalization with high probability. Higher VC-dimension means more samples needed. For neural networks, VC-dimension grows roughly linearly with the number of parameters (though the exact growth depends on network depth and architecture).
Valid Example: : The class of linear separators in $\mathbb{R}^k$ has VC-dimension $k+1$. This is because any $k+1$ points in general position in $\mathbb{R}^k$ can be shattered (by choosing different hyperplanes for each labeling), but some $k+2$ points cannot be shattered (there are labelings of $k+2$ points that are not linearly separable, regardless of the hyperplane).
Failure Case: : The class of all functions from a finite set $\mathcal{X}$ to $\{0, 1\}$ has VC-dimension $|\mathcal{X}|$ (it shatters any subset because it contains all functions). This unbounded VC-dimension explains why memorization-based methods have poor generalization.
Explicit ML Relevance: : Chapter 1 introduced VC-dimension as the fundamental determinant of learnability. Here we revisit it in the context of optimization limits: even with perfect optimization (finding the hypothesis that achieves the lowest training loss), generalization is bounded by the VC-dimension. A model with very high VC-dimension requires exponentially more data to generalize, a fundamental limit.

Definition 5: Benign Overfitting

Definition: A learning algorithm exhibits benign overfitting if it simultaneously achieves: 1. Zero or near-zero training loss: $\mathcal{L}_{\text{train}}(\hat{\theta}) \approx 0$ 2. Small test loss: $\mathcal{L}_{\text{test}}(\hat{\theta}) \approx L^*$ where $L^*$ is the optimal achievable test loss (Bayes optimal loss)
Assumptions: - The model is in the overparameterized regime: $d \gg n$ (parameters far exceed samples) - Training is noise-free or nearly noise-free - The algorithm has implicit regularization (e.g., early stopping, small learning rates) that selects among interpolating solutions
Notation: - $\hat{\theta}$: estimated parameters after training - $\mathcal{L}_{\text{train}}, \mathcal{L}_{\text{test}}$: training and test loss - $L^*$: optimal (Bayes) loss - $\sigma^2$: noise in the data
Usage: Benign overfitting challenges classical intuition that interpolation (fitting training data perfectly) leads to poor generalization. In modern deep learning, interpolation is common and often achieves good generalization. The key difference from harmful overfitting is that the interpolating solution is found by an algorithm with implicit regularization (not arbitrary interpolation), and the excess loss decreases as noise decreases (overfitting to noise is minimal).
Valid Example: A deep neural network trained on 100,000 high-resolution images to convergence fits all training images perfectly (zero training error) yet achieves high test accuracy (90%+). The network has interpolated the training data but generalizes well because gradient descent implicitly regularizes toward solutions with good representation learning.
Failure Case: A random interpolation method (e.g., fitting a high-degree polynomial to 100 points by choosing random coefficients until it interpolates) exhibits harmful overfitting: it perfectly fits training data but generalizes catastrophically. The key difference is that SGD’s implicit regularization steers toward good solutions, but randomized interpolation does not.
Explicit ML Relevance: Benign overfitting explains why modern deep learning succeeds despite high capacity: it is not that overfitting is impossible, but that the right algorithms (SGD, dropout, batch norm) achieve a form of overfitting that is benign. Understanding the conditions for benign overfitting helps practitioners know when to use high-capacity models and when to regularize more aggressively.

Definition 6: Double Descent (Limit Case)

Definition: The double descent phenomenon is a non-monotonic relationship between model capacity $d$ and test error: test error first decreases as $d$ increases (classical U-shaped risk curve), reaches a minimum at some $d_1$, then increases (the “classical” overfitting regime) until a second threshold $d_2$ (the interpolation threshold), after which test error decreases again as $d$ continues to increase.
Assumptions: - The data has a signal $y_i = x_i^T \beta^* + \xi_i$ where $\xi_i$ is small noise - The algorithm is gradient descent with appropriate learning rate schedule - Early stopping or other implicit regularization is applied
Notation: - $d$: model capacity (number of parameters) - $n$: training set size - $L_{\text{test}}(d)$: test loss as a function of capacity - $d_1$: classical overfitting threshold - $d_2 > d_1$: interpolation threshold
Usage: Double descent reveals that the bias-variance trade-off is more nuanced than classical statistics suggests. In the overparameterized regime, higher capacity can improve generalization despite fitting training noise, because the implicit regularization in gradient descent steers toward solutions that generalize. Practitioners should not fear high-capacity models; in the overparameterized regime, regularization (early stopping, dropout) prevents harmful overfitting while capacity enables representation learning.
Valid Example: Train neural networks of increasing width on a fixed classification dataset. Test error follows a U-shape up to classical overfitting region (modest networks), then degrades; but as networks become very wide (100× more parameters than samples), test error improves again. This non-monotonic curve is double descent.
Failure Case: Linear regression in the classical regime $d < n$ does not exhibit double descent; test error is monotonic in capacity (U-shaped, not W-shaped). Double descent requires overparameterization.
Explicit ML Relevance: Double descent explains a surprising empirical fact: larger deep networks often generalize better. This changes how practitioners think about capacity: in the overparameterized regime (which describes modern deep learning), bigger is not always worse; the implicit regularization of gradient descent makes large models viable. However, this assumes proper hyperparameter tuning and does not apply if the objective is fundamentally misaligned.

Definition 7: Optimization Bias

Definition: The optimization bias of an algorithm $\mathcal{A}$ is the systematic difference between the solutions it finds and the loss-minimizing solutions. Formally, let $\Theta^*(\epsilon) = \{\theta : \mathcal{L}(\theta) \leq \min_\theta \mathcal{L}(\theta) + \epsilon\}$ be the set of $\epsilon$-optimal solutions. The optimization bias is:
Assumptions: - The loss landscape has identifiable global optima or $\epsilon$-optimal regions - The algorithm is run from random initialization - The loss function is deterministic (not stochastic); optimization bias refers to the solution bias, not estimation error
Notation: - $\mathcal{A}$: optimization algorithm (e.g., SGD, Adam, Newton’s method) - $\hat{\theta}_{\mathcal{A}}$: solution found by algorithm $\mathcal{A}$ - $\Theta^*(\epsilon)$: set of $\epsilon$-optimal parameters - $\mathcal{L}$: loss function
Usage: Optimization bias captures the idea that the choice of algorithm systematically biases which solution is found. Two algorithms might both find a point of low training loss, but the solutions differ due to the algorithms’ inherent biases. For instance, gradient descent is biased toward solutions reachable by continuous paths from random initialization and toward solutions in flat minima (due to early stopping); second-order methods bias toward different solutions.
Valid Example: Gradient descent on a non-convex loss landscape with multiple local minima. Different learning rates and initializations cause SGD to converge to different local minima. The set of minima SGD reaches is a biased subset of the set of all local minima. This bias is the optimization bias.
Failure Case: In convex optimization with initialization at the same point, algorithms like gradient descent and Newton’s method converge to the same global minimum regardless of learning rate (up to numerical precision). Optimization bias is negligible.
Explicit ML Relevance: Optimization bias explains why hyperparameter tuning matters: learning rate, batch size, and initialization determine which part of the loss landscape is explored. In underspecified problems (many solutions equally good on training data), optimization bias determines which solution is deployed, affecting fairness, robustness, and other properties. Different practitioners might deploy different models simply by tuning the optimization algorithm differently, even on the same data and architecture.

Definition 8: Induced Inductive Bias

Definition: The induced inductive bias of an algorithm $\mathcal{A}$ is the preference the algorithm exhibits, across different function classes and datasets, for solutions with specific properties, such as simplicity, smoothness, or sparsity, even when those properties are not explicitly enforced through regularization.
Assumptions: - The algorithm is run without explicit regularization (or with minimal regularization) - Solutions are comparatively evaluated on the same loss scale - The complexity measure $\mathcal{C}$ is well-defined and comparable across solutions
Notation: - $\mathcal{A}$: algorithm (e.g., SGD with specific hyperparameters) - $\hat{\theta}_{\mathcal{A}}$: solution found by $\mathcal{A}$ - $\mathcal{C}$: complexity measure (norm, sparsity, etc.) - $\theta^*$: loss-optimal solution
Usage: Induced inductive bias formalizes the observation that algorithms like SGD implicitly prefer certain types of solutions even without explicit penalties. Gradient descent with small learning rates and early stopping implicitly prefers smooth, simple solutions. Dropout implicitly prefers solutions robust to feature perturbations. These implicit preferences are not specified by the loss function but are inherited from the algorithm.
Valid Example: Neural networks trained with SGD exhibit implicit preference for hierarchical feature representations (low-level detectors at early layers, abstract concepts at later layers). This hierarchy is not explicitly enforced by the loss function; it emerges from the structure of gradient descent applied to deep networks.
Failure Case: If the true underlying function is deliberately complex (e.g., requires all parameters to have large magnitude to encode a simple pattern), algorithms that implicitly prefer simplicity will fail to find the true function, even with sufficient capacity. The induced bias becomes a liability.
Explicit ML Relevance: Understanding induced bias is crucial for knowing when an algorithm will work well. SGD’s bias toward simple solutions works well for vision and language tasks (where true functions likely are hierarchical and smooth) but might fail for adversarially-constructed problems. Domain knowledge about the likely structure of true functions helps match algorithm biases to problem structure.

Definition 9: Expressivity Bound

Definition: An expressivity bound is a limit on the set of functions a model class can represent. Formally, let $\mathcal{H}$ be a hypothesis class and let $\mathcal{F}$ be the set of all possible functions from the input space $\mathcal{X}$ to the output space $\mathcal{Y}$. A lower bound on expressivity states: \[ |\{h \in \mathcal{H} : \forall x \in D, h(x) = f(x) \text{ for some target } f \}| < |\mathcal{F}| \]
Assumptions: - The model class has finite capacity (bounded parameter space) or can be parameterized finitely - We are measuring representational expressivity (ignoring optimization difficulty)
Notation: - $\mathcal{H}$: hypothesis class - $\mathcal{X}, \mathcal{Y}$: input and output spaces - $\mathcal{F}$: set of all possible functions - $D$: dataset or domain of interest
Usage: An expressivity bound provides a negative result: there exist functions that no model in the class can represent, regardless of the data or optimization algorithm. This is a fundamental limit: if the true function is outside the class, no learning algorithm can do better than the best approximation. This is distinct from optimization or generalization limits.
Valid Example: Linear classifiers have bounded expressivity in nonlinear problems. The set of functions representable by linear models is strictly smaller than all possible classifiers. Specifically, XOR cannot be represented by any linear classifier, regardless of training data or optimization. We must use a nonlinear model (e.g., shallow neural network) to express XOR.
Failure Case: A universal approximation theorem (like that of neural networks) states that networks with a single hidden layer can represent any continuous function (given sufficient width). This is a positive expressivity result: the bound is not restrictive for continuous functions. However, the width required may be exponential, creating a practical limit.
Explicit ML Relevance: Expressivity bounds tell when a model class is fundamentally insufficient for a problem. If the true underlying function requires nonlinear representations and we insist on linear models, increasing data or improving optimization does not help. We must change the model class. Recognizing when expressivity is the limiting factor (versus optimization or generalization) changes how we troubleshoot poor performance.

Definition 10: Computational Complexity Limit

Definition: A computational complexity limit for a learning problem is a formal result stating that no polynomial-time algorithm can solve the problem (optimize the loss) unless a complexity conjecture (typically $\mathcal{P} \neq \mathcal{NP}$), fails. Formally, a problem $\mathcal{P}$ has a computational complexity limit if: \[ \exists k : \text{OptimalSolution}(\mathcal{P}) \in \text{coNP-complete} \text{ or } \text{OptimalSolution}(\mathcal{P}) \in \text{PH-hard} \] where coNP-complete or PH-hard means the decision version of the problem is hard for that complexity class.
Assumptions: - We are interested in worst-case complexity (not average-case) - Polynomial-time means polynomial in the input size (data dimension and sample size) - Standard complexity classes ($\mathcal{P}, \mathcal{NP}, \mathcal{NP}$-hard) are well-defined
Notation: - $\mathcal{P}$: learning problem (e.g., feature selection with $k$ features, hyperparameter tuning) - $\text{OptimalSolution}(\mathcal{P})$: decision problem corresponding to finding an optimal solution - $\text{NP}, \text{coNP}, \mathcal{P}$: complexity classes
Usage: A computational complexity limit implies that no polynomial-time algorithm (unless $\mathcal{P} = \mathcal{NP}$, which is widely believed to be false) can solve the optimization problem to optimality. We must either: (1) accept approximation algorithms with bounded approximation guarantees, (2) restrict to special cases with favorable structure, or (3) use heuristics with no worst-case guarantees.
Valid Example: Feature subset selection is NP-hard: selecting the $k$ best features from $d$ available features requires evaluating up to $\binom{d}{k}$ subsets in the worst case, which is exponential in $d$. No polynomial-time algorithm provably finds the optimal subset (unless $\mathcal{P} = \mathcal{NP}$). Greedy feature selection is a polynomial-time heuristic with no worst-case guarantee.
Failure Case: Problems with efficient polynomial-time algorithms do not have computational complexity limits (e.g., linear regression, convex optimization with polynomial-time-solvable objectives). These problems are solvable optimally in polynomial time.
Explicit ML Relevance: When a learning problem is computationally hard, practitioners must accept one of three uncomfortable truths: (1) the heuristic they use (like greedy feature selection) might be far from optimal, (2) the problem must be restricted (use domain knowledge to reduce the search space), or (3) the objective must be relaxed (optimize an easier proxy). Understanding when a problem is computationally hard motivates using domain expertise to make it tractable.

Definition 11: Scaling Saturation

Definition: A model exhibits scaling saturation at capacity $d^*$ if the rate of loss improvement with increasing model capacity drops below a critical threshold, formally: \[ \left| \frac{d\mathcal{L}_{\text{test}}}{dd} \bigg|_{d=d^*} \right| < \gamma \] for some small $\gamma > 0$, even as additional data continues to be available. Equivalently, loss becomes insensitive to further capacity increases: \[ \mathcal{L}_{\text{test}}(d^* + \Delta d) \approx \mathcal{L}_{\text{test}}(d^*) \text{ for moderate } \Delta d \]
Assumptions: : - The model has access to sufficient data (not data-limited) - Capacity $d$ is increased while data size remains fixed - The architecture and optimization are held constant
Notation: : - $\mathcal{L}_{\text{test}}$: test loss as a function of capacity - $d$: model capacity - $d^*$: saturation point - $\gamma$: threshold for “significant” improvement
Usage: : Scaling saturation indicates that the problem has reached a “ceiling” imposed by factors other than model capacity. Possible causes include: (1) the model architecture has reached its expressivity limit for the task, (2) the data distribution is simple and adequately captured by current capacity, or (3) the objective is fundamentally misaligned, and capacity cannot help. Scaling saturation is a sign to stop increasing model size and instead focus on data quality, objective specification, or architectural changes.
Valid Example: : Training ResNet-50 on CIFAR-10 (image classification, 60,000 images). Test accuracy reaches 95% and plateaus despite further increases in model capacity or training time. Saturation has been reached; the remaining 5% error is not due to insufficient capacity but to label noise, distribution shift, or other factors.
Failure Case: : A model in the underparameterized regime (fewer parameters than samples) might exhibit saturation at low capacity, but this is not true scaling saturation; it is expressivity limitation. Adding sufficient capacity would allow improvement.
Explicit ML Relevance: : Practitioners often continue scaling models hoping for continued improvements. Recognizing scaling saturation signals the need for a different approach: perhaps the objective is misaligned, the data is insufficient, or the architecture needs redesign. Blindly scaling beyond saturation wastes compute without benefit.

Definition 12: Functional Redundancy

Definition: A neural network exhibits functional redundancy if multiple distinct sets of parameters produce nearly identical input-output mappings. Formally, given a network $h_\theta(x)$ parameterized by $\theta$, there exist distinct $\theta_1 \neq \theta_2$ such that: \[ \mathbb{E}_{x \sim \mathcal{P}}[|h_{\theta_1}(x) - h_{\theta_2}(x)|] < \epsilon \] for small $\epsilon > 0$, yet $\|\theta_1 - \theta_2\| \gg \epsilon$ (the parameters are far apart).
Assumptions: - The parameter distance is measured in some norm (e.g., Euclidean norm) - Input-output similarity is measured in appropriate metric (e.g., $\ell_2$ or $\ell_\infty$) - The input distribution $\mathcal{P}$ is fixed
Notation: - $\theta_1, \theta_2$: distinct parameter vectors - $h_\theta$: neural network function parameterized by $\theta$ - $\mathcal{P}$: input distribution - $\epsilon$: threshold for functional similarity
Usage: Functional redundancy means that the parameter space has “flat” directions: changing parameters in certain directions has minimal effect on the function. This explains why neural networks are overparameterized yet generalize: not all parameters matter equally. The effective dimensionality (number of directions that meaningfully affect the function) is much smaller than the nominal parameter count.
Valid Example: A ReLU network with a hidden layer that is far wider than necessary can reduce hidden unit weights by a constant factor and scale up subsequent layer weights by the same factor, producing identical output. These redundant parameterizations create flat directions in the loss landscape.
Failure Case: A linear network (no nonlinearity) in the full-rank regime has minimal functional redundancy: each parameter meaningfully affects the output. Adding extra width does not create redundancy if the singular values are distinct.
Explicit ML Relevance: Functional redundancy explains why networks with billions of parameters can generalize and why regularization methods like dropout and pruning are effective: they exploit the fact that many parameters are redundant. Networks can be compressed significantly without loss of function.

Definition 13: Sparse Activation Limit

Definition: A neural network exhibits a sparse activation limit for a given input distribution $\mathcal{P}$ if the fraction of active units (those with non-zero activation) is bounded away from one, formally: \[ \mathbb{E}_{x \sim \mathcal{P}}[\text{ActiveFraction}(x)] = \mathbb{E}_{x \sim \mathcal{P}}\left[\frac{1}{h} \sum_{j=1}^h \mathbb{1}[z_j(x) \neq 0]\right] \leq \rho < 1 \] where $z_j(x)$ is the $j$-th unit’s pre-activation value and $h$ is the number of units. The fraction $\rho$ is the sparse activation limit.
Assumptions: - The network uses sparse activations (e.g., ReLU, not sigmoid) - We are measuring activations on a specific input distribution $\mathcal{P}$, not worst-case - Active means non-zero pre-activation or post-activation (convention varies)
Notation: - $\rho$: sparse activation limit (0 to 1) - $z_j(x)$: pre-activation value of unit $j$ on input $x$ - $h$: total number of units - $\mathcal{P}$: input distribution
Usage: Many deep networks (especially with ReLU activations) are not dense; they activate only a fraction of their units per example. This sparsity limits the effective capacity used for any single input. A network with 1 billion parameters might activate only 100 million on a typical input, effectively using 10% of its capacity. This sparse activation limit bounds the representational capacity available for any single prediction.
Valid Example: A MLP with ReLU activations on natural images: on average, 40% of hidden units are active (ReLU fires = positive value). The sparse activation limit is 0.4. Adding more inactive units does not directly improve representation for individual examples.
Failure Case: A network with 0.9 sparsity could still have high capacity if the sparsity pattern varies across examples: different inputs might activate different subsets of units. However, if sparsity is structured (always the same units are active), capacity is severely constrained.
Explicit ML Relevance: Sparse activation limits help explain when network scaling helps and when it does not. Networks with low sparse activation limits might not benefit from increased width; the effective capacity per example is unchanged. Alternatively, sparsity creates an opportunity: if different examples activate different units, the network achieves high effective capacity without being dense.

Definition 14: Curvature Explosion

Definition: A loss landscape exhibits curvature explosion at a region if the Hessian (matrix of second-order partial derivatives) has either very large or very small eigenvalues, indicating sharp curvature in some directions and flat curvature in others. Formally, let $\mathcal{H}(\theta) = \nabla^2 \mathcal{L}(\theta)$ be the Hessian of the loss at parameters $\theta$. Curvature explosion occurs if: \[ \kappa(\theta) := \frac{\lambda_{\max}(\mathcal{H}(\theta))}{\lambda_{\min}(\mathcal{H}(\theta))} \gg 1 \] where $\lambda_{\max}$ and $\lambda_{\min}$ are the largest and smallest eigenvalues, and $\kappa$ is the condition number.
Assumptions: - The loss is twice-differentiable - The Hessian exists and is well-defined - We are interested in the local curvature, not global landscape properties
Notation: - $\mathcal{H}(\theta)$: Hessian matrix at $\theta$ - $\lambda_{\max}, \lambda_{\min}$: largest and smallest eigenvalues - $\kappa(\theta)$: condition number (ratio of curvatures) - $\mathcal{L}$: loss function
Usage: High condition number means the loss landscape is elongated: small steps in some directions lead to large loss changes, while large steps in other directions have minimal effect. This makes optimization difficult: gradient descent requires a small learning rate to avoid diverging in steep directions, but then makes slow progress in flat directions. Second-order methods (computing the Hessian inverse) can help but are computationally expensive.
Valid Example: Training a neural network on a noisy dataset: the Hessian at the optimum often has condition number 10^6 or higher, indicating extreme elongation of the loss landscape. This necessitates careful learning rate tuning and adaptive methods (Adam, RMSprop).
Failure Case: A strongly convex objective (quadratic with positive-definite Hessian) can have condition number close to 1, indicating smooth loss landscape. Gradient descent converges in iteration count proportional to the condition number; low condition numbers enable fast convergence.
Explicit ML Relevance: Curvature explosion is an algorithmic, not structural, limit. Better optimization algorithms (second-order methods, adaptive learning rates) can partially mitigate curvature explosion. However, when the condition number is astronomical, even second-order methods struggle. Understanding curvature is essential for choosing appropriate optimization hyperparameters.

Definition 15: Hessian Degeneracy

Definition: A point $\theta$ exhibits Hessian degeneracy if the Hessian matrix $\mathcal{H}(\theta) = \nabla^2 \mathcal{L}(\theta)$ is not full-rank, i.e., has one or more zero eigenvalues: \[ \text{rank}(\mathcal{H}(\theta)) < \text{dim}(\theta) \]

Equivalently, there exist non-zero directions $v \in \mathbb{R}^d$ such that $v^T \mathcal{H}(\theta) v = 0$ (no curvature in that direction).
Assumptions: : - The loss is twice-differentiable at $\theta$ - The Hessian is well-defined (not approximated)
Notation: : - $\mathcal{H}(\theta)$: Hessian matrix at $\theta$ - $\text{rank}$: matrix rank (number of linearly independent rows/columns) - $\text{dim}(\theta)$: dimensionality of the parameter space
Usage: : Hessian degeneracy indicates flat directions in the loss landscape: you can move in certain parameter directions without changing the loss (to first or second order). This is common at critical points (where gradient is zero) in overparameterized models. Degenerate Hessians make second-order optimization methods unstable because they rely on inverting the Hessian, which is impossible when the Hessian is singular.
Valid Example: : A neural network with ReLU activations trained to convergence often has a degenerate Hessian. Some units may be inactive (always output zero), creating flat directions: changing their parameters does not affect the loss. Similarly, redundant units (whose outputs are linear combinations of other units) create flat directions.
Failure Case: : A strictly-convex loss function (e.g., ridge regression with positive regularization) has a full-rank Hessian at all points. No flat directions exist; the loss is strictly curved in all directions.
Explicit ML Relevance: : Hessian degeneracy explains why second-order optimization methods (Newton’s method, natural gradient) can fail in deep learning: the Hessian is often singular or nearly singular. Algorithms that rely on Hessian inversion (computing solutions to $\mathcal{H}^{-1} g$ where $g$ is the gradient) are unreliable. Modern methods like Adam avoid this by using diagonal approximations to the Hessian.

Definition 16: Stability Bound

Definition: A learning algorithm $\mathcal{A}$ has stability bound $\beta$ if changing any single training example causes the algorithm’s output to change by at most $\beta$ in expectation. Formally, let $\mathcal{A}(S) = \hat{\theta}$ denote the parameters found by $\mathcal{A}$ on dataset $S$, and let $S^{(i)}$ denote the dataset with the $i$-th example replaced. Then: \[ \mathbb{E}_{S, S^{(i)}}[|\mathcal{A}(S) - \mathcal{A}(S^{(i)})|] \leq \beta \]
Assumptions: - The replacement example is arbitrary (worst-case across all possible examples) - The expectation is over randomness in the algorithm and data sampling - The distance $|\cdot|$ is in an appropriate metric (e.g., parameter norm or loss)
Notation: - $\mathcal{A}$: learning algorithm - $S$: training dataset - $S^{(i)}$: dataset with $i$-th example changed - $\beta$: stability parameter (typically small, $O(1/n)$)
Usage: Stability bounds characterize how sensitive a learning algorithm is to individual training examples. Algorithms with small stability bounds are robust: changing a single example minimally affects the learned model. This robustness to perturbation implies good generalization: if removing any single example does not drastically change the model, the model does not overfit to individual examples.
Valid Example: Gradient descent with regularization has stability bound $\beta = O(1/n)$, where $n$ is the dataset size. The bound decays with data size, indicating that with more data, the algorithm becomes more stable.
Failure Case: Memorization (fitting individual examples exactly) has poor stability: removing one example can change the memorized mapping significantly. Stable algorithms do not memorize.
Explicit ML Relevance: Stability bounds provide generalization guarantees complementary to VC-dimension-based bounds. An algorithm with small stability bound is guaranteed to generalize, even if its VC-dimension is high. Differential privacy (adding noise to protect individual records) can be understood as enforcing stability: the algorithm’s output is insensitive to individual records.

Definition 17: Regime Shift Boundary

Definition: A regime shift boundary is a critical value of a problem parameter (data size $n$, model capacity $d$, noise level $\sigma$, etc.) at which the optimal algorithm or solution type changes qualitatively. Formally, let $\lambda$ be a problem parameter (e.g., ratio $d/n$), and let $\theta^*(\lambda)$ be the optimal solution as a function of $\lambda$. A regime shift boundary $\lambda^*$ is a point where: \[ \Delta := \lim_{\lambda \to \lambda^*-} \theta^*(\lambda) \neq \lim_{\lambda \to \lambda^*+} \theta^*(\lambda) \] i.e., the optimal solution type changes discontinuously or has a phase transition.
Assumptions: - The problem parameter $\lambda$ varies continuously - We can compute or approximate optimal solutions as functions of $\lambda$ - The transition is not merely quantitative (larger loss) but qualitative (different algorithm type or solution class)
Notation: - $\lambda$: problem parameter (e.g., $d/n$, ratio of model capacity to data size) - $\theta^*(\lambda)$: optimal solution as a function of $\lambda$ - $\lambda^*$: regime shift boundary
Usage: Regime shifts indicate that as a problem scales, the optimal approach changes fundamentally. Below the boundary, one strategy is best; above it, a different strategy is superior. This has profound implications for practitioners: scaling a solution that was optimal at small scale does not work at large scale; you must change approaches at regime boundaries.
Valid Example: The transition from classical regime ($n > d$) to overparameterized regime ($d \gg n$) is a regime shift boundary. In the classical regime, regularization and capacity control are essential; in the overparameterized regime, implicit regularization takes over and capacity control is less important. At the boundary ($d \approx n$), the optimal strategy changes qualitatively.
Failure Case: A continuous quantitative change (e.g., loss decreases as data increases) is not a regime shift; it is a smooth scaling law. Regime shifts are characterized by discontinuous or qualitative changes.
Explicit ML Relevance: Understanding regime shift boundaries helps practitioners know when to change strategies. Scaling a model linearly with data works in the classical regime but fails at the boundary; you must change the optimization strategy, add implicit regularization, or adjust hyperparameters. Recognizing when you are crossing a regime boundary enables proactive changes rather than reactive debugging.

Definition 18: Generalization Collapse

Definition: Generalization collapse occurs when test loss increases sharply (suddenly becomes much larger than training loss) despite training loss remaining stable or decreasing. Formally, there exist sequence of iterations $t_1 < t_2 < \ldots$ such that: \[ \mathcal{L}_{\text{train}}(t_i) \approx \mathcal{L}_{\text{train}}(t_{i-1}) \text{ or } \mathcal{L}_{\text{train}}(t_i) < \mathcal{L}_{\text{train}}(t_{i-1}) \] yet: \[ \mathcal{L}_{\text{test}}(t_i) \gg \mathcal{L}_{\text{test}}(t_{i-1}) \]
Assumptions: - Training and test sets are from different distributions or have different characteristics - The collapse happens sharply, not gradually (distinguishing from ordinary overfitting) - Training quality is maintained while test performance degrades
Notation: - $\mathcal{L}_{\text{train}}, \mathcal{L}_{\text{test}}$: training and test loss - $t$: iteration or epoch - $\Delta_{\text{gen}}$: generalization gap
Usage: Generalization collapse is an indicator of distribution shift or that the model has learned brittle, overfitted solutions. Unlike ordinary overfitting (which happens gradually as capacity increases), collapse happens suddenly, suggesting a qualitative change in what the model is learning. This is often a sign of serious problems: the model has fit spurious correlations that do not generalize, or the train-test distribution mismatch has suddenly become apparent.
Valid Example: Training an image classifier where the training set has image artifacts (e.g., all cancer images have a watermark, all normal images do not) while the test set does not. The model learns to exploit the watermark, achieving perfect training accuracy. If the test set is revealed partway through training, after the model has learned the watermark, generalization collapse occurs sharply.
Failure Case: A model with poor generalization from the start (high test loss from epoch 1) is not a generalization collapse; it is initial overfitting. Collapse is the sudden worsening during training.
Explicit ML Relevance: Detecting generalization collapse is a red flag indicating that the model is learning something spurious. When this occurs, practitioners should: (1) audit the training-test split for distribution differences, (2) visualize what features the model is using, (3) check for data quality issues (labels, artifacts), or (4) simplify the model. Continuing to train past a generalization collapse point is counterproductive.

Definition 19: Optimization–Representation Gap

Definition: The optimization–representation gap is the difference between the best achievable loss (in the representation space of the model) and the loss actually achieved by the optimization algorithm. Formally, let $\mathcal{H}$ be the hypothesis class and $\mathcal{L}^* = \inf_{\theta} \mathcal{L}(h_\theta)$ be the best representable loss (the minimum loss achievable by any parameter in the class). Let $\hat{\theta}$ be the parameters found by optimization algorithm $\mathcal{A}$. Then: \[ \text{OptRep Gap} := \mathcal{L}(h_{\hat{\theta}}) - \mathcal{L}^* \]
Assumptions: - The model class is fixed - Optimization runs to approximate convergence (not early stopping) - The capacity is finite (expressivity does not dominate)
Notation: - $\mathcal{H}$: hypothesis class - $\mathcal{L}^*$: best achievable loss in the class - $\hat{\theta}$: solution found by algorithm - $\mathcal{A}$: optimization algorithm
Usage: The optimization–representation gap captures the difference between “what is possible in principle” (expressivity) and “what we actually achieve” (optimization success). A large gap indicates either: (1) the optimization algorithm is stuck in a bad local optimum, (2) the learning rate or regularization is too strong, or (3) the problem is fundamentally hard to optimize. A small gap indicates the algorithm has found a nearly-optimal solution within the representation space.
Valid Example: Training a neural network on a simple dataset where the true function is representable. The best possible test loss is 1%, but the algorithm achieves 5%. The optimization–representation gap is 4%. This gap can be closed by: (1) adjusting hyperparameters (learning rate, regularization), (2) using a better optimizer, or (3) running longer, assuming the model can represent the function.
Failure Case: If the true function is not representable in the model class (expressivity is the limit), then no optimization algorithm can achieve loss lower than some threshold, regardless of effort. In this case, the gap is not due to optimization failure but representation failure.
Explicit ML Relevance: Diagnosing whether poor performance is due to representation (expressivity) or optimization (algorithm failure) is crucial for deciding how to improve. If the optimization–representation gap is small, add more complex models or features. If it is large, improve the optimizer or hyperparameter tuning. Confusing the two leads to misdirected efforts.

Theorems

Theorem 1: No-Free-Lunch Theorem (Formal Version)

Statement: For any learning algorithm $\mathcal{A}$ and any fixed sample size $n$, there exists a distribution $\mathcal{P}$ such that the expected test error of $\mathcal{A}$ is at least $1/2 - O(1/\sqrt{n})$, even if the true function is in the hypothesis class $\mathcal{H}$.

Formally, let $\mathcal{A}: (\mathcal{X} \times \{0,1\})^n \to \mathcal{H}$ be a learning algorithm that maps $n$ labeled samples to a hypothesis. For any $\mathcal{A}$ and any fixed $n$: \[ \sup_{\mathcal{P}} \mathbb{E}_{S \sim \mathcal{P}^n}[\mathcal{L}_{\text{test}}(\mathcal{A}(S))] \geq \frac{1}{2} - O\left(\frac{1}{\sqrt{n}}\right) \] where the supremum is over all distributions $\mathcal{P}$, $S$ is a training set of size $n$, and the expectation is over both the training data and the algorithm’s randomness.

Full Formal Proof:

We prove the No-Free-Lunch theorem by a counting argument combined with probabilistic method.

Step 1: Construct problematic distribution. Consider any finite subset $X_0 \subseteq \mathcal{X}$ with $|X_0| = 2n$. Since the algorithm must process only $n$ examples, there are at least $n$ examples in $X_0$ that are never seen by the algorithm in the training set.

Any labeling of $X_0$ is a valid target function. There are $2^{2n}$ possible labelings. The learning algorithm $\mathcal{A}$ must output a single hypothesis $h \in \mathcal{H}$ based on $n$ training examples.

Step 2: Probability analysis. For any fixed training set $S$ of size $n$ from $X_0$, the algorithm $\mathcal{A}(S)$ outputs a single hypothesis $h$. Consider a random labeling of the remaining $n$ unseen examples. For any hypothesis $h$, if the remaining examples are labeled adversarially (to contradict $h$), the test error on those $n$ unseen examples is at least $1/2$ in expectation (roughly half the examples will be mislabeled).

More precisely, let $f: X_0 \to \{0, 1\}$ be a uniformly random labeling of $X_0$. Let $S = (x_1, f(x_1)), \ldots, (x_n, f(x_n))$ be the first $n$ examples (the training set), and let $T = (x_{n+1}, f(x_{n+1})), \ldots, (x_{2n}, f(x_{2n}))$ be the remaining $n$ examples (the test set).

By symmetry of random labeling, the probability that $h(x) = f(x)$ for a random $x \in T$ is $1/2$ (the algorithm has no information about the unseen test labels).

Step 3: Formalize the expectation. \[ \mathbb{E}_f[\text{Test Error}] = \mathbb{E}_f\left[\frac{1}{n} \sum_{i=n+1}^{2n} \mathbb{1}[h(x_i) \neq f(x_i)]\right] = \frac{1}{n} \sum_{i=n+1}^{2n} \mathbb{P}_f[h(x_i) \neq f(x_i)] = \frac{1}{2} \]

Step 4: Account for algorithm selection. The above argument applies to any fixed algorithm output. Since the algorithm $\mathcal{A}(S)$ depends on the random training labels $S$, we take expectations: \[ \mathbb{E}_{f}[\mathcal{L}_{\text{test}}(\mathcal{A}(S_f))] = \mathbb{E}_{f}[\mathbb{E}_{\text{test labels}}[\text{Test Error}]] = \frac{1}{2} - O\left(\frac{1}{\sqrt{n}}\right) \]

The $O(1/\sqrt{n})$ term comes from concentration: the test error is tightly concentrated around $1/2$, with deviations at most $\Theta(1/\sqrt{n})$ with high probability.

Step 5: Conclude for any algorithm. Since this holds for any algorithm $\mathcal{A}$ and there exists a distribution (the adversarial one constructed above) where this bound is tight, the supremum over all distributions is at least $1/2 - O(1/\sqrt{n})$.

Interpretation: The No-Free-Lunch theorem states that no learning algorithm is universally superior. For any algorithm, there exist problem distributions on which it performs poorly. This is a fundamental limit: to get good performance, an algorithm must have assumptions (inductive biases) that match the true problem structure. If the assumptions are wrong, no amount of data or computation helps.

Explicit ML Relevance: This theorem explains why we cannot have one-size-fits-all learning algorithms. CNNs work well for vision because their convolutional structure matches visual patterns, but perform terribly on language. Transformers excel on language but waste capacity on images. There are no free lunches: good generalization requires that model assumptions align with problem structure.

Theorem 2: Capacity–Generalization Tradeoff Bound

Statement: For any hypothesis class $\mathcal{H}$ with VC-dimension $d_{\text{VC}}$ and any learning algorithm that outputs a hypothesis in $\mathcal{H}$, the generalization gap (difference between training and test error) is bounded by the model capacity. Specifically, with probability $\geq 1 - \delta$: \[ |\mathcal{L}_{\text{test}} - \mathcal{L}_{\text{train}}| \leq O\left(\sqrt{\frac{d_{\text{VC}} + \log(1/\delta)}{n}}\right) \]

where $n$ is the training set size and $d_{\text{VC}}$ is the VC-dimension of $\mathcal{H}$.

Full Formal Proof:

We prove the capacity-generalization tradeoff using uniform convergence and VC-dimension concentration.

Step 1: Uniform Convergence for Finite Class. For a finite hypothesis class $|\mathcal{H}| = M$, by the union bound and concentration inequalities (Hoeffding’s inequality), the probability that any hypothesis has training-test gap larger than $\epsilon$ is at most: \[ \mathbb{P}\left[\exists h \in \mathcal{H} : |\mathcal{L}_{\text{test}}(h) - \mathcal{L}_{\text{train}}(h)| > \epsilon\right] \leq 2M \exp(-2n\epsilon^2) \]

By union bound, if any hypothesis class has $M$ members, the probability that all of them have training-test gap $\leq \epsilon$ is $\geq 1 - 2M \exp(-2n\epsilon^2)$.

Setting $2M \exp(-2n\epsilon^2) = \delta$ and solving for $\epsilon$: \[ \epsilon = \sqrt{\frac{\log(2M/\delta)}{2n}} \]

Step 2: Extend to Infinite Class via VC-dimension. For an infinite hypothesis class with VC-dimension $d_{\text{VC}}$, the growth function (number of distinct labelings on $n$ points achievable by $\mathcal{H}$) is bounded: \[ \Pi_{\mathcal{H}}(n) \leq \sum_{i=0}^{d_{\text{VC}}} \binom{n}{i} \leq \left(\frac{en}{d_{\text{VC}}}\right)^{d_{\text{VC}}} \]

This growth function replaces $2^M$ in the finite class analysis (the number of “effective” distinct hypotheses on any sample of size $n$ is bounded by the growth function).

Step 3: Apply to Empirical Process. Using the Vapnik-Chervonenkis theory, the empirical process $\sqrt{n}(\mathcal{L}_{\text{test}} - \mathcal{L}_{\text{train}})$ is bounded by: \[ \sup_{h \in \mathcal{H}} |\sqrt{n}(\mathcal{L}_{\text{test}}(h) - \mathcal{L}_{\text{train}}(h))| \leq O(\sqrt{d_{\text{VC}} \log n}) \] with high probability.

Step 4: Concentration and Tightening. Combining union bound over the effective hypotheses (bounded by growth function) and Hoeffding concentration: \[ \mathbb{P}\left[|\mathcal{L}_{\text{test}} - \mathcal{L}_{\text{train}}| > \sqrt{\frac{d_{\text{VC}} + \log(1/\delta)}{n}}\right] \leq \delta \]

Therefore, with probability $\geq 1 - \delta$: \[ |\mathcal{L}_{\text{test}} - \mathcal{L}_{\text{train}}| \leq O\left(\sqrt{\frac{d_{\text{VC}} + \log(1/\delta)}{n}}\right) \]

Interpretation: The capacity-generalization tradeoff formalizes the intuition that models with higher capacity (larger VC-dimension) require more data to generalize. The gap scales as $\sqrt{d_{\text{VC}} / n}$, meaning doubling the data halves the gap, but doubling the capacity increases the gap by $\sqrt{2}$. This is the theory behind classical regularization and capacity control.

Explicit ML Relevance: This theorem justifies why practitioners restrict model capacity (use simpler models, add regularization). However, in modern deep learning, this theorem is empirically violated: neural networks with VC-dimension $d_{\text{VC}} \gg n$ generalize well. This paradox is explained by benign overfitting (Definition 5) and implicit regularization, not captured by classical VC-dimension analysis.

Theorem 3: Benign Overfitting Condition

Statement: A model trained via gradient descent in the overparameterized regime ($d \gg n$) exhibits benign overfitting if: 1. The model interpolates training data to zero loss: $\mathcal{L}_{\text{train}}(\hat{\theta}) = 0$ 2. The Hessian at the solution has bounded spectral condition number: $\kappa_H = \lambda_{\max}(\mathcal{H}) / \lambda_{\min}^+(\mathcal{H}) = O(\text{poly}(n))$ where $\lambda_{\min}^+ > 0$ is the smallest non-zero eigenvalue 3. Gradient descent converges to a “flat” minimum: the solution satisfies $\|\nabla \mathcal{L}(\hat{\theta})\| = O(1/\text{poly}(n))$

Then test loss satisfies: \[ \mathcal{L}_{\text{test}}(\hat{\theta}) - L^* = O(\sigma^2) + o(1) \] where $\sigma^2$ is the noise level in the data and $L^*$ is the optimal Bayes loss.

Full Formal Proof:

We prove benign overfitting by decomposing the test loss into signal and noise terms.

Step 1: Decompose Signal and Noise. Let the true model be $y_i = f^*(x_i) + \xi_i$ where $f^*$ is the true function and $\xi_i \sim N(0, \sigma^2)$ is independent noise. The training loss is: \[ \mathcal{L}_{\text{train}}(\theta) = \frac{1}{n} \sum_{i=1}^n (h_\theta(x_i) - (f^*(x_i) + \xi_i))^2 \]

Since $h_\theta$ interpolates (fits to zero loss), we have $h_\theta(x_i) = f^*(x_i) + \xi_i$ for all training points. The model has memorized both the signal and the noise.

Step 2: Extrapolation Error on Test Data. On test data, the model predicts $h_\theta(x_{\text{test}})$. The test loss decomposes as: \[ \mathcal{L}_{\text{test}}(\theta) = \mathbb{E}_{x_{\text{test}}}[(h_\theta(x_{\text{test}}) - f^*(x_{\text{test}}))^2 + 2(h_\theta(x_{\text{test}}) - f^*(x_{\text{test}}))\xi_{\text{test}} + \xi_{\text{test}}^2] \]

The $\xi_{\text{test}}^2$ term is the irreducible noise ($L^* = \sigma^2$). The first two terms are the estimator error.

Step 3: Bound Estimator Error via Implicit Regularization. Under gradient descent, the solution $\hat{\theta}$ lies in the implicit regularization ball determined by the training dynamics. Belkin et al. showed that for overparameterized models, the implicit regularization ball has a specific minimum-norm property: among all interpolating solutions, gradient descent finds one with small norm.

Formally, the induced norm is: \[ \|\hat{\theta}\|_{\text{implicit}} \leq O(n^{-1/2}) \cdot \text{noise scale} \]

Step 4: Leverage Bounded Hessian. If the Hessian condition number is bounded $\kappa_H = O(\text{poly}(n))$, perturbations to the solution (from fitting the noise) are controlled: \[ \|(h_\theta(x_{\text{test}}) - f^*(x_{\text{test}}))\| \leq \beta_H \cdot \|\text{noise component of } \theta\| \]

where $\beta_H$ depends on the condition number. This ensures that overfitting to noise translates to small extrapolation error on test data.

Step 5: Conclude Test Loss. Combining the above: \[ \mathcal{L}_{\text{test}}(\hat{\theta}) = \underbrace{\sigma^2}_{L^*} + \underbrace{O(\sigma^2 / n + \text{bias term})}_{o(1)} \]

The bias term arises from mismatch between the training function (which interpolates) and the true function; it vanishes as the model becomes more expressive or the noise becomes small relative to sample size.

Therefore, $\mathcal{L}_{\text{test}}(\hat{\theta}) - L^* = O(\sigma^2) + o(1)$, confirming benign overfitting.

Interpretation: Benign overfitting occurs when the model memorizes training data without harming generalization. This happens when: (1) the signal is larger than the noise (model can separate them through implicit regularization), (2) the loss landscape is well-conditioned (no sharp minima), and (3) gradient descent converges to solutions that extrapolate well. These conditions are often satisfied in modern deep learning, explaining the paradoxical success of overparameterized models.

Explicit ML Relevance: This theorem explains why practitioners do not fear overfitting in deep learning as much as in classical statistics. Overparameterization is not inherently bad; it is beneficial if implicit regularization is present. The conditions in the theorem (bounded condition number, flat minima, convergence to good interpolators) are roughly satisfied by SGD with appropriate learning rates and batch sizes.

Theorem 4: Degenerate Hessian Limit Theorem

Statement: For a neural network $h_\theta$ trained to convergence on an overparameterized regime ($d \gg n$), the Hessian matrix $\mathcal{H}(\hat{\theta}) = \nabla^2 \mathcal{L}(\hat{\theta})$ at the solution is rank-deficient with probability approaching 1 as $d \to \infty$. Specifically: \[ \text{rank}(\mathcal{H}(\hat{\theta})) \leq n + O(\sqrt{n \log d}) \] with high probability, meaning the Hessian has at least $d - n - O(\sqrt{n \log d})$ zero eigenvalues.

Full Formal Proof:

We prove Hessian degeneracy for overparameterized networks using properties of random matrices and neural network structure.

Step 1: Relate Hessian to Data Matrix. For a network trained to interpolation ($\mathcal{L}_{\text{train}} = 0$), the Hessian at convergence is: \[ \mathcal{H}(\hat{\theta}) = \frac{1}{n} X^T X + \nabla_\theta^2 \text{Reg}(\theta) \] where $X \in \mathbb{R}^{n \times d}$ is the Jacobian matrix (derivative of outputs with respect to parameters) evaluated at the solution, and $\text{Reg}$ is any regularization term.

If regularization is absent or weak (typical in overparameterized regime), the Hessian is dominated by $X^T X$.

Step 2: Apply Rank Constraint from Data. The matrix $X^T X$ is $d \times d$ but is constructed from $n$ training examples. By linear algebra, the rank of $X^T X$ is at most $\text{rank}(X) \leq \min(n, d)$.

Since $d \gg n$, we have $\text{rank}(X^T X) \leq n$ with high probability (assuming the samples are not in a lower-dimensional subspace, which generically they are not).

Step 3: Account for Regularization Error. If regularization is present, the Hessian is perturbed from $X^T X$ by at most the Hessian of regularization. For standard regularizers (L2, dropout), this perturbation affects the small eigenvalues but does not reduce rank substantially.

By Weyl’s theorem (eigenvalues of perturbed matrices are nearby), the rank remains at most $n + O(\sqrt{n \log d})$, accounting for numerical stability and small perturbations.

Step 4: Conclude Degeneracy. Therefore, $\text{rank}(\mathcal{H}(\hat{\theta})) \leq n + o(d)$, meaning the Hessian has at least $d - n$ zero eigenvalues. This degeneracy is inevitable; it arises from the fundamental constraint that $n$ examples cannot fully constrain $d \gg n$ dimensions.

Interpretation: The Hessian is degenerate in overparameterized networks because there are more parameters than constraints (training examples). Some parameter directions do not affect the loss; these create zero eigenvalues in the Hessian. This degeneracy is both a blessing (it explains robustness and generalization through implicit regularization) and a curse (second-order optimization methods that require Hessian inversion become ill-conditioned).

Explicit ML Relevance: This theorem explains why second-order optimization methods (Newton’s method, natural gradient) are difficult in modern deep learning: the Hessian is nearly singular, and inversion is numerically unstable. Practitioners use first-order methods (SGD) or approximate second-order methods (Adam, which uses diagonal Hessian approximations) to avoid inverting the degenerate Hessian.

Theorem 5: Optimization Bias Convergence Theorem

Statement: For a convex loss function and gradient descent with constant learning rate $\eta < 1/L$ (where $L$ is the Lipschitz constant of the gradient), the bias in the solution converges to zero at rate $O(1/t)$: \[ \mathbb{E}[\|\hat{\theta}_t - \theta^*\|] = O\left(\frac{1}{t}\right) \] where $\hat{\theta}_t$ is the parameters after $t$ iterations and $\theta^*$ is the global optimum.

For non-convex losses, the convergence rate to critical points (not necessarily maxima) is: \[ \mathbb{E}[\|\nabla \mathcal{L}(\hat{\theta}_t)\|] = O\left(\frac{1}{\sqrt{t}}\right) \]

Full Formal Proof:

We prove gradient descent convergence for both convex and non-convex cases.

Proof for Convex Case:

Step 1: Descent Lemma. For a convex-differentiable loss $\mathcal{L}$, gradient descent updates $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$ satisfy: \[ \mathcal{L}(\theta_{t+1}) \leq \mathcal{L}(\theta_t) - \eta \|\nabla \mathcal{L}(\theta_t)\|^2 + \frac{L \eta^2}{2} \|\nabla \mathcal{L}(\theta_t)\|^2 \]

The first term is the decrease from moving in the gradient direction; the second term is the overshoot due to non-linearity. If $\eta < 1/L$, the first term dominates, and loss decreases.

Step 2: Telescoping Sum. Summing the descent lemma from $t=0$ to $t=T$: \[ \mathcal{L}(\theta_T) \leq \mathcal{L}(\theta_0) - \eta(1 - L\eta/2) \sum_{t=0}^{T-1} \|\nabla \mathcal{L}(\theta_t)\|^2 \]

By convexity, $\mathcal{L}(\theta_T) \geq \mathcal{L}(\theta^*) + \nabla \mathcal{L}(\theta^*)^T (\theta_T - \theta^*) = \mathcal{L}(\theta^*)$ (since $\nabla \mathcal{L}(\theta^*) = 0$).

Step 3: Bound Average Gradient Norm. Rearranging: \[ \eta(1 - L\eta/2) \sum_{t=0}^{T-1} \|\nabla \mathcal{L}(\theta_t)\|^2 \leq \mathcal{L}(\theta_0) - \mathcal{L}(\theta^*) \]

Therefore, the average gradient norm is: \[ \frac{1}{T} \sum_{t=0}^{T-1} \|\nabla \mathcal{L}(\theta_t)\|^2 \leq O(1 / T) \]

Step 4: Use Strong Gradient Condition. For a $\mu$-strongly convex function, the gradient norm relates to distance to optimum: \[ \|\nabla \mathcal{L}(\theta)\|^2 \geq 2\mu (\mathcal{L}(\theta) - \mathcal{L}(\theta^*)) \geq 2\mu \cdot \text{const} \cdot \|\theta - \theta^*\|^2 \]

Thus, $\|\theta - \theta^*\| = O(1/t)$ for strongly convex case.

Step 5: Non-Convex Case. For non-convex losses, the descent lemma still applies: \[ \mathcal{L}(\theta_{t+1}) \leq \mathcal{L}(\theta_t) - \eta \|\nabla \mathcal{L}(\theta_t)\|^2 + O(\eta^2) \text{ (high-order terms)} \]

Telescoping and summing: \[ \sum_{t=0}^{T-1} \|\nabla \mathcal{L}(\theta_t)\|^2 \leq O(1 / \eta) \]

Thus, at least one iterate satisfies $\|\nabla \mathcal{L}(\theta_t)\| \leq O(1/\sqrt{T})$, and averaging gives $\mathbb{E}_t[\|\nabla \mathcal{L}(\theta_t)\|] = O(1/\sqrt{T})$.

Interpretation: Gradient descent exhibits convergence bias: the solution converges toward the optimum (convex) or critical points (non-convex) at polynomial rates. The convergence rate depends on problem conditioning (Lipschitz constant, strong convexity), not on model dimension, which is surprising and favorable for high-dimensional problems.

Explicit ML Relevance: This theorem justifies why gradient descent is so effective in deep learning, despite the non-convex landscape. While convergence to global optima is not guaranteed in the non-convex case, convergence to critical points is guaranteed at a reasonable rate. In practice, the critical points found by gradient descent often generalize well, possibly due to implicit regularization and the geometry of overparameterized loss landscapes.

Theorem 6: Expressivity Saturation Bound

Statement: For a neural network with width $h$ and depth $d$, the number of distinct input-output mappings representable is bounded by $2^{O(hd)}$. More precisely, the VC-dimension satisfies: \[ \text{VC-dim}(\text{ReLU network with } h \text{ hidden units per layer, } d \text{ layers}) = O(hd) \]

This implies that increasing width $h$ or depth $d$ increases capacity linearly, not exponentially. At some point, further increases in $h$ or $d$ do not grant access to new functions; the representable function class saturates.

Full Formal Proof:

We prove expressivity saturation by counting the number of linear regions in a piecewise-linear network.

Step 1: Local Structure of ReLU Network. A ReLU network with $h$ hidden units per layer computes a piecewise-linear function: different input regions activate different subsets of ReLU units, creating distinct linear segments. The $i$-th hidden unit creates a hyperplane $a_i^T x + b_i = 0$ that divides the input space. With $h$ hidden units, at most $2^h$ regions are created (hyperplane arrangement).

Step 2: Count Regions Across Layers. A $d$-layer network applies $d$ sets of hyperplanes, one per layer. However, the hyperplanes in later layers depend on earlier layers’ outputs, not directly on input. The total number of input regions (cells in the hyperplane arrangement relative to the original input) is bounded by the product of regions created at each layer.

More precisely, a single layer with $h$ units creates at most $2^h$ regions. Subsequent layers do not exponentially increase this; instead, each layer creates a new partitioning but within the subspace defined by earlier activations.

The total number of regions is bounded by: \[ R(d, h) \leq \left(\sum_{i=0}^d \binom{h}{i}\right)^{d} \leq (2^h)^d = 2^{hd} \]

(This is a rough bound; tighter analysis gives $2^{O(hd)}$.)

Step 3: VC-Dimension from Region Count. The VC-dimension of a function class is roughly the logarithm of the number of distinct functions (roughly, the function class can shatter $\log(\text{# functions})$ points). Since the network can express $2^{O(hd)}$ distinct linear segments, the VC-dimension is: \[ \text{VC-dim} = O(\log(2^{hd})) = O(hd) \]

Step 4: Implication for Saturation. The VC-dimension (and expressivity) grows linearly with capacity (width times depth). This is much slower than the quadratic or super-linear growth required to represent all possible Boolean functions over many variables.

Specifically, a fully-connected network requires width exponential in the input dimension to represent all possible functions. In contrast, a well-designed network (e.g., with convolutional structure) can achieve good expressivity with polynomial width.

Interpretation: Expressivity saturation means that network capacity has limits. At a certain point, adding more parameters does not grant access to fundamentally new functions; the representable class plateaus. This is a structural limit of the network architecture, not an optimization limit.

Explicit ML Relevance: This theorem explains why network architecture matters as much as width. A wide but shallow network saturates expressivity at width polylogarithmic in the input dimension. A deep network can achieve exponential expressivity (in terms of input features representable) through compositional structure. This justifies the success of deep learning: depth allows exponential expressivity growth, while width alone cannot.

Theorem 7: Computational Complexity Barrier Theorem

Statement: For many learning problems of interest (feature selection, hyperparameter optimization, neural architecture search), finding the optimal solution is NP-hard. Specifically, if a polynomial-time algorithm $\mathcal{A}$ solves the feature selection problem (finding the $k$ features that minimize test error) to optimality, then $\mathcal{P} = \mathcal{NP}$.

More formally, the decision version of many learning problems is NP-complete: - Decision feature selection: “Does there exist a subset of $k$ features such that the resulting model achieves test loss $\leq \epsilon$?” is NP-complete. - Decision hyperparameter optimization: “Does there exist a hyperparameter setting that achieves loss $\leq \epsilon$?” is NP-complete (reduction from SAT or other NP-complete problems).

Full Formal Proof:

We prove NP-hardness of feature selection; hyperparameter optimization follows similar logic.

Step 1: Reduction from Vertex Cover. Vertex Cover is NP-complete: given a graph $G = (V, E)$ and integer $k$, decide if there exists a subset $S \subseteq V$ with $|S| \leq k$ such that every edge has at least one endpoint in $S$.

We reduce Vertex Cover to Feature Selection.

Step 2: Construct Feature Selection Instance. For each vertex $v \in V$, create a feature $x_v$. For each edge $(u, v) \in E$, create a training example: - If $u$ or $v$ is in the vertex cover, the example is “covered” - If neither $u$ nor $v$ is in the vertex cover, the example is “uncovered”

Set the target label to 1 if the edge is “covered,” 0 if “uncovered.”

Now, a feature subset $S'$ of size $k$ achieves zero training error (perfectly predicts all labels) if and only if the corresponding vertices form a vertex cover (every edge has an endpoint in $S'$).

Step 3: Formal Reduction. Define a feature selection instance: dataset $D$, $k$, and threshold $\epsilon = 0$. The problem is: “Does there exist a feature subset of size $\leq k$ such that a model using only those features achieves zero training error?”

If the Vertex Cover instance is a YES (vertex cover of size $\leq k$ exists), then the feature selection instance is a YES (features corresponding to the cover achieve zero error).

If the Vertex Cover instance is a NO (no vertex cover of size $\leq k$ exists), then the feature selection instance is a NO (no feature subset of size $\leq k$ achieves zero error).

Step 4: Conclude NP-Hardness. Since Vertex Cover is NP-complete and we have a polynomial-time reduction to the feature selection problem, feature selection is NP-hard. A polynomial-time algorithm for feature selection would imply a polynomial-time algorithm for Vertex Cover, contradicting the assumption that $\mathcal{P} \neq \mathcal{NP}$ (widely believed to be true).

Interpretation: Many learning optimization problems are computationally hard. We cannot solve them to optimality efficiently (unless $\mathcal{P} = \mathcal{NP}$). This is a fundamental barrier, not an algorithmic limitation. We must use heuristics (greedy feature selection, random hyperparameter search) that have no worst-case optimality guarantees.

Explicit ML Relevance: This theorem explains why hyperparameter tuning and architecture search are so difficult and time-consuming. We cannot have fast, provably optimal methods; heuristics are the only practical approach. This validates the empirical nature of machine learning: practitioners rely on grid search, random search, or Bayesian optimization, none of which guarantee optimality.

Theorem 8: Stability–Robustness Tradeoff Theorem

Statement: For a learning algorithm with uniform stability bound $\beta$, the algorithm achieves test loss close to training loss (generalization) with high probability. Specifically, with probability $\geq 1 - \delta$: \[ \mathcal{L}_{\text{test}} \leq \mathcal{L}_{\text{train}} + \beta + O\left(\sqrt{\frac{\log(1/\delta)}{n}}\right) \]

However, a small stability bound $\beta$ comes at a cost: the algorithm is less able to learn from data deviating from the training distribution (low robustness). Formally, if an algorithm is $\beta$-stable, it is also $\beta$-robust in the sense that changing any training example by at most $\Delta$ changes the output by at most $\beta \cdot \Delta$ (Lipschitz continuity in the data).

The tradeoff is: low $\beta$ (good generalization) implies high sensitivity to perturbations (possible poor robustness), and vice versa.

Full Formal Proof:

We prove the stability-generalization connection and then formalize the robustness tradeoff.

Stability implies Generalization:

Step 1: Concentration via Stability. For an algorithm with uniform stability $\beta$, consider the difference between training error on dataset $S$ and test error on a held-out example. By the definition of stability, removing one example changes the algorithm output by at most $\beta$.

Step 2: Martingale Argument. Define a martingale by iteratively replacing training examples with test examples. The martingale at step $i$ represents the expected loss when the first $i$ examples have been replaced. \[ M_i = \mathbb{E}[h_{\text{avg loss}}(\text{first } i \text{ examples replaced})] \]

By stability, $M_i - M_{i-1} \leq \beta$ (replacing one example changes loss by at most $\beta$).

Step 3: Apply Azuma’s Inequality. The martingale $M_i$ has differences at most $\beta$. By Azuma’s inequality: \[ \mathbb{P}[|M_n - M_0| > t] \leq 2\exp\left(-\frac{2t^2}{n\beta^2}\right) \]

where $M_0$ is training loss and $M_n$ is test loss (after replacing all examples). Setting the right-hand side to $\delta$: \[ t = \sqrt{\frac{n\beta^2 \log(2/\delta)}{2}} = O(\beta \sqrt{n \log(1/\delta)}) \]

Step 4: Generalization Bound. \[ |\mathcal{L}_{\text{test}} - \mathcal{L}_{\text{train}}| \leq O(\beta \sqrt{n \log(1/\delta)}) + \beta \]

or with probability $\geq 1 - \delta$: \[ \mathcal{L}_{\text{test}} \leq \mathcal{L}_{\text{train}} + \beta + O(\beta\sqrt{n \log(1/\delta)}) \]

(The $O(\beta \sqrt{n \log(1/\delta)})$ term becomes negligible for large $n$, so the key term is $\beta$.)

Stability implies Robustness (and vice versa):

Step 5: Define Robustness. An algorithm is $\epsilon$-robust if for any two datasets $S, S'$ differing in a single example: \[ \mathbb{E}[|h_\theta(x) - h_{\theta'}(x)|] \leq \epsilon \] where $\theta = \mathcal{A}(S)$ and $\theta' = \mathcal{A}(S')$.

This is equivalent to stability (by definition).

Step 6: Tradeoff Argument. A stable algorithm is robust because it is insensitive to individual training examples. However, robustness to training examples sometimes requires insensitivity to genuine signal: if a feature is important in 99% of training data but absent in some examples, a stable algorithm must down-weight that feature to remain insensitive to the 1% of examples.

Formally, consider a dataset where a feature $x_j$ is signal in most examples but noise in a few. To be $\beta$-stable (not change output much when those few examples change), the algorithm must not weight $x_j$ heavily. But this trades off generalization on the signal.

Step 7: Conclude Tradeoff. Small $\beta$ (good generalization and robustness to training examples) may force the algorithm to ignore important features or patterns present sporadically. Large $\beta$ (ability to learn from sparse signals) allows the model to be sensitive to individual examples, risking overfitting to noise.

Interpretation: Stability provides generalization guarantees but can restrict what the model is capable of learning. This is a fundamental tradeoff: robustness and generalization are linked, but extreme stability (complete indifference to data) prevents learning.

Explicit ML Relevance: Differential privacy (a formal notion of robustness) enforces stability by adding noise to the learning algorithm’s output. The result is excellent generalization (via the stability bound) but reduced accuracy (due to the forced insensitivity). Practitioners must choose how much privacy (stability) to enforce, accepting the accuracy loss/generalization gain tradeoff.

Theorem 9: Scaling Saturation Theorem

Statement: For a fixed model architecture and training set size $n$, there exists a capacity threshold $d^* = d^*(n, \text{arch})$ such that: - For $d < d^*$: test loss improves (roughly) as $d$ increases - For $d > d^*$: test loss saturates or degrades, showing diminishing returns

Formally, in the overparameterized regime, test loss follows a scaling law: \[ \mathcal{L}_{\text{test}}(d) \approx L^* + A / d^\alpha \] for constants $A, \alpha > 0$ and optimal loss $L^*$, up to the saturation point $d^* \approx n^\beta$ (for some $\beta$). Beyond $d^*$, the approximation breaks down, and improvements plateau.

Full Formal Proof:

We prove saturation by analyzing the bias-variance tradeoff and capacity constraints.

Step 1: Decompose Test Loss. Test loss can be decomposed: \[ \mathcal{L}_{\text{test}} = L^* + \text{Bias}^2 + \text{Variance} \]

where $L^*$ is the optimal Bayes loss (irreducible error), Bias is systematic error from model misspecification, and Variance is the error from fitting noise.

Step 2: Bias Decreases with Capacity. As model capacity $d$ increases, the model can represent more complex functions, reducing bias: \[ \text{Bias}^2(d) = O(1 / d^\alpha) \]

for some $\alpha > 0$. This assumes the true function is not trivial and requires increasing expressivity.

Step 3: Variance Increases with Capacity. Variance increases roughly with model complexity relative to data size: \[ \text{Variance}(d, n) = O(d / n) \]

This is the classical intuition: more parameters mean more overfitting potential per data point.

Step 4: Tradeoff and Saturation. The total test loss is: \[ \mathcal{L}_{\text{test}}(d) = L^* + O(1/d^\alpha) + O(d/n) \]

The minimum occurs at a capacity where $d/\alpha \approx d/n$, or roughly $d \approx n$. For $d \ll n$, bias dominates. For $d \gg n$, variance dominates (classically).

However, in the overparameterized regime with implicit regularization (common in modern deep learning), the variance term is suppressed, and the scaling law becomes: \[ \mathcal{L}_{\text{test}}(d) \approx L^* + A/d^\alpha \quad \text{for } d \gg n \]

Improvements continue until $d > n \cdot C$ for some constant $C$, at which point further capacity increases provide negligible benefit.

Step 5: Empirical Validation. Empirically, on datasets like CIFAR-10, ImageNet, and language models, test loss follows power-law scaling $\mathcal{L} \propto 1/d^\alpha$ with $\alpha \approx 0.1$ to $0.3$, up to a saturation point beyond which improvements plateau.

Interpretation: Scaling saturation is the phenomenon that beyond a certain model capacity, adding more parameters does not meaningfully improve generalization. This limit is problem and architecture dependent. The saturation point depends on data size $n$, task complexity, and the model’s inductive bias.

Explicit ML Relevance: Practitioners should not blindly scale models hoping for continued improvements. Once saturation is reached (detected by plotting test loss vs capacity), further scaling wastes compute. The solution is to improve data quality, refine the objective, or redesign the architecture.

Theorem 10: Structural Limit of Empirical Risk Minimization

Statement: There exist learning problems where empirical risk minimization (ERM)—the principle of finding parameters that minimize training loss—fails to generalize, regardless of the amount of training data, the model capacity (within some range), or the optimization algorithm used. Specifically, there exist distributions $\mathcal{P}$ and target functions $f^*$ such that:

The optimal model achieves loss $L^* = 0$ (the true function is learnable in principle)
ERM achieves training loss $\mathcal{L}_{\text{train}} \approx 0$ (perfectly fits the training data)
Yet ERM achieves test loss $\mathcal{L}_{\text{test}} \approx 1/2$ (worse than random guessing for binary classification)

This is the fundamental limitation of ERM in the underspecified regime: ERM alone does not ensure good generalization.

Full Formal Proof:

We construct a counterexample showing ERM failure using an underspecified problem.

Step 1: Construct the Adversarial Distribution. Consider a supervised learning problem where: - Input space: $\mathcal{X} = \{0, 1\}^d$ (binary vectors of dimension $d$) - Target function: $f^*(x) = x_1$ (output is the first coordinate) - Training data: $n = 100$ examples drawn uniformly from $\mathcal{X}$ - Test data: drawn from the same distribution

The true function is very simple (identity of the first coordinate), and empirically, the training set will have roughly equal numbers of $x_1 = 0$ and $x_1 = 1$ examples, so the true function is learnable.

Step 2: Construct Multiple ERM Solutions. Consider two hypotheses: - $h_1(x) = x_1$ (the true function) - $h_2(x) = \text{XOR}(x_2, \ldots, x_d)$ (depends on all but the first coordinate)

On the training set of 100 examples: - $h_1$ achieves zero training error (correctly predicts all labels) - $h_2$ also achieves roughly zero training error if the XOR pattern happens to match the labels in those 100 examples (which happens with some probability)

Both achieve minimal training loss.

Step 3: Show Divergence on Test Data. - $h_1$ on test data: achieves nearly zero test error (the true function) - $h_2$ on test data: achieves roughly 50% error (random guessing performance), since XOR of random bits is random on unseen data

Both solutions minimize training loss equally well, but their test performance diverges drastically.

Step 4: Formal Argument. More formally, consider the underspecified regime where the VC-dimension of the hypothesis class is larger than the sample size. Among all hypotheses achieving zero training error (an exponentially large set), ERM picks one (typically the one found by gradient descent).

The set of hypotheses that generalize well (test error $\approx 0$) may be a small minority of the set achieving zero training error. If the algorithm’s implicit bias does not point toward the generalizing set, ERM fails.

Step 5: Show Structural Limit. The fundamental issue is that the training loss provides insufficient information to distinguish good from bad solutions. No matter how large the training set, as long as it is finite, there exist underspecified problems where ERM selects a poorly-generalizing solution.

Even with infinite training data, if the problem structure contains symmetries or equivalences (multiple functions mapping to the same training distribution), ERM cannot disambiguate.

Interpretation: ERM is necessary but not sufficient for learning. In underspecified problems, additional information or constraints (inductive biases, regularization) are required. The choice of algorithm (which solution ERM converges to) determines generalization, not the choice of loss function alone.

Explicit ML Relevance: This theorem formalizes the limitation that generalization requires inductive bias. An unbiased learner that minimizes training loss without preferences cannot generalize in underspecified problems. Practical machine learning success relies on encoding problem structure through: - Architecture design (CNNs for vision, transformers for language) - Regularization (L2, dropout, early stopping) - Data augmentation (teaching invariances) - Optimization algorithm choice (SGD’s implicit bias)

Without these, ERM fails.

Worked Examples

Example 1 — Underspecified Linear Regression

Consider a linear regression task with features $x \in \mathbb{R}^{200}$ and a training set of $n = 50$ points, where the true relationship is sparse but unknown. The setup is intentionally underspecified: there are more parameters than samples, the data are noisy, and many coefficient vectors $\theta$ interpolate the data perfectly. Reasoning about this case requires distinguishing empirical fit from actual predictive structure. When we solve $\min_\theta \|X\theta - y\|^2$, the set of solutions forms an affine subspace, and the optimizer’s choice (e.g., minimum $\ell_2$-norm solution under gradient descent) determines which solution we deploy. Interpretation: the training objective does not uniquely identify the model, so the behavior on new data is determined by the algorithm’s inductive bias, not the loss function alone. A common misconception is that “perfect training fit implies good predictive structure”; in underspecified linear problems it only implies that some solution exists, not that the chosen one generalizes. A second misconception is that regularization is optional once interpolation is achieved; in fact, the choice of implicit or explicit regularizer is what selects among solutions. What-if scenarios illustrate the fragility: if we rotate the feature basis or rescale features, the minimum-norm solution changes, leading to different deployment behavior despite identical training loss. If we add a single informative feature, the solution can shift dramatically, showing that small changes to specification can decide which solution is selected. Explicit ML relevance: this is the simplest formal demonstration of why training loss alone is insufficient for deployment reliability, why feature normalization matters, and why implicit bias (e.g., gradient descent on standardized features) is a hidden design choice.

Example 2 — Non-Identifiable Parameterization in Neural Networks

Take a two-layer ReLU network $h_\theta(x) = W_2 \sigma(W_1 x)$ with hidden width $h$, and consider a dataset where the network can fit perfectly. The setup is non-identifiable because many parameterizations implement the same function: permuting hidden units, scaling $W_1$ by $c$ and scaling $W_2$ by $1/c$, or duplicating a hidden unit and halving its outgoing weight yields identical input-output mappings. The reasoning is that identifiability in probabilistic models is about unique likelihood, and here the likelihood is identical across entire parameter manifolds. Interpretation: the parameters themselves are not meaningful objects; only the function class matters, and multiple parameter vectors are equivalent. A common misconception is to interpret parameter values as “explanations” of model behavior; in non-identifiable networks, parameter values are not uniquely determined, so such interpretations are unstable. Another misconception is that “more parameters always imply more knowledge,” whereas non-identifiability implies parameter count can increase without adding new functional capability. What-if scenarios expose the limit: if you add a constraint that fixes hidden unit order or imposes orthogonality, identifiability improves, but optimization becomes harder and sometimes reduces performance. If you add weight decay, the symmetry is partially broken, and the optimizer prefers small-norm parameterizations, which may or may not align with the intended inductive bias. Explicit ML relevance: this explains why interpretability based solely on weights is fragile, why different training runs yield different “internal representations,” and why reproducibility in deep learning requires controlling initializations and optimization settings.

Example 3 — Benign Overfitting Simulation

Imagine a simulated dataset generated by $y = x^T \beta^* + \xi$ with $\beta^*$ sparse, $x \sim \mathcal{N}(0, I_d)$, and $\xi$ small Gaussian noise, where $d = 5000$ and $n = 1000$. The setup is deliberately overparameterized, and a model trained by gradient descent achieves zero training error. The reasoning is that in this regime, the optimizer selects a minimum-norm interpolating solution that aligns with the true signal, effectively separating signal and noise. Interpretation: the model “overfits” in the classical sense (zero training error) but does not generalize poorly because the implicit regularization of gradient descent chooses a solution that is aligned with the underlying signal. A common misconception is that interpolation always means memorization and therefore poor generalization; benign overfitting shows that interpolation can coexist with good test performance if the data distribution and algorithm bias align. Another misconception is that the only way to avoid overfitting is to reduce capacity; in this regime, more capacity can help if it enables the optimizer to find a favorable interpolating solution. What-if scenarios include increasing the noise level $\sigma^2$ or corrupting labels: the benign regime breaks down when noise dominates signal, and the interpolating solution begins to fit noise, producing sharp generalization collapse. If we switch from gradient descent to an adversarial optimizer that selects a large-norm interpolator, generalization can degrade sharply. Explicit ML relevance: this example explains why deep networks can fit massive datasets perfectly while still generalizing, and why implicit regularization and optimization dynamics are central to modern generalization theory.

Example 4 — Capacity Threshold Crossing

Consider a classification task with a true decision boundary that is highly nonlinear but smooth, and compare models of increasing capacity: a linear classifier, a shallow network with $h=10$ units, and a deeper network with $h=200$ units. The setup examines a capacity threshold: below a critical size, the model cannot fit the data; above it, the model suddenly fits nearly perfectly. The reasoning is that expressivity increases discretely with architecture; once the model crosses a representational threshold, the training loss collapses quickly. Interpretation: there is a phase transition in fit quality, not a smooth increase, and the boundary is controlled by the true data complexity, not just the number of samples. A common misconception is that performance improves smoothly with capacity; in practice, there is often a sharp transition where the model suddenly begins to fit. Another misconception is that the threshold is a fixed number of parameters; in reality it depends on architecture, data distribution, and optimization. What-if scenarios show why this matters: if you increase data complexity (e.g., adding new classes), the threshold shifts to higher capacity. If you keep capacity fixed and increase data size, the threshold can appear to move as the model fails to represent the richer distribution. Explicit ML relevance: this is the operational reason hyperparameter sweeps can look like “nothing works, nothing works, everything works,” and it motivates systematic capacity scaling rather than ad hoc tuning.

Example 5 — Degenerate Hessian at Scale

Take a wide ReLU network trained on a moderately-sized dataset where $d \gg n$. The setup focuses on the Hessian at convergence; because there are far more parameters than constraints, the Hessian is rank-deficient, and many directions are flat. The reasoning relies on the structure of the Jacobian: the Hessian is dominated by $X^T X$, and $\text{rank}(X^T X) \leq n$. Interpretation: the loss landscape has many flat directions and the optimizer can move in large subspaces without changing the loss, which makes the solution non-unique. A common misconception is that flat directions imply model fragility; in many cases they imply redundancy and can actually enable robustness to parameter perturbations. Another misconception is that second-order methods should always help; in degenerate landscapes, Hessian inversion is ill-posed and can destabilize training. What-if scenarios include adding strong $\ell_2$ regularization or constraining weights: these can lift some degeneracy, but may also increase curvature and introduce sharp minima that reduce generalization. If the dataset is expanded substantially (larger $n$), the Hessian rank increases and the degree of degeneracy decreases, often improving identifiability. Explicit ML relevance: this example explains why first-order methods remain dominant in deep learning, why large models are easy to fine-tune (flat directions), and why pruning and compression are often possible with minimal loss.

Example 6 — Expressivity Saturation Under Finite Data

Suppose we train increasingly large transformers on a fixed dataset with a finite vocabulary and limited coverage of semantic patterns. The setup isolates expressivity saturation: the model class can represent many functions, but the data only constrain a limited subset of that space. The reasoning is that once the model is expressive enough to fit the dataset, additional expressivity does not lead to new generalizable patterns because the data do not distinguish among them. Interpretation: apparent gains from scaling are initially due to crossing an expressivity threshold, but beyond that, the model enters a saturation regime where extra capacity provides diminishing returns. A common misconception is that expressivity equals performance; in fact, performance depends on how well the data can identify which expressible function is the true one. Another misconception is that more parameters always create better representations; without additional data or constraints, they can merely increase underspecification. What-if scenarios show the leverage points: if we add new data that cover previously unseen contexts, the saturation point shifts and larger models start to help again. If we change the objective (e.g., add contrastive learning), expressivity can be redirected toward useful distinctions. Explicit ML relevance: this informs data-centric ML practice: if scaling has stalled, the bottleneck may be data diversity or objective design, not model size.

Example 7 — Optimization Bias vs Explicit Regularization

Consider two training regimes for the same architecture: (1) SGD with small learning rate and early stopping, and (2) SGD with explicit $\ell_2$ regularization but aggressive learning rates. The setup compares implicit optimization bias against explicit regularization. Reasoning: both regimes can achieve similar training loss, but they may converge to different solutions because the optimization path biases which minima are reachable. Interpretation: “regularization” is not a single mechanism; explicit penalties and optimization dynamics create different inductive biases. A common misconception is that explicit regularization fully controls model complexity, making optimizer choice irrelevant; in practice, optimization can dominate the final model’s properties. Another misconception is that matching training loss implies matching generalization; two solutions with identical loss can have very different robustness and fairness properties. What-if scenarios include changing batch size or initialization: large batch sizes reduce gradient noise and can shift solutions toward sharper minima; small batch sizes increase noise and often favor flatter minima. If we eliminate explicit regularization entirely but retain early stopping, we may still obtain a low-norm solution due to optimization bias. Explicit ML relevance: this example explains why reproducibility across labs is hard: differences in training dynamics can change results even when architectures and datasets are identical.

Example 8 — Functional Redundancy in Wide Networks

Take a wide network where hidden layer width is far larger than required for the task. The setup emphasizes that multiple parameter configurations compute the same function: scaling one layer up and scaling the next down, swapping hidden units, or duplicating units yields identical outputs. Reasoning: the function space is lower-dimensional than parameter space, so many parameters are redundant. Interpretation: the model’s effective capacity is not the parameter count; it is the dimensionality of the function manifold that matters for generalization and robustness. A common misconception is that pruning must hurt performance because it removes parameters; in redundant networks, pruning can remove large fractions without noticeable loss. Another misconception is that each parameter contributes independently to the learned function; redundancy implies many parameters are functionally irrelevant. What-if scenarios include applying structured pruning or low-rank factorization: performance often stays stable because the model was redundant. If we train the same architecture on a more complex task, redundancy may shrink and pruning may begin to hurt, indicating that redundancy is task-dependent. Explicit ML relevance: this explains why model compression works, why ensembling within a single model can occur implicitly, and why parameter-efficient fine-tuning can be effective.

Example 9 — Stability Collapse Under Drift

Imagine a recommendation model trained on user behavior from last year, deployed into an environment where user preferences shift rapidly due to external events. The setup focuses on stability: the model is stable with respect to small perturbations in training data, but not with respect to distribution drift. Reasoning: stability bounds apply to i.i.d. draws from the same distribution; once the distribution changes, the stability guarantee no longer bounds the test error. Interpretation: a model can be stable and still fail catastrophically under drift, which is a different failure mode than overfitting. A common misconception is that if a model has good generalization guarantees, it will remain reliable in production; stability is a within-distribution guarantee, not a temporal one. Another misconception is that monitoring only training loss or validation loss is sufficient; in drift, both can remain stable while deployment performance collapses. What-if scenarios include adding continual learning or periodic retraining: stability to drift can be improved but at the risk of catastrophic forgetting or increased variance. If we add explicit drift detection (e.g., population shift tests), we can trigger retraining before collapse. Explicit ML relevance: this example motivates MLOps practices like drift monitoring, canary deployments, and shadow evaluation, highlighting that generalization theory is necessary but insufficient for production reliability.

Example 10 — Compute-Constrained Optimization Failure

Consider training a large model under a strict compute budget where only a fraction of the intended training steps are feasible. The setup is compute-constrained optimization: the model is expressive enough, but the optimizer is stopped far from convergence. Reasoning: the optimization–representation gap is large because the model can represent the solution, but the algorithm cannot reach it within the budget. Interpretation: performance limitations are algorithmic rather than representational; more capacity does not help when the optimizer cannot exploit it. A common misconception is that bigger models always do better; under compute constraints, larger models may underperform because they are harder to optimize within budget. Another misconception is that reducing the learning rate always improves performance; under tight budgets, large learning rates can be necessary to make progress, even if they reduce stability. What-if scenarios show the tradeoffs: if we reduce model size, the optimizer converges faster but may hit expressivity limits; if we change the optimizer (e.g., use adaptive methods or warm starts), the gap may shrink without changing capacity. Explicit ML relevance: this is the practical reason scaling laws include compute and why effective training often requires co-design of model size, dataset size, and training budget.

Example 11 — Scaling Saturation in Large Models

Suppose a language model is scaled from 1B to 10B to 100B parameters on the same dataset. The setup isolates scaling saturation: improvements follow a power law initially, then plateau as additional capacity provides diminishing returns. Reasoning: beyond a certain size, the data become the bottleneck, and the model learns increasingly similar representations regardless of additional parameters. Interpretation: the observed plateau is not necessarily a failure of optimization but a limitation of data coverage and objective specification. A common misconception is that the plateau implies the model has “maxed out” the task; it may simply reflect that the dataset is too narrow or the objective too shallow. Another misconception is that scaling will inevitably overcome misalignment; in reality, scaling can amplify misalignment by making the model more effective at optimizing the wrong objective. What-if scenarios include adding higher-quality data, new tasks, or auxiliary objectives: the plateau can shift upward, revealing that the saturation was data-limited. If we change evaluation to a more challenging distribution, performance may drop sharply, showing that saturation on the original distribution concealed brittleness. Explicit ML relevance: this example informs compute allocation decisions and argues for data investment, evaluation diversification, and objective design over endless model scaling.

Example 12 — Objective–Representation Gap Demonstration

Consider a content moderation model trained to maximize click-through rate (CTR) while the true goal is to reduce harmful content exposure. The setup illustrates an objective–representation gap: the model can represent the desired policy, but the objective function targets a proxy that is only loosely related. Reasoning: optimization faithfully improves CTR, which correlates with engagement, but engagement may increase exposure to sensational or harmful content. Interpretation: the model is “optimal” under the defined objective but fails under the actual goal; this is not an optimization failure but a specification failure. A common misconception is that adding more data will resolve the problem; if the objective is misaligned, more data will simply optimize the wrong target more effectively. Another misconception is that fairness or safety can be “patched in” after the fact; if the objective is wrong, post hoc fixes are fragile and often conflict with the primary optimization. What-if scenarios show the leverage points: if we replace CTR with a multi-objective loss that penalizes harm, the model can move toward the intended behavior but may reduce engagement, creating organizational tradeoffs. If we add human-in-the-loop feedback as a constraint, the model can incorporate values that are not present in the data. Explicit ML relevance: this example is the central argument for rigorous objective design, alignment audits, and governance practices, especially in high-stakes domains.

Summary

Key Ideas Consolidated

The central theme of this chapter is that optimization is a powerful but incomplete tool. When a problem is underspecified, multiple solutions fit the training data equally well, and the algorithm’s implicit bias decides which solution is deployed. When objectives are misaligned, optimization can amplify the wrong behaviors, even if the algorithm is correct and the data are plentiful. When the landscape is non-convex and the parameter space is high-dimensional, optimization bias becomes a structural feature, not a nuisance. When expressivity saturates or compute limits are reached, scaling fails to produce the expected benefits. Each of these limits is structural: it arises from the mathematics of learning, not from an implementation flaw.

The practical takeaway is not to abandon optimization, but to treat it as one component in a larger system of decision-making. Objective design, inductive bias, data curation, and deployment constraints determine whether optimization succeeds or fails. The chapter’s definitions and theorems formalize these limits; the worked examples show how they appear in practice. Together, they argue that “better optimization” is often the wrong fix for failures that are rooted in specification, identifiability, and representational limits.

What the Reader Should Now Be Able To Do

After this chapter, the reader should be able to diagnose why a model fails even when training loss is excellent: whether the failure is due to underspecification, non-identifiability, expressivity bounds, or compute constraints. The reader should be able to distinguish between optimization error (the model can represent the solution but the algorithm fails to find it) and representation error (the model cannot represent the solution at all). The reader should be able to reason about when scaling helps and when it saturates, and to recognize regime shifts where an algorithmic strategy that worked in one scale fails at another. Most importantly, the reader should be able to interpret a model’s success or failure in deployment as a consequence of structural choices made in objective design, data selection, and optimization bias.

Structural Limits Identified

This chapter identifies several limits that persist regardless of compute or data scale. Underspecification implies that perfect training loss does not imply unique or reliable behavior. Non-identifiability implies that even with infinite data, parameters may remain ambiguous, making interpretability fragile. VC-dimension and expressivity bounds imply that some functions cannot be represented, no matter how well optimized. Computational complexity limits imply that some objectives cannot be solved exactly, requiring approximations that change the solution space. Degenerate Hessians and curvature explosion create instability for second-order methods. Scaling saturation implies that beyond a point, more parameters yield diminishing returns unless data or objectives change. Together, these are not incidental issues but core properties of learning systems.

Active Assumptions for Future Research

The analysis in this chapter assumes that data are drawn from fixed distributions, that model classes are well-defined, and that objectives are at least partially aligned with the intended task. Real-world systems violate these assumptions: data shift over time, objectives change with policy, and learned models alter the data they observe. Future research must relax the i.i.d. assumption, model feedback loops explicitly, and develop optimization frameworks that incorporate dynamic objectives and human values. Another assumption is that optimization bias is a fixed property of an algorithm; in practice, it can be altered by training schedules, architecture changes, and data augmentation. Understanding how to control optimization bias intentionally remains an open problem.

End-of-Chapter Advanced Exercises

A. True / False (20)

A.1. In an overparameterized linear model trained to zero loss with gradient descent, the minimum-norm interpolating solution is uniquely determined by the data distribution up to measure-zero perturbations, regardless of feature scaling.

A.2. For any fixed data distribution, there exists a model capacity threshold beyond which test error must monotonically decrease with additional parameters under any optimization algorithm.

A.3. In the presence of non-identifiability, the Fisher Information Matrix necessarily has at least one zero eigenvalue under the true data distribution, even with infinite data.

A.4. Double descent implies that increasing model capacity beyond the interpolation threshold cannot worsen test error if explicit regularization is held constant.

A.5. The No-Free-Lunch theorem implies that an algorithm that is Bayes optimal on one distribution can be arbitrarily suboptimal on another distribution with the same marginal feature distribution.

A.6. If a hypothesis class has VC-dimension larger than the sample size, uniform convergence guarantees are vacuous even when test error is empirically small.

A.7. In non-convex optimization, convergence to a stationary point implies convergence to a global minimum whenever the Hessian at that point is rank-deficient.

A.8. Scaling laws of the form $\mathcal{L} \approx L^* + A d^{-\alpha}$ imply that dataset quality can be traded off for model size without affecting the asymptotic limit $L^*$.

A.9. For any fixed compute budget, there exists a model size beyond which increasing capacity decreases achievable test performance, even if the model class contains the Bayes optimal predictor.

A.10. Stability-based generalization guarantees can hold even when the hypothesis class has infinite VC-dimension, provided algorithmic stability is $O(1/n)$.

A.11. In underspecified problems, two models with identical training loss and identical test loss can still have arbitrarily different predictions on a non-negligible subset of the deployment distribution.

A.12. The existence of curvature explosion in the loss landscape implies that any first-order method must use a learning rate that scales inversely with the condition number to guarantee convergence.

A.13. A computational complexity barrier for optimal feature selection implies that no polynomial-time algorithm can guarantee even a constant-factor approximation unless $\mathcal{P} = \mathcal{NP}$.

A.14. Expressivity saturation can occur even when the model class is a universal approximator, if the data distribution fails to constrain function choice.

A.15. In the overparameterized regime, implicit regularization from gradient descent always selects the same interpolating solution as explicit $\ell_2$ regularization with an appropriate coefficient.

A.16. Scaling saturation implies that additional data are ineffective if model capacity is already beyond the saturation threshold.

A.17. If the optimization–representation gap is small, then model performance can be improved only by changing the objective, not by changing the optimization algorithm.

A.18. The presence of functional redundancy implies that pruning cannot increase test error unless it changes the effective hypothesis class.

A.19. A regime shift boundary in the ratio $d/n$ can cause the optimal training strategy to change from explicit regularization to reliance on implicit regularization.

A.20. In any fixed architecture, increasing depth and width indefinitely guarantees elimination of objective misalignment effects on deployment behavior.

B. Proof Problems (20)

B.1. Prove that for any hypothesis class with VC-dimension $d$, there exists a distribution over $\mathcal{X} \times \{0,1\}$ such that any ERM learner trained on $n < d/2$ samples has test error at least $1/4$ with probability at least $1/3$.

B.2. Prove that for overparameterized linear regression with isotropic Gaussian features, gradient descent initialized at zero converges to the minimum $\ell_2$-norm interpolating solution, and characterize the solution as a function of the data matrix.

B.3. Prove that if a learning algorithm is uniformly stable with stability parameter $\beta$, then with probability at least $1-\delta$, the generalization gap is bounded by $\beta + O\left(\sqrt{\log(1/\delta)/n}\right)$.

B.4. Prove that for any fixed architecture, there exists a data distribution for which scaling model width beyond a threshold yields no improvement in test error beyond an $\epsilon$ margin, despite continued reduction in training loss.

B.5. Prove a no-free-lunch result for regression: for any learning algorithm and any fixed sample size $n$, there exists a data distribution over $\mathbb{R}^d \times \mathbb{R}$ such that the expected test error is bounded below by a constant independent of $n$.

B.6. Prove that the Hessian at an interpolating solution of a squared-loss neural network has rank at most $n$ in the absence of explicit regularization, and deduce the existence of at least $d-n$ flat directions.

B.7. Prove that any algorithm for exact $k$-feature selection that runs in time polynomial in $d$ and $n$ implies $\mathcal{P} = \mathcal{NP}$.

B.8. Prove that in a model class with non-identifiable parameterization, the Fisher Information Matrix is singular for all parameter values that represent the same likelihood.

B.9. Prove that for any non-convex loss with Lipschitz-continuous gradient, gradient descent with fixed learning rate $\eta < 1/L$ converges to a point with gradient norm at most $O(1/\sqrt{T})$ in $T$ iterations.

B.10. Prove that for any fixed compute budget $C$, there exists a model size $d$ and training dataset size $n$ for which the optimization–representation gap is bounded below by a positive constant when constrained to $C$ gradient evaluations.

B.11. Prove that if the data distribution admits multiple Bayes-optimal classifiers that disagree on a set of positive measure, then any consistent learner is underspecified in the sense of Definition 1.

B.12. Prove that for a ReLU network with width $h$ and depth $d$, the number of linear regions of the input space is bounded above by $2^{O(hd)}$, and derive a corresponding upper bound on VC-dimension.

B.13. Prove that a scaling law of the form $\mathcal{L}_{\text{test}}(d) = L^* + A d^{-\alpha}$ implies diminishing returns in the sense that doubling $d$ reduces excess loss by at most a factor of $2^{-\alpha}$.

B.14. Prove that for any learning algorithm that minimizes empirical risk over a hypothesis class with infinite VC-dimension, there exists a distribution for which the algorithm fails to generalize with non-trivial probability, despite zero training error.

B.15. Prove that any algorithm that is $\epsilon$-robust to $\ell_\infty$-bounded adversarial perturbations must sacrifice at least $\Omega(\epsilon)$ accuracy on a class of linear separable distributions.

B.16. Prove that if a loss landscape has condition number $\kappa$, then any first-order method with fixed learning rate must use $\eta \leq 1/\lambda_{\max}$ to guarantee descent, and show how convergence rate depends on $\kappa$.

B.17. Prove that in the overparameterized regime $d \gg n$, the set of interpolating solutions forms an affine subspace of dimension at least $d-n$ for linear regression.

B.18. Prove that for any fixed objective misalignment (formalized as a bounded discrepancy between the training loss and the deployment loss), scaling model capacity cannot reduce the deployment loss below a fixed offset determined by the misalignment.

B.19. Prove that a regime shift boundary exists in the ratio $d/n$ for which the optimal regularization parameter in ridge regression changes discontinuously as a function of $d/n$.

B.20. Prove that for any fixed architecture and training algorithm, there exists a distribution shift of bounded total variation such that deployment error exceeds training error by at least a constant.

C. Python Exercises (20)

C.1 — Interpolation Across Parameterization Regimes

Task: Construct a simulation with synthetic data where the true signal is generated as $y = X ^* + $, with $^*$ sparse (only 5–10 active components out of $d$), $X (0, I)$, and noise $(0, ^2 I)$. Train linear models using gradient descent across three regimes by varying the dimension $d$: (1) Underparameterized: $d = 0.5n$ (2) Critical: $d = n$ (3) Overparameterized: $d {2n, 5n, 10n}$. For each regime, compute: (a) training error ${}$, (b) test error ${}$ on a held-out set, (c) generalization gap ${} - {}$, (d) minimum-norm solution’s $_2$ norm. Plot error curves as functions of $d/n$ on a continuous scale, showing the three regimes.

Purpose: The classical U-shaped bias-variance curve predicts test error rises monotonically with capacity beyond the noise level, but modern empirical results show non-monotonic behavior (double descent). By directly simulating across a wide range of dimensionality, you observe where the classical curve breaks down, how benign overfitting emerges, and under what noise levels interpolation becomes safe. This exercise builds intuition for why traditional model selection criteria (e.g., AIC, BIC) can be misleading in high-dimensional settings.

ML Link: This directly tests Theorem: Double Descent Curve and connects to Theorem: Benign Overfitting Conditions. Example 1 (Interpolation and Generalization) provides the conceptual foundation. The exercise validates that perfect training fit (zero training error) coexists with good test error when  is small and $d n$, contradicting classical statistics.

Hints: Generate data consistently across regimes: use the same $n = 1000$, vary $d$, repeat experiments with multiple random seeds ($10$), and report mean ± std. Use np.linalg.lstsq or gradient descent to solve the regression problem. For overparameterized regimes, ensure the model interpolates (training loss ≈ 0). Normalize errors by noise variance $^2$ to enable comparison across noise levels. Create visualizations with error bars showing ±1 standard deviation.

What mastery looks like: Generate a comprehensive plot showing training error, test error, and generalization gap across $d/n $. For low noise ($= 0.1$), demonstrate: underparameterized region ($d < n$) shows positive gap; at $d = n$ test error spikes (peak of bias-variance curve); beyond $d = n$, test error decreases despite perfect interpolation (benign overfitting region), approaching the noise floor at $d n$. Report specific numbers: at $d = 10n$, test error ≈ 1.1σ (slightly above noise floor, benign), while training error = 0 (perfect fit). For high noise ($= 1.0$), the peak at $d = n$ is more pronounced, and benign overfitting region is less pronounced. Create a second analysis: plot the minimum-norm solution’s norm $||$ vs. dimensionality, showing that it increases sharply near $d/n = 1$ then grows sub-exponentially for $d > n$. Analyze test error decomposition: compute bias and variance estimates using bootstrap or cross-validation, showing that below $d = n$ variance dominates, at $d = n$ both are large, and above $d = n$ variance drops dramatically. Measure how the error curves change with signal sparsity: compare sparse (5 active components) vs. dense (all $^*$ nonzero) signals, showing that sparsity narrows the problematic region near $d = n$. Conclude with quantitative thresholds: identify the minimum $d/n$ ratio at which test error stabilizes below 1.2σ, relating this to practical model selection (e.g., for signal recovery, $d/n \geq 5$ may be necessary).

C.2 — Feature Scaling and Implicit Bias

Task: Use a fixed dataset of $n = 500$ samples and $d = 2000$ dimensions in the overparameterized regime. Initialize features with diverse scales: (1) standardized: each feature $(0, 1)$, (2) unstandardized: first 100 features $(0, 100)$, remaining $(0, 0.01)$, (3) random rotation: apply an orthogonal transformation $U$ to standardized features, $X_{} = X U$. For each scaling, train a model using gradient descent to interpolation, compute the minimum-norm solution , and measure: (a) $||_2$, (b) per-feature weight magnitudes $_j$, (c) correlation between feature scale and solution magnitude, (d) test error on a held-out set with labels generated from the true signal $^*$. Compare functional similarity: evaluate both models on a dense test grid and measure $_2$ distance between predictions.

Purpose: In overparameterized linear regression, infinitely many solutions achieve zero training error. Which one the optimizer selects depends critically on the parameterization. By varying feature scaling without changing the data distribution itself, you isolate the role of implicit bias and demonstrate that preprocessing is not a mere numerical convenience but a fundamental part of model specification. This teaches practitioners that reproducibility requires documenting both the algorithm and the preprocessing.

ML Link: This tests Theorem: Implicit Bias of Gradient Descent and demonstrates Underspecification in Practice. Example 2 (Implicit Bias and Parameterization) provides theory; this exercise quantifies real effects. The observation that functionally identical models can have vastly different parameter vectors relates to the optimization–representation gap and parameter non-identifiability.

Hints: Keep the target function $^$ fixed, so only the predictor variables change. Use consistent optimization (small learning rate, many epochs to convergence). For each scaling variant, compute the test error on fresh data generated from $^$. Use feature importance measures (absolute coefficient values, permutation importance) to quantify reliance on different features across scalings.

What mastery looks like: Report a table showing Results for Three Scalings: Scaling | $||$ | Test Error | Feature Weight Correlation with Scale. For standardized features, expect $|| \approx 1–2$ (balanced norm distribution). For unstandardized features, $|| \approx 100–500$ with highly skewed distribution: large weights on small-scale features, small weights on large-scale features (because gradient descent prefers adjusting features with large gradients). For rotated features, $||$ similar to standardized (orthogonal transformation preserves norms), but the pattern of weights is scrambled. Create visualizations: (1) histogram of $|j|$ for two scalings showing distinct distributions, (2) scatter plot of feature scale vs. weight magnitude showing clear negative correlation for unstandardized features. Measure functional disagreement: despite different parameter vectors, test error should be relatively similar (within ~5–10% variation) because they fit the same signal. However, compute predictions on a constructed adversarial test point where features have unusual scaling; at this point, the predictions diverge significantly (e.g., two models disagree by $0.5 \{}$). Create a second analysis: compare robustness to feature perturbations. Add small noise to each feature independently and remeasure test error; models with unbalanced weights are more sensitive to noise in their high-weight features. Conclude with practical implications: discuss why feature standardization is essential for reproducibility, how scaling affects regularization strength (implicit and explicit), and how to detect unintended scale dependence in production systems.

C.3 — Underspecification and Optimization Bias

Task: Construct a synthetic classification problem on $^2$ with two disjoint training regions that have spurious correlations: Region A ($50%$ of data): points in $[-2, -1] $ labeled +1 with spurious red color; points in $[1, 2] $ labeled −1 with spurious blue color. Region B (held-out test set): same point locations but labels are flipped; spurious colors are random. Define two linear hypotheses: (1) Spurious: $f_S(x) = ((x))$ (uses spurious color), achieves 100\% training accuracy; (2) True: $f_T(x) = (x_1)$ (uses true pattern $x_1$ coordinate), also achieves 100\% training accuracy on Region A but generalizes to Region B. Train a simple neural network classifier using SGD with learning rates ${0.001, 0.01, 0.1, 1.0}$ and batch sizes $B {8, 32, 128}$, multiple random seeds, and measure: (a) training accuracy (should be 100\% for all), (b) test accuracy on Region B, (c) gradient attribution or saliency maps showing which features the model relies on. Classify each run as “spurious” or “true” based on which features dominate.

Purpose: Multiple hypotheses can be equally consistent with training data. Which one an optimizer selects depends on implicit bias, learning rate, batch size, and random initialization. This exercise makes underspecification concrete: you demonstrate that two Bayes-optimal models on training data behave completely differently on test data, and that optimization hyperparameters deterministically shift the selected solution. This is critical for understanding why hyperparameter tuning, seeds, and initialization matter beyond just optimization speed.

ML Link: This tests Underspecification in Deep Learning and Theorem: Optimization-Induced Bias. Example 3 (Spurious Correlations and Optimization Bias) provides motivation. The exercise demonstrates that stronger regularization (smaller learning rate, larger batch size) often pushes toward simpler patterns (true), while aggressive optimization (large learning rate, small batch size) may find complex spurious features first.

Hints: Design the spurious and true hypotheses so they have similar model complexity (both linear) to isolate bias, not capacity effects. Use saliency maps or feature importance to diagnose which hypothesis was selected. Repeat each (LR, batch size) combination multiple times and report fraction of runs selecting “spurious” vs. “true”. Run early stopping using a clean validation set to detect which pattern emerges early.

What mastery looks like: Generate 5×3 grid of results showing (Learning Rate × Batch Size) with cell values showing percentage of seeds selecting “spurious” (red) or “true” (blue). For conservative settings ($, B=128$), >80% of runs select the true hypothesis (train robust, generalize). For aggressive settings ($, B=8$), >70% select spurious (overfit). Create saliency maps for two representative runs (one spurious, one true) showing which input features the model attends to. Generate a timeline: plot training and test accuracy over epochs for one run selecting spurious and one selecting true, showing that both reach 100% training accuracy but diverge on test. Create a second analysis: compute gradient magnitude statistics early in training (epoch 5–10) and late (epoch 100+); hypothesize that spurious-selecting runs show larger gradients w.r.t. spurious features early, but the relationship may be complex. Measure implicit bias: train two models with identical data, initialization, and hyperparameters, but one with clean features and one with noisy spurious features added; quantify the shift in learned parameters. Conclude with practical guidance: discuss how to detect spurious correlation reliance in deployed models (via stress tests or held-out OOD data), explain why larger batch sizes or longer schedules with lower learning rates reduce spurious reliance, and propose monitoring strategies to flag when a model is overly reliant on spurious features.

C.4 — Double Descent Observation

Task: Generate synthetic regression data with $n = 200$ samples, true signal generated from $y = X^* + $ with $^*$ having 10 active components, noise $= 0.5$. Train least-squares linear regression models by varying dimension $d {10, 20, 50, 100, 150, 200, 250, 300, 400, 500, 750, 1000}$. For each $d$, also train a fully-connected neural network (2 hidden layers, 50 units each) with identical data. For both model classes, perform grid search over explicit $_2$ regularization strength , selecting the tuned  that minimizes validation error (use 5-fold cross-validation). Generate three plots: (1) test error vs. $d$ for least-squares with optimal $(d)$, (2) test error vs. $d$ for neural networks with optimal $(d)$, (3) overlap showing both on the same axes. Identify the location of the interpolation threshold (where training error first reaches zero).

Purpose: Double descent is the most striking empirical phenomenon violating classical statistical intuition: test error increases with capacity, peaks near the interpolation threshold, then decreases despite perfect training fit. This exercise provides hands-on experience reproducing this curve, understanding why it occurs (implicit bias of interpolators), and observing how it changes across model classes and noise levels. Seeing the curve is more convincing than reading about it.

ML Link: This directly validates Theorem: Double Descent Curve and connects to Theorem: Benign Overfitting Conditions. Examples 1–3 provide conceptual grounding; this exercise is empirical validation.

Hints: Ensure the model actually interpolates in overparameterized regimes: check that training loss ≈ 0 for $d > n$. Use well-tuned learning rates and sufficient epochs for convergence. For neural networks, use the same initialization scheme across runs. Perform multiple random seeds ($5$) and report mean ± std. Use semilog plots if needed to make both regimes visible.

What mastery looks like: Generate clear double descent plots showing: training error drops sharply to zero around $d = n$, then stays at zero for $d > n$. Test error shows a V-shape with a peak near $d/n $: starts high in underparameterized regime (~0.6–0.8), decreases toward $d n$, then increases sharply to a peak (~1.2–1.5 above noise floor) at $d 1.2n$ (interpolation threshold + some buffer), then descends to approach noise floor (~0.5–0.6) at $d = 1000$. The descent should be smooth and clearly visible on the plot. Quantify the peak value: report peak test error and its location in $d$. Compare across model classes: neural networks typically show a smoother descent than least-squares (may relate to different implicit biases). Create a table: Regime | Location | Test Error (LS) | Test Error (NN), comparing both models. Provide a second analysis: compute the condition number of the design matrix $\kappa(X)$ and correlate with test error. Near $d = n$, condition number is high (ill-conditioning, explaining the peak). Show effective rank (number of singular values above threshold) vs. $d$. Compute implicit regularization: for least-squares, measure the norm $||$ across the interpolation threshold, showing a spike near $d = n$ and decay for large $d$. Analyze noise sensitivity: repeat with different noise levels ($\{0.1, 0.5, 2.0\}$) and show how the peak changes (larger noise → higher peak, shifted rightward). Conclude with interpretation: explain why the second descent emerges (implicit bias of minimum-norm interpolators), discuss which regime is preferable for different applications, and speculate on how architecture and optimization dynamics affect the shape of the curve.

C.5 — Benign Overfitting and Implicit Regularization

Task: Generate high-dimensional linear regression data: $n = 200$, $d = 1000$ (strongly overparameterized), true signal from sparse $^* \in \mathbb{R}^d$ with 10 active components, noise $= 0.5$. Construct three families of interpolating solutions: (1) Minimum-norm: $_{} = ||_2$ s.t. $X= y$, found via gradient descent initialization at zero. (2) Random interpolation: randomly sample 100 solutions from the solution space ${: X= y}$ by adding vectors from the null space of $X$. (3) Limited iterations: run gradient descent for $k \in \{10, 50, 100, 500, 1000\}$ iterations (before convergence) and record the solution at each checkpoint. For each solution family, measure: (a) $_2$ norm $||_2$, (b) test error on held-out data, (c) stability over noise levels $\{0.1, 0.5, 1.0, 2.0\}$. Plot test error vs. solution norm for all three families on the same axes, color-coded by family.

Purpose: Not all interpolators generalize equally well. Benign overfitting occurs when the algorithm’s implicit bias selects an interpolator with low test error. This exercise isolates algorithmic bias by comparing gradient descent (implicit regularization via initialization and dynamics) to random interpolators (no particular bias). By measuring test error as a function of solution norm, you quantify how much algorithmic bias helps generalization and identify regimes where interpolation is benign vs. harmful.

ML Link: This tests Theorem: Benign Overfitting Conditions and Implicit Bias of Gradient Descent. Example 4 (Algorithmic Bias and Generalization) provides theory. The exercise demonstrates that “fitting the training data” is not the problem; which fit you find is everything.

Hints: Use np.linalg.lstsq or QR decomposition to find the minimum-norm solution. For random interpolation, compute an orthonormal basis for the null space of $X$ and sample random vectors in the null space, adding them to a particular solution. Store intermediate GD iterates by saving the model at regular intervals. Generate test data from the same distribution as training to measure generalization fairly.

What mastery looks like: Generate a scatter plot: x-axis = $||_2$, y-axis = test error, with three color families. Demonstrate: minimum-norm GD solutions have moderate norm (~20–50 depending on problem scale) and low test error (~0.6–0.8, near $2$). Random interpolators span a wide range of norms (due to sampling from the null space), with test error strongly correlated to norm: very high-norm solutions have test error >2, low-norm solutions have test error ~0.8–1.0. Limited iterations show a trajectory: early iterations (10–50 steps) have low norm and good test error, consistent with GD’s implicit regularization. Create a table: Solution Family | Mean Norm | Mean Test Error | Noise Sensitivity, showing quantitative comparisons. Generate a second figure showing test error vs. noise level for three representative solutions: one from each family. Demonstrate that minimum-norm GD is robust across noise levels (test error scales roughly with noise linearly). Random high-norm interpolators are fragile: test error increases sharply with noise. Compute a metric for “order reversal”: as noise increases, some random interpolators with lower-norm might outperform higher-norm ones differently than GD trajectories. Analyze the null space: compute the dimension of the null space $d - (X) = d - (n,d) = 1000 - 200 = 800$, showing how many degrees of freedom exist for interpolators. Demonstrate that random interpolators explore this 800-dimensional space poorly (high variance), while GD finds a concentrated region (implicit manifold). Conclude with interpretation: explain why gradient descent’s implicit bias (initialization at zero, continuous dynamics) produces good generalization, discuss what properties of solutions lead to benign overfitting (low norm, alignment with data), and propose practical implications: if you must interpolate, ensure your optimization (learning rate, stopping time) implicitly biases toward good solutions.

C.6 — Parameter Non-Identifiability and Functional Stability

Task: Create a fully-connected neural network with intentional redundancy: input layer ($d=50$), hidden layer ($m=200$ units, intentionally wide), output layer (1 unit). Initialize the network with a configuration where  hidden units are functionally redundant (weight matrices set so they compute identical functions or can be freely permuted). Train 20 independent copies of this network on a fixed regression task using identical hyperparameters but different random seeds. For each trained network, measure: (a) parameter diversity: $ = $ across all weight matrices (do the parameters diverge?), (b) functional stability: $L_{}$-distance between predictions on a fine test grid, $_x |^{(i)}(x) - ^{(j)}(x)|$ pairwise for all 20 runs, (c) hidden unit permutation: measure how much permuting redundant hidden units changes the function (should be zero if genuine redundancy). Create visualization: a scatter plot of parameter divergence vs. functional divergence for all pairs of runs.

Purpose: Deep learning has many more parameters than samples, yet different parameter settings produce similar (or identical) functions. This fundamental asymmetry—parameters are non-identifiable, but functions are—is at the heart of why modern neural networks generalize despite having capacity to memorize. This exercise makes non-identifiability concrete by showing that you can train the same function on the same data and get radically different parameters, yet similar outputs. This clarifies why parameter-level interpretability is fragile but function-level behavior is stable.

ML Link: This tests Definition: Parameter Non-Identifiability and Theorem: Functional Equivalence Under Redundancy. Example 5 (Overparameterization and Non-Identifiability) provides theory; this exercise demonstrates the practical implications. The observation that different parameter trajectories converge to the same function is central to implicit bias.

Hints: Use the same architecture, initialization scheme, and optimization hyperparameters across all 20 runs, varying only the random seed. To verify hidden unit redundancy, either design it directly (e.g., duplicate a hidden unit with same weights), or detect it post-hoc via correlation analysis of activations. Use Procrustes alignment to find optimal parameter matching before comparing divergence. Compute permutation invariance explicitly: for suspected redundant units, permute their positions and check if output changes.

What mastery looks like: Demonstrate a striking dichotomy: parameter-level divergence is high (e.g., parameter Frobenius norms differ by 30–50%), while functional divergence is negligible (e.g., $|f^{(i)} - f^{(j)}| < 0.01$ on test grid). Create a heatmap showing pairwise parameter distances (20×20 matrix showing all pairs), which is highly structured with large values, contrasted to a pairwise functional distance heatmap, which is nearly uniform and small. Generate a scatter plot of (parameter distance, functional distance) for all $ = 190$ pairs, showing a clear horizontal concentration: solutions stack vertically at low functional distance but spread horizontally in parameter space. Quantify permutation invariance: show that randomly permuting redundant hidden units does not change the output ($f < 10^{-6}$), while permuting non-redundant units dramatically changes outputs ($f > 0.1$). Create a visualization showing hidden unit activations: a heatmap of mean absolute activations for 20×200 hidden units, showing columns (units) that are nearly identical across runs. Compute the effective dimension: perform PCA on the stacked parameters from all 20 runs, reporting that true effective parameter dimension is $ (50 + 1 + 1) = 10,200$: maybe only ~500-1000 effective dimensions are used. Analysis: train with different hidden layer dimensions ($m \{50, 100, 200, 500\}$) and show that functional divergence remains low, but parameter divergence increases dramatically with $m$ (wider networks have more redundancy). Conclude with interpretation: explain why parameter uncertainty does not imply functional uncertainty, discuss implications for transfer learning (fine-tuning can change parameters but preserve function), and propose metrics that account for non-identifiability (use function-level, not parameter-level, diagnostics).

C.7 — Data vs. Model Scaling Saturation

Task: Use a fixed architecture (ResNet-20 or equivalent simple CNN) and CIFAR-10. Create two experiments: (1) Data scaling: fix model size, and vary training data size $n \{500, 1000, 5000, 10000, 50000\}$ (subsample CIFAR-10 training set), train to convergence, and measure test accuracy. (2) Model scaling: fix $n = 50000$ (full CIFAR-10) and vary model capacity by scaling all channel dimensions by factors $w \{0.25, 0.5, 1.0, 2.0, 4.0\}$. For each configuration, compute: (a) test accuracy, (b) compute cost (FLOPs or training time), (c) accuracy per unit compute (accuracy divided by FLOPs). Create three plots: test accuracy vs. data size, test accuracy vs. model width, and accuracy-per-compute vs. both axes. Identify saturation regimes where further scaling yields diminishing returns.

Purpose: Modern machine learning systems can scale data or model size, but both have limits. By empirically measuring where each axis saturates, you learn to diagnose what is holding back performance and therefore where to invest effort next. This ties to practical resource allocation: is it better to spend budget on labeling more data or on compute to run bigger models?

ML Link: This tests Theorem: Scaling Laws and Diminishing Returns, connecting to Example 6 (Data Scaling vs. Capacity Scaling). The exercise provides empirical grounding for scaling law research and practical implications for prioritizing investments.

Hints: Keep all hyperparameters identical across scaling experiments (learning rate, batch size, epochs, augmentation). Measure compute using FLOPs or wall-clock time consistently. Plot on log-log axes to detect power-law relationships. Fit power laws $ = a - b n^{-}$ to identify saturation exponent .

What mastery looks like: Generate three plots showing clear saturation patterns. For data scaling with fixed model, demonstrate rapid accuracy improvement initially (e.g., 500→5000 samples: accuracy 60%→75%), then saturation (5000→50000: accuracy 75%→78%, gains slow). Report a power-law exponent: test accuracy follows $ - c/n^{0.3}$, meaning doubling data gives $% ) improvement for large \(n$. For model scaling with fixed data, show similar saturation: model size 0.25×→1.0×: accuracy 70%→82%, then 1.0×→4.0×: accuracy 82%→85% (smaller gains). Quantify the saturation regime: identify where diminishing returns become severe (e.g., beyond 4× width, accuracy gains <1% per doubling). Create a table: Data Size | Accuracy | Model Width | Accuracy showing trade-offs. Compute accuracy-per-compute for both axes: data scaling yields diminishing returns per sample, model scaling yields diminishing returns per parameter. Generate a heatmap showing test accuracy for all (data size, model width) combinations, visualizing the joint optimization landscape. Show that there exists an optimal balance: sometimes a smaller model on more data beats a larger model on small data with the same compute budget. Perform a second analysis: measure convergence speed (epochs to target accuracy) as a function of data and model size. Data scaling can slow convergence (noisy gradients from small batches), model scaling can speed it (larger batches possible). Conclude with practical recommendations: quantify the "optimal data-compute ratio" for CIFAR-10 (e.g., with 1000 FLOPs budget, use 1000 samples and 0.5× model rather than 500 samples and 1.0× model), discuss how saturation changes for different tasks, and propose monitoring frameworks to detect when to switch from data to model scaling.

C.8 — Compute-Constrained Optimization

Task: Fix a total optimization budget: $10^8$ gradient updates (equivalent to ~100 epochs of ResNet-20 on full CIFAR-10). Train models of increasing width $w \{0.5, 1.0, 2.0, 4.0, 8.0\}$ such that larger models take fewer epochs to reach the budget but may not fully converge. For each $w$, compute: (a) test accuracy after exactly $10^8$ updates, (b) training loss at that point, (c) distance from convergence (estimated by continuing training and measuring accuracy gain per additional epoch), (d) compute efficiency (accuracy per unit compute spent). Plot all metrics vs. model width $w$. Repeat on two datasets: CIFAR-10 (task where larger models help) and a synthetic task (where smaller models suffice).

Purpose: In practice, compute is limited. Practitioners often face choices: spend compute on a small model trained to convergence, or on a large model trained incompletely. This exercise makes that tradeoff visible and quantifies when each choice is wise. It also illustrates the optimization–representation gap: the gap between achievable loss by a model class and loss achieved by a specific algorithm under constraints.

ML Link: This tests Computing Resource Constraints and Theorem: Optimization-Representation Gap. Example 7 (Compute-Constrained Learning) provides theory; this exercise shows practical implications of being unable to fully train large models.

Hints: Use the same learning rate schedule but scaled by model size (e.g., lr = base_lr / sqrt(w)). Track total updates strictly. Use a simple architecture (ResNet-20) to avoid confounding with architecture search. Measure convergence to a stable test accuracy by continuing training after the budget expires.

What mastery looks like: Generate a plot showing test accuracy vs. model width, where accuracy first increases (small models benefit from budget increases), then decreases (very large models do not converge in budget). Report a peak at $w \approx 1.0–2.0$ on CIFAR-10: the optimal budget allocation. Quantify the gap: the optimal model size achieves ~83% accuracy, while a much larger model under the same budget achieves only ~79% (4% degradation due to incomplete optimization). Create a second plot showing convergence curves for three representative models (small, medium, large), where large model curve is still steep at the budget limit (not converged), small model curve has plateaued. Generate a table: Model Width | Final Accuracy | Training Loss | Distance to Convergence (epochs), showing that large models are underoptimized. Compute efficiency (accuracy per update): small models might be more efficient under tight budgets, larger models more efficient if fully trained. Create a visualization of the optimization–representation gap: plot the loss landscape (estimated via grid search of final loss for many (model, epochs) pairs) and identify the Pareto frontier (highest achievable accuracy given compute). Solutions on this frontier represent budget-optimal trade-offs. For the synthetic task, show that the optimal width may differ from CIFAR-10 (smaller model may suffice). Conclude with practical guidance: propose adaptive computation strategies (e.g., train a smaller model first, then allocate leftover compute to a larger model), discuss how this relates to efficient scaling laws, and recommend monitoring training curves to detect when budget might be insufficient for the chosen model size.

C.9 — Stability-Based Generalization vs. Distribution Shift

Task: Create a training distribution and two test distributions: (1) In-distribution test: same distribution as training. (2) Shifted test: labels randomly flipped with probability $p$ (label shift), or input features corrupted with noise (feature shift). Vary shift magnitude $p \{0, 0.1, 0.2, 0.5\}$. Train a classifier on the training distribution (clean labels, clean features) using a simple model (logistic regression or shallow neural network). Compute formal stability-based bounds (e.g., Rademacher complexity bounds, PAC-Bayes bounds) for the model. Compare the bound predictions with actual test errors on: (a) in-distribution test, (b) shifted test distributions. Measure how the bound degrades as shift increases. Create a plot: x-axis = shift magnitude $p$, y-axis = accuracy, with curves for actual test accuracy and bound-predicted accuracy.

Purpose: Classical generalization bounds (based on Rademacher complexity, VC dimension, PAC-Bayes) are derived assuming a fixed distribution. When the test distribution shifts, these bounds may become vacuous (>1) or highly pessimistic, failing to predict actual performance. By comparing bounds to reality across varying levels of shift, you see the gap between distribution-free guarantees and practical performance, and learn when bounds remain useful vs. when they become misleading.

ML Link: This tests Theorem: Stability-Based Bounds and Theorem: Distribution Shift and Generalization Failure. Example 8 (Generalization Under Distribution Shift) provides theory; this exercise shows practical limitations. The exercise also connects to domain adaptation and out-of-distribution robustness.

Hints: Implement a simple stability bound from literature (e.g., Vapnik-Chervonenkis bound or recent PAC-Bayes bound). Use clean, controlled shifts (label flip or specific noise) rather than natural distribution shifts. For feature shift, add Gaussian noise to specific dimensions. Measure bound values empirically from the training set.

What mastery looks like: Generate a comprehensive plot showing four curves: (1) Actual in-distribution accuracy: stays stable ~90%, (2) Actual shifted accuracy: drops with shift (e.g., 20% degradation at $p=0.2$), (3) Bound-predicted accuracy: predicts no accuracy change (too pessimistic or assumes worst-case), (4) Revised bound accounting for shift: might track actual accuracy better. Report numerical results: for small shifts ($p=0.1$), bounds might remain <50% (uninformative), actual accuracy only drops to 85% (useful). For large shifts ($p=0.5$), both bounds and actual accuracy drop significantly, but bounds diverge even more. Create a second analysis: measure empirical stability (sensitivity of model predictions to small train data perturbations) and correlate with actual shift robustness. Hypothesis: more stable models should degrade less under shift. Create a table: Shift Magnitude | Actual Accuracy | Bound Prediction | Margin. Show where bounds are useful (informative, tight) vs. useless (vacuous). Conclude with practical recommendations: discuss when stability-based bounds are reliable (clean distributions, small caps), when they fail (misaligned distributions), propose alternative approaches that account for shift (distribution-dependent bounds, robust optimization), and recommend monitoring strategies deployable without knowing the test distribution.

C.10 — Curvature and Learning Rate Dynamics

Task: Train a neural network (2–3 layers, 100–200 units) on a classification task (CIFAR-10 or MNIST) while tracking approximate Hessian eigenvalues. Every 10 training steps, compute the top-1 and top-5 eigenvalues of the Hessian matrix using Lanczos or power iteration (Hessian-vector product via autodiff). Also track the learning rate and training loss simultaneously. Correlate epochs with high top eigenvalue (${}$) with training instability: identify when gradients explode or when loss plateaus. Experiment with fixed learning rates $\{0.001, 0.01, 0.1, 1.0\}$ and adaptive methods (Adam, RMSprop). For each setting, measure: (a) maximum ${}$ observed during training, (b) training stability (ratio of #unstable steps to total steps, defined as |gradient norm| > 3× median gradient norm), (c) final test accuracy.

Purpose: Non-convex loss landscapes have regions of high curvature (large Hessian eigenvalues) where optimization becomes difficult: step sizes must be small to avoid divergence, convergence slows. By measuring curvature along training trajectories, you diagnose when and where optimization breaks down. This teaches why adaptive learning rates and learning rate schedules are practically necessary: they implicitly adapt to local curvature.

ML Link: This tests Theorem: Loss Landscape Curvature and Optimization Limits, and Theorem: Adaptive Learning Rates Address Curvature. Example 9 (Hessian Geometry and Training Dynamics) provides foundation.

Hints: Implement Hessian-vector product via autodiff (standard in PyTorch/TensorFlow). Use Lanczos iteration or power method to find top eigenvalues cheaply (requires only ~10–20 Hessian-vector products). Store history of $_{}$, loss, and gradient norms. Measure “instability” conservatively: flag as unstable when gradient norm exceeds a high threshold (e.g., 3× median).

What mastery looks like: Generate 4 plots (one per learning rate): loss curves, gradient norms, ${}$ curve, and a binary instability indicator. For fixed $= 0.01$, show that ${}$ reaches ~10–50 during training, most training is stable, final accuracy~85%. For $= 1.0$, show ${}$ reaches 100+, instability spikes (training loss zigzags, then diverges or stabilizes at poor accuracy ~60%). Create a scatter plot of (${}$, loss) samples colored by learning rate: demonstrate that higher curvature with large learning rates correlates with poor solutions or instability. Generate a table: Learning Rate | Max Eigenvalue | Instability % | Final Acc, showing clear relationship. For adaptive method (Adam), show that ${}$ curves are similar, but training is more stable (fewer instable steps), achieving similar or better final accuracy. Create a second analysis: correlate ${}$ with effective learning rate (step size observed in parameter updates). When ${}$ is high, effective step size should be small (stable optimization). Provide a theoretical sanity check: for a quadratic function $f(x) = \frac{1}{2} x^T H x$, stability requires ${} = 2 / {}$. Compute this bound and check if observed instabilities occur roughly when $> 2/{}$. Conclude with guidance: discuss why curvature-dependent learning rate schedules (like those used in adaptive methods) help, propose curvature monitoring for detecting optimization difficulties early, and recommend strategies to reduce curvature (batch normalization, different architectures).

C.11 — Explicit vs. Implicit Regularization

Task: Generate synthetic regression data where the true signal is generated from a sparse set of features: $y = X^* + $ with $^*$ having 5–10 nonzero components, $n = 100$, $d = 500$. Create two models: (1) Explicit regularization: train using $_2$-regularized least squares (ridge regression) with regularization strength  tuned via cross-validation to achieve training loss $^2$ (slightly above noise floor). (2) Implicit regularization: train using gradient descent (no explicit regularizer) with early stopping calibrated to stop when training loss reaches $^2$ (matching explicit model). Both models should have identical training loss at their stopping points. Measure: (a) test error, (b) solution norm $||_2$, (c) per-feature weight magnitudes and sparsity pattern, (d) feature importance via permutation, (e) robustness to small input perturbations. Compare learned patterns between the two models.

Purpose: Both explicit regularization and early stopping are regularization mechanisms, but they operate differently. Explicit $_2$ regularization directly penalizes large parameters; implicit regularization from gradient descent initialization biases toward specific directions (e.g., minimum-norm solutions). Despite achieving the same training loss, they may produce different solutions with different generalization and robustness properties. This exercise clarifies that regularization is not a single thing but depends on the full training procedure.

ML Link: This tests Implicit Bias of Gradient Descent and Early Stopping and contrasts with Explicit Regularization. Examples 3–4 provide theory; this exercise quantifies the differences. The observation that two training procedures can match loss but diverge on test performance is central to understanding implicit bias.

Hints: Calibrate both models to identical training loss (use the same fixed loss target). For implicit regularization via early stopping, may need to tune learning rate and stopping time carefully. Use multiple seeds to ensure results are robust, not flukes. Compute feature importance consistently across both models (e.g., permutation importance on test set).

What mastery looks like: Report a table: Model | Test Error | $||$ | Generalization Gap, showing that despite identical training loss, test errors differ: e.g., explicit regularization achieves 1.05$^2$, implicit (early stopping) achieves 0.98$^2$ (implicit is better by ~7%). Generate weight magnitude distributions: explicit regularization typically shows more uniform weight distribution (all shrunk by ), while implicit shows a spikier distribution (some components large, others near-zero, reflecting minimum-norm bias). Create a sparsity analysis: measure the number of weights above threshold for both models. Implicit might be sparser (more weights near-zero) if initialization at zero biases toward specific features. Plot test error vs. training loss (parameter sweep), showing that optimal training loss differs between explicit and implicit: explicit model might prefer higher training loss (more regularization), while implicit model benefits from lower training loss (less early stopping). Create robustness curves: add noise to a small fraction of input features and measure test error change. One model type might be more robust than the other depending on which features are important. Provide a financial interpretation: compute the cost of each model (explicit: evaluating $-\nabla L(\beta)$ plus matrix solve; implicit: many GD steps), comparing compute efficiency. Conclude with practical implications: discuss when to use each regularization type (explicit easier to tune, implicit emergent from optimization), explain why reproducibility depends on which regularization is used, and propose diagnostic tests to determine which regularization a given training procedure implicitly uses.

C.12 — Functional Redundancy and Pruning

Task: Train a wide neural network on CIFAR-10 (e.g., ResNet-56 or vanilla CNN with 300+ hidden units in one layer). After convergence, perform progressive pruning by removed neurons/channels in multiple ways: (1) Structured pruning: remove entire filters/neurons ranked by magnitude of weights or average activation magnitude. (2) Unstructured pruning: remove individual weights. For each pruning level (10%, 30%, 50%, 70% parameters removed), measure: (a) test accuracy before and after pruning, (b) functional similarity to original model: measure $_2$ distance between predictions on a test grid before and after pruning, (c) training loss if re-trained after pruning with fixed remaining weights. Create a pruning curve: x-axis = fraction of parameters removed, y-axis = test accuracy (should plateau initially, then drop). Also track functional similarity vs. number removed.

Purpose: Neural networks often contain functionally redundant parameters: removing them does not significantly change the output. This exercise makes redundancy visible and teachable: you see exactly which parameters matter for the task. By measuring functional similarity and accuracy together, you learn that perfect parameter preservation is not necessary for good generalization.

ML Link: This tests Theorem: Functional Redundancy in Overparameterized Networks. Example 5 (Capacity, Redundancy, Expressivity) provides foundation. The exercise also connects to model compression and understanding which parts of a network are actually used.

Hints: Implement structured and unstructured pruning methods clearly. For ranking, use multiple criteria (magnitude, activation frequency, gradient-based importance). Apply pruning iteratively and measure after each step. Re-training is optional but provides insight into whether pruning creates an optimization problem (requires re-optimization) or is genuinely harmless.

What mastery looks like: Generate a pruning curve showing: test accuracy stays ~90% after removing 30% of parameters, ~88% after 50%, ~82% after 70%, then drops sharply beyond 80% removed. Quantify the “sweet spot”: ~60–70% of parameters can be pruned with <5% accuracy loss. Create a second curve showing functional similarity: initial pruning (10–30%) has near-zero prediction difference (<0.01 in $_\infty$), then increases gradually/non-linearly. Create a heatmap showing which layers are more/less prunable: early layers might be more redundant than late layers (or vice versa, depends on task). Report structured vs. unstructured results: structured pruning might achieve similar accuracy with more parameters removed but be faster to deploy (regular structure). Provide a per-layer analysis: create a table showing pruning tolerance for each layer. Conclude with practical implications: discuss compression ratios achievable, propose pruning-based model selection (keep only necessary parameters), and explore whether re-training after pruning recovers lost accuracy (often yes, indicating optimization was incomplete, not capacity limitation).

C.13 — Expressivity Limits and Representation Error

Task: Design a synthetic classification problem where the decision boundary is XOR-like: positive samples in $[0, 0.45]^2 ^2$, negative samples in $[0, 0.45] $. Generate 1000 samples with label noise $= 0.1$. Train three model classes, each trained to convergence with identical optimization: (1) Linear model (logistic regression), (2) Shallow nonlinear (2-layer MLP with 20 hidden units), (3) Deeper nonlinear (4-layer MLP with 20 units per layer). For each model class, measure: (a) training accuracy, (b) test accuracy, (c) number of samples misclassified: distinguish representation error (cannot express XOR) from optimization error (can express but failed to find it). Plot training curves (accuracy vs. epoch) for all three to see convergence behavior.

Purpose: Some problems require specific model expressivity classes. Linear models simply cannot express XOR; adding layers fixes this. By controlled experiments on an artificial problem, you isolate representation error (architectural limitation) from optimization error (algorithm failure). This teaches you to distinguish “the model cannot solve this” from “the training procedure failed.”

ML Link: This tests Theorem: Expressivity Limits Theorem and UAP (Universal Approximation Property). Example 6 (Representation Error vs. Optimization Error) provides theory and intuition for synthetic benchmarks.

Hints: Use clean data (low noise) to isolate expressivity effects. Ensure all models are trained with sufficient iterations and stable learning rates to reach their optimum. Verify that shallow and deep models fully converge (loss plateaus).

What mastery looks like: Report a results table: Model Class | Train Acc | Test Acc | Misclassified Samples, showing: (1) Linear: ~50% accuracy (cannot learn XOR), (2) Shallow: ~95% accuracy (can learn), (3) Deep: ~95% accuracy (similar to shallow for this simple task). Show training curves: linear model’s accuracy plateaus early around 50% (representation error), shallow nonlinear and deep quickly reach >90% (expressivity sufficient). Create a decision boundary visualization: plot the learned boundaries for each model class, showing linear model with straight line (cannot fit XOR), shallow/deep models with curved boundaries separating the regions well. Perform a second experiment: manually train a linear model with XOR-like features (e.g., $x_1 x_2$) and verify it can now solve the task (showing that expressivity can be achieved through feature engineering). Conclude with interpretation: discuss why expressivity matters and is not just capacity, explain how to identify expressivity limitations empirically (clean data + convergence verification), and propose architectural modifications that address expressivity gaps.

C.14 — Multi-Objective Trade-Offs

Task: Create a synthetic classification dataset with multiple demographic groups (e.g., 70% Group A, 30% Group B). Define a multi-objective loss: $ = {} + {}$, where fairness is defined as balanced accuracy across groups (minimize $|Acc_A - Acc_B|$). Train classifiers with $\{0, 0.1, 0.5, 1.0, 2.0, 5.0\}$, reporting: (a) overall test accuracy, (b) per-group accuracies $Acc_A, Acc_B$, (c) fairness metric (difference between group accuracies), (d) calibration per group (predicted probability vs. true frequency). Create a trade-off curve showing accuracy vs. fairness as  changes.

Purpose: Real-world models often optimize proxies for true objectives. A fairness term added to accuracy is a common case: formally optimizing a proxy objective. This exercise makes visible how changing objective weights changes model behavior, and how achieving proxy targets (balanced accuracy) may not achieve true fairness (equal decision quality). This teaches critical thinking about objectives and their consequences.

ML Link: This tests Objective Misalignment and Theorem: Proxy Optimization Can Harm True Objective. Example 10 (Multi-Objective Optimization) provides context. The exercise connects to governance and responsible AI: algorithms are only as good as their objectives.

Hints: Define fairness and accuracy metrics clearly before experiments. Ensure both groups are represented in test set (not just training aggregation). Measure per-group performance separately, not just averaged.

What mastery looks like: Generate a trade-off curve: x-axis = fairness ($|Acc_A - Acc_B|$), y-axis = overall accuracy. For $= 0$, the model prioritizes accuracy ($85\%$) but may be unfair ($Acc_A = 90\%, Acc_B = 60\%, \text{gap} = 30\%$). As  increases, fairness improves (gap narrows), but overall accuracy degrades: $= 5$ yields $Acc_A = 75\%, Acc_B = 73\%$ (fair), overall accuracy ~74% (5% loss). Create a table: $\lambda$ | Overall Acc | Gap | $Acc_A$ | $Acc_B$ showing the trade-off quantitatively. Analyze decision patterns: for each , visualize confusion matrices per group, showing how the model’s error patterns change (group B may go from high false-negative to more balanced errors as  increases). Create calibration curves: plot predicted probability vs. true positive rate for each group under different . Discuss whether fairness via proxy optimization actually achieves fairness in the true sense, or if it merely balances an imperfect metric. Conclude with governance implications: discuss how objective design affects deployment fairness, why optimization of proxies is inherently risky, and propose mechanisms to ensure alignment between proxy and true objective.

C.15 — Scaling Laws and Saturation Detection

Task: Train models of varying sizes ($m \{10^5, 3 ^5, 10^6, 3 ^6\}$ parameters) on datasets of varying sizes ($n \{10^3, 10^4, 10^5, 10^{5.5}\}$ samples) using a fixed architecture family. For each ($m, n$) pair, train to convergence and measure test loss. Fit a power-law model: $(m, n) = a m^{-b} + c n^{-d} + e$. Evaluate fit quality ($R^2$) across the range. Identify regimes where the power law breaks down (non-linear regions) and estimate saturation points.

Purpose: Scaling laws have become central to modern deep learning (e.g., Chinchilla scaling, Kaplan et al.). By empirically fitting scaling relationships, you learn to (1) predict performance at new scales, (2) detect when predictions fail, (3) identify saturation regimes where scaling no longer helps. This is practical for planning large-scale training runs.

ML Link: This tests Theorem: Scaling Laws and Power-Law Behavior. Example 11 (Scaling Laws) provides theoretical foundations. The exercise grounds abstract scaling behavior in concrete experiments.

Hints: Use consistent training procedures (learning rate schedule, epochs, regularization). Ensure models fully converge in each configuration. Log-transform variables for fitting power laws (linear regression in log space). Evaluate fit quality at multiple points and extrapolate cautiously.

What mastery looks like: Generate a log-log plot showing test loss vs. model size for multiple fixed dataset sizes, with fitted power-law lines. Demonstrate that power laws hold over a range (e.g., $R^2 > 0.95$ for $m $, $n$ fixed), with exponent $b \approx 0.07$ (slow decay, typical for modern models). For dataset size, fit $d \approx 0.05–0.1$. Create a heatmap: rows = model size, columns = dataset size, cells = test loss. Show smooth scaling over the middle range, with saturation visible in corners (very small $m$ saturates from below due to underfitting, very large $n$ saturates from above due to diminishing data returns). Fit a combined model $(m, n) = a m^{-b} + c n^{-d} + e$ and report coefficients. Evaluate fit globally: report $R^2$ and list configurations where fit deviates (likely saturation). Provide saturation estimates: compute the loss floor (asymptotic loss as $m, n $) from the fitted model. Conclude with scaling recommendations: propose allocation strategies (how much to scale model vs. data with a compute budget), discuss when power-law predictions become unreliable, and suggest alternative models for capturing saturation effects.

C.16 — Bayes-Optimal Non-Identifiability

Task: Construct a distribution where two classifiers are both Bayes-optimal (minimize Bayes risk) but differ on a subset of inputs. For example, a symmetric two-class distribution: positive samples $(+, I)$ with $+ = (1, 1)$, negative samples $(-, I)$ with $- = (-1, -1)$, but in the region ${x: |x|< 0.1}$ near the origin, labels are randomly flipped (50% positive, 50% negative). Define two Bayes-optimal classifiers: (1) Linear: $f_L(x) = (x_1 + x_2)$ (best constant classifier on origin), (2) Nonlinear: $f_N(x) = (|(x - +)|^2 - |(x - _-)|^2)$ (different decision). Train neural networks multiple times with different random seeds and measure: (a) test accuracy on the full distribution, (b) per-region accuracy (origin vs. away from origin), (c) disagreement on the ambiguous region. Measure what fraction of trained models become “linear-like” vs. “nonlinear-like” in behavior.

Purpose: Multiple globally optimal solutions exist for the same task. Which one a learning algorithm finds is determined by inductive bias, not optimality. This exercise makes underspecification concrete: you see that even for Bayes-optimal classifiers, different algorithms pick different solutions, and these differences matter in the ambiguous region.

ML Link: This tests Underspecification in Deep Learning and the non-identifiability of Bayes-optimal solutions. Example 12 (Underspecification) provides context. The exercise illustrates that Bayes optimality is a weak notion: it does not uniquely determine a model.

Hints: Construct the label-flipped region carefully so that both solutions truly achieve the same Bayes risk. Generate a large, clean test set to measure true generalization. Use multiple seeds (~20) to see the population of solutions found by the algorithm.

What mastery looks like: Train 20 models with different seeds and report: X% are linear-like (low disagreement on ambiguous region, high agreement in non-ambiguous), Y% are nonlinear-like. Both should achieve similar test accuracy (~78–79%, when Bayes risk limits all to that region’s contribution). On the ambiguous region, one family of models might predict all positive, another all negative; disagreement is high (20–40% for different model pairs). Create a visualization: decide boundary plot for several trained models, showing divergence in the ambiguous region but agreement away from it. Generate a second analysis: vary the size/severity of the ambiguous region and measure how many seeds produce each type of solution. Conclusion: discuss why initialization, learning rate, or batch size could bias toward one solution, propose methods to ensure a specific Bayes-optimal solution is selected, and relate to robustness (different Bayes-optimal models may have different robustness profiles).

C.17 — Small Model vs. Large Model Under Compute Constraints

Task: Fix a total compute budget (e.g., 10^9 FLOPs equivalent to training a ResNet-50 on CIFAR-10 for 100 epochs). Create three training strategies: (1) Small model to convergence: train ResNet-20 to validation convergence, using as many epochs as fit in the budget. (2) Medium model underoptimized: train ResNet-50 for as many epochs as fit in the budget, likely converging partway. (3) Large model underoptimized: train ResNet-110 or wider network for fewer epochs. For each strategy, measure: test accuracy, training loss, convergence indicators (change in loss per epoch), and estimate the optimization gap (distance from convergence). Create visualizations: training curves for all three (epochs vs. loss/accuracy), and a summary table.

Purpose: This makes the optimization–representation gap concrete. Conventional wisdom might suggest “use the largest model under your budget,” but if you cannot fully train it, a smaller model trained to convergence might outperform. By measuring this empirically, you learn when to prefer smaller, well-optimized models over larger, under-optimized ones.

ML Link: This tests Optimization–Representation Gap and Compute-Constrained Learning. Example 7 provides foundation. The exercise illustrates practical implications of limited compute.

Hints: Measure compute carefully and consistently across strategies. Use the same learning rate schedule scaled by model width. Track convergence via validation loss plateauing.

What mastery looks like: Report a results table: Strategy | Test Acc | Training Loss | Convergence Indicator. Small model achieved ~82% accuracy (fully converged). Medium achieves ~79% (underoptimized). Large achieves ~77% (severely underoptimized). The gap between large and small is ~5%, representing the optimization–representation gap under the fixed budget. Create convergence curves: small model curve plateaus (flat tail after epoch 80), medium and large curves are still sloping downward at budget limit, indicating incomplete optimization. Quantify gap: estimate that the large model, if fully trained, would achieve ~84% (higher than small), but was not given the compute. Conclude with practical guidance: develop a decision rule (when to prefer small vs. large under budget constraints), discuss techniques to reduce the gap (better learning rate schedules, more efficient architectures), and recommend monitoring to detect when budget is insufficient.

C.18 — Regime Shift Boundaries

Task: Starting from the underparameterized regime ($d/n = 0.1$), gradually increase dimensionality $d$ while keeping $n$ fixed, stepping through $d/n \{0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 5.0, 10.0\}$. At each $d/n$, perform a grid search over $_2$ regularization strength $\{10^{-5}, 10^{-3}, 10^{-1}, 1, 10, 100\}$. For each ($d/n, $) pair, train a model and record test error. Plot: x-axis = $d/n$, y-axis = optimal $(d/n)$ (the  value minimizing test error for each $d/n$). Look for phase transitions or sharp changes in optimal .

Purpose: Optimal training strategies change as the problem regime shifts. In underparameterized regimes, different regularization is optimal than in overparameterized regimes. By mapping the optimal regularization as dimensionality changes, you gain insight into phase transitions and what training strategy changes at each boundary.

ML Link: This tests Regime-Dependent Optimization Strategies and Phase Transitions. Example 8 provides context. The exercise illustrates how fundamental problem structure determines optimal algorithms.

Hints: Keep $ n , , $ data distribution fixed. Use consistent training (epochs, learning rate, optimization). Perform validation split carefully to ensure stable  selection.

What mastery looks like: Plot optimal  vs. $d/n$, showing: for $d/n < 1$ (underparameterized), optimal  is large (~10–100, strong regularization needed), decreasing gradually toward $d/n = 1$. Near $d/n = 1$ (critical regime), optimal  drops sharply or non-smoothly (phase transition). For $d/n > 1$ (overparameterized), optimal  is small (~0.001–0.01, weak regularization), potentially approaching zero as $d/n$ increases. Create a table showing: $d/n$ | Optimal  | Test Error | Interpretation. Identify the transition points. Overlay test error curves: in underparameterized regime, test error decreases with increasing $d$ (bias reduction). Near the critical regime, test error spikes (peak of bias-variance curve). In overparameterized regime, test error decreases again (benign overfitting). Conclude with training guidance: explain how to detect which regime you are in and what type of regularization to apply.

C.19 — Spurious Correlations and Data Scaling

Task: Create a dataset with spurious correlations: labels $y = f(x_1) + $ where $f$ depends only on $x_1$ (true feature). But in training, dimension $x_2$ is highly correlated with $y$ (spurious). In test, $x_2$ is independent of $y$ (spurious correlation breaks). Vary training set size $n$ and measure: (1) Feature importance (permutation importance) for $x_1$ vs. $x_2$ at each $n$, (2) Test accuracy across scaling levels. Compare two training schemes: (a) Standard training (model learned spurious correlation), (b) Augmented training (perturb $x_2$ to break spurious correlation). Track how reliance on spurious features changes with scale.

Purpose: A common belief is “more data fixes everything,” but without aligned inductive bias, scaling can amplify spurious correlations. This exercise shows that scaling without intervention (e.g., augmentation) can make spurious reliance worse, not better. It teaches that data scaling alone does not ensure robustness.

ML Link: This tests Scaling and Robustness Sans Inductive Bias. Example 13 provides context. The exercise connects to domain generalization and out-of-distribution robustness.

Hints: Construct spurious and true features carefully. Measure feature importance consistently. Test on a separate distribution confirming spurious correlation breaks.

What mastery looks like: Generate a plot showing feature importance (x-axis = training set size, two curves: true feature $x_1$, spurious feature $x_2$) for standard training. At small $n$, both features have similar importance (both are useful in-distribution). As $n$ increases, $x_2$ importance might increase (model fits spurious correlation more precisely), while $x_1$ importance plateaus (already captured perfectly). On test set (where spurious breaks), model relying on $x_2$ fails, even though training improves. For augmented training (perturbing $x_2$): importance of $x_1$ dominates everywhere, $x_2$ importance stays low or decreases. Test performance remains good across scales. Create a visualization: feature importance heat map for different ($n$, feature) pairs, showing divergence between standard and augmented. Conclude with guidance: discuss why scaling amplifies spurious correlations (more data = stronger signal for fitting spurious patterns), why intervention (augmentation, invariant learning) is necessary, and how to detect spurious reliance empirically.

C.20 — Deployment Monitoring Under Distribution Shift

Task: Simulate a deployment scenario with time-varying distribution: generate training data from distribution $_0$, and simulate test data from drifting distributions $_1, _2, , _T$ that gradually shift away from training. Implement three monitoring strategies: (1) Loss-based: track model predictions and reported confidence, flag when confidence decreases. (2) Feature-based: track marginal statistics of input features (mean, std, covariance), flag when they deviate from training distribution. (3) Prediction-based: track model prediction distribution (marginal class distribution), flag when it diverges from training. Train a classifier model on $_0$, deploy it on the shifting distributions, and measure: (a) actual test accuracy as distribution shifts, (b) alerts raised by each monitoring strategy, (c) false alarm rate (alerts when model still performs well) and detection latency (time from true degradation start to detection). Create a time series plot.

Purpose: Deployed models are not static. Data distributions shift post-deployment, causing performance degradation. Real systems need monitoring strategies to detect degradation before it affects users. This exercise teaches practical MLOps: designing and evaluating monitoring systems that balance false alarms and detection speed.

ML Link: This tests Distribution Shift and Monitoring Under Drift. Example 14 provides context. The exercise connects monitoring to practical deployment.

Hints: Simulate realistic drift (gradual, not sudden). Make degradation somewhat slow so monitoring strategy has time to detect it. Use ROC or other evaluation metrics for monitoring strategy performance.

What mastery looks like: Generate multiple time series plots (one per monitoring strategy) showing: x-axis = time, y-axis (top) = actual test accuracy, y-axis (bottom) = alert signals. Loss-based monitoring might flag very late (lag = high) when accuracy already degraded significantly. Feature-based monitoring detects early (lag = low) because feature drift often precedes accuracy degradation. Prediction-based monitoring is intermediate. Create tables: Monitoring Strategy | False Alarm Rate | Detection Latency | Accuracy at Detection, quantifying tradeoffs. Identify when detection is critical (e.g., at 5% accuracy drop, must detect within 2 timesteps). Propose a hybrid strategy combining multiple monitors to reduce false alarms. Conclude with deployment guidance: recommend specific monitoring strategy based on cost of false alarms vs. cost of undetected degradation, propose automated response (trigger retraining, alert humans) when drift is detected, and discuss costs/benefits of defensive mechanisms.

Solutions

Solutions to A. True / False

A.1 Final Answer: False. Full mathematical justification: In overparameterized linear regression, the set of interpolators is $\theta = \theta_p + v$ for $v$ in the null space of $X$. Gradient descent initialized at zero selects the minimum $\ell_2$-norm solution in the current parameterization. If we rescale or rotate the feature space by a matrix $S$, the minimizer becomes $\arg\min \|\theta\|_2$ subject to $X S \theta = y$, which corresponds to a different solution in the original coordinates. Hence the learned solution depends on the feature scaling and is not uniquely determined by the data distribution alone. Counterexample if false: Let $x_2 = x_1$ for all samples and scale $x_2$ by 100. The interpolating solutions change from $(\theta_1, \theta_2) = (1/2, 1/2)$ to approximately $(1, 0)$ in the minimum-norm solution, producing different predictions under small noise. Comprehension: The optimizer’s implicit bias is defined by the geometry induced by the parameterization, not only by the data distribution. ML Applications: Preprocessing choices (standardization, whitening, feature scaling) change which interpolator is selected and can alter fairness, robustness, and calibration. Failure Mode Analysis: Treating preprocessing as purely numerical can yield non-reproducible models and unstable deployment behavior. Traps: Assuming that “the data determine the solution” or that minimum training loss implies a unique model.

A.2 Final Answer: False. Full mathematical justification: There is no theorem guaranteeing monotonic test error decrease with capacity for all distributions and optimizers. In fact, for distributions with spurious correlations, increased capacity expands the hypothesis class to include spurious solutions, which can dominate ERM. Thus test error can increase even if training error decreases. Counterexample if false: Construct a dataset where labels correlate with a spurious feature in training but not test; a higher-capacity model can fit the spurious feature more precisely, worsening test performance. Comprehension: Increasing capacity increases the number of feasible solutions; without aligned inductive bias, generalization can worsen. ML Applications: Capacity tuning must account for spurious features and objective alignment, not only for training loss. Failure Mode Analysis: Blind scaling can push models into a regime where they optimize the wrong patterns. Traps: Treating scaling laws or double descent as universal monotonicity claims.

A.3 Final Answer: True. Full mathematical justification: Non-identifiability means there exists a nontrivial direction $v$ such that $p(Y|X;\theta) = p(Y|X;\theta + tv)$ for all $t$. Differentiating w.r.t. $t$ at $t=0$ yields $v^T \nabla_\theta \log p(Y|X;\theta) = 0$ almost surely. Therefore $v$ lies in the null space of the Fisher Information Matrix $\mathcal{I}(\theta)$, implying $\mathcal{I}(\theta)$ is singular. Counterexample if false: Not applicable; the singularity follows directly from the definition of non-identifiability. Comprehension: If parameters cannot be distinguished by any data, the Fisher information must vanish in at least one direction. ML Applications: Explains why confidence intervals and parameter-based interpretability can be ill-posed in models with symmetries. Failure Mode Analysis: Overconfident parameter inference in non-identifiable models leads to misleading conclusions. Traps: Confusing finite-sample ambiguity with fundamental non-identifiability.

A.4 Final Answer: False. Full mathematical justification: Double descent is an observed pattern, not a guarantee. When explicit regularization is fixed, increasing capacity past interpolation can still worsen test error if the optimizer fits noise or if the implicit bias changes adversely. Non-monotonicity beyond the interpolation threshold is possible. Counterexample if false: In high-noise linear regression, test error can increase for large $d$ if the optimizer finds high-variance interpolators. Comprehension: Double descent is contingent on data noise, optimization dynamics, and inductive bias. ML Applications: Overparameterization should be paired with appropriate optimization and regularization to avoid harmful overfitting. Failure Mode Analysis: Scaling without controlling noise sensitivity can increase deployment error. Traps: Assuming “double descent” means “bigger is always better.”

A.5 Final Answer: True. Full mathematical justification: No-Free-Lunch states that averaged over all possible labelings, all algorithms are equivalent. For fixed $\mathcal{P}(X)$, one can construct distinct conditional distributions $\mathcal{P}(Y|X)$ that induce arbitrary errors for a given algorithm while leaving the marginal unchanged. Hence Bayes optimality on one conditional does not imply performance on another conditional with the same marginal. Counterexample if false: Not applicable; this is a direct corollary of the theorem. Comprehension: Generalization requires assumptions about $\mathcal{P}(Y|X)$, not just $\mathcal{P}(X)$. ML Applications: Domain shift in labels (label shift or concept drift) can break performance even if inputs look similar. Failure Mode Analysis: Evaluating only feature marginals can mask severe label shift risk. Traps: Treating identical feature distributions as proof of task similarity.

A.6 Final Answer: True. Full mathematical justification: If $d_{\text{VC}} > n$, the growth function can be $\Pi_\mathcal{H}(n) = 2^n$, making uniform convergence bounds of order $\sqrt{(\log \Pi_\mathcal{H}(n))/n}$ close to 1. This makes the bound vacuous irrespective of the realized test error. Counterexample if false: Not applicable; this is standard VC theory. Comprehension: Worst-case uniform convergence bounds can be non-informative even when empirical generalization is strong. ML Applications: Explains why modern deep learning lacks tight classical bounds. Failure Mode Analysis: Overreliance on VC bounds can lead to pessimistic assessments or incorrect model comparisons. Traps: Treating a vacuous bound as evidence of inevitable overfitting.

A.7 Final Answer: False. Full mathematical justification: Rank-deficient Hessian indicates flat directions. In non-convex landscapes, a stationary point with a degenerate Hessian can be a saddle, a non-global local minimum, or a flat plateau. There is no implication of global optimality from rank deficiency alone. Counterexample if false: $f(x,y) = x^4 - y^2$ has a stationary point at $(0,0)$ with rank-deficient Hessian that is a saddle. Comprehension: Flatness does not imply optimality; it only indicates lack of curvature. ML Applications: Stopping criteria based on gradient norm can be misleading in non-convex training. Failure Mode Analysis: Declaring convergence at flat saddles can yield poor solutions. Traps: Equating “flat” with “good.”

A.8 Final Answer: False. Full mathematical justification: The asymptotic limit $L^*$ depends on the data distribution and label noise. Improving data quality can reduce $L^*$. Therefore, model size cannot always trade off for data quality without changing the asymptotic limit. Counterexample if false: If labels have additive noise $\xi$ with variance $\sigma^2$, then $L^*$ includes a noise floor. Reducing label noise reduces $\sigma^2$ and lowers $L^*$. Comprehension: Data quality determines the irreducible error; capacity determines how close you get to it. ML Applications: Data cleaning and better labeling can outperform scaling in low-quality datasets. Failure Mode Analysis: Scaling a model on noisy data can entrench a higher error floor. Traps: Treating $L^*$ as fixed by the task rather than by the dataset.

A.9 Final Answer: True. Full mathematical justification: With fixed compute, optimization error cannot be driven below a constant if the condition number or effective dimension grows with model size. As $d$ grows, the number of steps required to reach the representational optimum exceeds the compute budget, leaving a nontrivial optimization–representation gap and potentially worse test error. Counterexample if false: Not applicable; the statement aligns with standard optimization lower bounds under budget constraints. Comprehension: Capacity is valuable only when optimization can exploit it within the available compute. ML Applications: In practice, smaller models can outperform larger ones under strict training budgets. Failure Mode Analysis: Oversized models trained for too few steps lead to underfitting despite high capacity. Traps: Assuming capacity improvements automatically yield better results under fixed compute.

A.10 Final Answer: True. Full mathematical justification: Uniform stability bounds depend on algorithmic sensitivity, not on hypothesis class size. If stability is $O(1/n)$, then generalization gaps are bounded by $O(1/n)$ up to concentration terms, even when VC-dimension is infinite. Counterexample if false: Not applicable. Comprehension: Algorithmic stability can replace capacity as the primary control of generalization. ML Applications: Strongly convex objectives and differential privacy can guarantee generalization even for large models. Failure Mode Analysis: Ignoring stability can lead to fragile models with large generalization gaps. Traps: Assuming that high capacity alone implies poor generalization.

A.11 Final Answer: True. Full mathematical justification: Losses are expectations or averages; they do not uniquely specify pointwise predictions. Two hypotheses can agree in expected loss while disagreeing on a set of positive measure if the losses balance out. This is especially common under underspecification. Counterexample if false: Not applicable. Comprehension: Equal scalar metrics do not imply identical decision boundaries or behaviors. ML Applications: Two models with equal validation accuracy can behave very differently under distribution shift or subgroup evaluation. Failure Mode Analysis: Hidden disagreement regions can cause real-world failures despite identical metrics. Traps: Treating accuracy as a complete description of model behavior.

A.12 Final Answer: False. Full mathematical justification: The stability condition for gradient descent on quadratics is $\eta < 2/\lambda_{\max}$. The condition number $\kappa$ governs convergence rate, not the maximum stable step size. Using $1/\kappa$ is sufficient but overly conservative. Counterexample if false: For eigenvalues $1$ and $100$, the maximum stable $\eta$ is $2/100$, which depends on $\lambda_{\max}$, not on $\kappa$. Comprehension: Stability and speed are controlled by different spectral quantities. ML Applications: Learning-rate tuning should track $\lambda_{\max}$ (or its estimates), not just condition number. Failure Mode Analysis: Overly small learning rates slow training and can stall progress under compute limits. Traps: Confusing convergence-rate bounds with stability bounds.

A.13 Final Answer: False. Full mathematical justification: NP-hardness of exact feature selection does not preclude constant-factor approximations. Some NP-hard optimization problems admit polynomial-time constant-factor algorithms under specific objective structure (e.g., submodularity). Therefore, the blanket claim is too strong. Counterexample if false: For submodular feature selection objectives, greedy algorithms achieve constant-factor approximation guarantees. Comprehension: Hardness of exact optimization is not equivalent to hardness of approximation. ML Applications: Approximate feature selection is feasible in structured settings, but guarantees depend on objective properties. Failure Mode Analysis: Overstating hardness can discourage practical, effective approximation methods. Traps: Equating NP-hardness with “no useful algorithms exist.”

A.14 Final Answer: True. Full mathematical justification: Universal approximation guarantees representability, not identifiability. If data underconstrain the hypothesis class, many functions remain consistent, so additional expressivity does not improve generalization. This is expressivity saturation induced by data limitations. Counterexample if false: Not applicable. Comprehension: Representational capacity does not ensure learnability without sufficient data constraints. ML Applications: Scaling universal architectures without data diversity yields diminishing returns. Failure Mode Analysis: Larger models can overfit arbitrary consistent functions when data are limited. Traps: Conflating universal approximation with guaranteed generalization.

A.15 Final Answer: False. Full mathematical justification: Implicit regularization from SGD depends on optimization dynamics and loss geometry and is not generally equivalent to explicit $\ell_2$ regularization. Equivalence holds in special linear settings but fails in non-convex models where implicit bias favors flat minima or low-rank solutions not captured by $\ell_2$ penalties. Counterexample if false: Two deep networks trained with and without weight decay can reach different minima with identical training loss but different sharpness and robustness. Comprehension: Implicit and explicit regularization are different control mechanisms with different solution biases. ML Applications: Weight decay cannot replace early stopping, batch-size effects, or optimizer choice. Failure Mode Analysis: Treating them as interchangeable leads to unpredictable generalization outcomes. Traps: Overgeneralizing linear-model equivalences to deep networks.

A.16 Final Answer: False. Full mathematical justification: Capacity saturation at fixed data does not imply data saturation. Additional data can reduce estimation error and reveal new structure even when capacity is already high. Thus more data can still improve performance beyond a capacity saturation threshold. Counterexample if false: A model saturated on a narrow dataset improves substantially when trained on a larger, more diverse dataset. Comprehension: Saturation must be evaluated separately along the data and capacity axes. ML Applications: Data acquisition can shift or remove saturation effects even for very large models. Failure Mode Analysis: Abandoning data scaling prematurely can leave performance gains on the table. Traps: Interpreting capacity saturation as a universal saturation in all resources.

A.17 Final Answer: False. Full mathematical justification: A small optimization–representation gap on the training objective does not imply optimal deployment behavior. Optimizer choice can change implicit bias, robustness, calibration, and subgroup performance even if training loss remains near optimal. Therefore, changing the optimizer can still improve performance without changing the objective. Counterexample if false: Two optimizers yield identical training loss but different adversarial robustness; switching optimizers improves deployment robustness without altering the objective. Comprehension: Optimization affects more than training loss; it shapes which solution is selected among many. ML Applications: Optimizer selection is part of model design, especially under underspecification. Failure Mode Analysis: Ignoring optimizer effects can lock in brittle or biased solutions. Traps: Equating training optimality with deployment optimality.

A.18 Final Answer: False. Full mathematical justification: Functional redundancy increases the chance that pruning is safe, but it does not guarantee it. Redundant parameters can contribute to robustness or rare-case behavior; removing them can increase test error or degrade out-of-distribution performance even if average loss remains stable. Counterexample if false: Pruning parameters used primarily for rare subpopulations can increase subgroup error while leaving overall accuracy nearly unchanged. Comprehension: Redundancy is a necessary but not sufficient condition for safe compression. ML Applications: Pruning must be evaluated on deployment-relevant metrics, not only on average accuracy. Failure Mode Analysis: Overpruning can silently harm robustness and fairness. Traps: Assuming that redundancy implies zero-risk compression.

A.19 Final Answer: True. Full mathematical justification: The ratio $d/n$ governs whether variance is controlled primarily by explicit regularization or by implicit bias. Around the interpolation threshold, the dominant regularization mechanism can change, and the optimal strategy can shift from explicit penalties to relying on implicit regularization and careful optimization. Counterexample if false: Not applicable; this is consistent with theoretical and empirical evidence around interpolation thresholds. Comprehension: Regularization is regime-dependent; the same setting can be suboptimal on either side of the threshold. ML Applications: Hyperparameter schedules should adapt as models scale relative to data. Failure Mode Analysis: Using classical regularization in overparameterized regimes can introduce unnecessary bias. Traps: Assuming one regularization recipe works across all scales.

A.20 Final Answer: False. Full mathematical justification: Objective misalignment is a specification problem; scaling model capacity improves optimization of the given objective, not the true objective. Therefore, increasing depth and width can amplify misalignment rather than eliminate it. Counterexample if false: A recommender optimized for engagement can become more effective at promoting harmful content as it scales, increasing misalignment effects. Comprehension: Capacity strengthens optimization of whatever is specified; it does not correct specification errors. ML Applications: Alignment requires objective redesign, constraints, or governance, not just scaling. Failure Mode Analysis: Larger models can intensify harmful behaviors if objectives are misaligned. Traps: Believing scale is a substitute for objective correctness.

Solutions to B. Proof Problems

B.1 Full formal proof: Let $\mathcal{H}$ have VC-dimension $d$. There exists a shattered set $S = \{x_1, \ldots, x_d\}$. Consider the uniform distribution over $S$, and define labels by a random target function $f: S \to \{0,1\}$ chosen uniformly among the $2^d$ labelings. Let $S_n$ be a sample of size $n < d/2$. With probability at least $1/3$, the sample leaves at least $d/2$ points of $S$ unobserved (standard occupancy bound). On unseen points, the ERM hypothesis has no information; by symmetry over random labelings, its expected error on each unseen point is $1/2$. Therefore, conditional on this event, the expected test error is at least $1/2 \cdot (d/2)/d = 1/4$. This yields $\mathbb{P}(\mathcal{L}_{\text{test}} \ge 1/4) \ge 1/3$. Proof strategy & techniques: Shattering construction, random labeling, and occupancy bounds to ensure unseen points; symmetry of random labels. Computational validation: Generate a shattered set and random labels; train ERM with $n<d/2$ and measure test error on the full set. ML interpretation: High-capacity classes can memorize small samples yet remain uninformed about unseen points, leading to poor generalization. Generalization & edge cases: The bound is worst-case over distributions; it weakens when the distribution has structure or low noise. Failure mode analysis: Relying on ERM in small-sample regimes yields unstable and brittle models. Historical context: Classic VC lower bounds and the fundamental limits of sample complexity. Traps: Treating this lower bound as a typical-case guarantee or misreading it as a claim about all distributions.

B.2 Full formal proof: For $\mathcal{L}(\theta) = \|X\theta - y\|^2$ with $d>n$, write the solution set as $\theta = \theta_p + v$ with $v \in \ker(X)$. Gradient descent with $\theta_0=0$ yields iterates in $\text{span}(X^T)$ since $\nabla \mathcal{L}(\theta) = 2X^T(X\theta - y)$ lies in $\text{span}(X^T)$. Thus $\theta_t$ has no component in $\ker(X)$. The unique interpolator in $\text{span}(X^T)$ is $\theta^* = X^T (X X^T)^{-1} y$, which also minimizes $\|\theta\|_2$ among all interpolators. Convergence follows from standard linear convergence of gradient descent on a convex quadratic with step $\eta < 1/\|X\|_2^2$. Proof strategy & techniques: Decomposition into row space and null space, characterization of minimum-norm interpolator, and linear convergence on quadratics. Computational validation: Compare GD solution with pseudoinverse solution $X^+ y$ for random $X$ and confirm equality to numerical precision. ML interpretation: The algorithm, not just the loss, selects a specific interpolator; implicit regularization emerges from optimization dynamics. Generalization & edge cases: If $X X^T$ is singular, replace with Moore-Penrose pseudoinverse; non-zero initialization preserves a null-space component. Failure mode analysis: Preprocessing or feature scaling changes $X$ and therefore the implicit bias, leading to different deployed models. Historical context: Early analyses of implicit bias in linear models and kernel methods. Traps: Assuming any optimizer yields the minimum-norm solution; ignoring initialization dependence.

B.3 Full formal proof: Let $\mathcal{A}$ be uniformly stable with parameter $\beta$. Define $g(S) = \mathbb{E}_z[\ell(h_S,z)] - \frac{1}{n}\sum_{i=1}^n \ell(h_S,z_i)$. By stability, replacing any single example changes $g(S)$ by at most $\beta/n$. Apply McDiarmid’s inequality to $g(S)$ to obtain \[ \mathbb{P}(|g(S) - \mathbb{E}g(S)| > t) \le 2\exp\left(-\frac{2t^2 n}{\beta^2}\right). \] Moreover, $\mathbb{E}g(S) \le \beta$ by standard stability arguments. Setting the RHS to $\delta$ gives $g(S) \le \beta + O\left(\sqrt{\log(1/\delta)/n}\right)$ with probability at least $1-\delta$. Proof strategy & techniques: Uniform stability, bounded differences, McDiarmid concentration. Computational validation: Estimate leave-one-out stability numerically and compare with observed generalization gaps. ML interpretation: Stability yields generalization guarantees even for high-capacity models. Generalization & edge cases: Strong convexity improves $\beta$; non-convex training can violate stability assumptions. Failure mode analysis: Unstable training (large learning rates, high variance) can lead to large generalization gaps. Historical context: Stability theory in learning (Bousquet, Elisseeff) and connections to differential privacy. Traps: Assuming stability without verifying Lipschitz loss or algorithmic conditions.

B.4 Full formal proof: Construct $x = (u, v)$ with $u \in \mathbb{R}^k$, $v \in \mathbb{R}^{d-k}$, and labels $y = f(u) + \xi$ where $\xi$ is independent noise. Let $\mathcal{H}_d$ be architectures with width $d$ that can represent any function of $u$ once $d \ge d^*$. For $d \ge d^*$, the Bayes risk is $\mathbb{E}\xi^2$, and no increase in $d$ can reduce this irreducible error. However, increased width permits fitting noise in $v$, reducing training loss while leaving test loss unchanged, giving a saturation bound. Proof strategy & techniques: Constructive distribution with nuisance features; irreducible error argument. Computational validation: Simulate with fixed signal dimension $k$, add noise dimensions, and observe training loss decrease without test improvement. ML interpretation: Past a representational threshold, scaling capacity does not improve generalization absent richer data or objectives. Generalization & edge cases: If $\xi=0$, saturation occurs at zero error; if $f$ changes with $d$, the threshold shifts. Failure mode analysis: Scaling models without data diversity wastes compute and can increase spurious dependence. Historical context: Classical bias-variance analysis and modern scaling saturation discussions. Traps: Interpreting training loss improvements as evidence of better generalization.

B.5 Full formal proof: Let $X_0$ be a finite set of size $2n$ and define labels by a random function $f$ with values in $\{0,1\}$. For any algorithm $\mathcal{A}$ trained on $n$ samples, at least $n$ points are unseen. By symmetry of random labeling, for any hypothesis $h$, the expected squared error on an unseen point is $\mathbb{E}(h(x)-f(x))^2 \ge 1/4$. Therefore the expected test error over $X_0$ is at least $1/4$, independent of $n$. Extending to regression in $[0,1]$ preserves the constant lower bound. Proof strategy & techniques: Random labeling, symmetry, and expectation lower bounds. Computational validation: Train on random labels and evaluate on unseen points to observe error near 0.25. ML interpretation: Without assumptions, learning is impossible beyond memorization. Generalization & edge cases: Structure assumptions (smoothness, low noise) can break the bound. Failure mode analysis: Overconfidence in generic algorithms without domain assumptions. Historical context: No-free-lunch theorems in learning and optimization. Traps: Treating this as a claim about typical tasks rather than worst case.

B.6 Full formal proof: For squared-loss networks, $\mathcal{L}(\theta)=\frac{1}{2}\sum_{i=1}^n r_i(\theta)^2$ with residuals $r_i(\theta)=h_\theta(x_i)-y_i$. The Hessian is $\mathcal{H}=J^T J + \sum_i r_i \nabla^2 r_i$. At interpolation, $r_i=0$, hence $\mathcal{H}=J^T J$. Since $J \in \mathbb{R}^{n \times d}$, $\text{rank}(\mathcal{H})\le \text{rank}(J)\le n$. Thus at least $d-n$ eigenvalues are zero, yielding flat directions. Proof strategy & techniques: Residual-based Hessian decomposition and rank bounds. Computational validation: Estimate Hessian spectrum numerically and confirm many near-zero eigenvalues when $d\gg n$. ML interpretation: Overparameterization induces flat directions and non-unique solutions; second-order methods become ill-conditioned. Generalization & edge cases: Regularization adds curvature; non-interpolating solutions can have additional Hessian terms. Failure mode analysis: Hessian inversion becomes unstable; naive second-order updates can diverge. Historical context: Flat minima and degeneracy analyses in deep learning geometry. Traps: Assuming Hessian full rank or using Newton steps without regularization.

B.7 Full formal proof: Reduce Vertex Cover to feature selection. For each vertex $v$, create a feature. For each edge $(u,v)$, create a labeled example that can be correctly classified iff at least one of $u$ or $v$ is selected. If there is a vertex cover of size $k$, then selecting those $k$ features yields zero training error; if not, any size-$k$ selection yields at least one misclassification. Thus exact $k$-feature selection solves Vertex Cover, implying $\mathcal{P} = \mathcal{NP}$ if it is polynomial-time. Proof strategy & techniques: Polynomial-time reduction from Vertex Cover; exact correspondence between cover and zero error. Computational validation: Build small graphs and compare optimal feature sets with vertex covers. ML interpretation: Exact feature selection is intractable in the worst case, motivating heuristics. Generalization & edge cases: Submodular objectives allow approximations; this does not negate NP-hardness of exact selection. Failure mode analysis: Exhaustive search is infeasible; naive heuristics can be far from optimal. Historical context: Complexity theory foundations for combinatorial learning problems. Traps: Conflating NP-hardness with impossibility of approximation or ignoring problem structure.

B.8 Full formal proof: If a model is non-identifiable, there exists a nontrivial curve $\theta(t)$ such that $p(Y|X;\theta(t))$ is constant. Differentiating $\log p$ along the curve at $t=0$ yields $\nabla_\theta \log p(Y|X;\theta_0) \cdot \theta'(0)=0$ almost surely. Therefore, $\mathcal{I}(\theta_0)\theta'(0)=0$, so $\mathcal{I}(\theta_0)$ has a nontrivial nullspace and is singular. Proof strategy & techniques: Likelihood-preserving manifold differentiation. Computational validation: For mixture models, compute Fisher information and verify zero eigenvalues due to label swapping. ML interpretation: Parameter uncertainty cannot be resolved by more data in symmetric models. Generalization & edge cases: Partial identifiability yields singularity only along symmetry directions. Failure mode analysis: Overconfident parameter inference, unstable explanations. Historical context: Identifiability theory in statistics and the role of Fisher information. Traps: Assuming infinite data guarantees unique parameters.

B.9 Full formal proof: For $L$-smooth $\mathcal{L}$, the descent lemma gives \[ \mathcal{L}(\theta_{t+1}) \le \mathcal{L}(\theta_t) - \eta\left(1-\frac{L\eta}{2}\right)\|\nabla\mathcal{L}(\theta_t)\|^2. \] With $\eta \le 1/L$, the coefficient is at least $\eta/2$. Summing from $t=0$ to $T-1$ yields \[ \sum_{t=0}^{T-1}\|\nabla\mathcal{L}(\theta_t)\|^2 \le \frac{2(\mathcal{L}(\theta_0)-\mathcal{L}^*)}{\eta}. \] Thus $\min_{t<T}\|\nabla\mathcal{L}(\theta_t)\| \le \sqrt{\frac{2(\mathcal{L}(\theta_0)-\mathcal{L}^*)}{\eta T}} = O(1/\sqrt{T})$. Proof strategy & techniques: Smoothness, descent lemma, telescoping sum, and minimum bound. Computational validation: Track gradient norms in non-convex training and verify $1/\sqrt{T}$ decay. ML interpretation: First-order methods reach approximate stationarity but not necessarily global optima. Generalization & edge cases: PL or strong convexity yields faster rates; non-smooth losses require subgradient variants. Failure mode analysis: Stationary points can be poor or unstable; additional criteria are needed. Historical context: Standard non-convex optimization guarantees. Traps: Equating stationarity with solution quality.

B.10 Full formal proof: Consider a family of convex quadratics with condition number $\kappa(d)$ growing in $d$. First-order methods require $\Omega(\kappa \log(1/\epsilon))$ steps to reduce error below $\epsilon$. Fix compute budget $C$. Choose $d$ such that $\kappa(d) > cC$, making the required steps exceed $C$. Then the optimization error remains at least $\epsilon_0$, while the representational optimum is still attainable, yielding a positive optimization–representation gap. Proof strategy & techniques: Lower bounds on first-order convergence; adversarial condition-number scaling. Computational validation: Compare training loss across model sizes under a fixed step budget. ML interpretation: Compute limits can dominate performance even with expressive models. Generalization & edge cases: Adaptive methods reduce constants but not the fundamental budget dependence. Failure mode analysis: Large models under fixed compute can underperform smaller models despite higher capacity. Historical context: Complexity bounds for optimization in convex and non-convex learning. Traps: Assuming model capacity guarantees achievable performance under fixed compute.

B.11 Full formal proof: If $h_1$ and $h_2$ are Bayes-optimal and disagree on a set $A$ with $\mathcal{P}(A)>0$, then any consistent learner can converge to either solution along subsequences or under different algorithmic biases. Hence the learning problem admits multiple optimal hypotheses with distinct behaviors, satisfying the underspecification condition. Proof strategy & techniques: Bayes optimality with non-unique minimizers; consistency and subsequence arguments. Computational validation: Construct distributions with symmetric label ambiguities and observe different solutions from different runs. ML interpretation: Optimal risk does not imply unique behavior; deployment can vary across runs. Generalization & edge cases: If Bayes-optimal classifier is unique almost surely, underspecification diminishes. Failure mode analysis: Two equally accurate models can disagree on critical subpopulations. Historical context: Decision theory and non-unique Bayes rules. Traps: Assuming “Bayes optimal” implies a unique target function.

B.12 Full formal proof: A ReLU network partitions input space into linear regions. A layer of width $h$ induces at most $\sum_{i=0}^d \binom{h}{i}$ regions; a depth $d$ network composes such partitions, yielding at most $2^{O(hd)}$ regions. The VC-dimension is bounded by the log of the number of distinct labelings realizable on $n$ points, giving $\text{VC-dim} = O(hd)$. Proof strategy & techniques: Hyperplane arrangement bounds and combinatorial counting; VC-dimension upper bounds. Computational validation: Empirical region counting by random sampling and linear region estimation. ML interpretation: Expressivity grows linearly with width-depth product, highlighting architectural limits. Generalization & edge cases: Tighter bounds depend on input dimension and network structure; convolution changes region counts. Failure mode analysis: Assuming width alone yields unlimited expressivity can lead to mis-scaled models. Historical context: Neural network expressivity bounds from early theoretical analyses. Traps: Confusing universal approximation with unlimited VC-dimension growth.

B.13 Full formal proof: From $\mathcal{L}_{\text{test}}(d) = L^* + A d^{-\alpha}$, we have $\mathcal{L}_{\text{test}}(2d)-L^* = A(2d)^{-\alpha} = 2^{-\alpha} A d^{-\alpha}$. Thus doubling capacity reduces excess loss by factor $2^{-\alpha}$, demonstrating diminishing returns. Proof strategy & techniques: Direct algebraic manipulation of scaling law. Computational validation: Fit power laws to empirical loss curves and verify the reduction factor when doubling $d$. ML interpretation: Scaling yields predictable but diminishing gains, guiding compute allocation. Generalization & edge cases: The law can break at saturation or under distribution shifts. Failure mode analysis: Extrapolating the power law beyond its regime leads to over-optimistic scaling plans. Historical context: Empirical scaling law literature in large models. Traps: Treating $\alpha$ as universal across tasks.

B.14 Full formal proof: Infinite VC-dimension implies the ability to shatter any finite set. For any $n$, select $n$ points that can be shattered. With random labels, ERM achieves zero training error but has test error at least $1/4$ with nontrivial probability by the random labeling argument used in B.1. Hence generalization can fail despite perfect training fit. Proof strategy & techniques: Shattering and random labeling lower bounds. Computational validation: Train a high-capacity model on random labels and observe near-chance test performance. ML interpretation: Capacity alone does not guarantee learning; inductive bias is essential. Generalization & edge cases: Structured distributions can yield benign overfitting, but worst-case failure remains. Failure mode analysis: Memorization without structure yields poor deployment behavior. Historical context: Classical VC theory and lower bounds on generalization. Traps: Assuming empirical success invalidates worst-case theory.

B.15 Full formal proof: For $\ell_\infty$-robustness $\epsilon$, the classifier must correctly label all points in an $\epsilon$-ball. Consider two clusters separated by distance $2\epsilon$. The robust neighborhoods overlap, forcing any classifier to misclassify a constant fraction. This yields a lower bound $\Omega(\epsilon)$ on robust error relative to standard error. Proof strategy & techniques: Geometric margin arguments; robust risk lower bounds. Computational validation: Evaluate robust accuracy on synthetic separable data at varying $\epsilon$. ML interpretation: Robustness imposes a cost on accuracy in small-margin regimes. Generalization & edge cases: For large-margin separable data, the tradeoff weakens; for small margins it is strong. Failure mode analysis: Overemphasis on robustness can degrade clean accuracy in critical applications. Historical context: Adversarial robustness theory and robustness-accuracy tradeoffs. Traps: Assuming robustness can be added without cost.

B.16 Full formal proof: For quadratic $f(x)=\frac{1}{2}x^T H x$, the GD update matrix is $I-\eta H$. Convergence requires $|1-\eta\lambda_i|<1$ for all eigenvalues, giving $0<\eta<2/\lambda_{\max}$. The optimal rate is attained at $\eta=2/(\lambda_{\min}+\lambda_{\max})$, yielding contraction factor $(\kappa-1)/(\kappa+1)$ where $\kappa=\lambda_{\max}/\lambda_{\min}$. Proof strategy & techniques: Spectral decomposition and linear dynamical system stability. Computational validation: Plot convergence rates for different $\kappa$ and confirm predicted contraction. ML interpretation: Ill-conditioned curvature slows training and demands careful learning-rate tuning. Generalization & edge cases: Locally quadratic approximations justify the same logic near minima in non-convex losses. Failure mode analysis: Excessive learning rates cause divergence; overly small rates waste compute. Historical context: Classical convex optimization results on gradient descent. Traps: Confusing stability bounds with rate-optimal choices.

B.17 Full formal proof: For $X\theta=y$ with $d>n$, the solution set is $\theta=\theta_p+v$ where $v\in\ker(X)$. The null space dimension is $d-\text{rank}(X)$. For generic data, $\text{rank}(X)=n$, so the solution set is an affine subspace of dimension at least $d-n$. Proof strategy & techniques: Null space decomposition for underdetermined linear systems. Computational validation: Compute $\text{rank}(X)$ and verify null space dimension numerically. ML interpretation: Overparameterized models admit infinitely many interpolators, making optimization bias critical. Generalization & edge cases: If $\text{rank}(X)<n$, the solution set is even larger. Failure mode analysis: Solution non-uniqueness creates sensitivity to initialization and optimizer choice. Historical context: Classical linear algebra facts applied to modern overparameterized models. Traps: Assuming interpolation implies uniqueness.

B.18 Full formal proof: Let $\mathcal{L}_{\text{dep}}(h)=\mathcal{L}_{\text{train}}(h)+\Delta(h)$ with $|\Delta(h)|\ge\delta$ for all $h$. Then $\inf_h \mathcal{L}_{\text{dep}}(h) \ge \inf_h \mathcal{L}_{\text{train}}(h) + \delta$. Increasing capacity can only reduce the first term, leaving a fixed offset $\delta$ determined by misalignment. Proof strategy & techniques: Inequality bounding and decomposition of deployment loss. Computational validation: Simulate a proxy objective with fixed discrepancy and observe a nonzero deployment loss floor. ML interpretation: Capacity cannot fix objective misalignment; it only improves proxy optimization. Generalization & edge cases: If $\Delta(h)$ varies with $h$, the bound can tighten or loosen, but a misalignment floor remains. Failure mode analysis: Scaling can amplify misaligned behavior while leaving deployment loss bounded away from optimum. Historical context: Alignment and governance literature on specification errors. Traps: Treating optimization improvements as proof of alignment.

B.19 Full formal proof: In ridge regression, risk decomposes into bias and variance terms that depend on $\lambda$ and $d/n$. Near $d/n=1$, the variance term can spike sharply in overparameterized regimes. Construct a sequence of problems where the risk curve in $\lambda$ changes shape across $d/n=1$, yielding a discontinuous shift in the minimizer $\lambda^*$. This establishes a regime shift boundary. Proof strategy & techniques: Bias-variance analysis and asymptotic random matrix arguments to show risk shape change. Computational validation: Sweep $\lambda$ across $d/n$ and plot the argmin; observe abrupt changes near interpolation. ML interpretation: Regularization strategies must adapt when crossing interpolation thresholds. Generalization & edge cases: In low-noise or low-rank settings, the discontinuity can be mild. Failure mode analysis: Fixed regularization across regimes can yield suboptimal or unstable models. Historical context: Random matrix theory analyses of ridge regression and double descent. Traps: Assuming smooth dependence of optimal regularization on scale.

B.20 Full formal proof: Let $\mathcal{A}$ be fixed. Choose distributions $\mathcal{P}$ and $\mathcal{Q}$ with $\text{TV}(\mathcal{P},\mathcal{Q})\le\epsilon$ but different label conditionals such that $\mathcal{A}$ trained on $\mathcal{P}$ achieves low training error yet incurs a constant error under $\mathcal{Q}$. By Le Cam or Pinsker-type arguments, small TV distance does not prevent a constant gap in expected loss when label conditionals shift on a low-probability but high-impact set. Thus deployment error can exceed training error by a constant under bounded shift. Proof strategy & techniques: Adversarial distribution shift construction; information-theoretic lower bounds. Computational validation: Simulate small label shifts with fixed feature marginal and observe constant performance drops. ML interpretation: Even small distribution shifts can cause large deployment failures without monitoring. Generalization & edge cases: Robust training or invariance constraints can reduce the constant but not eliminate worst-case shifts. Failure mode analysis: Reliance on training distribution assumptions leads to fragile deployment performance. Historical context: Domain adaptation lower bounds and distribution shift theory. Traps: Equating small distributional distance with small performance impact.

Solutions to C. Python Exercises

C.1 Code:

C.1

import numpy as np

rng = np.random.default_rng(7)

def simulate(d, n, noise=0.1, trials=20):
    train_err = []
    test_err = []
    for _ in range(trials):
        beta = np.zeros(d)
        beta[:10] = rng.normal(size=10)
        X = rng.normal(size=(n, d))
        y = X @ beta + noise * rng.normal(size=n)
        X_test = rng.normal(size=(n, d))
        y_test = X_test @ beta + noise * rng.normal(size=n)
        if d <= n:
            theta = np.linalg.lstsq(X, y, rcond=None)[0]
        else:
            theta = X.T @ np.linalg.solve(X @ X.T, y)
        train_err.append(np.mean((X @ theta - y) ** 2))
        test_err.append(np.mean((X_test @ theta - y_test) ** 2))
    return np.mean(train_err), np.mean(test_err)

ratios = [0.5, 1.0, 2.0, 5.0]
n = 200
results = []
for r in ratios:
    d = int(r * n)
    tr, te = simulate(d, n, noise=0.1)
    results.append((d, tr, te))

print(results)

Expected Output:

[(100, <small>, <moderate>), (200, <near 0>, <higher>), (400, <near 0>, <lower or higher>), (1000, <near 0>, <lower>)]

Numerical / Shape Notes: Training error should drop toward zero around $d \geq n$; test error often shows a peak near $d \approx n$, then may decrease in the overparameterized regime depending on noise.

C.2 Code:

C.2

import numpy as np

rng = np.random.default_rng(11)
n, d = 200, 50
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.1 * rng.normal(size=n)

def fit_min_norm(X, y):
    return X.T @ np.linalg.solve(X @ X.T, y)

theta_raw = fit_min_norm(X, y)
X_scaled = X * np.linspace(0.5, 2.0, d)
theta_scaled = fit_min_norm(X_scaled, y)

X_test = rng.normal(size=(n, d))
pred_raw = X_test @ theta_raw
pred_scaled = (X_test * np.linspace(0.5, 2.0, d)) @ theta_scaled
print(np.mean((pred_raw - pred_scaled) ** 2))

Expected Output:

<positive value>

Numerical / Shape Notes: The prediction difference is typically nonzero, demonstrating sensitivity to feature scaling; shapes are $X: (n, d)$, $theta: (d,)$.

C.3 Code:

C.3

import numpy as np

rng = np.random.default_rng(3)
n = 400
x = rng.normal(size=(n, 2))
y = (x[:, 0] > 0).astype(int)
y_spurious = (x[:, 1] > 0).astype(int)
y_train = np.where(rng.random(n) < 0.9, y_spurious, y)

def train_lr(X, y, lr=0.1, steps=2000):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        z = X @ w
        p = 1 / (1 + np.exp(-z))
        grad = X.T @ (p - y) / len(y)
        w -= lr * grad
    return w

w_fast = train_lr(x, y_train, lr=0.5, steps=200)
w_slow = train_lr(x, y_train, lr=0.05, steps=2000)

x_test = rng.normal(size=(n, 2))
y_test = (x_test[:, 0] > 0).astype(int)
acc_fast = np.mean((x_test @ w_fast > 0) == y_test)
acc_slow = np.mean((x_test @ w_slow > 0) == y_test)
print(acc_fast, acc_slow)

Expected Output:

<two accuracies that differ>

Numerical / Shape Notes: Both training runs can achieve similar training loss but different test accuracy due to optimizer bias; $w$ shape is $(2,)$.

C.4 Code:

C.4

import numpy as np

rng = np.random.default_rng(5)
n = 150
X = rng.normal(size=(n, 1))
def build_features(x, m):
    return np.hstack([x ** i for i in range(1, m + 1)])

y = np.sin(3 * X[:, 0]) + 0.1 * rng.normal(size=n)

degrees = [1, 3, 5, 10, 20, 40]
errs = []
for m in degrees:
    Phi = build_features(X, m)
    theta = np.linalg.lstsq(Phi, y, rcond=None)[0]
    y_hat = Phi @ theta
    errs.append(np.mean((y_hat - y) ** 2))

print(list(zip(degrees, errs)))

Expected Output:

[(1, <higher>), (3, <lower>), (5, <lower>), (10, <lower>), (20, <maybe higher>), (40, <varies>)]

Numerical / Shape Notes: With noise, error may decrease then increase around interpolation; with more data or regularization, the second descent can appear.

C.5 Code:

C.5

import numpy as np

rng = np.random.default_rng(9)
n, d = 200, 1000
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.05 * rng.normal(size=n)

theta_gd = X.T @ np.linalg.solve(X @ X.T, y)

theta_rand = rng.normal(size=d)
theta_rand = theta_rand - X.T @ np.linalg.solve(X @ X.T, X @ theta_rand - y)

X_test = rng.normal(size=(n, d))
y_test = X_test @ beta + 0.05 * rng.normal(size=n)
err_gd = np.mean((X_test @ theta_gd - y_test) ** 2)
err_rand = np.mean((X_test @ theta_rand - y_test) ** 2)
print(err_gd, err_rand)

Expected Output:

<gd error lower>, <random error higher>

Numerical / Shape Notes: Both solutions interpolate training data but generalization differs due to implicit bias; $theta$ shape is $(d,)$.

C.6 Code:

C.6

import numpy as np

rng = np.random.default_rng(13)
n, d, h = 200, 5, 50
X = rng.normal(size=(n, d))
W1 = rng.normal(size=(d, h))
W2 = rng.normal(size=(h, 1))
Y = np.maximum(0, X @ W1) @ W2

def forward(X, W1, W2):
    return np.maximum(0, X @ W1) @ W2

W1_alt = W1.copy()
W2_alt = W2.copy()
perm = rng.permutation(h)
W1_alt = W1_alt[:, perm]
W2_alt = W2_alt[perm]

diff = np.mean((forward(X, W1, W2) - forward(X, W1_alt, W2_alt)) ** 2)
print(diff)

Expected Output:

0.0

Numerical / Shape Notes: Output equivalence despite parameter permutation; demonstrates non-identifiability and functional redundancy.

C.7 Code:

C.7

import numpy as np

rng = np.random.default_rng(21)
n_full, d = 5000, 50
X_full = rng.normal(size=(n_full, d))
beta = rng.normal(size=d)
y_full = X_full @ beta + 0.1 * rng.normal(size=n_full)

def fit_mse(X, y):
    return np.linalg.lstsq(X, y, rcond=None)[0]

sizes = [200, 500, 1000, 2000, 5000]
errs = []
X_test = rng.normal(size=(1000, d))
y_test = X_test @ beta + 0.1 * rng.normal(size=1000)
for n in sizes:
    X = X_full[:n]
    y = y_full[:n]
    theta = fit_mse(X, y)
    errs.append(np.mean((X_test @ theta - y_test) ** 2))

print(list(zip(sizes, errs)))

Expected Output:

[(200, <higher>), ..., (5000, <lower or plateau>)]

Numerical / Shape Notes: Error typically decreases with data size then plateaus, indicating data-limited regime.

C.8 Code:

C.8

import numpy as np

rng = np.random.default_rng(19)
def train_budget(d, n=500, steps=200, lr=0.1):
    X = rng.normal(size=(n, d))
    beta = rng.normal(size=d)
    y = X @ beta + 0.1 * rng.normal(size=n)
    w = np.zeros(d)
    for _ in range(steps):
        grad = X.T @ (X @ w - y) / n
        w -= lr * grad
    X_test = rng.normal(size=(n, d))
    y_test = X_test @ beta + 0.1 * rng.normal(size=n)
    return np.mean((X_test @ w - y_test) ** 2)

for d in [50, 200, 800]:
    print(d, train_budget(d))

Expected Output:

50 <value>
200 <value>
800 <value>

Numerical / Shape Notes: Larger $d$ can yield worse error under fixed steps due to optimization–representation gap.

C.9 Code:

C.9

import numpy as np

rng = np.random.default_rng(2)
n = 1000
X = rng.normal(size=(n, 2))
y = (X[:, 0] + 0.1 * X[:, 1] > 0).astype(int)

X_shift = X.copy()
X_shift[:, 0] += 1.0
y_shift = (X_shift[:, 0] + 0.1 * X_shift[:, 1] > 0).astype(int)

def train_perceptron(X, y, steps=5):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        for i in range(len(y)):
            if (X[i] @ w > 0) != bool(y[i]):
                w += (1 if y[i] else -1) * X[i]
    return w

w = train_perceptron(X, y)
acc_train = np.mean((X @ w > 0) == y)
acc_shift = np.mean((X_shift @ w > 0) == y_shift)
print(acc_train, acc_shift)

Expected Output:

<high>, <lower>

Numerical / Shape Notes: Training accuracy can remain high while shifted accuracy drops, illustrating drift sensitivity.

C.10 Code:

C.10

import numpy as np

rng = np.random.default_rng(8)
n, d = 200, 20
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.1 * rng.normal(size=n)

def hessian_approx(X):
    return (X.T @ X) / len(X)

H = hessian_approx(X)
eigvals = np.linalg.eigvalsh(H)
print(eigvals[0], eigvals[-1], eigvals[-1] / eigvals[0])

Expected Output:

<min>, <max>, <condition number>

Numerical / Shape Notes: Large condition number indicates curvature explosion and learning rate sensitivity.

C.11 Code:

C.11

import numpy as np

rng = np.random.default_rng(4)
n, d = 300, 50
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.1 * rng.normal(size=n)

def train_gd(lam=0.0, steps=200, lr=0.1):
    w = np.zeros(d)
    for _ in range(steps):
        grad = X.T @ (X @ w - y) / n + lam * w
        w -= lr * grad
    return w

w_l2 = train_gd(lam=0.1, steps=200)
w_stop = train_gd(lam=0.0, steps=40)

X_test = rng.normal(size=(n, d))
y_test = X_test @ beta + 0.1 * rng.normal(size=n)
err_l2 = np.mean((X_test @ w_l2 - y_test) ** 2)
err_stop = np.mean((X_test @ w_stop - y_test) ** 2)
print(err_l2, err_stop)

Expected Output:

<two errors that can differ>

Numerical / Shape Notes: Matching training loss does not guarantee matching test loss; norms typically differ.

C.12 Code:

C.12

import numpy as np

rng = np.random.default_rng(6)
n, d = 200, 100
X = rng.normal(size=(n, d))
beta = rng.normal(size=d)
y = X @ beta + 0.1 * rng.normal(size=n)

theta = np.linalg.lstsq(X, y, rcond=None)[0]
mask = np.abs(theta) > np.percentile(np.abs(theta), 70)
theta_pruned = theta * mask

X_test = rng.normal(size=(n, d))
y_test = X_test @ beta + 0.1 * rng.normal(size=n)
err_full = np.mean((X_test @ theta - y_test) ** 2)
err_pruned = np.mean((X_test @ theta_pruned - y_test) ** 2)
print(err_full, err_pruned)

Expected Output:

<baseline>, <slightly higher or similar>

Numerical / Shape Notes: Pruning can preserve performance if redundancy is high; shapes remain $(d,)$.

C.13 Code:

C.13

import numpy as np

rng = np.random.default_rng(10)
n = 200
X = rng.integers(0, 2, size=(n, 2))
y = (X[:, 0] ^ X[:, 1]).astype(int)

def fit_linear(X, y):
    X1 = np.hstack([X, np.ones((len(X), 1))])
    w = np.linalg.lstsq(X1, y, rcond=None)[0]
    return w

w = fit_linear(X, y)
pred = (np.hstack([X, np.ones((n, 1))]) @ w > 0.5).astype(int)
print(np.mean(pred == y))

Expected Output:

0.5

Numerical / Shape Notes: Linear model cannot solve XOR; accuracy stays near chance, showing expressivity limits.

C.14 Code:

C.14

import numpy as np

rng = np.random.default_rng(14)
n = 500
X = rng.normal(size=(n, 3))
group = (X[:, 2] > 0).astype(int)
y = (X[:, 0] + 0.5 * group > 0).astype(int)

def train_with_penalty(X, y, group, lam=0.0, steps=500, lr=0.1):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        z = X @ w
        p = 1 / (1 + np.exp(-z))
        grad = X.T @ (p - y) / len(y)
        gap = np.mean(p[group == 1]) - np.mean(p[group == 0])
        grad += lam * gap * np.mean(X[group == 1] - X[group == 0], axis=0)
        w -= lr * grad
    return w

w0 = train_with_penalty(X, y, group, lam=0.0)
w1 = train_with_penalty(X, y, group, lam=1.0)
print(np.linalg.norm(w0 - w1))

Expected Output:

<positive value>

Numerical / Shape Notes: Penalty changes parameters and group gap; illustrates objective tradeoffs without code outputs of metrics.

C.15 Code:

C.15

import numpy as np

sizes = np.array([50, 100, 200, 400, 800], dtype=float)
losses = np.array([0.8, 0.6, 0.48, 0.40, 0.36])

logx = np.log(sizes)
logy = np.log(losses - losses.min() + 1e-6)
coef = np.polyfit(logx, logy, 1)
print(coef)

Expected Output:

<slope and intercept>

Numerical / Shape Notes: Negative slope indicates power-law scaling; deviations from linearity indicate saturation.

C.16 Code:

C.16

import numpy as np

rng = np.random.default_rng(15)
n = 400
X = rng.normal(size=(n, 2))
y = (X[:, 0] > 0).astype(int)
amb = np.abs(X[:, 0]) < 0.1
y[amb] = rng.integers(0, 2, size=np.sum(amb))

def train_lr(X, y, seed):
    rng_local = np.random.default_rng(seed)
    w = rng_local.normal(size=2) * 0.1
    for _ in range(500):
        p = 1 / (1 + np.exp(-X @ w))
        w -= 0.1 * (X.T @ (p - y) / len(y))
    return w

w1 = train_lr(X, y, 1)
w2 = train_lr(X, y, 2)
pred_disagree = np.mean((X @ w1 > 0) != (X @ w2 > 0))
print(pred_disagree)

Expected Output:

<small positive value>

Numerical / Shape Notes: Disagreement concentrates in ambiguous region; illustrates underspecification with equal risk.

C.17 Code:

C.17

import numpy as np

rng = np.random.default_rng(17)
def train_model(d, steps):
    n = 500
    X = rng.normal(size=(n, d))
    beta = rng.normal(size=d)
    y = X @ beta + 0.1 * rng.normal(size=n)
    w = np.zeros(d)
    for _ in range(steps):
        w -= 0.1 * (X.T @ (X @ w - y) / n)
    X_test = rng.normal(size=(n, d))
    y_test = X_test @ beta + 0.1 * rng.normal(size=n)
    return np.mean((X_test @ w - y_test) ** 2)

err_small = train_model(50, steps=1000)
err_large = train_model(500, steps=200)
print(err_small, err_large)

Expected Output:

<two errors, often small model better>

Numerical / Shape Notes: Under fixed compute, larger model can underperform due to slower optimization.

C.18 Code:

C.18

import numpy as np

rng = np.random.default_rng(18)
n = 200
ratios = [0.5, 1.0, 2.0, 4.0]
best_lams = []
for r in ratios:
    d = int(r * n)
    X = rng.normal(size=(n, d))
    beta = rng.normal(size=d)
    y = X @ beta + 0.1 * rng.normal(size=n)
    X_test = rng.normal(size=(n, d))
    y_test = X_test @ beta + 0.1 * rng.normal(size=n)
    lams = np.logspace(-4, 1, 10)
    errs = []
    for lam in lams:
        theta = np.linalg.solve(X.T @ X + lam * np.eye(d), X.T @ y)
        errs.append(np.mean((X_test @ theta - y_test) ** 2))
    best_lams.append(lams[int(np.argmin(errs))])
print(best_lams)

Expected Output:

[<lam>, <lam>, <lam>, <lam>]

Numerical / Shape Notes: Optimal $\lambda$ can shift sharply near $d \approx n$, indicating regime shift.

C.19 Code:

C.19

import numpy as np

rng = np.random.default_rng(20)
n = 1000
X = rng.normal(size=(n, 2))
spurious = (rng.random(n) < 0.9).astype(int)
y = (X[:, 0] > 0).astype(int)
y_train = np.where(spurious == 1, y, 1 - y)

def train_lr(X, y, steps=500, lr=0.1):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        p = 1 / (1 + np.exp(-X @ w))
        w -= lr * (X.T @ (p - y) / len(y))
    return w

w = train_lr(X, y_train)
X_test = rng.normal(size=(n, 2))
y_test = (X_test[:, 0] > 0).astype(int)
acc = np.mean((X_test @ w > 0) == y_test)
print(acc)

Expected Output:

<accuracy possibly degraded>

Numerical / Shape Notes: Accuracy can drop if model relies on spurious correlations; highlights need for invariance.

C.20 Code:

C.20

import numpy as np

rng = np.random.default_rng(22)
n = 300
def generate(t):
    shift = 0.0 if t < 5 else 1.0
    X = rng.normal(size=(n, 2))
    X[:, 0] += shift
    y = (X[:, 0] > 0).astype(int)
    return X, y

def train_simple(X, y):
    w = np.zeros(2)
    for _ in range(200):
        p = 1 / (1 + np.exp(-X @ w))
        w -= 0.1 * (X.T @ (p - y) / len(y))
    return w

X0, y0 = generate(0)
w = train_simple(X0, y0)

for t in range(10):
    Xt, yt = generate(t)
    acc = np.mean((Xt @ w > 0) == yt)
    print(t, acc)

Expected Output:

0 <high>
...\n+5 <drop>\n+...

Numerical / Shape Notes: Accuracy stays high before shift and drops after; monitoring can detect regime change.

C.1 Explanation

Explanation: This exercise operationalizes the capacity threshold and interpolation regime by explicitly varying the ratio $d/n$. The simulated linear model allows you to see the underparameterized regime where optimization error dominates, the critical regime where interpolation begins, and the overparameterized regime where implicit bias decides which interpolating solution is selected. The key is that the model family is fixed, while capacity is the independent variable; this isolates the phase transition in fit quality and the subsequent generalization behavior. ML Interpretation: The experiment makes benign overfitting concrete: perfect training fit can coexist with good generalization when the implicit bias aligns with the true signal. It demonstrates that capacity is not simply a risk factor; it can be a tool when combined with stable optimization and adequate inductive bias. Failure Modes: If data noise is too high, the interpolating solution will overfit noise and test error will rise sharply. If features are poorly scaled, implicit bias changes and the generalization curve can shift unpredictably. Common Mistakes: Interpreting the interpolation peak as a universal law without checking data noise; concluding that higher capacity is always better; ignoring feature scaling effects. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 3 (Capacity Threshold), Definition 5 (Benign Overfitting), Definition 6 (Double Descent), Theorem 3 (Benign Overfitting Condition), Theorem 2 (Capacity–Generalization Tradeoff), Example 1 (Underspecified Linear Regression), Example 4 (Capacity Threshold Crossing).

C.2 Explanation

Explanation: This exercise isolates how feature scaling changes the implicit norm used by the optimizer, which in turn changes the minimum-norm interpolating solution. It shows that the geometry of the parameter space is not intrinsic to the data distribution; it is imposed by representation choices. The same training loss can hide substantial prediction differences. ML Interpretation: This is a concrete example of optimization bias and induced inductive bias: preprocessing choices encode inductive bias as strongly as architecture choices. It also illustrates underspecification, because multiple solutions fit equally well. Failure Modes: Overinterpreting model coefficients without accounting for scaling; assuming invariance to normalization; conflating feature preprocessing with mere numerical conditioning. Common Mistakes: Comparing models trained on differently scaled inputs as if they were equivalent; using raw coefficients for interpretability without normalization; ignoring the impact of basis changes. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 1 (Underspecification), Definition 7 (Optimization Bias), Definition 8 (Induced Inductive Bias), Theorem 5 (Optimization Bias Convergence), Example 7 (Optimization Bias vs Explicit Regularization).

C.3 Explanation

Explanation: The setup creates a spurious correlation that is highly predictive in training but not in deployment. Different optimizers or hyperparameters can converge to different solutions that exploit or ignore the spurious feature. The exercise demonstrates how optimization bias selects among underspecified solutions. ML Interpretation: This models real-world spurious shortcut learning. It shows that training loss is not sufficient to certify robustness, and that optimizer dynamics (learning rate, batch size) can alter model reliance on spurious cues. Failure Modes: Overfitting to spurious features; evaluating only in-distribution performance; mistaking training accuracy for causal understanding. Common Mistakes: Not controlling for distribution shift in the test set; concluding that spurious reliance is a data-only issue; ignoring the role of optimization and architecture. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 1 (Underspecification), Definition 7 (Optimization Bias), Definition 18 (Generalization Collapse), Theorem 10 (Structural Limit of ERM), Example 9 (Stability Collapse Under Drift), Example 12 (Objective–Representation Gap Demonstration).

C.4 Explanation

Explanation: The polynomial feature expansion provides a controlled way to increase model capacity, revealing the classical U-shaped curve and potential second descent. It shows that the interpolation threshold can coincide with a spike in test error, after which test error can decrease again if implicit bias favors simpler interpolators. ML Interpretation: The exercise illustrates double descent and why classical bias-variance narratives fail in modern settings. It is a bridge between classical statistics and modern deep learning scaling behavior. Failure Modes: Misreading the curve as evidence of stable improvement beyond any capacity; using too few data points or overly noisy data that mask the second descent. Common Mistakes: Confusing polynomial degree with data dimensionality; ignoring the role of regularization; assuming the second descent appears in all settings. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 6 (Double Descent), Definition 5 (Benign Overfitting), Theorem 2 (Capacity–Generalization Tradeoff), Example 4 (Capacity Threshold Crossing), Example 11 (Scaling Saturation in Large Models).

C.5 Explanation

Explanation: By constructing a random interpolator and comparing it to the gradient descent interpolator, this exercise isolates the role of implicit bias. Both solutions fit the training data perfectly, yet their test errors differ because the implicit bias of gradient descent favors low-norm, signal-aligned solutions. ML Interpretation: This is the simplest empirical proof that interpolation alone is not synonymous with overfitting. The algorithm matters, and the inductive bias it imposes determines whether overfitting is benign. Failure Modes: Assuming that any interpolator is equivalent; ignoring the impact of optimizer on the selected solution; misattributing the effect to data size alone. Common Mistakes: Using a single random interpolator and overgeneralizing; failing to control noise level; interpreting differences as statistical noise rather than algorithmic bias. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 5 (Benign Overfitting), Definition 7 (Optimization Bias), Theorem 3 (Benign Overfitting Condition), Example 3 (Benign Overfitting Simulation).

C.6 Explanation

Explanation: This exercise demonstrates non-identifiability by showing that parameter permutations preserve the function. It clarifies that parameter trajectories can differ across runs even when the function learned is nearly identical. ML Interpretation: It illustrates why parameter-level interpretability is fragile in deep networks and why multiple equivalent parameterizations exist. It also emphasizes the difference between parameter identifiability and functional identifiability. Failure Modes: Overinterpreting parameter values; misattributing differences in parameter trajectories to “learning different concepts” when the outputs are equivalent. Common Mistakes: Treating parameter similarity as a proxy for functional similarity; assuming that uniqueness of optimization implies unique representation. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 2 (Non-Identifiability), Definition 12 (Functional Redundancy), Theorem 4 (Degenerate Hessian Limit), Example 2 (Non-Identifiable Parameterization in Neural Networks), Example 8 (Functional Redundancy in Wide Networks).

C.7 Explanation

Explanation: By scaling data and model size along different axes, the exercise detects whether performance gains are data-limited or capacity-limited. This is an empirical counterpart to scaling laws and saturation limits. ML Interpretation: It shows that scaling is not a one-dimensional strategy; the model, data, and compute interact. The experiment forces you to identify the dominant bottleneck. Failure Modes: Scaling model size while data remain narrow; scaling data without improving label quality; ignoring compute budget effects. Common Mistakes: Treating accuracy gains as evidence of capacity limitations without checking data diversity; conflating data quantity with data quality. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 11 (Scaling Saturation), Definition 9 (Expressivity Bound), Theorem 9 (Scaling Saturation), Example 11 (Scaling Saturation in Large Models), Example 6 (Expressivity Saturation Under Finite Data).

C.8 Explanation

Explanation: This exercise fixes compute and varies model size, revealing that larger models can underperform if optimization cannot reach a good solution within the budget. It distinguishes representational capacity from achievable performance. ML Interpretation: It directly illustrates the optimization–representation gap and compute-constrained learning. It provides a concrete reason why scaling can fail in practice. Failure Modes: Selecting models solely by parameter count; mistaking underperformance as evidence of low capacity rather than insufficient optimization. Common Mistakes: Holding training steps constant across models without accounting for per-step cost; comparing models trained to different levels of convergence. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 19 (Optimization–Representation Gap), Definition 10 (Computational Complexity Limit), Theorem 10 (Structural Limit of ERM), Example 10 (Compute-Constrained Optimization Failure).

C.9 Explanation

Explanation: This exercise contrasts stability-based generalization with actual performance under drift. It shows that stability guarantees are distribution-dependent and do not protect against shifts, even when the shift is modest. ML Interpretation: It connects theoretical generalization bounds to operational reliability, highlighting that stability is necessary but not sufficient for deployment resilience. Failure Modes: Ignoring drift monitoring; assuming a stable training loss implies stable deployment performance. Common Mistakes: Using only in-distribution validation; failing to track feature distribution changes; assuming IID conditions hold indefinitely. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 16 (Stability Bound), Definition 17 (Regime Shift Boundary), Theorem 8 (Stability–Robustness Tradeoff), Example 9 (Stability Collapse Under Drift).

C.10 Explanation

Explanation: By estimating Hessian eigenvalues, the exercise reveals curvature explosion and its impact on learning rate stability. It shows how ill-conditioning makes optimization sensitive to step size. ML Interpretation: It links loss landscape geometry to optimization failure modes and motivates adaptive learning rates. Failure Modes: Divergence with overly large learning rates; slow convergence with overly small rates; misattributing instability to data noise rather than curvature. Common Mistakes: Treating learning rate as a universal constant; ignoring curvature diagnostics; confusing local curvature with global behavior. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 14 (Curvature Explosion), Definition 15 (Hessian Degeneracy), Theorem 4 (Degenerate Hessian Limit), Example 5 (Degenerate Hessian at Scale).

C.11 Explanation

Explanation: This exercise contrasts explicit regularization with implicit regularization from early stopping. It demonstrates that models with similar training loss can generalize differently because optimization path constraints differ from penalty-based constraints. ML Interpretation: It provides empirical evidence that optimization is part of the model specification, not just the solver, and that implicit bias can dominate explicit penalties. Failure Modes: Assuming regularization strength is a substitute for training dynamics; failing to control for training loss when comparing regimes. Common Mistakes: Comparing models at different training losses; treating weight decay and early stopping as equivalent; ignoring batch size effects. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 7 (Optimization Bias), Definition 8 (Induced Inductive Bias), Theorem 5 (Optimization Bias Convergence), Example 7 (Optimization Bias vs Explicit Regularization).

C.12 Explanation

Explanation: Pruning tests the hypothesis of functional redundancy by removing parameters and checking output stability. It exposes whether the learned function depends on many parameters or only a subset. ML Interpretation: It supports the idea that effective capacity is lower than nominal capacity, enabling compression without major performance loss. Failure Modes: Overpruning that removes essential but sparse features; underestimating degradation on rare cases. Common Mistakes: Evaluating pruning only on training data; assuming parameter magnitude always reflects importance; ignoring structured redundancy. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 12 (Functional Redundancy), Definition 13 (Sparse Activation Limit), Example 8 (Functional Redundancy in Wide Networks).

C.13 Explanation

Explanation: By comparing linear and nonlinear models on XOR-like structure, the exercise highlights representational limits independent of optimization. It demonstrates a clean separation of representation error from optimization error. ML Interpretation: It shows that no amount of optimization can solve a problem outside the hypothesis class, reinforcing the need to match model class to task structure. Failure Modes: Over-optimizing a misspecified model; attributing failure to optimizer rather than expressivity. Common Mistakes: Assuming more training time can fix representational mismatch; ignoring the need for nonlinearity. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 9 (Expressivity Bound), Theorem 6 (Expressivity Saturation Bound), Example 6 (Expressivity Saturation Under Finite Data).

C.14 Explanation

Explanation: The exercise introduces a multi-objective loss to demonstrate objective tradeoffs. It shows how adding a fairness penalty changes the solution, sometimes reducing accuracy while improving group parity. ML Interpretation: It directly illustrates misalignment and the need for explicit governance in objective design. Optimization can only balance the objectives you include, not the ones you omit. Failure Modes: Over-penalizing fairness and collapsing accuracy; under-penalizing and leaving bias unchanged; assuming a single penalty weight fits all contexts. Common Mistakes: Treating fairness metrics as post hoc evaluations rather than objectives; ignoring stakeholder priorities. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 19 (Optimization–Representation Gap), Theorem 10 (Structural Limit of ERM), Example 12 (Objective–Representation Gap Demonstration).

C.15 Explanation

Explanation: Fitting a power law to scaling data tests whether improvements follow a predictable scaling pattern and where that pattern breaks. The breakdown point indicates saturation or a regime shift. ML Interpretation: It grounds scaling decisions in empirical evidence rather than intuition, clarifying when more capacity yields diminishing returns. Failure Modes: Overfitting the power law; extrapolating beyond the observed regime; ignoring changes in data quality. Common Mistakes: Treating scaling slopes as universal; using short ranges of data to infer long-range scaling. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 11 (Scaling Saturation), Theorem 9 (Scaling Saturation), Example 11 (Scaling Saturation in Large Models).

C.16 Explanation

Explanation: The ambiguous region creates multiple Bayes-optimal classifiers that disagree, making underspecification unavoidable. Different training runs can converge to different solutions with identical risk. ML Interpretation: It demonstrates that even optimal solutions can differ in behavior, and that inductive bias determines which solution is chosen. Failure Modes: Assuming stable model behavior across runs; overlooking disagreement in ambiguous regions. Common Mistakes: Comparing only global metrics; ignoring localized behavior in ambiguous subspaces. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 1 (Underspecification), Definition 2 (Non-Identifiability), Theorem 10 (Structural Limit of ERM), Example 1 (Underspecified Linear Regression).

C.17 Explanation

Explanation: This exercise compares a smaller converged model to a larger partially trained model under fixed compute. It emphasizes that optimization completion is often more important than representational capacity. ML Interpretation: It reinforces compute-constrained learning and the optimization–representation gap as practical deployment constraints. Failure Modes: Choosing large models that never converge; misreading underperformance as low capacity rather than training budget mismatch. Common Mistakes: Equating parameter count with achievable quality; ignoring the cost of convergence. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 19 (Optimization–Representation Gap), Definition 10 (Computational Complexity Limit), Example 10 (Compute-Constrained Optimization Failure).

C.18 Explanation

Explanation: Varying $d/n$ and tracking optimal regularization reveals regime shifts in training strategy. The optimal $\lambda$ can change sharply near interpolation thresholds. ML Interpretation: This is a practical view of how explicit regularization is most useful in classical regimes and less so in strongly overparameterized regimes. Failure Modes: Using a fixed regularization schedule across scales; misinterpreting changing optimal $\lambda$ as noise rather than a regime change. Common Mistakes: Insufficient grid resolution; failing to separate optimization failure from generalization failure. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 17 (Regime Shift Boundary), Theorem 9 (Scaling Saturation), Example 4 (Capacity Threshold Crossing), Example 11 (Scaling Saturation in Large Models).

C.19 Explanation

Explanation: The spurious correlation setup shows how models can amplify shortcuts when scaling without invariance enforcement. It tests whether data augmentation or invariant learning reduces reliance on spurious features. ML Interpretation: This reflects real-world shortcut learning in vision and language models and shows that scale alone does not fix it. Failure Modes: Increasing model size without addressing spurious correlations; evaluating only on in-distribution data. Common Mistakes: Assuming data quantity fixes spurious reliance; ignoring causal structure in dataset design. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 1 (Underspecification), Definition 8 (Induced Inductive Bias), Example 3 (Benign Overfitting Simulation), Example 11 (Scaling Saturation in Large Models).

C.20 Explanation

Explanation: The time-varying distribution simulates deployment drift and tests monitoring strategies. It demonstrates how performance can collapse abruptly after a regime shift, even if training loss remains unchanged. ML Interpretation: This is a concrete link between stability theory and MLOps: without monitoring and adaptation, models degrade under drift. Failure Modes: Delayed detection of drift; retraining too late; using static thresholds that fail in new regimes. Common Mistakes: Monitoring only overall accuracy; ignoring feature distribution drift; not setting alert thresholds based on operational risk. Chapter Connections (Definitions, Theorems, Example 1–12): Definition 18 (Generalization Collapse), Definition 16 (Stability Bound), Theorem 8 (Stability–Robustness Tradeoff), Example 9 (Stability Collapse Under Drift).

End of C Solutions

Appendices

In Context

Algorithmic Development History

The modern theory of learning began with Vapnik–Chervonenkis theory, which provided the first rigorous sample complexity bounds linking hypothesis class capacity to generalization. It introduced VC-dimension as a formal measure of capacity and connected it to uniform convergence, establishing why large hypothesis classes require more data. This foundation set the stage for both statistical learning theory and the later empirical paradoxes of deep learning, where models with enormous VC-dimension generalize well despite classical bounds.

The No-Free-Lunch theorem emerged from information-theoretic arguments in the 1990s, formalizing the idea that no learning algorithm can be universally best across all possible data-generating distributions. This theorem reframed inductive bias as essential rather than optional: every successful algorithm must assume something about the world. In practice, architectures such as CNNs and transformers are embodiments of these assumptions, which is why they succeed on specific domains.

Double descent research, starting with early observations in polynomial regression and later formalized in modern work on overparameterization, overturned the classical bias-variance tradeoff narrative. It demonstrated that test error can decrease again after the interpolation threshold, particularly under implicit regularization from gradient descent. This research is a cornerstone of modern scaling practice: it explains why very large models can generalize well even when they fit training data perfectly.

Modern scaling-limit discussions, especially in large language models, emphasize empirical scaling laws and diminishing returns. Work on power-law improvements and data-model-compute tradeoffs shows that scaling is productive only up to a point, after which data quality, objective alignment, and evaluation scope become dominant constraints. These discussions connect theoretical saturation limits to real deployment decisions.

Complexity theory foundations, built on the $\mathcal{P} \neq \mathcal{NP}$ conjecture and hardness of approximation results, explain why many learning objectives are computationally intractable. Feature selection, architecture search, and hyperparameter optimization are not just expensive in practice; they are provably hard in the worst case. These results justify the reliance on heuristics and highlight the limits of algorithmic improvement.

Large-model limitation analyses bring these threads together. They show that increasing parameters can amplify misalignment, magnify underspecification, and stress the evaluation pipeline. Empirical failures in robustness, fairness, and safety indicate that larger models are not inherently wiser; they are more effective at optimizing whatever objective we provide. This makes objective design and governance central to the future of ML.

Why This Matters for ML

The Limits of Optimization

Optimization finds the best solution to the specified objective, not the best solution for the real-world goal. When objectives are misaligned, optimization amplifies that misalignment. When the loss landscape is underspecified, optimization selects among equally-valid solutions based on algorithmic bias. When the landscape is non-convex and ill-conditioned, optimization may converge to a solution that is technically correct but operationally brittle. Recognizing these limits prevents misdiagnosis: a failure of deployment is often a failure of specification, not a failure of optimization.

The Limits of Data Scaling

More data generally improves performance, but only if the data capture the relevant structure of the target problem. If the data distribution is narrow or the labels are proxies for a more complex goal, scaling the dataset can entrench existing biases. Moreover, data scaling does not resolve non-identifiability or underspecification when multiple models fit the data equally well. The limit is not merely quantity but quality, diversity, and alignment with real-world objectives.

Structural Constraints of Learning Systems

Learning systems are constrained by expressivity bounds, computational complexity, and stability limits. These constraints mean that some problems are unlearnable with a given model class, and some objectives are infeasible under available compute. They also imply tradeoffs: stability can reduce sensitivity to noise but also suppress rare signals; robustness can conflict with accuracy; fairness constraints can reduce utility on certain metrics. These are not bugs; they are structural properties that must be navigated.

Future Directions: Beyond Pure Optimization

The future of ML is not a race to optimize faster but a shift toward better specification, governance, and human alignment. This includes multi-objective optimization with explicit value constraints, interactive training that incorporates human feedback, and system-level design that treats models as components in a broader socio-technical pipeline. It also includes research on causal modeling, robust decision-making under distribution shift, and training methods that encode invariances or safety constraints directly. Beyond pure optimization, the field must invest in problem formulation, evaluation integrity, and accountability mechanisms.

Motivation

Why Optimization Cannot Solve Everything

The essence of the optimization framework is this: given a model class $\mathcal{M}$, a loss function $\mathcal{L}$, and training data $\{(x_i, y_i)\}_{i=1}^n$, the goal is to find parameters $\theta^* = \arg\min_{\theta} \sum_{i=1}^n \mathcal{L}(h_\theta(x_i), y_i)$.

On its face, this is a well-defined mathematical problem. Solve it perfectly, and you have found the model that fits the data best (according to the chosen metric). Why is this insufficient?

The answer lies in five distinct gaps between the optimization problem and the actual deployment problem:

Gap 1: The Training-to-Deployment Transition: The training data is drawn from a distribution $\mathcal{P}_{\text{train}}$, but the deployed model operates on data from potentially different distribution $\mathcal{P}_{\text{deploy}}$. Optimization minimizes loss on $\mathcal{P}_{\text{train}}$, but we ultimately care about performance on $\mathcal{P}_{\text{deploy}}$. If these distributions differ—and they always do, to some degree—optimization on training data does not guarantee good deployment performance. Worse, the two objectives can be antagonistic: optimizations that minimize training loss may actively harm deployment performance.

Gap 2: Multiple Equally-Valid Solutions: The optimization problem may have many distinct solutions $\theta_1^*, \theta_2^*, \ldots$ that achieve identical or near-identical training loss. In high dimensions (as is commonplace in modern deep learning), this is almost always the case. These solutions can make wildly different predictions on unseen data. Optimization finds one of these solutions (typically the one that the algorithm happens to converge to first), with no guarantee that this solution generalizes better than alternatives. The algorithm has succeeded at its defined objective but failed at the actual goal.

Gap 3: Loss Function Misalignment: The loss function $\mathcal{L}$ is a mathematical proxy for what we actually care about. In classification, we approximate the goal “maximize accuracy on real-world data where mistakes have serious consequences” with the loss function “minimize cross-entropy on labeled training samples.” In regression, we approximate “make predictions useful for downstream decision-making” with “minimize mean squared error.” These are not equivalent. As the problem becomes more consequential (medicine, criminal justice, hiring), the gap between the mathematical objective and the actual human goal grows.

Gap 4: Computational Intractability: Even if the problem is well-specified and the objective is correct, finding the global optimum may be computationally impossible. Many optimization problems are NP-hard; finding the exact solution requires time exponential in the problem size. We use heuristics (gradient descent, simulated annealing, genetic algorithms) that find local optima, which may be arbitrarily far from the global optimum. For some problem structures, even local optima are hard to find.

Gap 5: The Meta-Problem: Optimization finds the best solution according to a metric we chose. But we did not optimize the choice of metric itself. The decision to optimize accuracy rather than fairness, or revenue rather than user welfare, is a meta-level choice made by humans (or encoded in organizational structure) before optimization begins. Optimization cannot solve this meta-problem; it can only expose it.

Each of these gaps is fundamental. No amount of additional data, compute, or algorithmic sophistication closes all of them simultaneously.

Underspecification and Non-Identifiability

In mathematics and statistics, an inverse problem is called “underspecified” when there are multiple distinct solutions consistent with the observed data. A classic example: given the shape of a shadow on a wall, many different 3D objects could have cast that shadow. The shadow (the data) does not uniquely determine the object (the parameter).

In machine learning, underspecification is endemic. Consider a simple example: training a logistic regression model to predict loan default risk given applicant features (income, credit score, employment history). Suppose we have 100,000 training examples and a carefully labeled target variable. The loss function is convex, and optimization is straightforward. We fit a model $h_\theta(x) = \sigma(\theta^T x)$ and find parameters $\theta^*$.

But here’s the problem: there are many different parameter vectors that achieve essentially identical training loss. In fact, as long as the feature space is high-dimensional relative to the sample size (which it is, once we include nonlinear features or interactions), there are infinitely many solutions with zero training loss. Even if we regularize (add a penalty to $\theta^2$ to prefer “small” parameters), there are still many solutions with near-optimal training loss, and they make different predictions on unseen applications.

More problematically: the different solutions may differ systematically. Solution A might be more aggressive in lending to applicants from Group 1, while Solution B is more aggressive toward Group 2. Both fit the training data. But if the training data reflects historical bias (as it almost always does), both solutions will amplify different types of historical bias in different directions. Optimization cannot adjudicate between them because both are equally optimal according to the loss function.

Non-identifiability arises when the data does not provide enough information to uniquely pin down the parameters. In linear models, if you have fewer samples than features (or near-collinear features), you have non-identifiability. In deep neural networks, the situation is more subtle: the network is heavily overparameterized relative to the training data, yet mysteriously generalizes well. But this generalization is not guaranteed; different initializations and random seeds converge to different solutions, some of which generalize well and others poorly.

Objective Misalignment as a Structural Limitation

The term “objective misalignment” refers to a situation where the objective function we’re optimizing does not correspond to the actual goal we care about.

Consider content recommendation. The training objective is typically: maximize click-through rate (CTR), the fraction of recommendations that users click on. Optimizing this objective is straightforward and well-studied. However, the actual goals of a platform might be: - Maximize user well-being and cognitive health - Increase diversity of user exposure to viewpoints - Reduce time spent on the platform (to prevent addiction) - Prevent radicalization and filter bubbles

These goals directly contradict CTR optimization. Content that maximizes immediate clicks often exploits emotional triggers (outrage, division, tribalism), which harms user well-being and diversity. A system that optimizes CTR will not achieve the true goals, no matter how well it executes the optimization.

The problem is not solvable through better optimization. No algorithm can discover the “true” objective from data; the objective is a choice made by humans. And once the system begins operating under a misaligned objective, it generates feedback loops that reinforce the misalignment.

Similar misalignments arise throughout machine learning applications:

Hiring: Optimize for job performance prediction (test accuracy), but the true goal is fair hiring. A model that perfectly predicts job performance under a historically-biased hiring process will perpetuate that bias.
Criminal sentencing: Optimize for predicting recidivism, but the true goal is fair and accurate risk assessment. A model trained on historical sentencing data learns to predict who was actually incarcerated, not who likely would re-offend, conflating crime and enforcement patterns.
Medical diagnosis: Optimize for diagnostic accuracy on labeled training cases, but the true goal is patient outcomes. Diagnostic accuracy and patient outcomes are correlated but distinct; a model that maximizes accuracy on the training distribution might fail catastrophically on underrepresented populations with different disease presentations.

In each case, the optimization problem is mathematically well-posed. We can solve it efficiently. But the solution does not achieve the actual goal. This is a fundamental limitation of the optimization framework: it requires that we specify, in advance, a mathematical objective function. If that function is misaligned with reality, optimization makes things worse.

Computational Limits and Scaling Barriers

Not all optimization problems are created equal. Some are convex and can be solved to global optimality in polynomial time. Others are non-convex landscapes where local optima pullulate and finding the global optimum is NP-hard.

Deep learning operates in the latter regime. The loss landscape of a neural network with parameters $\theta$ and training data $\{(x_i, y_i)\}_{i=1}^n$ is a complex non-convex function. For a network with 10 million parameters (small by modern standards), the space of possible parameters is 10-dimensional, and the loss landscape is extraordinarily high-dimensional and rugged.

Gradient descent navigates this landscape by taking small steps in the direction of steepest descent. It is fast and scalable but is guaranteed only to find local optima, not global optima. Moreover, in high dimensions, the “typical” local optimum can be vastly suboptimal compared to the best solutions. The optimization algorithm does not guarantee convergence to good solutions.

Yet empirically, gradient descent trains deep networks that generalize well. Why? This remains one of the great unsolved mysteries in machine learning. Current theories invoke properties like overparameterization (having so many parameters that the loss surface has special structure), implicit regularization (gradient descent has a bias toward simple solutions), and luck (random initialization happens to land near good solutions).

However, this empirical success masks a critical limitation: for many problem structures, no efficient algorithm (deterministic or randomized) can find the global optimum. These are computational limits, distinct from algorithmic limits. We cannot overcome them with better algorithms; they are intrinsic to the problem structure.

For example, many combinatorial optimization problems are NP-complete. Finding the optimal selection of features subset to include in a model (subset selection) is NP-hard. Finding the optimal hyperparameter values is generally NP-hard. Even the problem of learning a parsimonious decision tree consistent with data is NP-complete.

When problems are computationally hard, we have three options: 1. Accept approximation: use heuristics that find near-optimal solutions 2. Relax the problem: optimize a different (easier) objective that approximates the original 3. Exploit structure: identify special cases or domain properties that make the problem easier

Option 1 is common in deep learning (gradient descent is a heuristic for non-convex optimization). Option 2 is used everywhere (convex relaxations of combinatorial problems). Option 3 requires domain expertise but can yield dramatic speedups.

However, none of these options change the fundamental constraint: for some problems, finding exact optimal solutions is computationally impossible. We must accept suboptimality.

Common Misconceptions About “More Data + Bigger Models”

In recent years, a powerful trend has emerged: scale for scale’s sake. The intuition is clear: more data should improve any model, and bigger models should have more capacity to fit complex patterns. Therefore, we should maximize scale everywhere: biggest models, largest datasets, most compute.

This intuition is partially correct but dangerously incomplete. There are multiple ways that increased scale can hurt rather than help:

Overfitting in the Overparameterized Regime: When a model has many more parameters than training samples ($d \gg n$), more parameters do not necessarily help. If a model has 100 million parameters and only 1 million training samples, adding another 100 million parameters provides no additional information; the model simply memorizes noise. Empirically, modern neural networks in this regime exhibit the “double descent” phenomenon: test error first decreases as model size increases (standard learning curve), then increases dramatically as the model enters the overparameterized regime (overfitting), then decreases again (implicit regularization). The second descent is not guaranteed; it depends on properties of the data and architecture.

Misalignment Amplification: Larger models trained on more data more faithfully encode patterns in the training data. If those patterns reflect bias (historical discrimination, measurement error, distributional shift), larger models amplify the bias. A small model might be too simple to encode the bias; a large model faithfully captures it. In hiring recommendation systems, more data and larger models sometimes lead to worse fairness, not better accuracy.

Computational Inefficiency: Training a model scales with the product of model size and data size. Computational cost is often neither model nor data size but rather the scaling relationship between them. A model with 1 billion parameters trained on 1 billion examples requires more compute than a model with 100 million parameters trained on 100 million examples (if the scaling law is superlinear), even though the latter is “smaller” in both dimensions. Beyond certain thresholds, additional scale brings diminishing returns or negative returns.

Objective Misalignment Under Scale: At small scale, we can carefully tune the loss function and validate that optimization produces reasonable results. At large scale (1 billion parameters, billions of examples, weeks of training), validation becomes impractical. We lose feedback about whether the learned representations actually correspond to what we want. A loss function that was roughly aligned at small scale can become deeply misaligned at large scale.

Emergence of Unexpected Behaviors: In large language models and other large-scale systems, increasing scale sometimes produces qualitatively new behaviors (called “emergent” abilities) that were entirely unpredicted by smaller-scale experiments. Some emergent abilities are beneficial (improved reasoning, multitasking), while others are harmful (jailbreaks, adversarial vulnerability, tendency to generate misinformation). We cannot reliably predict which will emerge, and thus cannot control them through objective specification.

The lesson is not that scale is bad. Scale has enabled genuine breakthroughs in language modeling, vision, and scientific simulation. Rather, the lesson is that scale is a double-edged sword. More data and larger models are effective tools only when the objective is well-specified, the data is representative, and the inductive biases of the model align with the problem structure. Blindly maximizing scale without attending to these meta-level considerations often backfires.

ML Connection

Capacity vs Generalization Limits

Throughout machine learning education, one concept is drilled into students: the bias-variance trade-off. A model with low capacity (few parameters, simple structure) fits training data poorly but generalizes well to unseen data (high bias, low variance). A model with high capacity (many parameters, flexible structure) fits training data perfectly but generalizes poorly (low bias, high variance). The goal is to find the “sweet spot” in the middle: choose a model capacity that balances fitting training data and generalizing to new data.

This mental model is useful but incomplete, especially for modern deep learning. The reality is more nuanced and depends on the relationship between model capacity, data size, and problem structure.

Within the Classical Regime ($n > d$, more samples than features), the trade-off is precisely as taught: increasing model capacity increases training accuracy but decreases test accuracy beyond a certain point. This is the regime in which classical statistics operates.

However, modern deep learning operates in the Overparameterized Regime ($d \gg n$, vastly more parameters than samples). In this regime, something surprising happens: test error can decrease even as the model enters the overparameterized regime and training error goes to zero. This is the “double descent” phenomenon, first rigorously documented by Belkin, Montanari, and co-authors.

The double descent has profound implications for how we think about generalization:

Implicit Regularization: In the overparameterized regime, gradient descent does not use explicit regularization (fewer parameters, L1/L2 penalties) but instead implicitly regularizes through early stopping, architecture choices, and the geometry of high-dimensional spaces. This implicit regularization is powerful but not well understood.
Interpolation is Possible but Not Necessary for Good Generalization: A model that fits all training data perfectly (interpolates) can still generalize well, contradicting classical wisdom. The key is how the model interpolates: some interpolations generalize, others overfit catastrophically.
Different Local Optima Have Different Generalization: Not all local optima of the training loss are equally good for generalization. Gradient descent is biased toward certain types of solutions (those reachable by continuous movement from the random initialization) that generalize better than other solutions with equally low training loss.

These phenomena reveal that “more parameters = worse generalization” is not a universal law but rather a regime-dependent statement. Beyond the interpolation threshold, more parameters can improve generalization if the model learns the right way.

However, this does not mean that bigger is always better. The generalization performance in the overparameterized regime depends critically on the alignment between the model’s inductive bias and the problem structure. A model with the right inductive bias (e.g., convolutional structure for vision, transformer structure for language) scales beautifully. A model with misaligned inductive bias (e.g., fully-connected layers for image data) scales poorly, even in the overparameterized regime.

Non-Convex Landscapes at Scale

The loss landscape of a deep neural network is non-convex: it has multiple local minima, saddle points, and regions where gradient information is uninformative. This has several implications for optimization and generalization.

Local Minima: A local minimum is a point $\theta$ where the gradient is zero and all nearby points have higher loss. In principle, gradient descent can get stuck at suboptimal local minima. However, in high-dimensional spaces, true local minima (where all directions increase loss) are rare; instead, saddle points (where some directions decrease loss, others increase) are more common. High-dimensional geometry is counterintuitive, and local minima are not the primary obstacle to optimization in deep learning.

Saddle Points and Plateaus: More problematic than local minima are regions where the gradient is very small, either because you are near a saddle point or because you are on a flat plateau. Gradient descent moves slowly through these regions, requiring many iterations. The learning rate (step size) must be carefully tuned to navigate these regions without diverging.

Mode Connectivity: Perhaps surprisingly, different local minima of the loss landscape are sometimes connected by low-loss paths in the parameter space. This mode connectivity implies that the landscape is less compartmentalized than it seems: you can travel from one local minimum to another without going through high-loss regions. This has implications for ensemble methods and suggests that the particular local minimum reached is less important than we initially thought.

Loss Landscape Geometry and Generalization: The geometry of the loss landscape at the local minimum affects generalization. Solutions that lie in “sharp” minima (where the loss increases steeply in all directions) tend to generalize poorly, while solutions in “flat” minima generalize better. This is one heuristic explanation for why gradient descent generalizes: early stopping and small learning rates bias gradient descent toward flat minima.

By varying the learning rate, batch size, and initialization, we can steer gradient descent toward different local minima with different generalization properties. This is why hyperparameter tuning matters so much in practice: the choice of optimizer and its hyperparameters determines not just the speed of convergence but also which part of the loss landscape we explore.

Optimization Bias vs Inductive Bias

Machine learning models encode two distinct biases:

Inductive Bias: The preferences built into the model class and structure. A convolutional neural network has inductive bias toward local patterns and translation equivariance. A recurrent network has inductive bias toward sequential relationships. A decision tree has bias toward hierarchical, axis-aligned splits. These biases make the model well-suited for some problems and poorly suited for others.
Optimization Bias: The preferences encoded in the learning algorithm (how we optimize). Gradient descent does not explore the loss landscape uniformly; it is biased toward solutions reachable by continuous paths from the random initialization and toward solutions consistent with the implicit regularization of small learning rates and early stopping.

These two biases interact in complex ways. A convolutional network trained with gradient descent on image data learns a hierarchy of features (low-level edges, mid-level textures, high-level parts) that are simultaneously favored by the inductive bias (local structure in the architecture) and the optimization algorithm (gradient descent implicitly prefers solutions that are reachable from random initialization in a reasonable number of steps).

However, optimization bias and inductive bias can also conflict. A deep neural network has inductive bias toward complex, hierarchical features, but gradient descent can get stuck in shallow solutions that do not exploit the deep structure. In these cases, we need to adjust the optimization algorithm (learning rate schedules, batch normalization, residual connections) to align it with the inductive bias.

Moreover, optimization bias can introduce spurious correlations not intended by the model architect. If the data contains spurious correlations (features that are correlated with the target in training but not in deployment), gradient descent discovers and encodes these correlations. The model has high training accuracy but fails in deployment.

Compute-Constrained Learning

Compute is the ultimate constraint in machine learning. Given a fixed amount of compute (GPU-hours, TPU-years), how should we allocate it between model size, training data size, and algorithmic sophistication?

This is not an academic question. The training cost of a modern large language model is on the order of $10^{20}$ to $10^{24}$ floating-point operations (FLOPs). With current hardware, this translates to millions of dollars. Choosing the wrong allocation wastes resources.

The relationship between compute, model size $d$, and data size $n$ is not linear. Empirical scaling laws (discovered by OpenAI, DeepMind, and others) suggest that loss decreases roughly as:

\[ \mathcal{L} \approx a \cdot d^{-\alpha} + b \cdot n^{-\beta} \]

where $\alpha$ and $\beta$ are typically around 0.1 to 0.2 (loss decreases as a power law in both model and data size). Importantly, $\alpha \approx \beta$ for large language models, meaning model and data scaling are roughly equivalent in terms of improvement rate.

However, training cost is $C \approx d \cdot n$ (roughly proportional to the product of model and data size, though constants depend on architecture). Given a fixed compute budget $C_{\text{total}}$, the question becomes: what choice of $d$ and $n$ (subject to $d \cdot n \approx C_{\text{total}}$) minimizes loss?

If loss scales as $\mathcal{L} \approx (C / d \cdot n)^{-\gamma}$ for some exponent $\gamma$, then loss is roughly independent of how we split compute between model and data size. However, if there are data scarcity constraints (only so much high-quality data available) or model size constraints (hardware limits on model parallelism), the allocation changes.

More subtly, the relationship between compute, data, and performance is not smooth. There exist “regime transitions” where increasing compute provides no additional benefit, or even hurts, until you cross a threshold and suddenly become compute-efficient again. These transitions reflect phase changes in the problem structure: a barrier that requires a jump in capability to overcome.

This compute-constrained learning regime reveals a fundamental tension: the optimal algorithm depends on the computational budget. A $1 million training run and a $1 billion training run do not simply differ in scope; they differ in the algorithms, datasets, and models that are optimal. Scaling does not commute with algorithm choice.