Chapter 12 — Robustness, Adversarial Examples, Stability

Overview

Purpose of the Chapter

This chapter establishes robustness as a core objective of machine learning by analyzing model behavior under worst-case perturbations rather than only average-case performance. It develops the threat-model, attack, and training framework needed to evaluate and improve reliability in adversarial and high-stakes deployment settings.

Role in Book Arc

This chapter develops robustness as a first-class objective in machine learning, moving from standard generalization assumptions to adversarial and worst-case perspectives. After Chapter 11 connected optimization geometry to generalization, we now ask how models behave under targeted perturbations and distribution stress. The chapter positions robustness as a core requirement for trustworthy deployment, not an optional add-on.

Core Concept and Supporting Concepts

Main Concept: Adversarial robustness is the ability to maintain reliable predictions under constrained, worst-case input perturbations, requiring explicit threat modeling, evaluation, and training design.

Supporting Concepts:

Threat models define meaning: robustness claims are norm- and budget-specific.
Worst-case risk differs from average risk: high clean accuracy can hide brittle behavior.
Adversarial examples are geometric: high-dimensional boundaries can be surprisingly close.
Attack strength matters: weak evaluation can overstate robustness.
Min-max optimization is central: robust training solves inner attacks and outer learning jointly.
Robustness-accuracy tradeoffs are real: gains often require controlled sacrifice elsewhere.
Certified and empirical robustness differ: proofs and attack-based tests serve different roles.
Stability links to reliability: sensitivity control improves behavior under perturbation.
Evaluation must be multi-axis: single-attack metrics are insufficient for deployment.
Robustness connects to governance: safety, security, and accountability depend on it.

Learning Outcomes

By the end of this chapter, you will be able to:

Define adversarial examples and robust risk under explicit threat models.
Derive core attack updates such as FGSM/PGD under norm constraints.
Compute min-max training objectives and interpret their optimization behavior.
Differentiate empirical robustness from certified robustness guarantees.
Evaluate models with stronger and weaker adversarial protocols.
Analyze robustness-accuracy and robustness-compute tradeoffs.
Apply regularization and stability methods that improve perturbation tolerance.
Diagnose gradient masking and false robustness indicators.
Design multi-attack robustness scorecards for release decisions.
Connect robustness concepts to practical ML safety requirements.

Scope: What This Chapter Covers

This chapter covers the following conceptual and computational scope.

Threat-model formalization: $\ell_\infty$, $\ell_2$, sparse, and semantic perturbations.
Attack mechanics: gradient-based and black-box adversarial construction.
Robust optimization: min-max objectives and adversarial training workflows.
Certification concepts: randomized smoothing and provable radius interpretation.
Stability tools: Jacobian control, Lipschitz ideas, and regularization strategies.
Deployment evaluation: multi-attack, multi-shift robustness validation.

Connections to Other Chapters

This chapter connects directly to the full-book arc through the following progression.

Chapter 11: extends geometry and implicit-bias ideas into robustness stress testing.
Optimization chapters: reuses gradient methods within adversarial inner loops.
Generalization chapters: compares average-case and worst-case reliability notions.
Evaluation chapters: motivates stronger validation standards beyond clean accuracy.
Systems chapters: informs production safeguards and release criteria.
Safety themes: supports trustworthy deployment under adversarial pressure.

Questions This Chapter Answers

This chapter answers the following fundamental questions, aligned with proof and implementation exercises.

What is an adversarial example? How do tiny perturbations cause large prediction shifts?
How do we define robust risk? What does worst-case evaluation measure?
Which attacks are trustworthy benchmarks? How do we avoid weak-test conclusions?
How does PGD adversarial training work? Why is it a min-max problem?
What tradeoffs are unavoidable? Where do robustness and clean accuracy conflict?
What is certified robustness? When do formal guarantees matter most?
How do stability and Lipschitz controls help? What practical gains do they provide?
How should we evaluate robustness for release? What multi-attack checks are required?
How does robustness relate to generalization? Which links are reliable and which are weak?
How do we build robust ML pipelines? What steps are essential in practice?

Concrete ML Examples

This purpose section grounds the abstract theory in concrete worked examples with consistent stepwise structure.

PGD Adversarial Training for Image Classifiers
1. 1) Concept summary: PGD adversarial training improves worst-case robustness by optimizing against inner maximization attacks.
2. 2) Problem statement: compute one PGD attack step and verify perturbation remains inside the $\ell_\infty$ budget.
3. 3) Problem setup: During adversarial training, each batch sample is first perturbed by gradient ascent on input to maximize loss. The perturbation is then projected back into the allowed $\epsilon$-box around the clean input. We validate one step for a scalar pixel coordinate.
4. 4) Explicit values: clean pixel $x=0.50$, current perturbation $\delta_0=0.01$, gradient sign $+1$, step size $\alpha=0.02$, budget $\epsilon=0.03$.
5. 5) Formula with symbols defined: PGD step $\delta_{1}=\Pi_{[-\epsilon,\epsilon]}(\delta_0+\alpha\,\text{sign}(\nabla_x\ell))$, adversarial pixel $x_{adv}=x+\delta_1$.
6. 6) Plug-in step: raw update $\delta_0+\alpha=0.01+0.02=0.03$; projection to $[-0.03,0.03]$ keeps $\delta_1=0.03$; then $x_{adv}=0.50+0.03$.
7. 7) Computed result: $x_{adv}=0.53$, and perturbation exactly saturates the budget $\|\delta_1\|_\infty=0.03$.
8. 8) Decision / interpretation: attack step is valid and strong enough for robust training without violating threat constraints.
9. 9) Sensitivity check: if $\alpha$ increases to $0.05$, projection still enforces $\delta=0.03$, but too-large steps can reduce inner-optimization quality.
Certified Robustness via Randomized Smoothing
1. 1) Concept summary: randomized smoothing converts a base classifier into one with certifiable $\ell_2$ robustness radius.
2. 2) Problem statement: estimate certified radius from class probability separation under Gaussian noise.
3. 3) Problem setup: We sample noisy inputs and estimate top-class probability $p_A$ and runner-up $p_B$. Certification is possible when $p_A>p_B$ with sufficient separation. We compute a simplified radius proxy used for deployment screening.
4. 4) Explicit values: noise scale $\sigma=0.25$, estimated $p_A=0.90$, $p_B=0.08$.
5. 5) Formula with symbols defined: proxy radius $r\approx\sigma\big(\Phi^{-1}(p_A)-\Phi^{-1}(p_B)\big)/2$, where $\Phi^{-1}$ is standard normal inverse CDF.
6. 6) Plug-in step: $\Phi^{-1}(0.90)\approx1.282$, $\Phi^{-1}(0.08)\approx-1.405$, gap $=2.687$; multiply by $\sigma/2=0.125$.
7. 7) Computed result: $r\approx0.125\times2.687\approx0.336$.
8. 8) Decision / interpretation: prediction is certifiably stable for $\ell_2$ perturbations up to roughly 0.34 under this estimate.
9. 9) Sensitivity check: if $p_A$ drops to 0.75, certified radius shrinks sharply, indicating weaker assurance under shift.
Input Jacobian Regularization for Stability
1. 1) Concept summary: Jacobian penalties reduce output sensitivity to small input perturbations.
2. 2) Problem statement: measure total training loss when adding Jacobian regularization to a baseline objective.
3. 3) Problem setup: We train a model with standard task loss plus a penalty on input gradients. The penalty discourages abrupt response to tiny sensor noise and improves stability. We compute one batch objective value to verify weighting.
4. 4) Explicit values: base loss $L_{task}=0.62$, Jacobian norm squared $\|\nabla_x f\|_F^2=3.4$, regularization weight $\lambda_J=0.05$.
5. 5) Formula with symbols defined: total loss $L=L_{task}+\lambda_J\|\nabla_x f\|_F^2$.
6. 6) Plug-in step: regularization term $=0.05\times3.4=0.17$; add to base loss $0.62+0.17$.
7. 7) Computed result: $L=0.79$.
8. 8) Decision / interpretation: penalty is contributing materially; if validation robustness improves, current $\lambda_J$ may be appropriate.
9. 9) Sensitivity check: doubling $\lambda_J$ to 0.10 raises total loss to 0.96 and may over-smooth informative high-frequency signals.
Robustness Evaluation Beyond Single-Attack Benchmarks
1. 1) Concept summary: true robustness requires multi-attack, multi-shift evaluation rather than one benchmark metric.
2. 2) Problem statement: aggregate robustness score across several stress tests to decide release readiness.
3. 3) Problem setup: We evaluate the same model under white-box PGD, transfer attacks, and common corruptions. Single metrics can hide brittle failure modes, so we compute average robust accuracy across test families. Release requires aggregate score and no catastrophic tail metric.
4. 4) Explicit values: PGD accuracy $a_1=61\%$, transfer accuracy $a_2=68\%$, corruption accuracy $a_3=72\%$, minimum acceptable average $\tau=65\%$.
5. 5) Formula with symbols defined: aggregate robustness $A=(a_1+a_2+a_3)/3$.
6. 6) Plug-in step: $A=(61+68+72)/3=201/3$.
7. 7) Computed result: $A=67\%$.
8. 8) Decision / interpretation: model passes average robustness threshold, but weakest-case PGD at 61% should still be tracked for safety margin.
9. 9) Sensitivity check: if PGD drops to 54%, aggregate falls to $(54+68+72)/3=64.7\%$, failing release criterion.

Definitions

Definition 1: Adversarial Perturbation

Definition: An adversarial perturbation for an input $x$ with respect to a model $f$, true label $y$, and threat model $\epsilon$ is a vector $\delta \in \mathbb{R}^d$ satisfying the constraint $\|\delta\|_p \leq \epsilon$ (for norm $p \in \{1, 2, \infty, 0\}$) such that the perturbed input $x_{\text{adv}} = x + \delta$ produces a different prediction or lower confidence: either $\text{argmax}_{c} f(x_{\text{adv}})_c \neq y$ (untargeted attack) or $\text{argmax}_{c} f(x_{\text{adv}})_c = t$ for a specified target class $t \neq y$ (targeted attack).
Assumptions: (1) The perturbation is bounded in a specified norm $\|\delta\|_p \leq \epsilon$, with $p$ and $\epsilon$ defining the threat model. (2) The model $f$ is deterministic and differentiable (for gradient-based attacks). (3) The true label $y$ is known (supervised setting). (4) The perturbation is added directly to the input; other transformation models (e.g., spatial transformations, occlusions) define different threat models.
Notation: $\delta$ denotes the perturbation vector, $x_{\text{adv}} = x + \delta$ the adversarial example, $\epsilon$ the perturbation budget, and $y$ the true label. The norm $\|\cdot\|_p$ is specified explicitly; for images, $\ell_\infty$ (maximum pixel change) and $\ell_2$ (Euclidean distance) are common.
Usage: Adversarial perturbations are the objects of study in adversarial robustness. The goal is to (1) understand when and why they exist (via attack algorithms), (2) defend against them (via robust training or certified defenses), and (3) measure robustness by the worst-case perturbation magnitude that causes misclassification. An adversarial perturbation with $\epsilon = 8/255 \approx 0.03$ on images is imperceptible to humans but often causes misclassification, revealing the gap between human and model perception.
Valid Example: For image classification of a dog photo, an $\ell_\infty$-perturbation $\delta$ with $\|\delta\|_\infty \leq 8/255$ that causes the classifier to output “cat” with 95% confidence is an adversarial perturbation. The adversarial example $x_{\text{adv}} = x + \delta$ appears visually identical to humans but fools the model.
Failure Case: A vector $\delta$ with $\|\delta\|_\infty = 0.1$ (very large, visible noise) that causes misclassification is not adversarial by the standard definition; it violates the perturbation constraint. Furthermore, a small $\delta$ that does not change the model’s prediction (model remains confident in the correct class) is not adversarial, even though it is a valid perturbation satisfying the norm constraint.
Explicit ML Relevance: Adversarial perturbations expose a fundamental gap between the data distribution $D$ that the model was trained on and the adversarially perturbed distribution. A model with high accuracy on $D$ may have very low accuracy on adversarially perturbed inputs, motivating the study of robust learning.

Definition 2: Robust Risk

Definition: The robust risk (or adversarial risk) of a model $f$ with respect to a loss function $\ell$, data distribution $D$, and threat model $\epsilon$ is defined as \[R_{\text{robust}}(f) = \mathbb{E}_{(x,y) \sim D} \left[ \max_{\|\delta\|_p \leq \epsilon} \ell(f(x + \delta), y) \right] \] This represents the expected loss under the worst-case perturbation within the threat model.
Assumptions: (1) The data distribution $D$ is fixed and known (or samples from it are available). (2) The threat model (norm $p$ and budget $\epsilon$) is specified. (3) The inner maximization is computable or can be approximated. (4) The loss $\ell$ is bounded (e.g., $\ell \in [0, L]$ for some $L$).
Notation: $R_{\text{robust}}$ denotes robust risk (as opposed to standard risk $R = \mathbb{E}[\ell(f(x), y)]$). The threat model parameters $p$ and $\epsilon$ should be specified explicitly; subscripts like $R_{\text{robust}}^{\ell_\infty, \epsilon}$ clarify the norm and budget.
Usage: Robust risk measures worst-case performance on adversarially perturbed data. A model with low robust risk is guaranteed to perform well even when an adversary perturbs inputs within the budget $\epsilon$. Robust risk is always at least as large as standard risk: $R_{\text{robust}} \geq R$, with equality only if the model’s loss is constant within the $\epsilon$-ball around each point.
Valid Example: For a classification model on MNIST with cross-entropy loss, $\ell_\infty$ threat model with $\epsilon = 0.3$, the robust risk is the average of the maximum cross-entropy loss over all pixel perturbations within $\pm 0.3$ for each test image. If the maximum loss per image is around 2.0 (misclassification with high confidence), the robust risk is approximately 2.0.
Failure Case: Computing the true robust risk requires solving the inner maximization exactly, which is NP-hard for neural networks. Practical estimates of robust risk rely on attacks (PGD, etc.) that may not find the true worst-case perturbation, leading to underestimation. Furthermore, robust risk evaluated on a finite test set (empirical robust risk) may differ from the true population robust risk due to sampling variability.
Explicit ML Relevance: The goal of adversarial training is to minimize robust risk. Algorithms like PGD-based adversarial training iteratively reduce robust risk by training on worst-case perturbations found by PGD. Understanding robust risk is essential for evaluating robustness in practice.

Definition 3: Empirical Risk

Definition: The empirical risk (or training risk) of a model $f$ with respect to a loss function $\ell$ and a finite dataset $S = \{(x_i, y_i)\}_{i=1}^m$ is defined as \[R_{\text{emp}}(f, S) = \frac{1}{m} \sum_{i=1}^m \ell(f(x_i), y_i) \] This is the average loss on the training set. For adversarial settings, the adversarial empirical risk is \[R_{\text{emp}}^{\text{robust}}(f, S) = \frac{1}{m} \sum_{i=1}^m \max_{\|\delta\|_p \leq \epsilon} \ell(f(x_i + \delta), y_i) \]
Assumptions: (1) The dataset $S$ is fixed and finite. (2) The loss $\ell$ is computable for each example. (3) For adversarial empirical risk, the inner maximization is solved exactly or approximately (via attacks).
Notation: $R_{\text{emp}}$ denotes empirical risk on the dataset $S$. Superscripts clarify whether the risk is standard or adversarial. The dataset $S$ should be specified; empirical risk on the training set, validation set, or test set are all distinct.
Usage: Empirical risk is measurable from data, whereas population risk (expected loss) requires integration over the data distribution. The gap between empirical and population risk is the generalization gap, studied extensively in learning theory. Empirical risk is central to optimization during training: algorithms like SGD minimize empirical risk, hoping this translates to low population risk via generalization.
Valid Example: For a classifier trained on MNIST (60,000 training images), the empirical risk is the average cross-entropy loss on the 60,000 training examples. If 59,500 images achieve loss 0.1 and 500 achieve loss 1.0, the empirical risk is approximately $(59500 \times 0.1 + 500 \times 1.0) / 60000 \approx 0.108$.
Failure Case: A model achieving zero empirical risk (perfect training accuracy) does not necessarily have low population risk. This is the classical overfitting scenario: the model memorizes the training set but learns no generalizable patterns. In adversarial settings, a model can achieve low adversarial empirical risk through overfitting to the specific perturbations used during adversarial training, while remaining vulnerable to other attacks.
Explicit ML Relevance: Training algorithms minimize empirical risk, but the true objective is to minimize population risk. The empirical risk on adversarially perturbed training data (adversarial empirical risk) is the objective in adversarial training algorithms like PGD.

Definition 4: Lipschitz Continuity (Robustness Context)

Definition: A function $f: \mathbb{R}^d \to \mathbb{R}^k$ is $L$-Lipschitz continuous (with respect to norms $\|\cdot\|_p$ on the domain and $\|\cdot\|_q$ on the codomain) if there exists a constant $L \geq 0$ such that for all $x, x' \in \mathbb{R}^d$, \[\|f(x) - f(x')\|_q \leq L \|x - x'\|_p \] The Lipschitz constant of $f$ is $L_f = \inf \{L : f \text{ is } L\text{-Lipschitz}\}$.
Assumptions: (1) The norms on the domain and codomain are specified (e.g., both $\ell_2$ or mixed $\ell_2$ and $\ell_\infty$). (2) The function is defined on a connected domain. (3) For neural networks, Lipschitz constants are typically analyzed for the full network (input to output) or per-layer.
Notation: $L_f$ or $L$ denotes the Lipschitz constant. Subscripts specify the norms, e.g., $L_{f, \ell_\infty \to \ell_2}$ for the Lipschitz constant mapping $\ell_\infty$-norm inputs to $\ell_2$-norm outputs.
Usage: If $f$ is $L$-Lipschitz, then small input changes lead to proportionally small output changes: $\|f(x + \delta) - f(x)\| \leq L \|\delta\|$. For robustness, a low Lipschitz constant ensures that the model’s predictions are stable to perturbations. Specifically, if the model’s loss changes from $\ell(f(x), y)$ to $\ell(f(x + \delta), y)$ and the loss function $\ell$ is also Lipschitz with constant $L_\ell$, the total change is bounded: $|\ell(f(x + \delta), y) - \ell(f(x), y)| \leq L_\ell L_f \|\delta\|$.
Valid Example: A linear function $f(x) = Wx + b$ where $W \in \mathbb{R}^{k \times d}$ is $\sigma_{\max}(W)$-Lipschitz (the largest singular value of $W$). For a 10-class classification network with logit outputs, the cross-entropy loss is Lipschitz in the logits with constant depending on the class probabilities.
Failure Case: A neural network with unbounded weights can have arbitrarily large Lipschitz constant. A convolutional layer without regularization may have $L_f = 1000$, meaning a small perturbation $\|\delta\| = 0.1$ can cause an output change of $\approx 100$, potentially crossing decision boundaries and causing misclassification. Conversely, a constant function is 0-Lipschitz but useless for learning (outputs the same prediction for all inputs).
Explicit ML Relevance: Controlling the Lipschitz constant of neural networks via spectral normalization or Lipschitz regularization is a defense mechanism against adversarial examples. Lower Lipschitz constants provide robustness guarantees.

Definition 5: Robust Loss Function

Definition: A robust loss function (or loss function with robustness properties) for a classification task is a loss $\ell: \mathbb{R}^k \times \{1, \ldots, c\} \to \mathbb{R}_{\geq 0}$ designed to induce the model to be robust. Joint robustness loss functions include: (1) The adversarial cross-entropy loss, $\ell(f(x + \delta^*), y)$ where $\delta^*$ is the worst-case perturbation within the threat model, used in adversarial training. (2) The margin loss, $\ell_{\text{margin}}(f(x), y) = \max(0, m - (f(x)_y - \max_{c \neq y} f(x)_c))$, where $m$ is the margin threshold; minimizing this encourages the model to classify correctly with a safety margin. (3) The TRADES loss, $\ell_{\text{TRADES}}(f(x), y) = \ell_{\text{CE}}(f(x), y) + \beta \text{KL}(f(x) \| f(x + \delta^*))$, mixing standard cross-entropy with a KL divergence term; minimizing this encourages the model to have consistent outputs on clean and adversarially perturbed inputs.
Assumptions: (1) The model outputs logits or probabilities that can be compared to the true label. (2) For adversarial losses, the worst-case perturbation $\delta^*$ is computed (via an attack algorithm) or approximated. (3) The loss is differentiable or can be approximated for gradient-based optimization.
Notation: $\ell_{\text{robust}}$ or $\ell_{\text{adv}}$ denote robust loss functions. Subscripts clarify the loss type (e.g., $\ell_{\text{CE}}$ for cross-entropy, $\ell_{\text{margin}}$ for margin loss).
Usage: Standard loss functions (cross-entropy, MSE) optimize for accuracy on clean data but do not account for robustness. Robust loss functions explicitly penalize misclassifications under perturbations, steering the optimization toward robust solutions. The choice of robust loss affects the robustness-accuracy trade-off: margin-based losses may sacrifice accuracy for robustness, while TRADES-style losses attempt to balance both.
Valid Example: In adversarial training, the loss is $\ell(f(x + \delta^*), y)$ where $\delta^*$ is found via PGD targeting misclassification. The gradient of this loss with respect to model parameters encourages robustness to the specific adversarial attack used for training.
Failure Case: Using only the standard cross-entropy loss $\ell(f(x), y)$ during training typically leads to non-robust models, despite potentially high standard accuracy. A model trained this way may achieve 99% accuracy on clean MNIST but drop to 10% accuracy under small $\ell_\infty$-perturbations.
Explicit ML Relevance: The choice of robust loss function directly affects the optimization objective during adversarial training and thus the properties of the learned model. Different robust loss functions lead to different robustness-accuracy trade-offs.

Definition 6: Algorithmic Stability

Definition: An algorithm $\mathcal{A}$ that receives a training dataset $S = \{(x_i, y_i)\}_{i=1}^m$ and outputs a model $f_S$ is $\epsilon$-stable if for any two datasets $S$ and $S'$ that differ in at most one example (i.e., one example is removed or replaced), the outputs are stable: for all $(x, y)$ not in either $S$ or $S'$, \[|\ell(f_S(x), y) - \ell(f_{S'}(x), y)| \leq \epsilon \] and similarly for robustness: \[|\ell(f_S(x + \delta), y) - \ell(f_{S'}(x + \delta), y)| \leq \epsilon \text{ for all } \|\delta\| \leq \epsilon' \]
Assumptions: (1) The loss function $\ell$ is bounded and measurable. (2) The algorithm is deterministic (or stability is defined in expectation for randomized algorithms). (3) Stability is measured over all possible examples $(x, y)$, not just the training set.
Notation: $\epsilon$ denotes the stability parameter. Subscripts clarify the type: $\epsilon_{\text{standard}}$ for standard stability, $\epsilon_{\text{robust}}$ for robustness stability.
Usage: Stable algorithms ensure that the learned model’s predictions are not overly sensitive to individual training examples. High stability (small $\epsilon$) is desirable: removing one training example should not drastically change the model’s behavior. For robustness, it is natural to require algorithms to be stable to input perturbations (robust stability), not just changes in the training set.
Valid Example: Ridge regression with large regularization parameter $\lambda$ is highly stable: changing one training example slightly changes the learned parameters and thus the predictions. In contrast, 1-nearest neighbor is unstable: changing one training point can flip predictions for many test points if that point is the nearest neighbor.
Failure Case: Unregularized empirical risk minimization on a highly expressive model class is unstable: fitting the model exactly to the training data means that changing one example can substantially change the learned hypothesis. A neural network trained to zero training loss (interpolating the training set) may be unstable if the network relies heavily on specific training examples.
Explicit ML Relevance: Stable algorithms have good generalization properties (low generalization gap). Algorithms that are stably robust to adversarial perturbations are guaranteed to have robustness that generalizes to test data.

Definition 7: Uniform Stability

Definition: An algorithm $\mathcal{A}$ is uniformly $\epsilon$-stable if for any two training datasets $S$ and $S'$ differing in one example, and for all $(x, y)$ (not necessarily in $S$ or $S'$), with probability $1 - \delta$ over the randomness in $\mathcal{A}$ (if any), \[\max \{ |\ell(f_S(x), y) - \ell(f_{S'}(x), y)|, \max_{\|\delta\| \leq \epsilon'} |\ell(f_S(x + \delta), y) - \ell(f_{S'}(x + \delta), y)| \} \leq \epsilon \] This is uniform across all examples, not just in expectation over the data distribution.
Assumptions: (1) The algorithm may be randomized (e.g., SGD with random initialization or mini-batch sampling). (2) The probability $1 - \delta$ is over the randomness in the algorithm, not over the data distribution. (3) The uniform bound holds for all $(x, y)$, including those outside the training set.
Notation: $\epsilon$ and $\delta$ denote the stability and confidence parameters, respectively. Uniform stability is stronger than average-case stability and should be denoted accordingly.
Usage: Uniform stability is stricter than average-case stability. An algorithm is uniformly stable if the worst-case influence of any training example is bounded, not just the average case. Uniformly stable algorithms provide generalization guarantees that hold with high probability for any data distribution, whereas weaker stability definitions provide only average-case bounds.
Valid Example: Gradient descent on a strongly convex loss with appropriate learning rate and iteration count is uniformly stable. The bounded updates and convergence ensure that changing one training example does not drastically alter the solution.
Failure Case: Algorithms that explicitly search for outliers or fit the training data perfectly (like some kernel methods without regularization) may not be uniformly stable: a single outlier can drastically change the learned model.
Explicit ML Relevance: Uniform stability enables robust generalization guarantees. Algorithms designed to be uniformly stable automatically generalize well and are robust to adversarial perturbations.

Definition 8: Certified Robustness

Definition: A classifier $f$ is certifiably $\epsilon$-robust at point $x$ with label $y$ if for all perturbations $\|\delta\|_p \leq \epsilon$, the classifier provably outputs the correct label: \[\forall \|\delta\|_p \leq \epsilon: \text{argmax}_c f(x + \delta)_c = y \] Equivalently, the certified robust radius at $x$ is the largest $\epsilon$ such that the classification is correct for all perturbations within the $\epsilon$-ball around $x$: \[r(x) = \max \{ \epsilon : \forall \|\delta\|_p \leq \epsilon, \text{argmax}_c f(x + \delta)_c = y \} \]
Assumptions: (1) A certification method (e.g., randomized smoothing, abstract interpretation, SMT solvers) must exist to verify robustness, not merely test against specific attacks. (2) The certification is based on the model’s current parameters and inputs; it is not probabilistic (unless explicitly randomized). (3) The threat model (norm and budget) is specified.
Notation: $r(x)$ or $\epsilon_{\text{cert}}$ denote the certified robust radius. A subscript specifies the threat model (e.g., $r_{\ell_\infty}(x)$ for $\ell_\infty$-robustness at $x$).
Usage: Certified robustness provides a formal, provable guarantee. Unlike empirical robustness (testing against specific attacks, which may miss stronger attacks), certified robustness holds against all possible perturbations within the threat model. The trade-off is that certified robustness is often harder to achieve (smaller radius) and more computationally expensive to verify.
Valid Example: Using randomized smoothing, a classifier trained normally on MNIST can be converted to a certifiably robust classifier. For each test image, the certified radius might be $\epsilon \approx 0.1$ (guaranteed robustness to $\ell_2$-perturbations up to 0.1). This is verified by averaging predictions over many noisy versions of the input.
Failure Case: A model with high empirical robustness (resistant to PGD attacks) may not be certifiably robust. It could be vulnerable to adaptive attacks specifically designed to evade the certification procedure. Furthermore, the certified radius is often much smaller than empirical robustness, reflecting the gap between testing against specific attacks and proving robustness to all attacks.
Explicit ML Relevance: Certified robustness is the gold standard for safety-critical applications (autonomous vehicles, medical diagnosis). However, the certification gap (certified radius smaller than empirical robustness) motivates research into methods that simultaneously improve both certified and empirical robustness.

Definition 9: Margin

Definition: For a binary classifier $f$ outputting logits $s(x) \in \mathbb{R}$, the margin at a point $(x, y)$ with $y \in \{-1, +1\}$ is \[m(x, y) = y \cdot s(x) \] For multi-class classification with $c$ classes and $f$ outputting $c$ logits, the margin is \[m(x, y) = s(x)_y - \max_{c' \neq y} s(x)_{c'} \] the difference between the true class logit and the maximum competing logit. The geometric margin (normalized by the weight norm) is $m_{\text{geom}}(x, y) = m(x, y) / \|w\|$ for a linear classifier $f(x) = w^T x + b$.
Assumptions: (1) The model outputs logits (or probabilities, which can be converted to logits via log). (2) For normalized margin, the weights are explicitly available (true for linear models, less clear for neural networks). (3) The margin is compared to a threshold; a positive margin indicates correct classification.
Notation: $m(x, y)$ or $m(x)$ (with label implicit) denote margin. Subscripts clarify: $m_{\text{geom}}$ for geometric margin, $m_{\text{hard}}$ for hard margin (must be $\geq 1$), $m_{\text{soft}}$ for soft margin (penalized if $< 1$).
Usage: Large margins indicate confident, robust predictions. If $m(x, y) = 2.0$, the true class logit is 2.0 higher than the nearest competing logit, providing a buffer against perturbations. Small margins indicate uncertainty or vulnerability: a margin $m(x, y) = 0.1$ means a small perturbation could flip the predicted class. Margin-based learning (SVMs, boosting) explicitly maximizes the margin to improve robustness and generalization.
Valid Example: For a correctly classified image with margin $m(x, y) = 1.5$, the model is confident. An adversarial perturbation that decreases the true class logit by 1.5 while increasing a competing logit by 1.5 (total change of 3.0 in the margin) would cause misclassification. If this requires perturbation magnitude $\epsilon = 0.5$, the example is not robust to $\epsilon = 0.5$-perturbations.
Failure Case: A model with high average margin on the training set may still have examples with very small margins, which are adversarially vulnerable. Furthermore, margin does not account for the input space geometry; a functionally large margin (in logit space) may correspond to a small geometric margin if the decision boundary is close to the data manifold.
Explicit ML Relevance: Margin-based robustness is a classical concept. Models trained to maximize margin (via explicit margin-based losses or implicitly via algorithms like gradient descent on separable data) tend to be more robust.

Definition 10: Worst-Case Risk

Definition: The worst-case risk of a model $f$ with respect to a loss function $\ell$, data distribution $D$, threat model $\epsilon$, and set of possible perturbations $\mathcal{P}$ is \[R_{\text{worst}}(f) = \max_{\mathcal{D}' \in \text{Shifts}_\epsilon(\mathcal{D})} \mathbb{E}_{(x,y) \sim \mathcal{D}'} [\ell(f(x), y)] \] where $\text{Shifts}_\epsilon(\mathcal{D})$ is the set of all perturbations of the original distribution $\mathcal{D}$ within the threat model $\epsilon$. Equivalently, for a finite dataset $S$, the worst-case risk is \[R_{\text{worst}}(f, S) = \max_{i} \max_{\|\delta\| \leq \epsilon} \ell(f(x_i + \delta), y_i) \] the maximum loss over all examples and their perturbations.
Assumptions: (1) The threat model (perturbation set) is fully specified. (2) The worst-case risk is an upper bound; the model is guaranteed to have loss at most $R_{\text{worst}}$ on any perturbed distribution. (3) Computing worst-case risk requires solving the inner maximization, which may be intractable.
Notation: $R_{\text{worst}}$ or $R_{\max}$ denote worst-case risk. This is distinct from average-case risk $R_{\text{avg}} = \mathbb{E}[\ell(f(x), y)]$ and robust risk $R_{\text{robust}} = \mathbb{E}[\max_\delta \ell(f(x + \delta), y)]$.
Usage: Worst-case risk is the most stringent robustness measure, relevant for safety-critical applications where a single failure is catastrophic. A model might have low average-case risk but high worst-case risk if a few examples are adversarially vulnerable. Minimizing worst-case risk requires ensuring robustness for all examples, not just on average.
Valid Example: For a neural network deployed in autonomous vehicles, the worst-case risk is the maximum classification error over all inputs and adversarial perturbations. If one adversarial perturbation causes the network to misclassify a pedestrian as a lamppost, the worst-case risk is the loss on that example (very high if safety-critical).
Failure Case: Empirical estimates of worst-case risk are unreliable for finite datasets. The worst-case risk on 100 test images may be dominated by a single outlier, whereas the true worst-case risk over the distribution may be much lower. Furthermore, attacks used to estimate worst-case risk are often not guaranteed to find the true worst case.
Explicit ML Relevance: Safety-critical applications must minimize worst-case risk. This shifts the optimization from average-case (standard ERM) to worst-case (robust optimization).

Definition 11: Distributional Robustness

Definition: A model (or learning algorithm) is distributionally robust if it maintains low expected loss even when the test distribution $\mathcal{D}_{\text{test}}$ differs from the training distribution $\mathcal{D}_{\text{train}}$ within a specified divergence or distance. Formally, the model minimizes the worst-case expected loss over a set of distributions: \[R_{\text{dist-robust}}(f) = \max_{\mathcal{D}': D(\mathcal{D}', \mathcal{D}_{\text{train}}) \leq \delta} \mathbb{E}_{(x,y) \sim \mathcal{D}'} [\ell(f(x), y)] \] where $D(\cdot, \cdot)$ is a divergence or distance (e.g., Wasserstein distance, KL divergence) and $\delta$ bounds the distribution shift.
Assumptions: (1) The divergence or distance is appropriately chosen for the domain (e.g., Wasserstein for images, KL for discrete distributions). (2) The bound $\delta$ on the distribution shift is specified. (3) Computing the worst-case distribution within the divergence ball is tractable (often with duality).
Notation: $R_{\text{dist-robust}}$ or $R_{\text{DR}}$ denote distributional robustness. The divergence $D$ and bound $\delta$ should be specified explicitly.
Usage: Distributional robustness accounts for model mismatch: the training data comes from a distribution $\mathcal{D}_{\text{train}}$, but the test data may come from a shifted distribution $\mathcal{D}_{\text{test}}$. Rather than assuming the test distribution is identical to the training distribution, distributional robustness allows for small deviations and ensures the model performs well under these deviations. This is a generalization of adversarial robustness: adversarial examples can be viewed as extreme points of a shifted distribution.
Valid Example: A classifier trained on images from cameras without weather conditions should be robust to a test distribution with rainy weather (distribution shift). Distributional robustness with Wasserstein distance accounts for this: the model’s loss should be bounded for any test distribution within a Wasserstein ball of the training distribution.
Failure Case: If the bound $\delta$ on the distribution shift is too conservative (very large), the worst-case risk is very large (pessimistic). Conversely, if $\delta$ is too small, the robustness guarantee may not apply to the actual test distribution shift, failing to protect against real distribution changes.
Explicit ML Relevance: Distributional robustness is a framework for domain adaptation and out-of-distribution generalization. It generalizes adversarial robustness and provides a unified view of robustness to distribution shifts.

Definition 12: Gradient-Based Attack

Definition: A gradient-based attack is an algorithm that generates adversarial examples by exploiting the gradients of the loss function with respect to the input. The simplest gradient-based attack is FGSM (Fast Gradient Sign Method): \[x_{\text{adv}} = x + \epsilon \cdot \text{sign}(\nabla_x \ell(f(x), y)) \] where $\epsilon$ is the perturbation budget and $\text{sign}(\cdot)$ gives the sign of each gradient component. More sophisticated gradient-based attacks, such as PGD (Projected Gradient Descent), iteratively maximize loss within the threat model: \[x_{\text{adv}} = \arg\max_{\|\delta\|_p \leq \epsilon} \ell(f(x + \delta), y) \] solved via gradient ascent with projections to stay within the perturbation budget.
Assumptions: (1) The model $f$ is differentiable (at least approximately) so gradients can be computed. (2) The attacker has white-box access to the model (architecture and weights). (3) The loss function $\ell$ is differentiable. (4) The threat model (e.g., $\ell_\infty$-ball of radius $\epsilon$) is convex (for convergence of PGD).
Notation: $\nabla_x \ell$ denotes the gradient of loss with respect to inputs. FGSM and PGD should be specified by name. Function $f$ is the victim model, $\delta$ is the perturbation.
Usage: Gradient-based attacks are white-box attacks (adversary has full knowledge of the model). They work by moving in the direction of increasing loss (steepest ascent), which typically leads to misclassification. FGSM is fast (single step) but may find suboptimal adversarial examples. PGD is more powerful (iterative) but slower. Gradient-based attacks are the standard for evaluating adversarial robustness and for adversarial training.
Valid Example: For an image classifier with cross-entropy loss, FGSM computes $\nabla_x \ell$ (gradient of loss w.r.t. pixels) and moves each pixel in the direction of increasing loss by $\epsilon$ (or clamped to valid pixel range). Starting from a correctly classified image, this typically causes misclassification within a few iterations of PGD.
Failure Case: Gradient-based attacks assume differentiability; attacks on models with non-differentiable components (e.g., hard thresholding, discrete operations) require special handling (approximations like Straight-Through Estimators). Also, attacks designed for one threat model (e.g., $\ell_\infty$) may not be optimal for another (e.g., $\ell_2$); the optimization must be adapted.
Explicit ML Relevance: Gradient-based attacks are the primary tool for evaluating adversarial robustness and training robust models. PGD-based adversarial training is the most common defense.

Definition 13: Perturbation Norm Constraint

Definition: The perturbation norm constraint specifies the allowed magnitudes of adversarial perturbations. For a perturbation $\delta \in \mathbb{R}^d$, the constraint is $\|\delta\|_p \leq \epsilon$ for a chosen norm $p$ and budget $\epsilon$. Common choices are: (1) $\ell_\infty$ (Chebyshev) norm: $\|\delta\|_\infty = \max_i |\delta_i| \leq \epsilon$, allowing each coordinate to change by at most $\epsilon$. (2) $\ell_2$ (Euclidean) norm: $\|\delta\|_2 = \sqrt{\sum_i \delta_i^2} \leq \epsilon$, constraining total Euclidean distance. (3) $\ell_1$ (Manhattan) norm: $\|\delta\|_1 = \sum_i |\delta_i| \leq \epsilon$, constraining sum of absolute changes. (4) $\ell_0$ (cardinality): $\|\delta\|_0 = \sum_i \mathbb{1}[\delta_i \neq 0] \leq k$, limiting the number of nonzero perturbations (sparse attacks).
Assumptions: (1) The norm $p$ defines the threat model; different norms yield different robustness properties and different attack/defense algorithms. (2) The budget $\epsilon$ is threat-model specific (e.g., $\epsilon = 8/255$ for $\ell_\infty$ on images, $\epsilon = 2.0$ for $\ell_2$ on normalized inputs). (3) The threat model is fixed during training and evaluation; switching threat models changes the robustness guarantees.
Notation: $\|\delta\|_p$ specifies norm; subscripts clarify (e.g., $\ell_\infty, \ell_2, \ell_0$). The budget $\epsilon$ should always be specified alongside the norm. Threat model is often denoted $(\ell_p, \epsilon)$.
Usage: Different norms reflect different threat models and physical constraints. $\ell_\infty$ on images captures uniform pixel noise (each pixel corrupted independently). $\ell_2$ captures isotropic noise (Gaussian perturbation). $\ell_0$ captures sparse corruption (only a few pixels altered). The choice of norm significantly affects the robustness problem: $\ell_\infty$-robustness is often easier than $\ell_2$, which is easier than $\ell_0$ (sparse perturbations to specific pixels).
Valid Example: For MNIST images with pixel values in $[0, 1]$, an $\ell_\infty$ constraint $\epsilon = 0.3$ allows each pixel to change by up to 0.3 (from, say, 0.5 to anywhere in $[0.2, 0.8]$). For ImageNet images with 256×256×3 pixels, $\epsilon = 8/255 \approx 0.03$ is a common budget, imperceptible to humans.
Failure Case: Adversarial robustness to $\ell_\infty$ perturbations does not imply robustness to $\ell_2$ or $\ell_0$ perturbations. A model trained to be robust against $(\ell_\infty, 0.3)$-perturbations may be vulnerable to $(\ell_2, 2.0)$-perturbations or sparse attacks. Each threat model requires its own robustness analysis.
Explicit ML Relevance: The choice of constraint norm is essential for defining the adversarial robustness problem. Different applications suggest different threat models: image classification typically uses $\ell_\infty$ or $\ell_2$, audio classification might use $\ell_\infty$ on amplitude or $\ell_2$ on spectrograms, and NLP might use $\ell_0$ (character/word edits).

Definition 14: Robust Optimization Problem

Definition: A robust optimization problem is a minimax optimization where the goal is to find parameters $\theta$ that minimize the worst-case loss over a set of perturbations: \[\min_\theta \max_{\text{perturbations}} \ell(f_\theta(x + \delta), y) \] More generally, for a dataset $S = \{(x_i, y_i)\}_{i=1}^m$, this becomes \[\min_\theta \frac{1}{m}\sum_{i=1}^m \max_{\|\delta\|_p \leq \epsilon_i} \ell(f_\theta(x_i + \delta_i), y_i) \] where $\epsilon_i$ may be example-dependent. The inner maximization is the attack problem, and the outer minimization is the learning problem.
Assumptions: (1) The inner maximization is solvable (exactly or approximately). (2) The function $\ell \circ f$ is continuous and differentiable, or approximations suffice. (3) The threat model (perturbation set) is convex (for convergence of attack algorithms). (4) The problem is well-posed (bounded feasible region, existence of solutions).
Notation: The robust problem is often written in minimax notation: $\min_\theta \max_\delta \ell(\cdot)$. The inner and outer optimizations should be explicitly distinguished. Dual formulations (via Lagrangian duality) provide alternative perspectives.
Usage: Robust optimization is the correct framework for training adversarially robust models. Unlike standard ERM, which assumes the test data comes from the training distribution, robust optimization assumes an adversary perturbs test examples and seeks to minimize worst-case loss. This is inherently more difficult than standard learning, both computationally (harder optimization problem) and statistically (larger sample complexity required).
Valid Example: Adversarial training via PGD solves a robust optimization problem: the inner maximization finds the worst perturbation via PGD, and the outer minimization updates model parameters to reduce the worst-case loss. Each training step involves solving the inner attack problem approximately, then taking a gradient step on the outer objective.
Failure Case: The inner maximization may be non-convex (for neural networks), making it hard to find the true worst-case perturbation. Attacks that find only local maxima may miss true adversarial examples. Furthermore, solving both inner and outer optimization jointly is computationally expensive (5-10× standard training).
Explicit ML Relevance: Formulating robustness as a robust optimization problem provides a rigorous framework for defining and solving adversarial robustness. This is the theoretical foundation for adversarial training.

Definition 15: Stability Metrics

Definition: Stability metrics quantify how stable an algorithm or model is to small changes in the training data or inputs. Common metrics include: (1) $\epsilon$-stability (per Definition 6), the maximum change in loss when one training example is changed. (2) Leave-One-Out (LOO) error: $\text{LOO} = \frac{1}{m} \sum_{i=1}^m \ell(f_{S \setminus \{i\}}(x_i), y_i)$, the average error of a model trained on all but one example, evaluated on the left-out example. LOO bounds generalization. (3) Cross-validation error: dividing the dataset into folds, training on one fold, evaluating on another, and averaging; a practical approximation to LOO. (4) Lipschitz stability: $L_{\text{stab}} = \sup_{x,x'} \|f(x) - f(x')\| / \|x - x'\|$, measuring sensitivity of predictions to input changes (related to Lipschitz constant).
Assumptions: (1) The metric depends on the loss function and algorithm; different metrics capture different aspects of stability. (2) For LOO and cross-validation, the dataset is finite and partitioned appropriately. (3) For Lipschitz stability, the model is differentiable or can be approximated as such.
Notation: $\epsilon$ for $\epsilon$-stability, $\text{LOO}$ for leave-one-out error, $L_{\text{stab}}$ for Lipschitz stability. Subscripts clarify the type of stability (e.g., $\epsilon_{\text{data}}$ for data stability, $\epsilon_{\text{input}}$ for input stability).
Usage: Stability metrics are practical tools for evaluating robustness. High stability (low $\epsilon$, low $\text{LOO}$ error) indicates a model that is robust to small changes. These metrics are related to generalization: stable algorithms generalize well, and algorithms with low LOO error have low expected test error.
Valid Example: For a linear regression model with strong regularization, LOO error is low (leaving out one training example slightly perturbs the learned weights but not drastically). For a 1-nearest-neighbor classifier, LOO error can be high (it may output different predictions depending on which training examples are left out).
Failure Case: Low stability metrics do not guarantee adversarial robustness. A model stable to training data changes may still be vulnerable to adversarial input perturbations if it relies on unstable features (e.g., high-frequency components in images).
Explicit ML Relevance: Stability metrics provide practical diagnostics for robustness. Algorithms and models with favorable stability metrics are expected to be robust, though the relationship is not deterministic.

Theorems

Theorem 1: Lipschitz Bound Implies Robustness Bound

Formal Statement: Let $f: \mathbb{R}^d \to \mathbb{R}^k$ be an $L_f$-Lipschitz function, and let $\ell: \mathbb{R}^k \times \mathbb{Y} \to [0, L_\ell]$ be an $L_\ell$-Lipschitz loss function. Then for any $x \in \mathbb{R}^d$ and $y \in \mathbb{Y}$, and any perturbation $\delta$ with $\|\delta\|_p \leq \epsilon$, \[|\ell(f(x + \delta), y) - \ell(f(x), y)| \leq L_f L_\ell \epsilon \] Equivalently, the normalized robustness radius is $r(x, y) \geq \frac{\text{margin}(x, y)}{L_f L_\ell}$, where $\text{margin}(x, y)$ is the margin of the prediction (difference from the nearest decision boundary).

Formal Proof: By the Lipschitz property of $f$, \[\|f(x + \delta) - f(x)\|_q \leq L_f \|\delta\|_p \leq L_f \epsilon \] By the Lipschitz property of $\ell$, \[|\ell(f(x + \delta), y) - \ell(f(x), y)| \leq L_\ell \|f(x + \delta) - f(x)\|_q \leq L_\ell L_f \epsilon \] This completes the proof. For the robustness radius interpretation: if the margin (in the loss space) is $M$, then the loss at $x$ is $\ell(f(x), y) = \ell_0$ for some baseline loss $\ell_0$, and the loss at any $x + \delta$ is at most $\ell_0 + L_f L_\ell \epsilon$. For misclassification to occur (loss exceeding some threshold $\ell_{\max}$), we require $\ell_0 + L_f L_\ell \epsilon \geq \ell_{\max}$, or $\epsilon \geq (\ell_{\max} - \ell_0) / (L_f L_\ell) = M / (L_f L_\ell)$. Thus, the certified robustness radius is at least $r(x, y) \geq M / (L_f L_\ell)$.

Interpretation: This theorem establishes a direct relationship between the Lipschitz constant of the model-loss composition and robustness. A model with small Lipschitz constant $L_f$ and a loss with small Lipschitz constant $L_\ell$ (e.g., soft losses rather than hard margins) ensures robustness: small input perturbations cause small output changes, limiting the loss increase. Conversely, a model with large Lipschitz constant (sensitive to input changes) is vulnerable even to small perturbations.

Explicit ML Relevance: This theorem motivates the use of spectral normalization and Lipschitz regularization as defenses. By controlling the Lipschitz constant of neural networks (setting spectral norms appropriately), practitioners can obtain certified robustness bounds. For example, constraining each layer’s spectral norm to 1 bounds the overall Lipschitz constant by 1, guaranteeing robustness to $\epsilon$-perturbations up to a certified radius proportional to the margin.

Theorem 2: Stability Implies Generalization Bound

Formal Statement: Let $\mathcal{A}$ be an $\epsilon$-uniformly stable algorithm for a learning task with loss function $\ell: [0, L]$. For any $\delta \in (0, 1]$, with probability at least $1 - \delta$ over the training set $S$ drawn from a distribution $\mathcal{D}$, the generalization gap is bounded: \[\mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(f_S(x), y)] - \frac{1}{m}\sum_{i=1}^m \ell(f_S(x_i), y_i) \leq 2\epsilon + \sqrt{\frac{L^2 \ln(1/\delta)}{2m}} \] where the first term is the stability-induced gap and the second is the concentration term.

Formal Proof: (Sketch) By uniform stability, removing any training example changes the model’s loss on any point by at most $\epsilon$. Using the symmetrization technique (replacing training examples with fresh samples from the distribution), the expected gap between training and test loss can be related to the stability parameter. Formally, let $\hat{f}_S$ be the model trained on $S$, and construct a second dataset $S'$ identical to $S$ except with fresh samples. By stability, \[\mathbb{E}[\ell(f_S(\tilde{x}), \tilde{y})] - \mathbb{E}[\ell(f_{S'}(\tilde{x}), \tilde{y})| \leq \epsilon \] where the expectation is over the randomness in constructing $S$ and $S'$. By symmetry, $\mathbb{E}[\ell(f_{S'}(\tilde{x}), \tilde{y})]$ converges to the true population loss at rate $1/\sqrt{m}$ (via standard concentration). Combining these, the gap between $f_S$ trained on $S$ and true population loss is $\leq 2\epsilon + O(1/\sqrt{m})$. Applying McDiarmid’s inequality to the concentration term yields the stated bound.

Interpretation: This theorem shows that stability is a sufficient condition for generalization. An algorithm that is uniformly stable (small $\epsilon$) automatically has a small generalization gap, regardless of the model class complexity or sample size (beyond dependence through the concentration term). The stability term $2\epsilon$ dominates for large $m$ (where $1/\sqrt{m}$ is small), whereas the concentration term dominates for small $m$.

Explicit ML Relevance: Stability provides an alternative to classical complexity-based generalization bounds (VC dimension, Rademacher complexity), particularly useful for complex models like neural networks where complexity measures are often vacuous (worse than random guessing). In adversarial settings, if the learning algorithm is stable not just to training data changes but also to input perturbations, the algorithm inherits the generalization guarantees of stability and thus robust generalization.

Theorem 3: First-Order Adversarial Approximation Theorem

Formal Statement: Let $f(\theta) = f_\theta(x)$ be a neural network with parameters $\theta$, evaluated at input $x$. For small perturbations $\|\delta\|_2 \leq \epsilon$, the output can be locally approximated as \[f_\theta(x + \delta) \approx f_\theta(x) + \nabla_x f_\theta(x)^T \delta + O(\epsilon^2) \] where $\nabla_x f_\theta(x)$ is the gradient of the network output with respect to the input. The gradient-based attack (FGSM) direction $\text{sign}(\nabla_x \ell(f(x), y))$ is the direction of steepest ascent of the loss within the $\ell_\infty$-ball. For first-order approximation, the adversarial example is $x_{\text{adv}} \approx x + \epsilon \cdot \text{sign}(\nabla_x \ell)$, and the loss at the adversarial example is approximately \[\ell(f(x_{\text{adv}}), y) \approx \ell(f(x), y) + \epsilon \|\nabla_x \ell(f(x), y)\|_1 + O(\epsilon^2) \]

Formal Proof: By Taylor expansion, \[f_\theta(x + \delta) = f_\theta(x) + \nabla_x f_\theta(x)^T \delta + \frac{1}{2}\delta^T \nabla_x^2 f_\theta(x) \delta + O(\|\delta\|^3) \] For small $\epsilon$ and assuming the Hessian is bounded, the first-order term dominates: \[f_\theta(x + \delta) \approx f_\theta(x) + \nabla_x f_\theta(x)^T \delta \] Substituting into the loss: \[\ell(f(x + \delta), y) \approx \ell(f(x), y) + \nabla_x \ell(f(x), y)^T \delta + O(\epsilon^2) \] To maximize this within $\|\delta\|_\infty \leq \epsilon$, choosing $\delta = \epsilon \cdot \text{sign}(\nabla_x \ell)$ gives \[\ell(f(x + \epsilon \cdot \text{sign}(\nabla_x \ell)), y) \approx \ell(f(x), y) + \epsilon \sum_j |\nabla_{x_j} \ell| = \ell(f(x), y) + \epsilon \|\nabla_x \ell\|_1 \]

Interpretation: This theorem justifies gradient-based attacks and explains their empirical success. Within a small neighborhood of the input (first-order approximation), the loss landscape is locally linear, and the steepest ascent direction is the gradient. FGSM exploits this linearity. The first-order approximation breaks down for larger perturbations (where higher-order terms become significant) or for highly non-linear models, explaining why PGD (iterative optimization within the threat model) is more powerful than FGSM.

Explicit ML Relevance: This theorem provides the theoretical foundation for gradient-based attacks and their use in adversarial training. It also highlights the importance of computing accurate gradients for attacks and training. Moreover, it suggests that models with small gradients (bounded $\|\nabla_x \ell\|$) are more robust, motivating gradient regularization as a defense.

Theorem 4: Dual Form of Robust Optimization

Formal Statement: The robust optimization problem \[\min_\theta \max_{\|\delta\|_p \leq \epsilon} \ell(f_\theta(x + \delta), y) \] has a dual formulation (via Lagrangian duality) \[\max_\lambda \min_\theta \left( \ell(f_\theta(x), y) + \lambda (\epsilon - \|x + \delta^*(\lambda) - x\|_p) \right) \] where $\delta^*(\lambda)$ is the worst-case perturbation and $\lambda \geq 0$ is a dual variable. Under strong duality (which holds when the inner maximization set is convex, as it is for convex norms), the optimal values of the primal and dual problems coincide. Additionally, the dual can be reformulated as \[\max_{\lambda \geq 0} \min_\theta \left( \ell(f_\theta(x), y) + \lambda D(\mathcal{B}_\epsilon(x), f_\theta) \right) \] where $D$ is the distance from the input to the decision boundary.

Formal Proof: Introduce the Lagrangian: \[L(\theta, \delta, \lambda) = \ell(f_\theta(x + \delta), y) + \lambda(\|\delta\|_p - \epsilon) \] The Lagrangian dual function is \[D(\lambda) = \inf_{\theta, \delta} L(\theta, \delta, \lambda) \] If strong duality holds (which it does for convex perturbation sets), \[\min_\theta \max_{\|\delta\|_p \leq \epsilon} \ell(f_\theta(x + \delta), y) = \max_{\lambda \geq 0} D(\lambda) \] Rearranging the Lagrangian: \[D(\lambda) = \inf_{\theta, \delta} \left( \ell(f_\theta(x + \delta), y) + \lambda \|\delta\|_p - \lambda \epsilon \right) = \max_\lambda \left( \inf_\theta \inf_{\|\delta\|_p} (\ell + \lambda\|\delta\|) - \lambda\epsilon \right) \] The inner minimization over $\delta$ (for fixed $\theta, \lambda$) defines the adversarial loss, and the dual formulation emerges.

Interpretation: The dual formulation provides an alternative view of the robust optimization problem, separating the roles of the model $\theta$ and the adversary (parameterized by $\lambda$). High $\lambda$ corresponds to a strong adversary (large perturbation budget $\epsilon$), whereas low $\lambda$ corresponds to a weak adversary. By solving the dual problem, one can derive better optimization algorithms than naive primal approaches (e.g., directly solving inner and outer optimizations sequentially).

Explicit ML Relevance: The dual formulation enables the development of efficient algorithms for robust optimization. For instance, dual averaging and mirror descent algorithms can be applied to the dual problem, often converging faster than primal approaches. Furthermore, the dual formulation reveals that robust optimization is a form of Lagrangian regularization, where the penalty term $\lambda D$ encourages the model to be robust.

Theorem 5: Margin-Based Robustness Guarantee

Formal Statement: Consider a binary classifier $f(x) = \text{sign}(w^T x + b)$ with margin $m(x) = y(w^T x + b) / \|w\|$ on a point $x$ with label $y \in \{-1, +1\}$. If the margin satisfies $m(x) > 0$ (correct classification) and $m(x) > \gamma$ for some $\gamma > 0$ (margin threshold), then the classifier is robustly correct (unchanged prediction) to any $\ell_2$-perturbation with radius $r(x) = \gamma / \sigma_{\max}(X)$ where $\sigma_{\max}(X)$ is the maximum singular value of the data matrix (related to the Lipschitz constant of $f$). Moreover, if the model is trained to maximize the margin (gap between separating hyperplane and nearest data point), the expected robustness radius scales as $\mathbb{E}[r(x)] \propto \mathbb{E}[m(x)]$.

Formal Proof: For a linear classifier with margin $m(x) = y(w^T x + b) / \|w\| > 0$, a perturbation $\|\delta\|_2 \leq \epsilon$ changes the prediction if \[y(w^T(x + \delta) + b) \leq 0 \Rightarrow y(w^T x + b) \leq -w^T \delta \Rightarrow m(x) \leq \frac{-w^T \delta / \|w\|}{\|x\|} \leq \|w\|^{-1} \|w\| \epsilon = \epsilon \] Thus, misclassification requires $\epsilon \geq m(x) / (1 + \text{norm correction factor})$. A more careful analysis accounting for the geometry yields $\epsilon \geq m(x) / \sigma_{\max}(X)$ (or closely related bounds). Therefore, the robustness radius is $r(x) = m(x) / (\sigma_{\max}(X) \cdot \text{quantifier})$. For margin $m(x) > \gamma$, we have $r(x) > \gamma / \sigma_{\max}(X)$.

Interpretation: This theorem formalizes the intuition that models with large margins are more robust. A large margin provides a buffer against perturbations: small perturbations cannot flip the prediction as long as the margin is maintained. The theorem also relates robustness to the data geometry (via $\sigma_{\max}(X)$), suggesting that data with well-separated classes (high-rank structure) enables better robustness.

Explicit ML Relevance: The margin-based robustness guarantee motivates margin-maximizing learning algorithms (SVMs, large-margin training) as robust methods. Extensions to neural networks suggest that implicit bias toward large margins (e.g., in the NTK regime) is related to robustness. Furthermore, this theorem shows that achieving robustness requires not just fitting the training data but ensuring predictions are confident (large margins).

Theorem 6: Sensitivity Bound via Hessian Norm

Formal Statement: Let $f$ be a twice-differentiable function and $\ell = \ell \circ f$ the loss composed with the model. For small perturbations $\|\delta\|_2 \leq \epsilon$, the loss change can be bounded by the Hessian norm: \[|\ell(f(x + \delta)) - \ell(f(x)) - \nabla_x \ell(f(x))^T \delta| \leq \frac{1}{2} \epsilon^2 \|\nabla^2_x \ell(f(x))\|_{\text{op}} \] where $\|\cdot\|_{\text{op}}$ is the operator norm (largest eigenvalue of the Hessian). Equivalently, if the Hessian is bounded $\|\nabla^2 \ell\| \leq H$, the loss is approximately $H$-Lipschitz: \[|\ell(f(x + \delta)) - \ell(f(x))| \leq \epsilon H \] approximately (ignoring precise constants).

Formal Proof: By Taylor expansion: \[\ell(f(x + \delta)) = \ell(f(x)) + \nabla_x \ell(f(x))^T \delta + \frac{1}{2}\delta^T \nabla^2_x \ell(f(x)) \delta + O(\|\delta\|^3) \] Rearranging: \[|\ell(f(x + \delta)) - \ell(f(x)) - \nabla_x \ell(f(x))^T \delta| = \left| \frac{1}{2}\delta^T \nabla^2_x \ell \delta + O(\|\delta\|^3) \right| \leq \frac{1}{2} \|\delta\|^2 \|\nabla^2 \ell\|_{\text{op}} + O(\epsilon^3) \] For small $\epsilon$, the second-order term dominates, giving \[\leq \frac{1}{2} \epsilon^2 \|\nabla^2 \ell\|_{\text{op}} \]

Interpretation: This theorem shows that the curvature of the loss landscape (quantified by the Hessian norm) affects sensitivity to perturbations. Models with small Hessian norms (flat loss landscapes) are less sensitive to perturbations than models with large Hessian norms (sharp loss landscapes). This is the theoretical basis for the claim that flat minima are more robust: flatter loss landscapes (smaller Hessian eigenvalues) lead to bounded loss changes under perturbations.

Explicit ML Relevance: Sharpness-Aware Minimization (SAM) explicitly seeks models with small Hessian eigenvalues by perturbing parameters and minimizing worst-case loss. This theorem justifies the approach: SAM reduces the Hessian norm, thereby improving robustness. Furthermore, the theorem suggests that regularization techniques that penalize large Hessian norms (e.g., $\ell_2$-regularization on weights, which indirectly constrains the Hessian) can improve robustness.

Theorem 7: Certified Radius Bound Under Norm Constraint

Formal Statement: Suppose a classifier $f$ is trained with spectral normalization such that the Lipschitz constant is bounded: $L_f \leq L$. Further, suppose the model achieves margin $m(x) = f(x)_y - \max_{c \neq y} f(x)_c \geq m_0 > 0$ at a test point $x$ with label $y$. Then the certified robustness radius is at least \[r(x) \geq \frac{m_0}{L} \] for $\ell_2$ perturbations (or appropriately adjusted for other norms). No adversarial perturbation with $\|\delta\|_2 \leq r(x)$ can flip the classification.

Formal Proof: By the Lipschitz property, \[|f(x + \delta)_y - f(x)_y| \leq L \|\delta\|_2 \quad \text{and} \quad |f(x + \delta)_c - f(x)_c| \leq L \|\delta\|_2 \text{ for all } c \neq y \] Thus, \[f(x + \delta)_y \geq f(x)_y - L\|\delta\|_2 \geq m_0 - L\|\delta\|_2 \quad \text{(true class logit lower bound)} \] and \[f(x + \delta)_c \leq f(x)_c + L\|\delta\|_2 \leq (f(x)_y - m_0) + L\|\delta\|_2 \quad \text{(competing class logit upper bound)} \] For the true class to remain the largest logit, we require \[f(x + \delta)_y > f(x + \delta)_c \Rightarrow m_0 - L\|\delta\|_2 > -L\|\delta\|_2 \Rightarrow m_0 > 0 \] This is always satisfied for $m_0 > 0$. For strict inequality, we need $\|\delta\|_2 \leq m_0 / L$, certifying robustness up to radius $r(x) = m_0 / L$.

Interpretation: This theorem provides a constructive method for certifying robustness: (1) Train the model with spectral normalization to bound the Lipschitz constant $L$, (2) Compute the margin $m_0$ on a test point, (3) The certified robustness radius is $r(x) = m_0 / L$. This is often tighter than abstract interpretation or randomized smoothing for linear or simple models, though weaker guarantees may arise if the margin is small or the Lipschitz constant is large.

Explicit ML Relevance: Lipschitz-constrained models (via spectral normalization) enable simple certified robustness guarantees. This is a practical defense mechanism: practitioners can easily compute certified radii and adjust model architectures to improve them (e.g., by increasing margins or reducing Lipschitz constants via regularization).

Theorem 8: Tradeoff Between Robustness and Accuracy

Formal Statement: There exists a fundamental trade-off between standard accuracy and adversarial robustness. Specifically, for any learner achieving high adversarial robustness to $\ell_\infty$-perturbations with budget $\epsilon$ on a dataset $D$, the model’s standard accuracy (on clean, non-perturbed data) is constrained: \[\text{Acc}_{\text{robust}}(\epsilon) + \text{Acc}_{\text{standard}} \leq C(\epsilon, d) \] where $C(\epsilon, d)$ is a function of the perturbation budget and dimension $d$, with $C \approx 1 + \alpha (1 + d \ln(1/\epsilon))$ for some constant $\alpha > 0$. Empirically, training models to be robust to $\epsilon = 8/255$ perturbations on ImageNet results in standard accuracy dropping by 10-20 percentage points compared to non-robust training.

Formal Proof: (Sketch) The intuition is that robustness and accuracy correspond to different decision boundaries: the boundary robust to perturbations is necessarily distant from the data manifold (to avoid adversarial perturbations crossing it), whereas the accurate boundary is close to the data manifold (to make correct predictions on the data). These geometric constraints are incompatible. Formally, consider the sample complexity for robust learning: a learner must distinguish examples separated by at most $\epsilon$ (robustness constraint) while fitting the training data (accuracy constraint). The VC dimension of the hypothesis class increases, leading to higher sample complexity. With finite samples, increasing robustness necessarily degrades accuracy. Quantitatively, the VC dimension scales as $\Omega(d \ln(1/\epsilon))$, leading to generalization error $\Omega(d \ln(1/\epsilon) / m)$. To maintain constant error rate as $\epsilon$ decreases, $m$ must grow exponentially, or accuracy on a fixed dataset must degrade.

Interpretation: This theorem formalizes the empirical observation that robust training hurts standard accuracy. It is not a property of specific algorithms (PGD training, TRADES, etc.) but a fundamental property of the robust learning problem. The trade-off is more pronounced for smaller perturbation budgets ($\epsilon \to 0$) and higher dimensions (larger $d$). However, the trade-off is not absolute: optimized training can achieve better robustness-accuracy Pareto frontiers than na ïve approaches.

Explicit ML Relevance: Understanding the robustness-accuracy trade-off informs practical decisions about when to pursue robust training. For applications where standard accuracy is paramount (e.g., medical diagnosis on mostly benign data), a small robustness radius may be acceptable. For security-critical applications (e.g., surveillance systems), robust training is necessary despite the accuracy cost. Recent work aims to improve the trade-off via better architectures, training procedures, and implicit biases, partially mitigating (but not eliminating) the fundamental trade-off.

Worked Examples

Example 1 — Linear Classifier Under Adversarial Perturbation

Consider a binary classification problem where we classify 2D points $(x_1, x_2)$ using a linear classifier with weights $w = (2, 1)^T$ and bias $b = 0$. The decision boundary is the line $2x_1 + x_2 = 0$, and predictions are made via $f(x) = \text{sign}(2x_1 + x_2)$. Now, suppose we have a test point $x = (1, 0)$ with true label $y = +1$ (correctly classified, since $2(1) + 0 = 2 > 0$). We want to understand how robust this classification is to adversarial perturbations.

To analyze robustness, we compute the margin at this point: $m(x) = (2(1) + 0) / \|(2, 1)\| = 2 / \sqrt{5} \approx 0.894$. This margin tells us how far the point is from the decision boundary (in normalized distance). The Lipschitz constant of the classifier is $L_f = \|(2, 1)\| = \sqrt{5} \approx 2.236$, indicating the sensitivity of the model to input changes. By Theorem 7, the certified robustness radius is at least $r(x) \approx 0.894 / \sqrt{5} \approx 0.4$ in $\ell_2$-norm. This means any $\ell_2$-perturbation with $\|\delta\|_2 \leq 0.4$ cannot flip the classification.

To verify this explicitly, consider an $\ell_2$-perturbation of magnitude exactly $0.4$: the worst-case perturbation within this budget that moves closest to the decision boundary is $\delta^* = -0.4 \cdot \frac{(2, 1)}{\|(2,1)\|} = -0.4 \cdot \frac{(2,1)}{\sqrt{5}} \approx (-0.358, -0.179)$. The perturbed point is $x + \delta^* \approx (0.642, -0.179)$, and the prediction is $2(0.642) - 0.179 \approx 1.105 > 0$, still predicting $+1$ correctly. At perturbation magnitude $0.5$, the perturbed point approaches the decision boundary more closely, and the prediction may flip. This demonstrates the practical meaning of the certified robustness radius.

A common misconception is that the certified radius provided by Theorem 7 is tight—that it represents the true adversarial robustness. In reality, the certificate is often conservative: the true adversarial robustness radius may be 2–3× larger, especially for this simple linear case. The certificate bounds robustness from below, providing a formal guarantee but potentially underestimating practical robustness. Another misconception is that linear classifiers are inherently robust; while they have simple Lipschitz constants, they can still be vulnerable to adversarial perturbations if the margin is small.

What if the weights were instead $w = (10, 1)$? Then $L_f = \sqrt{101} \approx 10.05$, and the certified radius would shrink to roughly $0.894 / 10.05 \approx 0.089$, despite the same normalized margin. This illustrates a subtle but important principle: robustness depends not just on the margin but also on the Lipschitz constant of the model. Scaling up weights (making the model more “sensitive”) reduces certified robustness, even if the geometric margin is preserved.

Explicit ML Relevance: This example demonstrates why practitioners use spectral normalization in deep networks—constraining the Lipschitz constant directly improves certified robustness. It also shows why linear models, while interpretable and simple, may offer robustness trade-offs: controlling the norms of weights is easier than for neural networks, enabling explicit robustness guarantees.

Example 2 — Gradient Sign Attack Geometry

Imagine a simple neural network $f_\theta$ trained on MNIST to classify handwritten digits. Consider a correctly classified digit “3” with input $x$ and true label $y = 3$. The cross-entropy loss on this example is $\ell(f(x), 3) = 0.1$ (low, since the model is confident). We compute the input gradient: $\nabla_x \ell = (g_1, g_2, \ldots, g_{784})$ where each $g_i$ represents how much a change in pixel $i$ increases the loss. Suppose the gradient norm is $\|\nabla_x \ell\|_\infty = 0.05$ (reasonable magnitude).

The Fast Gradient Sign Method (FGSM) attack takes a step in the direction of the sign of the gradient: $\delta_{\text{FGSM}} = \epsilon \cdot \text{sign}(\nabla_x \ell)$ with perturbation budget $\epsilon = 0.3$ (on a $[0, 1]$ pixel scale). Geometrically, this moves the input as far as possible in the direction of steepest loss increase within the $\ell_\infty$-ball of radius $\epsilon$. The resulting adversarial example is $x_{\text{adv}} = x + \delta_{\text{FGSM}}$, where each pixel is increased by $0.3$ in the direction of the gradient sign. Since the gradient is sparse (most $g_i$ near zero), only pixels with significant gradients (those important for classification) are substantially perturbed.

By Theorem 3, the first-order approximation predicts the loss increase: $\ell(f(x_{\text{adv}}), 3) \approx \ell(f(x), 3) + \epsilon \|\nabla_x \ell\|_1$. If $\|\nabla_x \ell\|_1 \approx 5$ (sum of absolute gradient components), the predicted loss is roughly $0.1 + 0.3 \cdot 5 = 1.6$, a dramatic increase. In practice, the actual loss might be $1.8$, close to the prediction, confirming the validity of the first-order approximation for small $\epsilon$.

A critical misconception is that FGSM finds the optimal adversarial example within the perturbation budget. In fact, FGSM is greedy: it makes a single large step in the direction of steepest ascent, but due to non-convexity of the loss landscape, this direction may not lead to the global maximum within the $\ell_\infty$-ball. Iterative methods like PGD, which repeat gradient steps with projections back into the threat model, are empirically more powerful. Another misconception is that the gradient always points toward misclassification; for adversarially trained models, the gradient landscape is dramatically different, and the same FGSM procedure may be much less effective.

What if we used a different threat model, say $\ell_2$ instead of $\ell_\infty$? With $\ell_2$-FGSM, we compute $\delta_{\ell_2} = \epsilon \cdot \frac{\nabla_x \ell}{\|\nabla_x \ell\|_2}$, normalizing the gradient to unit $\ell_2$-norm before scaling. This distributes the perturbation more evenly across pixels, rather than concentrating it on high-gradient pixels. Empirically, $\ell_2$-FGSM requires smaller perturbation budgets to cause misclassification (stronger attack), suggesting that the $\ell_\infty$ threat model is somewhat more aligned with the gradient geometry of typical classifiers.

Explicit ML Relevance: Understanding FGSM geometry is essential for adversarial training. Algorithms like PGD-based adversarial training repeatedly apply FGSM-like steps (or stronger attacks) and update model weights to reduce worst-case loss. The gradient structure revealed by FGSM also motivates gradient regularization techniques: penalizing large gradients in adversarial training can reduce the effectiveness of gradient-based attacks.

Example 3 — Lipschitz Bound on Model Sensitivity

Consider a neural network $f$ trained on CIFAR-10 (32×32 RGB images) to classify into 10 classes. Without any regularization, the network may have a very large Lipschitz constant: $L_f \approx 1000$. This means that a small input change of $\|\delta\|_2 = 0.01$ can potentially cause an output change of up to $\|f(x + \delta) - f(x)\| \leq 1000 \cdot 0.01 = 10$, which for a 10-class output space means the logits can change dramatically and flip predictions.

Now, suppose we apply spectral normalization to each layer of the network, constraining the spectral norm (largest singular value) of each weight matrix to 1. As a result, the Lipschitz constant of the network becomes $L_f \approx 1$ (product of per-layer constants, each bounded by 1). The same input perturbation now causes output change at most $\|f(x + \delta) - f(x)\| \leq 1 \cdot 0.01 = 0.01$, which is negligible. The logits change by at most 0.01, making misclassification unlikely unless the original margin was very small.

To illustrate the tradeoff, consider a test image classified correctly with margin $m = 2.0$ (true class logit 2.0 above the nearest competing class). With $L_f = 1$ and a perturbation $\|\delta\|_2 = 0.5$, the output change is at most 0.5, and the margin remains $\geq 2.0 - 0.5 = 1.5$ (still correct). With $L_f = 1000$ and the same perturbation, the output change could be 500, completely flipping the classification. By Theorem 1, the certified robustness radius with spectral normalization is $r(x) = m / L_f = 2.0 / 1.0 = 2.0$, providing formal robustness guarantees.

However, spectral normalization is not cost-free. Constraining the Lipschitz constant forces the network to be less expressive: the function class capacity is reduced. In practice, models trained with spectral normalization on CIFAR-10 may have standard accuracy 3–5 percentage points lower than baseline models, trading robustness for accuracy. Another common misconception is that spectral normalization is the only way to control Lipschitz constants; other techniques like gradient penalty, layer normalization, and careful weight initialization also affect Lipschitz constants, though less directly.

What if we only apply spectral normalization to certain layers, say the first and last layers? This hybrid approach allows the middle layers more flexibility while constraining the sensitivity at the boundaries (input and output). The resulting Lipschitz constant would be intermediate, e.g., $L_f \approx 10$, offering a middle ground between expressiveness and robustness. Empirically, such selective spectral normalization can achieve better robustness-accuracy trade-offs than full normalization.

Explicit ML Relevance: Spectral normalization is a practical defense mechanism widely used in industry. Understanding Lipschitz bounds explains why it works: by controlling the Lipschitz constant, the model is provably robust to input perturbations. The certified radius formula $r(x) = m / L_f$ is simple but powerful, enabling practitioners to compute robustness guarantees for deployed models.

Example 4 — Robust vs Standard Risk Comparison

Consider a toy classification task on synthetic data: two Gaussian clusters in 2D, each with 100 points. A standard logistic regression model achieves 98% accuracy on the training set. Now, we evaluate both standard and robust risk. The standard risk (evaluated on clean, non-perturbed data) is $R_{\text{standard}} = 0.02$ (2% error rate). To compute robust risk with $\ell_\infty$-threat model $\epsilon = 0.1$, we find for each test point the worst-case perturbation within $\|\delta\|_\infty \leq 0.1$ and evaluate the loss. Using PGD attack with 10 steps, we find that many of the 98 correctly classified points are now misclassified under adversarial perturbations. The robust risk is $R_{\text{robust}} = 0.35$ (35% error rate under worst-case perturbations), dramatically higher than standard risk.

This gap between $R_{\text{robust}} = 0.35$ and $R_{\text{standard}} = 0.02$ reveals a fundamental vulnerability: the model is not robust to small perturbations even though it achieves high standard accuracy. To improve robust risk, we apply adversarial training by solving the robust optimization problem: $\min_\theta \frac{1}{m} \sum_{i=1}^m \max_{\|\delta\|_\infty \leq 0.1} \ell(f_\theta(x_i + \delta), y_i)$. After adversarial training, the model achieves $R_{\text{standard}} = 0.05$ and $R_{\text{robust}} = 0.08$, simultaneously improving both standard and robust risk. The standard accuracy drops slightly (98% → 95%), consistent with the robustness-accuracy trade-off, but robust risk improves dramatically (35% → 8%).

The robust risk curve over increasing $\epsilon$ is typically monotonically decreasing: as the threat model becomes weaker (smaller $\epsilon$), robust risk decreases, approaching standard risk as $\epsilon \to 0$. For the logistic regression example, at $\epsilon = 0$, robust risk equals standard risk (5%). At $\epsilon = 0.1$, robust risk is 8%. At $\epsilon = 1.0$ (large perturbations far outside the feature scale), robust risk approaches 50% (random guessing between the two classes). A common misconception is that robust training directly improves standard accuracy; in fact, adversarial training often slightly hurts standard accuracy due to the robustness-accuracy trade-off (Theorem 8), though the hurt can be mitigated with careful hyperparameter tuning.

What if we had used a stronger threat model, say $\epsilon = 0.5$? The robust risk for the non-adversarially trained model would be much higher, perhaps 60%, reflecting the model’s extreme vulnerability to larger perturbations. However, after adversarial training against the stronger threat model, the robust risk at $\epsilon = 0.5$ would be much lower, e.g., 10%, while the standard accuracy might drop more substantially (95% → 80%), illustrating the increasing cost of robustness to stronger threat models.

Explicit ML Relevance: Computing robust risk is essential for evaluating realistic security postures. Standard accuracy alone is insufficient; practitioners must evaluate robustness across multiple threat models and perturbation budgets. The robust risk curve $R_{\text{robust}}(\epsilon)$ is a key diagnostic, revealing vulnerabilities and guiding the choice of threat models to defend against.

Example 5 — Margin and Perturbation Radius

Consider training a linear SVM on MNIST digits “7” vs “8” with the goal of maximizing the margin. A well-tuned SVM learns weights $w \in \mathbb{R}^{784}$ (one per pixel) that separate the two classes with a large margin. For a typical test image of “7”, the margin might be $m = 5.0$ (5 units above the decision boundary in logit space). The Lipschitz constant of the linear classifier is $L_f = \|w\|_2 \approx 3.0$. By Theorem 5 and 7, the certified robustness radius is $r(x) = m / L_f = 5.0 / 3.0 \approx 1.67$ in $\ell_2$-norm.

This certified radius of 1.67 is meaningful on the MNIST pixel scale. A perturbation with $\|\delta\|_2 = 1.67$ is about 2–3 pixels’ worth of noise (for 28×28 images), or roughly equivalent to small blurring or rotation. Any such perturbation leaves the classification unchanged. However, a perturbation with $\|\delta\|_2 = 2.0$ might flip the classification (the first-order bound no longer guarantees robustness).

In contrast, consider a test image classified with small margin $m = 0.5$ (barely above the decision boundary). The same classifier has $L_f = 3.0$, so the robustness radius is $r(x) = 0.5 / 3.0 \approx 0.167$. This point is highly adversarially vulnerable: even tiny perturbations might flip the classification. This illustrates an important principle: margin measures confidence, and confident predictions are more robust.

The dependence of robustness radius on both margin and Lipschitz constant reveals several insights. First, maximizing margin (as SVMs and large-margin learning algorithms do) directly improves robustness. Second, controlling $\|w\|_2$ (via $\ell_2$-regularization) also improves robustness by reducing the Lipschitz constant. A common misconception is that robustness requires sacrificing accuracy; in the SVM setting with large margin, high accuracy and high robustness often coexist because both require large margins. However, as models become more complex (e.g., neural networks), achieving large margins becomes harder, and the robustness-accuracy trade-off intensifies.

What if we re-weighted the SVM objective to penalize misclassifications on “7” more heavily than on “8”? The resulting class imbalance would create larger margins for “7” but smaller margins for “8”. Predictions on “7” would be more robust, while predictions on “8” would be more vulnerable to perturbations. This illustrates how the choice of loss function and data weighting directly affects the robustness profile.

Explicit ML Relevance: SVM training, widely used in classical ML, implicitly optimizes for robustness via margin maximization. Understanding the margin-robustness connection explains why SVMs often achieve good robustness without explicit adversarial defenses. For deep learning, achieving large margins is harder, motivating the need for explicit adversarial training.

Example 6 — Robust Optimization Dual Interpretation

Consider the robust optimization problem for training a classifier on a binary dataset with 100 examples: \[\min_\theta \max_{i, \delta_i} \ell(f_\theta(x_i + \delta_i), y_i) \quad \text{subject to} \quad \|\delta_i\|_2 \leq \epsilon \] where the inner maximization finds the worst-case perturbation for each example. This is a minimax problem: the adversary (inner max) chooses perturbations to maximize loss, and the learner (outer min) chooses parameters to minimize worst-case loss.

By Theorem 4, the dual formulation is: \[\max_{\lambda_i \geq 0} \min_\theta \sum_{i=1}^{100} \left( \ell(f_\theta(x_i), y_i) + \lambda_i \|\delta_i^*(\lambda_i)\|_2 \right) - \sum_i \lambda_i \epsilon \] where $\delta_i^*(\lambda_i)$ is the optimal perturbation parameterized by the dual variable $\lambda_i$. Intuitively, $\lambda_i$ represents the “cost” of adversarial perturbations at example $i$: high $\lambda_i$ means the learner is willing to pay a large penalty (lose more on the clean data) to reduce the robustness requirement (smaller $\epsilon$ effective bound); low $\lambda_i$ means the learner prioritizes robustness.

In the dual formulation, the problem decomposes: for each example, the learner and adversary engage in a two-player game parameterized by $\lambda_i$. The learner minimizes $\ell + \lambda_i \|\delta_i\|_2$, balancing standard loss and robustness. As $\lambda_i$ increases, the regularization term $\lambda_i \|\delta_i\|_2$ dominates, and the learner prioritizes standard accuracy over robustness. As $\lambda_i$ decreases, robustness becomes more important.

To implement this dual approach, an algorithm might alternate between solving for the optimal $\theta$ (inner minimization) and updating $\lambda_i$ (outer maximization) via gradient ascent on the dual objective. This can be more efficient than solving the primal minimax problem directly, especially with advanced optimization techniques like proximal methods or mirror descent. A common misconception is that the dual problem is simpler to solve than the primal; in practice, both require iterative optimization, but the dual may have better numerical properties or enable distributed computation.

What if we constrain some $\lambda_i$ to be zero, effectively exempting certain examples from the robustness requirement? This would allow the learner to sacrifice robustness on a subset of examples to improve overall performance. For instance, if example $i$ is a known outlier or from a different domain, setting $\lambda_i = 0$ allows the model to overfit to this example to reduce total loss. Conversely, setting $\lambda_i$ to infinity forces strict robustness on example $i$, potentially at the cost of worse performance overall.

Explicit ML Relevance: The dual formulation provides algorithmic insights for training robust models. Algorithms like PGD-based adversarial training can be viewed as solving a dual problem, where the inner maximization (attack) implicitly computes $\delta_i^*(\lambda)$, and the outer minimization (parameter update) adjusts $\theta$. Understanding the dual interpretation helps practitioners design better optimization algorithms and diagnose convergence issues.

Example 7 — Hessian-Based Sensitivity Analysis

For a neural network $f$ trained on CIFAR-10, consider a correctly classified image with loss $\ell(f(x), y) = 0.2$. We compute the gradient $\nabla_x \ell$ and the Hessian matrix $\nabla_x^2 \ell \in \mathbb{R}^{3072 \times 3072}$ (for 32×32×3 images). The Hessian eigenvalues reveal the curvature of the loss landscape: if the largest eigenvalue is $\lambda_{\max} = 10$, small perturbations can cause large loss increases. By Theorem 6, a perturbation $\|\delta\|_2 = 0.1$ causes loss change approximately bounded by: \[|\Delta \ell| \approx \frac{1}{2} (0.1)^2 \cdot 10 = 0.05 \] plus first-order terms. If the gradient norm is $\|\nabla_x \ell\|_2 = 2.0$, the first-order change is at most $2.0 \cdot 0.1 = 0.2$, totaling around $0.25$ loss increase.

In contrast, consider a different test image where the Hessian has largest eigenvalue $\lambda_{\max} = 100$ (sharper curvature). The same perturbation causes second-order loss change approximately $\frac{1}{2} (0.1)^2 \cdot 100 = 0.5$, plus first-order terms. Even without the gradient contribution, the Hessian alone predicts substantial loss increase, potentially causing misclassification. This example illustrates the principle that flat minima (small Hessian eigenvalues) are more robust than sharp minima.

To make this concrete, Sharpness-Aware Minimization (SAM) is a training algorithm that explicitly seeks to minimize the maximum loss in a neighborhood of the current parameters: \[\min_\theta \max_{\rho \leq \rho_0} \ell(f_{\theta + \rho}, y) \] where $\rho_0$ is a small radius. By solving this robust optimization problem, SAM finds parameters in flatter regions (small Hessian eigenvalues). Empirically, models trained with SAM have both higher standard accuracy and better adversarial robustness compared to standard SGD, because SAM naturally controls the Hessian.

A common misconception is that the Hessian eigenvalues directly determine adversarial robustness; in reality, the Hessian provides a local second-order approximation that is valid only for small perturbations. For larger perturbations, higher-order terms and non-local effects dominate. Another misconception is that computing the Hessian is always practical; for large neural networks, the Hessian is prohibitively expensive to compute (quadratic memory, cubic time), so practitioners typically approximate it via gradient norms or other proxy metrics.

What if the Hessian has both very large and very small eigenvalues (large condition number)? This means the loss landscape is elongated: in directions with small eigenvalues (flat), perturbations cause small loss changes, but in directions with large eigenvalues (sharp), perturbations cause large changes. An adversary would focus on the sharp directions, exploiting the worst-case sensitivity. A well-conditioned Hessian (small condition number) ensures robustness in all directions simultaneously.

Explicit ML Relevance: SAM and other loss-surface-aware training algorithms represent the frontier of practical robustness improvements. By controlling the Hessian not just at the current parameters but in a neighborhood (robust optimization over parameter space), these methods achieve both high accuracy and good adversarial robustness, partially mitigating the robustness-accuracy trade-off.

Example 8 — Stability Bound Illustration

Consider a leave-one-out (LOO) evaluation of a logistic regression model trained on 1000 MNIST images (500 digit “1”, 500 digit “2”). Train the base model on all 1000 examples, then train 1000 additional models, each omitting one example. For most examples, the predictions from the base model and the leave-one-out models are identical or very similar. For a few examples (e.g., mislabeled points or boundary examples), the leave-one-out models produce different predictions. The leave-one-out error is the fraction of examples whose leave-one-out prediction differs from the base model’s prediction.

Let’s say the base model achieves 95% accuracy, and the LOO error is 2%. This means 20 examples (2% of 1000) have different predictions when left out. By the stability-generalization connection (Theorem 2), this 2% LOO error is an upper bound on the expected generalization gap $\mathbb{E}_{\text{test}}[\text{error}] - \mathbb{E}_{\text{train}}[\text{error}]$. In practice, if the base model’s training accuracy is 95%, we can expect the test accuracy to be at least $95\% - 2\% = 93\%$, providing a conservative estimate.

Contrast this with a 10-nearest-neighbor classifier on the same data. The KNN classifier is unstable: removing one training example can change the nearest neighbors for many test points, potentially flipping predictions. The LOO error for KNN might be 15%, suggesting a generalization gap of at least 15%. Indeed, when evaluated on a fresh test set, KNN might achieve 80% accuracy (15% drop from training accuracy), consistent with the stability bound.

To improve stability, apply $\ell_2$-regularization to the logistic regression: $\min_\theta \frac{1}{1000} \sum_i \ell(f_\theta(x_i), y_i) + \lambda \|\theta\|^2$ with regularization parameter $\lambda = 0.01$. With regularization, the LOO error improves to 1%, suggesting a tighter generalization gap and better expected test performance. Higher $\lambda$ increases stability further but may hurt training accuracy if the regularization is too strong.

A critical misconception about LOO is that it is an unbiased estimator of test performance. While LOO provides valid bounds, it is often conservative (overestimates error) because the training sets for LOO models are slightly smaller than the original training set. Another misconception is that high training accuracy always leads to good test accuracy; without stability, high training accuracy can coexist with poor generalization.

What if we consider a different notion of stability: robustness to input perturbations rather than training data changes? A model is input-stable if small perturbations to a test input produce small changes in predictions. By analogy with data stability, input-stable models should have good robustness. If the model is 0.1-stable to $0.1$-size input perturbations (loss changes by at most 0.1), then we can expect the model to be robust to such perturbations with high probability.

Explicit ML Relevance: Stability bounds provide a practical tool for estimating generalization without explicitly evaluating on a separate test set. In adversarial settings, algorithms that are stable to both training data changes and input perturbations are expected to achieve good robust generalization.

Example 9 — Adversarial Training on Logistic Regression

Consider training logistic regression on a binary classification task with 200 examples (100 per class) in 10D feature space. Standard training minimizes the cross-entropy loss on clean data: $\min_w \frac{1}{200} \sum_i \ell_{\text{CE}}(w^T x_i, y_i)$. After training, the model achieves 92% training accuracy and 88% test accuracy. Now, evaluate adversarial robustness by running PGD attack with $\epsilon = 0.5$ (perturbation budget in $\ell_2$-norm). Under adversarial perturbations, the accuracy drops to 45%, revealing severe vulnerability.

To improve robustness, apply adversarial training by solving: \[\min_w \frac{1}{200} \sum_i \max_{\|\delta_i\|_2 \leq 0.5} \ell_{\text{CE}}(w^T (x_i + \delta_i), y_i) \] For each example, find the worst-case perturbation (via PGD), then update the weights to reduce the worst-case loss. After adversarial training with this robust objective, the standard training accuracy drops slightly to 89% (training on harder adversarial examples), but the adversarial accuracy under the same PGD attack improves dramatically to 72%. The robust training trades some standard accuracy for improved robustness.

To understand the geometry, after standard training, the decision boundary is positioned to maximize clean accuracy: it passes through or near the original data points. With only this constraint, the boundary is vulnerable to small perturbations that push points across it. Adversarial training repositions the boundary further from the original data, creating a “robustness margin”: perturbations must move points further to flip predictions. This is geometrically similar to large-margin maximization (SVM principle), but now the margin is defined adversarially rather than geometrically.

A subtle misconception is that adversarial training directly optimizes the robust optimization problem; in practice, PGD finds only approximate worst-case perturbations (local maxima of the inner maximization), not true global maxima. Stronger attacks (more PGD iterations, larger step sizes) lead to better adversarial robustness during testing, but at computational cost during training. Another misconception is that adversarial training with one threat model (e.g., $\ell_2, \epsilon = 0.5$) transfers to other threat models; a model trained against $\ell_2$ perturbations may be vulnerable to $\ell_\infty$ perturbations or other attacks not seen during training.

What if we gradually increase the perturbation budget during training, starting with $\epsilon = 0.1$ and increasing to $\epsilon = 0.5$? This “curriculum” approach can accelerate convergence and potentially reach better robust optima by avoiding local minima early in training. Alternatively, we could use multiple threat models simultaneously (e.g., both $\ell_2$ and $\ell_\infty$ perturbations), training the model to be robust to diverse adversarial directions. This multi-threat training is more expensive but produces models robust across a broader threat landscape.

Explicit ML Relevance: Adversarial training is the most practical and widely deployed robust learning algorithm. Understanding its geometry—how it repositions decision boundaries to create robustness margins—explains its empirical success and limitations. The computational cost (5–10× standard training) must be weighed against the robustness benefits.

Example 10 — Certified Robust Radius Computation

Consider a neural network trained on CIFAR-10 with spectral normalization applied to all layers. The network has input dimension 3072 (32×32×3), output dimension 10 (classes), and all weight matrices are constrained to have spectral norm 1. The composed Lipschitz constant of the full network is at most $L_f = 1.0$ (product of per-layer constants). On a test image of a “cat”, the model outputs logits $s = (3.2, 1.5, 0.8, 0.3, \ldots)$, predicting “cat” (class 0) with logit 3.2. The margin is $m = 3.2 - 1.5 = 1.7$ (difference between top-2 logits).

By Theorem 7, the certified robustness radius is $r(x) = m / L_f = 1.7 / 1.0 = 1.7$ in $\ell_2$-norm. This means any $\ell_2$-perturbation with $\|\delta\|_2 \leq 1.7$ is guaranteed not to flip the classification. To verify this, suppose an adversarial perturbation moves the image by $\|\delta\|_2 = 1.5$ (within the certified bound). By Lipschitz property, logits change by at most $L_f \|\delta\|_2 = 1.0 \cdot 1.5 = 1.5$. The “cat” logit becomes at least $3.2 - 1.5 = 1.7$, and any competing logit becomes at most $1.5 + 1.5 = 3.0$. Since $1.7 < 3.0$ is false (the true class logit is lower), the bound is not tight here; more carefully, the original competing logit is 1.5, so after perturbation it’s at most $1.5 + 1.5 = 3.0$. Comparing $3.2 - 1.5 = 1.7$ to $1.5 + 1.5 = 3.0$ shows the true class still wins since $1.7 > 3.0$ is false. Let me recalculate: the certified radius ensures that if $s_y - s_{c^*} > 0$ (true class is top-1), then after perturbation $(s_y - 1.5) - (s_{c^*} + 1.5) = s_y - s_{c^*} - 3.0$ remains positive if $s_y - s_{c^*} > 3.0$. Here, $m = 1.7 < 3.0$, so the certification bound is weaker. The certified radius is actually $r = m / L_f = 1.7 / 1.0 = 1.7$, but this requires the margin to upper-bound the change. For a more robust image with margin $m = 3.2$, the certified radius would be $r = 3.2$, guaranteeing robustness to larger perturbations.

To improve the certified radius, we can either increase the margin (select images where the model is more confident) or decrease the Lipschitz constant (tighter spectral normalization, possibly with stronger regularization). In practice, 1-2 perturbations in $\ell_2$-norm are reasonable certified radii for image classifiers; larger radii are rare unless the model is highly regularized (sacrificing accuracy).

A common misconception is that certified robustness is the same as empirical robustness. Empirically, the model might be robust to much larger perturbations found by adaptive attacks (attacks designed specifically to evade the spectral normalization defense), whereas the certificate only guarantees robustness up to the computed radius. Another misconception is that computing the certified radius is expensive; it requires only forward passes and logit comparisons, making it practical for deployment.

What if we use randomized smoothing instead of spectral normalization? Randomized smoothing provides certified robustness by averaging predictions over noisy versions of the input. For a model and noise level $\sigma$, the certified radius is approximately $r = (c \sigma) / p_A$, where $p_A$ is the probability of the correct class under noise and $c$ is a constant. Larger noise $\sigma$ leads to larger certified radii but may hurt clean accuracy. This represents a different robustness-accuracy trade-off compared to spectral normalization.

Explicit ML Relevance: Certified robustness provides formal guarantees suitable for safety-critical deployments. Computing certified radii is straightforward and reveals which examples are inherently robust (high margin) and which are vulnerable (low margin), informing decisions about retraining or data collection.

Example 11 — Tradeoff Between Robustness and Accuracy

A classic experiment on CIFAR-10 illustrates the robustness-accuracy trade-off. Train a standard ResNet-50 without robustness constraints: it achieves 95% standard accuracy and 0% accuracy under $\ell_\infty$ perturbations with $\epsilon = 8/255$ (using our best PGD attack). Now, train the same ResNet-50 with adversarial training (PGD-based, $\epsilon = 8/255$): it achieves 76% standard accuracy and 52% robust accuracy. The model sacrifices 19 percentage points of standard accuracy to gain 52 percentage points of robust accuracy.

This trade-off becomes more pronounced with stronger threat models. For $\epsilon = 16/255$ perturbations, the robust accuracy drops to 20%, with standard accuracy potentially dropping further to 60–65%. Conversely, for weaker threat models like $\epsilon = 2/255$, robust accuracy is 85–90%, with minimal standard accuracy loss. By Theorem 8, this trade-off is fundamental: the sample complexity for learning robust models grows with dimension and inverse threat model strength, making robust learning harder.

To understand why the trade-off exists, consider the data geometry. The original training data lies near a lower-dimensional manifold embedded in high-dimensional space. A standard classifier learns a decision boundary close to this manifold, enabling high accuracy. A robust decision boundary must be further from the manifold (to account for adversarial perturbations), reducing the model’s ability to separate clean examples precisely. In high dimensions, this displacement is costly: moving the boundary away from the data typically requires a more complex or less confident model, hurting accuracy.

A common misconception is that the trade-off is an artifact of current training algorithms; while improvements in training methods can shift the Pareto frontier, the fundamental trade-off (Theorem 8) limits how much accuracy can be recovered. Another misconception is that the trade-off applies equally to all models; simple models (e.g., logistic regression, linear classifiers) may have better robustness-accuracy Pareto frontiers than complex models (e.g., large ResNets), because large models can overfit to non-robust features that small models cannot utilize.

What if we trained on a dataset with more robust features (e.g., data collected explicitly to include adversarial examples, or data with larger margins)? This could shift the Pareto frontier rightward, improving robust accuracy without sacrificing as much standard accuracy. Alternatively, using techniques like semi-supervised learning to leverage unlabeled data might provide more examples for the model to learn from, reducing the sample complexity penalty for robustness.

Explicit ML Relevance: Understanding the robustness-accuracy trade-off guides practitioners in deciding whether robust training is appropriate for their application. For applications where robust accuracy is paramount (security, safety), accepting standard accuracy loss is necessary. For applications where standard accuracy is critical (most commercial ML), the cost of robustness must be carefully weighed.

Example 12 — Robustness in Large-Scale Transformers

Modern transformer models like BERT or GPT-3 are trained on billions of tokens and fine-tuned for various NLP tasks. Consider fine-tuning BERT for sentiment analysis: the model has 110M parameters and is trained on 100K reviews (100 labels × 1000 examples per label). Standard fine-tuning achieves 94% classification accuracy on a held-out test set. Evaluating adversarial robustness is more complex in NLP than in images, because perturbations must preserve linguistic meaning. Common threat models include: (1) character-level perturbations (typos, misspellings), (2) token-level perturbations (synonyms, word replacements), (3) sentence-level perturbations (rephrasing, paraphrasing).

For character-level attacks, adversaries might substitute similar-looking characters (e.g., “Shakespear” → “Shakespear” with a Cyrillic character). For token-level attacks, adversaries replace words with synonyms or antonyms that change sentiment but preserve surface structure. For example, the phrase “This movie is [good]” becomes “This movie is [bad]” (single word substitution). A FGSM-style attack on BERT would compute gradients with respect to token embeddings, then perturb them to maximize sentiment loss. The perturbed embeddings are then decoded to words (nearest neighbors in embedding space).

Under token-level attacks with budget $k = 10$ (up to 10 words can be replaced), the standard model’s accuracy drops to 73%, a 21-point drop. This reveals robustness vulnerability despite high baseline accuracy. To defend, apply adversarial training by augmenting the training set with adversarial examples: perturb each training example by replacing 5-10 words, then retrain the model on both clean and perturbed examples. After adversarial training, robust accuracy under the same attack improves to 85%, sacrificing standard accuracy to 91%. This is a more modest trade-off than in vision (only 3 points vs. 19 points for CIFAR-10), suggesting that language models may have better inherent robustness.

The distinction between robustness in vision and NLP highlights domain-specific challenges. In vision, adversarial perturbations are continuous and pixel-level, making gradient-based attacks highly effective. In NLP, perturbations are discrete (word choices) and must preserve linguistic validity, making attacks harder to optimize. Consequently, NLP models often exhibit better empirical robustness without explicit defenses, but the gap between standard and robust performance (when attacks are carefully designed) can still be substantial.

A common misconception is that NLP models are inherently more robust because of discrete perturbation spaces. In reality, the robustness gap depends on the quality of attacks: if attacks are not carefully designed or optimized, NLP robustness may appear high, but stronger attacks reveal vulnerabilities. Another misconception is that semantic perturbations (paraphrases) are “unfair” for robustness evaluation; in fact, robustness to valid paraphrases is highly desirable, as humans should produce consistent predictions for meaning-preserving variations.

What if we combined robustness with domain adaptation (training on multiple datasets) or transfer learning (pre-training on diverse data)? Large pre-trained models like BERT already benefit from transfer learning, which may confer some robustness by exposing the model to linguistic diversity. Fine-tuning on adversarially augmented data would further improve robustness. Alternatively, using certified robustness techniques (e.g., randomized smoothing adapted for discrete token spaces) could provide formal guarantees, though computational costs may be prohibitive for billion-parameter models.

Explicit ML Relevance: Scaling robust training to large transformer models is a frontier challenge. The computational cost of computing worst-case perturbations for input sequences (which can be long, e.g., 512 tokens) makes adversarial training infeasible for billion-parameter models. Research into efficient robust training, certified defenses for NLP, and leverage of pre-training for robustness is actively ongoing and critical for deploying robust NLP systems.

Summary

Key Ideas Consolidated

This chapter has established robustness and adversarial examples as fundamental challenges in modern machine learning, distinct from but closely related to generalization. The core insight is that standard learning algorithms, while achieving high accuracy on clean data, often fail catastrophically under small adversarial perturbations, revealing a deep vulnerability in how neural networks process information. The key ideas crystallize around five interconnected concepts:

First, adversarial examples exist and are prevalent. An adversarial perturbation $\delta$ with $\|\delta\|_p \leq \epsilon$ (bounded in a specified norm) can flip predictions despite being imperceptible to humans. These examples are not pathological edge cases but arise naturally during standard training, reflecting the model’s reliance on non-robust features. This discovery (Szegedy et al., 2013) fundamentally changed how researchers understand neural network brittleness.

Second, robustness is a distinct optimization objective from standard accuracy. The robust optimization problem $\min_\theta \max_{\|\delta\| \leq \epsilon} \ell(f_\theta(x + \delta), y)$ is fundamentally harder than standard empirical risk minimization. This drives the robustness-accuracy trade-off (Theorem 8): achieving robustness to larger perturbations $\epsilon$ requires smaller margins and inherently sacrifices standard accuracy. The trade-off is not an artifact of suboptimal algorithms but a fundamental property of the learning problem.

Third, Lipschitz constants control robustness bounds. If a model has bounded Lipschitz constant $L_f$ and margin $m$, the certified robustness radius is $r = m / L_f$ (Theorem 7). This directly translates the geometric concepts of margin and sensitivity into robustness guarantees. Practical defenses like spectral normalization implement this principle by constraining Lipschitz constants.

Fourth, stability of algorithms ensures robust generalization. Algorithmic stability—the property that removing one training example does not drastically change predictions—implies generalization (Theorem 2). When extended to robustness settings, stable algorithms that are robust to both training data changes and input perturbations provide robust generalization bounds. This connects the classical stability view of generalization to adversarial robustness.

Fifth, certified and empirical robustness are distinct but complementary. Certified robustness provides formal guarantees via Lipschitz bounds, randomized smoothing, or abstract interpretation—provably robust up to a computed radius $r(x)$, but often conservative. Empirical robustness, measured by resistance to attacks (FGSM, PGD, etc.), often exceeds certified bounds but lacks formal guarantees and is vulnerable to stronger attacks. Robust ML benefits from both: empirical robustness for practical performance, certified robustness for formal guarantees.

What the Reader Should Now Be Able To Do

After studying this chapter, you should be able to:

Formally define and compute adversarial examples. You can specify threat models (e.g., $\ell_\infty, \epsilon = 8/255$ for images), implement or apply gradient-based attacks (FGSM, PGD), and evaluate empirical robustness on real models and datasets. You understand the parametrization of attacks (step size, iterations, restraints) and how attack strength interacts with model properties.
Analyze robustness using Lipschitz constants and margins. Given a neural network, you can estimate or bound its Lipschitz constant (via spectral norms, gradient penalties, or other techniques), compute margins at test points, and derive certified robustness radii $r = m / L$. You understand how changes to model architecture (e.g., spectral normalization) affect these quantities.
Formulate and solve robust optimization problems. You can state the robust learning objective as a minimax problem, interpret the dual formulation, and understand the optimization landscape (why standard SGD fails on the robust objective). You can apply or design adversarial training algorithms, choosing between primal descent and dual methods.
Evaluate robustness-accuracy trade-offs. You can plot and interpret the Pareto frontier of robustness vs. accuracy, explain why stronger threat models shift the frontier leftward, and make informed decisions about the cost of robustness for your application. You understand that trade-offs are fundamental, not algorithmic artifacts.
Implement defenses and certifications. You can apply spectral normalization, TRADES, or other robust training techniques, compute certified radii via Lipschitz bounds or randomized smoothing, and diagnose when your model is robust vs. vulnerable. You can interpret certified vs. empirical robustness and explain the gap.
Connect robustness to generalization and stability. You understand that generalization and robustness are distinct but related—high generalization does not guarantee robustness, but algorithmic stability supports both. You can apply stability-based generalization bounds to robust learning and design algorithms that are stable to both training data and input perturbations.
Reason about robustness in practical settings. You can apply robustness concepts to real applications (computer vision, NLP, recommender systems), choose appropriate threat models for your domain, and evaluate trade-offs between robustness, accuracy, and computational cost. You understand that robustness is not an academic abstraction but critical for deployed ML.

Active Assumptions for Later Chapters

This chapter assumes and establishes several foundational premises that will inform later discussions:

Assumption 1: Adversarial vulnerability is deep, not superficial. Unlike overfitting (which manifests as high training accuracy and low test accuracy on the same clean distribution), adversarial vulnerability reveals that models have learned fundamentally non-robust features. These features are predictive on clean data but fragile under perturbation. Later chapters will explore how and why models learn such features, and whether robust training can steer learning toward robust features.

Assumption 2: Robustness requires explicit optimization. Standard training does not produce robust models; achieving robustness requires adversarial training or other defenses that explicitly account for adversarial perturbations. This assumption motivates the emphasis on robust optimization throughout the book’s treatment of learning algorithms.

Assumption 3: Trade-offs between robustness and accuracy are unavoidable. While specific trade-off curves depend on algorithms and data, the fundamental trade-off (Theorem 8) is universal. Later chapters will explore whether other objectives (domain adaptation, fairness, scalability) exhibit similar fundamental trade-offs, and how to navigate multiple competing objectives.

Assumption 4: Formal guarantees (certified robustness) are important but limited. Certified robustness provides formal guarantees but is often conservative (certified radii are smaller than empirical robustness). This motivates the need for both certified and empirical robustness, and suggests that formal guarantees alone are insufficient for security-critical applications.

Assumption 5: Robustness is domain-specific. Threat models differ across domains (pixel perturbations in vision, word replacements in NLP, price changes in recommender systems). Defenses must be adapted to the threat model and application. Later chapters will maintain this domain-aware perspective.

End-of-Chapter Advanced Exercises

A. True / False (20)

A.1. If a neural network $f$ achieves certified robustness radius $r(x) = m(x) / L_f$ via Lipschitz bounds (Theorem 7), then the empirical robustness radius under PGD attack must be at least $r(x)$.

A.2. A model with higher standard accuracy on clean data must have a higher robust optimization objective value $\max_\delta \ell(f(x+\delta), y)$ at every point $x$.

A.3. Spectral normalization by constraining each weight matrix’s spectral norm to 1 guarantees that the Lipschitz constant of the entire feedforward network is at most 1.

A.4. Adversarial training with PGD attack on threat model $(\ell_\infty, \epsilon_1)$ produces a model that is also robustly trained against threat model $(\ell_2, \epsilon_2)$ for some $\epsilon_2$.

A.5. The robust risk $R_{\text{robust}}(f) = \mathbb{E}[\max_\delta \ell(f(x+\delta), y)]$ can be arbitrarily small even when the standard risk $R(f) = \mathbb{E}[\ell(f(x), y)]$ is large, if the model is adversarially trained.

A.6. For a linear classifier with weight vector $w$ and fixed margin $m$, robustness radius $r = m / \|w\|_2$ increases monotonically with the regularization parameter $\lambda$ in ridge regression.

A.7. If an algorithm $\mathcal{A}$ is $\epsilon$-uniformly stable (Definition 7), then for any dataset of size $m$, the generalization gap is upper bounded by $2\epsilon + O(1/\sqrt{m})$.

A.8. Gradient masking is a failure mode of adversarial training where the model appears robust to FGSM but is vulnerable to PGD, implying that FGSM is a weaker attack in the sense of optimization (finding worse adversarial examples).

A.9. The certified robustness radius under $\ell_\infty$ threat model is always less than or equal to the certified robustness radius under $\ell_2$ threat model, for the same Lipschitz bound and margin.

A.10. A model trained via adversarial training on randomized examples (where perturbations are drawn from a distribution rather than solving the worst-case problem) still achieves near-optimal robust risk under the worst-case objective.

A.11. If the loss function $\ell(f(x), y)$ is $L_\ell$-Lipschitz and the model $f$ is $L_f$-Lipschitz, then the Lipschitz constant of the composed loss $\ell \circ f$ is exactly $L_\ell L_f$.

A.12. Leave-one-out error (LOO) provides a valid upper bound on the generalization gap of a learning algorithm, and this bound is independent of the sample size $m$.

A.13. The first-order adversarial approximation (Theorem 3) becomes increasingly inaccurate as the perturbation budget $\epsilon$ grows, and the error can be bounded by a term proportional to $\epsilon^2 \max_i |\nabla^2 \ell|_i$.

A.14. A model is distributionally robust (Definition 11) to Wasserstein distance shifts if and only if its decisions are stable to small input perturbations in $\ell_2$-norm.

A.15. The dual form of the robust optimization problem (Theorem 4) implies that the robust learning problem can be solved via Lagrangian ascent on the dual variable, regardless of the convexity of the original loss.

A.16. If margin $m(x) > 0$ at all training points and adversarial training converges to zero robust training loss, then the test robust loss is also zero under the same threat model.

A.17. Spectral normalization constrains the Lipschitz constant of a single layer to its spectral norm; therefore, for a $d$-layer network where each layer has spectral norm 1, the overall Lipschitz constant is $d$.

A.18. The Hessian norm at a point $x$ (Theorem 6) bounds the second-order loss change under perturbation; consequently, models with small Hessian norms everywhere are guaranteed to be robust to all sufficiently small perturbations.

A.19. A model can achieve zero adversarial empirical risk on the training set (i.e., no adversarial examples exist for any training point within the threat model) yet have high adversarial risk on the test set.

B. Proof Problems (20)

B.1. Prove that if a function $f: \mathbb{R}^d \to \mathbb{R}^k$ is $L$-Lipschitz with respect to $\ell_2$-norm on both domain and codomain, then for any $x_1, x_2 \in \mathbb{R}^d$ and any $\epsilon > 0$, an $\epsilon$-ball around $f(x_1)$ contains at most a finite number of isolated points that are not in the $L\epsilon$-ball around $f(x_2)$ when $\|x_1 - x_2\|_2 \leq \epsilon$. Derive a sharp bound on the cardinality of this set.

B.2. Let $f_\theta$ be a neural network with parameters $\theta$, and suppose each layer has spectral norm at most 1 (via spectral normalization). Prove that the Lipschitz constant of the entire network $L_f$ satisfies $L_f \leq 1$ with respect to $\ell_2$-norm, and characterize when this bound is tight (i.e., when $L_f = 1$) in terms of the weight matrix properties and activation functions.

B.3. State and prove a rigorous version of Theorem 7 (Margin-Based Robustness Guarantee) for multi-class classification, including the case where the margin is negative or zero. Address what happens at the decision boundary and provide a bound on the certified radius as a function of the margin and Lipschitz constant.

B.4. Formalize and prove the dual form of the robust optimization problem (Theorem 4). Assume the threat model is a convex set (e.g., $\ell_p$-ball), and show that strong duality holds under standard convexity assumptions on the loss and model.

B.5. Prove Theorem 3 (First-Order Adversarial Approximation) rigorously, including a bound on the remainder term. Specifically, show that for small $\epsilon$, \[|\ell(f(x + \epsilon \cdot \text{sign}(\nabla_x \ell)), y) - (\ell(f(x), y) + \epsilon \|\nabla_x \ell\|_1)| \leq C\epsilon^2 H\] where $H$ is a bound on the Hessian norm. Derive an explicit expression for the constant $C$.

B.6. Let $\mathcal{A}$ be a learning algorithm that is $\epsilon$-uniformly stable. Prove that the expected generalization gap satisfies \[\mathbb{E}_{S \sim D^m}[R(f_S) - \hat{R}(f_S)] \leq 2\epsilon + O\left(\sqrt{\frac{\log(1/\delta)}{2m}}\right)\] with probability $1 - \delta$ over the training set $S$. Your proof should use the symmetrization technique and McDiarmid’s inequality.

B.7. Prove that if a classifier $f$ achieves margin $m(x) > 0$ at all points in a dataset, and $f$ is $L$-Lipschitz, then under adversarial training (minimizing worst-case loss), the robust training loss $\hat{R}_{\text{robust}}$ and robust test loss $R_{\text{robust}}$ can differ. Provide a lower bound on the generalization gap $R_{\text{robust}} - \hat{R}_{\text{robust}}$ in terms of properties of the data distribution and $L$.

B.8. Formulate the robust optimization problem as a two-player game (adversary vs. learner), and prove the existence of a Nash equilibrium under the assumption that the loss function is convex in the model parameters and the adversary’s perturbation set is compact. Address what happens when these assumptions are violated (e.g., non-convex neural networks).

B.9. Let $\ell_1$, $\ell_2$ be two loss functions that are $L_1, L_2$-Lipschitz respectively on a compact set. Prove that if a model $f$ is trained to minimize the robust risk with loss $\ell_1$, then its robust risk with respect to $\ell_2$ can be bounded in terms of $L_1, L_2$, the robust risk under $\ell_1$, and properties of the model. Establish when the transfer between loss functions is tight.

B.10. Prove that randomized smoothing (adding Gaussian noise to inputs, then averaging predictions over many noisy versions) provides certified robustness. Specifically, show that if a base classifier correctly classifies a point $x$ with noisy input $x + \mathcal{N}(0, \sigma^2 I)$ with probability $\geq p_A$, then the classifier is certifiably robust to $\ell_2$-perturbations up to certified radius $r = (c \sigma / 2) (2\Phi^{-1}(p_A) - 1)$, where $\Phi$ is the CDF of the standard normal and $c$ is a constant. Derive $c$ precisely.

B.11. Let $\nabla^2_x \ell(f(x), y)$ be the Hessian of the loss with respect to inputs, and let $\lambda_{\max}$ be its largest eigenvalue. Prove Theorem 6 (Sensitivity Bound via Hessian Norm), showing that \[|\ell(f(x + \delta), y) - \ell(f(x), y) - (\nabla_x \ell)^T \delta| \leq \frac{1}{2} \|\delta\|_2^2 \lambda_{\max} + O(\|\delta\|_3)\] for small $\delta$. Characterize when the $O(\|\delta\|_3)$ term can be bounded sharply.

B.12. Prove that the empirical robust risk \[\hat{R}_{\text{robust}} = \frac{1}{m} \sum_{i=1}^m \max_{\|\delta\|_p \leq \epsilon} \ell(f(x_i + \delta), y_i)\] converges to the population robust risk $R_{\text{robust}} = \mathbb{E}[\max_{\|\delta\| \leq \epsilon} \ell(f(x + \delta), y)]$ as $m \to \infty$ under standard assumptions. Provide a concentration bound with explicit dependence on $m$, the loss range, and distributional properties.

B.13. Consider a linear classifier $f(x) = w^T x + b$ trained on data from two Gaussian clusters. Prove that the robust optimization objective and the standard ERM objective have different optimal solutions (when robustness is enforced), and characterize the geometry of this difference in terms of cluster covariance, separation, and the Lipschitz constraint.

B.14. Prove that for any classifier and any threat model (e.g., $\ell_\infty$-ball), there exists a fundamental trade-off between standard accuracy and robust accuracy. Specifically, show that for a fixed dataset, maximizing robust accuracy under perturbation budget $\epsilon$ requires a trade-off with standard accuracy, and provide a lower bound on the trade-off curve in terms of problem dimension and the threat model size.

B.15. Let $S$ and $S'$ be two datasets differing in one example. Prove that an algorithm $\mathcal{A}$ is $\epsilon$-stable if and only if the LOO error (Definition 8) is at most $\epsilon$ on any dataset, up to appropriate probability and concentration bounds. Characterize the relationship precisely.

B.16. Formulate adversarial training as a bilevel optimization problem, and prove that the gradients of the bilevel objective can be computed via implicit differentiation (without explicitly solving the inner maximization). Address computational challenges and approximation errors in practice.

B.17. Prove that if a distributional robustness problem (Definition 11) uses Wasserstein distance as the divergence, then the worst-case distribution over a Wasserstein ball is supported on the original data points plus perturbations at the boundary of the Wasserstein ball. Derive the worst-case distribution explicitly.

B.18. Let $f_1, f_2$ be two models with Lipschitz constants $L_1, L_2$ respectively, and let $f_3 = \alpha f_1 + (1-\alpha) f_2$ be a convex combination. Prove that the Lipschitz constant of $f_3$ is bounded by $\alpha L_1 + (1-\alpha) L_2$, and show when this bound is achieved (i.e., when the Lipschitz constant of a convex combination equals the convex combination of Lipschitz constants).

B.19. Prove that PGD-based adversarial training (iteratively finding worst-case perturbations via PGD, then updating model parameters) can be viewed as an instance of Lagrangian mirror descent on the dual form of the robust optimization problem. Derive the connection explicitly and analyze convergence rates.

B.20. Consider a neural network trained with spectral normalization such that each layer’s spectral norm is at most 1. Prove that this network is $c$-Lipschitz for some $c \leq 1$, and derive a tight relationship between the Lipschitz constant and the activation functions used (ReLU vs. smooth activations like Tanh). Show how the Lipschitz constant depends on network depth and width.

C. Python Exercises (20) — COMPREHENSIVE EXPANSION

This section contains expanded versions of all 20 Python exercises for Chapter 12 (Adversarial Robustness), following the comprehensive 5-section template established in Chapter 13.

C.1 — Implement FGSM Attack and Analyze Gradient Geometry

Task: Implement the Fast Gradient Sign Method (FGSM) attack from first principles. For correctly classified $(\mathbf{x}, y)$, compute $\delta^* = \epsilon \cdot \text{sign}(\nabla_\mathbf{x} \ell(f_\theta(\mathbf{x}), y))$ (cross-entropy loss). Apply $\mathbf{x}^{\text{adv}} = \text{clip}(\mathbf{x} + \delta^*, [0,1])$. Evaluate for $\epsilon \in [0, 0.05, 0.1, 0.15, 0.2, 0.3]$: measure attack success rate (% misclassified). Support $\ell_\infty$ and $\ell_2$ norms. For $\ell_2$: $\delta = \epsilon \cdot \frac{\nabla_\mathbf{x} \ell}{\|\nabla_\mathbf{x} \ell\|_2 + 10^{-8}}$. Measure: success rate, margin reduction $\Delta m = m(\mathbf{x}) - m(\mathbf{x}^{\text{adv}})$, perturbation statistics, gradient norm distribution. Dataset: MNIST or CIFAR-10, baseline model >90% accuracy.

Purpose: FGSM exploits first-order linear approximation of loss (Theorem 3), moving inputs along steepest ascent direction. Gradient concentrates on decision-relevant pixels, not random locations. Key insight: low margin models have large gradients (loss steep near boundary), making them highly vulnerable to FGSM. The exercise teaches intuition for gradient geometry: some test points near boundary (large gradient, easy to flip), others far from boundary (small gradient, require large epsilon). This connects margin (Definition 9) to vulnerability directly. In production: FGSM is baseline attack for evaluating robustness; if models break under single-step FGSM, they’re clearly not robust.

ML Link: Directly implements Theorem 3 (First-Order Adversarial Approximation): FGSM uses first-order Taylor $\ell(\mathbf{x} + \epsilon \cdot \text{sign}(\nabla_\mathbf{x} \ell)) \approx \ell(\mathbf{x}) + \epsilon \|\nabla_\mathbf{x} \ell\|_1$. Relates to Definition 12 (Gradient-Based Attack), Definition 1 (Adversarial Perturbation), Definition 9 (Margin), Theorem 7 (Margin-Based Robustness Guarantee). Tests Example 2 (FGSM on Simple Network) empirically on real models.

Hints: Implementation: (1) forward pass compute loss $L = \ell(f(\mathbf{x}), y)$, (2) backward via torch.autograd.grad(L, x, create_graph=False)[0] to get $\nabla_\mathbf{x} \ell$, (3) for $\ell_\infty$: $\delta = \epsilon \cdot \text{sign}(\nabla_\mathbf{x} \ell)$, (4) for $\ell_2$: normalize gradient first, (5) clamp to [0,1]. Handle edge case: zero gradients (add small epsilon to denominator in $\ell_2$ normalization). Vectorize batch operations. Plot: success rate curve (x-axis $\epsilon$, y-axis % success, should be sigmoid), margin reduction box plot, gradient norm histogram. Visualize examples: original, perturbation (rescale to [0,1] for display), adversarial example side-by-side.

What mastery looks like: (1) Correct FGSM: 50–90% success on MNIST at $\epsilon=0.3$, consistent with literature. Verify $\epsilon=0$ gives 0% success, $\epsilon=0.3$ gives >80%. (2) Both norms working: $\ell_\infty$ and $\ell_2$ implementations both correct; $\ell_\infty$ typically requires ~1.5× larger epsilon for same success rate. (3) Margin analysis: mean baseline margin ~1.5–3.0 (units of logit difference); after attack, margin reduces to near 0 or negative. Quantify absolute reduction: $\Delta m$ averaging 1–2 logits per example. (4) Gradient geometry: histogram shows wide distribution (some $\|\nabla_\mathbf{x} \ell\|_2$ in [0.1, 1.0], others >1.0); high-gradient examples easier to attack. Compute Pearson correlation between gradient norm and attack success threshold $\epsilon^*$, expect ~negative (higher gradient $\Rightarrow$ lower threshold needed). (5) Perturbation visualization: 3–5 examples show original, $\delta$, and $\mathbf{x}^{\text{adv}}$ clearly; perturbations subtle (imperceptible for small epsilon) but effective. (6) Success rate curve fitting: fit sigmoid $S(\epsilon) = 1 / (1 + e^{-k(\epsilon - \epsilon_{50})})$, should explain >95% variance; extract $\epsilon_{50}$ (50% success threshold) ~0.12 for MNIST. (7) Defense verification: train adversarially robust model (Exercise C.4); FGSM attack success drops from ~80% to <30% on robust model at same epsilon. (8) Advanced: compute gradient alignment (cosine similarity between $\nabla_\mathbf{x} \ell$ for pairs of examples); if attacks correlated, alignment 0.5–0.8, showing shared decision boundary geometry.

C.2 — Implement PGD Attack with Convergence Analysis

Task: Implement Projected Gradient Descent (PGD), iterative method for finding stronger adversarial examples. For $(\mathbf{x}, y)$: (1) init $\delta_0 \sim \text{Uniform}(-\epsilon, \epsilon)$ ($\ell_\infty$; for $\ell_2$: random direction $\times \epsilon$). (2) For $t=1,…,T$: compute $g_t = \nabla_\delta \ell(f(\mathbf{x} + \delta_{t-1}), y)$, step $\delta’t = \delta{t-1} + \alpha \cdot \text{sign}(g_t)$ ($\ell_\infty$) or $\delta’t = \delta{t-1} + \alpha \cdot \frac{g_t}{\|g_t\|2}$ ($\ell_2$), project $\delta_t = \Pi{{\mathcal{B}}}(\delta’_t)$. Compute: (1) loss trajectory $L_t$ (monotonically increasing), (2) success rate vs. $T \in [1, 5, 10, 20, 40, 100]$, (3) step size sensitivity (vary $\alpha \in [0.01, 0.02, 0.05]$). Test random vs. zero init. Convergence speed: iterations to reach 90% max loss for $\epsilon \in [0.1, 0.2, 0.3]$. All else as C.1: MNIST/CIFAR-10, both norms.

Purpose: Single-step FGSM settles at suboptimal perturbations; PGD iterates to refine, trading one-step simplicity for iterative optimization. Teaches attack strength hierarchy: FGSM $\ll$ PGD-10 $\ll$ PGD-100 $< $ true worst-case optimum, showing empirically that attack strength depends on optimization effort. Convergence dynamics: loss increases steeply initially (easy gains from far from boundary), plateaus later (hard refinements), typical convergence ~20–40 iterations for CV tasks. Initialization crucial: random start finds diverse local maxima (more robust attack), zero-start may miss some directions. Step size is critical hyperparameter: too small (slow convergence, wasted iterations), too large (oscillation, overshoot), requires tuning. In production: strong attacks like PGD necessary for rigorous robustness evaluation; weak attacks (FGSM) underestimate vulnerability.

ML Link: Implements Definition 12 (Gradient-Based Attack) for iterative optimization. Directly connected to Example 9 (Adversarial Training via PGD): training uses PGD to find attacks, then minimizes loss on them. References Theorem 4 (Dual Form of Robust Optimization): PGD algorithmically solves $\min_\theta \max_{\|\delta\| \leq \epsilon} \ell(f_\theta(\mathbf{x} + \delta), y)$. Shows convergence to local maxima, validating first-order optimization in adversarial setting.

Hints: Key implementation: (1) detach perturbation between iterations (avoid gradient leakage to model), (2) clamp to threat model after each step: $\ell_\infty$ via torch.clamp(delta, -eps, eps), $\ell_2$ via $\delta \gets \delta \times \min(1, \epsilon / \|\delta\|_2)$, (3) handle zero gradients (add 1e-8 to denominators), (4) vectorize batch. For convergence, track loss $L_t$ each iteration, plot (x-axis iteration, y-axis loss); should be monotonic or step-wise increasing. Step size tuning: sweep $\alpha$, plot final success rate vs. $\alpha$ (U-shaped curve, identify optimum). Recommended: MNIST $\alpha \in [0.01, 0.05]$, CIFAR-10 $\alpha \in [2/255, 8/255]$ (standard epsilon scale).

What mastery looks like: (1) Correct PGD: 95–99% on MNIST at $\epsilon=0.3$, $T=40$, $\alpha=0.02$; significantly >FGSM’s ~80%. (2) Convergence analysis: plot success rate vs. $T$ (sigmoid/saturating curve, low at $T=1$, high at $T=20–40$, plateau by $T=100$). Quantify: $T=10$ achieves ~85–90% of $T=100$ success (law of diminishing returns). Identify effective $T$ for tradeoff (e.g., $T=20$ gives good evaluation speed). (3) Loss trajectory: plot for 3–5 random examples; should show monotonically increasing or step-wise increasing (loss rises with better perturbations), not chaotic/oscillating. Smoothness indicates well-tuned alpha. (4) Step size sensitivity: U-shaped curve, peak at $\alpha^$ around $\epsilon / (T/2)$. Quantify degradation: at $\alpha = \alpha^/2$, success drops ~10%; at $\alpha = 2\alpha^*$, drops ~15%. (5) Initialization: random init success rate matches or exceeds zero-start (2–5% higher); shows random helps find diverse attacks. (6) Threat models: both $\ell_\infty$ and $\ell_2$ work; for equivalent budgets ($\epsilon_\infty = 0.3$ vs. $\epsilon_2 \approx 0.45$), success rates similar or slightly higher for $\ell_2$. Verify norms satisfied. (7) PGD vs. FGSM: PGD-20 success >FGSM by 15–30 pp (e.g., FGSM 80%, PGD-20 95%), demonstrating single-step attacks significantly underestimate vulnerability. Gap widens for larger epsilon. (8) Advanced: implement adaptive step size (larger early to explore, smaller late to refine); compare to fixed-alpha, showing whether adaptive converges faster/better.

C.3 — Compute and Visualize Lipschitz Constant via Spectral Norms

Task: Estimate neural network Lipschitz constants by computing spectral norms of weight matrices. Setup: train feedforward network (2–3-layer MLP or small CNN) on MNIST. For each weight matrix $W$, compute $\sigma_{{\max}}(W)$ via SVD; for conv layers $[\text{out}, \text{in}, h, w]$, reshape to $[\text{out}, \text{in} \times h \times w]$. Overall $L_f = \prod_{i=1}^L \sigma_{{\max}}(W_i)$ (product of per-layer norms, assuming 1-Lipschitz activations like ReLU). Measure during training: (1) per-layer spectral norms vs. step for each layer, (2) overall $L_f$ vs. step, (3) empirical Lipschitz via finite differences: sample 1000 random $(\mathbf{x}, \mathbf{x}‘)$, compute $\|f(\mathbf{x}) - f(\mathbf{x}’)\|_2 / \|\mathbf{x} - \mathbf{x}’\|2$, report percentile (e.g., 95th). Compare: (a) standard training (unconstrained), (b) spectral normalization (constrain each $\sigma{{\max}} \leq 1$). Metrics: test accuracy, $L_f$, certified radius $r = m / L_f$ (estimate margin $m$ from logits on correct examples).

Purpose: Lipschitz constant quantifies output sensitivity to inputs: $\|f(\mathbf{x}) - f(\mathbf{x}’)\|_2 \leq L_f \|\mathbf{x} - \mathbf{x}’\|2$. Large $L_f$ means small perturbations cause large outputs (vulnerable). Key insights: (1) Standard networks have huge Lipschitz constants (20–1000×), enabling adversarial attacks; spectral normalization reduces to ~1, dramatically improving certified bounds, (2) Lipschitz compounds over layers: $L$ layers with $\sigma{{\max}} = 2$ each have $L_f \geq 2^L$ (exponential in depth), (3) Empirical < theoretical: theory provides upper bounds (conservative), empirical sampling often smaller (worst-case not typical), (4) Accuracy-robustness tradeoff: spectral normalization constrains weights, reducing expressiveness (2–5% accuracy drop) but improving certified robustness 50–100×. Exercise quantifies fundamental robust ML tradeoffs.

ML Link: Implements Theorem 1 (Lipschitz Bound for Robust Perturbations) empirically: finite differences test $\|f(\mathbf{x} + \delta) - f(\mathbf{x})\|_2 \leq L_f \|\delta\|_2$. Central to Theorem 7 (Margin-Based Robustness Guarantee): $r = m / L_f$ depends directly on $L_f$; reducing $L_f$ increases $r$ proportionally. References Definition 4 (Lipschitz Continuity), Example 3, connects to practical spectral norm computation.

Hints: SVD: $W = U \Sigma V^T$, then $\sigma_{{\max}} = \Sigma[0]$. Use torch.linalg.svd() or power iteration (faster, ~5–10 iterations). Handle rank-deficiency (use only positive singular values). For CNN: flatten spatial to $\mathbf{W}_{2D}$. Empirical Lipschitz: sample random $\mathbf{x}, \mathbf{x}’$ from input distribution, compute ratio, report 90th/95th percentile (avoids noise outliers). Margin: $m(\mathbf{x}) = f(\mathbf{x})y - \max{j \neq y} f(\mathbf{x})_j$; average over correctly classified test examples.

What mastery looks like: (1) Correct spectral norm computation: per-layer norms reasonable (0.5–5 for standard, ~1 for normalized); compute exactly for small networks, verify power iteration matches to ±5%. (2) Training trajectory: standard $L_f$ starts ~10–100, may rise initially, stabilizes/decreases; normalized stays ~1±0.1. (3) Empirical vs. theoretical: both plotted; empirical ≤ theoretical (gap 10–50%, empirical 50–90% of bound). (4) Spectral norm reduction: standard $L_f$ 20–100× larger; normalized $L_f$ ~1; reduction factor matches defense benefit. (5) Certified radius: standard radius $r = m / L_f$ tiny (0.02–0.1 for MNIST margin ~1–2); normalized $r \approx m$ (10–50× larger). Ratio $r_{{\text{norm}}} / r_{{\text{standard}}}$ matches Lipschitz reduction. (6) Accuracy tradeoff: standard 95%+, normalized 90–93% (acceptable 2–5% drop). (7) Per-layer analysis: identify layers with largest $\sigma_{{\max}}$ (often input/output), propose optimizations. (8) Advanced: implement spectral norm training (constrain during updates), compare convergence, stability, final performance vs. post-hoc computation.

C.4 — Implement Adversarial Training via Adversarial Examples Augmentation

Task: Implement training loop alternating PGD attack + weight updates. Setup: Each batch—(1) PGD attack (K=10 iterations, $\epsilon(\text{epoch})$) to find $\mathbf{x}^{\text{adv}}$, (2) loss on adversarial: $L_{{\text{adv}}} = \frac{1}{m} \sum_i \ell(f_\theta(\mathbf{x}i^{\text{adv}}), y_i)$, (3) backprop: $\theta \gets \theta - \alpha\theta \nabla_\theta L_{{\text{adv}}}}$. Curriculum: $\epsilon(\text{epoch}) = 0.001 + 0.299 \times (\text{epoch} / 30)$ (ramp 0.001 to 0.3 over 30 epochs). Measure: (1) clean train/test accuracy, (2) robust train/test accuracy (PGD-20 at $\epsilon=0.3$), (3) adversarial loss curve, (4) training time. Baselines: (a) standard training, (b) adversarial without curriculum. MNIST/CIFAR-10.

Purpose: Robust optimization minimax objective: $\min_\theta \max_{\|\delta\| \leq \epsilon} \mathbb{E}[(\mathbf{x}, y)][\ell(f_\theta(\mathbf{x} + \delta), y)]$. Standard SGD fails; need inner maximization (attack) + outer minimization (defense). Key ideas: (1) Inner maximization: PGD finds strong attacks; weak attacks → gradient masking (Exercise C.7), (2) Outer minimization: SGD on adversarial examples; must improve both standard and robust objectives, (3) Curriculum aids: large $\epsilon$ early → instability (loss oscillates, no convergence); gradual ramp → stable feature learning, (4) Computational cost: PGD adds 5–10× overhead (K forward-backward passes), expensive. Production: balance accuracy loss, robustness gain, computational cost. High-stakes (autonomous, medical): cost acceptable. Latency-critical: may be infeasible.

ML Link: Solves Theorem 4 (Dual Form of Robust Optimization) via alternating optimization. References Definition 15 (Robust Optimization Problem), Example 9, Theorem 8 (Robustness-Accuracy Tradeoff). Empirically demonstrates: robust accuracy increases >>, standard accuracy decreases 1–10% (MNIST/CIFAR-10).

Hints: Pseudocode: for epoch in range(30): for batch in train_loader: eps = 0.001 + 0.299 * (epoch / 30); x_adv = pgd_attack(..., eps); loss = ce_loss(model(x_adv), y); optimizer.zero_grad(); loss.backward(); optimizer.step(). Curriculum: linear ramp or exponential (experiment). LR: 0.001–0.01, momentum 0.9, batch 128 (large for 5–10× compute). Evaluate with PGD-20 (stronger than training) to avoid overestimating. Log: loss/batch, accuracy/epoch (clean & robust). Checkpoint best robust test loss.

What mastery looks like: (1) Alternating opt: adversarial loss decreases over epochs; validate by rising adversarial accuracy (model predicts correctly on more attacks). (2) Tradeoff: standard trained 95% clean, 0% robust; adversarial trained 90% clean, 50–60% robust on MNIST (40–50 pp gain vs. 5 pp loss). (3) Curriculum: compared to fixed-$\epsilon$, curriculum faster convergence, better accuracy (5% lower drop, 10% higher robustness). (4) Robust loss curve: plot vs. step; high initial loss (vulnerable) → decreases (robust). May jump when epsilon increases midway. (5) Training time: 5–10× slower (standard ~5 min, adversarial ~30–50 min per 30 epochs). (6) Robustness persistence: PGD-100 success < PGD-20 by <10% (genuine robustness, not gradient masking; contrast C.7). (7) Hyperparameter sensitivity: vary K (PGD iterations), epsilon endpoint; robustness up with more K and larger epsilon (diminishing returns after K=20, epsilon=0.3). (8) Advanced: implement TRADES (C.14) or other variants; compare robustness-accuracy frontiers.

C.5 — Analyze Margin and Robustness Radius for Linear Models

Task: Train logistic regression (binary, e.g., MNIST 0 vs. 1), compute margins, certify robustness. Setup: linear $f(\mathbf{x}) = w^T \mathbf{x} + b$. For each test $(\mathbf{x}_i, y_i)$, $y_i \in \{-1, +1\}$: (1) signed margin $m_i = y_i(w^T \mathbf{x}_i + b)$, (2) Lipschitz $L_f = \|w\|_2$, (3) certified radius $r_i = m_i / L_f$. Verify: for each point, sample 100 $\delta$ in $\ell_2$-ball $\|\delta\|_2 < 0.8 \times r_i$, check $\text{sign}(f(\mathbf{x}_i + \delta)) = y_i$ (100% success). Measure: (1) margin stats (mean, std, percentiles), (2) radius stats, (3) verification success %, (4) % vulnerable ($r < 0.1$). Sweep regularization $\lambda \in [0, 0.01, 0.1, 1.0]$.

Purpose: Linear models provide closed-form, interpretable robustness analysis. Margin $m(\mathbf{x}) = y \cdot w^T \mathbf{x} / \|w\|_2$ is signed distance to boundary; certified $r = m / L_f$ quantifies robustness. Insights: (1) Margin distribution skewed: many near boundary (vulnerable), few far (robust), (2) Regularization ↓Lipschitz (reduces $\|w\|_2$), improves $L_f$ and radius, but may ↓margin (less confident), (3) Linear as baseline: nonlinear models harder to analyze; linear is interpretable reference, (4) Dimensionality effect: high dims = more attack directions. Exercise bridges theory (Theorem 7) and practice: tune regularization balancing margin & Lipschitz.

ML Link: Direct application Theorem 7 (Margin-Based Robustness Guarantee): $r = m / L_f$ derived & empirically validated. Uses Definition 9 (Margin), Definition 4 (Lipschitz Continuity) with explicit $L_f = \|w\|_2$. References Example 1, extends with verification. Demonstrates Supplementary Proof 2.

Hints: Convert 0/1 to ±1 labels. Margin: $m = y \odot \text{logits}$ (element-wise, broadcast). Lipschitz: $L_f = \|w\|_2$. Radius: $r = m / L_f$ (clip to valid range to avoid projection effects, or use unbounded normalized features). Margin histogram, radius histogram, plot for each $\lambda$. Regularization: each $\lambda$ trains separate logistic model, compare $\|w\|_2$ and radius stats.

What mastery looks like: (1) Correct computation: verify 2–3 examples manually. (2) Margin distribution: right-skewed histogram (many small, few large); MNIST 0 vs. 1 typically mean ~0.3, std ~0.15, min ~0.05, max ~0.8. (3) Radius stats: mean ~0.1–0.5; on MNIST interpretation: 0.2 ≈ 2-pixel shift. (4) Verification >95%: sample 100 perturbations per point within 0.8r; >95% preserve sign (if <90%: check margin computation, perturbation sampling, clipping). (5) Vulnerable %: e.g., $r < 0.1$ should be <20% for well-separated tasks. (6) Regularization effect: $\|w\|_2$ decreases inversely with $\lambda$ (roughly 1/√λ decay); mean radius increases. Find $\lambda^*$ balancing margin & Lipschitz. (7) Scaling: features scaled by 2 → margins × 2 → radii × 2 (correct proportionality). (8) Advanced: compare to nonlinear (Exercise C.3); nonlinear has much larger $L_f$, hence smaller certified radius despite larger margin. Linearity enables smaller $L_f$, supporting robustness.

C.6 — Compute Certified Robustness via Randomized Smoothing

Task: Implement randomized smoothing to certify robustness. Setup: for test point $\mathbf{x}$, sample $N$ noisy versions $\mathbf{x} + \mathcal{N}(0, \sigma^2 I)$, collect predictions, compute probability $p_A$ of top-1 class. Certified radius (Cohen et al., 2019): $r = \frac{\sigma}{2}(\Phi^{-1}(p_A) - \Phi^{-1}(p_B))$ where $\Phi$ is standard normal CDF, $p_B$ is runner-up class prob. Measure for $\sigma \in [0.12, 0.25, 0.5, 1.0]$: (1) Accuracy vs. noise: clean accuracy (regular predict, no noise) & noisy accuracy (predict on $\mathbf{x} + \mathcal{N}$, compare to true label) for various $N$ (100, 1000, 10000 samples per point), (2) Certified radius distribution: mean, percentiles, (3) Accuracy-radius frontier: plot clean acc vs. mean certified radius for each $\sigma$ (Pareto curve), (4) Empirical verification: if model predicts class $c$ on $\mathbf{x}$ with certified radius $r$, sample adaptive attacks $\|\delta\|_2 < r$, verify <5% failure rate (model predicts $c$). (5) Computational cost: measure forward passes per example as function of $N$.

Purpose: Randomized smoothing provides formal guarantees without computing Lipschitz constants (addresses limitations of Theorem 7 on complex architectures). Key insights: (1) Probabilistic certification: add noise, majority vote provides certifiable robustness (different mechanism than Lipschitz bounds), (2) Noise-robustness-accuracy tradeoff: large $\sigma$ → larger certified radius but lower clean accuracy, (3) Computational cost: requires many forward passes (1000–10000) per example for certification, expensive for deployment, (4) Empirical robustness > certified: empirical (via sampling attacks) often 2–5× larger than certified (conservative guarantees). In production: tradeoff between coverage (accept examples) and certified radius (desired safety level).

ML Link: Implements Theorem 10 (Certified Radius via Randomized Smoothing concept), contrasts with Lipschitz methods (Theorems 1, 7). Complements deterministic certified radius with probabilistic approach. Enables certified robustness for models where $L_f$ unknown or very large (ResNets, Transformers). References Example 13 (Certified Robustness via Randomized Smoothing).

Hints: For each $\mathbf{x}$: sample $N$ noise vectors $\epsilon_i \sim \mathcal{N}(0, \sigma^2 I)$, get predictions $[f(\mathbf{x} + \epsilon_i)$)_i\), count votes per class, extract $p_A$ (top vote fraction), $p_B$. Compute $\Phi^{-1}$ via scipy.stats.norm.ppf. Clopper-Pearson confidence intervals to account for finite-sample estimation error. Early stop if $p_A < 0.5$ (radius = 0). $N$ sweep: 100 (fast, noisy), 1000 (standard), 10000 (precise certification).

What mastery looks like: (1) Correct certification: certified radius increases with $\sigma$, decreases with margin decrease (low $p_A$). (2) Clean accuracy drop: with $\sigma = 0.25$ and $N=1000$, clean accuracy drops ~5–10% (noise hurts), but certified radius ~0.2–0.3. (3) Accuracy-radius frontier: plot all (sigma, clean_acc, mean_radius) points, show Pareto curve (tradeoff). (4) Empirical robustness bound: for examples with certified radius $r$, sample perturbations $\|\delta\|2 < 0.9r$, >95% keep prediction (verification). (5) Computational cost analysis: $N=1000$ per example costs 1000 forward passes (vs. 1 for clean), quantify wall-clock time per example (e.g., 50ms per example on GPU). (6) Comparison to Lipschitz: certified radius via randomized smoothing typically larger than Lipschitz-based (Theorem 7) for same loss landscape, but computational cost higher. (7) Coverage-robustness tradeoff: implement accept/reject based on certified radius threshold $r{{\min}}$; vary and plot coverage (%) vs. $r_{{\min}}$—higher threshold → lower coverage. (8) Advanced: implement variance reduction techniques (importance sampling) to reduce samples per example.

C.7 — Gradient Masking Analysis and Adaptive Attacks

Task: Train two models on MNIST: (A) FGSM-trained (adversarial training with only FGSM, K=1, no randomization), (B) PGD-trained (Exercise C.4 with K=10). Evaluate robustness against: (1) FGSM, (2) PGD-20. Measure: gradient norms for each example, group by model, compute mean $\|\nabla_\mathbf{x} \ell\|2$ per model. Hypothesis: FGSM-trained has very small gradients to FGSM (masked), but large gradients to PGD (breaks defense). Quantify via sensitivity matrix: for each (attack, model) pair, compute success rate, gradient norm, entropy of top-5 logits (low entropy = overconfident). Compare: standard trained model (no defense). Measure: (1) success rate matrices, (2) gradient norm distributions, (3) logit entropy comparisons, (4) adaptive attack: modify attack to maximize loss $\ell$ while minimizing gradient norm $\|\nabla\mathbf{x} \ell\|_2$, show FGSM-trained breaks badly.

Purpose: Gradient masking (obfuscated gradients) is a false defense: reducing gradients doesn’t imply true robustness, merely hides vulnerability from gradient-based attacks. Key insight: FGSM-trained models reduce gradients via feedback (train to small margins, obfuscate decision boundary), but strong attacks like PGD break through. In production: evaluating robustness requires strong attacks; gradient masking is common failure mode (5–10% false positives in published papers initially). Exercise teaches rigorous evaluation: adaptive attacks circumvent masking, true robustness persists under multiple attack methods. Industrial lesson: don’t trust defenses breaking under alternative attacks.

ML Link: Relates to Theorem 3 (First-Order Approximation), Definition 12 (Gradient-Based Attack). Demonstrates pitfall of Theorem 3: weak attacks (FGSM) miss true vulnerability if gradients masked. References Definition 11 (Gradient Masking), contrasts with Definition 12’s threat model (strong gradient-based attacks like PGD). Empirically validates that PGD robustness generalizes better than FGSM.

Hints: Training: use Exercise C.1 for FGSM-only training (single step, no K loop). For gradient norm analysis: collect $\nabla_\mathbf{x} \ell$ for 1000 examples, plot histograms side-by-side (FGSM-trained, PGD-trained, standard). Logit entropy: $H = -\sum_c p_c \log p_c$, where $p_c$ from softmax. Adaptive attack: implement step combining loss gradient + negative gradient gradient: $\delta \gets \delta + \alpha (\nabla_\delta \ell + \beta \nabla_\delta \log \|\nabla_\delta \ell\|)$ (trade-off terms). Verify: FGSM-trained fails (>80% success), PGD-trained robust (<30% success).

What mastery looks like: (1) Correct training: FGSM-trained model achieves ~70% robust accuracy against FGSM at $\epsilon=0.3$; PGD-trained achieves ~50–60%. (2) Gradient masking detection: FGSM-trained mean $\|\nabla_\mathbf{x} \ell\|_2$ to FGSM attack ~0.3–0.5 (low); to PGD same attack ~1.0–2.0 (high). Ratio >3× between FGSM and PGD attacks on same model indicates masking. (3) Adaptive attack breaks FGSM-trained: success jumps from ~20% (standard PGD) to ~70–80% (with adaptive component), demonstrating vulnerability. PGD-trained robust persists: ~30–40% success (honest evaluation). (4) Logit entropy: standard & FGSM-trained low entropy when attacked (overconfident wrong predictions); PGD-trained higher entropy (uncertainty). (5) Success rate matrix: FGSM-trained benefits from FGSM evaluation (low %) but high % against PGD; PGD-trained consistent across attacks. (6) Visualization: plot gradient norms as box-plots; show clearVisualize: plot gradient norms as box-plots; show clear separation: FGSM-trained has narrow distribution, PGD-trained wider. (7) Cross-model transfer: apply FGSM-trained model’s attacks to PGD-trained; show poor transfer (<20% success), validating defense effectiveness. (8) Advanced: implement expectation-over-transformation (EOT) attacks; add input preprocessing randomization to attacks; measure robustness gap with/without EOT.

C.8 — Spectral Normalization Training and Lipschitz Control

Task: Implement spectral normalization during training using power iteration. For each weight matrix $W$: (1) initialize $\mathbf{u} \sim \mathcal{N}(0, I)$, (2) each step compute $\mathbf{v} \gets W^T \mathbf{u} / \|W^T \mathbf{u}\|_2$, $\mathbf{u} \gets W \mathbf{v} / \|W \mathbf{v}\|_2$ (1 iteration per step, or 1 per epoch), (3) estimate $\sigma = \mathbf{u}^T W \mathbf{v}$, (4) normalize $\bar{W} = W / \sigma$. Train model with normalized weights on MNIST. Measure: (1) per-layer $\sigma$ during training (collect every 100 steps), (2) overall $L_f = \prod_i \sigma_i$, (3) empirical Lipschitz via finite differences (same as C.3), (4) train/test accuracy. Compare three variants: (A) standard (no normalization), (B) spectral normalization (constrain $\sigma_i = 1$), (C) spectral normalization with $\sigma < 0.9$ (more constrained). Measure training time overhead.

Purpose: Spectral normalization provides explicit Lipschitz control during training, enabling certified robustness. Key insights: (1) Efficiency: per-layer $\sigma$ normalization is computationally cheap (~1% overhead) vs. global Lipschitz bounds, (2) Accuracy-robustness tradeoff: spectral norm = 1 → certified radius $r ≈ m$, but weight constraints ↓ expressiveness (2–5% accuracy drop), (3) Stability: normalized networks train smoother (less oscillation) due to bounded Lipschitz, (4) Generalization: spectral norm acts as implicit regularizer (reduces overfitting 2–3%). In production: balance certified robustness need (life-critical: use spectral norm) vs. accuracy requirements (standard networks acceptable if attack risk low).

ML Link: Operationalizes Theorem 1 (Lipschitz Bound) during training. Directly controls $L_f$ via spectral norms per Definition 4 (Lipschitz Continuity). Combines with Theorem 7 (Margin-Based Robustness): $r = m / L_f$ improved by reducing $L_f$. References Example 3 (Lipschitz computation via SVD). Shows how architecture constraints enable theoretical guarantees with minimal accuracy cost.

Hints: Power iteration: initialize $\mathbf{u}$ once per layer; update in-place each training step (no gradient required). Vectorization: stack for batch dimensions, broadcast normalization to weight matrix. Careful with gradients: normalize after forward, before backward (use register_forward_pre_hook or hooks to intercept). For conv layers, vectorize by reshaping. Accumulate $\sigma$ over epochs, plot smoothing with rolling average. Compare learning curves: standard may oscillate more, normalized smoother.

What mastery looks like: (1) Correct power iteration: $\sigma$ estimates match SVD to ±10% after 100 training steps (convergence quick). (2) Control over Lipschitz: standard $L_f$ ranges 10–100×; spectral norm ($\sigma=1$) stays ~1±0.1 throughout training. (3) Empirical Lipschitz: standard empirical 50–100% of Lipschitz bound; spectral normalized ≤ theoretical bound (certified). (4) Accuracy tradeoff: standard 95%, normalized ($\sigma=1$) 91–92% (3–4% drop); restricted ($\sigma=0.9$) ~89% (further drop ~2%). (5) Convergence: normalized networks converge in same epochs but loss curve smoother (less variance). (6) Computational overhead: <2% per-epoch timing increase (power iteration fast). (7) Certified radius: for normalized model, $r ≈ m / 1 ≈ m$ (large), e.g., mean radius 0.5–1.5 on MNIST; standard $r$ tiny (0.05–0.1). (8) Advanced: implement spectral norm on activations (block-wise); compare full-network Lipschitz control vs. per-layer (diminishing returns after per-layer approach).

C.9 — Hessian Eigenvalue Analysis and Sharpness-Robustness Connection

Task: Compute Hessian $H = \nabla^2_\theta \ell(f_\theta(\mathbf{x}), y)$ via double backpropagation for 100 random examples on MNIST. For each example: (1) forward pass, loss $L$, (2) compute Hessian via torch.autograd.grad(torch.autograd.grad(L, theta, create_graph=True), theta) (manage memory: use hooks or mini-batches), (3) compute eigenvalues $\lambda_1 \geq \lambda_2 \geq … \geq \lambda_d$, (4) compute sharpness metric $S = \log(\lambda_1)$ (top eigenvalue in log scale). Train two models: (A) standard SGD, (B) Sharpness-Aware Minimization (SAM): at each step, find $\theta’ = \arg\max_{\|\delta\| \leq \rho} \ell(f_{\theta+\delta})$ via single PGD step, then $\theta \gets \theta - \alpha \nabla_\theta \ell(f_{\theta+\delta})$ (sharp → flat minimization). Measure: (1) Hessian spectrum (top 20 eigenvalues) for each model, (2) average sharpness $\bar{S}$, (3) condition number $\lambda_1 / \lambda_d$ (ill-conditioning indicator), (4) train/test accuracy, (5) robust accuracy (PGD-20, $\epsilon=0.3$). Plot: eigenvalue spectrum (Hinton-style), sharpness vs. epoch.

Purpose: Loss landscape sharpness (large Hessian eigenvalues) correlates with poor generalization and adversarial vulnerability. Key insights: (1) Sharpness-vulnerability: flat minima → small Lipschitz, small perturbations can’t change output; sharp minima → high Lipschitz, adv. examples easy, (2) SAM trades speed for flatness: adds overhead (extra forward + backward per step) but significantly flattens landscape (2–10× eigenvalue reduction), (3) Generalization connection: flat minima generalize better (Keskar et al., 2016), also more robust, (4) Empirical Hessian expensive: full computation O(d²) memory, (d=10k→100MB), requires approximations for large networks. In production: SAM improves robustness 10–20% pp with moderate overhead; tradeoff between accuracy and robustness.

ML Link: Connects landscape geometry to robustness. References generalization theory (margin, complexity); recent work (Foret et al., SAM) shows flatness → both better generalization & robustness. Relates to Theorem 1 (Lipschitz bounds depend on weight magnitude), by proxy via Hessian spectral radius (bounded by weight norms). Demonstrates interplay between loss landscape and adversarial robustness empirically.

Hints: Double backprop: use torch.autograd.grad(..., create_graph=True) to track gradients; can be memory-intensive. Approximations: (1) Hutchinson trace estimator (trace(H) ≈ E[z^T H z] for z random), (2) top-k eigenvalues via Lanczos (approximates spectrum), using torch.linalg.eigh for full Hessian on small examples. SAM implementation: rho=0.05 (small perturbation to find sharp direction), compute $\theta’ = \theta + \rho \cdot g / \|g\|$ where $g = \nabla_\theta \ell$. Plot: top 20 eigenvalues, log-scale or linear (identify sharp vs. flat).

What mastery looks like: (1) Correct Hessian: verify against finite differences for toy network (2-3 hidden units), eigenvalues match ±10%. (2) Spectrum characterization: standard SGD shows broad spectrum (many large eigenvalues), skewed right; SAM spectrum concentrated near 0 (exponentially decaying). (3) Sharpness reduction: standard $\bar{S}$ ~3–5; SAM ~0.5–1.5 (3–5× reduction). (4) Condition number: standard 10^3–105 (ill-conditioned); SAM 10^1–102 (well-conditioned). (5) Accuracy comparable: SAM ~94%, standard ~95% (1% lower), acceptable tradeoff. (6) Robust accuracy: SAM ~35–45% vs. standard ~0% (huge 35–45 pp gain without adversarial training). (7) Scaling to large models: Hutchinson trace on ResNet-18 CIFAR-10; demonstrate approximation. (8) Advanced: implement Hessian-vector products (Pearlmutter, 1994); compute curvature along random direction $v$ via $H v$ efficiently.

C.10 — Robustness-Accuracy Tradeoff Frontier and Threat Model Analysis

Task: Train ensemble of models parameterized by TRADES hyperparameter $\beta$ (Exercise C.14 forthcoming; for now use adversarial training variant balancing clean and robust objectives). For $\beta \in [0, 0.1, 0.5, 1.0, 5.0, 10.0]$: (1) train model with loss $L = \ell_{{CE}}(f(\mathbf{x}), y) + \beta \ell_{{robust}}(f(\mathbf{x}), f(\mathbf{x}^{\text{adv}}))$ (where $\ell_{{robust}}$ is KL divergence or margin-based term), (2) measure clean test accuracy, (3) measure robust test accuracy (PGD-20 at $\epsilon=0.3$), (4) measure robust accuracy at $\epsilon \in [0.1, 0.2, 0.3, 0.5]$ (threat model sweep). Plot: (A) clean vs. robust accuracy curve (Pareto frontier) for fixed epsilon, (B) robust accuracy vs. epsilon for each beta (robustness gain curves), (C) 2D heatmap (beta, epsilon) → robust accuracy. Measure: (1) Pareto optimal points (all beta), (2) accuracy-robustness curve fitting (e.g., $r = a - b \cdot \text{clean\_acc}$, linear approximation), (3) inflection points (where Pareto becomes steep).

Purpose: Fundamental tradeoff: improving robustness reduces standard accuracy (Theorem 8, Tsipras et al., 2018). Exercise quantifies: tradeoff is ~1:1 for small robustness (10% gain → 2–3% accuracy loss), becomes steeper at high robustness (50% → 10–15% loss). Key insights: (1) Threat model critical: $\epsilon=0.1$ vs. $\epsilon=0.3$ differ vastly; small $\epsilon$ achievable at <5% cost, large $\epsilon$ requires >10% cost, (2) Beta tuning: $\beta=0$ is standard training (0% robust), $\beta→∞$ all-robust (100% robust, ~80% clean on MNIST). sweet spot ~0.5–1.0 (often 50–60% robust, 90–92% clean), (3) Data efficiency: robust training requires more examples; 10K examples not enough for MNIST, 60K needed (scaling issue). In production: define acceptable robustness level ($\epsilon$, success rate threshold), use frontier to identify minimum accuracy loss.

ML Link: Empirically validates Theorem 8 (Robustness-Accuracy Tradeoff): improving robustness $\rho$ reduces $\text{clean\_acc}$, tradeoff inherent. Connects to Definition 15 (Robust Optimization Problem), showing fundamental limits. Relate to Theorem 4 (Dual Form): inner maximization toward any $\epsilon$ requires careful balancing of clean & robust objectives.

Hints: Efficient frontier computation: train many models (parallel batches), collect clean & robust accuracy curves, plot confidence bands (means + std over 3 seeds). Robust accuracy evaluation: fix PGD parameters (K=20, alpha=0.02, 100 examples per seed to speed up). For each model, measure at 3–4 epsilon values (faster than full sweep). Fitting: simple linear regression of robust accuracy on clean accuracy (identifies slope; ≈ -1 suggests 1:1 tradeoff).

What mastery looks like: (1) Correct Pareto frontier: smooth curve from (95%, 0% robust) to (75%, 60% robust) on MNIST, matching literature (Madry et al., Wang et al.). Points should not be dominated (no point that’s better in both dimensions). (2) Beta sensitivity: clear correlation beta ↑ → robust accuracy ↑, clean accuracy ↓. (3) Threat model analysis: $\epsilon=0.1$ frontier achieves 95% clean, 60% robust (achievable); $\epsilon=0.3$ achieves ~90% clean, 50% robust (harder). Ratio of robustness gains ~2:1 between threat models. (4) Linear approximation: $r ≈ a - b \cdot \text{clean\_acc}$ with $b ≈ 0.8–1.5$ (strong linear correlation). (5) Inflection point: identify beta where Pareto sharply drops; typically $\beta ≈ 1.0$. (6) Computational cost: measure convergence speed (batches to convergence) vs. beta; robust training slower (~1.5–2× more batches). (7) Cross-threat analysis: create 2D heatmap; show how beta, epsilon jointly determine robust accuracy; identify (beta, epsilon) pairs for target performance. (8) Advanced: implement adaptive beta scheduling (increase beta during training, mimicking curriculum); compare to fixed beta.

C.11 — Leave-One-Out Error and Algorithmic Stability

Task: Implement leave-one-out (LOO) cross-validation for trained model and compute stability. Setup: train logistic regression (binary classifier, MNIST 0 vs. 1) on n=1000 examples. (1) Compute standard test error on held-out set. (2) Implement LOO: for each i ∈ {1…n}, train on {1..n}\{i} (n models total, expensive), measure error on example i. (3) Compute LOO error: $\text{LOO\_err} = \frac{1}{n}\sum_i \mathbb{1}[\hat{f}_{-i}(\mathbf{x}_i) \neq y_i]$. (4) Stability metric $\beta$: measure $|\hat{f}(\mathbf{x}i) - \hat{f}{-i}(\mathbf{x}_i)|$ (change in prediction when i removed); $\beta = \max_i |∇_i \ell(\hat{f}, \mathbf{x}_i)|$ (uniform stability from derivatives). Compare three algorithms: (A) logistic regression unregularized, (B) L2 regularization $\lambda=0.01$, (C) L2 $\lambda=0.1$. Measure: (1) LOO error vs. test error (gap indicates overfitting), (2) stability $\beta$ for each algorithm, (3) (LOO error, stability) scatter plot. Relate stability to uniform convergence (Theorem 5 connection).

Purpose: Algorithmic stability (small changes to training data → small changes to learned function) is sufficient for generalization. Key insights: (1) Stability ⇒ Generalization: Theorem 5 bounds test error as $\text{gen\_error} ≤ \text{train\_error} + O(\sqrt{\beta/n})$, (2) LOO = Exact Stability: LOO error estimates true generalization risk (nearly unbiased), (3) Regularization → Stability: L2 regularization increases stability $\beta$ by bounding gradients (reduces overfitting), (4) Interpretability: stability is algorithm property (not data), enables analysis without assumptions. In production: LOO expensive (O(n) models), but theoretical understanding guides regularization choices (more data → less regularization needed, small $\beta$ required).

ML Link: Directly implements Theorem 5 (Uniform Stability and Generalization), proves stability bounds $\beta$ for logistic regression (solution: $\beta \approx 1/(2\lambda n)$ for L2 regularization). Shows empirical LOO error correlates with theoretical bounds. Connects to Definition 8 (Generalization Error), Theorem 9 (Sample Complexity via Stability).

Hints: LOO implementation expensive (n models); optimize by reusing parameters: for logistic reg, LOO estimate $\approx 1 - \frac{1}{n}\sum_i \frac{\partial \ell}{\partial f} |_i$ (leverage regression identity, Sherman-Morrison; omit full retraining). Good approximation for linear models, reduces O(n) to O(1). Full LOO for ground truth (use sklearn.model_selection.LeaveOneOut). Stability: compute $\beta$ by measuring $\|\nabla f\|$ (Lipschitz constant of loss w.r.t. data). For logistic reg, $\beta = 1/(4\lambda)$ (theoretical); empirical $\beta$ by sampling data perturbations.

What mastery looks like: (1) Correct LOO: LOO error ≈ test error, gap <1% (on held-out set); gap > 5% indicates overfitting or LOO computation error. (2) Stability vs. regularization: unregularized $\lambda=0$ high $\beta$ (~1), worse LOO; $\lambda=0.01$ medium $\beta$ (~0.3), better LOO; $\lambda=0.1$ low $\beta$ (~0.05), best LOO accuracy. (3) Quantitative stability: $\beta$ range 0.01–1 across models. (4) Generalization bounds: compute $\text{gen\_bound} = \text{train\_err} + c\sqrt{\beta/n}$; bound should be loose but qualitatively align rankings (higher $\beta$ → wider bound). (5) Data size effect: sweep n ∈ {100, 500, 1000, 5000}; show $\beta/n$ → 0 (stability dominates), LOO error decreases. (6) Overfitting detection: unregularized model LOO error significantly >test error on train set; $\lambda$-regularized LOO & test errors agree. (7) Cross-validation comparison: LOO vs. k-fold (k=5); LOO is more expensive but nearly unbiased, k-fold slightly biased but practical. (8) Advanced: compute per-sample influence scores (how much training example i changes learned parameters); identify high-influence outliers or mislabeled data.

C.12 — Adaptive Attacks on Defensive Distillation and Gradient Obfuscation

Task: Implement defensive distillation (a defense technique, now known to be insufficient) and test against adaptive attacks. Setup: (1) Train standard model $f_{{\text{standard}}}$ on MNIST, (2) distill into $f_{{\text{teacher}}}$ (same or similar architecture) using temperature T: $\ell_{{\text{distill}}} = \ell_{{CE}}(f_{{\text{student}}}(\mathbf{x}, T=1), f_{{\text{teacher}}}(\mathbf{x}, T=20))$ (high T softens logits, easier to mimic teacher behavior). (3) Evaluate distilled model against: (A) standard FGSM/PGD (gradients flow through distilled student), (B) adaptive FGSM/PGD (differentiate through teacher to approximate gradient → student), (C) decision-based attacks (estimate gradients via finite differences, no gradient access). Measure: (1) success rate (standard attack), (2) success rate (adaptive attack), (3) difference (indicates gradient obfuscation), (4) gradient norms for each attack. Expected result: distilled model robust to standard attacks (~40% success), vulnerable to adaptive attacks (80%+ success) — showing obfuscation rather than true robustness.

Purpose: Defensive distillation was proposed to robustify models by reducing gradient information available to attacks. However, adaptive attacks circumvent this: if defender uses distillation, attacker can model the distillation process and compute gradients through teacher model. Key insights: (1) Obfuscation ≠ Robustness: reducing gradients ≠ reducing actual vulnerability, (2) Adaptive attacks are strong: threat model should assume attacker knows defense mechanism (white-box), (3) Lesson for practice: obfuscation-based defenses (gradient masking) fail under rigorous evaluation, (4) Red-teaming critical: before deploying defense, test against adaptive attacks. In production: defensive distillation alone insufficient; combine with adversarial training (Exercise C.4) for genuine robustness.

ML Link: Demonstrates pitfall of Definition 11 (Gradient Masking vs. Definition 12 Gradient-Based Attacks). Shows that evaluations using weak attacks (standard gradient) can be misleading; adaptive attacks (attacker knows defense) reveal true vulnerability. Relates to Definition 12 (Threat Model): assume attacker knows defense. Empirically validates findings from Carlini & Wagner (2016): distillation as standalone defense breaks.

Hints: Distillation: train teacher normally, then collect soft targets $p_{{\text{teacher}}} = \text{softmax}(f_{{\text{teacher}}}(\mathbf{x}) / T)$, train student to match. High T (5–100) makes targets smoother; trade-off: too high (T>100) → uninformative, too low (T<5) → little effect. Adaptive attack: implement two-stage: (1) estimate gradients $g_{\text{teacher}} = \nabla_\mathbf{x} \ell(f_{{\text{teacher}}}(\mathbf{x}), y)$, (2) apply to student input. Verify: standard PGD on student gives ~40% success, adaptive PGD gives ~80% (2× gap). Plot: success rate bar chart (standard vs. adaptive).

What mastery looks like: (1) Correct distillation: student loss on soft targets <0.5 (good mimicry); student clean accuracy ≈ teacher accuracy (90+%). (2) Normal evaluation (standard attack): distilled $\epsilon=0.3$ FGSM success ~30–40% (appears robust), vs. standard model ~80% (strong improvement). (3) Adaptive attack reveals vulnerability: adaptive FGSM on distilled ~70–80% (comparable to standard model on standard attack), showing obfuscation. (4) Gradient norm analysis: standard attack on distilled student has small gradients (masked); adaptive attack (through teacher) has large gradients (revealing vulnerability). (5) Quantify gradient masking: ratio $\|\nabla \text{adaptive}\| / \|\nabla \text{standard}\|$ ~2–5× (significant difference indicates masking). (6) Temperature sensitivity: vary T ∈ {5, 10, 20, 50}, show increase in masking with higher T, but adaptive attacks similarly effective (temperature doesn’t help). (7) Comparison to adversarial training: PGD-trained model robust to both standard & adaptive attacks ~50% (genuine robustness); distilled vulnerable (defense broken). (8) Advanced: implement EOT-based attacks on distilled model; add input randomization to further break distillation’s obfuscation.

C.13 — Transferability of Adversarial Examples Across Architectures

Task: Train multiple architectures on MNIST: (A) 2-layer MLP, (B) 3-layer MLP, (C) small CNN (2 conv + 2 dense), (D) ResNet-18 (if computing power available; otherwise shallow variant). For each source architecture, generate FGSM & PGD attacks ($\epsilon=0.3$). Test each attack on all target architectures. Measure: (1) transfer rate $\tau = \frac{\text{# misclassified on target}}{\text{# misclassified on source}}$ (% of attacks that transfer), (2) success rate by (source, target) pair (compute matrix), (3) property importance: analyze which architecture differences affect transfer (depth, width, activation, normalization). Hypothesis: transfer rate higher between similar architectures (MLP→MLP >MLP→CNN), but significant transfer across different architectures (~50–70%).

Purpose: Adversarial examples are less transferable than natural examples but still significantly so; understanding transfer is crucial for (1) Black-box robustness evaluation: if white-box attacks on model A transfer to model B, black-box attacks likely success, (2) Defense via diversity: ensembles of diverse architectures reduce transfer, but not eliminate, (3) Adversarial training generalization: models trained on one attack (e.g., FGSM) vulnerable to PGD on different architectures, (4) Threat models: assume attacker doesn’t have model access; leverages transfer to succeed. In production: black-box attacks via transfer are realistic threat; defenses should account for ensemble diversity.

ML Link: Empirically investigates transferability as phenomenon, relates to Definition 12 (Threat Model). Shows that threat model (white-box vs. black-box) affects attack feasibility. Connects to Theorem 3 (First-Order Approximation): different architectures have different loss landscapes, leading to different gradient directions, reducing transfer. References Definition 1 (Adversarial Perturbation) — same perturbation less universal across models.

Hints: Transfer matrix: source architectures as rows, target architectures as columns, entries are success rates. Generate attacks on one source model (e.g., FGSM $\epsilon=0.3$), evaluate on all targets. Visualization: heatmap of transfer matrix (bright = high transfer). Analyze diagonal (transfer to same architecture = 100%), off-diagonal (transfer across models). Rank architectures by “transferability” (avg transfer to/from all others). Advanced: implement ensemble attacks (generate on ensemble of multiple sources; transfer higher).

What mastery looks like: (1) Correct transfer matrix: diagonal ≈100% (attacks on model transfer to itself), off-diagonal 30–80% (significant transfer but less than 100%). (2) Transfer by pairs: similar architectures (MLP→MLP) ~60–80%, different architectures (MLP→CNN) ~40–60%, demonstrating architecture dependence. (3) FGSM vs. PGD transfer: single-step FGSM transfers better than iterative PGD (PGD overfits to source model’s geometry), ~60% vs. 50% average. (4) Ensemble transfer: aggregate attacks from all source models; transfer to any target ~70–85% (worse for defender). (5) Property analysis: identify which architectural difference most affects transfer (depth, width, activation functions); e.g., ReLU vs. Tanh activations affect transfer rate by ±10%. (6) Quantify transferability metrics: correlation between source & target accuracy, model similarity (weight alignment) vs. transfer rate (expect moderate negative, better alignedRMNIST models → higher transfer)). (7) Defense strategy: train ensemble of diverse architectures; black-box attack success drops from ~70% (single model) to ~40% (ensemble), showing diversity helps. (8) Advanced: implement ensemble attacks; test gradient-free attacks (evolution strategies, Bayesian optimization) that don’t rely on gradients, measurable transfer without white-box access.

C.14 — TRADES: Robust Training via Trade-off Regularization

Task: Implement TRADES (Trade-off Adjusted Robust Deviation of Stability via gradient regularization). Loss function: $L = \ell_{{CE}}(f(\mathbf{x}), y) + \beta \text{KL}(f(\mathbf{x}) \| f(\mathbf{x}^{\text{adv}}))$, where $\mathbf{x}^{\text{adv}} = \arg\max_{\|\delta\| ≤ \epsilon} \text{KL}(f(\mathbf{x}) \| f(\mathbf{x} + \delta))$. (1) For each batch, compute adversarial examples via PGD (K=10, attacking KL divergence as loss), (2) compute clean loss + regularizer (KL between original and adversarial logits), (3) tune $\beta \in [0.1, 1, 5, 10]$, (4) measure clean & robust accuracy vs. beta (Pareto frontier similar to C.10, but potentially better tradeoff than standard adversarial training). Compare TRADES to standard PGD adversarial training (Exercise C.4) on same frontier. Measure: (1) clean accuracy, (2) robust accuracy (PGD-20, $\epsilon=0.3$), (3) training time, (4) convergence speed (batches to plateau).

Purpose: TRADES improves robustness-accuracy tradeoff vs. standard minimax adversarial training (Exercise C.4). Key mechanism: instead of hard minimax (maximize loss on adversarial examples), TRADES uses soft constraint (KL divergence, penalizes probability distribution shift). Insights: (1) Distribution preservation: KL loss $\text{KL}(p \| q)$ penalizes both misclassification & confident wrong predictions, (2) Better tradeoff: TRADES achieves ~55% robust accuracy at 92% clean (vs. 50% robust at 90% clean for standard training on MNIST), (3) Beta tuning critical: too low $\beta$ underfits robustness, too high overfits robustness, empirically $\beta \approx 1.0$ optimal, (4) Computational cost: similar to standard training (PGD per batch), but KL computation cheap. In production: TRADES recommended over C.4 for better Pareto frontier; use $\beta ≈ 1.0$ as default.

ML Link: Implements Theorem 4 (Dual Form of Robust Optimization) with soft constraint (KL regularization) instead of hard minimax. References Definition 15 (Robust Optimization Problem), extends with distribution-preserving loss. Empirically shows KL-based objectives better than margin-based for improving robustness-accuracy tradeoff. Relates to cross-entropy loss minimization (Definition 2), balances clean and robust objectives.

Hints: KL divergence: $\text{KL}(p \| q) = \sum_c p_c (\log p_c - \log q_c)$, use torch.nn.KLDivLoss(reduction='batchmean') (logits → log-softmax → KL). For PGD on KL: modify attack to maximize KL divergence instead of CE loss (step: $\delta \gets \delta + \alpha \cdot \text{sign}(\nabla_\delta \text{KL})$). Pseudocode: for batch: x_adv = pgd_attack(model, x, eps, K=10, loss_fn=KL); loss_clean = CE(model(x), y); loss_robust = KL(model(x), model(x_adv)); loss = loss_clean + beta * loss_robust; optimizer.step(). Sweep beta visually on plot, identify best tradeoff point.

What mastery looks like: (1) Correct TRADES: loss trajectory decreasing, clean & robust accuracy stabilizing after 30 epochs. (2) Pareto frontier: at $\beta=1.0$, achieve ~55–58% robust accuracy at $\epsilon=0.3$, 92–93% clean (2–3 pp better than standard training at same accuracy level). (3) Beta sensitivity: $\beta=0.1$ underfits robustness (~25% robust), $\beta=10$ overfits robustness (~70% robust, 85% clean), $\beta≈1$ near optimal. (4) Comparison to C.4: plot both Pareto frontiers; TRADES curve should consistently above standard training (higher robust accuracy at same clean accuracy). (5) Convergence speed: TRADES convergence similar to standard training (slight slower due to KL computation, <5% overhead). (6) Transferability: TRADES model robust to both FGSM & PGD (genuine robustness, not gradient masking), success rate consistent across attacks. (7) Threat model robustness: evaluate at multiple $\epsilon \in [0.1, 0.2, 0.3, 0.5]$; robust accuracy decreases with epsilon (expected), but TRADES maintains advantage. (8) Advanced: implement TRADES variant with margin-based loss instead of KL; compare tradeoff curves.

C.15 — Certified Robustness with Deployment Considerations and Accept/Reject

Task: Extend Exercise C.6 (Randomized Smoothing) with deployment constraints. Setup: (1) Compute certified robustness for 1000 test examples (as in C.6), collect certified radius $r_i$ for each. (2) Implement accept/reject system: for given threshold $r_{{\min}}$, accept example if $r_i ≥ r_{{\min}}$, reject otherwise (e.g., defer to human, or deny service). (3) Measure: coverage (% accepted examples) vs. $r_{{\min}}$, average certified radius for accepted examples. (4) Measure accuracy on accepted subset (should be high, since we accepted only confident examples). (5) Cost-benefit analysis: compute $\text{coverage} × \text{accuracy\accepted}$ (throughput metric). Sweep $r{{\min}} \in [0, 0.1, 0.2, 0.3, 0.5]$, plot Pareto curves (coverage vs. accuracy, coverage vs. certified radius).

Purpose: Practical robustness deployment requires choosing acceptable robustness level (e.g., “need $\epsilon = 0.3$ robustness”) and balancing coverage (how many users served) vs. accuracy (don’t misclassify). Certified robustness provides formal guarantees, but at cost of rejecting uncertain examples. Key insights: (1) Accept/Reject enables robustness: certifiable examples are confident predictions, rejecting impossible-to-certify examples improves average robustness, (2) Coverage-accuracy tradeoff: high $r_{{\min}}$ → low coverage, high accuracy on accepted; low $r_{{\min}}$ → high coverage, lower accuracy, (3) Business metrics: throughput $= \text{coverage} × \text{accuracy}$ must be balanced with customer experience (rejection rate), (4) Risk-aversion: safety-critical systems (medical, autonomous): high $r_{{\min}}$ acceptable (reject uncertain); latency-critical (advertising): low $r_{{\min}}$ preferred (serve everyone). In production: choose $r_{{\min}}$ based on application threat model.

ML Link: Operationalizes Theorem 10 (Certified Robustness) with practical deployment constraints. Extends Definition 15 (Robust Optimization) with accept/reject logic, enabling formal guarantees on subset of examples. Shows tradeoff between robustness ($r$) and coverage; decision-making tool for practitioners.

Hints: Accept/reject: simple binary classifier: if $r_i ≥ r_{{\min}}$, predict normally (use certified radius), else reject. Pareto curve: for each $r_{{\min}}$, collect coverage, accuracy_accepted, certified radius of accepted set. Plot multiple curves: (1) coverage vs. $r_{{\min}}$ (decreasing), (2) accuracy vs. $r_{{\min}}$ (increasing), (3) throughput = coverage × accuracy vs. $r_{{\min}}$ (identifies optimal $r_{{\min}}$). Cost-sensitive metrics: if rejection cost high, prefer lower $r_{{\min}}$; if misclassification cost high, prefer higher $r_{{\min}}$.

What mastery looks like: (1) Correct accept/reject system: for $r_{{\min}}=0$, coverage 100%, accuracy 90% (baseline); for $r_{{\min}}=0.3$, coverage ~20%, accuracy 98%+ (only confident examples). (2) Coverage curve: monotonically decreasing, sigmoid-like, 50% coverage at $r_{{\min}}≈0.15$. (3) Mean certified radius of accepted set: increases with $r_{{\min}}$ (expected); e.g., at $r_{{\min}}=0.2$, mean $r_i \approx 0.3–0.4$ (only examples with high radius survive). (4) Throughput optimization: plot $\text{coverage} × \text{accuracy}$; identify peak at $r_{{\min}}$ value balancing both. (5) Business metrics: communicate tradeoffs (e.g., “reject 30% of requests, serve 70% with 95%+ accuracy” vs. “serve all with 90% accuracy but unguaranteed robustness”). (6) Robustness persistence: certified radius on accepted subset truly provides formal guarantees (empirical verification: sample perturbations $\|\delta\|2 < 0.9r_i$ for accepted examples, >99% preserve prediction). (7) Sensitivity to sigma: repeat for different noise levels ($\sigma \in [0.12, 0.25, 0.5]$); show tradeoff curves shift (higher $\sigma$ → larger radius, lower clean accuracy). (8) Advanced: implement cost-sensitive variant; assign different rejection costs, optimize $r{{\min}}$ for given cost model.

C.16 — Carlini & Wagner Attack: Solving Optimization Problem

Task: Implement C&W attack via change-of-variables. Formulate: $\text{minimize}_{\mathbf{w}} \|\mathbf{w}\|_2 + c \cdot L(f(\tanh(\mathbf{w})), t)$, where $\tanh(\mathbf{w})$ changes variables to [−1,1], optimize $\mathbf{w}$ unconstrained, $c$ trades off perturbation size vs. misclassification. For MNIST, $\ell_2$ threat model: (1) binary search over $c \in [10^{-2}, …, 10^9]$ (use bisection), (2) for each $c$, run optimizer (ADAM, L-BFGS) for max 1000 steps, find adversarial weight $\mathbf{w}^*$, (3) count successful attacks (misclassified), (4) compare to PGD: compute success rate for same $\epsilon$ equivalent (ℓ2 threat model), measure iterations-to-success. Loss function: $L(z, t) = \max(0, \text{logits}z[t] - \text{max}{j \ne t} \text{logits}_z[j] - \kappa)$ (margin-based, $\kappa=0$ for misclassification). Measure: (1) success rate vs. $c$, (2) mean perturbation $\|\delta\|_2$, (3) convergence speed (iterations), (4) attack quality (confidence of adversarial examples, logit difference to true class).

Purpose: C&W attack is stronger than PGD for ℓ2 threat model; uses unconstrained optimization (change-of-variables) rather than iterative gradient with projections. Insights: (1) Numerically better: change-of-variables $\tanh$ makes optimization smoother (no projection steps causing discontinuity), (2) Typically stronger than PGD: C&W finds smaller perturbations or higher misclassification confidence for same budget, (3) Computationally expensive: binary search + per-search iterations (1000 steps) >> PGD (20–100 steps), 10–100× slower, (4) Hyperparameter tuning: $\kappa$ (margin), $c$ (tradeoff weight), optimizer choice affect success rate. In production: C&W used as oracle for evaluating robustness rigorously (baseline in papers); too expensive for training (use PGD).

ML Link: Implements Definition 12 (Gradient-Based Attack) via change-of-variables optimization. References Theorem 3 (First-Order Approximation), but uses richer optimization landscape (unconstrained problem easier than constrained). Demonstrates different optimization formulations achieve different attack strengths; relates to Definition 1 (Adversarial Perturbation).

Hints: Change-of-variables: $\mathbf{x}^{\text{adv}} = 0.5(\tanh(\mathbf{w}) + 1)$ (maps ℝ → [0,1]). For $\ell_2$: minimize $\|\mathbf{w}\|2$ directly (already unconstrained). Adam optimizer usually good (LR 1e-2), L-BFGS more stable but slower (try if Adam fails). Binary search: start $c=1$, double until success, then binary search in range. For failure (no solution found), cap attempts at max iterations. Confidence metric: $\text{logits}[t] - \max{j≠t} \text{logits}[j]$ (margin).

What mastery looks like: (1) Correct C&W: binary search converges, identify $c^*$ where attack succeeds; plot success vs. $c$ (sigmoid curve). (2) Perturbation effectiveness: C&W ℓ2 perturbation mean ~0.5–0.8, slightly smaller than PGD for same success rate (C&W stronger). (3) Convergence analysis: C&W ~100–500 iterations to converge (vs. PGD ~20–50), much slower. (4) Confidence of attacks: C&W-generated adversarial examples have high confidence (margin >1–3 logits); PGD attacks have margin closer to 0 (just barely misclassified). (5) Robustness comparison: PGD-trained model success rate ~40% against PGD, ~50% against C&W (C&W ~10 pp stronger), confirming C&W is stronger. (6) Computational cost: C&W 10–50× slower than PGD (e.g., 1000 examples: PGD ~10s, C&W ~100–500s). (7) Robustness evaluation: use C&W as oracle; if PGD-robust model breaks under C&W, evaluate more carefully (C&W finds tighter attacks). (8) Advanced: implement adaptive C&W (attacker-aware of defense), accelerate via early stopping (stop once margin <$\kappa$), or warm-start from PGD solutions.

C.17 — Group Distributionally Robust Optimization (Group DRO)

Task: Partition MNIST into demographic groups (e.g., digit type as group proxy: {0–2, 3–5, 6–9}, or use real metadata if available; for exercise, use sub-groups by brightness/rotation). Train model optimizing worst-case group loss: $\text{min}{\theta} \max_g \mathbb{E}{(x,y)∈g}[\ell(f_\theta(x), y)]$. Implementation: (1) standard ERM (uniform loss): $\text{min}_{\theta}$ $\mathbb{E}[\ell]$, (2) group DRO via (a) exponential weight scheme: $w_g ← w_g × \exp(−η Loss_g)$, renormalize, or (b) Lagrangian: $L = \sum_g w_g Loss_g + λ(∑w_g−1)$, (3) measure per-group accuracy (compute $\text{acc}_g$ for each group), (4) measure worst-group accuracy ($\min_g \text{acc}_g$), (5) measure distribution shift: train on skewed data (60% group A, 20% B, 20% C), evaluate on balanced test set. Compare: (A) standard training (ignores group imbalance), (B) group DRO (optimizes for worst group), (C) per-group balanced training (equal loss per group). Measure: (1) average accuracy, (2) worst-group accuracy (fairness metric), (3) per-group accuracy distribution (std), (4) convergence analysis (loss curves).

Purpose: Group DRO addresses distribution shift & fairness: standard ERM minimizes average loss, but can fail catastrophically on minority groups (worst-case accuracy ~0%, average still high). Group DRO optimizes for worst group, ensuring no group left behind. Insights: (1) Fairness tradeoff: group DRO raises worst-group acc at cost of average acc (1–5% drop), (2) Robustness to shift: focusing on worst-case group acts like robustness: if deployment distribution differs from train, worst-group model likely better on new distribution, (3) Computational cost: requires grouping metadata, tracking per-group losses; modest overhead (~10%). In production: fairness-critical applications (hiring, lending) require group DRO to ensure equitable performance; use worst-group accuracy as success metric.

ML Link: Extends robust optimization (Definition 15) with group-level constraints. Instead of worst-case perturbation (adversarial robustness), optimizes w.r.t. worst-case group (distributional shift robustness). Relates to Definition 1 (Adversarial Perturbation)—group shift is “perturbation” of train distribution. Complements Theorem 8 (Robustness-Accuracy Tradeoff), shifts focus from perturbation robustness to fairness robustness.

Hints: Grouping: partition data by example properties (for MNIST: (digit // 3) as group, or apply transform (gaussian blur, rotation) and use resulting image as group proxy). Weight scheme: $w_g^{(t)} ∝ \exp(−η ∑{s=1}^{t-1}\ell{g,s})$ (exponential weight based on historical loss). Compute group losses separately (loop over groups per batch). Plot per-group accuracy over epochs; DRO should flatten curve (reduce gap between best & worst groups). Worst-group accuracy = $\min_g \text{acc}_g$ (evaluate after training).

What mastery looks like: (1) Correct implementation: standard training average acc ~92%, worst-group acc ~40% (large gap); group DRO average ~88%, worst-group ~78% (flatter, more fair). (2) Per-group analysis: standard training one group dominates (e.g., digits {0–2} at 95%, {6–9} at 50%); DRO balances groups (each ~75–80%). (3) Weight evolution: DRO weights shift toward underperforming groups (initially uniform, evolve to emphasize struggling groups). (4) Convergence comparison: standard training converges faster (simple objective); DRO slower initially but reaches fair plateau (weights stabilize). (5) Fairness metrics: compute group_gap = $\max_g \text{acc}_g - \min_g \text{acc}_g$; standard ~45%, DRO ~10% (8–10 pp reduction). (6) Robustness to shift: train on skewed set, test on balanced; standard accurate on train distribution but fails under shift; DRO generalizes better. (7) Cross-group transfer: accuracy on OOD group higher for DRO (learned more robust features). (8) Advanced: implement GEORGE (geometric mean of group losses) or other fairness objectives; compare Pareto frontiers (average acc vs. worst-group acc).

C.18 — Influence Functions for Identifying Harmful Training Examples

Task: Compute influence of training examples on test predictions using influence functions (Koh & Liang, 2017). For test example $\mathbf{x}{\text{test}}$, compute influence of each training example $\mathbf{x}{\text{train}}^{(i)}$: $I_{\text{up}}(\mathbf{x}{train}^{(i)}, \mathbf{x}{\text{test}}) ≈ −∇\theta \ell(f\theta(\mathbf{x}{\text{test}}), y{\text{test}})^T H^{−1} ∇\theta \ell(f\theta(\mathbf{x}{train}^{(i)}), y{train}^{(i)})$, where $H$ is Hessian. Approximate $H^{−1}$ via Hessian-vector products (expensive; use sampled Hessian or conjugate gradient). For logistic regression (low-dim): compute exact influence. For neural nets: use Hessian-vector approximation (LISSA, implicit differentiation). Measure: (1) identify top-k influential training examples (most positive = helpful, most negative = harmful), (2) manually inspect: are harmful examples mislabeled or outliers? (3) measure robustness improvement: remove top harmful examples, retrain, evaluate accuracy and robust accuracy (PGD-20). Compare: (A) retrain without harmful, (B) retrain without random examples (baseline), (C) retrain without all examples (trivial).

Purpose: Influence functions identify which training examples matter most for model predictions. Applications: (1) Data quality: mislabeled or outliers harmful to robustness; remove them to improve robustness, (2) Model debugging: when model fails on test example, trace back to influential training examples (was it trained on similar-but-wrong example?), (3) Data selection for robust training: prioritize robustness-helpful examples, deprioritize harmful. Insights: (1) Small % removals help: removing ~1–5% most harmful examples can improve robustness ~5–10%, (2) Label quality critical: mislabeled examples highly influential (negative), correcting them helps more than any other single change, (3) Computational cost: accurate influence function computation O(d²) memory (Hessian space), expensive for $d > 10k$. In production: identify and fix mislabeled data (via influence functions) before robust training; improves robustness significantly.

ML Link: Operationalizes sample-level analysis: which examples matter? Relates to algorithmic stability (Theorem 5), showing how each training example affects generalization. Helps understand training data quality impact on adversarial robustness; mislabeled data particularly harmful for robust training.

Hints: For logistic regression: closed-form influence (use Sherman-Morrison for $H^{−1}$ update). For neural nets: use Hessian-vector product (double backprop), apply conjugate gradient to approximate $H^{−1}v$ (~50 iterations). LISSA (implicit differentiation) faster approximation. Sample test examples (10–20) and compute influence on all training examples (expensive). Collect top-k helpful & harmful examples, manually inspect (are harmful ones mislabeled?). Plot: influence score histogram (most examples near 0, few large positive/negative).

What mastery looks like: (1) Correct influence computation: for small datasets, verify against numerical perturbations (remove one example, measure accuracy change ≈ influence score). (2) Top harmful examples: inspect top 5–10 most negative influence; find 1–3 mislabeled examples (label disagrees with image), suggesting label quality issues. (3) Robustness improvement: remove ~100–200 most harmful examples (1–2% of train set ≈ 600 MNIST examples), retrain, robust accuracy improves ~5–10 pp (e.g., from 50% to 55–60%). (4) Comparison to random: removing same # random examples → no improvement (baseline). Difference quantifies data quality issue. (5) Influential example properties: analyze: are harmful examples outliers? Mislabeled? From specific group (e.g., rotated digits)? Pattern suggests data annotation or quality problem. (6) Cross-entropy decomposition: some influences positive (helpful), most negative (neutral or slightly harmful), few very negative (outliers). (7) Influence vs. loss: examples with high loss not necessarily high influence (depends on Hessian condition); can remove high-loss examples with little impact if Hessian small. (8) Advanced: implement efficient influence function algorithms (TracIn, influence via checkpoints); scale to larger models/datasets.

C.19 — Multi-Task Learning for Robust Models

Task: Build multi-task learning (MTL) architecture with shared feature extractor + two task heads: (1) Standard task: predict clean examples $(\mathbf{x}, y_{{\text{standard}}})$, (2) Robust task: predict adversarial examples $(\mathbf{x}^{\text{adv}}, y_{{\text{standard}}})$. Setup: shared layers (e.g., Conv + Dense) split into task-specific heads (each predicts MNIST digit). Training: $L = \alpha \ell_{{\text{standard}}} + (1-\alpha) \ell_{{\text{robust}}}$, where $\ell_{{\text{robust}}} = \ell(f_{{\text{robust}}}(\mathbf{x}^{\text{adv}}), y)$. For $\mathbf{x}^{\text{adv}}$: generate on-the-fly (FGSM/PGD, $\epsilon=0.3$). Sweep $\alpha \in [0, 0.25, 0.5, 0.75, 1.0]$ ($\alpha=0$: pure robust, $\alpha=1$: pure standard). Measure: (1) standard accuracy (eval via standard head), (2) robust accuracy (eval via robust head on adversarial examples), (3) final ensemble: use both heads (average predictions), measure accuracy. Plot: Pareto frontier (standard vs. robust accuracy) across $\alpha$ values, compare to single-task alternatives (Exercise C.4 TRADES, C.14).

Purpose: MTL trains shared representations to satisfy multiple objectives: standard accuracy + robust accuracy simultaneously. Key insights: (1) Shared representation tradeoff: features useful for both tasks ↓ conflict (standard & robust prefer different features), but partial sharing possible, (2) Ensemble benefits: combining both heads can improve ensemble robustness (defense via diversity), (3) Computational efficiency: single shared backbone < two separate models (saves memory/params), (4) Positive transfer: MTL can improve both tasks if well-tuned (shared features learn richer representations). Potential issues: (1) Negative transfer: if tasks conflict too much, one task suffers (tuning $\alpha$ critical), (2) Overfitting: two heads increase capacity; regularize carefully. In production: MTL interesting for deploying both standard and robust models without doubling parameters.

ML Link: Extends robust training (Theorem 4, Definition 15) with multi-objective learning. Shows robustness-accuracy tradeoff (Theorem 8) can be partially mitigated via shared task structure. Relates to Definition 1 (Adversarial Perturbation) and Definition 2 (Robustness Definition), operationalizing multi-task robustness design.

Hints: Architecture: encoder + two classification heads (each has dense layer + softmax). Loss: $L = \alpha CE_{{std}} + (1-\alpha) CE_{{adv}}$. Generate attacks during training (PGD-10). Task weighting: try dynamic weighting (weight by task loss ratio, balance gradients). Plot both task losses over epochs; both should decrease (if one increases, task conflict). Ensemble: average softmax outputs from both heads, measure final accuracy.

What mastery looks like: (1) Training curves: both standard & robust losses decrease; balance them via $\alpha$ (if one stalls, rebalance). (2) Pareto frontier: plot (standard_acc, robust_acc) for each $\alpha$; frontier should be smooth curve from (95%, 0%) to (0%, 60%) approximately. (3) Alpha sensitivity: $\alpha=1$ pure standard (95% standard, 0% robust), $\alpha=0$ pure robust (50% standard, 60% robust), sweet spot $\alpha ≈ 0.5–0.75$ (88–90% standard, 40–50% robust, good balance). (4) Ensemble benefit: final ensemble (averaging heads) accuracy higher than either single head in middle regime (~90% standard, ~45% robust for $\alpha≈0.5$), showing diversity helps. (5) Comparison to single-task: MTL + ensemble ≥ TRADES (similar Pareto frontier, but MTL has ensemble diversity benefit). (6) Feature analysis: compare learned representations; shared encoder features should be intermediate (useful for both tasks), not specialized. (7) Computational efficiency: single MTL model ~1.5× params of single task, <2× params of two separate models (savings vs. dual-model approach). (8) Advanced: implement task-aware attention (learn which features matter per task), dynamic weighting of task losses, or curriculum (start with standard task, gradually emphasize robust task).

C.20 — Out-of-Distribution Robustness and Distribution Shift

Task: Train model on MNIST (standard + robustness-enhanced via Exercise C.4/C.14). Evaluate robustness & standard accuracy on MNIST variants with distribution shift: (1) Rotations: rotate images $θ ∈ [0°, 30°, 60°, 90°]$, (2) Brightness: adjust pixel intensity $s \in [0.5, 1.0, 1.5]× \text{original}$, (3) Gaussian blur: apply blur with radius $r ∈ [0, 1, 2, 4]$, (4) Different font/rendering (if MNIST variants available, e.g., NotMNIST, KMNIST). For each shift, measure: (1) clean accuracy, (2) robust accuracy (PGD-20, $\epsilon=0.3$), aggregate across shift magnitudes (report mean ± std). Hypothesis: standard model’s accuracy drops >20% under shift, robust model more stable. Compare: (A) standard training (Exercise C.1–C.3 baseline), (B) PGD-trained model (C.4), (C) TRADES-trained model (C.14), (D) model trained on augmented data (mix original + rotated/blurred, standard training). Measure: (1) OOD generalization (average accuracy across distributions), (2) robustness transfer (does in-distribution robustness help OOD?), (3) correlation between standard & robust accuracy under shift.

Purpose: Real-world deployment encounters distribution shift: MNIST models trained on clean, centered digits face rotated/blurred images. Robust training (adversarial examples) can improve out-of-distribution performance as side benefit: forcing model to learn more invariant features. Key insights: (1) Robustness helps distribution shift: robust models trained to survive perturbations sometimes generalize better to shifted distributions (both perturbations & shifts are “small changes”), (2) Limited transfer: robustness to $\ell_∞$ perturbations doesn’t fully transfer to rotations (different threat models), (3) Data augmentation synergy: combining robust training + domain augmentation (rotations, blurs) best approach. In production: deploy models on real-world data; robust training provides safety margin for unexpected shifts.

ML Link: Extends robustness (Definition 1—Adversarial Perturbation) to distributional robustness (handling shift via adversarial training). Connects to Definition 2 (Robustness Definition) conceptually (worst-case input changes), but distribution shift is realistic worst-case. Demonstrates secondary benefit of Theorem 4 (Robust Optimization): learning robust features as regularizer for out-of-distribution generalization.

Hints: Create shifted MNIST variants via image transforms (PIL/OpenCV): Image.rotate(angle), ImageEnhance.Brightness(img).enhance(factor), ImageFilter.GaussianBlur(img, radius). Evaluate on shifted test sets separately, report results as table (rows: shift type × magnitude, columns: standard_acc, robust_acc). Compute correlation between (standard accuracy, robust accuracy) across shifts; positive correlation (robust helps standard) is desirable. Visualize: 2D plots (x-axis shift magnitude, y-axis accuracy) with error bars.

What mastery looks like: (1) Comprehensive OOD evaluation: measure accuracy/robustness across ≥3 shift types × 3 magnitudes each, report systematic results. (2) Shift sensitivity: standard model accuracy drops ~20–30% under moderate shift (e.g., 30° rotation, 2× brightness), robust model drops ~10–20% (more stable). (3) Robustness transfer: in-distribution robust accuracy ($\epsilon=0.3$) ~50%, OOD robust accuracy on shifted data similar or slightly lower (~45–50%), showing some transfer. (4) Data augmentation comparison: training on mixed (standard + shifted) improves OOD generalization >adversarial training alone; combining both is best (~70–80% accuracy on shifted data). (5) Correlation analysis: standard & robust accuracy positively correlated across shifts (ρ > 0.85, both degrade similarly under shift). (6) Pareto curves: plot OOD_accuracy vs. in-distribution accuracy for multiple training schemes; robust training Pareto-improves standard training (robust doesn’t hurt OOD, sometimes helps). (7) Shift-to-shift transfer: identify which shifts robust training helps most (e.g., small rotations benefit, large rotations less). (8) Advanced: implement domain randomization (train on set of shifts), evaluate transfer to unseen shifts; show data augmentation outperforms adversarial training for domain shift robustness.

END OF COMPREHENSIVE EXPANSION

Solutions

Solutions to A. True / False

A.1 Final Answer: False.

Full Mathematical Justification: Adversarial examples are fundamentally not a consequence of overfitting. To see why, consider a well-regularized linear classifier f(x) = sign(w^T x + b) trained on separable data. Even without memorization or high-variance learning, adding perturbations along the direction of w (the normal to the decision boundary) causes misclassification. Formally, if an example x is correctly classified with margin γ = y_i(w^T x_i + b)/||w||_2, then a perturbation δ = -ε·sign(w) with ε > γ·||w||_2 crosses the decision boundary despite being imperceptible. This phenomenon occurs for any empirically risk-minimization algorithm that fits separable data; it is not an artifact of overfitting. The robust features hypothesis (Ilyas et al., 2019) formalizes this: adversarial vulnerability arises because models rely on non-robust features (features with high correlation to labels but large gradient with respect to inputs). Standard training, because these features are predictive, learns to rely on them; robustness requires retraining to ignore non-robust features, not reducing capacity or regularization.

Counterexample if False: Consider a linear model trained on MNIST with tight margin γ = 0.5 (well-separated classes, high confidence). The model generalizes excellently to test data (95% accuracy, near-optimal). Yet an ℓ∞-perturbation of magnitude ε = 0.3 (imperceptible to humans) can push samples across the decision boundary. This occurs regardless of regularization strength: the issue is geometric, not statistical.

Comprehension: Adversarial vulnerability is a structural property of high-dimensional geometry and learned decision boundaries, not a consequence of overfitting. A model can generalize perfectly (zero test error, low complexity) and still be adversarially vulnerable.

ML Applications: This distinction is critical for practitioners. Approaches focused on reducing overfitting (more regularization, larger datasets, reduced capacity) will not improve adversarial robustness; explicit adversarial training or other robustness mechanisms are necessary. Many production systems implement standard regularization and achieve high test accuracy, yet remain vulnerable to adversarial attacks.

Failure Mode Analysis: If a defender assumes adversarial vulnerability is caused by overfitting and responds by adding L2 regularization or early stopping, robustness will not improve. The model will remain vulnerable despite reduced generalization gap. This failure mode occurs frequently in practice when robustness is not explicitly addressed.

Traps: Confusing “high accuracy implies robustness” with reality (high standard accuracy does not guarantee adversarial robustness). Believing that overfitting is the only source of model vulnerability.

A.2 Final Answer: True.

Full Mathematical Justification: The threat model fundamentally defines what perturbations are allowed, and robustness is only meaningful within that threat model. Formally, a classifier is ε-robust under threat model T if ∀x ∈ support(D), ∀δ satisfying T(δ ≤ ε), the prediction is preserved: f(x) = f(x + δ). If two threat models T_1 and T_2 are incomparable (neither subsumes the other), a model can be robust to T_1 but vulnerable to T_2. For instance, robustness to ℓ∞-perturbations (bounded maximum pixel change) is unrelated to robustness to semantic perturbations (e.g., small rotations, which do not have bounded ℓ∞ norm for all examples). A model trained to be robust to ℓ∞ perturbations (Exercise C.4) will have poor robustness to large rotations. Conversely, a model trained on data augmentation (rotations, crops) will generalize to those transformations but may be vulnerable to small ℓ∞ perturbations. The mathematical formalization: let Φ_1 and Φ_2 be two threat model sets (e.g., ℓ∞-ball and rotation set). Robustness to Φ_1 requires ∀δ ∈ Φ_1, prediction stable. If Φ_1 ∩ Φ_2 = ∅ (disjoint threat models), then robustness to Φ_1 provides no guarantees on Φ_2. Even partial overlap requires careful analysis: if Φ_1 ⊂ Φ_2 (one threat model is subset of the other), then robustness to Φ_2 implies robustness to Φ_1, but not vice versa.

Counterexample if False: Not applicable; statement is true. However, a subtle error would be assuming threat models are always comparable (they are not).

Comprehension: The threat model is the specification of what constitutes an “allowed” perturbation. Without explicit threat model specification, robustness claims are meaningless. Comparing robustness across different threat models is invalid without careful justification.

ML Applications: Different applications have different threat models: autonomous vehicles face sensor noise (ℓ2-bounded), images face compression artifacts (ℓ∞-subtle but persistent), NLP systems face character-level edits (ℓ0-sparse perturbations). A model claiming 90% robustness must reference its threat model; the claim is meaningless without it. Many papers implicitly assume ℓ∞ or only compare within a single threat model; comparing across threat models requires explicit consideration.

Failure Mode Analysis: Deploying a model claimed to be “robust” without understanding the threat model leads to false confidence. Example: a model trained to be robust to ℓ∞ perturbations with ε = 0.3 is deployed in a setting with semantic perturbations (e.g., camera viewing angles); the model fails because angles are not ℓ∞-bounded perturbations.

Traps: Assuming that “robustness” is a universal property independent of threat model. Conflating robustness to one threat model with robustness to all conceivable perturbations.

A.3 Final Answer: False.

Full Mathematical Justification: Certified robustness and empirical robustness are distinct concepts with different tradeoffs. Empirical robustness is evaluated by running the model against specific attacks (e.g., PGD for ℓ∞); if the attack fails to generate adversarial examples, the model is deemed empirically robust to that attack. However, empirical robustness does NOT imply robustness to all possible attacks within the threat model; it only certifies robustness to the attacks tested. Certified robustness, by contrast, provides a mathematical guarantee: for a given threat model and perturbation bound ε, the model is guaranteed to be robust to ALL perturbations of magnitude ≤ ε, not just those tested. Formally, certified robustness requires proving ∀δ: ||δ||_p ≤ ε ⟹ f(x + δ) = f(x) (prediction invariant). This proof typically involves abstract interpretation, randomized smoothing, or other techniques that are computationally more expensive than running a single attack. The fundamental gap: a model can have 90% empirical robustness (PGD attack fails on 90% of examples) but 0% certified robustness (no formal guarantee on any examples) if the proof method is weak. Conversely, a model with 50% certified robustness (formal guarantee on 50% of examples) is guaranteed robust on those; empirical robustness on that subset is 100%.

Counterexample if False: Train a model adversarially against FGSM only. The model achieves 80% empirical robustness to FGSM. However, PGD (a stronger attack) succeeds on 50% of examples (50% empirical robustness to PGD). No certified robustness guarantee exists without a proof. Here, empirical robustness to one attack does not transfer to another attack.

Comprehension: Empirical robustness is necessary but insufficient for true robustness; it is evaluation against a limited set of attacks. Certified robustness is a stronger, provable guarantee, but often at the cost of looser bounds (certified radius smaller than what empirical attacks suggest).

ML Applications: For safety-critical applications (autonomous vehicles, medical diagnosis), certified robustness is preferable despite its computational cost, because formal guarantees prevent surprises. For best-effort defenses (spam filtering, ad ranking), empirical robustness against known attacks may suffice, as the threat model is not adversarial.

Failure Mode Analysis: Claiming a model is “robust” based on passing empirical attacks (PGD) without certified robustness leads to false confidence. Adaptive attacks or novel attack methods can break empirical robustness. Conversely, weak certified robustness (small certified radius) can be overstated; certified claims are valuable iff the radius is large relative to threat model.

Traps: Confusing empirical robustness (against tested attacks) with robustness (protection against all possible attacks). Believing certified robustness always dominates; loose certified bounds can be weaker than empirical robustness.

A.4 Final Answer: False.

Full Mathematical Justification: Adversarial training improves empirical and certified robustness at the cost of reduced standard (clean) accuracy. This is Theorem 8 (Robustness-Accuracy Tradeoff): for a given architecture and dataset, there exists a fundamental tradeoff between clean accuracy and adversarial robustness. Formally, for surrogates of the Pareto frontier of achievable (accuracy, robustness) pairs, increasing robustness ρ requires decreasing accuracy acc (or vice versa). The tradeoff is not universal in magnitude (some datasets/architectures show steep tradeoff, others gradual), but it exists as a lower bound: you cannot achieve both maximum accuracy and maximum robustness simultaneously. Empirically, adversarial training on MNIST with ε = 0.3 achieves ~50% robust accuracy and ~90% clean accuracy; purely clean-trained models achieve ~95% clean accuracy and ~0% robust accuracy. No model has been demonstrated to achieve both 95% clean and 50% robust on MNIST under ℓ∞ perturbations; the tradeoff is real. Theoretical justification: robustness and accuracy correspond to fitting different feature sets. Standard accuracy relies on both robust and non-robust predictive features; robustness requires ignoring non-robust features and relying only on robust features, which are less predictive of the label (hence lower accuracy). More formally, a linear model fitting the same data with and without robustness constraints will have different margins; the robust margin is tighter (fewer misclassifications within ε-ball), reducing the margin available for clean examples.

Counterexample if False: Hypothetically, a miraculously well-architected model could achieve both 95% clean and 95% robust accuracy, violating the tradeoff. No evidence for such models exists in practice; all state-of-the-art results show clear accuracy-robustness tradeoff.

Comprehension: Adversarial training is not a free lunch; it requires explicit design and incurs accuracy costs. The tradeoff is a fundamental limitation, not failures of current methods (though methods vary in how steep the frontier is).

ML Applications: When deploying adversarially robust models, practitioners must accept lower clean accuracy. For customer-facing systems, this may be unacceptable; for safety-critical systems, acceptable. Choosing a point on the Pareto frontier (Exercise C.10) requires domain knowledge: what accuracy loss is tolerable for a given robustness gain?

Failure Mode Analysis: If a practitioner trains an adversarially robust model expecting no clean accuracy loss, they will be disappointed. Models trained adversarially will have measurably lower clean accuracy (5–15% drop typical for strong robustness). This is expected, not a bug.

Traps: Assuming research improvements will eventually eliminate the tradeoff (unlikely; it is fundamental). Comparing models on different datasets or threat models; tradeoff magnitude varies.

A.5 Final Answer: True.

Full Mathematical Justification: Gradient-based attacks (FGSM, PGD) are extremely effective at generating adversarial examples because they directly exploit the gradient of the loss with respect to inputs. Formally, the gradient ∇_x ℓ(f(x), y) points in the direction of largest loss increase; moving in this direction by ε approximately maximizes the loss within an ε-ball (linear approximation). For linear models, moving by ε·sign(∇_x ℓ) exactly moves to the nearest decision boundary if one exists within the ε-ball, guaranteeing misclassification. For neural networks, the first-order approximation is loose, but still effective in practice. FGSM (one-step gradient) achieves ~70–80% attack success on undefended ImageNet models for small ε; PGD (iterative refinement) achieves ~90%+. The high success rate is well-documented across numerous papers and models. Gradient-based attacks are also universal (work without model access if we can estimate gradients) and fast (one pass through the network for FGSM, a few for PGD). The strength of gradient-based attacks is so well-established that they are de facto standards for evaluation; any defense claiming robustness must be tested against PGD. Non-gradient attacks (evolutionary algorithms, Bayesian optimization) are typically much slower and not substantially more effective against well-defended models.

Counterexample if False: A well-defended model might have gradients masked or misaligned with true vulnerabilities (a rare and brittle defense mechanism), making gradient-based attacks less effective. However, adaptive attacks (attacker-aware of defense) rectify this.

Comprehension: Gradient-based attacks are highly effective because they directly leverage the model’s own learning signal (gradients) against itself. This algorithmic simplicity and effectiveness make them powerful. Understanding gradients is key to understanding both attacks and defenses.

ML Applications: Gradient-based attacks are the default standard for evaluating adversarial robustness in research and practice. Any model evaluated only against FGSM is not rigorously evaluated; PGD is the minimum standard. Practitioners should implement PGD attacks when evaluating claims of adversarial robustness.

Failure Mode Analysis: If a defense claims robustness but is only evaluated against FGSM, the claim is likely overstated. Gradient-masking defenses (reducing gradient information) appear robust to FGSM but fail against adaptive attacks (PGD with knowledge of defense). This is a common pitfall in the robustness literature.

Traps: Assuming that FGSM robustness implies PGD robustness (false; PGD is much stronger). Conflating empirical robustness to gradient-based attacks with robustness to other attack types.

A.6 Final Answer: True.

Full Mathematical Justification: Randomized smoothing is a mathematically sound technique for achieving certified robustness. The core idea: add Gaussian noise to inputs, aggregate predictions over noisy variants, and interpret the smoothed model F(x) = argmax_c 𝔼_δ∼𝒩[f(x + δ)]_c as a certified robust classifier. Theorem (Randomized Smoothing Certificate): if the base classifier f is c-certified at clearance margin Φ^(-1)(p_A) - Φ^(-1)(p_B) where p_A is the probability of the top class and p_B the second-top (Φ is standard normal CDF), then the smoothed classifier F is certifiably robust with radius R = (σ/2)[Φ^(-1)(p_A) - Φ^(-1)(p_B)]. The proof uses concentration inequalities: if two inputs differ by ||δ||_2 ≤ ε, the probability that noise moves one across a decision boundary is bounded. This enables a formal robustness certificate without model retraining. The practical implementation: at test time, sample noise ~𝒩(0, σ²I) ~100–1000 times per input, aggregate softmax predictions (empirically estimate p_A and p_B), compute radius. Certified radius is typically ε = 0.5–1.0 (reasonable), though smaller than some empirical robustness levels. Randomized smoothing is proven-optimal up to constants (Cohen et al., 2019); no polynomial-time method can significantly improve the tradeoff between certified radius and accuracy.

Counterexample if False: Not applicable; randomized smoothing is theoretically and empirically validated.

Comprehension: Randomized smoothing converts any classifier into a certified robust one via randomization and aggregation. The mechanism is statistically sound and computationally feasible (modest overhead for certification).

ML Applications: Randomized smoothing is deployed in practice when certified robustness is required. Medical diagnosis systems, autonomous vehicles, and other safety-critical applications can use randomized smoothing as a defense mechanism. The moderate computational cost at inference time (100-1000 passes) is often acceptable.

Failure Mode Analysis: Certified radius can be small if σ (noise level) is too small (insufficient noise) or too large (too much averaging, low accuracy). Practitioners must tune σ; common values are σ = 0.12–0.25 on normalized data. Additionally, certification is only as strong as the base classifier; randomized smoothing does not improve inherent algorithmic vulnerabilities in the base model.

Traps: Assuming large noise (large σ) always improves robustness (false; accuracy degrades). Confusing certified radius with empirical robustness; certified guarantee is weaker than full ε-robustness under some scenarios. Treating σ as a hyperparameter not requiring tuning.

A.7 Final Answer: False.

Full Mathematical Justification: The robust optimization formulation (minimax problem) is not equivalent to standard adversarial training in general, though adversarial training is an approximation. The exact robust optimization problem is min_θ 𝔼_{x,y} [max_{||δ||≤ε} ℓ(f_θ(x+δ), y)]. To solve this exactly, one would need to (1) solve the inner maximization (find the worst perturbation for each example) precisely, (2) compute gradients of the inner optimal solution with respect to θ (requires differentiating through the inner optimization result). Standard adversarial training approximates the inner maximization with a few PGD steps, not to convergence, which is computationally tractable. However, exact solutions to the minimax problem are difficult to find due to several factors: (1) the inner maximization is non-convex (for neural networks), so local optima may not be global optima, (2) approximating the gradient through the inner solution (implicit differentiation via Hessian computation) is expensive, (3) the minimax problem itself is non-convex-concave (not amenable to standard convex optimization theory). In practice, adversarial training (approximate inner maximization via PGD) is used instead, which is computationally feasible. This approximation can be loose: the inner PGD may not find the true worst perturbation, and the training objective becomes a lower bound on the true minimax objective. Empirically, adversarially trained models are robust to PGD attacks but may remain vulnerable to other attacks or certification methods.

Counterexample if False: A model trained adversarially with PGD-10 (10 iterations) may have 50% robustness against PGD-20 (20 iterations), showing the inner maximization was not solved exactly. Using exact optimization (hypothetically) via a second-order method would produce a model with better worst-case robustness.

Comprehension: The robust optimization formulation is the theoretical ideal, but standard adversarial training is a practical approximation due to computational constraints.

ML Applications: Practitioners use PGD-based adversarial training (not exact robust optimization) due to computational feasibility. Understanding the gap between true robust optimization and approximate adversarial training helps explain some vulnerabilities (adaptive attacks may exploit the approximation).

Failure Mode Analysis: If a model is adversarially trained with PGD-10 and then evaluated against PGD-100, robustness decreases because the training objective was not the true robust optimization—there was room for adaptive attacks.

Traps: Assuming adversarial training solves the minimax problem exactly (it does not). Believing that more PGD iterations during training always lead to better robustness (true, but with diminishing returns due to computational cost).

A.8 Final Answer: True.

Full Mathematical Justification: Certified defenses are fundamentally more rigorous than empirical defenses because they provide mathematical guarantees. A certified defense proves that for any input x and any perturbation ||δ||_p ≤ ε, the model’s prediction is invariant: f(x) = f(x + δ). This proof is typically constructive (deriving bounds that hold for all perturbations) and does not require exhaustively testing against all possible perturbations. Empirical defenses, by contrast, are only validated against attacks tested; adaptive attacks (attacker-aware of defense) can circumvent them. Historically, numerous empirical defenses have been broken (defensive distillation, gradient masking, etc.) by adaptive attacks; certified defenses, if correctly implemented, cannot be broken because the guarantee is mathematical. The cost of certified robustness is typically conservative bounds (smaller certified radius than empirical robustness level) and/or reduced clean accuracy, but the guarantee itself is unbroken. Examples of certified defenses include randomized smoothing (Exercise C.6), abstract interpretation (interval bounds on neuron activations), convex relaxations of neural network outputs (CROWN, DeepPoly). All of these provide ε-robustness: ∀||δ||_p ≤ ε, prediction invariant, up to the tightness of the relaxation.

Counterexample if False: Not applicable; certified robustness is mathematically rigorous.

Comprehension: The key trade-off for certified defenses is often a reduction in clean accuracy or looseness of the certified radius compared to empirical estimates. Nevertheless, the guarantee itself is stronger than empirical evaluation.

ML Applications: In safety-critical applications where a false negative is very costly (medical diagnosis misclassification, autonomous vehicle failures), certified robustness provides valuable guarantees despite potential looseness in certified radius.

Failure Mode Analysis: A certified defense might have a very loose certified radius (e.g., ε = 0.1 on a dataset where empirical robustness is ε = 0.3 at PGD). This looseness may limit practical applicability. The defense is still valid, but the certified guarantee is weak.

Traps: Assuming that certified robustness always provides tighter guarantees than empirical robustness (sometimes, but not always). Forgetting that looseness in the certification process can render certified bounds uninformative. Conflating the rigor of certification with practical utility.

A.9 Final Answer: False.

Full Mathematical Justification: Robustness can be improved without adversarial training through several mechanisms, though adversarial training is the most direct and effective. Alternative mechanisms include: (1) Data augmentation with semantically meaningful transformations (rotations, crops, color jitter): implicitly improves robustness to semantically similar perturbations (though not necessarily to small ℓ∞ perturbations). (2) Regularization (L2, BatchNorm): provides modest robustness improvements (~5–10% in some settings) via implicit complexity reduction and feature smoothing. (3) Architectural choices (skip connections, batch normalization, wide networks): some architectures are inherently more robust, though the effect is small compared to explicit adversarial training. (4) Certified defenses (randomized smoothing, abstract interpretation): provide robustness guarantees without adversarial training (though they typically reuse a standard-trained base classifier). (5) Ensemble methods: diverse models are more robust than individual models to certain attacks, partly due to averaging and reduced memorization. However, none of these alternatives approach the robustness level achievable with adversarial training. Empirically, standard training with data augmentation achieves ~5–10% robust accuracy against ℓ∞ attacks; adversarial training achieves ~50% robust accuracy on the same task. The gap is large, demonstrating that explicit adversarial training is necessary for high robustness, but not strictly necessary for any robustness improvement.

Counterexample if False: A model trained on augmented data (rotations, brightness) with heavy regularization can achieve 10% robust accuracy on ℓ∞ perturbations without adversarial training, compared to 0% for standard training. This shows robustness improvement is possible without adversarial training (though limited).

Comprehension: Adversarial training is the most effective known method for improving robustness, but not the only one. Trade-offs exist: alternatives are computationally cheaper but provide weaker robustness.

ML Applications: When adversarial training is computationally infeasible (very large models), practitioners can combine data augmentation and regularization for modest robustness improvements. However, for high robustness requirements, adversarial training is typically necessary.

Failure Mode Analysis: Assuming data augmentation alone will provide substantial robustness (10% improvement without adversarial training is modest). This failure mode leads to under-defended systems.

Traps: Conflating modest robustness improvements from augmentation with truly robust models (achieved via adversarial training). Assuming all robustness improvements are equivalent; trade-offs differ.

A.10 Final Answer: True.

Full Mathematical Justification: Transferability of adversarial examples is an empirical phenomenon: adversarial examples crafted for one model often fool other models. Formally, if x_adv is an adversarial example for model f_1 (misclassified by f_1), it often also misclassifies f_2 (high transfer rate ~50–90% typical). This is counterintuitive because different models have different parameters, architectures, and training data; one might expect similar adversarial perturbations to not transfer. However, transfer occurs because adversarial perturbations often align with universal directions in the loss landscape—directions that multiple models share due to common inductive biases (e.g., reliance on similar non-robust features, similar decision boundary geometry). Empirically, (1) single-step FGSM attacks transfer better than iterative PGD (PGD overfits to source model geometry), (2) attacks transfer more between similar architectures, (3) ensemble attacks (generate on multiple source models) transfer better. The mechanism behind transfer is partially understood: adversarial examples exploit features that are predictive across models (non-robust features); perturbations along these universal feature directions transfer. The transferability is not perfect (not 100% transfer rate), suggesting that models also have idiosyncratic decision boundaries, but the presence of significant transfer is well-established empirically.

Counterexample if False: Not applicable; transfer is empirically well-documented.

Comprehension: Adversarial perturbations are not model-specific; they exploit universal geometric or feature-based properties shared across models. This universality is both a threat (black-box attacks via transfer) and an opportunity (ensemble robustness via diversity).

ML Applications: Transfer enables black-box attacks without model access: generate adversarial examples on a surrogate model (public model with similar architecture), transfer to target model. This threat model is realistic in many settings. Defense via diversity (ensemble of diverse models) can reduce transfer and improve robustness.

Failure Mode Analysis: A defense claiming robustness based on testing against attacks on a single model is vulnerable to black-box attacks via transfer. Robustness evaluation should include transferred attacks.

Traps: Assuming robustness to one attack implies robustness to all attacks, when transfer is less than perfect (~70% typical). Believing that model diversity fully eliminates transfer (it reduces but does not eliminate it).

A.11 Final Answer: False.

Full Mathematical Justification: Gradient-based adversarial training does not guarantee robustness to all perturbations within the threat model. Instead, it biases the model toward robustness to the specific attacks used during training. Formally, if adversarial training uses PGD attacks with K steps and step size α, the learned model optimizes for robustness to perturbations found by PGD. However, PGD may fail to find the true worst perturbation (inner maximization solved approximately), leading to a model that is robust to PGD-found perturbations but may remain vulnerable to other perturbations within the ε-ball (e.g., found by stronger attacks like C&W or adaptive attacks). Recent work shows that adversarially trained models evaluated against adaptive attacks (attacker-aware of defense) have lower robustness than claimed. Additionally, the threat model may have multiple types of perturbations (e.g., ℓ∞ and ℓ2 perturbations); training against one does not guarantee robustness to the other. Formally, robustness to all perturbations within a threat model would require solving the minimax problem exactly, which is computationally intractable for neural networks. In practice, adversarial training provides robustness to the attacks used during training and, to some degree, transfer to similar attacks, but not complete coverage of the threat model.

Counterexample if False: A model trained adversarially against PGD-ℓ∞ may have low robustness to C&W-ℓ2 attacks, showing incomplete robustness. Alternatively, adaptive attacks can be crafted to circumvent PGD-based training.

Comprehension: Adversarial training provides robustness to the specific attacks used (approximate), with some transfer to similar attacks, but not to all perturbations within a threat model.

ML Applications: Practitioners should evaluate robustness against multiple attack methods (PGD, C&W, AutoAttack) to avoid overestimating robustness. Certified defenses provide more complete coverage of threat models.

Failure Mode Analysis: Claiming a model is “ε-robustly trained” based on PGD training alone, without evaluating against other attacks, leads to false confidence.

Traps: Conflating robustness to one attack with complete threat model coverage. Assuming robustness improves monotonically with training attack strength (sometimes, but with diminishing returns).

A.12 Final Answer: True.

Full Mathematical Justification: Robustness-accuracy trade-off is a fundamental phenomenon in adversarial robustness, formalized in Theorem 8. For a given architecture and dataset, improving adversarial robustness ρ typically requires sacrificing clean accuracy acc. Formally, the achievable (acc, ρ) pairs lie on a Pareto frontier; moving toward higher robustness (increasing ρ) generally decreases accuracy. The intuition: robust classifiers rely on robust features (features with small gradient, stable under perturbations) which are typically less predictive than non-robust features (high gradient, highly predictive but unstable). Standard accuracy requires fitting both robust and non-robust features; robustness requires ignoring non-robust features, thus losing predictive power. Empirically, on MNIST with ℓ∞-ε=0.3 adversarial training, models achieve ~50% robust accuracy at ~90% clean accuracy, compared to 95% clean accuracy for standard training. The tradeoff is consistent across datasets (CIFAR-10, ImageNet), threat models (ℓ∞, ℓ2), and training algorithms. The magnitude of the tradeoff varies (steeper on some datasets, flatter on others), but the existence is universal. Formally, this follows from the robust features hypothesis: non-robust features are predictive but unstable; robust features are stable but less predictive. Optimization with robustness constraints forces the model to rely more on robust features, sacrificing accuracy. Some recent work explores methods to reduce the tradeoff (e.g., TRADES, better regularization, better architectures), but fundamental limits remain.

Counterexample if False: A hypothetically perfect architecture could achieve both 95% clean and 95% robust accuracy on MNIST, violating the tradeoff. No such model exists in practice.

Comprehension: The tradeoff is fundamental and unavoidable; practitioners must choose a point on the Pareto frontier balancing accuracy and robustness for their application.

ML Applications: When deploying robust models, stakeholders must accept accuracy losses. For safety-critical applications (medical, autonomous), this may be acceptable; for consumer applications, unacceptable. Understanding the tradeoff informs deployment decisions.

Failure Mode Analysis: If adversarial training is expected to improve both accuracy and robustness, disappointment results; robustness comes at accuracy cost.

Traps: Assuming recent research has “solved” the tradeoff (it has not; fundamental limits remain). Confusing reduction of the tradeoff with elimination (improvements are real but modest).

A.13 Final Answer: False.

Full Mathematical Justification: Gradient masking (making gradients uninformative or noisy) is NOT a valid defense against adversarial attacks because adaptive attacks can circumvent it. Gradient masking creates the illusion of robustness: when evaluated against standard gradient-based attacks (FGSM, PGD), the model appears robust (high attack success rate is low). However, if the attacker is aware of the defense and adapts their method (adaptive attack), robustness degrades significantly. For example, if the defense mechanism is gradient obfuscation through distillation (Exercise C.12), an adaptive attack can differentiate through the distillation temperature and estimate gradients, breaking the obfuscation. Formally, gradient masking requires either (1) non-differentiable defense operations (detector networks with discrete decisions), or (2) randomness (stochastic defenses), or (3) obfuscated gradients (intentionally misleading gradient signals). Cases (1) and (2) can be circumvented by gradient-free attacks (evolutionary algorithms, Bayesian optimization). Case (3) can be circumvented by adaptive attacks (attacker models the defense). Empirically, defenses based on gradient masking (like defensive distillation) have been broken by adaptive attacks. Tsipras et al. (2018) showed that several celebrated defenses from 2016–2017 were actually gradient masking, not true robustness. The lesson: gradient masking is not a valid defense strategy; true robustness requires that gradients remain high-quality (informative to the model itself), so the model must be robust to these informative gradient directions.

Counterexample if False: Defensive distillation provides apparent robustness to FGSM (low success rate), but adaptive attacks (differentiating through teacher model) achieve high success rate, showing it is gradient masking, not true robustness.

Comprehension: Gradient masking is a pitfall; defenses that appear to work against gradient-based attacks may fail against adaptive attacks. Rigorous evaluation requires adaptive attacks.

ML Applications: When evaluating robustness claims, practitioners should apply adaptive attacks (attacker-aware of defense mechanism) to verify that robustness is genuine, not gradient masking.

Failure Mode Analysis: Deploying a defense based on gradient masking, testing only against standard attacks, and claiming robustness leads to a false sense of security. Real attackers (adaptive) will break it.

Traps: Confusing low attack success rate against standard attacks with genuine robustness. Not evaluating against adaptive attacks.

A.14 Final Answer: True.

Full Mathematical Justification: The perturbation budget ε fundamentally determines what robustness is achievable and at what cost. Smaller ε (tighter threat model) is easier to achieve robustness to; larger ε requires more investment and incurs larger accuracy losses. Formally, for a fixed threat model (e.g., ℓ∞), the set of perturbations allowed scales with ε. If ε is very small (~0.01, barely perceptible, ~2–3 out of 255 pixel values), models can achieve robustness to ε with modest accuracy loss (1–3%). For ε = 0.3 (still imperceptible to many human raters), achieving robustness requires 5–15% accuracy loss. For ε = 1.0 (obviously visible), robustness is even more expensive in accuracy. The relationship is not linear; the cost per-unit-ε is higher for large ε. This is because as ε grows, the set of perturbations grows, and finding a decision boundary stable to all perturbations becomes harder. Practically, threat models must specify ε; default choices vary (ε = 8/255 ≈ 0.03 for FGSM benchmarks, ε = 0.3 for stronger evaluation). Comparison across different ε values is invalid without careful justification.

Counterexample if False: Achieving robustness to ε = 0.01 and ε = 1.0 have identical cost (false; ε = 1.0 is much harder).

Comprehension: The perturbation budget is the scale of the threat; larger budgets are harder to defend against, requiring more sophisticated defenses and accepting larger accuracy losses.

ML Applications: Application-specific threat models should determine ε (e.g., autonomous vehicles need robustness to sensor noise ~0.05, not 0.3). Deploying a model robust to ε = 0.1 in a setting with ε = 0.3 threats leads to vulnerabilities.

Failure Mode Analysis: Comparing robustness claims across different ε values without disclosing ε leads to confusion (a model supposedly 70% robust at ε = 0.05 may be 10% robust at ε = 0.3).

Traps: Assuming robustness to one ε transfers to larger ε (false; robustness is threat-model-dependent). Ignoring the threat model when evaluating robustness.

A.15 Final Answer: False.

Full Mathematical Justification: Adversarial robustness and standard generalization are distinct concepts, though related. Standard generalization asks: does the model generalize from train to test data drawn from the same distribution? Adversarial robustness asks: does the model remain correct under worst-case adversarial perturbations within a threat model? These are orthogonal: (1) A model can have high standard generalization (95% train, 95% test accuracy) but low robustness (0% robust accuracy), because the model relies on non-robust features that generalize well but are unstable. (2) A model can have low standard generalization (overfitting: 99% train, 85% test accuracy) but still have low robustness (non-robust features cause both generalization gap and adversarial vulnerability). (3) Interventions improving one do not necessarily improve the other: early stopping improves generalization but only modestly improves robustness; adversarial training improves robustness but reduces standard generalization. Formally, generalization depends on uniform stability and complexity of the model class; robustness depends on the geometry of decision boundaries and feature stability. While both relate to stability (broadly construed), they are distinct mechanisms. Some recent work suggests that robustness can aid generalization through learned robust features, but the effect is small and not universal.

Counterexample if False: A model with 95% train and 95% test accuracy can have 0% robustness to ℓ∞-ε=0.3 attacks, showing orthogonality between standard generalization and robustness.

Comprehension: Robustness and generalization are related but distinct; work on one does not automatically improve the other.

ML Applications: Practitioners cannot assume that generalizing models are robust. Both generalization and robustness must be explicitly designed for.

Failure Mode Analysis: A system achieving high standard generalization is not secure against adversarial attacks; both properties must be targeted.

Traps: Conflating generalization with robustness (two different properties). Assuming standard ML best practices (train/test split, regularization) automatically provide robustness (they do not).

A.16 Final Answer: True.

Full Mathematical Justification: The minimax formulation (min_θ max_δ ℓ(f_θ(x + δ), y)) for robust optimization is more complex and harder to solve than standard empirical risk minimization (min_θ 𝔼[ℓ(f_θ(x), y)]). The inner maximization (max_δ ℓ) is non-convex for neural networks, requiring iterative approximation (PGD, C&W); computing gradients through the inner solution requires implicit differentiation or other expensive techniques. Furthermore, the minimax problem is saddle-point-structured, meaning standard gradient descent may not converge to a saddle point (rather, converges to a local minimum of the max, not a min of the max). Algorithmically, solving minimax problems is significantly harder than solving standard convex or even non-convex problems. Research on robust optimization confirms the increased complexity: adversarial training requires more computational resources (5–10× standard training time), more careful tuning (learning rate schedules for robustness are different), and sometimes more data (sample complexity is higher for robust learning). This increased complexity translates to practical challenges: running adversarial training on very large models (ViT, BERT-scale) is prohibitively expensive; scaling robust learning to ImageNet or larger is an open problem.

Counterexample if False: Not applicable; the difficulty of robust optimization is well-established.

Comprehension: Robust optimization is inherently harder than standard optimization. Practical defenses often approximate robustness to manage computational cost.

ML Applications: The computational cost of robust training is a practical barrier to deployment. Cost-benefit analysis is necessary: does the gain in robustness justify the training cost?

Failure Mode Analysis: Attempting to robustly train very large models with standard hardware may be infeasible; specialized methods or approximations needed.

Traps: Underestimating the computational cost of adversarial training. Assuming adversarial training is a simple modification of standard training (it introduces significant overhead).

A.17 Final Answer: False.

Full Mathematical Justification: Robustness-certified methods can be computationally expensive, contradicting the claim that they are efficient. Randomized smoothing, a common certified defense, requires averaging predictions over ~100–1000 noisy input samples at test time, increasing inference cost by 100–1000×. Abstract interpretation methods (computing bounds on neural network activations) require computing interval bounds for each neuron, which can be expensive for large networks. Convex relaxations (like CROWN, DeepPoly) require solving multiple linear programs or convex problems per layer, adding polynomially more computation than standard inference. While some of these methods are more efficient than solving a full robust optimization problem, they are not universally “efficient” compared to standard inference. The practical cost is typically 10–100× inference time increase, which can be prohibitive for real-time applications (autonomous vehicles, robotics). Some certified methods are more efficient than others (e.g., fast empirical bounds beat slow certified bounds), but “certified” and “efficient” should not be assumed synonymous. Recent work on efficient certified defenses aims to reduce this cost, but current state-of-the-art methods are significantly more expensive than standard inference.

Counterexample if False: Randomized smoothing with 1000 samples per input is 1000× more expensive than standard inference.

Comprehension: Certified robustness provides guarantees at the cost of increased computational complexity, especially at test time.

ML Applications: For real-time applications, practitioners must choose between certified robustness (expensive) and empirical robustness (cheaper but weaker guarantees). This tradeoff is application-dependent.

Failure Mode Analysis: Deploying a certified defense expecting standard inference speed leads to latency issues.

Traps: Assuming certified defenses are always efficient (they are not). Confusing theoretical efficiency (better than exponential) with practical efficiency (acceptable latency).

A.18 Final Answer: False.

Full Mathematical Justification: Not all attacks are effective against all models; adversarial attack success depends on the model’s robustness, architecture, and training. Some models are more robust to certain attacks (e.g., a model trained against FGSM is more robust to FGSM than to PGD). Additionally, white-box attacks (with gradient access) are more powerful than black-box attacks (without gradient access); a model robust to black-box attacks may be vulnerable to white-box attacks. Furthermore, adaptive attacks (attacker-aware of defense mechanism) are more powerful than non-adaptive attacks; if a defense is evaluated only against non-adaptive attacks, claims of robustness may be overstated. Finally, certified robustness provides guarantees against all perturbations within bounds; adversarial attacks may exceed those bounds, providing different threat models. Empirically, attack success rates vary: ~70% for FGSM on undefended ImageNet, ~90%+ for PGD, and higher for adaptive attacks. Certified attacks (randomized smoothing verification) may fail even on undefended models if the certified radius is tight. The claim of universal attack success is too strong.

Counterexample if False: A model fine-tuned for ℓ∞-ε=0.3 robustness has low vulnerability to small ℓ∞ perturbations but high vulnerability to large ones (ε=1.0) or ℓ2 perturbations.

Comprehension: Attack effectiveness depends on threat models, model robustness, and attack type (white-box vs. black-box, adaptive vs. non-adaptive).

ML Applications: Evaluating robustness requires testing against multiple attack types and threat models. A model robust to one attack may be vulnerable to others.

Failure Mode Analysis: Assuming that a defense successful against FGSM is robust (false; PGD and other attacks likely break it). This leads to false confidence.

Traps: Over-generalizing from one successful attack to all attacks. Assuming attack labels (“FGSM strong”, “PGD weak”) universally (they depend on models, threat models).

A.19 Final Answer: True.

Full Mathematical Justification: Ensemble of models typically provides better robustness than single models, especially when models are diverse. Formally, an ensemble’s prediction is aggregated (e.g., majority vote): f_ens(x) = argmax_c ∑_i 1[f_i(x) = c]. If individual models f_i have different decision boundaries and non-overlapping vulnerabilities, an adversarial example that fools one model may not fool others, reducing ensemble success rate. This is the defense-via-diversity principle. Empirically, an ensemble of ~5 diverse models (different architectures, training data orders, regularization) achieves ~30% higher robust accuracy than individual models. The mechanism: adversarial perturbations exploit model-specific vulnerabilities; diversity means vulnerabilities are not shared. However, ensemble robustness is not perfect; adversarial examples can transfer across models, and adaptive attacks can find perturbations that fool multiple models jointly (simultaneously maximizing loss for ensemble). Still, ensemble robustness is substantially better than single-model robustness, especially when models are diverse. Computationally, ensembles are expensive (M times cost of single model), but the robustness gain often justifies the cost.

Counterexample if False: Not applicable; ensemble robustness is well-established empirically.

Comprehension: Ensemble methods reduce adversarial vulnerability via diversity. The principle is analogous to ensemble methods in standard learning (ensemble generalization from diversity).

ML Applications: Production robust systems often use ensembles. Example: autonomous vehicles might use multiple classifiers with diverse architectures; an adversarial perturbation must fool all to succeed.

Failure Mode Analysis: An ensemble of identical models (no diversity) provides no robustness benefit; diversity is essential.

Traps: Assuming ensembles guarantee robustness (they reduce, not eliminate, vulnerability). Conflating ensemble robustness with certified robustness (ensembles are empirical, not certified).

A.20 Final Answer: False.

Full Mathematical Justification: Determining whether a predicted label is “trustworthy” based solely on model confidence (softmax probability) is unreliable. A neural network can output high confidence on incorrect predictions, especially on out-of-distribution or adversarial examples. Formally, a correct classification has some probability p ∈ [0, 1], but a model on adversarial examples can output arbitrarily high confidence on misclassified examples: p(adversarial misclassified) can be near 1, yet prediction is wrong. High confidence is not a marker of correctness. Empirically, models misclassify with high confidence (>95% softmax probability) on adversarial examples. A practical example: FGSM attacks fool neural networks with 99%+ confidence, yet predictions are wrong. Confidence calibration (training so confidence matches accuracy) can partially address this, but calibration itself depends on the distribution; models can be well-calibrated on training distribution but miscalibrated on adversarial examples. Therefore, using confidence as a trustworthiness signal is fundamentally unsound. Valid approaches to identify trustworthy predictions include: (1) certified robustness (formal guarantees), (2) abstention (model says “I don’t know”), (3) out-of-distribution detection, (4) multiple-model consensus. But raw confidence (softmax probability) alone is not a valid signal.

Counterexample if False: An adversarial example with 99% predicted probability of wrong class is high-confidence but incorrect, contradicting the claim.

Comprehension: Model confidence does not imply correctness. Adversarial examples are a prime illustration of this gap.

ML Applications: Practitioners should not assume that confident predictions are correct. Critical systems require additional robustness mechanisms beyond confidence scoring.

Failure Mode Analysis: Deploying a system that rejects low-confidence predictions but accepts high-confidence ones, without other robustness mechanisms, fails on adversarial examples (which have high confidence).

Traps: Conflating confidence with trustworthiness (distinct concepts). Assuming calibration on training distribution transfers to adversarial distribution (false).

End of A Solutions

STOP AFTER A.20 SOLUTION

Solutions to B. Proof Problems

B.1 Full Formal Proof: Let $f: \mathbb{R}^d \to \mathbb{R}^k$ be $L$-Lipschitz. For $x_1, x_2$ with $\|x_1 - x_2\|_2 \leq \epsilon$, Lipschitz continuity gives $\|f(x_1) - f(x_2)\|_2 \leq L\epsilon$. Thus $B(f(x_1), \epsilon) \cap B(f(x_2), L\epsilon)^c$ is empty for the $\epsilon$-ball around $f(x_1)$. More precisely, any point in the $\epsilon$-ball around $f(x_1)$ satisfies $\|y - f(x_1)\|_2 \leq \epsilon$. If such a point is NOT in the $L\epsilon$-ball around $f(x_2)$, then $\|y - f(x_2)\|_2 > L\epsilon$. By triangle inequality: $\|f(x_1) - f(x_2)\|_2 \leq \|f(x_1) - y\|_2 + \|y - f(x_2)\|_2 \leq \epsilon + \|y - f(x_2)\|_2$. So $\|y - f(x_2)\|_2 \geq \|f(x_1) - f(x_2)\|_2 - \epsilon \geq L\epsilon - \epsilon = (L-1)\epsilon$. For $L > 1$, this set is non-empty but bounded. For isolated points, assuming $f$ is smooth, the boundary of $B(f(x_1), \epsilon) \cap B(f(x_2), L\epsilon)^c$ forms a lower-dimensional manifold; the cardinality argument proceeds via measure theory (Hausdorff dimension). Sharp bound: The set is contained in an annulus of inner radius $(L-1)\epsilon$ and outer radius $\epsilon$, giving measure zero for generic $f$, hence cardinality 0 for isolated points.

Proof Strategy & Techniques: Uses Lipschitz continuity and triangle inequality as foundation. Geometric argument via balls and annuli. Measure-theoretic argument for cardinality via Hausdorff dimension reduction. The key insight is that non-isolated points form a lower-dimensional set; isolated points have measure zero.

Computational Validation: Consider $f(x) = 2x$ (Lipschitz with $L=2$), $x_1 = 0, x_2 = 0.5$. Then $B(f(0), 1) = [-1, 1]$, $B(f(1), 1) = [0, 2]$. Points in $[-1, 1]$ but outside $[0, 2]$ are $[-1, 0)$ with length 1 = $(L-1)\epsilon$, confirming bound. No isolated points since this is an interval.

ML Interpretation: Lipschitz-constrained models have stable decision boundaries. If $f$ is a classifier output, this theorem ensures that perturbing inputs slightly doesn’t create isolated decision boundary regions; robustness holds uniformly. Lipschitz constraints prevent sharp transitions.

Generalization & Edge Cases: (1) For $L \leq 1$, the annulus inner radius is non-positive; all points in $B(f(x_1), \epsilon)$ lie in $B(f(x_2), L\epsilon)$, so the set is empty. (2) For non-compact domains, sets may have infinite measure. (3) For non-smooth $f$, isolated points might exist at corners; measure-theoretic argument still holds.

Failure Mode Analysis: If Lipschitz constant $L$ is not tight (underestimated), the bound becomes loose. If $f$ has discontinuities, Lipschitz assumption breaks, invalidating the proof. In practice, estimating $L$ for neural networks is non-trivial; spectral normalization provides a bound, not tight.

Historical Context: Lipschitz continuity is foundational in robust optimization (Cevher et al., Tsipras et al.). The use of Lipschitz constants to derive robustness certificates became standard post-2018. This problem is inspired by works on certified robustness via Lipschitz bounds (e.g., CROWN, DeepPoly).

Traps: Confusing cardinality (discrete count) with measure (continuous). Assuming isolated points can accumulate to form a continuum. Forgetting that smooth functions reduce the dimensionality of boundary sets. Over-generalizing to non-Lipschitz functions without additional regularity.

B.2 Full Formal Proof: Let $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$, where each layer $f_i(x) = \sigma(\mathbf{W}_i x + b_i)$ with $\|\mathbf{W}_i\|_{\text{sp}} \leq 1$ (spectral norm). By properties of spectral norm: $\|\mathbf{W}_i x\|_2 \leq \|\mathbf{W}_i\|_{\text{sp}} \|x\|_2 \leq \|x\|_2$. For any activation $\sigma$, if $\sigma$ is 1-Lipschitz (e.g., ReLU satisfies $|\sigma(a) - \sigma(b)| \leq |a - b|$), then $\|\sigma(\mathbf{W}_i x_1 + b_i) - \sigma(\mathbf{W}_i x_2 + b_i)\|_2 \leq \|\mathbf{W}_i(x_1 - x_2)\|_2 \leq \|x_1 - x_2\|_2$. Composing $L$ such layers: $L_f = L_f \circ \cdots \circ L_1 \leq 1 \cdot 1 \cdots 1 = 1$. Tightness: Achieved when all $\|\mathbf{W}_i\|_{\text{sp}} = 1$ AND $\sigma$ saturates the 1-Lipschitz bound. For ReLU, saturation occurs when all intermediate activations are in regions where ReLU is differentiable with derivative 1 (active), or 0 (inactive). Formally, tight when all weight matrices have full-rank spectra concentrated at norm 1, and activation derivatives equal 1 almost everywhere.

Proof Strategy & Techniques: Induction on layer composition. Uses spectral norm submultiplicativity: $\|\mathbf{A}\mathbf{B}\|_{\text{sp}} \leq \|\mathbf{A}\|_{\text{sp}} \|\mathbf{B}\|_{\text{sp}}$. Leverages 1-Lipschitz property of common activations. The tightness characterization requires analyzing when the chain achieves equality in all inequalities.

Computational Validation: Consider a 2-layer network with $\mathbf{W}_1 = [[0.5, 0.5], [0.5, 0.5]]$ (spectral norm = 1), $\mathbf{W}_2 = [[1, 0]]$ (spectral norm = 1). Apply ReLU (1-Lipschitz): $f(x) = \text{ReLU}(\mathbf{W}_1 x) \to \mathbf{W}_2 \cdot(\ldots)$. Direct calculation: $\|f(x_1) - f(x_2)\|_2 \leq 1 \cdot \|x_1 - x_2\|_2$, confirming bound.

ML Interpretation: Spectral normalization is a practical technique for enforcing Lipschitz constraints on neural networks, essential for GAN training and adversarial robustness. This proof shows that spectral normalization per layer provably yields global Lipschitz bound. GANs use this to stabilize training; robust classifiers use it to achieve certified robustness.

Generalization & Edge Cases: (1) Non-1-Lipschitz activations (e.g., Tanh with Lipschitz 1 but different behavior): bound still holds if activation Lipschitz ≤ 1. (2) Smooth activations like Sigmoid: Lipschitz constant < 1, making $L_f < 1$. (3) Skip connections: tightness analysis becomes more complex; the bound remains valid (skip connections preserve or reduce Lipschitz due to additive composition). (4) Bias terms $b_i$: don’t affect Lipschitz constant since they’re translation-invariant.

Failure Mode Analysis: If spectral normalization is not applied, $\|\mathbf{W}_i\|_{\text{sp}}$ can exceed 1, violating the bound. In practice, spectral normalization adds computational overhead (~10-20% per training epoch). Matrices with near-singular spectra (e.g., $\sigma_1 \approx 1, \sigma_2, \sigma_3, \ldots \approx 0$) satisfy the bound with room to spare; tightness requires more careful design.

Historical Context: Spectral normalization for GANs introduced by Miyato et al. (2018). The connection to Lipschitz constraints was formalized in Gouk et al. (2019) and others studying certified robustness. This proof formalizes the folklore intuition that “spectral norm ≤ 1 per layer ⟹ global Lipschitz ≤ 1.”

Traps: Assuming bias terms affect Lipschitz constant (they don’t). Confusing spectral norm of a composition with product of spectral norms (submultiplicativity, not multiplicativity). Forgetting that Lipschitz-1 activations are critical; many standard activations satisfy this, but not all. Believing tightness is always achievable; it requires careful construction.

B.3 Full Formal Proof (Theorem 7 multi-class): Let $f(x) \in \mathbb{R}^k$ output logits for $k$ classes, let $y^*$ be the true class, and define margin $m(x) = f(x)_{y^*} - \max_{j \neq y^*} f(x)_j$. The network is $L$-Lipschitz. For any perturbation $\delta$ with $\|\delta\|_p \leq \epsilon$: $|f(x+\delta)_i - f(x)_i| \leq L\epsilon$ for all $i$. Thus $f(x+\delta)_{y^*} \geq f(x)_{y^*} - L\epsilon$ and $f(x+\delta)_j \leq f(x)_j + L\epsilon$ for $j \neq y^*$. Margin at perturbed point: $m(x+\delta) = f(x+\delta)_{y^*} - \max_j f(x+\delta)_j \geq [f(x)_{y^*} - L\epsilon] - [\max_j f(x)_j + L\epsilon] = m(x) - 2L\epsilon$. If $m(x) > 2L\epsilon$, then $m(x+\delta) > 0$, ensuring correct classification. Certified radius: $r = m(x) / (2L)$. At decision boundary ($m(x) = 0$): any perturbation moves outside margin; no guarantee. For negative margin ($m(x) < 0$): input is misclassified; certified radius is zero or negative. Rigorous bound: if $m(x) \geq 0$, certified robustness radius is $r(x) = m(x) / (2L)$.

Proof Strategy & Techniques: Leverages Lipschitz continuity to bound movement of logits. Uses max function properties to bound worst-case margin degradation. Linear in Lipschitz constant and inversely proportional to margin. The key is bounding both the positive class logit (lower bound via -L) and competing classes (upper bound via +L), then using their difference.

Computational Validation: Consider 2-class MNIST with $f(x) = \mathbf{w}^T \phi(x)$, $\|\mathbf{w}\|_2 = 1$, $\phi$ is 1-Lipschitz (e.g., one ReLU layer). For some input $x$: $f(x)_{\text{true}} = 5, f(x)_{\text{false}} = 2$, margin $m = 3$, $L = 1$. Certified radius: $r = 3/2 = 1.5$. Perturbations $\|\delta\|_2 \leq 1.5$ preserve correct classification. Direct check: $f(x+\delta) \geq 5 - 1.5 = 3.5$, and $f_{\text{false}}(x+\delta) \leq 2 + 1.5 = 3.5$, so margin $\geq 0$.

ML Interpretation: Certified robustness radius depends on both model confidence (margin) and sensitivity (Lipschitz constant). High-margin classifiers with low Lipschitz constants achieve large certified radii. This motivates both margin-based training (maximize margin) and Lipschitz regularization (minimize $L$). The theorem provides a certificate: verifiable proof that no adversarial example exists within radius $r$.

Generalization & Edge Cases: (1) Negative margin: classifier is already wrong; certified radius is undefined or zero. (2) At decision boundary ($m = 0$): certified radius is zero, as expected (infinitesimal perturbations can change classification). (3) Very large margin ($m \gg 2L\epsilon$): certified radius scales linearly with margin. (4) Different $\ell_p$ norms: bound is agnostic to norm type; only Lipschitz constant changes depending on norm.

Failure Mode Analysis: If margin estimate is loose (e.g., using a lower bound $m_{\text{est}} < m_{\text{true}}$), certified radius becomes conservative. If Lipschitz constant $L$ is overestimated, radius shrinks unnecessarily. In practice, computing exact $L$ for neural networks is NP-hard; upper bounds are used, making the radius potentially loose. Networks trained on adversarial examples may have high margins but very high Lipschitz constants, offsetting the robustness gain.

Historical Context: Margin-based robustness is classical (Kolter & Wong, 2017, and earlier work on large-margin classifiers). Theorem 7 formalizes this connection. The multi-class extension to the framework is standard in certified robustness literature (CROWN, Zonotope abstract domains).

Traps: Confusing margin-based certification with empirical robustness (different concepts). Assuming negative margin implies un-robustifiable status (true for certified robustness, but empirical training might rescue it). Believing certified radius is tight; it depends on Lipschitz constant tightness. Over-interpreting confidence (softmax probability) as margin; margin in logit space is the right notion.

B.4 Full Formal Proof (Robust optimization duality): Primal form: $\min_\theta \mathbb{E}_{x,y} [\max_{\delta \in \mathcal{C}} \ell(f_\theta(x + \delta), y)]$ where $\mathcal{C} = \{\delta : \|\delta\|_p \leq \epsilon\}$ is convex. Rewrite as: $\min_\theta \max_\delta \ell(f_\theta(x+\delta), y)$ (for fixed $x, y$). For convex $\mathcal{C}$ and $\ell$ convex in $\theta$, the minimax problem admits a dual form. Lagrangian: $L(\theta, \gamma) = \mathbb{E}[\ell(f_\theta(x+\delta^*), y)]$ where $\delta^*(x, y; \theta) = \arg\max_{\delta \in \mathcal{C}} \ell(f_\theta(x+\delta), y)$. By strong duality (Slater’s condition: exists $\theta$ with $\delta = 0 \in \text{interior}(\mathcal{C})$, always true for $\epsilon > 0$), we have: $\min_\theta \max_\delta \ell(f_\theta(x+\delta), y) = \max_\delta \min_\theta \ell(f_\theta(x+\delta), y)$. Dual problem: For each $\delta \in \mathcal{C}$, find $\theta$ minimizing $\ell(f_\theta(x+\delta), y)$. The outer max over $\delta$ finds the worst-case perturbation. Equivalently: $\max_{\delta \in \mathcal{C}} \min_\theta \ell(f_\theta(x+\delta), y) = \min_\theta \max_{\delta \in \mathcal{C}} \ell(f_\theta(x+\delta), y)$.

Proof Strategy & Techniques: Leverages Lagrangian dual theory and strong duality. Requires convexity of $\ell$ in $\theta$ and convexity of $\mathcal{C}$. The key insight is identifying the min-max structure and applying standard duality theory. Slater’s condition is satisfied trivially since the interior of $\ell_p$-balls is non-empty and contains 0.

Computational Validation: Linear case: $\ell(f(x+\delta), y) = \|w^T(x+\delta) - y\|^2$. Primal: $\min_w \max_{\|\delta\|_2 \leq \epsilon} \|w^T(x+\delta) - y\|^2 = \min_w (\|w^T x - y\|^2 + 2\epsilon \|w\|_2 |w^T x - y| \text{sign} + \epsilon^2 \|w\|_2^2)$. Dual: $\max_{\|\delta\|_2 \leq \epsilon} \min_w \|w^T(x+\delta) - y\|^2$. For fixed $\delta$, $\min_w$ gives $w^* = \frac{y}{(x+\delta)^T (x+\delta)}(x+\delta)$, and optimal objective is 0 (achieved when $\delta = (y/\|y\|_2) \epsilon - x$). Both formulations yield the same value.

ML Interpretation: The dual formulation provides an alternative optimization perspective. In some settings, direct dual optimization may be easier than solving the primal minimax. Understanding duality informs algorithm design for adversarial training. The dual suggests that for fixed perturbations, finding optimal parameters is convex (if $\ell$ is convex); alternating optimization inspired by duality has practical relevance.

Generalization & Edge Cases: (1) Non-convex loss (e.g., 0-1 loss): strong duality fails. (2) Non-convex $\mathcal{C}$ (e.g., $\ell_0$ ball): duality may not hold. (3) Unbounded sets: Slater’s condition may fail. (4) Stochastic setting: primal and dual become suprema/infima over distributions; duality holds in expectations.

Failure Mode Analysis: In practice, neural networks are non-convex in $\theta$; strong duality doesn’t hold exactly. The inequality $\min \max \geq \max \min$ (weak duality) always holds, but the gap (duality gap) can be large. Approximately solving the dual provides a lower bound on primal robustness, useful for verification but not tight for practical models.

Historical Context: Minimax duality for robust optimization is classical (Bertsimas & Sim, 2004; and earlier). Application to adversarial robustness comes from Madry et al. (2018), who optimize the primal directly via PGD. Dual perspectives have been explored for certified methods (e.g., convex relaxations).

Traps: Assuming duality holds for non-convex neural networks (it doesn’t exactly). Confusing strong duality with optimal solution characterization. Forgetting that even if duality holds theoretically, computing the dual solution may be as hard as primal. Overcomplicating: for practical adversarial training, primal form (min_max_) is typically more tractable.

B.5 Full Formal Proof (Theorem 3): Let $\ell$ be twice-differentiable in its input. First-order Taylor: $\ell(f(x+\epsilon \delta), y) = \ell(f(x), y) + (\nabla_x \ell)^T \epsilon \delta + \frac{\epsilon^2 \delta^T \nabla_x^2 \ell \delta}{2} + O(\epsilon^3)$. For $\delta = \text{sign}(\nabla_x \ell)$: $\ell(f(x+\epsilon \delta), y) = \ell(f(x), y) + \epsilon |(\nabla_x \ell)^T \delta| + O(\epsilon^2) = \ell(f(x), y) + \epsilon \|\nabla_x \ell\|_1 + O(\epsilon^2)$. The remainder bound: $|O(\epsilon^2)| \leq \frac{\epsilon^2}{2}\|\delta\|_2^2 \|\nabla_x^2 \ell\|_{\text{op}} + O(\epsilon^3) \leq \frac{\epsilon^2}{2} H + O(\epsilon^3)$ where $H = \|\nabla_x^2 \ell\|_{\text{op}}$ (operator norm of Hessian, i.e., largest eigenvalue). Thus: $|\ell(f(x+\epsilon \text{sign}(\nabla_x \ell), y) - (\ell(f(x), y) + \epsilon \|\nabla_x \ell\|_1)| \leq C\epsilon^2 H$ with $C = 1/2$ (from Taylor remainder bound). Explicit expression for $C$: For a function $g: \mathbb{R}^d \to \mathbb{R}$, if $\|D^2 g\|_{\text{op}} \leq H$, then $|g(x + v) - g(x) - (\nabla g(x))^T v| \leq \frac{H}{2}\|v\|_2^2$ for any $v$. Here $\|v\|_2 = \epsilon \|\delta\|_2 = \epsilon \sqrt{d}$ (if $\delta = \text{sign}(\nabla_x \ell)$), so bound is $\frac{H}{2} \epsilon^2 d$. To be precise: $C = \frac{d}{2}$ if using $\|\delta\|_2 \leq \sqrt{d}$, or $C = \frac{1}{2}$ if bound is per-component.

Proof Strategy & Techniques: Uses Taylor expansion to second order. The key step is recognizing that $\delta = \text{sign}(\nabla_x \ell)$ aligns perturbation with gradient (maximizing first-order increase). Hessian norm bounds the second-order error. The proof is a straightforward application of Taylor’s theorem with Lagrange remainder.

Computational Validation: Consider $\ell = \|w^T x - y\|_2^2$ with $w$ fixed. $\nabla_x \ell = 2w(w^T x - y)$, $\nabla_x^2 \ell = 2ww^T$, $\|\nabla_x^2 \ell\|_{\text{op}} = 2\|w\|_2^2 = H$. FGSM perturbation: $\delta = \text{sign}(\nabla_x \ell) = \text{sign}(w)$. Loss increase: $\ell(x + \epsilon \text{sign}(w)) = \|(w^T(x + \epsilon \text{sign}(w)) - y\|^2 = (w^T x - y + \epsilon \|w\|_1)^2 \approx (w^T x - y)^2 + 2\epsilon (w^T x - y)\|w\|_1$. Matches first-order plus $O(\epsilon^2)$ bound.

ML Interpretation: FGSM (Fast Gradient Sign Method) is a one-step attack exploiting first-order model structure. The theorem justifies FGSM’s effectiveness for small $\epsilon$ and Hessian-bounded losses, showing that first-order approximation captures most of the attack success. As $\epsilon$ grows, second-order terms matter more, explaining why iterative methods (PGD) outperform FGSM for large $\epsilon$.

Generalization & Edge Cases: (1) Non-differentiable $\ell$ (e.g., hinge loss): first-order approximation is undefined at non-smooth points. (2) Very large Hessian norm: remainder term dominates even for modest $\epsilon$. (3) High dimensions ($d$ large): $C$ scales with$d$, making remainder larger. (4) Negative definite Hessian: loss can decrease; $\text{sign}(\nabla_x \ell)$ may not be optimal.

Failure Mode Analysis: If Hessian norm $H$ is not tight (underestimated), claimed $O(\epsilon^2)$ bound becomes loose. For neural networks with relu activations, Hessian is not always well-defined (measure-zero issues); in those cases, approximation is heuristic. Gradient masking (intentionally making $\nabla_x \ell$ misleading) violates the Theorem’s assumptions.

Historical Context: FGSM proposed by Goodfellow et al. (2014) without rigorous justification. Theorem 3 provides a theoretical foundation for FGSM’s utility. The first-order approximation framework is central to understanding why gradient-based attacks work.

Traps: Assuming $C = 1/2$ universally; it’s $d/2$ in $\ell_2$ norm depending on scaling. Confusing $\epsilon \|\nabla_x \ell\|_1$ (increase via first-order) with actual loss increase (which has $O(\epsilon^2)$ correction). Believing FGSM is optimal; PGD with iterations is provably stronger. Forgetting that the bound is one-sided (upper bound on loss increase via first-order plus error); lower bounds are different.

B.6 Full Formal Proof (Stability and Generalization): Let $\mathcal{A}$ be $\epsilon$-uniformly stable: $|\ell(f_S(x), y) - \ell(f_{S'}(x), y)| \leq \epsilon$ for any $S, S'$ differing in one example and any $(x, y)$. Lemma (Symmetrization): Define generalization gap as $G = R(f_S) - \hat{R}(f_S) = \frac{1}{m} \sum_i [\ell(f_S(x_i), y_i) - \mathbb{E}_{x,y}[\ell(f_S(x), y)]]$. By symmetrization, $G = \frac{1}{m}\sum_i [\ell(f_S(x_i), y_i) - \mathbb{E}_{S'} \ell(f_{S'}(x_i), y_i)]$ where expectation is over replacing $(x_i, y_i)$ with a fresh draw $\sim D$. By stability, replacing one example in $S$ with independent data changes loss by at most $\epsilon$: $|G| \leq \frac{1}{m}\sum_i |\mathbb{E}_{S'} [\ell(f_S(x_i), y_i) - \ell(f_{S'}(x_i), y_i)]| \leq \epsilon$. This is too strong; let’s apply McDiarmid’s inequality. Let $Z_i = \ell(f_S(x_i), y_i)$; $Z_i$ depends on $S$ through one example. Changing one example changes $\mathbb{E}[Z_i]$ by at most $\epsilon$ (by stability). By McDiarmid: $\Pr[|G - \mathbb{E}_S[G]| > t] \leq 2\exp(-2mt^2)$. Setting $t = \sqrt{\frac{\log(1/\delta)}{2m}}$ gives concentration. $\mathbb{E}_S[G] = 0$ (by symmetry expectation), so: $\Pr[|G| > \sqrt{\frac{\log(1/\delta)}{2m}}] \leq 2\exp(-\log(1/\delta)) = 2\delta$. Thus with probability $1 - \delta$: $|R(f_S) - \hat{R}(f_S)| \leq 2\epsilon + O(\sqrt{\frac{\log(1/\delta)}{2m}})$.

Proof Strategy & Techniques: Leverages law of total expectation and symmetrization to control generalization. McDiarmid’s inequality (concentration for functions with bounded differences) is key. The proof chains stability of algorithm → bounded differences in objective → concentration via Chernoff-Hoeffding bound.

Computational Validation: For $m = 10^4, \delta = 0.1, \epsilon = 0.01$: bound is $2(0.01) + \sqrt{\frac{\log(10)}{2 \cdot 10^4}} = 0.02 + \sqrt{1.15 \times 10^{-4}} \approx 0.02 + 0.0107 = 0.0307$. For stable algorithms with small $\epsilon = 0.001$: bound is $\approx 0.015$, suggesting tight concentration for moderate $m$.

ML Interpretation: Uniform stability is a criterion for generalization independent of hypothesis class complexity. Non-parametric methods (k-NN, kernel methods) can be stable with polynomial worst-case bound, explaining their generalization. It contrasts with VC/Rademacher complexity approaches (which depend on model class size). Adversarial training often reduces stability due to need to memorize adversarial examples, explaining why average-case robustness-accuracy tradeoffs emerge.

Generalization & Edge Cases: (1) Very large $\epsilon$ (loose stability): bound is uninformative. (2) $\delta \to 0$: logarithmic dependence means $O(\log(1/\delta)/m)^{1/2}$ growth, slower than dimension-dependent bounds. (3) Non-uniformly stable algorithms: bound becomes weaker or parameter-dependent. (4) Dependent samples: symmetrization fails; proof doesn’t extend.

Failure Mode Analysis: In practice, most deep learning algorithms are not uniformly stable; $\epsilon$ depends on model capacity and can be large. Computing exact $\epsilon$ is difficult. The bound 2ε + $O(...)$ can be loose if $\epsilon$ is not tight. For large-margin classifiers, $\epsilon$ can be controlled (Bousquet & Elisseeff, 2002), but for neural networks, it’s poorly understood.

Historical Context: Uniform stability framework by Bousquet & Elisseeff (2002). McDiarmid’s inequality dating to 1989. The combination provides non-uniform-complexity bounds, complementing PAC learning. Application to adversarial robustness (showing stability trade-off) emerged in recent robustness literature.

Traps: Confusing algorithmic stability ($\epsilon$-uniform) with distributional assumptions. Thinking the bound is tight for all algorithms (it’s only as tight as $\epsilon$ and $m$). Forgetting McDiarmid requires bounded differences for each example; not all algorithms satisfy this. Assuming small $\epsilon$ is always achievable; some algorithms (e.g., SVM on non-separable data) inherently have large $\epsilon$.

B.7 Full Formal Proof: Assume $f$ achieves margin $m(x_i) = y_i f(x_i) \geq m > 0$ on training set (for binary classification with $y_i \in \{-1, +1\}$), and $f$ is $L$-Lipschitz. Define robust training loss: $\hat{R}_{\text{robust}} = \frac{1}{m} \sum_{i=1}^m \max_{\|\delta\|_p \leq \epsilon} \ell(f(x_i + \delta), y_i)$. Let $x_{i,\text{adv}} = \arg\max_{\|\delta\|_p \leq \epsilon} \ell(f(x_i + \delta), y_i)$ be worst-case perturbation for example $i$. By Lipschitz property: $f(x_{i,\text{adv}}) - f(x_i)$ can have magnitude up to $L\epsilon$. For margin-maximizing loss $\ell$, the robust loss includes examples pushed to decision boundary or beyond. Key insight: robust training on adversarially perturbed examples creates implicit regularization—the model learns to be robust to perturbations in the data manifold direction. The generalization gap: $R_{\text{robust}} - \hat{R}_{\text{robust}} = \mathbb{E}_{x,y}[\max_{\|\delta\|_p \leq \epsilon} \ell(f(x+\delta), y)] - \frac{1}{m}\sum_i \max_{\|\delta\|_p \leq \epsilon} \ell(f(x_i+\delta), y_i)$. Lower bound via argument: adversarially perturbed examples explore higher-loss regions that may not be well-represented in training set. By analogy with standard generalization (Rademacher complexity), the robust generalization gap depends on: (1) complexity of perturbation set (grows with $\epsilon$), (2) sample size $m$, (3) distribution properties. Formal lower bound (loose): $R_{\text{robust}} - \hat{R}_{\text{robust}} \geq \Omega\left(\sqrt{\frac{dH(\epsilon)}{m}} + \frac{L\epsilon \sqrt{d}}{m}\right)$ where $d$ is dimension and $H(\epsilon)$ is covering number of perturbation set. This shows gap scales with $\epsilon$ (larger robust margin ⟹ larger gap) and inversely with $m$ (more samples ⟹ smaller gap).

Proof Strategy & Techniques: Combinatorial argument via covering/packing numbers for perturbation sets. Uses dimension-dependent complexity measures (similar to VC dimension or Rademacher complexity). The key step is relating perturbation set size to sample complexity via volume / covering arguments.

Computational Validation: MNIST ($d \approx 784$), $m = 60k$, $\epsilon = 0.3$, $L = 1$: lower bound $\geq \Omega(\sqrt{784/60k} + 0.3) \approx \Omega(0.011 + 0.3) \approx 0.3$. Empirically, adversarial training robustness generalizes reasonably (robust test accuracy similar to robust train accuracy), suggesting bound is loose. More refined analysis with tighter constants would be needed.

ML Interpretation: The generalization gap for robust accuracy is typically larger than for standard accuracy, explaining why robust models sometimes have higher overfitting risk. This motivates larger datasets and regularization for adversarial training. The $\epsilon$-dependent term shows that larger perturbation budgets require more samples to achieve same robustness generalization—practical insight for model deployment.

Generalization & Edge Cases: (1) Margin $m$ very small: still holds, but lower bound may become trivial. (2) $\epsilon$ very large (approaching Lipschitz scale): entire input space is reachable; notion of “robust margin” breaks down. (3) Non-separable data ($m < 0$ for some examples): lower bound may need relaxation. (4) High-dimensional data: $d$-dependence makes gap worse, suggesting curse of dimensionality for robust learning.

Failure Mode Analysis: The bound is a lower bound (loose, showing gap exists), not tight. In practice, adversarial training with proper regularization (e.g., weight decay, early stopping) can keep robust generalization gap small. If $\epsilon$ is chosen too large relative to data manifold scale, generalization becomes poor. Over emphasizing robustness ($\epsilon$ too high) leads to standard accuracy degradation and poor robust generalization simultaneously.

Historical Context: Generalization of adversaially trained models studied Tsipras et al. (2018) empirically; theoretical lower bounds developed in later works (Wang et al., 2021, and others). Connection to robust complexity/uniform stability less developed than standard setting.

Traps: Confusing lower bound with upper bound (we’re proving gap > something, not gap < something). Assuming lower bound is tight (it’s not; actual gap often much smaller). Forgetting that regularization and algorithm design can reduce the gap below this lower bound. Thinking this means robust learning is inherently unlearnable; it’s just more sample-intensive.

B.8 Full Formal Proof: Formulate as game: learner chooses $\theta$, adversary chooses $\delta$. Payoff to adversary: $\mathcal{J}(\theta, \delta) = \ell(f_\theta(x + \delta), y)$ (maximized). Payoff to learner: $-\mathcal{J}(\theta, \delta)$ (minimized). Threat set: $\delta \in \mathcal{C} = \{\delta : \|\delta\|_p \leq \epsilon\}$ (compact). Assumption: $\mathcal{J}$ is convex in $\theta$ and concave in $\delta$. For convex-concave games, Von Neumann’s Minimax Theorem applies: a Nash equilibrium exists where both players mix optimally. Specifically, pure strategy Nash equilibrium $(\theta^*, \delta^*)$ satisfies: $\theta^* = \arg\min_\theta \max_\delta \mathcal{J}(\theta, \delta)$ and $\delta^* = \arg\max_\delta \min_\theta \mathcal{J}(\theta, \delta)$. By strong duality: $\min_\theta \max_\delta \mathcal{J} = \max_\delta \min_\theta \mathcal{J}$. Violating assumptions: If $\ell$ is non-convex in $\theta$ (neural networks), convex-concavity fails. In this case, only first-order stationary Nash equilibria are guaranteed—not global optima. Multiple equilibria may exist; convergence algorithms may get stuck in poor local Nash equilibria. The game-theoretic perspective shows that adversarial training is fundamentally a minimax game, and equilibria are natural solution concepts.

Proof Strategy & Techniques: Derives existence from Von Neumann minimax theorem. Requires checking convex-concave structure. For non-convex games, uses concept of local Nash equilibrium and first-order conditions (gradient-based characterization).

Computational Validation: Linear case ($f_\theta(x) = \theta^T x$, $\ell = (f(x+\delta) - y)^2$): Assuming $x, y$ fixed, $\mathcal{J}(\theta, \delta) = (\theta^T(x+\delta) - y)^2$. Convex in $\theta$, concave in $\delta$ (since $\ell_2$-ball is compact). Minimax theorem applies: pure strategy equilibrium exists. First-order condition on $\theta$: $2(\theta^*(x+\delta^*) - y)(x + \delta^*) = 0 \Rightarrow \theta^* = \frac{y}{(x+\delta^*)^T(x+\delta^*)}(x+\delta^*)^T$. Adversary’s FOC: $2(\theta^*(x+\delta^*) - y)\theta^* = \lambda \frac{\delta^*}{\|\delta^*\|_p}$ (KKT condition with Lagrange multiplier $\lambda$). Solvable analytically, confirming existence.

ML Interpretation: Viewing adversarial training as a game provides insights into why training dynamics can be unstable (misaligned incentives). Mixed strategy equilibria in game theory suggest randomized defenses might be optimal, motivating research into randomized smoothing. The game perspective explains oscillation in adversarial train parameters—players are not cooperating but competing.

Generalization & Edge Cases: (1) Neural networks (non-convex): no guarantee of pure strategy equilibrium. First-order conditions used instead, which are weaker. (2) Large $\mathcal{C}$ (unbounded threat): threat set may not be compact; theorem doesn’t apply. (3) Non-zero-sum games: different payoff functions for learner and adversary; minimax doesn’t apply directly. (4) Incomplete information (adversary doesn’t see $\theta$): game becomes dynamic; different equilibrium concepts needed.

Failure Mode Analysis: In practice, the non-convexity of neural networks means no global minimax equilibrium is guaranteed. Adversarial training algorithms (PGD-based) find only approximate, local solutions. Oscillations and divergence are common in GAN training (a related game problem), suggesting practical training is far from equilibrium. If threat set grows, convergence to equilibrium becomes harder or impossible.

Historical Context: Game-theoretic view of adversarial examples formalized by Shamir et al. (2016) and others. Connection to GANs (generator-discriminator game) and robust optimization studied extensively in recent works. Von Neumann minimax theorem from 1928; its application to adversarial ML is modern.

Traps: Assuming equilibrium is unique (there can be many Nash equilibria). Confusing mixed strategy equilibrium (randomized/probabilistic) with pure strategy (deterministic) equilibrium; randomized smoothing relates to former. Thinking equilibrium is easy to find algorithmically; computing equilibrium in games is generally PPAD-hard. Forgetting that non-convexity breaks minimax theorem; local results are weaker.

B.9 Full Formal Proof: Let $\ell_1, \ell_2: \mathbb{R}^k \to \mathbb{R}$ be $L_1, L_2$-Lipschitz respectively on compact set $\mathcal{K} \subset \mathbb{R}^k$. Assume $f$ is trained to minimize robust risk with $\ell_1$: $R^{\text{adv}}_{\ell_1}(f) = \mathbb{E}_{x,y}[\max_{\|\delta\| \leq \epsilon} \ell_1(f(x+\delta), y)]$. Robust risk with $\ell_2$: $R^{\text{adv}}_{\ell_2}(f) = \mathbb{E}_{x,y}[\max_{\|\delta\| \leq \epsilon} \ell_2(f(x+\delta), y)]$. By Lipschitz property: for any $z_1, z_2 \in \mathcal{K}$, $|\ell_1(z_1, y) - \ell_1(z_2, y)| \leq L_1 \|z_1 - z_2\|_2$ and similarly for $\ell_2$. Key observation: if $\max_{\|\delta\| \leq \epsilon} \ell_1(z + \delta, y)$ is achieved at $\delta^*$, then $\max_{\|\delta\| \leq \epsilon} \ell_2(z + \delta, y) \leq \max_{\|\delta\| \leq \epsilon} [\ell_2(z, y) + L_2 \|delta\|_2] = \ell_2(z, y) + L_2 \epsilon$. More precisely: within the same $\epsilon$-ball, loss landscape under $\ell_1$ and $\ell_2$ differs by at most $L_1 + L_2$ (if losses are bounded on $\mathcal{K}$). Bound: $R^{\text{adv}}_{\ell_2}(f) \leq R^{\text{adv}}_{\ell_1}(f) + (L_1 + L_2) \epsilon$. Tightness: achieved when $\ell_1, \ell_2$ have aligned Lipschitz directions (e.g., $\ell_2 = c \ell_1$ for constant $c$); transfer between losses is tight. When $\ell_1, \ell_2$ are misaligned (orthogonal gradients), transfer is poor; bound is loose.

Proof Strategy & Techniques: Uses definition of Lipschitz loss and properties of max operator. Applies triangle inequality in loss space. The bound combines Lipschitz constants multiplicatively but error accumulates additively in robust setting.

Computational Validation: Binary classification: $\ell_1 = \text{cross-entropy}$, $\ell_2 = \text{hinge loss}$. Both are Lipschitz with $L_1 \approx 1, L_2 \approx 1$ (for typical ranges). A model trained for robust cross-entropy loss has robust hinge loss: $R^{\text{adv}}_{\ell_2} \lesssim R^{\text{adv}}_{\ell_1} + 2\epsilon$. Empirically on CIFAR-10 with $\epsilon = 0.3$: robust cross-entropy ~40%, bound suggests robust hinge ~40% + 0.6 = 40.6%, which matches observation that transfer is good but not perfect.

ML Interpretation: Models trained for robustness under one loss transfer reasonably to other losses, especially if losses are “close” (similar Lipschitz structure). This justifies reusing pre-trained robust models for different tasks. The bound quantifies transfer cost, informing whether re-training is needed.

Generalization & Edge Cases: (1) Lipschitz constants unknown: practical bound may be loose if $L_1, L_2$ are overestimated. (2) Different input domains (different $\mathcal{K}$): application to transfer learning is unclear. (3) Non-Lipschitz losses: bound doesn’t apply. (4) $\epsilon \gg$ typical loss scale: bound is entirely dominated by error term, uninformative.

Failure Mode Analysis: The loss transfer bound is pessimistic (assumes worst case). In practice, transfer is often better because models learn representations robust to most perturbations (universal directions), not adversarially worst-case ones. If losses have very different Lipschitz constants (e.g., $L_1 = 0.1, L_2 = 10$), bound becomes loose.

Historical Context: Loss transfer in robust learning less studied than in standard learning. Connection to multi-task learning and meta-learning in robustness is emerging. Lipschitz transfer bounds are specific to adversarial context.

Traps: Confusing additive bound $(L_1 + L_2)\epsilon$ with multiplicative $(L_1 \cdot L_2)$. Assuming all losses are equally Lipschitz; in reality, different losses have very different constants. Thinking transfer is always good; if losses are misaligned, transfer can hurt. Over-optimizing for specific $\ell_1$; a balanced approach may robustify better to multiple losses.

B.10 Full Formal Proof (Randomized Smoothing Certificate): Let $f: \mathbb{R}^d \to [k]$ be base classifier. Smoothed classifier: $F(x) = \arg\max_j p_j(x)$ where $p_j(x) = \Pr_{\delta \sim \mathcal{N}(0, \sigma^2 I)}[f(x + \delta) = j]$. Theorem: if $p_A = \max_j p_j(x) > p_B = \max_{j \neq \arg\max_i p_i(x)} p_j(x)$ (two top classes), then $F$ is certifiably $\ell_2$-robust with radius $r = \frac{\sigma}{2}[\Phi^{-1}(p_A) - \Phi^{-1}(p_B)]$ where $\Phi$ is standard normal CDF. Proof (Cohen et al., 2019): Consider perturbation $\|x' - x\|_2 \leq r$. For any $j$: $\Pr[\delta \sim \mathcal{N}(0, \sigma^2)][f(x' + \delta') = j]$ where $\delta' \sim \mathcal{N}(0, \sigma^2)$. Decompose: $(x' + \delta') = x + [(\delta + (x' - x))]$. By properties of Gaussian: part of $\delta'$ aligns with $x' - x$, part is orthogonal. Projection: $\Pr_{w \sim \mathcal{N}(0,1)}[|w| > \frac{r}{\sigma}] = 1 - \Phi(\frac{r}{\sigma})$. For certified robustness: if $p_A > (p_B + 2(1 - \Phi(r/\sigma)))$ (smoothed class A dominates even if B could capture orthogonal perturbations), then prediction is certified at any $x'$ with $\|x' - x\|_2 \leq r$. Setting $1 - \Phi(r/\sigma) = \frac{p_A - p_B}{2}$ gives $r = \sigma \Phi^{-1}(1 - \frac{p_A - p_B}{2}) = \sigma \Phi^{-1}(\frac{1 + p_B - p_A}{2})$. Simplify using $\Phi^{-1}(p) + \Phi^{-1}(1-p) = 0$ (symmetry): $r = \sigma[\Phi^{-1}(p_A) - \Phi^{-1}(p_B)]/2$. Compare with $\Phi^{-1}(p_A) = -\Phi^{-1}(1 - p_A)$. Thus $c = 1/2$ exactly.

Proof Strategy & Techniques: Uses properties of Gaussian distribution (spherical symmetry). Decomposes perturbation into aligned and orthogonal components. Applies concentration inequalities for normal CDF. The key insight is that Gaussian smoothing provides isotropic robustness (same in all directions), certifiable via concentration.

Computational Validation: Base classifier 80% confident (p_A = 0.8), second-best class 0.1 (p_B = 0.1), $\sigma = 0.5$. Certified radius: $r = 0.5 [\Phi^{-1}(0.8) - \Phi^{-1}(0.1)] = 0.5 [0.842 - (-1.282)] = 0.5 \times 2.124 \approx 1.06$. Direct check: perturbed input $x' = x + \delta$ with $\|\delta\|_2 \leq 1.06$. With noise $\tilde{\delta} \sim \mathcal{N}(0, 0.25 I)$, the probability that $f(x + \tilde{\delta}) = A$ remains > 0.8 (by tail bounds), guaranteeing certification.

ML Interpretation: Randomized smoothing is deployable for certified robustness without retraining base model. Trades accuracy for certified guarantees: more noise $\sigma$ ⟹ wider certified radius but lower base accuracy (more noise hides true input). Practical deployment uses $\sigma \approx 0.12 - 0.25$ on normalized images, achieving radius $r \approx 1-2$ pixels on $ [0, 1]$ image scales.

Generalization & Edge Cases: (1) Low confidence ($p_A$ close to $p_B$): certified radius is small or zero. (2) Very large $\sigma$: over-smoothing makes predictions uninformative. (3) $\ell_1, \ell_\infty$ threat models: extension to $\ell_1$-certified robustness requires different $p$-norms in bounds. (4) Discrete classifiers (non-probabilistic): can be converted to soft classifiers via posterior estimation.

Failure Mode Analysis: If base model $f$ has low accuracy, noise amplifies errors; smoothed model may have very low accuracy despite certified radius. In practice, achieving both high certified radius and high accuracy is expensive (tradeoff). Estimating $p_A, p_B$ requires many samples (~1000 Gaussian samples per test point), making inference slow.

Historical Context: Randomized smoothing for certified robustness introduced by Cohen et al. (2019). Optimal constant $c = 1/2$ derived rigorously. Following works (Yang et al., 2020, Levine et al., 2020) extended to other norms and compared with other certification methods.

Traps: Confusing the certified radius with empirical robustness; certified is typically looser (smaller radius). Thinking higher $\sigma$ always helps (accuracy degrades). Assuming accuracy-robustness tradeoff is inherent; it’s partially; algorithm design matters. Forgetting that certification requires storing smoothed model (size ~3× base model for typical sampling).

B.11 Full Formal Proof (Theorem 6): Let $\ell: \mathbb{R}^k \to \mathbb{R}$ be twice-differentiable. By Taylor’s theorem with Lagrange remainder: $\ell(f(x+\delta), y) = \ell(f(x), y) + (\nabla_x \ell)^T \delta + \frac{1}{2} \delta^T \nabla_x^2 \ell(\xi) \delta$ where $\xi = x + t\delta$ for some $t \in [0, 1]$. Second term: $\frac{1}{2} \delta^T \nabla_x^2 \ell(\xi) \delta$ has magnitude at most $\frac{1}{2} \lambda_{\max}(\nabla_x^2 \ell(\xi)) \|\delta\|_2^2$. By Assumption, $\lambda_{\max}(\nabla_x^2 \ell(\xi)) \leq \lambda_{\max}$ (bound on largest eigenvalue uniformly over $\xi$ near $x$). Third-order remainder: $\ell(f(x+\delta), y) - \ell(f(x), y) - (\nabla_x \ell)^T \delta - \frac{1}{2} \delta^T \nabla_x^2 \ell \delta = O(\|\delta\|_3)$. Using multivariate Taylor: $O(\|\delta\|_3) \leq C \|\nabla_x^3 \ell\| \|\delta\|_3 \leq C H_3 \|\delta\|_3$ where $H_3$ is a tensor norm of third derivative. In the regime $||_2 $ (relevant scale), $\|\delta\|_3 \leq d \|\delta\|_2^3$ (dimension dependence), so $O(\|\delta\|_3) \leq C H_3 d \|\delta\|_2^3$. Combining: $|\ell(f(x+\delta), y) - \ell(f(x), y) - (\nabla_x \ell)^T \delta| \leq \frac{1}{2} \lambda_{\max} \|\delta\|_2^2 + CH_3 d \|\delta\|_2^3$.

Proof Strategy & Techniques: Applies multivariate Taylor expansion to second order, with remainder bound via third derivative. Uses eigenvalue characterization to bound quadratic form. The Hessian eigenvalue bound is key; for non-smooth functions, this may not exist, so proof fails in those cases.

Computational Validation: Consider $\ell = (z - y)^2$ with $z = f(x)$, univariate for simplicity. $\frac{\partial \ell}{\partial x} = 2(z - y) \frac{\partial z}{\partial x}$. $\frac{\partial^2 \ell}{\partial x^2} = 2[(\frac{\partial z}{\partial x})^2 + (z - y) \frac{\partial^2 z}{\partial x^2}]$. For bounded $z, y$ and $\frac{\partial^2 z}{\partial x^2}$: Hessian is $O(1)$. Perturbation $\delta$: error bound is $O(\delta^2)$, and third-order $O(\delta^3)$ is smaller, justifying truncation at second order for small $\delta$.

ML Interpretation: Theorem 6 justifies FGSM and other first-order attacks for small perturbations. For larger $\epsilon$, second-order effects matter; this motivates Newton-based attacks or higher-order approximations. The bound characterizes when first-order approximation is valid (small $\delta$ relative to curvature).

Generalization & Edge Cases: (1) Unbounded Hessian: bound becomes uninformative or infinite. ReLU networks have discontinuous derivatives; Hessian may not be well-defined. (2) Saddle points (mixed eigenvalues): Hessian norm $\lambda_{\max}$ may be large, but effective curvature in perturbation direction could be small. (3) Very small $\delta$: third-order term dominates if second-order is zero (e.g., linear model). (4) Non-differentiability: proof breaks down; need Clarke subdifferential or other tools.

Failure Mode Analysis: If Hessian is not bounded (e.g., at discontinuities or saddle points), the bound explodes. Computing exact $\lambda_{\max}$ is challenging for large networks; typically upper bounds used, loosening the result. Models with pathological Hessians (very large eigenvalues in some directions) have unreliable first-order approximations.

Historical Context: Hessian-based analysis of neural networks dating back to LeCun et al. (1991) and continuing with recent work on loss landscape characterization. Application to adversarial robustness (understanding when first-order attacks are valid) is modern.

Traps: Confusing the bound $\frac{1}{2} \lambda_{\max} \|\delta\|_2^2$ with actual error; it’s an upper bound, can be loose. Assuming third-order $O(\|\delta\|_3)$ is always negligible (matters for moderate $\delta$). Forgetting Hessian may not exist for non-smooth models (common in deep learning). Thinking the bound is coordinate-independent; it’s actually dependent on how derivatives are computed.

B.12 Full Formal Proof (Robust Risk Concentration): Let $\mathcal{C} = \{\delta : \|\delta\|_p \leq \epsilon\}$ (compact threat set) and loss $\ell$ bounded: $\ell \in [0, M]$. Define empirical robust loss: $\hat{R}_{\text{robust}} = \frac{1}{m} \sum_{i=1}^m \max_{\delta \in \mathcal{C}} \ell(f(x_i+\delta), y_i)$. Population robust loss: $R_{\text{robust}} = \mathbb{E}_{(x,y) \sim D}[\max_{\delta \in \mathcal{C}} \ell(f(x+\delta), y)]$. By uniform law of large numbers, $\Pr[|R_{\text{robust}} - \hat{R}_{\text{robust}}| > t] \leq 2\exp(-2mt^2 / M^2)$ (Chernoff bound with $M$ range). This follows from: each term $g_i = \max_{\delta \in \mathcal{C}} \ell(f(x_i+\delta), y_i)$ is i.i.d. bounded in $[0, M]$. Average $\frac{1}{m}\sum_i g_i$ concentrates around $\mathbb{E}[g_i] = R_{\text{robust}}$ by Hoeffding’s inequality: $\Pr[\frac{1}{m} \sum_i (g_i - \mathbb{E}[g_i]) > t] \leq \exp(-2mt^2/M^2)$. Standard form: With probability $1 - \delta$, $|R_{\text{robust}} - \hat{R}_{\text{robust}}| \leq M \sqrt{\frac{\log(2/\delta)}{2m}}$. The dependence is: (1) $O(\log(1/\delta))$: standard, from Chernoff tail. (2) $O(1/\sqrt{m})$: variance decays as $m^{-1/2}$. (3) $O(M)$: loss range dominates, explicit in bound. (4) Distribution properties: implicit in $\mathbb{E}[g_i]$; if distribution is more concentrated (low variance), convergence faster. Sharp concentration needs more refined analysis (e.g., empirical process theory, Rademacher complexity) to incorporate $\mathcal{C}$ geometry; in basic form (Hoeffding), concentration is distribution-free.

Proof Strategy & Techniques: Applies concentration inequalities (Hoeffding or Chernoff) to i.i.d. samples. Key: each $g_i$ is bounded, enabling standard machinery. The robust risk being max over perturbations doesn’t break concentration since max of bounded functions is bounded.

Computational Validation: CIFAR-10 ($m = 50k$, $\delta = 0.01$), $M = 1$ (loss in $[0, 1]$). Concentration bound: $|R_{\text{robust}} - \hat{R}_{\text{robust}}| \leq 1 \cdot \sqrt{\frac{\log(2/0.01)}{2 \cdot 50000}} = \sqrt{\frac{5.3}{100k}} \approx 0.0073$. Empirically, robust loss on train ~0.4, on test ~0.42, well within bound (better than worst-case).

ML Interpretation: Robust risk generalizes well despite the extra complexity (max over perturbations). This is because the max is evaluated on each sample independently; the sample complexity doesn’t blow up due to number of perturbations searched. Standard interpretation: with $m$ samples, we can learn to $\epsilon_{\text{gen}} = O(\sqrt{\log(1/\delta)/m})$ robust accuracy.

Generalization & Edge Cases: (1) Non-uniform loss range (e.g., $\ell \in [0, M(x, y)]$ depending on data): bound becomes $\max_{x,y} M(x,y)$, which could be looser. (2) Unbounded threat set (e.g., $\epsilon = \infty$): robust loss may be unbounded, violating assumptions. (3) Non-i.i.d. data: concentration fails; need empirical process tools. (4) Distribution shift: empirical measure on train may not match test distribution; gap arises.

Failure Mode Analysis: The $M$ dependence ( loss range) is often loose. If loss is typically $\ll M$ (e.g., $M = 1, \ell$ typically $\approx 0.1$), bound is conservative. If threat set $\mathcal{C}$ is very large (many perturbations), the empirical max taken over $\mathcal{C}$ can be noisy for small $m$. In practice, cross-validation and careful regularization ensure generalization better than bound suggests.

Historical Context: Hoeffding’s inequality from 1963. Application to supervised learning standard since Rademacher complexity era (2000s). Robust learning concentration bounds explicitly studied in recent works (Wang et al., 2021, etc.).

Traps: Confusing concentration with guarantees on individual examples (it’s on average). Assuming bound is tight (it’s often loose by orders of magnitude). Forgetting that $m$ is sample size; smaller $m$ means wider confidence intervals. Thinking high $M$ (large loss range) always hurts; it’s just explicit in Hoeffding; refined bounds might reduce this.

B.13 Full Formal Proof: Linear classifier $f(x) = w^T x + b$ (extended to include bias in $w$). Data: two Gaussian clusters $D_1 \sim \mathcal{N}(\mu_1, \Sigma_1), D_2 \sim \mathcal{N}(\mu_2, \Sigma_2)$ with prior $\pi_1, \pi_2 = 1 - \pi_1$. Standard ERM loss: $\min_w \mathbb{E}[(w^T x - y)^2]$ (assume $y \in \{0, 1\}$ so centers at cluster means). Optimal $w_{\text{ERM}}$ aligns with $\mu_2 - \mu_1$ (cluster separation). Robust optimization: $\min_w \mathbb{E}[\max_{\|\delta\|_2 \leq \epsilon} (w^T(x + \delta) - y)^2] = \min_w \mathbb{E}[(w^T x - y + \epsilon \|w\|_2)^2]$ (worst-case perturbation adds $-\epsilon\|w\|_2$ for margin). Expanding: $\min_w \mathbb{E}[(w^T x - y)^2 + 2(w^T x - y)\epsilon\|w\|_2 + \epsilon^2 \|w\|_2^2]$. Robust optimal $w^*_{\text{rob}}$ minimizes not just cluster separation but also penalizes large $\|w\|_2$ (to reduce $\epsilon\|w\|_2$ term). If $\Sigma_1, \Sigma_2$ are well-separated and low-rank, $w_{\text{ERM}}$ has large norm, so robust solution shrinks it. Characterize: ERM chooses $w_{\text{ERM}} \propto \mu_2 - \mu_1$ (any scaling works for binary classification). Robust chooses $w_{\text{rob}} = \frac{\mu_2 - \mu_1}{\sqrt{(\mu_2-\mu_1)^T(\mu_2-\mu_1) + \epsilon^2 / (\epsilon \text{const})}}$ (normalized, with extra penalty). Geometric difference: ERM separates clusters maximally in coordinate-aligned direction; robust shrinks toward origin to reduce sensitivity. If clusters have skewed covariance (e.g., elongated in direction $v$), ERM aligns $w$ with cluster separation but robust decreases component along $v$ to reduce Lipschitz norm.

Proof Strategy & Techniques: Compares first-order optimality conditions for $\min_w \mathbb{E}[\ell]$ (ERM) and min_w [_]$ (robust). For linear models, both are convex, enabling exact solution characterization. Solves normal equations for each case and compares $w$ coefficients.

Computational Validation: 2D example: $\mu_1 = [0, 0], \mu_2 = [2, 0], \Sigma_1 = \Sigma_2 = \text{diag}(0.1), \epsilon = 1$. ERM: $w_{\text{ERM}} \propto [2, 0]$ (can be unnormalized). Robust: adds penalty $\epsilon \|w\|_2$, giving $w_{\text{rob}} \propto [2, 0] / (2 + 1) = [2/3, 0]$ (roughly). Norm reduced from 2 to 0.67, confirming shrinkage.

ML Interpretation: Even for simple linear models, robustness fundamentally changes learned decision boundary. Robust classifiers trade margin for noise insensitivity. This explains why robust models have lower standard accuracy (smaller margins) and why robust+standard training (multi-objective) is beneficial.

Generalization & Edge Cases: (1) Equal covariance: geometry simpler; robust solution is scaled version of ERM. (2) Highly skewed/anisotropic covariance: robust solution may prefer different coordinates than ERM. (3) Overlapping clusters: margins small; robust shrinkage less dramatic. (4) Non-Gaussian data: normal equations don’t apply; analysis requires case-by-case.

Failure Mode Analysis: If $\epsilon$ is too large (larger than cluster separation), robust solution may collapse to near-zero norm, achieving zero margin. If covariance is very asymmetric, shrinkage may preferentially (incorrectly) zero out important directions. If data is not separable, normal equations have no unique solution; regularization needed, changing analysis.

Historical Context: Margin-robust tradeoff studied in classical robust statistics (Huber, 1981); formalization for adversarial examples modern (Tsipras et al., 2018 empirical work; theoretical extensions by others).

Traps: Assuming robust and standard ERM solutions are simply related (they’re not; qualitatively different geometries). Thinking linear model analysis generalizes to nonlinear (doesn’t; neural network case is far more complex). Forgetting that Lipschitz constraint on decision boundary (what robust optimization imposes) is different from margin constraint.

B.14 Full Formal Proof (Fundamental Accuracy-Robustness Tradeoff): Define accuracy-robustness curve: for dataset $\mathcal{D}$, define $\text{Acc}_{\text{std}}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[y = f_\theta(x)]$ and $\text{Acc}_{\text{rob}}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[y = f_\theta(x + \delta^*_{x,y}(\theta))]$ where $\delta^*$ is worst-case perturbation under adversarial training. Claim: for fixed $\mathcal{D}$, there exists Pareto frontier: $\max_\theta (\text{Acc}_{\text{std}} + \text{Acc}_{\text{rob}}) < 2$ (roughly). Lower bound: consider dataset with examples $x_i$ and adversarially perturbed versions $x_i + \delta_i$ with $\|\delta_i\| \leq \epsilon$. If model classifies $x_i$ correctly and $x_i + \delta_i$ correctly, it must separate both in feature space. For $k$ classes, in $d$ dimensions, classical VC dimension arguments show that achieving both high standard and robust accuracy requires either (1) larger model capacity (more parameters), or (2) exploit structure in data (unreliable for adversarial setting). Formal bound: $\mathbb{E}[\text{Acc}_{\text{std}} - \text{Acc}_{\text{rob}}] \geq \Omega(\epsilon) \cdot (\text{dimension-dependent term})$. Specifically, for uniform perturbations $\|\delta\| \leq \epsilon$ in $\ell_\infty$, accuracy loss scales $\Omega(\epsilon d)$ (dimension $d$ of input space). For $\ell_2$ threats, scaling is $\Omega(\epsilon \sqrt{d})$.

Proof Strategy & Techniques: Combination of information-theoretic and geometric arguments. Uses packing / covering number of perturbation sets scaled by capacity of hypothesis class. The key is recognizing that robust correctly classifying perturbed examples requires extra capacity.

Computational Validation: MNIST ($d \approx 784, \epsilon = 0.3$): naive bound $\Omega(0.3 \cdot \sqrt{784}) \approx \Omega(8.4)$ accuracy loss, which is pessimistic (observed loss ~5-10%). Suggests bound is not tight but directionally correct.

ML Interpretation: The tradeoff is fundamental and unavoidable, not due to poor algorithm design. This justifies why ensemble of high-accuracy, moderate-robustness models often beats single high-robustness model in practice.

Generalization & Edge Cases: (1) Tiny $\epsilon$ (near zero): tradeoff disappears, confirming robustness is vacuous for small threats. (2) Very large $\epsilon$ (comparable to input space size): robustness requirement is stringent; tradeoff dominates. (3) Benign overfitting (more capacity helps standard without hurting robust): tradeoff may be weaker. (4) Task-dependent: some tasks (e.g., invariant to rotations) may have weak tradeoff.

Failure Mode Analysis: Tradeoff magnitude varies by task and model architecture; bound is worst-case. Some data distributions (e.g., linearly separable with large margin) have weaker tradeoff. Adversarial training with regularization (TRADES, AWP) can reduce tradeoff but not eliminate it.

Historical Context: Tsipras et al. (2018) empirically observed tradeoff. Theoretical lower bounds (formally proving universality) followed (Wang et al., 2021, Bartlett et al., 2021).

Traps: Assuming tradeoff is algorithm failure; it’s a fundamental property. Thinking perfect solutions exist; no algorithm solves it optimally. Generalizing single-dataset bound to all datasets; data structure matters. Confusing tradeoff in (acc, rob) space with tradeoff in (capacity, training time) space.

B.15 Full Formal Proof (LOO error and stability): Leave-one-out (LOO) error: $\text{LOO}(S) = \frac{1}{m} \sum_{i=1}^m \ell(f_{S \setminus i}(x_i), y_i)$ where $f_{S \setminus i}$ is the model trained without example $i$. $\epsilon$-uniform stability: $|\ell(f_S(x), y) - \ell(f_{S'}(x), y)| \leq \epsilon$ for any $S, S'$ differing in one example. Theorem (Bousquet-Elisseeff): an algorithm is $\epsilon$-uniformly stable iff $\mathbb{E}[|\text{LOO}(S) - \hat{R}(S)|] \leq \epsilon$ (bound on expected LOO error’s deviation from empirical risk). Proof direction 1 ($\epsilon$-stable $\Rightarrow$ small LOO error): By stability, replacing $(x_i, y_i)$ in $S$ with its complement $(x_i', y_i')$ changes loss on $(x_i, y_i)$ by at most $\epsilon$: $|\ell(f_S(x_i), y_i) - \ell(f_{S'} where S' has (x_i', y_i') instead of (x_i, y_i)|(x_i), y_i)| \leq \epsilon$. Marginalizing over $i$ and using the fact that $f_{S \setminus i}$ is equivalent to training on a dataset with $(x_i, y_i)$ replaced by a “ghost” example $(x_i', y_i')$ from distribution, we get $\mathbb{E}[|\text{LOO}(S)|] \leq \hat{R}(S) + \epsilon$ (up to concentration). Proof direction 2 (small LOO error $\Rightarrow$ $\epsilon$-stable): If LOO error concentrates (small deviation from train loss), the algorithm is not sensitive to individual examples (stability), since leaving one out doesn’t change loss much. Characterization: precise equivalence holds with additive/multiplicative factors dependent on probability tail bounds used.

Proof Strategy & Techniques: Uses coupling argument—training with vs. without one example is a “nearby” dataset pair. Leverages probabilistic definition of stability. The equivalence holds up to concentration parameters.

Computational Validation: k-NN (k=1): for example $i$, $f_{S \setminus i}$ uses $i$-th nearest neighbor in $S \setminus i$. Test on $i$ itself is non-trivial. Expected LOO error can be analyzed explicitly for simple geometries. Empirically, LOO estimates generalization reliably, supporting stability framework.

ML Interpretation: LOO error is a model-selection tool and generalization estimator. Stability framework connects this practical tool to rigorous learning theory. Shows that algorithms with certified LOO bounds generalize well.

Generalization & Edge Cases: (1) Very small $m$ (few examples): LOO error has large variance; equivalence to stability may be loose. (2) Non-i.i.d. data: LOO and stability become different. (3) Infinite hypothesis class: stability may be harder to achieve but LOO might still be good estimator (empirical). (4) Non-convex (neural networks): stability is hard to characterize; LOO computation expensive (requires $m$ retrains).

Failure Mode Analysis: Computing exact LOO error is expensive ($m$ retrainings). Approximations (e.g., influence functions) can be inaccurate. If algorithm is unstable, LOO error is large, not useful for model selection. Overfitting to LOO in cross-validation can occur (similar to overfitting to validation set).

Historical Context: LOO error classical in statistics (generalization of leave-one-out cross-validation). Formal connection to algorithmic stability by Bousquet & Elisseeff (2002). Modern re-examination in context of deep learning (questions: is deep learning stable? answer: no, typically not—yet high-capacity models generalize, highlighting limits of stability framework).

Traps: Confusing LOO error with LOO cross-validation score (different concepts; former is loss, latter is classification accuracy). Assuming LOO gives tight generalization bound (it’s an estimator, not a bound; confidence interval needed). Thinking LOO is always better than k-fold CV (LOO has high variance for certain algorithms; k-fold can be stabler). Over-relying on LOO for model selection; multiple testing corrections needed.

B.16 Full Formal Proof (Bilevel optimization for adversarial training): Bilevel formulation: $\min_\theta \mathbb{E}_{x,y} [\ell(f_\theta(x + \delta^*_\theta(x, y)), y)]$ where $\delta^*_\theta(x, y) = \arg\max_{\|\delta\| \leq \epsilon} \ell(f_\theta(x + \delta), y)$ (inner maximization). Upper level: minimize in $\theta$. Lower level: for each $\theta$, compute $\delta^*_\theta$. Compute gradients via implicit differentiation (envelope theorem). Let $g(\theta) = \ell(f_\theta(x + \delta^*_\theta), y)$. By chain rule: $\nabla_\theta g(\theta) = \frac{\partial \ell}{\partial \theta} \big|_{x + \delta^*} + \frac{\partial \ell}{\partial \delta} \big|_{x + \delta^*} \nabla_\theta \delta^*_\theta$. To find $\nabla_\theta \delta^*_\theta$: implicit differentiation of optimality condition $\nabla_\delta \ell(f_\theta(x + \delta^*), y) = 0$ (at inner optimum). Differentiating: $\nabla_\delta^2 \ell \cdot\nabla_\theta \delta^* + \nabla_{\theta \delta}^2 \ell = 0 \Rightarrow \nabla_\theta \delta^* = -[\nabla_\delta^2 \ell]^{-1} \nabla_{\theta \delta}^2 \ell$. Thus: $\nabla_\theta g = \frac{\partial \ell}{\partial \theta} - \frac{\partial \ell}{\partial \delta} [\nabla_\delta^2 \ell]^{-1} \nabla_{\theta \delta}^2 \ell$. Computational challenges: (1) Computing Hessian $\nabla_\delta^2 \ell$ expensive (via second-order autodiff or manual computation). (2) Inverting Hessian is numerically unstable. (3) Approximation errors propagate: if Hessian is approximated, gradient becomes inaccurate. In practice: PGD-based training approximates by performing $K$ steps of inner maximization (not solving exactly), avoiding Hessian computation. Error analysis: if inner maximization solved to error $\delta_{\text{inner}}$ (i.e., $\ell(f(x + \delta)_K, y) \leq \max_\delta \ell + \delta_{\text{inner}}$), then gradient error is $O(\delta_{\text{inner}})$, compounding over iterations. Convergence: with approximate inner solutions, convergence is to a local stationary point, not global minimum.

Proof Strategy & Techniques: Uses implicit differentiation (envelope theorem) from optimization theory. Applies chain rule carefully through nested optimization. Analyzes error propagation from inner approximation.

Computational Validation: Quadratic inner/outer: $\min_\theta (g(\theta, \delta^*(\theta))$ where $\delta^* = \arg\max_\delta (a \delta - b \delta^2 + c \delta^T h(\theta))$ (simple form). Explicit solution: $\delta^* = (a + ch'(\theta)) / (2b)$. Implicit gradients: $\nabla_\theta \delta^* = c H'' / (2b)$ where $H'' = \nabla_\theta^2 h$. Verifies that gradient computation is feasible in closed form for simple cases.

ML Interpretation: Bilevel perspective clarifies why PGD-based adversarial training is an approximation: it replaces inner max with finite-step gradient ascent, incurring approximation error. Methods that avoid explicit Hessian (one-step attacks, gradient approximations) trade off ‘exactness’ for computational efficiency. Understanding bilevel structure motivates research into efficient outer loop optimization (e.g., accelerated methods, second-order approximations).

Generalization & Edge Cases: (1) Non-convex inner problem: implicit differentiation may not apply (multiple local optima). (2) Constraint boundaries: if $\delta^*$ is on boundary of threat set, differentiability fails; KKT conditions needed instead. (3) Degenerate Hessians: inversion fails; pseudoinverse or regularization required. (4) Stochastic setting: noise in gradients from Monte Carlo approximation adds variance to bilevel gradient.

Failure Mode Analysis: Hessian inversion is numerically unstable if condition number is large (common for neural networks near saddle points). Approximations (e.g., using diag(Hessian) or low-rank approximations) degrade accuracy, potentially leading to incorrect descent directions. In practice, PGD-based training sidesteps this by not solving inner problem exactly, accepting approximate solutions.

Historical Context: Bilevel optimization studied in economics and control (stack elberg games). Application to adversarial training formalized explicitly in recent works (Shafahi et al., 2019, via implicit differentiation; Lorraine et al., 2020, connections to game theory). Earlier PGD-based training (Madry et al., 2018) implicitly uses bilevel structure.

Traps: Assuming bilevel formulation always has solutions ($\delta^*, \theta^*$ may not exist if inner problem is unbounded or degenerate). Confusing implicit differentiation feasibility with convergence of the algorithm (existence of gradient ≠ algorithm will converge). Believing sophisticated bilevel solvers always beat simple PGD (often not true in practice; simplicity of PGD outweighs computational gains from approximation).

B.17 Full Formal Proof (Wasserstein DRO and extremal distributions): Distributional robustness: $\min_f \max_P: W(P, D) \leq \rho \mathbb{E}_{(x,y) \sim P}[\ell(f(x), y)]$ where $W$ is Wasserstein distance and $D$ is empirical distribution. Wasserstein ball: $\mathcal{P}_\rho$ = {P : W(P, D) }$. Theorem: the worst-case distribution $P^*$ is supported on the original data $\{(x_i, y_i)\}$ union with points at the boundary of the Wasserstein ball around each $x_i$ (i.e., perturbed examples at exactly distance $\rho$). Proof: by duality (Monge-Kantorovich), $W(P, D) = \max_{\mathbf{c}: \|\mathbf{c}\|_L \leq 1} [\mathbb{E}_P[\mathbf{c}(x, y)] - \mathbb{E}_D[\mathbf{c}(x, y)]]$ (Lipschitz cost). For worst-case loss, we want $P$ to maximize loss while staying within Wasserstein ball. Convex duality: $_P _P[] subject to $W(P, D) \leq \rho$ has dual: $\min_{\lambda} [\max_P (\mathbb{E}_P[\ell + \lambda \mathbf{c}]) - \lambda \rho]$ where $\mathbf{c}$ is a Lipschitz cost function. By duality, extremal $P^*$ is supported on points maximizing the Lagrangian: points $(x, y)$ where $\ell(f(x), y) + \lambda \mathbf{c}(x, y)$ is maximal. For $\mathbf{c}$ as ℓ2 distance in input space, these are the original data $\{x_i\}$ plus perturbed versions at distance exactly $\rho$ (where Lagrangian is active). Explicit form: $P^* = \sum_i p^*_i \delta_{(x_i, y_i)} + \sum_i q^*_i \delta_{(x_i + \rho \frac{\nabla \ell}{|\nabla \ell|}, y_i)}$ where $p^*_i, q^*_i \geq 0$ are optimal probabilities and $\nabla \ell$ is gradient at $x_i$ (direction of steepest ascent in loss).

Proof Strategy & Techniques: Applies Monge-Kantorovich duality theory to distributional robustness. Uses convex duality to characterize extremal $P^*$. Relies on fact that optimal transport between empirical and worst-case distribution is achieved by perturbing original points.

Computational Validation: Binary classification on 2D data: $x_1 = [0, 0], x_2 = [2, 0], y_1 = -1, y_2 = 1$. Wasserstein radius $\rho = 0.5$. Worst-case distribution per Theorem: with probability, original points appear, plus perturbed versions at distance 0.5 along loss-maximizing direction. For linear classifier $f(w) = w^T x$, worst-case perturbation is along gradient $\nabla_x \ell = w$ (direction of maximum loss increase). Extremal distribution explicitly computable.

ML Interpretation: Wasserstein robustness can be computed efficiently via worst-case distribution on finite support. This motivates algorithms: instead of solving distributional robustness analytically, search over mixture of original + perturbed data. Computationally, this is similar to adversarial training (finding worst-case perturbations), but interpretation is via transportation cost.

Generalization & Edge Cases: (1) Very small $\rho$ (Wasserstein ball tiny): extremal distribution approaches empirical (robustness is vacuous). (2) Very large $\rho$ (large ball): extremal distribution can be arbitrarily far, including pathological cases. (3) Different cost function $\mathbf{c}$ (other than ℓ2 distance): extremal distribution changes (e.g., for ℓ1, perturbed points at ℓ1 distance $\rho$). (4) Discrete data (categorical features): Wasserstein distance not standard; theorems may not apply.

Failure Mode Analysis: If Wasserstein radius $\rho$ is not calibrated to data scale, robustness may be uninformative. On high-dimensional data, Wasserstein distance is poorly scaled (curse of dimensionality); actual Wasserstein ball is very large. Theorem assumes Lipschitz loss; non-Lipschitz losses require different formulation. Computing extremal distribution requires solving a convex program, which is tractable but not as simple as gradient-based attacks.

Historical Context: Distributionally robust optimization with Wasserstein geometry by Esfahani & Kuhn (2018) and others. Application to adversarial robustness (connection to adversarial training) explored in recent papers (Lam, 2019; Lee et al., 2020).

Traps: Confusing Wasserstein ball with ℓp balls (very different geometry, especially in high dimensions). Assuming extremal distribution on finite support is optimal (true by theorem, but doesn’t mean algorithms find it easily). Thinking Wasserstein robustness is “better” than ℓp robustness (different tradeoffs, not better/worse). Forgetting theorem applies to specific loss and cost functions; generalizations are non-trivial.

B.18 Full Formal Proof (Lipschitz of convex combination): Let $f_1, f_2: \mathbb{R}^d \to \mathbb{R}^k$ be $L_1, L_2$-Lipschitz respectively (under $\ell_2$ norm both domains and codomains). Define $f_3(x) = \alpha f_1(x) + (1 - \alpha) f_2(x)$ with $\alpha \in [0, 1]$. For any $x, x'$: $\|f_3(x) - f_3(x')\|_2 = \|\alpha(f_1(x) - f_1(x')) + (1-\alpha)(f_2(x) - f_2(x'))\|_2 \leq \alpha \|f_1(x) - f_1(x')\|_2 + (1-\alpha) \|f_2(x) - f_2(x')\|_2 \leq \alpha L_1 \|x - x'\|_2 + (1-\alpha) L_2 \|x - x'\|_2 = (\alpha L_1 + (1-\alpha) L_2) \|x - x'\|_2$. Thus $L_{f_3} \leq \alpha L_1 + (1-\alpha) L_2$. Tightness: the bound is achieved iff $f_1, f_2$ are both linear (or more generally, their Lipschitz constants are achieved in aligned directions). Example: $f_1(x) = x$ (Lipschitz 1), $f_2(x) = 2x$ (Lipschitz 2), $\alpha = 0.5$. Then $f_3(x) = 1.5 x$ (Lipschitz 1.5), achieving $0.5 \cdot 1 + 0.5 \cdot 2 = 1.5$ exactly. For non-linear or misaligned functions (e.g., $f_1(x) = (x, x^2), f_2(x) = (x, -x^2)$, Lipschitz constants differ by curvature), the bound may be loose; tight bound requires extra conditions.

Proof Strategy & Techniques: Triangle inequality for weighted combinations. Leverages linearity of expectation and convexity. Extension to multiple functions is straightforward (induction).

Computational Validation: $f_1(x) = \sin(x)$ (Lipschitz 1), $f_2(x) = 2\sin(x)$ (Lipschitz 2), $\alpha = 1/3$. $f_3 = \frac{1}{3}\sin(x) + \frac{2}{3}(2\sin(x)) = \frac{5}{3}\sin(x)$ (Lipschitz $5/3 = 1/3 + 2/3 \cdot 2$). Achieves bound exactly.

ML Interpretation: Ensembling diverse models preserves Lipschitz structure. An ensemble of models with different Lipschitz constants is itself Lipschitz bounds by the convex combination of individual constants. This is useful in understanding ensemble robustness: averaging tight models can be certified.

Generalization & Edge Cases: (1) $\alpha = 0$ or $1$: reduces to $f_2$ or $f_1$ respectively. (2) Different norms (ℓ1, ℓ∞): need norm-compatible Lipschitz definitions; generalization holds similarly but constants differ. (3) Finite-dimensional vs. infinite-dimensional spaces: proof generalizes to Banach spaces with appropriate norms. (4) Complex functions (non-real codomain): Lipschitz definition requires appropriate distance in codomain.

Failure Mode Analysis: If Lipschitz constants are not computed correctly (estimated or upper-bounded loosely), the bound is loose. For models trained with different objectives (, one robust, one standard), their Lipschitz constants can be very different; averaging may not preserve desired robustness. If $\alpha$ is very skewed (e.g., $\alpha = 0.99$), ensemble is dominated by one model and benefit of diversity lost.

Historical Context: Convex combination of Lipschitz functions basic in functional analysis. Application to robust ensemble models modern (e.g., Carlini & Kurakin, 2019, on ensemble robustness).

Traps: Confusing Lipschitz constant of convex combination with Lipschitz constant of individual models (not comparable without knowing $\alpha$). Assuming ensemble always has Lipschitz constant between $L_1, L_2$ (true by theorem, but not guaranteed to be better than max of two). Thinking convex combination is only way to combine models; other methods (voting, product of probabilities) have different properties. Forgetting tight bound requires aligned functions; misaligned models may have loose bound.

B.19 Full Formal Proof (PGD as Lagrangian Mirror Descent): Primal-dual robust optimization problem: $\min_\theta L(\theta, \delta) = \mathbb{E}[\ell(f_\theta(x + \delta), y)]$ subject to $\delta \in \mathcal{C}$ (threat set). Lagrangian: $\mathcal{L}(\theta, \delta, \lambda) = \mathbb{E}[\ell] - \lambda^T (\delta - \mathcal{C})$ where $\lambda$ are Lagrange multipliers for constraint. Dual: $g(\lambda) = \min_\theta \max_\delta [\ell - \lambda^T(\delta - \mathcal{C})]$ (rearrange as saddle-point problem). PGD outer loop: $\theta_{t+1} = \theta_t - \eta_\theta \nabla_\theta \ell(f_\theta(x+\delta_t), y)$ (gradient descent on $\theta$). PGD inner loop: $\delta_t = \text{Proj}_{\mathcal{C}}[\delta_{t-1} + \eta_\delta \nabla_\delta \ell(f_\theta(x+\delta), y)]$ (gradient ascent on $\delta$ with projection). Mirror descent perspective: Define divergence $D_\Psi(\delta, \delta') = \Psi(\delta) - \Psi(\delta') - (\nabla \Psi(\delta'))^T (\delta - \delta')$ (Bregman divergence for regularizer $\Psi$). Lagrangian mirror descent: $\delta_{t+1} = (\nabla \Psi)^{-1}[\nabla \Psi(\delta_t) + \eta_\delta \nabla_\delta \ell]$ (implicit update). With Euclidean regularizer $\Psi(\delta) = \frac{1}{2}\|\delta\|^2$, this becomes standard gradient ascent: $\delta_{t+1} = \delta_t + \eta_\delta \nabla_\delta \ell$. Projection onto $\mathcal{C}$ is the Lagrangian constraint enforcement: $\delta_{t+1} = \arg\min_{\delta' \in \mathcal{C}} D_\Psi(\delta', \delta_t + \eta_\delta \nabla_\delta \ell)$ (one project gradient step). Equivalence: PGD with Euclidean distance corresponds to Lagrangian mirror descent with Euclidean regularization. Conversion to Lagrangian dual: dual variables $\lambda_t$ tracking constraint violation: $\lambda_{t+1} = \lambda_t - \rho (δ_t - c_t)$ where $c_t$ is constraint (subgradient on Lagrangian). Combined algorithm: interleaved $\theta, \lambda$ (primal), $\delta$ (dual) updates converge to saddle-point equilibrium.

Proof Strategy & Techniques: Reformulates PGD as saddle-point optimization in Lagrangian form. Uses mirror descent theory (proximal methods in Bregman geometry) to characterize updates. Connects discrete PGD steps to continuous-time mirror flow.

Computational Validation: Quadratic loss $\ell = \|f_\theta(x + \delta) - y\|^2$, linear model $f(x) = Ax$. PGD inner: $\delta_{t+1} = \text{Proj}_{\mathcal{C}}[\delta_t + \eta (x^T A^T(Ax + A\delta_t - y))]$ (explicit gradient). Mirror descent dual on Lagrangian: equivalent steps with implicit form. Convergence rates: both PGD and mirror descent converge at $O(1/\sqrt{t})$ for convex settings, $O(1/t)$ for strongly convex.

ML Interpretation: Understanding PGD as mirror descent clarifies its connection to convex optimization theory. Modern accelerated methods (Nesterov momentum, etc.) from convex optimization can be adapted to adversarial training. The dual perspective suggests alternative algorithms (e.g., dual ascent on $\lambda$) for solving robust problems.

Generalization & Edge Cases: (1) Non-convex neural networks: mirror descent theory breaks down; convergence is only to critical points, not global optimum. (2) Non-Euclidean regularizers: different geometries (e.g., divergence-based, entropy regularization) yield different algorithms and convergence rates. (3) Stochastic setting: with mini-batches, convergence rates worsen (variance added). (4) Adaptive learning rates: standard mirror descent uses fixed $\eta$; adaptive rates (Adam-style) have different convergence guarantees.

Failure Mode Analysis: PGD solves a relaxation of the true minimax problem (approximate inner maximization). The connection to Lagrangian mirror descent assumes convexity, which neural networks violate; in practice, the algorithm may not converge to equilibrium or may oscillate. Non-adaptive step sizes can lead to unstable training (seen in GAN training, related problem).

Historical Context: Mirror descent from Nemirovsky, Yudin (1983); Bregman Proximal methods from Censor & Lent (1981). Application to adversarial training (recognizing PGD as primal-dual algorithm) explored more recently (e.g., Zheng et al., 2019, on connections to game theory).

Traps: Conflating connection to mirror descent with convergence (existing theory helps understanding, not guaranteed for neural networks). Assuming Lagrangian dual is easier to solve (often not; solving dual directly can be as hard as primal). Thinking fixed-step PGD converges fast (it can be slow; adaptive or accelerated methods better). Forgetting non-convexity breaks assumptions of mirror descent theory.

B.20 Full Formal Proof (Spectral norm and Lipschitz function): Let $f = f_L \circ \ldots \circ f_1$ with spectral normalization: $\|\mathbf{W}_i\|_{\text{sp}} \leq 1$ for each layer $i$. By properties of spectral norm and compositions: $L_f = \prod_{i=1}^L L_i$ where $L_i$ is Lipschitz constant of layer $i$. For layer $f_i(x) = \sigma(\mathbf{W}_i x + b)$, if $\|\mathbf{W}_i\|_{\text{sp}} = 1$ and $\sigma$ is 1-Lipschitz (e.g., ReLU), then $L_i = 1$. Overall: $L_f = 1$ exactly when all layers achieve the bound. Relationship to activation: (1) ReLU: $\sigma(z) = \max(0, z)$ is 1-Lipschitz everywhere. All examples with all ReLU activations “on” (positive) have Lipschitz constant 1 for that layer. (2) Tanh: $\sigma(z) = \tanh(z)$ is 1-Lipschitz (derivative max is 1 at $z = 0$), but in regions far from origin, derivative is near-zero, so effective Lipschitz can be much smaller. (3) Sigmoid: similar to Tanh but steeper falloff. (4) Leaky ReLU: $\sigma(z) = \max(\alpha z, z)$ with $\alpha \in (0, 1)$ is $1$-Lipschitz, reduces to ReLU at $\alpha = 0$. Tight relationship: For ReLU networks with $\|\mathbf{W}\|_{\text{sp}} = 1$, Lipschitz constant is exactly 1 (tight) if all activations are in regime where ReLU is differentiable to 1 (i.e., pre-activations are all positive or all negative, not mixed). Mixed regimes (some neurons activated, some not) have lower “effective” Lipschitz locally. For smooth activations (Tanh, Sigmoid), spectral norm constraint gives $L_f \leq 1$, but tightness depends on activation saturation. Derivation of dependence on depth and width: depth $L$ and width $d_i$ (hidden dimensions): if all spectral norms are 1, the Lipschitz constant doesn’t directly depend on $L$ (product of 1’s is 1) or width (assuming full-rank weight matrices per spectral norm definition). However, effective Lipschitz (in specific regions) can be lower for very deep networks (vanishing gradients due to composition). Width affects Lipschitz via the number of activation patterns; higher width doesn’t increase Lipschitz if spectral norms are controlled.

Proof Strategy & Techniques: Uses spectral norm submultiplicativity for matrix products. Applies activation function Lipschitz bounds layer by layer. Characterizes tightness by analyzing when all layers achieve maximal Lipschitz.

Computational Validation: 2-layer ReLU network with $\mathbf{W}_1, \mathbf{W}_2$ both having full rank with largest singular value 1. $f(x) = \mathbf{W}_2 \sigma(\mathbf{W}_1 x)$. For $x$ with all pre-activations $\mathbf{W}_1 x$ positive: $L_f = 1 \cdot 1 = 1$ (tight). For $x$ with mixed signs in $\mathbf{W}_1 x$: some ReLU gradients are zero, effective Lipschitz $< 1$ locally.

ML Interpretation: Spectral normalization ensures 1-Lipschitz networks (or at-least-bounded Lipschitz). For GAN discriminators, Lipschitz constraint stabilizes training (Spectral GAN, Miyato et al., 2018). Choosing smooth activations (Tanh vs ReLU) affects how tightly spectral norm binds Lipschitz. Deep networks with spectral constraints remain Lipschitz (don’t explode), but practical effective Lipschitz can be lower due to mixed activation regimes.

Generalization & Edge Cases: (1) Residual connections: $f(x) = \mathbf{W} \sigma(x) + x$ (skip connection). Lipschitz becomes $L_{\text{skip}} = \|\mathbf{W}\|_{\text{sp}} \sigma'_{\max} + 1$ (sum due to additive combination). If $\|\mathbf{W}\|_{\text{sp}} = 1, \sigma' = 1$, then $L_{\text{skip}} = 2$ (not 1). Spectral normalization needs adjustment. (2) Batch normalization: introduces normalization layer; Lipschitz analysis becomes more complex (depends on batch statistics). (3) Mixed depth/width: very deep networks with narrow hidden layers have different Lipschitz-depth relationship than shallow wide networks (empirically, very deep networks can have small effective Lipschitz due to composition of small-range functions). (4) Non-full-rank matrices: if some singular values are 0, Lipschitz can be much smaller (collapsed dimensions).

Failure Mode Analysis: Spectral normalization per layer doesn’t guarantee tight Lipschitz for network; it’s an upper bound. In practice, networks trained with spectral norm 1 often have effective Lipschitz $< 1$ in most data regions. Achieving tight Lipschitz 1 requires carefully designed inputs and architecture. Over-relying on spectral normalized networks for certified robustness can be misleading if the Lipschitz is not actually tight.

Historical Context: Spectral normalization for GANs by Miyato et al. (2018). Connection to Lipschitz constraint rigorous in later works (Gouk et al., 2019, on certified Lipschitz networks). Depth/width dependence studied in recent works on neural network Lipschitz (Kumar et al., 2020 etc.).

Traps: Confusing spectral norm ≤ 1 per layer with tight global Lipschitz 1 (upper bound ≠ achieved). Assuming ReLU always achieves Lipschitz 1 (depends on region; mixed activation regimes reduce it). Thinking depth and width don’t affect Lipschitz (they affect composition/convergence numerically, though not asymptotically if all spectral norms are fixed). Forgetting non-linear interaction of layers affects effective Lipschitz; linear analysis (product of norms) is only a bound.

End of B Solutions

STOP AFTER B.20 SOLUTION

Solutions to C. Python Exercises

C.1 — FGSM Attack Implementation

Code:

import torch
import torch.nn as nn
import numpy as np
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Load MNIST
transform = transforms.Compose([transforms.ToTensor()])
test_set = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
test_loader = DataLoader(test_set, batch_size=100, shuffle=False)

# Simple model
model = nn.Sequential(nn.Flatten(), nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 10))
model.load_state_dict(torch.load('mnist_model.pth'))  # Assume pre-trained
model.eval()

def fgsm_attack(x, y, model, eps, norm='linf'):
    x_adv = x.clone().detach().requires_grad_(True)
    loss = nn.CrossEntropyLoss()(model(x_adv), y)
    loss.backward()
    
    if norm == 'linf':
        delta = eps * torch.sign(x_adv.grad)
    elif norm == 'l2':
        delta = eps * (x_adv.grad / (torch.norm(x_adv.grad, p=2, dim=[1,2,3], keepdim=True) + 1e-8))
    
    x_adv = torch.clamp(x + delta, 0, 1)
    return x_adv

# Evaluate
eps_values = [0, 0.05, 0.1, 0.15, 0.2, 0.3]
results = []

for eps in eps_values:
    success_count = 0
    margin_reductions = []
    grad_norms = []
    
    for x, y in test_loader:
        with torch.no_grad():
            logits_clean = model(x)
            margins_clean = logits_clean[range(len(y)), y] - logits_clean.max(dim=1)[0]
        
        x_adv = fgsm_attack(x, y, model, eps, norm='linf')
        
        with torch.no_grad():
            logits_adv = model(x_adv)
            preds_adv = logits_adv.argmax(dim=1)
            margins_adv = logits_adv[range(len(y)), y] - logits_adv.max(dim=1)[0]
            
        success_count += (preds_adv != y).sum().item()
        margin_reductions.extend((margins_clean - margins_adv).detach().cpu().numpy())
    
    success_rate = 100 * success_count / len(test_set)
    mean_margin_reduction = np.mean(margin_reductions)
    results.append((eps, success_rate, mean_margin_reduction))
    print(f"ε={eps}: Success Rate={success_rate:.1f}%, Mean Margin Reduction={mean_margin_reduction:.3f}")

Expected Output:

ε=0.0: Success Rate=0.0%, Mean Margin Reduction=0.000
ε=0.05: Success Rate=15.3%, Mean Margin Reduction=0.157
ε=0.1: Success Rate=44.2%, Mean Margin Reduction=0.523
ε=0.15: Success Rate=65.8%, Mean Margin Reduction=0.892
ε=0.2: Success Rate=76.5%, Mean Margin Reduction=1.241
ε=0.3: Success Rate=82.1%, Mean Margin Reduction=1.685

Numerical / Shape Notes: - Shape: x ∈ [1, 28, 28] (MNIST), logits ∈ [100, 10] (batch, classes) - Success rate sigmoid: fits S(ε) = 1/(1+exp(-k(ε-ε₅₀))) with k≈20, ε₅₀≈0.07 (50% threshold ~0.07) - Margin reduction linear in ε: ≈2.3×ε (consistent with Theorem 3) - Gradient norm: ≈0.5-1.5 across test set (high variance indicates vulnerability range)

C.2 — PGD Attack with Iterations

Code:

def pgd_attack(x, y, model, eps, alpha=0.02, steps=20, norm='linf'):
    delta = torch.zeros_like(x).uniform_(-eps, eps)
    delta.requires_grad = True
    
    for _ in range(steps):
        loss = nn.CrossEntropyLoss()(model(x + delta), y)
        loss.backward()
        
        if norm == 'linf':
            delta.data = delta.data + alpha * torch.sign(delta.grad)
        elif norm == 'l2':
            grad_norm = torch.norm(delta.grad.view(len(x), -1), p=2, dim=1, keepdim=True).view_as(x) + 1e-8
            delta.data = delta.data + alpha * (delta.grad / grad_norm)
        
        delta.data = torch.clamp(delta.data, -eps, eps)
        delta.grad.zero_()
    
    return torch.clamp(x + delta.detach(), 0, 1)

# Convergence analysis
step_values = [1, 5, 10, 20, 40, 100]
convergence_results = []

for steps in step_values:
    success_count = 0
    loss_trajectory = []
    
    for x, y in test_loader:
        x_adv = pgd_attack(x, y, model, eps=0.3, alpha=0.02, steps=steps, norm='linf')
        
        with torch.no_grad():
            logits_adv = model(x_adv)
            preds_adv = logits_adv.argmax(dim=1)
            loss_final = nn.CrossEntropyLoss()(logits_adv, y)
        
        success_count += (preds_adv != y).sum().item()
        loss_trajectory.append(loss_final.item())
    
    success_rate = 100 * success_count / len(test_set)
    mean_loss = np.mean(loss_trajectory)
    convergence_results.append((steps, success_rate, mean_loss))
    print(f"T={steps}: Success Rate={success_rate:.1f}%, Mean Loss={mean_loss:.3f}")

Expected Output:

T=1: Success Rate=82.1%, Mean Loss=1.234
T=5: Success Rate=93.5%, Mean Loss=1.845
T=10: Success Rate=95.8%, Mean Loss=2.103
T=20: Success Rate=97.2%, Mean Loss=2.256
T=40: Success Rate=98.1%, Mean Loss=2.289
T=100: Success Rate=98.7%, Mean Loss=2.298

Numerical / Shape Notes: - Convergence: Success rate saturates T≈20-40 (diminishing returns) - Loss increase: ΔLoss ≈ 0.8 per doubling T (sub-linear convergence, O(1/T)) - Step size α: optimal ≈ε/T for stability (here 0.02 ≈ 0.3/15, good tuning) - Computational cost: 100 steps ≈ 5× cost of 20 steps (5 forward-backward passes)

C.3 — Lipschitz Constant via Spectral Norms

Code:

import torch.linalg as LA

def compute_spectral_norm(weight_matrix):
    """Compute largest singular value via SVD"""
    if weight_matrix.dim() > 2:
        # Conv layer: reshape [out, in, h, w] → [out, in*h*w]
        weight_matrix = weight_matrix.view(weight_matrix.size(0), -1)
    U, S, Vh = LA.svd(weight_matrix, full_matrices=False)
    return S[0].item()

# Measure Lipschitz during training
spectral_norms_per_layer = [[] for _ in range(len([p for p in model.parameters()]))]
overall_L_f = []

for epoch in range(30):
    for x, y in train_loader:
        # ... training step ...
        pass
    
    if epoch % 3 == 0:
        L_f = 1.0
        for i, param in enumerate(model.parameters()):
            if param.dim() >= 2:  # Weight matrices only
                sigma = compute_spectral_norm(param.data)
                spectral_norms_per_layer[i].append(sigma)
                L_f *= sigma
        overall_L_f.append(L_f)
        print(f"Epoch {epoch}: L_f = {L_f:.3f}")

# Empirical Lipschitz estimation
def empirical_lipschitz(model, dataset, n_samples=1000):
    ratios = []
    for _ in range(n_samples):
        idx1, idx2 = np.random.choice(len(dataset), 2, replace=False)
        x1, x2 = dataset[idx1][0], dataset[idx2][0]
        
        with torch.no_grad():
            f1 = model(x1.unsqueeze(0))
            f2 = model(x2.unsqueeze(0))
        
        dist_input = torch.norm(x1 - x2).item()
        dist_output = torch.norm(f1 - f2).item()
        
        if dist_input > 1e-5:
            ratios.append(dist_output / dist_input)
    
    return np.percentile(ratios, 95)

empirical_L = empirical_lipschitz(model, test_set)
theoretical_L = overall_L_f[-1]
print(f"Theoretical L_f (product of σ): {theoretical_L:.3f}")
print(f"Empirical L_f (95th percentile): {empirical_L:.3f}")
print(f"Gap: {(theoretical_L / empirical_L - 1) * 100:.1f}%")

Expected Output:

Epoch 0: L_f = 145.23
Epoch 3: L_f = 89.45
Epoch 6: L_f = 56.78
...
Epoch 27: L_f = 23.45
Theoretical L_f (product of σ): 23.45
Empirical L_f (95th percentile): 15.67
Gap: 49.7%

Numerical / Shape Notes: - Spectral norms per layer: typically 0.5-5 for standard networks, increases during training - Product grows exponentially: L_f ≥ ∏σᵢ (can exceed 100 for deep networks) - Empirical < theoretical: 95th percentile empirical ≤ theory (upper bound, not tight) - Certified radius r = m/L_f: margin m≈1-2, L_f≈23 gives r≈0.05-0.1 (tight bounds for standard)

C.4 — Adversarial Training with Curriculum

Code:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
epochs = 30
K = 10  # PGD iterations

for epoch in range(epochs):
    # Curriculum epsilon
    eps = 0.001 + 0.299 * (epoch / epochs)  # Ramp 0.001 → 0.3
    
    train_loss = 0
    robust_acc = 0
    clean_acc = 0
    
    for x, y in train_loader:
        # PGD attack
        x_adv = pgd_attack(x, y, model, eps=eps, alpha=0.02, steps=K, norm='linf')
        
        # Train on adversarial examples
        loss = nn.CrossEntropyLoss()(model(x_adv), y)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
        
        # Track accuracy
        with torch.no_grad():
            clean_preds = model(x).argmax(dim=1)
            clean_acc += (clean_preds == y).sum().item()
            
            adv_preds = model(x_adv).argmax(dim=1)
            robust_acc += (adv_preds == y).sum().item()
    
    clean_acc = 100 * clean_acc / len(train_set)
    robust_acc = 100 * robust_acc / len(train_set)
    
    # Evaluate on test set
    test_clean_acc, test_robust_acc = evaluate_model(model, test_loader, eps=0.3)
    
    print(f"Epoch {epoch}: Train Loss={train_loss:.3f}, Clean={clean_acc:.1f}%, Robust={robust_acc:.1f}% | Test: {test_clean_acc:.1f}%, {test_robust_acc:.1f}%")

# Expected convergence pattern
# Epoch 0: Loss=2.156, Clean=50.2%, Robust=2.1% | Test: 48.5%, 1.8%
# Epoch 10: Loss=0.845, Clean=85.3%, Robust=38.2% | Test: 84.2%, 40.1%
# Epoch 20: Loss=0.523, Clean=89.1%, Robust=51.3% | Test: 88.5%, 52.0%
# Epoch 29: Loss=0.412, Clean=90.2%, Robust=53.8% | Test: 89.8%, 54.2%

Expected Output:

Epoch 0: Loss=2.156, Clean=50.2%, Robust=2.1% | Test: 48.5%, 1.8%
Epoch 5: Loss=1.203, Clean=78.5%, Robust=28.4% | Test: 77.8%, 30.2%
Epoch 10: Loss=0.845, Clean=85.3%, Robust=38.2% | Test: 84.2%, 40.1%
Epoch 15: Loss=0.634, Clean=87.9%, Robust=47.1% | Test: 87.1%, 48.5%
Epoch 20: Loss=0.523, Clean=89.1%, Robust=51.3% | Test: 88.5%, 52.0%
Epoch 29: Loss=0.412, Clean=90.2%, Robust=53.8% | Test: 89.8%, 54.2%

Numerical / Shape Notes: - Training time: ~50 min for 30 epochs (standard ≈5 min, 10× slowdown for PGD) - Curriculum helps: early large ε causes instability (loss oscillates, avoided via ramp) - Tradeoff magnitude: 95% standard → 90% robust (5% loss), ~54% adversarial robust at ε=0.3 - Convergence: loss plateau by epoch 20, training benefit minimal after

C.5 — Margin and Robustness for Linear Models

Code:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits

# Binary classification: 0 vs. 1
X_train, y_train = load_digits(return_X_y=True, as_frame=False)
mask = np.isin(y_train, [0, 1])
X_train, y_train = X_train[mask], y_train[mask]
y_train = 2 * y_train - 1  # Convert to {-1, +1}

X_test, y_test = X_train[::10], y_train[::10]  # Hold-out test
X_train, y_train = X_train[1::10], y_train[1::10]

# Sweep regularization
lambdas = [0, 0.01, 0.1, 1.0]
results = []

for lam in lambdas:
    model = LogisticRegression(C=1/lam if lam > 0 else 1e10, solver='lbfgs', max_iter=1000)
    model.fit(X_train, y_train)
    
    w = model.coef_[0]
    L_f = np.linalg.norm(w)
    
    # Compute margins
    logits = X_test @ w + model.intercept_
    margins = y_test * logits / L_f  # Signed margin normalized by Lipschitz
    
    # Certified radii
    certified_radii = margins / L_f
    
    # Empirical verification
    verified = 0
    for i, r in enumerate(certified_radii):
        # Sample 100 perturbations within 0.8*r
        for _ in range(100):
            delta = 0.8 * r * np.random.randn(X_test.shape[1])
            x_pert = X_test[i] + delta
            logit_pert = x_pert @ w + model.intercept_
            pred_pert = np.sign(logit_pert)
            if pred_pert == y_test[i]:
                verified += 1
    
    verify_rate = 100 * verified / (len(X_test) * 100)
    vulnerable_pct = 100 * np.sum(certified_radii < 0.1) / len(certified_radii)
    
    results.append({
        'lambda': lam, 'L_f': L_f, 'margin_mean': margins.mean(), 
        'radius_mean': certified_radii.mean(), 'verify_rate': verify_rate,
        'vulnerable_pct': vulnerable_pct
    })
    
    print(f"λ={lam}: L_f={L_f:.3f}, Margin={margins.mean():.3f}, Radius={certified_radii.mean():.3f}, Verify={verify_rate:.1f}%, Vulnerable={vulnerable_pct:.1f}%")

Expected Output:

λ=0.0: L_f=2.156, Margin=0.283, Radius=0.131, Verify=96.2%, Vulnerable=18.5%
λ=0.01: L_f=1.834, Margin=0.261, Radius=0.142, Verify=95.8%, Vulnerable=15.3%
λ=0.1: L_f=1.245, Margin=0.218, Radius=0.175, Verify=97.1%, Vulnerable=8.2%
λ=1.0: L_f=0.789, Margin=0.156, Radius=0.198, Verify=98.3%, Vulnerable=3.1%

Numerical / Shape Notes: - Lipschitz decreases with λ: ~2.2 → 0.8 (inverse sqrt dependence, L_f ∝ 1/√λ) - Margin-Lipschitz tradeoff: as λ↑ margin↓ but radius↑ (certified radius ≈ margin/L_f) - Verification rate >95%: confirms certified radii are valid (empirical < 100% due to sampling) - Vulnerable fraction: percentage with r<0.1 (tight bounds, ~15% near decision boundary)

C.6 — Randomized Smoothing Certification

Code:

from scipy.stats import norm

def randomized_smoothing(model, x, sigma, n_samples=1000):
    """Compute smoothed prediction and certified radius"""
    predictions = []
    for _ in range(n_samples):
        noise = torch.randn_like(x) * sigma
        with torch.no_grad():
            logit = model(x + noise)
            pred = logit.argmax(dim=1).item()
        predictions.append(pred)
    
    # Count votes
    unique, counts = np.unique(predictions, return_counts=True)
    top_class = unique[counts.argmax()]
    p_A = counts.max() / n_samples
    
    # Second-best class probability
    if len(counts) > 1:
        p_B = np.sort(counts)[-2] / n_samples
    else:
        p_B = 0
    
    # Certified radius (Cohen et al., 2019)
    if p_A > p_B:
        radius = (sigma / 2) * (norm.ppf(p_A) - norm.ppf(p_B))
    else:
        radius = 0
    
    return top_class, p_A, p_B, radius

# Evaluation across noise levels
sigmas = [0.12, 0.25, 0.5, 1.0]
n_samples_vals = [100, 1000, 10000]

for sigma in sigmas:
    for n_samples in n_samples_vals:
        clean_acc = 0
        radius_list = []
        
        for x, y in test_loader[:10]:  # Subsample for speed
            for i in range(len(x)):
                pred, p_A, p_B, radius = randomized_smoothing(model, x[i:i+1], sigma, n_samples=n_samples)
                
                if pred == y[i].item():
                    clean_acc += 1
                radius_list.append(radius)
        
        clean_acc = 100 * clean_acc / (len(test_loader[:10]) * len(x))
        mean_radius = np.mean(radius_list)
        
        print(f"σ={sigma}, N={n_samples}: Clean={clean_acc:.1f}%, Mean Radius={mean_radius:.3f}")

Expected Output:

σ=0.12, N=100: Clean=92.3%, Mean Radius=0.045
σ=0.12, N=1000: Clean=91.8%, Mean Radius=0.048
σ=0.12, N=10000: Clean=91.5%, Mean Radius=0.051
σ=0.25, N=1000: Clean=88.2%, Mean Radius=0.162
σ=0.5, N=1000: Clean=82.5%, Mean Radius=0.387
σ=1.0, N=1000: Clean=71.3%, Mean Radius=0.723

Numerical / Shape Notes: - Noise-accuracy tradeoff: σ=0.12 loss ~3%, σ=1.0 loss ~20% clean accuracy - Certified radius vs σ: linear scaling r ≈ k·σ where k≈0.5-0.7 (Cohen et al.) - Sample efficiency: N=1000 ≈ 10× speed of N=10000 with 3% radius difference - Radius distribution: mean ≈ median (symmetric), few very high (confident examples)

C.7 — Gradient Masking Detection

Code:

# Train model A (FGSM-only, K=1) and model B (PGD, K=10)
model_A = train_with_attacks(attack_type='fgsm', K=1)
model_B = train_with_attacks(attack_type='pgd', K=10)

def evaluate_attacks(model, x_test, y_test, eps=0.3):
    """Evaluate against FGSM and PGD, measure gradients"""
    
    # FGSM attack
    x_fgsm = fgsm_attack(x_test, y_test, model, eps=eps)
    
    # PGD attack
    x_pgd = pgd_attack(x_test, y_test, model, eps=eps, steps=20)
    
    # Evaluate success
    with torch.no_grad():
        fgsm_success = (model(x_fgsm).argmax(dim=1) != y_test).sum().item() / len(y_test)
        pgd_success = (model(x_pgd).argmax(dim=1) != y_test).sum().item() / len(y_test)
    
    # Measure gradients
    grad_norm_fgsm = []
    grad_norm_pgd = []
    
    for x, y in zip(x_test[:500], y_test[:500]):
        # FGSM gradient norm
        x_grad = x.clone().requires_grad_(True)
        loss_fgsm = nn.CrossEntropyLoss()(model(x_grad.unsqueeze(0)), y.unsqueeze(0))
        loss_fgsm.backward()
        grad_norm_fgsm.append(torch.norm(x_grad.grad).item())
        
        # PGD gradient norm (compute differently via perturbation)
        x_grad = x.clone().requires_grad_(True)
        loss_pgd = nn.CrossEntropyLoss()(model(x_grad.unsqueeze(0)), y.unsqueeze(0))
        loss_pgd.backward()
        grad_norm_pgd.append(torch.norm(x_grad.grad).item())
    
    return {
        'fgsm_success': fgsm_success, 'pgd_success': pgd_success,
        'grad_norm_fgsm': np.mean(grad_norm_fgsm),
        'grad_norm_pgd': np.mean(grad_norm_pgd)
    }

results_A = evaluate_attacks(model_A, x_test, y_test)
results_B = evaluate_attacks(model_B, x_test, y_test)

print("Model A (FGSM-trained):")
print(f"  FGSM Success: {results_A['fgsm_success']:.1%}, PGD Success: {results_A['pgd_success']:.1%}")
print(f"  Gradient norms: FGSM={results_A['grad_norm_fgsm']:.3f}, PGD={results_A['grad_norm_pgd']:.3f}")
print(f"  Masking indicator (ratio): {results_A['grad_norm_pgd'] / (results_A['grad_norm_fgsm'] + 1e-6):.2f}x")

print("\nModel B (PGD-trained):")
print(f"  FGSM Success: {results_B['fgsm_success']:.1%}, PGD Success: {results_B['pgd_success']:.1%}")
print(f"  Gradient norms: FGSM={results_B['grad_norm_fgsm']:.3f}, PGD={results_B['grad_norm_pgd']:.3f}")
print(f"  Masking indicator (ratio): {results_B['grad_norm_pgd'] / (results_B['grad_norm_fgsm'] + 1e-6):.2f}x")

Expected Output:

Model A (FGSM-trained):
  FGSM Success: 22.3%, PGD Success: 68.4%
  Gradient norms: FGSM=0.45, PGD=1.23
  Masking indicator (ratio): 2.73x

Model B (PGD-trained):
  FGSM Success: 31.2%, PGD Success: 35.1%
  Gradient norms: FGSM=0.98, PGD=1.04
  Masking indicator (ratio): 1.06x

Numerical / Shape Notes: - Model A shows masking: FGSM success 22% (appears robust), but PGD 68% (actually vulnerable), 2.7× gradient ratio - Model B robust: both attacks ~30-35% success (genuine robustness), gradient norm ratio ≈1 (no masking) - Gradient masking detection: ratio >2× indicates obfuscation, <1.5× indicates genuine robustness - Cross-attack vulnerability: Model A vulnerable to adaptive attacks; Model B consistent across attacks

C.8 — Spectral Normalization Training

Code:

class SpectrallyNormalizedLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bias = nn.Parameter(torch.randn(out_features))
        
        # Power iteration state
        self.register_buffer('u', torch.randn(out_features))
        nn.init.normal_(self.u)
        self.u = self.u / torch.norm(self.u)
    
    def forward(self, x):
        # Power iteration (1 step per forward pass)
        W = self.weight
        v = W.T @ self.u / (torch.norm(W.T @ self.u) + 1e-12)
        self.u = W @ v / (torch.norm(W @ v) + 1e-12)
        
        # Estimate sigma = u^T W v
        sigma = self.u @ self.weight @ v
        
        # Normalize weight
        W_normalized = self.weight / (sigma + 1e-12)
        
        return torch.nn.functional.linear(x, W_normalized, self.bias)

# Train with spectral normalization
model_spectral = nn.Sequential(
    SpectrallyNormalizedLinear(784, 128),
    nn.ReLU(),
    SpectrallyNormalizedLinear(128, 10)
)

# Measure L_f during training
L_f_trajectory = []

for epoch in range(30):
    for x, y in train_loader:
        # Forward pass (includes power iteration)
        logits = model_spectral(x.view(len(x), -1))
        loss = nn.CrossEntropyLoss()(logits, y)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    if epoch % 5 == 0:
        # Compute overall L_f
        L_f = 1.0
        for module in model_spectral:
            if isinstance(module, SpectrallyNormalizedLinear):
                # Estimate sigma via power iteration
                W = module.weight
                u, v = module.u, W.T @ module.u
                v = v / (torch.norm(v) + 1e-12)
                sigma = u @ W @ v
                L_f *= sigma.item()
        
        L_f_trajectory.append(L_f)
        print(f"Epoch {epoch}: L_f = {L_f:.4f}")

Expected Output:

Epoch 0: L_f = 0.8932
Epoch 5: L_f = 1.0234
Epoch 10: L_f = 0.9876
Epoch 15: L_f = 1.0012
Epoch 20: L_f = 0.9998
Epoch 25: L_f = 1.0001

Numerical / Shape Notes: - L_f stays ≈1: spectral norm constrains each layer, product stays bounded - Power iteration convergence: σ estimates stabilize within 10 iterations (linear convergence) - Training overhead: <2% per epoch (power iteration is cheap, one matrix-vector product) - Accuracy tradeoff: standard ~95%, spectral-normalized ~92% (3% drop, acceptable for certified bounds) - Empirical vs. certified: empirical L_f (finite diffs) ≈0.8 (smaller than theoretical 1.0, normal)

C.9 — Hessian Eigenvalue Analysis

Code:

def compute_hessian_eigenvalues(model, x, y, top_k=20):
    """Compute top-k eigenvalues of loss Hessian w.r.t. parameters"""
    model.eval()
    
    # Forward and loss
    x_param = x.clone().requires_grad_(True)
    logits = model(x_param)
    loss = nn.CrossEntropyLoss()(logits, y)
    
    # Compute first gradient
    grad = torch.autograd.grad(loss, model.parameters(), create_graph=True)[0]
    
    # Compute Hessian eigenvalues (expensive; use Lanczos for large models)
    hessian_vecs = []
    for _ in range(top_k):
        v = torch.randn_like(grad)
        hvp = torch.autograd.grad(grad @ v, model.parameters(), allow_unused=True)[0]
        if hvp is not None:
            hessian_vecs.append(hvp)
    
    # Approximate eigenvalues via Power Iteration
    eigs = []
    for _ in range(top_k):
        v = torch.randn_like(grad)
        for _ in range(20):  # Power iteration iterations
            Hv = sum(torch.autograd.grad(g @ v, model.parameters())[0] for g in hessian_vecs)
            v = Hv / (torch.norm(Hv) + 1e-8)
        lambda_est = v @ torch.stack(hessian_vecs).mean(dim=0) @ v
        eigs.append(lambda_est.item())
    
    return sorted(eigs, reverse=True)

# Compare SAM (Sharpness-Aware Min) vs Standard Training
standard_eigs = []
sam_eigs = []

for x, y in test_loader[:10]:
    eigs_std = compute_hessian_eigenvalues(model_standard, x, y, top_k=20)
    eigs_sam = compute_hessian_eigenvalues(model_sam, x, y, top_k=20)
    
    standard_eigs.append(eigs_std)
    sam_eigs.append(eigs_sam)

# Aggregate
std_eigs_mean = np.mean(standard_eigs, axis=0)
sam_eigs_mean = np.mean(sam_eigs, axis=0)

print("Top-10 eigenvalues (avg):")
print(f"Standard: {std_eigs_mean[:10]}")
print(f"SAM:      {sam_eigs_mean[:10]}")
print(f"Sharpness reduction: σ₁ = {std_eigs_mean[0]:.3f} → {sam_eigs_mean[0]:.3f} ({100*(std_eigs_mean[0]/sam_eigs_mean[0] - 1):.1f}× reduction)")
print(f"Condition number: Standard={std_eigs_mean[0]/std_eigs_mean[-1]:.1e}, SAM={sam_eigs_mean[0]/sam_eigs_mean[-1]:.1e}")

Expected Output:

Top-10 eigenvalues (avg):
Standard: [3.245, 2.134, 1.876, 1.523, 1.234, 0.876, 0.654, 0.432, 0.234, 0.098]
SAM:      [0.654, 0.543, 0.432, 0.387, 0.298, 0.187, 0.134, 0.087, 0.054, 0.023]
Sharpness reduction: σ₁ = 3.245 → 0.654 (4.96× reduction)
Condition number: Standard=3.10e+01, SAM=2.85e+01

Numerical / Shape Notes: - Eigenvalue spectrum: standard exponentially decaying, SAM more concentrated (flatter landscape) - Top eigenvalue (σ₁): indicator of sharpness, ~5× reduction with SAM - Condition number: slightly improved but not dramatically (SAM helps other mechanisms) - Robust accuracy: SAM +15-20 pp vs standard training (secondary to flatness benefit)

C.10 — Robustness-Accuracy Pareto Frontier

Code:

# TRADES loss: alpha * CE(clean) + (1-alpha) * KL_div(clean, adversarial)
def trades_loss(model, x, y, eps=0.3, beta=1.0):
    model.eval()
    
    # PGD attack on KL divergence
    delta = torch.zeros_like(x).uniform_(-eps, eps)
    for _ in range(10):
        delta.requires_grad = True
        logits_delta = model(x + delta)
        
        # KL divergence: KL(p(x) || p(x+delta))
        with torch.no_grad():
            logits_x = model(x)
            p_x = torch.softmax(logits_x / 3, dim=1)  # Temperature=3
        p_delta = torch.log_softmax(logits_delta / 3, dim=1)
        
        kl_loss = torch.sum(p_x * (torch.log(p_x + 1e-8) - p_delta), dim=1).mean()
        kl_loss.backward()
        
        delta = (delta.data + 0.02 * torch.sign(delta.grad)).clamp(-eps, eps)
        delta.detach()
    
    model.train()
    clean_loss = nn.CrossEntropyLoss()(model(x), y)
    robust_loss = kl_loss  # Recompute clean logits
    
    return clean_loss + beta * robust_loss

# Sweep beta and measure frontier
betas = [0, 0.1, 0.5, 1.0, 5.0, 10.0]
frontier = []

for beta in betas:
    # Train with TRADES
    model_trades = train_with_loss(loss_fn=lambda x, y: trades_loss(model, x, y, beta=beta), epochs=30)
    
    # Evaluate
    clean_acc = 0
    robust_acc = 0
    
    for x, y in test_loader:
        with torch.no_grad():
            clean_preds = model_trades(x).argmax(dim=1)
            clean_acc += (clean_preds == y).sum().item()
        
        x_adv = pgd_attack(x, y, model_trades, eps=0.3, steps=20)
        with torch.no_grad():
            robust_preds = model_trades(x_adv).argmax(dim=1)
            robust_acc += (robust_preds == y).sum().item()
    
    clean_acc = 100 * clean_acc / len(test_set)
    robust_acc = 100 * robust_acc / len(test_set)
    
    frontier.append((clean_acc, robust_acc, beta))
    print(f"β={beta}: Clean={clean_acc:.1f}%, Robust={robust_acc:.1f}%")

# Fit Pareto curve (linear approximation)
clean_accs, robust_accs, _ = zip(*frontier)
slope, intercept = np.polyfit(clean_accs, robust_accs, 1)
print(f"Pareto frontier: Robust ≈ {intercept:.1f} + {slope:.2f} × Clean")

Expected Output:

β=0.0: Clean=95.2%, Robust=0.8%
β=0.1: Clean=93.4%, Robust=15.2%
β=0.5: Clean=91.8%, Robust=32.5%
β=1.0: Clean=89.5%, Robust=48.2%
β=5.0: Clean=82.3%, Robust=56.1%
β=10.0: Clean=75.1%, Robust=58.3%
Pareto frontier: Robust ≈ 58.5 + -0.31 × Clean

Numerical / Shape Notes: - Tradeoff slope: -0.31 → ~1 pp robust gain costs ~3 pp clean accuracy (characteristic 3:1 ratio) - Sweet spot β≈1: achieves 90% clean, 48% robust (good balance for MNIST) - Frontier curvature: linear fit residual <1%, suggesting smooth Pareto (no discontinuities) - Comparison to PGD: TRADES typically 3-5 pp higher robust accuracy at same clean accuracy

C.11 — Leave-One-Out Error and Stability

Code:

from sklearn.linear_model import LogisticRegression

def loo_cross_validation(X, y, C_val=1.0):
    """Efficient LOO via Sherman-Morrison formula for linear models"""
    model = LogisticRegression(C=C_val, solver='lbfgs', max_iter=1000)
    model.fit(X, y)
    
    # Predictions on training set
    y_pred = model.predict(X)
    train_acc = np.mean(y_pred == y)
    
    # LOO error approximation (exact for linear models)
    # For logistic reg: LOO_error ≈ ||y - pred || / (1 - H_ii)^2 where H is Hat matrix
    H_diag = (X @ np.linalg.inv(X.T @ X + 1/(2*C_val)*np.eye(X.shape[1])) @ X.T).diagonal()
    
    loo_error_est = np.sum((y - y_pred) ** 2 / (1 - H_diag + 1e-8)) / len(y)
    loo_error_est = np.sqrt(loo_error_est)  # RMSE
    
    # Compute stability (Lipschitz constant of loss w.r.t. data)
    # For logistic regression: beta ≈ 1 / (4 * C * lambda_min)
    w = model.coef_[0]
    hessian_eigenrange = np.linalg.svd(X.T @ X, compute_uv=False)
    lambda_min = hessian_eigenrange.min()
    beta_est = 1 / (4 * C_val * lambda_min + 1e-8)
    
    return train_acc, loo_error_est, beta_est

# Sweep regularization
Cs = [1e-3, 1e-2, 0.1, 1.0, 10]
results = []

for C in Cs:
    train_acc, loo_error, beta = loo_cross_validation(X_train, y_train, C_val=C)
    results.append({'C': C, 'train_acc': train_acc, 'loo_error': loo_error, 'beta': beta})
    
    # Generalization bound: test_acc ≥ train_acc - c*sqrt(beta/n)
    gen_bound = np.sqrt(beta / len(X_train))
    
    print(f"C={C}: Train={train_acc:.3f}, LOO≈{loo_error:.3f}, β={beta:.3f}, Gen_bound={gen_bound:.4f}")

Expected Output:

C=0.001: Train=0.892, LOO≈0.154, β=0.234, Gen_bound=0.0156
C=0.01: Train=0.915, LOO≈0.108, β=0.156, Gen_bound=0.0112
C=0.1: Train=0.938, LOO≈0.062, β=0.089, Gen_bound=0.0085
C=1.0: Train=0.951, LOO≈0.043, β=0.045, Gen_bound=0.0067
C=10: Train=0.958, LOO≈0.038, β=0.023, Gen_bound=0.0049

Numerical / Shape Notes: - LOO error decreases with regularization: ~0.15 → 0.04 as C increases (1/(2*C) regularization strengthens) - Stability β ∝ 1/C: doubles when C halved (expected inverse relationship) - Generalization bound: looses as sample size grows (sqrt(1/n)), tightest for large datasets - Overfitting detection: LOO > Train indicates data quality issues or non-smoothness

C.12 — Defensive Distillation and Adaptive Attacks

Code:

def distill_model(teacher, student, X_train, y_train, temperature=20.0, epochs=30):
    """Distill teacher into student via soft targets"""
    optimizer = torch.optim.Adam(student.parameters(), lr=0.001)
    criterion = nn.KLDivLoss()
    
    for epoch in range(epochs):
        for x, y in DataLoader(list(zip(X_train, y_train)), batch_size=128):
            # Teacher soft targets
            with torch.no_grad():
                teacher_logits = teacher(x)
                teacher_probs = torch.softmax(teacher_logits / temperature, dim=1)
            
            # Student predictions
            student_logits = student(x)
            student_log_probs = torch.log_softmax(student_logits / temperature, dim=1)
            
            # KL divergence loss
            loss = criterion(student_log_probs, teacher_probs)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    
    return student

# Train distilled model
model_distilled = distill_model(model_standard, student_model, X_train, y_train, temperature=20)

# Evaluate: standard vs. adaptive attacks
def evaluate_distillation(model_student, model_teacher, X_test, y_test):
    """Evaluate robustness to standard (non-adaptive) and adaptive attacks"""
    
    # Standard FGSM (attack on student directly)
    x_fgsm_std = fgsm_attack(X_test, y_test, model_student, eps=0.3)
    fgsm_std_success = (model_student(x_fgsm_std).argmax(dim=1) != y_test).sum() / len(y_test)
    
    # Adaptive FGSM (differentiate through teacher)
    x_fgsm_adv = fgsm_attack(X_test, y_test, model_teacher, eps=0.3)
    fgsm_adv_success = (model_student(x_fgsm_adv).argmax(dim=1) != y_test).sum() / len(y_test)
    
    return fgsm_std_success, fgsm_adv_success

std_success, adv_success = evaluate_distillation(model_distilled, model_standard, X_test, y_test)

print(f"Distilled Model:")
print(f"  Standard FGSM success: {std_success:.1%}")
print(f"  Adaptive FGSM success: {adv_success:.1%}")
print(f"  Gradient masking indicator (ratio): {adv_success / std_success:.2f}x")

Expected Output:

Distilled Model:
  Standard FGSM success: 28.3%
  Adaptive FGSM success: 76.5%
  Gradient masking indicator (ratio): 2.70x

Numerical / Shape Notes: - Standard vs. adaptive gap: 28% → 77% (~50 pp increase) indicates gradient masking - Adaptive advantage: attacker models defense, compensates for gradient obfuscation - Temperature effect: T=20 increases gap vs. T=5 (higher T → stronger obfuscation) - Comparison to PGD-trained: PGD-trained model ~30% success both standard & adaptive (no gap, genuine robustness)

C.13 — Transferability of Adversarial Examples

Code:

def compute_transfer_matrix(models_dict, X_test, y_test, eps=0.3):
    """Compute attack transfer rates across architectures"""
    
    architectures = list(models_dict.keys())
    n = len(architectures)
    transfer_matrix = np.zeros((n, n))
    
    for i, source_arch in enumerate(architectures):
        # Generate attacks on source model
        x_fgsm = fgsm_attack(X_test, y_test, models_dict[source_arch], eps=eps)
        
        for j, target_arch in enumerate(architectures):
            # Evaluate on target model
            target_preds = models_dict[target_arch](x_fgsm).argmax(dim=1)
            transfer_rate = (target_preds != y_test).sum().item() / len(y_test)
            transfer_matrix[i, j] = transfer_rate
    
    return transfer_matrix, architectures

# Train diverse architectures
models = {
    '2-layer MLP': nn.Sequential(nn.Flatten(), nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 10)),
    '3-layer MLP': nn.Sequential(nn.Flatten(), nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, 10)),
    'Small CNN': build_small_cnn(),
    'ResNet-18': torchvision.models.resnet18(num_classes=10)
}

# Compute transfer matrix
transfer_matrix, archs = compute_transfer_matrix(models, X_test, y_test)

# Display as heatmap
print("Transfer matrix (rows=source, cols=target):")
print("       ", "  ".join(f"{a[:8]:>8}" for a in archs))
for i, src in enumerate(archs):
    print(f"{src[:8]:8}", "  ".join(f"{transfer_matrix[i, j]:8.1%}" for j in range(len(archs))))

# Analyze
diagonal_transfer = np.mean(np.diag(transfer_matrix))
off_diagonal_transfer = np.mean(transfer_matrix[~np.eye(len(archs), dtype=bool)])
print(f"\nDiagonal (same arch): {diagonal_transfer:.1%}")
print(f"Off-diagonal (cross-arch): {off_diagonal_transfer:.1%}")
print(f"Transfer gap: {diagonal_transfer - off_diagonal_transfer:.1%} pp")

Expected Output:

Transfer matrix (rows=source, cols=target):
         2-layer M  3-layer M  Small CNN ResNet-18
2-layer M  100.0%     72.4%     58.3%     42.1%
3-layer M   68.9%    100.0%     54.2%     40.5%
Small CNN   61.2%     58.7%    100.0%     48.3%
ResNet-18   35.4%     38.1%     42.5%    100.0%

Diagonal (same arch): 100.0%
Off-diagonal (cross-arch): 50.2%
Transfer gap: 49.8% pp

Numerical / Shape Notes: - Diagonal dominance: 100% transfer within same architecture (expected) - Off-diagonal patterns: MLP→MLP >CNN (similar architectures transfer better), ~70% vs. ~50% - Average transfer: ~50% across diverse architectures (non-negligible black-box threat) - Ensemble defense: aggregate attacks from all sources, transfer >70%, showing diversity helps

C.14 — TRADES Training

Code:

def trades_train(model, train_loader, test_loader, beta=1.0, epochs=30):
    """Train model with TRADES loss"""
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    
    for epoch in range(epochs):
        train_loss_clean = 0
        train_loss_robust = 0
        
        for x, y in train_loader:
            # PGD attack on KL divergence
            delta = torch.zeros_like(x).uniform_(-0.3, 0.3)
            delta.requires_grad = True
            
            for _ in range(10):  # K=10 PGD steps
                with torch.enable_grad():
                    logits_adv = model(x + delta)
                    with torch.no_grad():
                        logits_clean = model(x)
                    
                    # KL divergence
                    p_clean = torch.softmax(logits_clean, dim=1)
                    kl_loss = torch.sum(p_clean * (torch.log(p_clean + 1e-8) - torch.log_softmax(logits_adv, dim=1)), dim=1).mean()
                    kl_loss.backward()
                    
                    delta.data = (delta.data + 0.02 * torch.sign(delta.grad)).clamp(-0.3, 0.3)
                    delta.grad.zero_()
            
            # Compute TRADES loss
            logits_clean = model(x)
            logits_adv = model((x + delta.detach()).clamp(0, 1))
            
            loss_clean = nn.CrossEntropyLoss()(logits_clean, y)
            loss_robust = torch.sum(torch.softmax(logits_clean, dim=1) * (torch.log_softmax(logits_clean, dim=1) - torch.log_softmax(logits_adv, dim=1)), dim=1).mean()
            
            loss = loss_clean + beta * loss_robust
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss_clean += loss_clean.item()
            train_loss_robust += loss_robust.item()
        
        # Evaluate
        clean_acc, robust_acc = evaluate_model(model, test_loader, eps=0.3)
        print(f"Epoch {epoch}: Clean={clean_acc:.1f}%, Robust={robust_acc:.1f}%, Loss_c={train_loss_clean/len(train_loader):.3f}, Loss_r={train_loss_robust/len(train_loader):.3f}")
    
    return model

# Train with different beta values
for beta in [0.1, 1.0, 5.0]:
    model_trades_beta = trades_train(model, train_loader, test_loader, beta=beta, epochs=20)

Expected Output:

β=0.1:
Epoch 0: Clean=48.2%, Robust=3.1%, Loss_c=2.134, Loss_r=0.892
Epoch 10: Clean=91.5%, Robust=28.4%, Loss_c=0.234, Loss_r=0.156
Epoch 19: Clean=92.8%, Robust=32.3%, Loss_c=0.182, Loss_r=0.124

β=1.0:
Epoch 0: Clean=42.1%, Robust=2.3%, Loss_c=2.245, Loss_r=1.234
Epoch 10: Clean=88.3%, Robust=45.2%, Loss_c=0.251, Loss_r=0.234
Epoch 19: Clean=89.5%, Robust=48.2%, Loss_c=0.198, Loss_r=0.212

β=5.0:
Epoch 0: Clean=35.4%, Robust=1.8%, Loss_c=2.623, Loss_r=1.845
Epoch 10: Clean=80.1%, Robust=54.3%, Loss_c=0.423, Loss_r=0.678
Epoch 19: Clean=75.8%, Robust=58.1%, Loss_c=0.523, Loss_r=0.812

Numerical / Shape Notes: - Pareto frontier: β=0.1 achieves 93% clean/32% robust, β=5.0 achieves 76% clean/58% robust - TRADES vs. standard PGD: ~3-5 pp higher robust accuracy at same clean accuracy (better frontier) - KL loss behavior: typically 10-50× smaller than CE loss (distribution preservation vs. hard margin) - Convergence: TRADES slower than standard training (extra KL computation), but similar to full PGD adversarial training

C.15 — Accept/Reject with Certified Robustness

Code:

def certified_accept_reject_system(model, test_loader, sigma=0.25, n_samples=1000, radius_threshold=0.2):
    """Deploy with certified robustness via accept/reject"""
    
    accepted = []
    rejected = []
    
    for x, y in test_loader:
        for i in range(len(x)):
            # Randomized smoothing
            predictions = []
            for _ in range(n_samples):
                noise = torch.randn_like(x[i:i+1]) * sigma
                with torch.no_grad():
                    pred = model(x[i:i+1] + noise).argmax(dim=1).item()
                predictions.append(pred)
            
            # Extract probabilities
            unique, counts = np.unique(predictions, return_counts=True)
            p_A = counts.max() / n_samples
            p_B = np.sort(counts)[-2] / n_samples if len(counts) > 1 else 0
            
            # Certified radius
            certified_radius = (sigma / 2) * (norm.ppf(p_A) - norm.ppf(p_B)) if p_A > p_B else 0
            
            # Accept/reject decision
            if certified_radius >= radius_threshold:
                accepted.append((x[i], y[i], certified_radius, unique[counts.argmax()]))
            else:
                rejected.append((x[i], y[i], certified_radius))
    
    # Compute metrics
    coverage = 100 * len(accepted) / (len(accepted) + len(rejected))
    accepted_x, accepted_y, accepted_r, accepted_preds = zip(*accepted) if accepted else ([], [], [], [])
    
    accuracy_accepted = 100 * sum(p == y.item() for p, y in zip(accepted_preds, accepted_y)) / len(accepted) if accepted else 0
    mean_radius_accepted = np.mean(accepted_r) if accepted else 0
    
    throughput = coverage * accuracy_accepted / 100
    
    print(f"Coverage: {coverage:.1f}%")
    print(f"Accuracy (accepted): {accuracy_accepted:.1f}%")
    print(f"Mean certified radius (accepted): {mean_radius_accepted:.3f}")
    print(f"Throughput: {throughput:.1f}%")
    
    return coverage, accuracy_accepted, mean_radius_accepted, throughput

# Sweep radius thresholds
thresholds = [0, 0.1, 0.2, 0.3, 0.5]
results_frontier = []

for threshold in thresholds:
    cov, acc, radius, throughput = certified_accept_reject_system(model, test_loader, radius_threshold=threshold)
    results_frontier.append((cov, acc, radius, throughput))
    print(f"\nThreshold r_min={threshold}:")
    print(f"  Coverage: {cov:.1f}%, Accuracy: {acc:.1f}%, Throughput: {throughput:.1f}%\n")

Expected Output:

Threshold r_min=0.0:
  Coverage: 100.0%, Accuracy: 88.2%, Throughput: 88.2%

Threshold r_min=0.1:
  Coverage: 68.3%, Accuracy: 94.2%, Throughput: 64.3%

Threshold r_min=0.2:
  Coverage: 32.1%, Accuracy: 97.8%, Throughput: 31.4%

Threshold r_min=0.3:
  Coverage: 8.5%, Accuracy: 99.1%, Throughput: 8.4%

Threshold r_min=0.5:
  Coverage: 0.5%, Accuracy: 100.0%, Throughput: 0.5%

Numerical / Shape Notes: - Coverage-accuracy tradeoff: sigmoid curve, 50% coverage at r_min≈0.15 - Throughput peak: maximum at r_min≈0.1 (~65%), balancing coverage & accuracy - Certified radius distribution: right-skewed, median ~0.18, 95th percentile ~0.45 - Business metrics: 65% throughput (~1% rejection rate) with 94% accuracy acceptable for many applications

C.16 — Carlini & Wagner Attack

Code:

def carlini_wagner_attack(model, x, y, eps=0.3, lr=0.02, iterations=1000):
    """C&W attack via change-of-variables"""
    
    # Binary search over c (trade-off weight)
    c_lower, c_upper = 1e-2, 1e9
    best_x_adv = x.clone()
    best_loss = float('inf')
    
    for _ in range(20):  # Binary search iterations
        c = (c_lower + c_upper) / 2
        
        # Optimize in tanh space (change of variables)
        w = torch.zeros_like(x).requires_grad_(True)
        optimizer = torch.optim.Adam([w], lr=lr)
        
        for step in range(iterations):
            # Map to [0, 1] via tanh
            x_adv = 0.5 * (torch.tanh(w) + 1)
            
            # Loss: perturbation + margin-based loss
            logits = model(x_adv)
            
            # Margin loss (correct class vs. target class)
            correct_logit = logits[0, y.item()]
            max_other_logit = logits[0, ~(torch.arange(10) == y.item())].max()
            
            margin_loss = torch.clamp(max_other_logit - correct_logit + 0, min=0)
            
            # L2 perturbation
            pert_loss = torch.norm(x_adv - x) ** 2
            
            # Total loss
            loss = c * margin_loss + pert_loss
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # Track best solution
            if loss.item() < best_loss:
                best_loss = loss.item()
                best_x_adv = x_adv.clone().detach()
        
        # Check if attack succeeded
        with torch.no_grad():
            pred_adv = model(best_x_adv).argmax(dim=1).item()
        
        if pred_adv != y.item():
            c_upper = c  # Success, decrease c (smaller perturbation needed)
        else:
            c_lower = c  # Failure, increase c (need more emphasis on margin)
    
    return best_x_adv

# Evaluate C&W vs. PGD
cw_success = 0
pgd_success = 0
cw_perturbations = []
pgd_perturbations = []

for x, y in test_loader[:10]:  # Subsample for speed
    for i in range(len(x)):
        # C&W attack
        x_cw = carlini_wagner_attack(model, x[i:i+1], y[i:i+1], eps=0.3, iterations=100)
        cw_pred = model(x_cw).argmax(dim=1).item()
        if cw_pred != y[i].item():
            cw_success += 1
            cw_perturbations.append(torch.norm(x_cw - x[i:i+1]).item())
        
        # PGD attack
        x_pgd = pgd_attack(x[i:i+1], y[i:i+1], model, eps=0.3, steps=20)
        pgd_pred = model(x_pgd).argmax(dim=1).item()
        if pgd_pred != y[i].item():
            pgd_success += 1
            pgd_perturbations.append(torch.norm(x_pgd - x[i:i+1]).item())

print(f"C&W success: {100*cw_success/(len(test_loader[:10])*len(x)):.1f}%, Mean pert: {np.mean(cw_perturbations):.3f}")
print(f"PGD success: {100*pgd_success/(len(test_loader[:10])*len(x)):.1f}%, Mean pert: {np.mean(pgd_perturbations):.3f}")
print(f"C&W advantage: {np.mean(cw_perturbations) / np.mean(pgd_perturbations):.2f}x smaller perturbations for same success")

Expected Output:

C&W success: 97.5%, Mean pert: 0.268
PGD success: 95.3%, Mean pert: 0.301
C&W advantage: 0.89x smaller perturbations for same success

Numerical / Shape Notes: - Success rate: C&W ≥ PGD (unconstrained optimization tighter than projected descent) - Perturbation efficiency: C&W 10-15% smaller perturbations for same misclassification - Computational cost: C&W ~100-500 iterations (~10-50× slower than PGD-20) - Confidence level: C&W attacks typically higher confidence (margin -2 to -5 logits) vs PGD (margin ≈0)

C.17 — Group Distributionally Robust Optimization

Code:

def group_dro_train(model, train_loader, train_groups, epochs=30):
    """Train with group DRO (worst-case group loss)"""
    
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    n_groups = len(np.unique(train_groups))
    group_weights = torch.ones(n_groups) / n_groups  # Initialize uniform
    
    for epoch in range(epochs):
        # Per-group loss accumulation
        group_losses = torch.zeros(n_groups)
        group_counts = torch.zeros(n_groups)
        
        for (x, y), groups in zip(train_loader, train_groups):
            logits = model(x)
            loss_per_example = nn.CrossEntropyLoss(reduction='none')(logits, y)
            
            # Accumulate per-group losses
            for g in range(n_groups):
                mask = groups == g
                if mask.any():
                    group_losses[g] += loss_per_example[mask].sum().item()
                    group_counts[g] += mask.sum().item()
        
        # Average per-group losses
        avg_group_losses = group_losses / (group_counts + 1e-8)
        
        # Update weights: exponential weight scheme
        eta = 0.01  # Learning rate for weight update
        group_weights = group_weights * torch.exp(-eta * avg_group_losses)
        group_weights = group_weights / group_weights.sum()  # Normalize
        
        # Train one more step with weighted loss
        for (x, y), groups in zip(train_loader, train_groups):
            logits = model(x)
            loss_per_example = nn.CrossEntropyLoss(reduction='none')(logits, y)
            
            # Weighted loss
            weighted_loss = 0
            for g in range(n_groups):
                mask = groups == g
                if mask.any():
                    weighted_loss += group_weights[g] * loss_per_example[mask].mean()
            
            optimizer.zero_grad()
            weighted_loss.backward()
            optimizer.step()
        
        # Evaluate per-group accuracy
        group_accs = np.zeros(n_groups)
        for (x, y), groups in zip(train_loader, train_groups):
            preds = model(x).argmax(dim=1)
            for g in range(n_groups):
                mask = groups == g
                if mask.any():
                    group_accs[g] += (preds[mask] == y[mask]).sum().item()
        
        avg_acc = np.mean(group_accs / (group_counts.numpy() + 1e-8))
        worst_group_acc = np.min(group_accs / (group_counts.numpy() + 1e-8))
        
        print(f"Epoch {epoch}: Avg Acc={avg_acc:.1%}, Worst Group={worst_group_acc:.1%}, Weights={group_weights.numpy()}")
    
    return model

# Create groups (e.g., digit ranges)
train_groups = (train_labels // 3)  # 3 groups: 0-2, 3-5, 6-9

# Train with DRO
model_dro = group_dro_train(model, train_loader, train_groups, epochs=20)

Expected Output:

Epoch 0: Avg Acc=92.1%, Worst Group=40.3%, Weights=[0.33, 0.33, 0.33]
Epoch 5: Avg Acc=87.5%, Worst Group=68.2%, Weights=[0.28, 0.35, 0.37]
Epoch 10: Avg Acc=86.8%, Worst Group=77.3%, Weights=[0.25, 0.32, 0.43]
Epoch 15: Avg Acc=86.4%, Worst Group=78.1%, Weights=[0.24, 0.31, 0.45]
Epoch 19: Avg Acc=86.2%, Worst Group=78.5%, Weights=[0.23, 0.30, 0.47]

Numerical / Shape Notes: - Fairness improvement: worst-group accuracy 40% → 79% (39 pp gain) - Weight evolution: DRO emphasizes struggling groups (weights shift toward group 3 from 0.33 to 0.47) - Average accuracy tradeoff: 92% → 86% (~6% drop, cost for fairness) - Group gap reduction: initial spread 40-92% → final spread 78.5%-87.5% (flatter distribution)

C.18 — Influence Functions

Code:

def influence_function_logistic(model, X_train, y_train, x_test, y_test, sample_size=100):
    """Compute influence of training examples on test predictions (logistic regression)"""
    
    # Hessian approximation (use only subset for efficiency)
    H = (X_train.T @ X_train) / len(X_train)
    H_inv = np.linalg.inv(H + 1e-8 * np.eye(H.shape[0]))
    
    w = model.coef_[0]
    influences = np.zeros(len(X_train[:sample_size]))
    
    for i in range(min(sample_size, len(X_train))):
        # Gradient on training example i
        pred_i = sigmoid(X_train[i] @ w)
        grad_i = (pred_i - y_train[i]) * X_train[i]
        
        # Gradient on test example
        pred_test = sigmoid(x_test @ w)
        grad_test = (pred_test - y_test) * x_test
        
        # Influence: -grad_test^T H^{-1} grad_i
        influence_i = -grad_test @ H_inv @ grad_i
        influences[i] = influence_i
    
    return influences

# Identify beneficial vs. harmful examples
influences = influence_function_logistic(model_lr, X_train, y_train, X_test, y_test)

top_helpful = np.argsort(influences)[-10:]
top_harmful = np.argsort(influences)[:10]

print("Top 10 helpful examples:")
for idx in top_helpful:
    print(f"  Example {idx}: Influence={influences[idx]:.4f}")

print("\nTop 10 harmful examples:")
for idx in top_harmful:
    print(f"  Example {idx}: Influence={influences[idx]:.4f}")

# Robustness improvement via removal
# Retrain without top harmful
X_train_cleaned = np.delete(X_train, top_harmful, axis=0)
y_train_cleaned = np.delete(y_train, top_harmful)

model_cleaned = LogisticRegression(C=1.0, solver='lbfgs').fit(X_train_cleaned, y_train_cleaned)

# Compare robustness
robust_acc_original = evaluate_robust_acc(model_lr, X_test, y_test, eps=0.3)
robust_acc_cleaned = evaluate_robust_acc(model_cleaned, X_test, y_test, eps=0.3)

print(f"\nRobustness improvement via harmful example removal:")
print(f"  Original: {robust_acc_original:.1%}")
print(f"  Cleaned:  {robust_acc_cleaned:.1%}")
print(f"  Gain:     {100*(robust_acc_cleaned - robust_acc_original):.1f} pp")

Expected Output:

Top 10 helpful examples:
  Example 453: Influence=0.0342
  Example 127: Influence=0.0328
  Example 891: Influence=0.0301
  ...

Top 10 harmful examples:
  Example 234: Influence=-0.0456
  Example 678: Influence=-0.0421
  Example 445: Influence=-0.0389
  ...

Robustness improvement via harmful example removal:
  Original: 45.2%
  Cleaned:  52.3%
  Gain:     7.1 pp

Numerical / Shape Notes: - Influence sign: positive (helpful) examples contribute to robustness, negative (harmful) hurt - Magnitude scale: typical influences ±0.01-0.04 (small individual impact, cumulative effect large) - Robustness gain: ~7 pp from removing ~1% most harmful examples (data quality impact significant) - Harmful example properties: often mislabeled, outliers, or adversarial-like (unusual feature patterns)

C.19 — Multi-Task Learning for Robustness

Code:

class MTLRobustModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Shared encoder
        self.encoder = nn.Sequential(
            nn.Linear(784, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU()
        )
        # Task heads
        self.standard_head = nn.Linear(128, 10)
        self.robust_head = nn.Linear(128, 10)
    
    def forward(self, x, task=None):
        feats = self.encoder(x)
        if task == 'standard':
            return self.standard_head(feats)
        elif task == 'robust':
            return self.robust_head(feats)
        else:
            return self.standard_head(feats), self.robust_head(feats)

def mtl_train(model, train_loader, test_loader, alpha=0.5, epochs=30):
    """Train MTL model with balance parameter alpha"""
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
    for epoch in range(epochs):
        train_loss_std = 0
        train_loss_rob = 0
        
        for x, y in train_loader:
            # Generate adversarial examples for robust task
            x_adv = pgd_attack(x, y, model, eps=0.3, steps=5)
            
            # Forward pass
            logits_std = model(x, task='standard')
            logits_rob = model(x_adv, task='robust')
            
            # Multi-task loss
            loss_std = nn.CrossEntropyLoss()(logits_std, y)
            loss_rob = nn.CrossEntropyLoss()(logits_rob, y)
            
            loss = alpha * loss_std + (1 - alpha) * loss_rob
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss_std += loss_std.item()
            train_loss_rob += loss_rob.item()
        
        # Evaluate
        std_acc, rob_acc = evaluate_mtl_model(model, test_loader)
        
        # Ensemble prediction (average both heads)
        ensemble_acc = evaluate_mtl_ensemble(model, test_loader)
        
        print(f"Epoch {epoch}: α={alpha}, Std={std_acc:.1f}%, Rob={rob_acc:.1f}%, Ens={ensemble_acc:.1f}%")
    
    return model

# Sweep alpha values
alphas = [0, 0.25, 0.5, 0.75, 1.0]
frontier_mtl = []

for alpha in alphas:
    model_mtl = MTLRobustModel()
    model_mtl = mtl_train(model_mtl, train_loader, test_loader, alpha=alpha, epochs=30)
    
    std_acc, rob_acc = evaluate_mtl_model(model_mtl, test_loader)
    ens_acc = evaluate_mtl_ensemble(model_mtl, test_loader)
    
    frontier_mtl.append((std_acc, rob_acc, ens_acc, alpha))

Expected Output:

α=0.0: Std=50.2%, Rob=60.3%, Ens=57.8%
α=0.25: Std=72.5%, Rob=52.1%, Ens=63.4%
α=0.5: Std=86.3%, Rob=42.1%, Ens=66.2%
α=0.75: Std=91.2%, Rob=28.5%, Ens=64.1%
α=1.0: Std=95.4%, Rob=0.8%, Ens=48.2%

Numerical / Shape Notes: - Pareto frontier: smooth curve from (50%, 60%) to (95%, 1%) as α increases - Ensemble benefit: ensemble accuracy peaks at α≈0.5-0.6, showing diversity maximizes robustness - Task specialization: standard head optimizes for α-weighted objective, robust head for (1-α)-weighted - Parameter sharing benefit: single shared encoder <1.5× parameters vs. two separate models

C.20 — Out-of-Distribution Robustness

Code:

def apply_transformations(x, transform_type, magnitude):
    """Apply OOD shifts to images"""
    if transform_type == 'rotation':
        return transforms.functional.rotate(x, magnitude)
    elif transform_type == 'brightness':
        return transforms.functional.adjust_brightness(x, magnitude)
    elif transform_type == 'blur':
        return transforms.GaussianBlur(kernel_size=3, sigma=magnitude)(x)
    else:
        return x

def evaluate_ood_robustness(model, test_loader, shifts):
    """Evaluate robust & standard accuracy under distribution shifts"""
    results = {}
    
    for shift_type, magnitudes in shifts.items():
        shift_results = {'magnitudes': magnitudes, 'clean_acc': [], 'robust_acc': []}
        
        for mag in magnitudes:
            clean_acc = 0
            robust_acc = 0
            count = 0
            
            for x, y in test_loader:
                # Apply shift
                x_shifted = torch.stack([apply_transformations(x[i:i+1], shift_type, mag) for i in range(len(x))])
                x_shifted = torch.clamp(x_shifted, 0, 1)
                
                # Clean accuracy on shifted
                with torch.no_grad():
                    preds = model(x_shifted).argmax(dim=1)
                    clean_acc += (preds == y).sum().item()
                
                # Robust accuracy (PGD on shifted)
                x_adv = pgd_attack(x_shifted, y, model, eps=0.3, steps=10)
                with torch.no_grad():
                    preds_adv = model(x_adv).argmax(dim=1)
                    robust_acc += (preds_adv == y).sum().item()
                
                count += len(y)
            
            shift_results['clean_acc'].append(100 * clean_acc / count)
            shift_results['robust_acc'].append(100 * robust_acc / count)
        
        results[shift_type] = shift_results
    
    return results

# Define shifts
shifts = {
    'rotation': [0, 15, 30, 60],
    'brightness': [0.5, 0.75, 1.0, 1.5],
    'blur': [0, 1, 2, 4]
}

# Evaluate different training methods
results_standard = evaluate_ood_robustness(model_standard, test_loader, shifts)
results_pgd = evaluate_ood_robustness(model_pgd_trained, test_loader, shifts)
results_augmented = evaluate_ood_robustness(model_with_augmentation, test_loader, shifts)

print("Out-of-Distribution Robustness Results:")
print("\nRotation (degrees):")
for model_name, results in [('Standard', results_standard), ('PGD', results_pgd), ('Augmented', results_augmented)]:
    print(f"{model_name}:")
    for mag, clean_acc, robust_acc in zip(results['rotation']['magnitudes'], 
                                          results['rotation']['clean_acc'], 
                                          results['rotation']['robust_acc']):
        print(f"  {mag}°: Clean={clean_acc:.1f}%, Robust={robust_acc:.1f}%")

Expected Output:

Out-of-Distribution Robustness Results:

Rotation (degrees):
Standard:
  0°: Clean=95.2%, Robust=0.8%
  15°: Clean=78.3%, Robust=0.2%
  30°: Clean=62.1%, Robust=0.1%
  60°: Clean=35.4%, Robust=0.0%

PGD:
  0°: Clean=89.5%, Robust=54.2%
  15°: Clean=84.2%, Robust=48.3%
  30°: Clean=72.8%, Robust=35.2%
  60°: Clean=45.3%, Robust=12.1%

Augmented:
  0°: Clean=93.1%, Robust=2.1%
  15°: Clean=88.7%, Robust=1.8%
  30°: Clean=81.5%, Robust=1.5%
  60°: Clean=62.3%, Robust=0.8%

Numerical / Shape Notes: - Standard model sensitivity: accuracy drops >50 pp under moderate shifts (30° rotation) - Robust training helps OOD: PGD-trained model more stable (~15-20 pp better at 30° vs. standard) - Data augmentation synergy: augmented model generalizes better (~25 pp better than standard at 60°) - Robustness transfer: in-distribution L2-robustness partially transfers to OOD shifts (different threat models, but both learn invariant features)

End of C Solutions

STOP AFTER C.20 Numerical / Shape Notes

Enhanced Solutions Reference — C.1 through C.20

The following enhancements provide deeper pedagogical context, theoretical grounding, and practical guidance for each of the 20 C exercises:

C.1 Enhancement — FGSM: Explanation, Misconceptions, and Connections

Explanation: Fast Gradient Sign Method (FGSM) is the simplest adversarial attack. It exploits the linear approximation of neural networks: in a small neighborhood around a correctly classified input $\mathbf{x}$, the model’s decision boundary can be approximated as linear. By moving along the gradient (direction of steepest loss increase), FGSM finds a point just across the boundary. The gradient localizes to decision-relevant features (edges, textures in images) rather than random pixels, showing that adversarial perturbations are not random noise but structured exploitations of the model’s feature usage. The perturbation is barely imperceptible to humans yet causes misclassification, exemplifying the adversarial vulnerability of modern neural networks.

ML Interpretation: FGSM operationalizes the concept of model margin (Definition 9) and gradient geometry. Theorem 3 (First-Order Adversarial Approximation) justifies FGSM’s effectiveness: the first-order Taylor expansion bounds the loss change upon perturbation, and maximizing along the gradient achieves this bound efficiently. The attack reveals that models trained only on clean data have narrow margins around decision boundaries. Relating to the robustness-accuracy spectrum (Theorem 7), FGSM attacks show this spectrum exists: models with better accuracy have larger gradients (steep decision boundaries), making them more susceptible to FGSM. The gradient structure also connects to implicit model inductive biases—the gradient concentrates on low-level features initially trained, not on all possible features equally.

Failure Modes: (1) Numerical instability: If gradient computation uses imprecise backpropagation (e.g., low-precision floating-point), noise in gradients leads to inconsistent attack direction, reducing success rate. Fix: use high-precision (float32+) gradients. (2) Clipping loss: Clipping to $[0, 1]$ after perturbation can “reset” a near-successful attack, causing the model to re-classify correctly by chance. Mitigate: handle edge cases by iterative verification. (3) Gradient masking: If the model has gradient obfuscation (e.g., using discrete activations, ReLU with many zeros), computed gradients may be zero or unreliable, making FGSM fail despite the model being vulnerable to adaptive attacks. Detect: compare FGSM success with PGD (Example 2); if FGSM >> PGD, gradient masking is suspected. (4) Batch processing errors: Computing perturbation per example vs. per batch can yield different results (batch norm leads to input-dependent statistics); fix by disabling batch norm during attack evaluation or using consistent batch statistics.

Common Mistakes: (1) Using loss gradients vs output gradients: Some implementations accidentally use gradients of output logits directly rather than cross-entropy loss, leading to incorrect attack vectors. Verify: gradient w.r.t. cross-entropy loss, not raw logits. (2) Forgetting to detach gradients: If using torch.autograd, failing to detach $\nabla_\mathbf{x} \ell$ before the subsequent step can cause backprop to flow through the attack, leading to incorrect perturbations and inflated gradients. (3) Mixing norms: Computing $\ell_\infty$ threat model but normalizing by $\ell_2$ norm or vice versa results in perturbations violating the claimed threat model. Always match norm throughout. (4) Not handling batch norm correctly: Running inference with batch norm in training mode (using batch statistics) vs. eval mode (using moving averages) changes attack effectiveness. Standard practice: use eval mode for clean models, then attack.

Chapter Connections: - Definitions: Directly uses Definition 1 (Adversarial Perturbation), Definition 9 (Margin), Definition 12 (Gradient-Based Attack). Relates to Definition 14 (Robust Optimization Problem): FGSM is one attack type in the threat model. - Theorems: Justification by Theorem 3 (First-Order Adversarial Approximation). Connects to Theorem 7 (Margin-Based Robustness Guarantee) by testing it empirically. - Examples: Direct empirical validation of Example 2 (FGSM on Simple Network) on real models. Forms basis for Example 9 (Adversarial Training via PGD) by serving as initial attack for training.

C.2 Enhancement — PGD: Explanation, Misconceptions, and Connections

Explanation: Projected Gradient Descent (PGD) improves upon FGSM by iteratively refining the perturbation. Rather than a single-step linear approximation, PGD performs nested optimization: for each step, it computes the gradient of loss with respect to the perturbation and moves in the direction of increasing loss, then projects back to the threat ball. This iterative process allows PGD to find stronger attacks that escape shallow local maxima FGSM might settle into. The iteration count $T$ trades off attack strength for computational cost: early iterations gain quickly, later iterations refine incrementally. PGD is the de facto standard for adversarial robustness evaluation because it approximates the true worst-case perturbation (within first-order optimization limits).

ML Interpretation: PGD solves the adversarial robustness problem exactly in spirit, only Approximating via first-order methods: it computes $\arg\max_{|\delta| \leq \epsilon} \ell(f_\theta(\mathbf{x} + \delta), y)$. This connects to Theorem 4 (Dual Form of Robust Optimization Objective), showing that rigorous defense evaluation requires solving this inner maximization problem. The convergence dynamics of PGD reveal landscape structure: rapid initial loss increase (far from boundary, gradient large) followed by plateau (near boundary, diminishing gradients) reflects the non-convex nature of the optimization landscape. Different step sizes trade convergence speed for accuracy—too large steps overshoot local maxima, too small steps waste iterations. This tuning problem mirrors hyperparameter optimization in standard training.

Failure Modes: (1) Oscillation with large step size: If $\alpha$ exceeds stability bounds, perturbations oscillate around local maxima without converging, leading to inconsistent success rates across random seeds. Fix: grid-search step size, monitor loss trajectory for smoothness. (2) Early stopping at suboptimal points: If training stops before convergence (e.g., after 5 iterations for speed), PGD may miss stronger attacks, underestimating robustness. Mitigate: use sufficient iterations (empirically determine plateau). (3) Poor initialization: Zero initialization of $\delta$ sometimes misses diverse attack directions accessible from random initialization; random start more robust. (4) Numerical precision: Computing projections onto $\ell_2$ balls requires accurate norm calculations; low-precision arithmetic introduces errors, especially after many iterations.

Common Mistakes: (1) Gradient not detached between iterations: Backpropagating through the attack loop can create unintended gradients and divergence. Always detach perturbations and recompute gradients fresh each iteration. (2) Projecting after every update: Some implementations project after computing gradients pre-update, others after; inconsistency leads to perturbations violating the threat model. Standard: update, then project. (3) Not re-computing gradients w.r.t. new perturbations: Reusing gradients from previous iterations or not recomputing $\nabla_\delta \ell(f(\mathbf{x} + \delta_{t}), y)$ leads to stale gradients, missing improved directions. (4) Forgetting random initialization: Defaulting to $\delta_0 = 0$ (zero initialization) sometimes gets stuck; random initialization explores more and typically succeeds more often.

Chapter Connections: - Definitions: Uses Definition 12 (Gradient-Based Attack) in iterative form. Relates to Definition 14 (Robust Optimization Problem) by conducting the inner maximization. - Theorems: Central to Theorem 4 (Dual Form); PGD approximates the solution structure. Validates Theorem 7 empirically: stronger attacks (PGD >> FGSM) demonstrate margin-based bounds are tight. - Examples: Extends Example 2 (FGSM) to iterative form (Example 9 uses PGD for adversarial training). Bridges to Example 11 (TRADES): PGD generates adversarial examples for TRADES training.

C.3 Enhancement — Lipschitz Constants: Explanation, Misconceptions, and Connections

Explanation: The Lipschitz constant $L_f$ measures the maximum rate of change of a neural network’s output with respect to inputs. Formally, $L_f = \max_{\mathbf{x}, \mathbf{x}'} \frac{\|f(\mathbf{x}) - f(\mathbf{x}')\|_2}{\|\mathbf{x} - \mathbf{x}'\|_2}$. Intuitively, a smaller $L_f$ means the network is “smooth”—small input changes cause small output changes—making it harder to find adversarial perturbations. Computing $L_f$ requires evaluating the worst-case difference, which is intractable globally, so we use upper bounds via spectral norms. For feedforward networks with ReLU activations (1-Lipschitz), the product of per-layer spectral norms gives an upper bound: $L_f \leq \prod_i \sigma_{\max}(W_i)$. This product-form reveals why networks naturally become non-Lipschitz: each layer potentially amplifies; a network with 10 layers, each with $\sigma_{\max} = 2$, has $L_f \leq 2^{10} = 1024$, meaning a small input change can cause huge output changes.

ML Interpretation: The Lipschitz constant is the key quantity linking input perturbation budgets to output/gradient changes. Theorem 1 (Lipschitz Bound for Robust Perturbations) rigorously bounds the robustness problem using $L_f$. Theorem 7 (Margin-Based Robustness Guarantee) shows that certified robustness radius is $r = m / L_f$—the certified region shrinks with $L_f$. By controlling/reducing $L_f$ through spectral normalization, we directly improve certified robustness. This connects to implicit regularization of SGD: large-batch training often leads to larger spectral norms (non-Lipschitz models), while small-batch training tends toward smaller norms (more Lipschitz). The empirical Lipschitz estimation via finite differences provides a sanity check—if empirical significantly exceeds theory, it suggests the theoretical bound is loose or the model has unusual structure.

Failure Modes: (1) SVD numerical instability: Computing SVD of large matrices can be numerically unstable, especially for rank-deficient matrices. Fix: use power iteration instead (faster, more stable). (2) Ignoring bias terms: Biases contribute to model expressiveness; ignoring them in Lipschitz computation (only considering weight matrices) undercounts the true complexity slightly, though weights dominate. (3) Mishandling convolutional layers: Reshaping conv weights to matrix form incorrectly (wrong shape) leads to incorrect spectral norms. Correct: for 4D conv weights $[out, in, h, w]$, reshape to $[out, in \times h \times w]$ (group by output channel). (4) Not handling batch norm correctly: Batch norm layers have complex Lipschitz dependence (includes running statistics); treating them as unit-Lipschitz is an approximation. For certified bounds, either explicitly account for batch norm or remove it during evaluation.

Common Mistakes: (1) Confusing spectral norm of weight and network: The spectral norm of a weight matrix $W$ is $\sigma_{\max}(W)$, NOT the Frobenius norm $\|W\|_F$ or max absolute entry. Compute via SVD, not naive row/column norms. (2) Forgetting to account for activation functions: The Lipschitz of ReLU is 1, sigmoid is $1/4$, tanh is 1. Non-1-Lipschitz activations require adjustment to the product formula. (3) Empirical Lipschitz sampling bias: Sampling perturbations uniformly in the threat ball misses adversarial directions; sample worst-case pairs or use adversarial perturbations as samples for more realistic empirical Lipschitz. (4) Not handling zero outputs: If a layer outputs many zeros (common with ReLU), the Lipschitz can be artificially small in those directions; full analysis should account for output subspace structure.

Chapter Connections: - Definitions: Directly implements Definition 4 (Lipschitz Continuity). Relates to Definition 14 (Robust Optimization Problem) by tightening bounds on achievable robustness. - Theorems: Central to Theorem 1 (Lipschitz Bound); this exercise computes the bound. Theorem 7 (Margin-Based Robustness Guarantee) uses $L_f$ directly in the radius formula. - Examples: Empirical validation of Example 3 (Lipschitz Certification): finite differences confirm theoretical bounds. Bridges to Example 6 (Spectral Normalization): spectral norm control is the primary mechanism to reduce $L_f$.

C.4 Enhancement — Adversarial Training: Explanation, Misconceptions, and Connections

Explanation: Adversarial training (AT) is a defense mechanism that trains the model on adversarially perturbed examples. Rather than minimizing loss on clean data alone, AT solves a minimax problem: for each clean sample, find the worst-case perturbation within a threat ball, then train the model to resist that perturbation. This means the model learns to be robust, not just accurate—its decision boundaries are pushed outward so that the threat ball around them remains correctly classified. Crucially, adversarial training changes the decision boundary geometry: instead of being tight and sharp, robust boundaries are wide and flat, trading some clean accuracy for robustness. The process is computationally expensive (multiple attack iterations per training step), but empirically highly effective—it’s currently the most practical defense.

ML Interpretation: Adversarial training implements Theorem 4 (Dual Form of Robust Optimization Objective) in practice: training alternates between maximizing loss (attack) and minimizing loss (defense). This connects to the min-max optimization structure: the inner max solves the attack problem (where to perturb), the outer min solves the defense problem (how to train). Convergence theory shows that alternating optimization converges to stationary points of the min-max objective, though global optimality is elusive. Adversarial training also has implicit regularization effects: the perturbations act as data augmentation, and the robust models tend to learn more generalizable features (confirmed empirically). The robustness-accuracy tradeoff (Theorem 7 + Theorem 8) is evident: AT models sacrifice clean accuracy (95% → 85–90%) for significant robustness gains.

Failure Modes: (1) Catastrophic overfitting: For large $\epsilon$ or small models, AT can overfit sharply—training accuracy rises quickly but test accuracy plateaus or drops. Mitigate: regularization, early stopping, larger model capacity. (2) Gradient obfuscation during attack generation: If the model’s gradients are unreliable during AT (e.g., due to discrete activations or other gradient masking effects), the generated adversarial examples may not be valid, and the model isn’t truly robust. Verify: compare with adaptive attacks after training. (3) Suboptimal lower-layer features: AT forces the model to be robust early (shallow layers), sometimes preventing effective feature learning. Result: the model remains vulnerable in subtle ways (certified robustness ≪ empirical). (4) Update instability: If attack generation is inaccurate or the attack and defense steps conflict, training diverges. Mitigate: careful step size tuning, ensure attack steps converge.

Common Mistakes: (1) Using cold-start (no model randomness): Initializing perturbations at zero for all training steps wastes early training. Standard: random init $\delta \sim U(-\epsilon, \epsilon)$ to explore diverse perturbations. (2) Not freshly generating attacks each epoch: Reusing pre-generated attacks leads to overfitting to specific perturbations; always generate attacks dynamically during training. (3) Incorrect threat model during defense: Training on $\ell_\infty$ but evaluating on $\ell_2$ (or vice versa) leads to overestimating robustness. Ensure threat models match training and evaluation. (4) Forgetting batch normalization mode: During adversarial example generation, batch norm should be in eval mode (using running stats, not batch stats), ensuring consistency with final deployment. Leaving it in train mode causes adversarial examples to differ between training and testing.

Chapter Connections: - Definitions: Relates to Definition 14 (Robust Optimization Problem); AT implements the defense strategy. Uses Definition 12 (Gradient-Based Attack) to generate adversarial examples. - Theorems: Directly implements Theorem 4 (Dual Form); alternates attack (inner max) and defense (outer min). Validates Theorem 7 empirically: AT increases certified robustness radius. - Examples: Combines PGD attacks (Example 9) with training. Enables evaluation of defenses (Example 7: Comparing Defenses).

C.5 Enhancement — Linear Model Robustness Analysis: Explanation, Misconceptions, and Connections

Explanation: Linear models offer a controlled setting to study adversarial robustness theoretically. For linear regression $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$, the change in output for perturbation $\delta$ is simply $\mathbf{w}^T \delta$, making the Lipschitz constant exactly $\|\mathbf{w}\|_2$ (no nonlinearity to complicate things). The margin (for classification, e.g., logistic regression) is $m = \mathbf{w}^T (\mathbf{x} - \mathbf{x}_j)$ for misclassified $\mathbf{x}_j$, and the certified robustness is $r = m / \|\mathbf{w}\|_2$, directly showing the relationship between weight norm, margin, and robustness. This analytical clarity contrasts with neural networks, where the same concepts hold but computing them requires approximations. Linear models thus serve as a sanity check: if the theory fails here, it’s definitely wrong for neural networks.

ML Interpretation: Linear models exemplify Theorem 1 (Lipschitz Bound for Robust Perturbations) and Theorem 7 (Margin-Based Robustness Guarantee) exactly. There’s no approximation—the Lipschitz constant is $\|\mathbf{w}\|_2$, the margin is computable exactly, and the certified radius is $r = m / \|\mathbf{w}\|_2$ precisely. This exercise confirms that the theory is sound in the linear regime, validating its extension to neural networks (where it becomes an upper bound). Additionally, studying adversarial training on linear models reveals fundamental tradeoffs: weight decay (which shrinks $\|\mathbf{w}\|_2$, improving certified robustness) also shrinks the margin if the data is not separable, requiring careful balancing.

Failure Modes: (1) Numerical instability in margin computation: For near-separable data, margins can be very small, leading to numerical underflow if not careful. Fix: use log-domain computations for probabilities. (2) Forgetting the bias term: The certified robustness formula $r = m / \|\mathbf{w}\|_2$ assumes bias doesn’t contribute; bias does affect the decision boundary but not the Lipschitz constant, so it’s negligible for robustness. (3) Confusing Lipschitz of the model with Lipschitz of the loss: The output Lipschitz $\|\mathbf{w}\|_2$ determines prediction robustness, but loss Lipschitz depends on the loss function and class labels, which is different.

Common Mistakes: (1) Computing margin as logit difference without considering scales: Margins should account for the norm of the weight vector; using raw logits can mislead. (2) Not normalizing features: If features have vastly different scales, $\|\mathbf{w}\|_2$ is dominated by weights for small-scale features, not necessarily reflecting true model sensitivity. Normalize or rescale features first. (3) Mixing adversarial and clean datasets: When evaluating robustness on adversarially perturbed data mixed with clean data, ensure proper accounting (test on only the perturbed data for robustness, only clean for accuracy).

Chapter Connections: - Definitions: Directly uses Definition 4 (Lipschitz Continuity), Definition 9 (Margin), Definition 14 (Robust Optimization Problem) in the linear regime. - Theorems: Provides concrete instantiation of Theorem 1 (Lipschitz Bound) and Theorem 7 (Margin-Based Robustness Guarantee) without approximation. - Examples: Simplifies Examples 3–4 (Lipschitz computation, adversarial training) to transparent settings. Provides baseline for neural network comparisons.

C.6 Enhancement — Randomized Smoothing: Explanation, Misconceptions, and Connections

Explanation: Randomized smoothing is a provably robust defense that “smooths out” adversarial perturbations via random noise. The core idea: to classify input $\mathbf{x}$, instead of directly querying $f(\mathbf{x})$, sample noisy versions $\mathbf{x} + \xi_i$ (where $\xi_i \sim \mathcal{N}(0, \sigma^2 I)$, Gaussian noise) from many samples, query the model on each, and take majority vote. This majority vote is robust to adversarial perturbations: if the adversary adds a small perturbation $|\|\delta\|_2 \leq \epsilon$, the decision still remains the same (majority vote doesn’t flip) with high probability, as long as the noise level $\sigma$ is calibrated appropriately. The certified robustness radius is $\sigma \cdot \Phi^{-1}(p_a) - \epsilon$ (Cohen formula), where $p_a$ is the empirical probability of the top class. Crucially, this certification is provable—it doesn’t assume anything about the adversary’s capabilities (worst-case).

ML Interpretation: Randomized smoothing decouples robustness from model training: any pre-trained model can be made provably robust by adding randomized smoothing at test time. This connects theory (Theorem 5: Provable Robustness via Randomization) to practice: real-world deployment can add smoothing without retraining. However, there’s a cost: inference requires many forward passes ($N$ samples, typically 100–1000), making it computationally expensive. The certification is tight when the top class has high confidence ($p_a$ close to 1); low-confidence predictions yield small certified radius, reflecting uncertainty. The method is also independent of noise distribution to some extent—the proofs work for any noise with certain properties—though Gaussian is standard due to simplicity.

Failure Modes: (1) Insufficient sampling During inference: Using too few samples $N$ (e.g., $N < 100$) leads to high variance in the majority vote; some classifications become unstable, and certified radius estimates are unreliable. Fix: use sufficient samples (typically $N \geq 100$ for toy problems, $N \geq 1000$ for realistic confidence). (2) Noise level mismatch between training and certification: If the model is trained with noise $\sigma_{\text{train}} \neq \sigma_{\text{cert}}$, the certification may not hold. Typically, to maximize certified robustness, train with the same $\sigma$ used for certification. (3) Ignoring the bias-variance tradeoff in smoothing: Larger $\sigma$ gives larger certified radius but degrades clean accuracy (predictions become fuzzy). Optimal $\sigma$ balances the two.

Common Mistakes: (1) Computing cAcc (certified accuracy) incorrectly: Certified accuracy is not the same as standardAccuracy; it only counts examples where the smoothed classifier has a certified robustness guarantee. Many implementations confuse the two. (2) Not accounting for abstention: If the top two classes have similar probabilities, the certified radius becomes very small or unverifiable. Some implementations incorrectly abstain (output “unknown”), others incorrectly claim robustness. (3) Using biased estimators for $p_a$: Computing $p_a$ from $N$ samples introduces estimation error; use Clopper-Pearson or Wilson score confidence intervals for $p_a$ to get correct certified radius, not just point estimates.

Chapter Connections: - Definitions: Relates to Definition 15 (Certified Robustness), Definition 14 (Robust Optimization Problem—randomized smoothing is a randomized defense strategy). - Theorems: Central to Theorem 5 (Provable Robustness via Randomization). Contrasts with Theorem 1 (Deterministic Lipschitz bounds): randomized smoothing trades computation for a stronger guarantee. - Examples: Provides certified defense (Example 7: Comparing Defenses); randomized smoothing results in unassailable certified bounds (though with large inference overhead).

C.7 Enhancement — Gradient Masking Detection: Explanation, Misconceptions, and Connections

Explanation: Gradient masking (also called obfuscation) is a false sense of security: a model appears robust to gradient-based attacks (like FGSM, PGD) because the gradients are uninformative or unreliable, not because the model is actually robust. For example, using many ReLU units with sparse activations (many neurons output zero) leads to sparse gradients; adversarial perturbations computed from these sparse gradients may be ineffective. Another case: discrete activation functions (e.g., pure threshold units) have zero gradients almost everywhere, making gradient-based attacks fail. However, if you switch to an adaptive attack that takes gradient obfuscation into account (e.g., computing finite differences instead of exact gradients, or asking “what if I flip the gradient?”), the model breaks. This distinction is critical: gradient masking is a well-known pitfall in adversarial robustness research. Detecting gradient masking involves comparing gradient-free attacks (e.g., black-box, random perturbations) with gradient-based attacks; if gradient-based significantly underperforms, masking is suspected.

ML Interpretation: Gradient masking violates the spirit of adversarial robustness (Definition 15), which assumes the adversary has full knowledge of the model. Masking is a form of “security through obscurity”—it doesn’t provide true robustness, only robustness against blind adversaries. This connects to Definition 20 (Adaptive Attacks), which explicitly account for defenses that might manipulate gradients. Theorem 9 (Gradient-Based Lower Bound on Robust Error) assumes exact gradients are available, which gradient masking violates; the theorem becomes inapplicable. The lesson: legitimate robustness must hold against adaptive adversaries (Definition 20), not just gradient-based ones. Adversarial training (C.4) is robust to gradient masking in this sense because it uses PGD attacks with fresh gradient computation, eliminating masking issues.

Failure Modes: (1) False negatives in gradient masking detection: If a model is robust OR gradient-masked, gradient-based attacks fail; you can’t distinguish them by attack success alone. Must compare with adaptive attacks. Just because FGSM fails doesn’t mean the model is robust. (2) Overlooking subtle masking: Some models have partial gradient masking (e.g., only certain neurons suffer from sparse activations), making detection tricky. (3) Adaptive attack implementation errors: Adaptive attacks are complex; if implemented incorrectly, they might fail even on truly vulnerable models, giving false positives for “robustness” detection.

Common Mistakes: (1) Comparing FGSM and PGD without accounting for tuning: PGD is stronger by design (iterative), not just because of adaptive gradients. To isolate gradient masking, compare FGSM with/without masking in the same model, or use identical-strength attacks (e.g., FGSM vs. 1-step PGD). (2) Not testing against adaptive attacks properly: Implementing adaptive attacks is non-trivial (e.g., auto-attack toolbox provides principled adaptive attacks). Rolling your own risks errors. (3) Forgetting randomization in gradient-free attacks: Black-box attacks should randomize perturbations to avoid alignment with any residual gradients. Use proper distributions (e.g., random Gaussian, not deterministic).

Chapter Connections: - Definitions: Directly relates to Definition 15 (Certified Robustness), Definition 20 (Adaptive Attacks). Shows that gradient-based robustness (Definition 18) can be illusory. - Theorems: Tests Theorem 9 (Gradient-Based Lower Bound) empirically; gradient masking violates the theorem’s assumptions. - Examples: Validates Example 8 (Gradient-Based Attack Analysis): distinguishes real robustness from masking.

C.8 Enhancement — Spectral Normalization: Explanation, Misconceptions, and Connections

Explanation: Spectral normalization (SN) is a training technique that constrains the spectral norm (largest singular value) of weight matrices during training. For each weight matrix $W$, after each update during training, $W$ is divided by its spectral norm $\sigma_{\max}(W)$, effectively normalizing each weight to have $\sigma_{\max} = 1$. By constraining all weight layers to have unit spectral norm, the network’s overall Lipschitz constant is bounded by 1 (assuming 1-Lipschitz activations), which directly improves certified robustness (Definition 15). Crucially, SN doesn’t require adversarial examples during training (unlike adversarial training); it’s a simple regularization that happens during normal training. The tradeoff: SN reduces the model’s expressiveness (constrained to Lipschitz ≤ 1), typically hurting clean accuracy by 2–5%, but dramatically improving certified robustness (50–100× increase in certified radius).

ML Interpretation: Spectral normalization operationalizes Theorem 1 (Lipschitz Bound for Robust Perturbations) during training: by controlling weight norms, training directly minimizes the Lipschitz constant, which provably bounds adversarial robustness. This connects to the implicit regularization of SGD (Chapter 20 dynamics): SN modifies the optimization landscape to favor Lipschitz-constrained solutions. The tradeoff between clean accuracy and certified robustness (Definition 15 vs. Definition 8) is evident: expressive models have large Lipschitz, guaranteeing robustness is hard. By constraining Lipschitz upfront, SN forces the model to be inherently robust, sacrificing expressiveness. SN is also used in generative adversarial networks (GANs) for training stability, suggesting that Lipschitz constraints improve optimization convergence beyond just robustness.

Failure Modes: (1) SN incompatibility with batch normalization: Batch normalization also normalizes activations; combining with SN can over-constrain the model, sometimes hurting both accuracy and robustness. Careful combinations or strategic layer selection needed. (2) Non-convergence due to Lipschitz constraints: For some networks/datasets, constraining all layers to σ_max = 1 is too restrictive; the model can’t fit the training data adequately, leading to divergence or oscillations during training. (3) Power iteration not converging: Computing σ_max via power iteration requires sufficient iterations; too few iterations lead to biased estimates, and subsequent normalization is incorrect.

Common Mistakes: (1) Applying SN to all layers indiscriminately: Some layers (e.g., final classification layer) might benefit from larger Lipschitz; blindly constraining everything to 1 is suboptimal. Vary normalization targets per layer for better performance. (2) Not using power iteration correctly: Power iteration requires randomization and proper initialization; deterministic implementations can diverge. (3) Forgetting to normalize bias terms: Biases contribute to the affine map but not the Lipschitz norm (they don’t affect slopes); SN typically ignores bias, which is fine, but this should be conscious.

Chapter Connections: - Definitions: Directly implements Definition 4 (Lipschitz Continuity) via training constraints. Enables Definition 15 (Certified Robustness). - Theorems: Operationalizes Theorem 1 (Lipschitz Bound) during training. Validates Theorem 7 empirically: SN increases certified robustness radius. - Examples: Practical implementation of Example 3 (Lipschitz Computation); extends static computation to training-time normalization.

C.9 Enhancement — Hessian Spectral Analysis: Explanation, Misconceptions, and Connections

Explanation: The Hessian (Hessian of the loss at a critical point) encodes the local geometry of the loss landscape. Its eigenvalues reveal whether the point is a minimum (all positive), saddle (mixed signs), or maximum (all negative). For adversarial robustness, the Hessian structure matters: at sharp minima (large Hessian eigenvalues), small perturbations change the loss significantly, but the tight curvature often means robustness is hard (small margin). Conversely, flat minima (small Hessian eigenvalues) allow larger perturbations while still being inside the minimum, suggesting better robustness. Computing Hessian eigenvalues is expensive (requires second-order derivatives), but for small networks or on subset of parameters, it’s tractable. The exercise computes Hessian spectra during training, monitoring how the landscape evolves: early training (random init), the Hessian might have many negative eigenvalues (far from minimum); later, it transitions to positive-semidefinite (inside a basin); at very late times, further training might sharpen the minimum (Hessian eigenvalues grow).

ML Interpretation: Hessian analysis connects to the implicit bias of SGD (Chapter 19: Stochastic Gradient Dynamics). Flat minima found by small-batch SGD have smaller Hessian eigenvalues, while sharp minima found by large-batch SGD have larger eigenvalues. This reflects the effective temperature: high temperature (small batch) allows escaping sharp basins; low temperature (large batch) settles quickly in whichever basin is found. Hessian structure also connects to Theorem 2 of Chapter 20 (something related to landscape structure); analyzing Hessian helps understand why certain training regimes generalize better (flatness → robustness, generalization).

Failure Modes: (1) Computational cost for large networks: Hessian has $d^2$ entries where $d$ is parameter count; for modern networks with millions of parameters, full Hessian is infeasible. Mitigate: compute on small networks, subset of parameters, or use approximations (Gauss-Newton, Fisher). (2) Numerical instability in eigenvalue decomposition: Computing eigenvalues of very ill-conditioned matrices can accumulate errors; use specialized solvers (ARPACK for sparse matrices, or techniques that avoid explicit Hessian construction). (3) Non-stationary Hessian during training: The Hessian changes as the model updates; snapshots at different iterations reflect different landscapes, making interpretation tricky without careful analysis.

Common Mistakes: (1) Computing the Hessian of the loss, not regularized loss: If the model uses weight decay or other regularization, the true optimization objective includes the regularizer. Forgetting this gives an incomplete Hessian. (2) Not accounting for batch effects: Hessian computed on one batch might differ significantly from Hessian on another (batch norm, dropout introduce batch-dependence). Compute on deterministic set (test set or full training set with eval mode). (3) Misinterpreting negative eigenvalues near convergence: At early iterations, negative eigenvalues are expected (away from critical point). If negative eigenvalues persist after convergence, the point might be a saddle, but it could also be numerical error or the model being on a plateau with zero eigenvalues.

Chapter Connections: - Definitions: Relates to Definition 5 (Sharp vs. Flat Minima), Definition 9 (Margin). Hessian eigenvalues quantify sharpness. - Theorems: Connects to implicit landscape theory (Theorem 2 of Chapter 20, if relevant); Hessian structure influences SGD convergence dynamics. - Examples: Provides empirical analysis of optimization landscape geometry (Example of landscape visualization, if present).

C.10 Enhancement — Pareto Frontier Construction: Explanation, Misconceptions, and Connections

Explanation: The Pareto frontier is the set of solutions that are optimal under different weightings of multiple objectives. For the robustness-accuracy tradeoff, the frontier consists of models where improving accuracy requires sacrificing robustness, and vice versa. To construct it, vary a hyperparameter (e.g., $\lambda$ in TRADES: Definition of adversarial training with KL regularization) that controls accuracy vs. robustness emphasis, train models for each $\lambda$, and measure both clean accuracy and robust accuracy (e.g., vs. PGD-20 attack). Plotting these points reveals a curve: the Pareto frontier. Models below the curve are dominated (worse accuracy and robustness simultaneously). The frontier typically shows diminishing returns: moving from 90% clean / 0% robust to 80% clean / 50% robust requires modest sacrifice, but reaching 80% clean / 80% robust might be infeasible for the given threat model and training budget. The frontier also depends on threat strength ($\epsilon$): larger threat models have worse Pareto frontiers (smaller achievable robust accuracy for same clean accuracy).

ML Interpretation: The Pareto frontier empirically demonstrates Theorem 8 (Robustness-Accuracy Tradeoff). Theory predicts fundamental limits on joint robust and clean accuracy (related to loss geometry and data separability); the frontier shows these limits in practice. Understanding the tradeoff’s shape helps practitioners choose appropriate tradeoff points: if 85% clean accuracy is acceptable, but you want to maximize robustness, the frontier shows the best achievable robust accuracy (e.g., 75%). TRADES (Definition 13, if present) is one way to navigate the frontier; the $\lambda$ parameter controls the tradeoff explicitly. Other techniques (MART, TRADESST, etc.) might trace different frontiers, showing their relative merits.

Failure Modes: (1) Insufficient points on the frontier: Using too few $\lambda$ values (e.g., 3–4 points) gives a sparse frontier, missing the actual curve’s shape. Use many values (10–20+) to trace it accurately. (2) Hyperparameter interaction: Varying $\lambda$ alone while keeping other hyperparameters fixed may not trace the true frontier; different $\lambda$ values might benefit from different learning rates, batch sizes, or early stopping. Often requires per-point tuning. (3) Measurement noise: Robust accuracy is often estimated via attacks (e.g., PGD-20), which are stochastic or incomplete; reporting single values without confidence intervals misleads. Use multiple runs or confident attack methods.

Common Mistakes: (1) Confusing Pareto optimality with global optimality: A point on the frontier is Pareto optimal (can’t improve all objectives simultaneously) but might be suboptimal overall if other training methods (e.g., different architectures, data augmentation) are not considered. (2) Not accounting for computational cost: Different $\lambda$ values have different training times (adversarial training is slow); plotting only accuracy/robustness without showing computation might mislead practitioners who have limited compute budgets. (3) Incorrect robust accuracy measurement: Robust accuracy must be measured against reasonably strong attacks (e.g., PGD with many iterations). Using weak attacks (FGSM) over-optimistically estimates robustness.

Chapter Connections: - Definitions: Empirically illustrates Definition 8 (Generalization vs. Robustness Tradeoff), Definition 15 (Certified Robustness) vs. Definition 8 (Accuracy). - Theorems: Validates Theorem 8 (Robustness-Accuracy Tradeoff). - Examples: Extends Example 7 (Comparing Defenses) to multi-objective optimization (frontier).

C.11 Enhancement — Leave-One-Out Error and Stability: Explanation, Misconceptions, and Connections

Explanation: Leave-One-Out (LOO) error measures generalization by training excluding one example, then testing on that example. Formally, for each $i \in [n]$: train on $S \setminus \{i\}$ (all but $i$-th example), compute $\hat{f}_{-i}$, evaluate $\ell(\hat{f}_{-i}(\mathbf{x}_i), y_i)$. LOO error is the average: $\text{LOO} = \frac{1}{n} \sum_i \ell(\hat{f}_{-i}(\mathbf{x}_i), y_i)$. This is a leave-one-out cross-validation error, often a good estimate of generalization. Additionally, LOO relates to algorithm stability (Definition 26, if present): algorithms that are stable (small changes in training data cause small changes in predictions) have good LOO error. For linear models, LOO error can be computed efficiently (matrix inversion lemma), giving $\text{LOO}_i = \epsilon_i / (1 - H_{ii})$ where $\epsilon_i$ is residual and $H$ is leverage matrix (related to Hessian). For neural networks, true LOO requires $n+1$ trainings (prohibitive), but approximations (influence functions, Definition 18 if present) exist.

ML Interpretation: LOO error and stability form the empirical foundation for generalization bounds (e.g., uniform stability implies generalization via concentration inequalities). The exercise empirically validates that regularization (e.g., weight decay) reduces LOO error by improving stability. Connection to Chapter 20 (Implicit Regularization of SGD): small-batch SGD has implicit stability properties (escapes sharp minima, finds flat solutions), which should correlate with lower LOO error and better generalization. This provides an alternative lens on why SGD generalizes well: stability under data perturbations.

Failure Modes: (1) Computational cost: True LOO requires $n$ training runs; for large datasets or slow models, this is infeasible. Use approximations or compute on subsets. (2) LOO overestimating generalization on small datasets: For small $n$, LOO can have high variance (each fold loses 1/n of data). Use bias-corrected estimates or k-fold cross-validation. (3) Instability in influence function approximations: Computing influence functions (for fast LOO approximation) requires inverting the Hessian; numerical instability may give poor estimates.

Common Mistakes: (1) Tuning hyperparameters on LOO error: Using LOO error to select hyperparameters introduces selection bias; LOO error estimates are not perfectly independent, leading to optimistic bias. Either use a separate validation set or apply bias corrections. (2) Not using efficient LOO computation: For linear models, using full retraining instead of the matrix inversion lemma is wasteful. Know efficient formulas for your model class. (3) Confusing LOO error with influence functions: Influence functions (Definition 18) identify which examples most affect model outputs; LOO error is a different (though related) concept. Don’t mix them up.

Chapter Connections: - Definitions: Relates to Definition 26 (Stability), Definition 18 (Influence Functions). LOO error is a stability-based generalization bound. - Theorems: Connects to generalization bounds (via uniform stability, if in Theorem section). - Examples: Empirical validation of stability’s role in generalization (related to Example of implicit regularization, if present).

C.12 Enhancement — Defensive Distillation: Explanation, Misconceptions, and Connections

Explanation: Defensive distillation is an early defense against adversarial attacks, now known to be largely ineffective (good cautionary tale). The idea: train a “student” network on soft labels (probabilities) from a “teacher” network, using high temperature $T$ (making probabilities smoother). The motivation: smooth probability outputs should have smaller gradients (empirically true), making gradient-based attacks less effective. If gradients are small, adversarial perturbations computed from small gradients are weaker. However, this is gradient masking: the model appears robust because gradients are uninformative, not because it’s truly robust. Using adaptive attacks (e.g., finite differences, or methods that estimate gradients differently), the model breaks despite gradient masking. This exercise is instructive in recognizing failed defenses and understanding why gradient masking is problematic.

ML Interpretation: Defensive distillation is an example of a gradient-masking defense (Definition 20, Adaptive Attacks). The Theorem 9 (Gradient-Based Lower Bound on Robust Error) assumes exact gradients; distillation violates this, making the bound inapplicable. The exercise teaches the importance of adversarial evaluation rigor: always test against adaptive adversaries, not just gradient-based attacks. Modern robustness evaluation (Example 7: Comparing Defenses) explicitly uses adaptive attacks to avoid such pitfalls. Distillation also connects to implicit knowledge transfer in models; though unsuccessful for robustness, knowledge distillation has other applications (model compression, transfer learning).

Failure Modes: (1) Gradient masking detection failure: If adaptive attacks aren’t properly implemented, distillation might appear effective. Mitigate: use reliable adaptive attack toolboxes (auto-attack). (2) Temperature parameter tuning: Too low temperature $T$ leaves small gradients (good for attack suppression but poor distillation); too high smooths the labels too much, losing information. Finding the sweet spot is non-trivial. (3) Computational overhead without benefit: Distillation adds training cost (requires running teacher network), but most of the benefit is gradient masking (not real robustness), making it inefficient.

Common Mistakes: (1) Using only gradient-based attacks to evaluate: Evaluating only FGSM or basic PGD on distilled models leads to false confidence. Always use adaptive attacks. (2) Not comparing with other defenses: Comparing distillation only to undefended models doesn’t show its true ineffectiveness. Compare with other defenses (adversarial training, randomized smoothing, etc.) using the same threat model. (3) Confusing temperature parameter in distillation with temperature in Langevin dynamics: They’re different concepts (both called “temperature”); in distillation, it controls label smoothness; in dynamics, it controls exploration. Don’t conflate.

Chapter Connections: - Definitions: Illustrates Definition 20 (Gradient Masking / Adaptive Attacks); defensive distillation is a failed gradient-masking defense. - Theorems: Violates assumptions of Theorem 9 (Gradient-Based Lower Bound); shows when theoretical guarantees can fail due to gradient masking. - Examples: Cautionary example in Example 7 (Comparing Defenses); used to show what not to do in practice.

C.13 Enhancement — Transferability of Adversarial Examples: Explanation, Misconceptions, and Connections

Explanation: Transferability is the phenomenon that adversarial examples crafted against one model often fool other models. For instance, an adversarial example against ResNet18 often also fools ResNet50 or VGG, even though the models have different architectures and were trained separately. This is important for adversarial robustness: it suggests that adversarial vulnerability is somewhat universal (all neural networks share similar vulnerabilities), which is both concerning (robust solutions must work across architectures) and helpful (insights from one model transfer). The exercise constructs an adversarial example against Model A (e.g., ResNet18), then tests it on Models B, C, D (different architectures, random initialization), measuring the transfer success rate. Intuitively, models with similar architectures have higher transfer rates; models trained on different datasets have lower transfer. A key finding: adversarial examples transfer surprisingly well, even across very different architectures, suggesting a shared adversarial subspace.

ML Interpretation: Transferability connects to implicit model inductive biases (Chapter 20: What similarities exist between trained models?). If neural networks of different architectures converge to similar decision boundaries (in some embedded space), adversarial perturbations targeting one boundary should transfer. This relates to universal adversarial perturbations (UAPs): perturbations that fool all models, not just one. The transferability empirically validates that adversarial vulnerability is not architecture-specific, but rather a property of the way neural networks learn in high-dimensional spaces. Transferability also has implications for black-box attacks: if you can transfer examples from a proxy model, you can attack unknown target models without querying them (harder to detect).

Failure Modes: (1) Transferability drop with defensive mechanisms: Adversarial training (C.4) increases robustness to both direct attacks and transferred examples (adversarially trained models have significantly lower transfer success). If comparing against adversarially trained models, transfer rates drop. (2) Unfair model comparisons: Transferability depends on Model A’s attack strength; using only weak attacks (FGSM) on Model A gives weak transferred examples, underestimating true transferability. Use strong attacks (PGD) on Model A for fair comparison. (3) Domain shift effects: If Model A and Models B/C are trained on different datasets, transferability is partly due to dataset (data-specific adversarial examples), not model architecture. Control for this by using the same training data.

Common Mistakes: (1) Not measuring successful transfer correctly: Transfer success should count misclassified examples on Model A that remain misclassified on Model B; some implementations count only examples that Model A misclassifies, then check if Model B also misclassifies, which is different (and less informative). (2) Using untargeted attacks when targeted might be more interesting: Untargeted (any misclassification) are easier to transfer; targeted (specific wrong class) are harder. Use both for a complete picture. (3) Ignoring model capacity differences: Very different-capacity models (small CNN vs. large ResNet) have different transferability than similar-capacity models. Account for this in comparisons.

Chapter Connections: - Definitions: Relates to Definition 12 (Gradient-Based Attack), Definition 14 (Robust Optimization Problem). Transferability is implicit in evaluating robustness across models. - Theorems: Empirical study of an implicit consequence of Theorem 9 (Gradient-Based Lower Bound); if robustness is limited by first-order geometry, it’s shared across architectures. - Examples: Extends Example 8 (Gradient-Based Attack Analysis) to multi-model settings.

C.14 Enhancement — TRADES Training: Explanation, Misconceptions, and Connections

Explanation: TRADES (TRadeoff-Aware Discrete Entropy regularization or “TRadeoff-Attentive Discrete Entropy regularization”—the exact expansion varies) is a state-of-the-art adversarial training method that explicitly balances clean accuracy and robustness. Unlike standard adversarial training (AT) which minimizes loss on adversarial examples (sometimes hurting clean accuracy), TRADES add a regularization term that keeps the model’s output distribution on clean examples similar to the output on adversarial examples. Formally: $\min_{\theta} [\ell(f_\theta(\mathbf{x}), y) + \lambda \text{KL}(f_\theta(\mathbf{x}) || f_\theta(\mathbf{x} + \delta^*(\mathbf{x})))]$, where $\delta^*(\mathbf{x}) = \arg\max_{|\delta| \leq \epsilon} \ell(f_\theta(\mathbf{x} + \delta), y)$ is the worst-case perturbation. The KL term encourages the model to make the same prediction (soft label) on both clean and adversarial examples, which is weaker than standard AT (which requires the same hard label). This weaker requirement often preserves clean accuracy while still achieving good robustness.

ML Interpretation: TRADES operationalizes the Pareto frontier concept (C.10): by varying $\lambda$, TRADES can navigate different points on the frontier. Large $\lambda$ emphasizes KL regularization (similar outputs on clean/adv), favoring clean accuracy; small $\lambda$ emphasizes clean loss, accepting lower robustness. TRADES also connects to information theory: the KL divergence measures information loss between clean and adversarial predictions. A robust model should make similar predictions regardless of small perturbations—TRADES enforces this via information-theoretic regularization. This differs from standard AT, which aims for hard label agreement, often requiring more drastic loss changes (larger noise in gradients during training).

Failure Modes: (1) Computational cost: TRADES requires computing adversarial examples (via PGD) for each batch, similar to standard AT, but with an additional KL divergence computation. This is expensive; often requires careful implementation for efficiency. (2) KL divergence can be unstable: Computing KL between two probability distributions requires both to be valid (sum to 1); if either is near-zero or near-one (high-confidence predictions), KL can be dominated by precision/scaling artifacts. (3) Hyperparameter $\lambda$ sensitivity: TRADES introduces a tuning parameter; different datasets/models require different $\lambda$ for good Pareto frontiers. Finding the right value requires experimentation.

Common Mistakes: (1) Using hard KL (cross-entropy) instead of soft KL: Some implementations use cross-entropy between clean and adversarial predictions (treating one as hard labels), rather than KL divergence between two soft distributions. This is incorrect; the difference is subtle but important. (2) Not recomputing adversarial examples: If reusing pre-generated adversarial examples, TRADES becomes ineffective (training overfit to specific perturbations). Generate adversarial examples on-the-fly during training. (3) Neglecting temperature in KL computation: Softening probability distributions with temperature before KL sometimes helps (reduces sharp distributions’ effect); some implementations forget this tuning.

Chapter Connections: - Definitions: Uses Definition 12 (Gradient-Based Attack), Definition 14 (Robust Optimization Problem). Introduces Definition 13 (TRADES), a specific regularized training method. - Theorems: Relates to Theorem 4 (Dual Form of Robust Optimization); TRADES is an alternative to standard AT for solving the robust objective. - Examples: Extends Example 5 (Adversarial Training) to the TRADES variant; enables finer navigation of Pareto frontier in Example 7 (Comparing Defenses).

C.15 Enhancement — Accept/Reject System: Explanation, Misconceptions, and Connections

Explanation: An accept/reject system is a deployment strategy where a neural network classifier can abstain (refuse to predict) when uncertain. Instead of always predicting a class, the system predicts only when the model’s confidence is above a threshold, otherwise returns “unknown” (or routes to human review). For robustness, this is valuable: an adversarial example might push a correctly classified image toward a misclassified label, but if that misclassified label still has lower confidence than the correct label, the model might abstain. The accept/reject threshold is tuned to balance coverage (fraction of examples the system predicts on) and accuracy (accuracy on predicted examples). For robust systems, the threshold can be calibrated based on the threat model: at low threat levels ($\epsilon$ small), accept most examples; at high threat levels, accept only high-confidence examples. The exercise trains a model, then develops an accept/reject rule based on confidence scores, and measures coverage vs. accuracy/robustness tradeoff.

ML Interpretation: The accept/reject system relates to confidence calibration (Definition 17, if present) and selective classification. From a robustness perspective, it’s a form of certified robustness (Definition 15): by abstaining on uncertain examples, the system guarantees high accuracy on accepted examples. This trades coverage for robustness—a practical choice when perfect coverage isn’t necessary. Connecting to Chapter 20 (Implicit Regularization of SGD), if SGD produces well-calibrated models (high confidence on correct predictions, low on incorrect), accept/reject systems naturally emerge from confidence thresholds without extra training.

Failure Modes: (1) Overconfidence on adversarial examples: If the model produces high confidence on adversarial misclassifications, the accept/reject system doesn’t help. Requires a robust model (trained with adversarial training, etc.). (2) Threshold tuning on training data: If the accept/reject threshold is tuned on the training set, it may not transfer to test data (especially under distribution shift or adversarial attack). Tune on a separate validation set. (3) Ignoring cost of abstention: In some applications, abstention is expensive (requires human review, delays, costs). The tradeoff must account for these costs, not just accuracy-coverage.

Common Mistakes: (1) Using maximum logit as confidence: Using the maximum output logit (before softmax) instead of softmax probability can be unreliable; softmax properly calibrated probabilities. (2) Not accounting for adversarial robustness of the confidence threshold itself: Adversarial attacks might lower confidence on correct predictions, causing the system to abstain even on non-adversarial examples. The confidence score itself might not be robust. (3) Mixing robustness and coverage metrics: Coverage is about “does the system attempt prediction?” and accuracy is “is the prediction correct?”. Don’t conflate them in reporting; report curves showing coverage vs. accuracy tradeoff.

Chapter Connections: - Definitions: Relates to Definition 15 (Certified Robustness), Definition 17 (Confidence Calibration). Accept/reject formalizes selective classification. - Theorems: Connects to Theorem 5 (Provable Robustness via Randomization); accept/reject and randomized smoothing are complementary certification strategies. - Examples: Deployment strategy discussed in Example 7 (Comparing Defenses); practical robustness framework.

C.16 Enhancement — Carlini & Wagner Attack: Explanation, Misconceptions, and Connections

Explanation: The Carlini & Wagner (C&W) attack is a powerful optimization-based adversarial attack that directly solves the optimization problem: $\min_{\delta} \|\delta\|_p$ subject to $f_\theta(\mathbf{x} + \delta)$ being f the wrong class. This is formulated as an unconstrained optimization via penalty method: $\min_{\delta} \| \delta \|_p + c \cdot \ell_{\text{attack}}(f_\theta(\mathbf{x} + \delta), \text{target})$, where $\ell_{\text{attack}}$ is often cross-entropy loss (or other loss) and $c$ is a weight tuned via binary search to find the smallest $c$ allowing attack success. By formulating adversarial perturbations as constrained optimization, C&W explores perturbations in the full space (not restricted to sign patterns like FGSM, or iterative steps like PGD), potentially finding smaller perturbations. The attack is much slower than FGSM/PGD (requires many optimizer iterations) but is considered strong and useful for evaluating defenses, especially when weak attacks (FGSM) give false confidence.

ML Interpretation: C&W attack implements a different formulation of the adversarial robustness problem (Definition 14): rather than iterating gradients with projections (PGD), it uses unconstrained optimization with penalty method (Lagrangian relaxation). This potentially finds better adversarial examples because it explores the full space without projection artifacts. Connecting to Theorem 4 (Dual Form of Robust Optimization Objective), C&W is another approximate solver for the inner maximization problem; different solvers (FGSM, PGD, C&W) provide different quality estimates of the worst-case perturbation. C&W is considered one of the strongest attack methods, useful as a rigorous evaluation baseline (Definition 20: Adaptive Attacks might include C&W).

Failure Modes: (1) Numerical optimization instability: C&W involves many optimizer steps (often 1000+); if not careful, the optimizer diverges or gets stuck. Requires careful initialization and step size tuning. (2) Computational cost prohibitive: C&W is orders of magnitude slower than PGD (might take 1 sec per image vs. 0.01 sec for PGD-20); evaluating large-scale robustness with C&W is impractical without approximations. (3) Binary search finickiness: Finding the right weight $c$ via binary search requires tuning (initial range, step size); poor tuning leads to missing true success or missing smallest perturbations.

Common Mistakes: (1) Not tuning the $c$ parameter properly: Choosing $c$ too large makes the penalty term dominant, wasting optimization effort on exploring large $\delta$; too small misses attack success. Proper binary search is essential. (2) Using too few optimizer iterations: C&W convergence is slow; stopping too early (e.g., 100 iterations) gives suboptimal perturbations. Use sufficient iterations (typically 1000). (3) Mixing threat models: The attack is formulated for specific $\ell_p$ norm; using $\ell_2$ formulation but evaluating on $\ell_\infty$ threat model (or vice versa) gives apples-to-oranges comparisons.

Chapter Connections: - Definitions: Uses Definition 12 (Gradient-Based Attack) via optimization. Implements Definition 14 (Robust Optimization Problem) via direct minimization. - Theorems: Alternative solver for Theorem 4 (Dual Form of Robust Optimization Objective), potentially finding tighter bounds than PGD. - Examples: Extension of attacks (Examples 9–10) to optimization-based methods; used as a rigorous evaluation baseline (Example 7: Comparing Defenses).

C.17 Enhancement — Group Distributionally Robust Optimization (DRO): Explanation, Misconceptions, and Connections

Explanation: Group DRO (Distributionally Robust Optimization) is a fairness and robustness technique that prioritizes worst-performing subgroups. Standard SGD minimizes average loss over all examples; Group DRO instead minimizes worst-case loss over groups. For instance, if the dataset contains subgroups (e.g., males/females, different ethnicities), standard training might achieve high average accuracy by excelling on majority groups and neglecting minorities. Group DRO instead trains to maximize the minimum group accuracy, ensuring no group is left behind. Formally: $\min_\theta \max_g \mathbb{E}_{(x,y) \sim D_g}[\ell(f_\theta(\mathbf{x}), y)]$, where $g$ indexes groups and $D_g$ is the group distribution. During training, Group DRO upweights examples from groups with high loss, dynamically focusing training on underperforming subgroups. This distributes performance more fairly and also improves worst-case robustness: models that perform well even on hard subgroups often are more robust to adversarial perturbations (adversarial examples often exploit models’ weaknesses on minority groups or edge cases).

ML Interpretation: Group DRO operationalizes fairness-aware training (Definition 24, if present) and adversarial robustness jointly. The worst-case group loss minimization relates to Theorem 10 (Worst-Case Robustness Analysis): if the adversary can choose which group to target, robust training must ensure worst-case group accuracy. Group DRO also connects to implicit model biases: standard SGD naturally underfits minority groups (lower gradients from them, as they’re less common), so reweighting explicitly corrects this bias. The interplay between fairness (group parity) and robustness is subtle but empirically significant: learning robust representations often improves fairness, and fair training often improves adversarial robustness.

Failure Modes: (1) Group identification difficulty: Requiring explicit group labels during training limits applicability (privacy concerns, hard to define groups). Mitigated by weakly-supervised or self-supervised group discovery. (2) Computational cost of subgroup queries: If groups are many or overlapping, tracking loss for each group and dynamically reweighting is expensive. Approximate via batching and sampling. (3) Overfitting to subgroup objectives: By maximizing worst-case group performance, the algorithm might overfit to specific group quirks instead of learning general robustness.

Common Mistakes: (1) Not properly defining groups statistically: Group definitions matter crucially; arbitrary grouping leads to arbitrary robustness properties. Define groups based on domain knowledge or calibrated clustering. (2) Ignoring within-group imbalance: Even after reweighting for group fairness, individual examples within groups might have imbalanced loss (one example much harder than others). Group DRO doesn’t address within-group heterogeneity. (3) Comparing unfairly to baselines: Standard sgd might perform better on average but worse on worst-case groups. Report both average and worst-case metrics for fair comparison.

Chapter Connections: - Definitions: Relates to Definition 24 (Fairness), Definition 14 (Robust Optimization Problem); Group DRO combines both goals. - Theorems: Connects to Theorem 10 (Worst-Case Robustness Analysis); minimizing worst-group loss directly implements worst-case robustness. - Examples: Empirical study of fairness-robustness connection (Example of fair/robust training, if present).

C.18 Enhancement — Influence Functions: Explanation, Misconceptions, and Connections

Explanation: Influence functions quantify how much a single training example contributes to a model’s output or loss. Formally, if we remove example $z_i$ from the training set and retrain, the change in loss on test example $z_{\text{test}}$ is approximately $- \hat{I}_{\text{up,loss}}(z_i, z_{\text{test}})$ (influence of $z_i$ on test loss). Computing this exactly requires retraining ($n$ times), which is expensive. Influence functions provide an efficient approximation using the inverse Hessian: $\hat{I}_{\text{up,loss}}(z_i, z_{\text{test}}) \approx -\nabla_\theta \ell(z_{\text{test}}, \theta^*)^T H^{-1} \nabla_\theta \ell(z_i, \theta^*)$, where $H$ is the Hessian at the learned $\theta^*$ and $\nabla_\theta \ell(z, \theta)$ is the gradient of loss for example $z$. This approximation relies on the model being locally linear around the learned $\theta^*$, which is often reasonable post-convergence. For adversarial robustness, influence functions can identify which training examples, if poisoned with adversarial labels, most hurt model robustness (Definition 19, Poisoning Attacks). The exercise computes influence scores for training examples on test robustness, identifying harmful examples.

ML Interpretation: Influence functions connect model outputs to individual training examples, enabling fine-grained analysis of model behavior. In adversarial robustness contexts, identifying high-influence examples on robustness metrics (e.g., which training examples mostly affect robust accuracy) helps understand and improve training. Influence functions also relate to model debugging: if a test example is misclassified robustly (adversarially misclassified), influence functions identify which training examples contributed most, enabling targeted data cleaning or augmentation. Connecting to implicit regularization (Chapter 20): influence functions quantify how well the training data “explains” learned representations; models with good generalization often have distributed influence (no single example dominates) and can explain predictions via influential examples.

Failure Modes: (1) Hessian computation infeasible: Computing full Hessian for modern neural networks is prohibitive ($d^2$ storage). Approximations (Gauss-Newton, Fisher Information Matrix) or implicit methods (implicit Hessian-vector products) needed. (2) Local linearity assumption breaks down: If the model’s loss surface is highly non-linear (neural networks post-convergence can be), the influence approximation error grows. (3) Influence can be diffuse: For overparameterized models, influence might be distributed across many examples, making identification of specific high-influence examples unreliable.

Common Mistakes: (1) Not accounting for Hessian approximation error: Using a poor Hessian approximation (e.g., diagonal, ignoring off-diagonals) gives unreliable influence scores. Use principled approximations or validate via empirical removal. (2) Confusing influence on loss with influence on robustness: Influence functions compute influence on training loss; influence on robust loss (e.g., against PGD) is different and requires separate computation. (3) Not distinguishing positive vs. negative influence: Some examples have positive influence (removing them hurts performance), others negative (removing them helps). Both are informative; distinguish them in analysis.

Chapter Connections: - Definitions: Relates to Definition 18 (Influence Functions), Definition 19 (Poisoning Attacks). Influence functions are mechanisms for understanding/detecting poisoning. - Theorems: Connects to generalization bounds via stability (if training data influence is controlled, generalization follows). - Examples: Used for data-centric robustness analysis (identifying and removing harmful examples).

C.19 Enhancement — Multi-Task Learning with Adversarial Robustness: Explanation, Misconceptions, and Connections

Explanation: Multi-task learning (MTL) trains a shared model to solve multiple tasks jointly. For robustness, MTL can improve generalization by providing auxiliary signals: a model learning both clean classification and robustness indicators (e.g., certified radius) simultaneously might learn more robust features. The exercise trains a neural network with two heads: one for main task (e.g., image classification), another for an auxiliary task (e.g., predicting whether the image contains adversarial perturbations, or predicting feature norms as a robustness proxy). The two losses are weighted: $\mathcal{L} = \ell_{\text{main}} + \alpha \ell_{\text{aux}}$. By tuning $\alpha$, we control the tradeoff between main-task performance and auxiliary task. This can improve robustness: the auxiliary task acts like implicit regularization, pushing the model toward more robust representations. The exercise measures: main task accuracy (standard and robust), auxiliary task accuracy, and robustness metrics.

ML Interpretation: Multi-task learning with robustness relates to implicit regularization (Chapter 20) and the robustness-accuracy tradeoff. The auxiliary task provides an inductive bias toward representations that support robustness, similar to how adversarial training provides such a bias. However, MTL is more flexible: auxiliary tasks can be diverse (adversarial detection, feature norm prediction, etc.), and the model can learn shared representations that simultaneously solve multiple objectives. Connecting to neural network implicit biases: models trained on multiple related tasks often have richer internal representations than models trained on single tasks, due to the constraints imposed by multiple objectives. This is related to network compression and knowledge transfer: multi-task learning knowledge is transferable across tasks.

Failure Modes: (1) Auxiliary task being too easy/hard: If the auxiliary task is trivial (easy to solve without learning good features), the regularization is weak (model ignores it). If too hard, training diverges or competes with the main task. (2) Negative transfer: Sometimes auxiliary tasks hurt main-task performance (negative transfer). This happens if the auxiliary task is irrelevant or conflicting. (3) Hyperparameter $\alpha$ sensitivity: The weight $\alpha$ controlling task balance is critical; different values lead to vastly different tradeoffs. Tuning required.

Common Mistakes: (1) Auxiliary task design without domain knowledge: Choosing auxiliary tasks randomly or arbitrarily wastes effort. Choose tasks that provide meaningful regularization for robustness. (2) Not normalizing losses: If the auxiliary and main losses have different scales (e.g., main loss 0.1, auxiliary loss 100), the simple sum $\ell_{\text{main}} + \alpha \ell_{\text{aux}}$ is dominated by one task regardless of $\alpha$. Normalize losses (divide by running averages or constants) first. (3) Ignoring data availability for auxiliary task: Auxiliary tasks require labels (or generated labels); if labels are expensive, auxiliary tasks don’t help. Ensure labels are available.

Chapter Connections: - Definitions: Relates to Definition 8 (Generalization), Definition 15 (Certified Robustness). MTL can improve both. - Theorems: Connects to implicit regularization theory (Chapter 20); auxiliary tasks provide implicit bias toward robustness. - Examples: Application of implicit regularization to adversarial robustness (Example of implicit bias, if present).

C.20 Enhancement — Out-of-Distribution (OOD) Robustness: Explanation, Misconceptions, and Connections

Explanation: Out-of-distribution (OOD) robustness evaluates how well a model generalizes to data substantially different from the training distribution. For example, training on clean, sharp images from ImageNet and testing on rotated images, blurred images, or synthetic perturbations (simulators, filters) reveals OOD robustness. This is different from adversarial robustness (Definition 15), which evaluates against adversarial perturbations. OOD shifts can be: (1) Domain shift (different camera, lighting), (2) Scale shift (larger/smaller objects), (3) Transformation (rotation, blur, noise). Evaluating OOD robustness involves testing against multiple distribution shifts (Definition, called “Robustness under Distribution Shift” or similar), measuring accuracy under each shift. The exercise applies several OOD transformations (rotation, blur, contrast change, etc.) and measures how standard, adversarially trained, and data-augmented models respond. Often, data augmentation helps OOD robustness (more so than adversarial training which overfits to the specific threat model).

ML Interpretation: OOD robustness relates to the implicit generalization of neural networks (Chapter 20) and the concept of learning generalizable features. Data augmentation (rotation, blur, etc.) acts like implicit regularization, forcing the model to learn features robust to such transformations. Adversarial training, while effective against adversarial perturbations under the threat model, doesn’t always transfer to OOD shifts (trained on small pixel perturbations, not trained for large rotations). This reveals OOD robustness and adversarial robustness as distinct (sometimes conflicting) objectives. The exercise empirically validates that different training strategies have different generalization properties: standard training exploits dataset biases (sensitive to OOD shift), adversarial training ignores dataset biases (robust to threat model but not OOD), and data augmentation provides general robustness (robust to both).

Failure Modes: (1) Mismatch between training augmentations and OOD test shifts: If trained with rotation augmentations but tested on blur shifts, the model won’t generalize to blur. Ensure augmentation training matches test distributions. (2) Oversampling extreme shifts: Testing on very extreme shifts (180° rotation, extreme blur) where humans also fail might not be fair evaluation. Use ecologically valid shifts. (3) Ignoring target domain knowledge: For deployed systems, OOD robustness is context-specific; knowing what shifts are likely helps design appropriate evaluations.

Common Mistakes: (1) Conflating OOD and adversarial robustness: They’re different; a model can be robust to one but not the other. Report both and note trade-offs. (2) Not reporting baseline human performance: Some OOD shifts (extreme blur) hurt humans too; reporting model accuracy without human baseline can mislead. (3) Forgetting to normalize perturbations: When applying shifts, perturbation magnitude must be consistent (e.g., rotation angle, blur radius, contrast factor); reporting “accuracy under shift X” without specifying magnitude is vague.

Chapter Connections: - Definitions: Relates to Definition 8 (Generalization), Definition 15 (Certified Robustness), Definition 21 (Robustness under Distribution Shift). - Theorems: Empirical study of implicit generalization under distribution shift (related to Theorem in Chapter 20, if present). - Examples: Evaluation scenario demonstrating that different training strategies have different robustness profiles (Example 7: Comparing Defenses; Example of generalization, if present).

End of Enhanced C Solutions

Appendices

In Context

Algorithmic Development History

The study of robustness in machine learning emerges from the intersection of classical stability theory, adversarial examples research, and modern optimization. Understanding this historical trajectory illuminates why robustness is challenging and how current methods evolved.

Early Foundations: Stability Theory (Bousquet & Elisseeff, 2002; Rogers & Wagner, 2006). In the early 2000s, Bousquet and Elisseeff developed algorithmic stability as a framework for understanding generalization. They proved that uniformly stable algorithms automatically generalize well (Theorem 2), providing an alternative to complexity-based bounds (VC dimension, Rademacher complexity) that were often vacuous for complex models. Stability also motivated regularization: algorithms that are insensitive to individual training examples (e.g., ridge regression with strong regularization) generalize better than those that overfit to specific examples. This classical view of stability—sensitivity to training data changes—laid groundwork for later adversarial robustness, which extends stability to input perturbations.

Adversarial Examples Discovered (Szegedy et al., 2013). A watershed moment came with Szegedy et al.’s 2013 paper, “Intriguing Properties of Neural Networks,” which empirically demonstrated adversarial examples: small, carefully crafted perturbations that fool neural networks despite being imperceptible to humans. This discovery shocked the community; state-of-the-art models (deep CNNs achieving 99% accuracy on MNIST) could be manipulated with perturbations of magnitude $\epsilon \approx 0.1$ on a $[0, 1]$ scale. Szegedy et al. attributed this to the high-dimensional geometry of neural networks: decision boundaries learned by robust methods (like linear classifiers) occupy different subspaces than boundaries learned by neural networks. This insight motivated viewing adversarial examples not as unusual attacks but as exposures of how neural networks rely on non-robust features.

Gradient-Based Attacks Formalized (Goodfellow et al., 2014 & Carlini & Wagner, 2016). Goodfellow et al.’s 2014 FGSM paper provided the first systematic attack algorithm: a single-step gradient ascent in the direction of increasing loss. FGSM is simple—one line of code—yet highly effective, revealing that linear approximations of loss landscapes explain adversarial vulnerability. This led to a key insight (Theorem 3): the first-order approximation $\ell(f(x + \epsilon \cdot \text{sign}(\nabla_x \ell)) \approx \ell(f(x)) + \epsilon \|\nabla_x \ell\|_1$ captures why gradient-based attacks work. Shortly after, Carlini and Wagner (2016) developed stronger attacks by solving the adversarial optimization problem more carefully, revealing that FGSM is much weaker than the true worst-case perturbation. Their C&W attack spurred development of stronger attack methods and more robust defenses.

Adversarial Training Emerges (Goodfellow et al., 2015; Madry et al., 2018). As attacks became stronger, defenses were proposed. Goodfellow et al. (2015) first proposed adversarial training: train on adversarily perturbed examples to improve robustness. However, early adversarial training used weak attacks (FGSM), leading to gradient masking—the model appeared robust to FGSM but remained vulnerable to stronger attacks. Madry et al. (2018) diagnosed this and proposed PGD-based adversarial training: use stronger attacks (PGD with many iterations) during training to prevent gradient masking. This remains the gold standard for empirical robustness and corresponds to solving the robust optimization problem (Theorem 4) approximately.

Distributionally Robust Optimization (Schanzenbach, 2018; Duan et al., 2021). In parallel, the optimization and economics communities developed distributional robustness: minimize expected loss under the worst-case distribution shift within a divergence ball (e.g., Wasserstein distance). This perspective generalizes adversarial robustness (adversarial examples as extreme points of shifted distributions) and connects to domain adaptation and out-of-distribution generalization. Modern algorithms (DRO, group DRO) apply these concepts to mitigate disparities across subpopulations and improve robustness to distribution shifts.

Certified Robustness Methods (Gowal et al., 2018; Cohen et al., 2019; Weng et al., 2019). Recognizing that PGD-based empirical robustness is not immune to stronger attacks, researchers developed certified robustness methods with formal guarantees. Randomized smoothing (Cohen et al., 2019) converts any classifier into a certifiably robust classifier by averaging predictions over noisy input perturbations. Abstract interpretation (Gowal et al., 2018; Weng et al., 2019) uses over-approximations of network outputs to verify robustness. These methods provide formal guarantees but are often conservative, motivating ongoing research into tighter certification and scalable methods for large models.

Modern Developments: Robustness at Scale. Recent work focuses on making robustness practical for large models. Spectral normalization (Miyato et al., 2018) constrains Lipschitz constants efficiently via spectral norm of weight matrices. Sharpness-Aware Minimization (Foret et al., 2020) seeks flat minima with small Hessian eigenvalues, improving both robustness and accuracy. TRADES (Zhang et al., 2019) balances standard and robust risk, improving robustness-accuracy trade-offs compared to pure PGD training. These methods improve practicality but do not overcome the fundamental trade-off (Theorem 8).

Why This Matters for ML

Robustness as a Safety Requirement

Adversarial robustness is not merely an academic curiosity but a fundamental safety requirement for deployed machine learning systems. In critical domains like autonomous driving, medical diagnosis, and content moderation, model failures have high costs. A camera-based object detector in an autonomous vehicle must correctly identify pedestrians, stop signs, and hazards even under adverse conditions (rain, snow, occlusion, lighting). While conventional stress testing addresses natural variations, adversarial perturbations expose a separate vulnerability: deliberate (or naturally arising) perturbations that fool models despite being small.

Consider a practical scenario: a medical imaging model trained to diagnose from X-rays achieves 95% accuracy in controlled clinical settings. However, if an adversary (or natural image corruption like compression artifacts) perturbs images by adding small noise, the model’s accuracy might drop to 60%, missing critical diagnoses. Or consider a spam classifier deployed in email: an adversary can craft adversarial emails to evade detection, spreading malicious content. These are not hypothetical concerns—adversarial robustness has direct security and safety implications.

Robustness also addresses a more subtle failure mode: distribution shift. Real-world data distribution differs from training distribution. Adversarial robustness, as a form of worst-case analysis (robust optimization), provides guarantees even when distribution shifts occur. A model robust to $\epsilon$-perturbations is protected against systematic distribution shifts bounded by $\epsilon$ in the feature space, providing insurance against model degradation in the field.

Stability–Generalization–Optimization Triangle

This chapter reveals a fundamental triangle connecting three core concepts in machine learning: stability, generalization, and optimization. Classical learning theory (Theorem 2) ties stability to generalization: stable algorithms generalize well. This chapter extends the link to optimization: the robust optimization objective (Theorem 4) is strictly harder than standard optimization, requiring both attack (inner maximization) and defense (outer minimization). These three vertices form a triangle where each vertex constrains the others:

Stability constrains generalization: Stable algorithms (insensitive to training data and input perturbations) have small generalization gaps, but achieving stability may degrade optimization (requiring regularization that slows convergence).

Generalization constrains optimization: To generalize, models must not overfit to training data, which means they must not rely on non-robust features. But optimizing for non-robust features is often easier (less constrained), so demanding robustness (avoiding non-robust features) complicates optimization.

Optimization constrains stability: Robust optimization (finding worst-case perturbations via inner maximization, then updating parameters) is computationally expensive and may get stuck in poor local optima, hurting the quality of learned stable models.

Understanding this triangle explains why achieving simultaneous high accuracy, robustness, and computational efficiency is difficult. Each objective has its costs, and the triangle constrains how well all three can be achieved simultaneously.

Failure Modes if Robustness Is Ignored

Ignoring robustness has both obvious and subtle failure modes:

Obvious failure: The model works well in development but fails catastrophically in deployment when it encounters adversarial or distribution-shifted inputs. A self-driving car that fails to recognize a rotated stop sign, a medical classifier that misdiagnoses images with compression artifacts, or a content moderation system fooled by adversarial text—these are catastrophic failures with real-world consequences.

Subtle failure: The model appears robust during evaluation but is vulnerable to adaptive attacks. A model may pass testing against standard attacks (FGSM, basic PGD) if those attacks are weak or gradients are masked (defenses that obfuscate gradients rather than improving true robustness). Deployed in the wild, a determined adversary using stronger attacks can manipulate the model. This has been demonstrated repeatedly: defenses thought effective were broken by stronger attacks, then stronger defenses were proposed, in an adversarial cycle.

Systemic failure: Ignoring robustness in system design leads to compounded vulnerabilities. If each component (classifiers, detectors, decision systems) is individually non-robust, cascading failures are likely. A small adversarial perturbation at any stage can propagate and amplify through the pipeline, breaking downstream components. Robust systems require end-to-end robustness analysis, not isolated component robustness.

Fairness failure: Non-robust models may be vulnerable to adversarial examples that specifically target underrepresented groups. A model might be accurate overall but vulnerable to perturbations that fool the model specifically on minority groups, exacerbating fairness issues. Building robust models is thus a prerequisite for fair models.

Forward Links to Distribution Shift (Chapter 13)

This chapter’s themes extend directly to Chapter 13’s treatment of distribution shift. Adversarial robustness is a form of domain adaptation: robustness to $\ell_p$ perturbations is robustness to a specific (small-magnitude) shift in the feature space. More generally, distribution shift—when test distribution $\mathcal{D}_{\text{test}}$ differs from training distribution $\mathcal{D}_{\text{train}}$ by $\delta$—is addressed via robust optimization and distributional robustness (Definition 11).

Many techniques introduced here directly apply to distribution shift: spectral normalization reduces Lipschitz constants, stabilizing predictions under distribution shift; adversarial training indirectly trains on shifted examples (the adversarial examples are worst-case shifts); TRADES balances standard and robust risk, analogous to balancing in-domain and out-of-domain performance. The robust optimization framework generalizes from adversarial perturbations (specific threat model) to arbitrary distribution shifts (general divergence ball).

Chapter 13 will explore distribution shift more broadly, including natural (common variations, domain shift) and adversarial shifts. The robustness foundation here—understanding worst-case analysis, robust optimization, stability–generalization connections, and trade-offs—is essential background for designing systems robust to distribution shift.

Motivation

Why Accurate Models Fail Under Small Perturbations

The primary motivating observation is empirical and striking: modern deep neural networks achieve very high accuracy on images, text, and other data, often approaching or exceeding human performance on standard benchmarks. Yet, these same models are vulnerable to small, carefully crafted perturbations. A canonical example is image classification: a deep convolutional network might correctly classify a photograph of a dog with $99\%$ confidence, yet if the pixel values are modified by adding a small amount of noise (magnitude $\leq 8/255 \approx 0.03$ on a $[0,1]$ scale), the network might confidently misclassify the image as a cat, a toaster, or any other class. The perturbation magnitude is so small that humans typically cannot detect it by eye. This vulnerability is not limited to images or small perturbation budgets; similar phenomena occur in audio (adding imperceptible noise to make a speech recognition system fail), text (modifying a few characters to flip spam detection), and other domains. The classical explanation—that adversarial robustness is a consequence of overfitting—is insufficient. Models trained to high accuracy on large datasets with explicit regularization still show adversarial vulnerability. Furthermore, adversarial examples are not random points in the input space; they concentrate near decision boundaries and transfer between different models (an adversarial example that fools model A often fools model B, even if the models are trained differently), suggesting a systematic rather than idiosyncratic phenomenon.

The deeper reason for this vulnerability lies in high-dimensional geometry. In high-dimensional spaces, geometric intuition from the familiar 2D and 3D worlds breaks down. Specifically, (1) the volume of a sphere grows concentrically toward its surface—most of the volume of a high-dimensional ball lies near the surface rather than toward the center. This means that if a model’s training data concentrates on a low-dimensional manifold (a very reasonable assumption for natural images, which lie on a much lower-dimensional surface within the $\mathbb{R}^{262144}$ space of $256 \times 256 \times 3$ color images), then most of the input space is not “seen” during training, and the learned decision boundaries are necessarily far from the training data. (2) The number of directions in high-dimensional spaces grows exponentially—an $\epsilon$-ball around a point in $\mathbb{R}^n$ contains exponentially many non-overlapping regions, and a linear classifier has only two sides, so adversarial perturbations can exploit the vast space orthogonal to the training data manifold. (3) The curse of dimensionality extends to robustness: to ensure that a model’s output is stable across a neighborhood of radius $\epsilon$ in an $n$-dimensional space, the model must be robust to perturbations along $n$ independent dimensions, a growing challenge as $n$ increases. These geometric facts imply that adversarial examples are a fundamental feature of learning in high dimensions, not merely an artifact of specific models or training procedures.

Geometry of Adversarial Directions

Formally, an adversarial example is a pair $(x, x_{\text{adv}})$ where $x_{\text{adv}} = x + \delta$ with $\delta$ a perturbation satisfying a norm bound $\|\delta\|_p \leq \epsilon$ (for some choice of norm $p \in \{0, 1, 2, \infty\}$), such that the model’s prediction differs: $f(x_{\text{adv}}) \neq f(x)$ or the confidence on the true class drops significantly. The adversarial direction $\delta$ is not random; it is structured by the decision boundaries and loss landscape of the model. Several geometric insights emerge: First, the gradient of the loss $\nabla_x \ell(f(x), y)$ points in a direction that increases the loss (a direction toward misclassification). For a linear classifier $f(x) = w^T x + b$, the gradient is $w$, and moving in the direction $\text{sign}(w)$ (the steepest ascent direction) quickly increases the loss and crosses the decision boundary. The Fast Gradient Sign Method (FGSM) exploits this: $x_{\text{adv}} = x + \epsilon \cdot \text{sign}(\nabla_x \ell)$. Second, adversarial directions are not uniformly distributed in the $\epsilon$-ball around $x$; they concentrate on the decision boundary toward the nearest misclassified point. This is captured formally by the projected gradient descent attack (PGD): iteratively maximize loss within the $\epsilon$-ball by moving along gradients, staying within the feasible region. Third, adversarial directions often point toward the intersection of multiple decision boundaries, regions of ambiguity where the model’s predictions are uncertain. Near such regions, small perturbations can cause large changes in the predicted class. Fourth, adversarial directions tend to align with the input features rather than being random: perturbations often concentrate on specific pixels, features, or dimensions that the model relies on for classification. This is revealed by visualization of adversarial perturbations—they often highlight textured regions, edges, or other meaningful features rather than appearing as uniform noise. Finally, adversarial examples exhibit transfer: a perturbation crafted to fool model A often also fools model B, even though the models are trained on different data or with different architectures. This transfer property suggests that adversarial perturbations exploit universal features of the learned models, not idiosyncrasies of specific training runs. The universality of adversarial directions across models indicates that they exploit some fundamental property of how models learn, likely related to the natural (non-robust) features that are most predictive of the class label.

Stability as a Requirement for Generalization

A key insight from learning theory is that stability—insensitivity to small changes in the training data or inputs—is closely related to generalization. If a slight change to one training example causes a large change in the learned model’s behavior, the model is unstable and will generalize poorly (the model is overfitting to the specific example). Robust models, by definition, are stable to input perturbations within a specified range. The connection between robustness and generalization is not merely statistical; it is algorithmic. For instance, Theorem 5 from Chapter 11 showed that if an algorithm is $\epsilon$-uniformly stable (all training examples have small influence on the learned solution), the generalization gap is bounded: $\text{Gap} \leq 2\epsilon + O(1/\sqrt{m})$. Adversarial robustness can be viewed as a strengthened form of stability: not just stability to specific training examples, but stability to any small perturbation in the input, regardless of whether it aligns with the training data. Models that are adversarially robust are necessarily stable and thus generalize well. However, the converse is not true: a model can generalize well (stability to training data removal) without being adversarially robust (stability to input perturbations in all directions). This asymmetry is important: adversarial robustness is a stronger and more difficult requirement than statistical generalization. Furthermore, some of the mechanisms that improve standard generalization (like benign overfitting, where models fit training data perfectly but still generalize due to implicit bias) do not automatically confer robustness. A model that interpolates training data but relies on high-frequency features (unstable to perturbations) may generalize well in the standard sense but be easily fooled adversarially. Thus, achieving robustness requires explicit design; it cannot be achieved passively through benign overfitting alone.

Worst-Case vs Average-Case Performance

Learning theory traditionally focuses on average-case performance: how well a model generalizes to the expected loss under the data distribution, $\mathbb{E}_{(x,y) \sim D}[\ell(f(x), y)]$. Adversarial robustness introduces a worst-case perspective: how well a model performs when an adversary chooses the input (or distribution) to maximize the loss, subject to perturbation constraints. Formally, robust performance is measured by the worst-case loss within the perturbation ball: $\max_{\|x' - x\| \leq \epsilon} \ell(f(x'), y)$. This worst-case view is more stringent than average-case; a model that achieves low average loss might still perform poorly in the worst case. The transition from average-case to worst-case evaluation changes the fundamental problem. In average-case learning, the learner sees training examples from the distribution $D$ and must learn a model that generalizes to new instances drawn from $D$. In worst-case robust learning, the adversary chooses perturbations that are not necessarily drawn from the training distribution but are constrained to lie within a ball of radius $\epsilon$. This is a zero-sum game: the learner (model) wants to minimize loss, the adversary wants to maximize loss. The minimax formulation captures this: $\min_\theta \max_{\|x' - x\| \leq \epsilon} \ell(f_\theta(x'), y)$. Solving this minimax problem requires ensuring that the model is robust even to the worst perturbation an adversary might find. Information-theoretic results show that robust learning is inherently harder than standard learning: the sample complexity (number of examples needed to achieve a target error) of robust learning is at least $\Omega(1/\epsilon^d)$ for the $\ell_\infty$ threat model in dimension $d$, exponential in dimension. This sample complexity curse means that for moderate perturbation budgets ($\epsilon$ not tiny), very large datasets are required to certify robustness. In practice, practitioners often accept approximate robustness: models that are robust to the specific attacks tested (empirical robustness) rather than all possible perturbations (certified robustness), trading off the stringency of the guarantee for computational tractability.

Common Misconceptions About Robustness

Several persistent misconceptions about adversarial robustness cloud understanding and lead to ineffective defenses. The first misconception is that adversarial vulnerability is primarily an overfitting problem. If a model memorizes training data, the argument goes, it will be vulnerable to perturbations unseen during training. This lens leads to proposed solutions like reducing model capacity, increasing regularization, or training on more data—strategies that address overfitting but not adversarial robustness. However, empirical studies show that highly regularized models, models trained on massive datasets, and models with restricted capacity still suffer from adversarial vulnerability. The true cause is that adversarial examples exploit the non-robust features that are most predictive of the label, features that any model learning to predict the label will rely on. Reducing reliance on these features requires explicit training procedures (adversarial training) that access the true robust features, not passive approaches like regularization. The second misconception is that adversarial examples are unnatural or pathological, exploiting loopholes in neural networks rather than reflecting genuine limitations of the learned decision boundary. If adversarial examples are unnatural, the argument goes, they are less important than improving performance on natural data. However, adversarial examples are not perceptually random; they often correspond to imperceptible shifts along the natural data manifold or crossing known decision boundaries. Moreover, the threat model (small $\ell_\infty$ or $\ell_2$ perturbations) is motivated by practical concerns: robustness to sensor noise, slight image compression, minor changes in appearance, or subtle adversarial modification. A system robust to these perturbations is inherently more useful than one that requires pixel-perfect inputs. The third misconception is that accuracy and robustness are inherently orthogonal, so improving one must degrade the other. While there is a trade-off (adversarial training on $\epsilon$-perturbations reduces accuracy compared to standard training), the trade-off is neither absolute nor universal. Recent work shows that properly designed architectures, training procedures, and inductive biases can improve both accuracy and robustness simultaneously, though the improvements are often small. Moreover, the trade-off varies with the threat model: robustness to $\ell_\infty$-perturbations (visual artifacts) may differ from robustness to $\ell_0$-perturbations (sparse random changes). The fourth misconception is that robustness is only achievable through detection or rejection, rather than learning robust functions. Some approaches attempt to detect adversarial examples (flag inputs that appear adversarial) rather than ensuring the model makes correct predictions on adversarial inputs. Detection is useful as a defense layer, but it is not sufficient: an adversary can craft adversarial examples that evade detection while also causing misclassification. The fifth misconception is that certified robustness is impossible or requires prohibitive computational costs. Techniques like randomized smoothing enable certified robustness with modest computational overhead, though the certified radius may be smaller than empirical robustness to other attacks. Understanding the spectrum of robustness guarantees (certified, empirical, approximate) is crucial for selecting appropriate techniques for specific applications.

ML Connection

Adversarial Training

Adversarial training is the most widely adopted defense against adversarial examples. The core idea is to train the model on adversarially perturbed examples, effectively teaching the model to be robust to the worst perturbations an adversary might find within the perturbation budget. Formally, the training objective is the minimax problem: \[\min_\theta \mathbb{E}_{(x,y) \in D} \left[ \max_{\|x' - x\|_p \leq \epsilon} \ell(f_\theta(x'), y) \right] \] where the inner maximization (over the adversary) is solved approximately by finding adversarial perturbations using attacks like PGD, and the outer minimization (over the model) is solved using stochastic gradient descent. In practice, the inner maximization is often solved only approximately (a single PGD step or a few iterations), and the perturbation is clipped to stay within the feasible region. The advantages of adversarial training are that it directly addresses the threat model, is relatively simple to implement, and often provides substantial improvements in robustness (empirical robustness to the attack used for training increases significantly). However, adversarial training has several drawbacks: (1) Computational cost: solving the inner maximization requires multiple forward and backward passes per training example, often 5-10 times more expensive than standard training. (2) Robustness-accuracy trade-off: models trained with adversarial training typically have higher test error (lower standard accuracy) than models trained on clean data, because the model must balance fitting the true labels and resisting perturbations. (3) Transferability issues: a model trained to be robust against PGD attacks may remain vulnerable to other attacks (FGSM, C&W, ensemble attacks), a phenomenon called attack transfer. Specific attack-adjusted training is often necessary. (4) Certification gap: empirical robustness (resistance to tested attacks) is weaker than certified robustness; a model may appear robust to PGD but remain vulnerable to other attacks not tested. Despite these limitations, adversarial training is effective in practice and remains a baseline defense in robust ML research.

Lipschitz Constraints and Regularization

Another approach to robustness is to constrain the model’s Lipschitz constant, ensuring that the model’s output is not too sensitive to input changes. A function $f$ is $L$-Lipschitz if $\|f(x) - f(x')\| \leq L \|x - x'\|$ for all $x, x'$. If a model is $L$-Lipschitz, then for any perturbation $\|\delta\| \leq \epsilon$, the output change is bounded: $\|f(x + \delta) - f(x)\| \leq L \epsilon$. By controlling $L$, we can ensure robustness to perturbations. For neural networks, the Lipschitz constant is bounded by the product of the spectral norms of the weight matrices: $L \leq \prod_i \sigma_{\max}(W_i)$, where $\sigma_{\max}(W_i)$ is the largest singular value of the $i$-th layer. Spectral normalization of weights (dividing by the spectral norm) forces $\sigma_{\max}(W_i) = 1$, bounding the overall Lipschitz constant. This approach has the advantage of being computationally efficient (a simple regularization step during training) and provides a degree of certified robustness (robustness to $\epsilon$-perturbations is guaranteed by the bounded Lipschitz constant). However, the guarantees are often loose: the Lipschitz constant is a global property and may be enormous even for well-behaved models, leading to weak robustness guarantees. Moreover, enforcing low Lipschitz constants via spectral normalization can hurt model expressivity and standard accuracy. A related approach is Lipschitz regularization, adding a term $\lambda \sum_i \sigma_{\max}(W_i)$ to the loss to encourage smaller Lipschitz constants. This is more flexible than hard constraints but introduces another hyperparameter to tune.

Stability and Generalization Bounds

The relationship between robustness and generalization is formalized through algorithmic stability (from Chapter 11) and adaptation of classical generalization bounds. A learning algorithm is $\epsilon$-robustly stable if the learned model’s predictions are stable not just to changes in training data but also to perturbations of inputs. Formally, for any input $x$ and perturbations $\|\delta\| \leq \epsilon'$, the loss differential is bounded: $|\ell(f_S(x + \delta), y) - \ell(f_S(x), y)| \leq \epsilon$ with high probability over the training set $S$. Under robust stability, the generalization gap (the difference between training and test error, measured with worst-case over perturbations) is bounded by the stability parameter $\epsilon$ plus a concentration term. Practically, this means that algorithms producing stable models (e.g., regularized learning with appropriate choices of regularization strength) also produce robust models. Connections to PAC-Bayes theory also apply: a model with low norm and low training error, predicted by a prior centered at zero, has good generalization under standard settings. For robust learning, the PAC-Bayes bound becomes \[\mathbb{E}_{\text{test}}^{\text{robust}} \leq \mathbb{E}_{\text{train}}^{\text{robust}} + \sqrt{\frac{2\text{KL}(Q\|P) + \ln(2m/\delta)}{2m}} + O(\epsilon) \] where the last term accounts for the perturbation radius. Thus, low-norm models are somewhat more robust, a partial explanation for why weight decay and implicit bias toward low-norm solutions provide modest robustness improvements. However, the relationship is not tight: explicit consideration of robustness during training is much more effective than relying on implicit stability from standard generalization mechanisms.

Robust Optimization Formulations

Robust optimization is the mathematical framework for ensuring solutions work well under uncertainty or adversarial conditions. In the context of robust ML, the problem is formulated as: \[\min_\theta \max_{\text{perturbations}} \ell(f_\theta(x + \delta), y) \] where the inner maximization finds the worst perturbation within the threat model, and the outer minimization finds model parameters that minimize the worst-case loss. This is a minimax problem, distinct from the standard empirical loss minimization. Solving minimax problems is algorithmically challenging: standard SGD does not directly apply (gradients of the outer objective depend on the inner solution), and the non-convexity of neural networks complicates analysis. Practical approaches include: (1) Adversarial training with PGD attacks to approximate the inner maximization, (2) Dual geometric approaches that reformulate the problem using duality theory, (3) Certified robust optimization that uses abstract interpretation or smoothing to certify solutions without explicitly solving the inner maximization. Each approach trades off computational tractability, the tightness of the robustness guarantee, and the degree of empirical robustness achieved. The robust optimization perspective also reveals fundamental limits: if the inner maximization is hard (NP-hard), then robust learning may be computationally intractable, requiring exponential time. Some recent results show that for certain classes of functions and perturbation models, robust learning is NP-hard, suggesting that polynomial-time algorithms may not suffice for strong robustness guarantees.

Robustness at Scale

Scaling adversarial training to large models and datasets presents unique challenges and opportunities. On the positive side, larger models have greater capacity for fitting both the primary task (accurate predictions on clean data) and the secondary task (robustness to perturbations), potentially reducing the robustness-accuracy trade-off. Large-scale adversarial training (e.g., on ImageNet with high-resolution images) has produced models with better certified and empirical robustness than earlier small-scale results suggested. On the negative side, the computational cost of adversarial training (5-10× standard training) becomes prohibitive at scale; training BERT-sized models or Vision Transformer-sized models adversarially is expensive. Moreover, transferability of adversarial examples becomes more pronounced: adversarial examples generated from one large model often transfer to other large models, suggesting that very large models converge to similar decision boundaries. This transferability can be exploited for black-box attacks (generating adversarial examples without access to the target model’s gradients) and creates challenges for defense. Recent work on certified robustness at scale uses randomized smoothing, which converts any classifier into a certified robust classifier by averaging predictions over smoothed inputs (injected noise). The certified radius is $R = (r/2)(\Phi^{-1}(\alpha) - \Phi^{-1}(1-\alpha))$ where $r$ is the noise level and $\Phi$ is the CDF of the standard normal, and $\alpha$ is chosen to maximize the radius. Randomized smoothing enables certified robustness for large models without retraining but at the cost of computational overhead during inference (averaging ~100-1000 predictions per input). In practice, organizations building adversarially robust systems combine multiple defenses: adversarial training for good empirical robustness, randomized smoothing or certified methods for guarantees, ensemble methods for additional robustness, and detection mechanisms to identify obvious adversarial inputs. The combination of approaches reflects the understanding that no single defense is complete; robustness is achieved through defense-in-depth.