Chapter 13 — Distribution Shift & Continual Learning

Overview

Purpose of the Chapter

This chapter develops the mathematical and operational tools needed when data distributions evolve after deployment. It shows how to identify different shift types, choose correction strategies that match assumptions, and manage continual updates without catastrophic forgetting or misleading offline validation.

Role in Book Arc

This chapter transitions from static generalization assumptions to dynamic deployment reality. After robustness and optimization foundations in prior chapters, we now examine what fails when production data evolves over time and how to keep models reliable under nonstationarity. It is the bridge from model-quality analysis in controlled settings to lifecycle management in live systems.

Core Concept and Supporting Concepts

Main Concept: Distribution shift and continual learning require matching mitigation strategy to the specific part of the data-generating process that changed, while controlling forgetting and evaluation bias in sequential updates.

Supporting Concepts:

Shift taxonomy matters: covariate, label, and concept shift require different interventions.
Detection and correction differ: identifying shift is easier than safely correcting for it.
Importance weighting has limits: heavy tails and support mismatch can destabilize estimates.
Temporal validation is essential: IID splits can hide real deployment risk.
Feedback loops are endogenous: model decisions can change future training data.
Continual learning is a tradeoff: plasticity competes with stability under memory constraints.
Forgetting is geometric: task optima separation predicts interference severity.
Regret is not full reliability: online guarantees do not automatically imply calibration.
Metrics need uncertainty context: drift scores require sample-size-aware interpretation.
Operations complete the theory: monitoring, update policy, and rollback logic determine safety.

Learning Outcomes

By the end of this chapter, you will be able to:

Differentiate covariate, label, and concept shift from observed data patterns.
Apply importance weighting and prior correction under valid assumptions.
Diagnose when weighting fails due to support gaps or extreme variance.
Design temporally faithful validation and monitoring pipelines.
Compute online regret and interpret what it does and does not guarantee.
Evaluate drift with IPM, total variation, and Wasserstein-style diagnostics.
Quantify catastrophic forgetting and identify mitigation priorities.
Compare replay, regularization, and adaptation methods for continual updates.
Anticipate feedback-induced shift in ranking and recommendation systems.
Specify practical update, gating, and rollback criteria for production models.

Scope: What This Chapter Covers

This chapter covers the following conceptual and computational scope.

Shift typology: covariate shift, label shift, concept drift, and temporal drift.
Risk correction tools: importance weighting, posterior adjustment, and adaptation bounds.
Online learning foundations: regret, stability under drift, and streaming updates.
Continual learning mechanics: forgetting metrics, rehearsal, and regularized updates.
Feedback dynamics: endogenous data generation and amplification effects.
Operational governance: deployment-aware validation, monitoring, and retraining policy.

Connections to Other Chapters

This chapter connects directly to the full-book arc through the following progression.

Chapter 11: extends generalization geometry into nonstationary environments.
Chapter 12: complements adversarial robustness with natural distribution shift.
Optimization chapters: reuses stability and convergence tools under sequential updates.
Systems chapters: links mathematical drift notions to production monitoring logic.
Evaluation chapters: motivates time-aware benchmarks and post-deployment auditing.
Governance themes: sets foundations for reliable long-horizon model maintenance.

Questions This Chapter Answers

This chapter answers the following fundamental questions, aligned with proof and implementation exercises.

What type of shift is present? Which assumptions are broken and why it matters.
When is reweighting valid? What conditions make correction trustworthy?
How do we detect harmful drift early? Which metrics are robust in practice?
What does online regret guarantee? What reliability gaps remain?
Why does catastrophic forgetting occur? How can we measure and control it?
How should replay be configured? What stability-plasticity tradeoffs are acceptable?
When do feedback loops amplify bias? How do we break self-reinforcement?
What validation protocol mirrors deployment? How should time splits be chosen?
When should we retrain vs adapt? Which trigger logic is defensible?
How do we keep systems reliable over time? What monitoring and rollback are essential?

Concrete ML Examples

This purpose section grounds the abstract theory in concrete worked examples with consistent stepwise structure.

Covariate Shift Correction with Importance Weighting
1. 1) Concept summary: importance weighting re-expresses test risk under changed feature marginals when $p(y\mid x)$ stays stable.
2. 2) Problem statement: compute weighted risk estimate and decide whether covariate-shift correction changes deployment readiness.
3. 3) Problem setup: We have validation samples drawn from training-like data plus estimated density ratios to target traffic. Unweighted risk underestimates target error when target distribution over-represents hard regions. We compare unweighted and weighted risk to determine if recalibration is needed.
4. 4) Explicit values: three samples with losses $\ell=[0.20,0.50,0.90]$, weights $w=[0.8,1.2,2.0]$.
5. 5) Formula with symbols defined: weighted risk $\hat R_w=\frac{\sum_i w_i\ell_i}{\sum_i w_i}$, where $w_i$ is sample importance ratio.
6. 6) Plug-in step: numerator $=0.8(0.20)+1.2(0.50)+2.0(0.90)=0.16+0.60+1.80=2.56$; denominator $=0.8+1.2+2.0=4.0$.
7. 7) Computed result: $\hat R_w=2.56/4.0=0.64$. Unweighted mean is $(0.20+0.50+0.90)/3=0.533$.
8. 8) Decision / interpretation: target-adjusted risk is materially higher, so direct deployment is riskier than unweighted evaluation suggests.
9. 9) Sensitivity check: clipping max weight to 1.5 gives adjusted risk $(0.16+0.60+1.35)/(0.8+1.2+1.5)=2.11/3.5=0.603$, reducing variance but introducing bias.
Test-Time Adaptation with Entropy Minimization
1. 1) Concept summary: entropy minimization sharpens uncertain predictions on unlabeled target batches to adapt under mild drift.
2. 2) Problem statement: check whether one adaptation step reduces average predictive entropy on incoming stream data.
3. 3) Problem setup: We monitor model confidence before and after updating only normalization-affine parameters. Lower entropy indicates improved certainty under shifted inputs, though overconfidence risks must be tracked. We compare batch-average entropy to a rollback threshold.
4. 4) Explicit values: two binary predictions before update: $[0.6,0.4]$, $[0.55,0.45]$; after update: $[0.7,0.3]$, $[0.65,0.35]$.
5. 5) Formula with symbols defined: entropy $H(p)=-\sum_c p_c\log p_c$, batch mean $\bar H=\frac{1}{B}\sum_{b=1}^{B}H(p^{(b)})$.
6. 6) Plug-in step: before: $H_1\approx0.673$, $H_2\approx0.688$, $\bar H_{pre}\approx0.681$; after: $H'_1\approx0.611$, $H'_2\approx0.647$, $\bar H_{post}\approx0.629$.
7. 7) Computed result: entropy drops by $0.681-0.629=0.052$ (about 7.6%).
8. 8) Decision / interpretation: adaptation is beneficial on this batch and may be kept if downstream calibration remains acceptable.
9. 9) Sensitivity check: if confidence rises but error audits worsen, gate updates or rollback because entropy-only gains can hide drift-induced overconfidence.
Continual Learning with Elastic Weight Consolidation
1. 1) Concept summary: EWC preserves prior-task knowledge by penalizing movement on Fisher-important parameters.
2. 2) Problem statement: compute EWC penalty contribution for a candidate update and determine if forgetting risk is high.
3. 3) Problem setup: After task A, we store reference parameters and Fisher scores. During task B, optimization adds a quadratic penalty that scales with parameter importance. We evaluate the regularization term before accepting an aggressive step.
4. 4) Explicit values: two parameters with $\theta^*=[1.0,-0.5]$, current $\theta=[1.2,-0.3]$, Fisher $F=[4,1]$, regularization $\lambda=10$.
5. 5) Formula with symbols defined: $L_{EWC}=\sum_i \frac{\lambda}{2}F_i(\theta_i-\theta_i^*)^2$.
6. 6) Plug-in step: deviations are $[0.2,0.2]$; weighted squares $4(0.2)^2=0.16$, $1(0.2)^2=0.04$; sum $=0.20$; multiply by $\lambda/2=5$.
7. 7) Computed result: $L_{EWC}=5(0.20)=1.0$.
8. 8) Decision / interpretation: penalty is substantial, signaling this update could damage prior-task performance unless compensated by strong task-B gain.
9. 9) Sensitivity check: halving $\lambda$ to 5 drops penalty to 0.5, increasing plasticity but likely increasing forgetting.
Replay-Based Stability in Long-Horizon Personalization
1. 1) Concept summary: replay buffers mix recent and historical data to stabilize continual updates and reduce forgetting.
2. 2) Problem statement: determine whether current replay mix overemphasizes recency relative to policy target.
3. 3) Problem setup: We target a training mix that preserves long-term preference patterns while adapting to new trends. Current mini-batch composition is compared to desired proportions. The gap indicates whether we risk seasonal forgetting.
4. 4) Explicit values: desired recent:historical ratio $=0.6:0.4$; observed batch has 80 recent and 20 historical samples.
5. 5) Formula with symbols defined: observed proportions $p_r=n_r/(n_r+n_h)$, $p_h=n_h/(n_r+n_h)$; deviation $\Delta_r=p_r-0.6$.
6. 6) Plug-in step: total $=100$, so $p_r=80/100=0.8$, $p_h=0.2$; $\Delta_r=0.8-0.6=0.2$.
7. 7) Computed result: recency is overrepresented by 20 percentage points.
8. 8) Decision / interpretation: rebalance sampling toward historical data to prevent regression on long-tail cohorts.
9. 9) Sensitivity check: moving to 65:35 still favors recency while reducing forgetting risk; monitor per-cohort AUC after change.

Definitions

Covariate Shift

Definition: Covariate shift holds between training distribution $P_{\text{train}}(x, y)$ and test distribution $P_{\text{test}}(x, y)$ when $P_{\text{train}}(y \mid x) = P_{\text{test}}(y \mid x)$ but $P_{\text{train}}(x) \neq P_{\text{test}}(x)$. The support of $P_{\text{test}}(x)$ is assumed to be contained in the support of $P_{\text{train}}(x)$ to avoid infinite importance weights.
Assumptions: The conditional label distribution is invariant, the test feature distribution is absolutely continuous with respect to the train feature distribution, and the model is evaluated on the test distribution without retraining.
Notation: Use $x \in \mathcal{X}$ for inputs, $y \in \mathcal{Y}$ for labels, $P_{\text{train}}$ and $P_{\text{test}}$ for joint distributions, and $w(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$ for importance weights.
Usage: Covariate shift indicates that the decision rule learned from training remains valid, but test examples occur in different regions of input space. Correcting for this shift focuses on reweighting or rebalancing the training data rather than changing the labeling function.
Valid Example: A loan default model trained on applicants from one region is deployed in a new region where income and age distributions differ, but the relationship between features and default remains unchanged.
Failure Case: If the relationship between features and labels changes due to new regulations, then $P(y \mid x)$ changes and covariate shift no longer applies.
Explicit ML Relevance: Importance weighting, domain adaptation, and feature drift monitoring are effective when covariate shift holds; otherwise they may be misleading.

Label Shift

Definition: Label shift holds when $P_{\text{train}}(x \mid y) = P_{\text{test}}(x \mid y)$ but $P_{\text{train}}(y) \neq P_{\text{test}}(y)$. Equivalently, the class conditional distributions are stable while class priors change.
Assumptions: Class conditional feature distributions are invariant, labels are discrete or can be discretized, and class priors shift without altering class conditionals.
Notation: Use $\pi_{\text{train}}(y)$ and $\pi_{\text{test}}(y)$ for priors, and $P(x \mid y)$ for class conditionals.
Usage: Label shift is handled by adjusting predicted probabilities or decision thresholds to reflect new class priors, without retraining feature representations.
Valid Example: A disease screening model trained before a public health campaign faces a test population with higher disease prevalence, while symptom patterns per disease remain stable.
Failure Case: If a new diagnostic device changes symptom measurements for a class, $P(x \mid y)$ changes and label shift fails.
Explicit ML Relevance: Confusion matrix correction and prior adjustment are standard in calibration pipelines and monitoring dashboards.

Concept Shift

Definition: Concept shift, also called conditional shift or concept drift, occurs when $P_{\text{train}}(y \mid x) \neq P_{\text{test}}(y \mid x)$. The mapping from inputs to labels changes over time or across environments.
Assumptions: The input space and label space remain comparable, and the shift is not due solely to sampling noise.
Notation: Use $f^*(x) = \arg\max_y P(y \mid x)$ for the Bayes optimal decision rule and note that $f^*_{\text{train}} \neq f^*_{\text{test}}$ under concept shift.
Usage: Concept shift requires model updates, new labels, or causal modeling; reweighting alone is insufficient because the decision boundary changes.
Valid Example: Fraud patterns evolve as attackers change tactics, altering which feature patterns correspond to fraud.
Failure Case: If only the feature distribution changes but the labeling mechanism remains fixed, the issue is covariate shift rather than concept shift.
Explicit ML Relevance: Continual learning methods, drift detectors on performance, and adaptive retraining schedules target concept shift.

Temporal Drift

Definition: Temporal drift is a time indexed sequence of distributions $\{P_t(x, y)\}_{t \geq 0}$ such that $P_t \neq P_{t+\Delta}$ for some $\Delta > 0$. It may include covariate shift, label shift, and concept shift as special cases over time.
Assumptions: Data arrives in order or with time stamps, and the distribution evolves smoothly or abruptly across time intervals.
Notation: Use $t$ for discrete or continuous time, $P_t$ for the distribution at time $t$, and $\Delta_t$ for a drift magnitude such as $\mathrm{TV}(P_t, P_{t+1})$.
Usage: Temporal drift frames distribution change as a process, enabling rolling evaluation windows and time based validation splits.
Valid Example: Seasonal demand in retail causes predictable shifts in purchasing behavior each quarter.
Failure Case: If time is irrelevant and differences are due to sampling variance, labeling it drift leads to overfitting or unnecessary updates.
Explicit ML Relevance: Time aware validation, windowed retraining, and change point detection are built on temporal drift modeling.

Feedback-Induced Shift

Definition: Feedback-induced shift occurs when a deployed model influences the data generation mechanism, creating a new distribution $P_{\text{test}}$ that depends on the model $f$ or policy $\pi$. Formally, $P_{\text{test}}(x, y) = P(x, y \mid f)$ where the conditional distribution changes as a function of model actions.
Assumptions: The model or policy affects user behavior, data collection, or measurement, and these changes persist into future data.
Notation: Use $P(x, y \mid f)$ to denote model dependent data, and distinguish $P_0$ as the pre deployment baseline.
Usage: This shift is endogenous, not exogenous, and mitigation must consider causal effects and exploration policies.
Valid Example: A recommender system repeatedly promotes certain items, changing the observed click distribution and user preferences.
Failure Case: If the model has no effect on the environment, the shift is exogenous and standard domain adaptation applies.
Explicit ML Relevance: Counterfactual evaluation, randomized exploration, and policy learning are required to avoid self reinforcing bias.

Endogenous Data

Definition: Endogenous data is data whose distribution depends on the model or policy that generated it, typically through selection or feedback. Formally, $P(x, y) = P(x, y \mid f)$ for a deployed model $f$, with $P$ changing when $f$ changes.
Assumptions: The data collection process is affected by model actions, and the model is updated based on the resulting data.
Notation: Use $D_f$ for data collected under model $f$, and $P_f$ for its distribution.
Usage: Endogeneity implies naive retraining can amplify biases because the observed data is not an unbiased sample of the world.
Valid Example: A hiring model screens applicants, so future training data only includes those screened in, biasing the dataset.
Failure Case: If labels are collected independently of the model, the data is exogenous.
Explicit ML Relevance: Debiasing, inverse propensity weighting, and causal inference are needed for reliable updates.

Continual Learning

Definition: Continual learning is the problem of learning a sequence of tasks $\{\mathcal{T}_t\}$ or distributions $\{P_t\}$ with a single model that updates over time while retaining performance on past tasks. The objective is to minimize cumulative loss across tasks while controlling forgetting.
Assumptions: Data arrives sequentially, past data may be inaccessible or limited, and the model must adapt under resource constraints.
Notation: Use $f_t$ for the model after training on task $t$, $\mathcal{L}_t(f)$ for the loss on task $t$, and $F_t$ for a forgetting metric such as $\mathcal{L}_{t-1}(f_t) - \mathcal{L}_{t-1}(f_{t-1})$.
Usage: Continual learning formalizes stability plasticity tradeoffs and motivates methods like replay buffers and regularization.
Valid Example: A translation model that adds new languages without losing performance on previously learned languages.
Failure Case: Training on the full joint dataset after each task is not continual learning; it is offline retraining.
Explicit ML Relevance: Lifelong ML systems, personalization, and streaming analytics use continual learning to stay current without full retraining.

Online Learning

Definition: Online learning is a sequential prediction framework where at each round $t$, the learner predicts $\hat{y}_t$, then observes a loss $\ell_t(\cdot)$ and updates its model. The learner aims to minimize cumulative loss compared to a reference class.
Assumptions: Data arrives sequentially, the learner must update without seeing the full dataset, and performance is measured via regret.
Notation: Use $\ell_t(f)$ for loss at round $t$, $f_t$ for the predictor at round $t$, and $T$ for the horizon.
Usage: Online learning provides theoretical tools for tracking drift and adapting to non stationary environments.
Valid Example: An online ad bidding system updates its model after each auction based on observed outcomes.
Failure Case: If the data is IID and the model can be trained in batch, online learning is not necessary.
Explicit ML Relevance: Streaming recommendation, real time anomaly detection, and adaptive control are online learning settings.

Regret

Definition: Regret at horizon $T$ is $\mathrm{Regret}_T = \sum_{t=1}^T \ell_t(f_t) - \min_{f \in \mathcal{F}} \sum_{t=1}^T \ell_t(f)$, comparing the learner to the best fixed predictor in $\mathcal{F}$.
Assumptions: Losses are revealed sequentially, and the comparator class is fixed over time.
Notation: Use $\mathcal{F}$ for hypothesis class, $f_t$ for the online model, and $\ell_t$ for per round loss.
Usage: Low regret means the online learner performs nearly as well as the best static predictor in hindsight.
Valid Example: In online logistic regression, regret bounds quantify how well the model adapts to user click feedback.
Failure Case: If the comparator should itself be time varying, standard regret may be too weak and dynamic regret is required.
Explicit ML Relevance: Regret bounds justify online updates and provide guarantees under drift.

Stability Under Drift

Definition: A learning algorithm is stable under drift if, for a sequence of distributions $\{P_t\}$ with bounded change, the excess risk at time $t$ is controlled by a function of the drift magnitude. Formally, for some $\phi$, $R_{P_t}(f_t) - R_{P_t}(f_t^*) \leq \phi(\Delta_t)$, where $\Delta_t$ measures distribution change.
Assumptions: There exists a drift metric $\Delta_t$ (such as total variation or Wasserstein distance), and the learner updates at a bounded rate.
Notation: Use $R_{P_t}(f)$ for risk under $P_t$, $f_t^*$ for the Bayes optimal model under $P_t$, and $\Delta_t$ for drift magnitude.
Usage: Stability under drift formalizes the idea that small distribution changes should not cause large performance drops.
Valid Example: A linear regression model updated with sliding windows maintains low error when the regression coefficients change slowly.
Failure Case: Sudden regime changes violate bounded drift and can break stability guarantees.
Explicit ML Relevance: Stability criteria guide retraining frequency and drift monitoring thresholds.

Catastrophic Forgetting

Definition: Catastrophic forgetting is the phenomenon where training on new data causes a significant increase in loss on previously learned tasks. Quantitatively, forgetting for task $i$ after training up to task $t$ is $F_{i,t} = \max_{k \leq t} \mathcal{L}_i(f_k) - \mathcal{L}_i(f_t)$.
Assumptions: Tasks are sequential, training is incremental, and past data is limited or unavailable.
Notation: Use $\mathcal{L}_i$ for loss on task $i$, $f_t$ for model after task $t$, and $F_{i,t}$ for forgetting.
Usage: Catastrophic forgetting captures stability failure in continual learning, where new tasks overwrite old representations.
Valid Example: A model trained on new product images loses accuracy on earlier product categories.
Failure Case: If tasks are jointly trained with full data, forgetting is not catastrophic because tasks are co optimized.
Explicit ML Relevance: Mitigation via rehearsal, regularization, and architectural isolation is central to continual learning systems.

Task-Incremental Learning

Definition: Task incremental learning is a continual learning setting where the task identity $t$ is known at both training and test time, and a single model must handle multiple tasks with the help of the task label.
Assumptions: Task labels are available at inference, and tasks are distinct with possibly different output spaces.
Notation: Use $\mathcal{T}_t$ for task $t$, and $f(x, t)$ for the model that conditions on task identity.
Usage: This setting is easier than fully task agnostic continual learning because the model can route inputs based on task identity.
Valid Example: A multi domain classifier that receives a domain ID at test time to select the proper head.
Failure Case: If task identity is unknown at inference, task incremental assumptions are violated.
Explicit ML Relevance: Many rehearsal and regularization methods are first validated in task incremental settings.

Domain-Incremental Learning

Definition: Domain incremental learning is a continual learning setting where the task identity is not provided at inference, but the label space is shared across tasks. The model must generalize across domains without knowing the domain label.
Assumptions: Output space is fixed, tasks are defined by distribution changes over inputs rather than label changes.
Notation: Use $P_t(x, y)$ for domain $t$ and a shared label space $\mathcal{Y}$.
Usage: This setting models real deployment where domain shifts happen but the task is the same, such as object classification across different cameras.
Valid Example: A digit classifier trained across multiple fonts and imaging conditions without being told the domain.
Failure Case: If label space changes across tasks, the setting is class incremental instead.
Explicit ML Relevance: Domain incremental learning connects to covariate shift and transfer learning in production systems.

Class-Incremental Learning

Definition: Class incremental learning is a continual learning setting where new classes are introduced over time, and the model must classify among all classes seen so far without task identity at inference.
Assumptions: The label space grows over time, and the model must avoid bias toward recent classes.
Notation: Use $\mathcal{Y}_t$ for the cumulative label set after task $t$, with $\mathcal{Y}_t \supset \mathcal{Y}_{t-1}$.
Usage: This is the most challenging continual learning setting because the model must retain old classes while learning new ones with limited data.
Valid Example: A product classifier that adds new product categories monthly while maintaining accuracy on older categories.
Failure Case: If task identity is provided at test time, it becomes task incremental rather than class incremental.
Explicit ML Relevance: Methods like rehearsal buffers and class balanced distillation are critical in class incremental learning.

Distributional Distance (e.g., Total Variation, Wasserstein)

Definition: A distributional distance is a function $d(P, Q)$ that measures how different two probability distributions are. Total variation is $\mathrm{TV}(P, Q) = \sup_{A} |P(A) - Q(A)|$. The Wasserstein-1 distance is $W_1(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \mathbb{E}_{(x, x') \sim \gamma}[\|x - x'\|]$, where $\Gamma(P, Q)$ is the set of couplings.
Assumptions: Distributions are defined on a measurable space, and the distance is finite under the chosen metric.
Notation: Use $P, Q$ for distributions, $\Gamma(P, Q)$ for couplings, and $d(\cdot, \cdot)$ for a generic distance.
Usage: Distributional distances quantify drift magnitude and are used in drift detection and generalization bounds.
Valid Example: Monitoring feature drift using Wasserstein distance between recent data and training data in a credit scoring model.
Failure Case: High dimensional estimates of Wasserstein can be noisy if sample sizes are small, causing false alarms.
Explicit ML Relevance: Drift thresholds and domain adaptation algorithms use distributional distances to trigger updates.

Theorems

Generalization Bound Under Covariate Shift

Formal statement. Let $\ell(f(x), y)$ be a nonnegative loss bounded by $0 \leq \ell \leq L$. Assume covariate shift with $P_{\text{train}}(y \mid x) = P_{\text{test}}(y \mid x)$, and define importance weights $w(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$ with $0 \leq w(x) \leq W$. For any predictor $f$, with probability at least $1 - \delta$ over $m$ IID training samples from $P_{\text{train}}$, \[ R_{\text{test}}(f) \leq \hat{R}_{\text{IW}}(f) + W L \sqrt{\frac{\log(1/\delta)}{2m}}, \] where $R_{\text{test}}(f) = \mathbb{E}_{P_{\text{test}}}[\ell(f(x), y)]$ and $\hat{R}_{\text{IW}}(f) = \frac{1}{m} \sum_{i=1}^m w(x_i) \ell(f(x_i), y_i)$. Proof. Since $P_{\text{train}}(y \mid x) = P_{\text{test}}(y \mid x)$, the test risk can be written as \[ R_{\text{test}}(f) = \mathbb{E}_{P_{\text{test}}(x)} \mathbb{E}_{P(y \mid x)}[\ell(f(x), y)] = \mathbb{E}_{P_{\text{train}}(x)} \left[w(x) \mathbb{E}_{P(y \mid x)}[\ell(f(x), y)]\right]. \] Define the random variables $Z_i = w(x_i) \ell(f(x_i), y_i)$ for IID samples $(x_i, y_i)$ from $P_{\text{train}}$. Then $\mathbb{E}[Z_i] = R_{\text{test}}(f)$. Since $0 \leq \ell \leq L$ and $0 \leq w(x) \leq W$, we have $0 \leq Z_i \leq W L$. By Hoeffding’s inequality, for any $\delta$, with probability at least $1 - \delta$, \[ \mathbb{E}[Z_i] \leq \frac{1}{m} \sum_{i=1}^m Z_i + W L \sqrt{\frac{\log(1/\delta)}{2m}}. \] Substituting back gives the stated bound. Q.E.D. Interpretation. Under covariate shift, the importance weighted empirical risk concentrates around the true test risk, with a penalty that scales with the maximum weight $W$. Larger shift (larger weights) leads to looser bounds. Explicit ML relevance. This bound justifies reweighting for covariate shift in practice and warns that heavy weights can produce high variance and unreliable estimates.

Importance Weighting Correctness Theorem

Formal statement. Assume covariate shift with $P_{\text{train}}(y \mid x) = P_{\text{test}}(y \mid x)$ and $P_{\text{test}}(x)$ absolutely continuous with respect to $P_{\text{train}}(x)$. For any measurable loss $\ell$ and predictor $f$, \[ \mathbb{E}_{P_{\text{test}}}[\ell(f(x), y)] = \mathbb{E}_{P_{\text{train}}}[w(x) \ell(f(x), y)], \] where $w(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$. Proof. By the law of total expectation, \[ \mathbb{E}_{P_{\text{test}}}[\ell(f(x), y)] = \int_{\mathcal{X}} \int_{\mathcal{Y}} \ell(f(x), y) P_{\text{test}}(y \mid x) P_{\text{test}}(x) \, dy \, dx. \] Under covariate shift, $P_{\text{test}}(y \mid x) = P_{\text{train}}(y \mid x)$, so \[ \int_{\mathcal{X}} \int_{\mathcal{Y}} \ell(f(x), y) P_{\text{train}}(y \mid x) P_{\text{test}}(x) \, dy \, dx. \] Multiply and divide by $P_{\text{train}}(x)$, using absolute continuity to avoid division by zero: \[ \int_{\mathcal{X}} \int_{\mathcal{Y}} \ell(f(x), y) P_{\text{train}}(y \mid x) \frac{P_{\text{test}}(x)}{P_{\text{train}}(x)} P_{\text{train}}(x) \, dy \, dx. \] Recognize this as expectation under $P_{\text{train}}(x, y)$: \[ \mathbb{E}_{P_{\text{train}}}[w(x) \ell(f(x), y)]. \] Q.E.D. Interpretation. Importance weighting is an exact correction for covariate shift in expectation, provided the weights are accurate and finite. Explicit ML relevance. Many domain adaptation methods derive from this identity; it justifies reweighted risk minimization and informs weight estimation.

Regret Bound in Online Convex Optimization

Formal statement. Let $\mathcal{F} \subset \mathbb{R}^d$ be a convex set with diameter $D$ in $\ell_2$, and let losses $\ell_t(f)$ be convex and $G$-Lipschitz. Online Gradient Descent with step size $\eta = D / (G \sqrt{T})$ satisfies \[ \mathrm{Regret}_T \leq D G \sqrt{T}. \] Proof. Let $f_t$ be the OGD iterate and $f^* \in \mathcal{F}$ minimize cumulative loss. The standard OGD update is $f_{t+1} = \Pi_{\mathcal{F}}(f_t - \eta g_t)$, where $g_t \in \partial \ell_t(f_t)$. By nonexpansiveness of projection, \[ \|f_{t+1} - f^*\|_2^2 \leq \|f_t - \eta g_t - f^*\|_2^2 = \|f_t - f^*\|_2^2 - 2\eta g_t^T (f_t - f^*) + \eta^2 \|g_t\|_2^2. \] Rearrange to get \[ g_t^T (f_t - f^*) \leq \frac{\|f_t - f^*\|_2^2 - \|f_{t+1} - f^*\|_2^2}{2\eta} + \frac{\eta}{2} \|g_t\|_2^2. \] By convexity, $\ell_t(f_t) - \ell_t(f^*) \leq g_t^T (f_t - f^*)$. Summing over $t = 1$ to $T$ yields \[ \sum_{t=1}^T \ell_t(f_t) - \ell_t(f^*) \leq \frac{\|f_1 - f^*\|_2^2}{2\eta} + \frac{\eta}{2} \sum_{t=1}^T \|g_t\|_2^2. \] Since $\|f_1 - f^*\|_2 \leq D$ and $\|g_t\|_2 \leq G$, we have \[ \mathrm{Regret}_T \leq \frac{D^2}{2\eta} + \frac{\eta G^2 T}{2}. \] Choosing $\eta = D / (G \sqrt{T})$ gives \[ \mathrm{Regret}_T \leq D G \sqrt{T}. \] Q.E.D. Interpretation. Regret grows sublinearly, so average loss approaches that of the best fixed predictor. This supports online learning under drift. Explicit ML relevance. Streaming logistic regression and online ranking can be justified by this regret bound.

Stability Under Slowly Varying Distributions

Formal statement. Let $\{P_t\}$ be a sequence of distributions with $\mathrm{TV}(P_t, P_{t+1}) \leq \rho$. Suppose a learner produces predictors $f_t$ with risk $R_{P_t}(f_t)$. If the learning algorithm is $\beta$-stable in the sense that changing one sample changes the loss by at most $\beta$, then for a sliding window estimator with window size $m$, the excess risk satisfies \[ R_{P_t}(f_t) - R_{P_t}(f_t^*) \leq O\left(\beta + L \rho m\right), \] where $L$ bounds the loss and $f_t^*$ is the Bayes optimal predictor for $P_t$. Proof. Let $\hat{P}_t$ be the empirical distribution over the window $\{t-m+1, \dots, t\}$. The risk gap decomposes as \[ R_{P_t}(f_t) - R_{P_t}(f_t^*) = \left(R_{P_t}(f_t) - R_{\hat{P}_t}(f_t)\right) + \left(R_{\hat{P}_t}(f_t) - R_{\hat{P}_t}(f_t^*)\right) + \left(R_{\hat{P}_t}(f_t^*) - R_{P_t}(f_t^*)\right). \] The middle term is nonpositive if $f_t$ minimizes empirical risk on $\hat{P}_t$. The first and third terms are bounded by stability and drift. By algorithmic stability, the deviation between empirical and expected risk is at most $\beta$. For drift, note that for any bounded loss, $|R_{P_t}(f) - R_{\hat{P}_t}(f)| \leq L \mathrm{TV}(P_t, \hat{P}_t)$. The window distribution $\hat{P}_t$ differs from $P_t$ by at most $\rho$ per step, so $\mathrm{TV}(P_t, \hat{P}_t) \leq \rho m$. Thus, \[ R_{P_t}(f_t) - R_{P_t}(f_t^*) \leq \beta + L \rho m + \beta = O(\beta + L \rho m). \] Q.E.D. Interpretation. If drift is slow and the learner is stable, a sliding window learner tracks the optimal predictor with bounded excess risk. Explicit ML relevance. This motivates windowed retraining schedules and stability regularization in non stationary environments.

Drift Detection Bound

Formal statement. Let $P$ and $Q$ be two distributions over $\mathcal{X}$, and let $\hat{P}$, $\hat{Q}$ be empirical distributions from $n$ IID samples each. For a bounded measurable test function class $\mathcal{H}$ with $|h(x)| \leq 1$, define the empirical integral probability metric \[ \widehat{\mathrm{IPM}}(\hat{P}, \hat{Q}) = \sup_{h \in \mathcal{H}} \left| \frac{1}{n} \sum_{i=1}^n h(x_i) - \frac{1}{n} \sum_{j=1}^n h(x'_j) \right|. \] Then with probability at least $1 - \delta$, \[ \left| \widehat{\mathrm{IPM}}(\hat{P}, \hat{Q}) - \mathrm{IPM}(P, Q) \right| \leq 2 \mathcal{R}_n(\mathcal{H}) + \sqrt{\frac{\log(2/\delta)}{2n}}, \] where $\mathcal{R}_n(\mathcal{H})$ is the Rademacher complexity. Proof. Define $S = \{x_i\}_{i=1}^n$ and $S' = \{x'_j\}_{j=1}^n$. Let \[ \phi(S, S') = \sup_{h \in \mathcal{H}} \left| \frac{1}{n} \sum_{i=1}^n h(x_i) - \frac{1}{n} \sum_{j=1}^n h(x'_j) \right|. \] By symmetrization and standard Rademacher complexity bounds, the deviation between empirical and population IPM is bounded by twice the Rademacher complexity plus a concentration term. Specifically, by McDiarmid’s inequality and the bounded differences property (changing one sample changes $\phi$ by at most $2/n$), \[ \phi(S, S') \leq \mathbb{E}[\phi(S, S')] + \sqrt{\frac{\log(2/\delta)}{2n}}. \] Symmetrization yields \[ \mathbb{E}[\phi(S, S')] - \mathrm{IPM}(P, Q) \leq 2 \mathcal{R}_n(\mathcal{H}). \] Combining gives the bound. Q.E.D. Interpretation. Drift detection via IPM is reliable when the function class is not too complex and sample sizes are large. Explicit ML relevance. This supports kernel two sample tests and neural drift detectors by quantifying their finite sample uncertainty.

Catastrophic Forgetting Bound in Sequential Tasks

Formal statement. Consider two tasks with losses $\mathcal{L}_1$ and $\mathcal{L}_2$, each $\mu$-strongly convex and $L$-smooth over $\mathcal{F}$. Let $f_1^*$ and $f_2^*$ be their minimizers. If a model is trained to convergence on task 1 and then to convergence on task 2, then the performance degradation on task 1 satisfies \[ \mathcal{L}_1(f_2^*) - \mathcal{L}_1(f_1^*) \geq \frac{\mu}{2} \|f_2^* - f_1^*\|_2^2. \] Proof. By strong convexity of $\mathcal{L}_1$, for any $f$, \[ \mathcal{L}_1(f) \geq \mathcal{L}_1(f_1^*) + \nabla \mathcal{L}_1(f_1^*)^T (f - f_1^*) + \frac{\mu}{2} \|f - f_1^*\|_2^2. \] Since $f_1^*$ minimizes $\mathcal{L}_1$, $\nabla \mathcal{L}_1(f_1^*) = 0$, so \[ \mathcal{L}_1(f) - \mathcal{L}_1(f_1^*) \geq \frac{\mu}{2} \|f - f_1^*\|_2^2. \] Substitute $f = f_2^*$ to obtain \[ \mathcal{L}_1(f_2^*) - \mathcal{L}_1(f_1^*) \geq \frac{\mu}{2} \|f_2^* - f_1^*\|_2^2. \] Q.E.D. Interpretation. If the optimal solutions for two tasks are far apart, training on the second task necessarily degrades performance on the first, unless explicit mechanisms preserve it. Explicit ML relevance. This bound formalizes catastrophic forgetting and motivates regularization toward the previous optimum or rehearsal data.

Domain Adaptation Risk Decomposition

Formal statement. Let $R_S(f)$ and $R_T(f)$ be risks on source and target domains, and let $\mathcal{H}$ be a hypothesis class. For any $f \in \mathcal{H}$, \[ R_T(f) \leq R_S(f) + \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T) + \lambda, \] where $d_{\mathcal{H}\Delta\mathcal{H}}(S, T)$ is the $\mathcal{H}\Delta\mathcal{H}$ divergence between domains and $\lambda = \min_{f \in \mathcal{H}} (R_S(f) + R_T(f))$. Proof. For any $f$, write \[ R_T(f) = R_S(f) + (R_T(f) - R_S(f)). \] For any $f, g \in \mathcal{H}$, the disagreement probability between domains satisfies \[ |R_T(f, g) - R_S(f, g)| \leq \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T), \] where $R_D(f, g)$ denotes disagreement under domain $D$. Let $f^*$ be the joint minimizer achieving $\lambda$. Then \[ R_T(f) \leq R_T(f^*) + R_T(f, f^*). \] Add and subtract $R_S(f, f^*)$: \[ R_T(f) \leq R_T(f^*) + R_S(f, f^*) + |R_T(f, f^*) - R_S(f, f^*)|. \] Use the divergence bound: \[ R_T(f) \leq R_T(f^*) + R_S(f, f^*) + \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T). \] Finally, note that $R_S(f, f^*) \leq R_S(f) + R_S(f^*)$ and $R_T(f^*) + R_S(f^*) = \lambda$, giving \[ R_T(f) \leq R_S(f) + \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T) + \lambda. \] Q.E.D. Interpretation. Target risk depends on source risk, a divergence between domains, and the best joint error. If source and target require different optimal predictors, adaptation is limited. Explicit ML relevance. This theorem underlies many domain adaptation methods and clarifies when domain alignment can succeed.

Feedback Loop Amplification Bound

Formal statement. Let $P_0$ be the initial data distribution and $P_t$ the distribution after $t$ rounds of deployment. Suppose the update rule induces a drift operator $\mathcal{T}$ such that $P_{t+1} = \mathcal{T}(P_t)$ and the operator is $\alpha$-expansive in total variation: $\mathrm{TV}(\mathcal{T}(P), \mathcal{T}(Q)) \geq \alpha \, \mathrm{TV}(P, Q)$ for $\alpha > 1$. Then for any perturbation $\Delta_0$, the deviation grows at least geometrically: \[ \mathrm{TV}(P_t, P_0) \geq \alpha^t \mathrm{TV}(P_0, P_0 + \Delta_0). \] Proof. Let $Q_0 = P_0 + \Delta_0$ be a perturbed distribution. Define $Q_{t+1} = \mathcal{T}(Q_t)$. Then by the expansiveness assumption, \[ \mathrm{TV}(P_{t+1}, Q_{t+1}) = \mathrm{TV}(\mathcal{T}(P_t), \mathcal{T}(Q_t)) \geq \alpha \, \mathrm{TV}(P_t, Q_t). \] By induction, $\mathrm{TV}(P_t, Q_t) \geq \alpha^t \mathrm{TV}(P_0, Q_0)$. Since $Q_0 = P_0 + \Delta_0$, this yields \[ \mathrm{TV}(P_t, Q_t) \geq \alpha^t \mathrm{TV}(P_0, P_0 + \Delta_0). \] Choosing $Q_t$ to represent a perturbed initialization gives the stated lower bound on amplification. Q.E.D. Interpretation. If the deployment policy amplifies differences, small biases or perturbations can explode over time, creating severe distribution shift. Explicit ML relevance. This formalizes why feedback loops in recommender systems and policing can lead to runaway bias unless controlled.

Worked Examples

Example 1 — Covariate Shift in Linear Regression

Consider a linear regression model $f(x) = \theta^T x$ trained on a dataset where $x$ represents features like income, age, and credit history, and $y$ is a continuous outcome such as default probability. The setup assumes $P_{\text{train}}(y \mid x)$ matches $P_{\text{test}}(y \mid x)$, but $P_{\text{train}}(x)$ differs from $P_{\text{test}}(x)$ because the deployed system targets a new applicant population. The reasoning uses importance weights $w(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$ to reweight the training loss so that the empirical risk better matches the test risk. The interpretation is that the linear relationship remains valid, but the training data underrepresents regions of the feature space now common in deployment. A common misconception is that merely adding more training data from the old distribution will fix the problem; in fact, without coverage of the new regions, the model still extrapolates. A what if scenario is when the new population includes feature combinations never seen in training; then weights become unbounded and the reweighted estimator becomes unstable, signaling the need for targeted data collection or model redesign. The explicit ML relevance is high: credit scoring, demand forecasting, and sensor calibration often encounter covariate shift, and importance weighting can improve calibration and fairness when the conditional relationship is stable.

Example 2 — Importance Weighting Correction

Suppose a classifier predicts ad clicks with loss $\ell$ bounded by $L$, and we estimate test performance using weighted training data. The setup defines $\hat{R}_{\text{IW}} = \frac{1}{m} \sum_i w(x_i) \ell(f(x_i), y_i)$, with weights estimated from a density ratio model. The reasoning is that under covariate shift, the weighted empirical risk is an unbiased estimator of test risk, but variance depends on weight magnitude. The interpretation is that correctness in expectation does not guarantee accuracy in finite samples, especially if weights are heavy tailed. A common misconception is that any weight estimator is adequate; poor ratio estimation can introduce bias or extreme variance. A what if scenario is when the estimated weights are clipped to reduce variance, which trades bias for stability; this is often beneficial in practice and can be analyzed with bias variance decomposition. The explicit ML relevance appears in domain adaptation pipelines and offline evaluation of recommender systems, where reweighted evaluation is the only feasible proxy for real test performance.

Example 3 — Label Shift and Posterior Adjustment

Imagine a medical screening model that outputs $P(y=1 \mid x)$ for disease presence, trained when prevalence is 2 percent but deployed during an outbreak at 10 percent. The setup assumes $P(x \mid y)$ is stable while the class prior changes, so posterior probabilities must be adjusted. The reasoning applies Bayes rule: $P_{\text{test}}(y \mid x) \propto P_{\text{train}}(y \mid x) \cdot \frac{\pi_{\text{test}}(y)}{\pi_{\text{train}}(y)}$, with normalization across classes. The interpretation is that the model’s likelihood features remain valid, but the baseline rate shifts, so decision thresholds and calibrated probabilities must move. A common misconception is that recalibrating on a small labeled set is always enough; if labels are delayed or biased, prior estimation can be inaccurate. A what if scenario is when the class conditional distribution changes because symptoms differ in a new variant; then label shift corrections fail and the model must be retrained or extended. The explicit ML relevance spans health, spam detection, and anomaly detection where priors can vary sharply and calibration depends on rapid prior estimation.

Example 4 — Concept Drift in Time Series

Consider demand forecasting for a product whose popularity changes due to a competitor’s launch. The setup models a time series with features such as price, promotions, and seasonality, but the mapping from features to demand changes as customers substitute products. The reasoning treats this as concept drift: $P_t(y \mid x)$ changes over time, so a model trained on earlier data becomes systematically biased. The interpretation is that the data distribution may appear similar in marginal feature statistics, yet the conditional response has shifted, and thus traditional drift metrics on $P(x)$ may not detect the problem. A common misconception is that more recent data always improves performance; in reality, aggressive updates can reduce generalization to slightly older regimes still present in the market. A what if scenario is the presence of a new policy that makes the old relationship obsolete, requiring model re specification or new features to capture the change. The explicit ML relevance is critical for forecasting, monitoring, and supply chain optimization where concept drift directly impacts operational costs.

Example 5 — Regret in Online Learning

An online news ranking system updates its model after each user interaction, aiming to minimize cumulative click loss. The setup uses online convex optimization with a convex loss $\ell_t$ at each round, and the model updates via gradient steps. The reasoning relies on regret bounds that compare performance to the best fixed model in hindsight, ensuring $\mathrm{Regret}_T = O(\sqrt{T})$ under Lipschitz losses. The interpretation is that although the model may not be optimal at any single time step, it does not fall far behind the best static policy, even under mild drift. A common misconception is that low regret implies the model is optimal for a drifting environment; in fact, regret is measured against a static comparator, so dynamic regret can still be high. A what if scenario is when the environment changes rapidly, in which case adaptive step sizes or meta learning may be needed to track the moving target. The explicit ML relevance is that online ranking, ad bidding, and adaptive control rely on regret guarantees to justify continuous updates.

Example 6 — Continual Learning With Sequential Tasks

Suppose a vision model learns to classify industrial defects across multiple factories, each with different imaging conditions. The setup presents tasks sequentially: each factory defines a task with its own distribution, and the model cannot store all data. The reasoning emphasizes the stability plasticity tradeoff: the model must adapt to each new factory while retaining prior knowledge. The interpretation is that continual learning is not just about accuracy on the newest task, but about maintaining a low forgetting metric across tasks. A common misconception is that regular retraining on the latest data with a small replay buffer is always enough; in practice, buffer diversity and task similarity matter more than buffer size alone. A what if scenario is when tasks are highly dissimilar; then parameter isolation or multi head architectures may be necessary. The explicit ML relevance is high for multi site deployment in manufacturing, medical imaging, and robotics where data access is restricted and updates must be incremental.

Example 7 — Catastrophic Forgetting in Neural Networks

Consider a neural network trained on handwritten digits, then fine tuned on letters without retaining digit data. The setup reflects sequential learning with no access to earlier data, a common constraint in edge devices. The reasoning explains forgetting as gradient updates that overwrite parameters important for earlier tasks, especially when the network uses shared representations. The interpretation is that forgetting is structural: optimizing for the new task moves the parameters toward a new optimum that is far from the old one. A common misconception is that freezing early layers always preserves old knowledge; in practice, drift in later layers can still break the old mapping or cause calibration errors. A what if scenario is when a small replay buffer is introduced; forgetting diminishes but does not vanish unless the buffer is representative, and the tradeoff between memory and performance becomes explicit. The explicit ML relevance is central to continual learning in mobile personalization, incremental product catalogs, and real time translation systems.

Example 8 — Domain Adaptation Risk Decomposition

Consider a sentiment classifier trained on movie reviews and adapted to product reviews. The setup uses a source domain $S$ and target domain $T$, with a hypothesis class $\mathcal{H}$ and the $\mathcal{H}\Delta\mathcal{H}$ divergence to measure domain discrepancy. The reasoning decomposes target risk into source risk, domain divergence, and the joint optimal error $\lambda$. The interpretation is that even if source accuracy is high, large divergence or a large $\lambda$ limits adaptation because the tasks are intrinsically different. A common misconception is that feature alignment alone guarantees performance; if $\lambda$ is large because labels differ in meaning, no amount of alignment can fix it. A what if scenario is when adding a small labeled target set reduces $\lambda$ by enabling a better joint predictor; in that case, semi supervised adaptation becomes effective. The explicit ML relevance includes cross domain NLP, transfer learning, and multi domain speech recognition.

Example 9 — Wasserstein Distance and Distribution Shift

Suppose a bank monitors feature drift by computing Wasserstein distance between last month’s applicant distribution and the training distribution. The setup treats the applicant feature space as a metric space and uses $W_1(P, Q)$ to quantify how much mass must move to match distributions. The reasoning highlights that Wasserstein captures geometry, so small shifts in many correlated features can produce a large distance even if marginal histograms look similar. The interpretation is that Wasserstein is sensitive to shape changes and can detect shifts invisible to simpler metrics like mean or variance. A common misconception is that high Wasserstein automatically implies performance degradation; the model may be invariant to the shifted features. A what if scenario is that the distance spikes due to a temporary data pipeline bug; without proper validation, the system could trigger unnecessary retraining. The explicit ML relevance is strong for drift detection, data quality monitoring, and robust validation of pipeline changes.

Example 10 — Feedback Loop Amplification in Recommender Systems

Imagine a streaming platform that recommends music using a model trained on historical clicks. The setup deploys the model, which influences exposure, and the new data reflects model choices. The reasoning models the distribution evolution as a feedback operator that can amplify small biases: if a genre is slightly over recommended, it gains more clicks, which further increases its recommendation probability. The interpretation is that observed preferences become a function of the model’s policy, not just of user taste, making naive retraining self reinforcing. A common misconception is that higher engagement always reflects better recommendations; it may instead reflect a narrowed exposure set. A what if scenario is adding exploration: randomizing some recommendations introduces counterfactual data that can reduce amplification and improve long term diversity. The explicit ML relevance is central to recommender systems, ad targeting, and personalized feeds where feedback loops can cause bias and reduce discovery.

Example 11 — Stability Under Gradual Drift

Consider a sensor based predictive maintenance model retrained weekly using a sliding window of the most recent data. The setup assumes that the distribution drifts gradually, with bounded $\mathrm{TV}(P_t, P_{t+1})$, and the model uses a stable learning algorithm. The reasoning shows that if drift is slow and the model is stable, the excess risk remains bounded by a term proportional to the window size and drift rate. The interpretation is that there is a tradeoff: larger windows reduce variance but increase lag, while smaller windows track change but increase variance. A common misconception is that increasing the window always improves performance; in non stationary environments, it can freeze the model and cause stale predictions. A what if scenario is a sudden regime change, such as a new sensor calibration, which violates the gradual drift assumption and requires a reset rather than continued windowing. The explicit ML relevance includes monitoring and retraining schedules for industrial IoT, predictive maintenance, and energy forecasting.

Example 12 — Continual Training in Large Language Models

Consider a large language model updated monthly with new data to track emerging topics, slang, and product knowledge. The setup treats each month as a new task distribution with partial overlap, and data privacy may prevent storing the full historical corpus. The reasoning highlights that continual training can improve current relevance but risks forgetting older knowledge or changing behavior in ways that break downstream integrations. The interpretation is that forgetting in LLMs often appears as subtle shifts in style, factuality, or tool usage rather than sharp accuracy drops, so evaluation must be multifaceted. A common misconception is that more pretraining steps always help; in practice, later data can bias the model toward recent domains, causing regressions on long tail topics. A what if scenario is using retrieval augmented generation to offload factual updates to a retriever, reducing the need for frequent weight updates and mitigating forgetting. The explicit ML relevance is strong for deployed LLM systems, where continual updates, safety constraints, and stable behavior are all required for reliable operations.

Summary

Key Ideas Consolidated

Distribution shift is not a single phenomenon but a family of changes that affect different parts of the data generating process. Covariate shift changes $P(x)$, label shift changes $P(y)$, and concept shift changes $P(y \mid x)$, and each demands different diagnostics and mitigation. Continual learning is the operational response that enables systems to update under drift while controlling catastrophic forgetting. The chapter emphasizes that the stability of learning algorithms, the geometry of high dimensional drift, and the feedback effects of deployed models must all be considered together. Risk bounds and regret analyses provide formal guardrails, but they rely on assumptions that must be validated in practice.

What the Reader Should Now Be Able To Do

The reader should be able to identify which form of distribution shift is present from empirical patterns, and to choose mitigation strategies that match the shift mechanism. This includes using importance weighting or prior adjustment for covariate and label shifts, and opting for retraining or causal modeling under concept drift. The reader should be able to design evaluation protocols that reflect deployment reality, including temporal validation and drift detection thresholds. The reader should also be able to reason about continual learning tradeoffs, select appropriate replay or regularization strategies, and interpret regret and stability bounds in real systems.

Active Assumptions for Later Chapters

Later chapters assume that the reader can distinguish between exogenous drift and feedback induced shift, and can evaluate whether monitoring signals are reliable. They also assume familiarity with basic drift metrics like total variation and Wasserstein distance, and with the limits of importance weighting under heavy tails. The subsequent governance and alignment chapters build on the idea that data is not static and that model updates change the data distribution, so the assumptions of IID and stationarity cannot be taken for granted.

End-of-Chapter Advanced Exercises

A. True / False (20)

A.1 In covariate shift with $P_{\text{train}}(y \mid x) = P_{\text{test}}(y \mid x)$, minimizing unweighted empirical risk is consistent for $R_{\text{test}}(f)$ if and only if $P_{\text{train}}(x) = P_{\text{test}}(x)$. A.2 Importance weighting yields an unbiased estimator of test risk under covariate shift even when the support of $P_{\text{test}}(x)$ is strictly larger than the support of $P_{\text{train}}(x)$. A.3 Under label shift, posterior adjustment via Bayes rule is valid without retraining provided that class conditional distributions are invariant and the class priors are known. A.4 In concept drift, any algorithm that minimizes weighted empirical risk with correct density ratios is sufficient to recover the Bayes optimal predictor on the test distribution. A.5 For online convex optimization with convex Lipschitz losses, sublinear static regret implies sublinear dynamic regret under arbitrary distribution drift. A.6 A drift detector based on an IPM with a sufficiently rich function class can distinguish covariate shift from concept shift without label access. A.7 Under bounded total variation drift $\mathrm{TV}(P_t, P_{t+1}) \leq \rho$, a sliding window learner with window size $m$ can achieve excess risk that scales as $O(\rho m)$ when the loss is bounded. A.8 A model exhibiting low regret in an online learning setting is guaranteed to be well calibrated on the current data distribution. A.9 If $P_{\text{train}}(x \mid y) = P_{\text{test}}(x \mid y)$, then reweighting predictions by estimated label priors is sufficient to correct any covariate shift. A.10 In class incremental learning, access to task identity at inference eliminates catastrophic forgetting. A.11 For domain adaptation, a small $\mathcal{H}\Delta\mathcal{H}$ divergence and low source risk are sufficient for low target risk regardless of the joint optimal error $\lambda$. A.12 If a feedback loop operator is $\alpha$-expansive in total variation with $\alpha > 1$, then small deployment biases can amplify geometrically over time. A.13 Under covariate shift with bounded importance weights, the variance of the importance weighted estimator is always lower than the unweighted estimator. A.14 In continual learning, stability regularization alone can prevent catastrophic forgetting even when tasks are highly dissimilar and no replay data is used. A.15 Wasserstein distance between feature distributions provides an upper bound on the target risk for any fixed predictor. A.16 If an online learner achieves $O(\sqrt{T})$ regret, then its average loss converges to the minimum achievable loss under the drifting distribution sequence $\{P_t\}$. A.17 In label shift, the confusion matrix estimated on training data can be used to adjust predictions on test data without any labeled test samples. A.18 A model can exhibit perfect accuracy on a held out IID test set and still fail under temporal drift when deployed. A.19 In covariate shift, importance weighting remains correct if labels are generated by a deterministic function of inputs. A.20 When concept drift is purely due to changes in the decision boundary orientation but not its margin, retraining is unnecessary if the model has sufficient capacity.

A. True / False (20)

A.1 Under covariate shift with bounded density ratio $w(x)$, the importance weighted empirical risk is a consistent estimator of test risk, but its finite sample variance can still diverge if $w(x)$ has heavy tails.

A.2 Label shift correction via prior adjustment is valid only if the class conditional distributions are invariant and the classifier is calibrated on the training distribution.

A.3 In concept drift, reweighting by $P_{\text{test}}(x) / P_{\text{train}}(x)$ can recover the Bayes optimal predictor even when $P(y \mid x)$ changes.

A.4 For online convex optimization, a sublinear static regret bound does not imply sublinear dynamic regret when the comparator is allowed to change over time.

A.5 Under bounded total variation drift, the excess risk of a sliding window ERM can scale linearly in both window size and drift rate when loss is bounded.

A.6 A drift detector based only on feature marginals can detect covariate shift but cannot, in general, distinguish it from label shift without labels.

A.7 If $\mathcal{H}\Delta\mathcal{H}$ divergence between domains is zero, then any hypothesis has identical risk on source and target domains.

A.8 In class incremental learning, a model can achieve high accuracy on new classes while exhibiting severe bias against older classes even if overall accuracy appears stable.

A.9 A feedback loop that is contractive in total variation ensures that distribution shift diminishes over time even if the model is updated aggressively.

A.10 A low Wasserstein distance between training and test feature distributions guarantees low target risk for any fixed classifier.

A.11 Under label shift, confusion matrix adjustment on predictions is consistent only if the confusion matrix is invertible and estimated without bias.

A.12 In online learning, achieving $O(\sqrt{T})$ regret guarantees the per round loss converges to the optimal loss under a stationary distribution but not necessarily under a drifting one.

A.13 Importance weighting can increase the bias of risk estimates if weights are truncated or clipped.

A.14 For covariate shift with overlapping supports, minimizing unweighted empirical risk can still be asymptotically optimal if the learned model class is invariant under the feature shift.

A.15 Under gradual drift, increasing the training window always decreases prediction error because it reduces estimator variance.

A.16 In continual learning, stability regularization alone cannot prevent forgetting when task optima are widely separated in parameter space.

A.17 For concept drift, delayed label feedback can cause a monitoring system to underestimate the true rate of performance decay.

A.18 In domain adaptation, a small joint optimal error $\lambda$ is necessary but not sufficient for low target risk.

A.19 Dynamic regret bounds can be made sublinear if the path length of the comparator sequence is sublinear in $T$.

A.20 Under covariate shift, density ratio estimation using a misspecified model can produce biased risk estimates even with infinite data.

B. Proof Problems (20)

B.1 Prove that under covariate shift with bounded density ratio $w(x)$, the importance weighted empirical risk is an unbiased estimator of test risk, and derive an explicit variance bound in terms of $\mathbb{E}_{P_{\text{train}}}[w(x)^2]$ and the loss bound.

B.2 Prove a lower bound showing that if $P_{\text{test}}(x)$ has support outside $P_{\text{train}}(x)$, then no estimator based solely on training data can guarantee finite worst case test risk for all bounded losses.

B.3 Prove that label shift correction via prior adjustment is consistent when class conditional distributions are invariant and the confusion matrix is invertible, and provide conditions under which the correction fails.

B.4 Prove that in concept drift, any estimator that only reweights by $P_{\text{test}}(x) / P_{\text{train}}(x)$ can be arbitrarily far from the Bayes optimal predictor on the test distribution.

B.5 Prove the $O(\sqrt{T})$ regret bound for online gradient descent with convex $G$-Lipschitz losses and a convex feasible set of diameter $D$.

B.6 Prove a dynamic regret bound in terms of the path length of the comparator sequence for online mirror descent under convex losses.

B.7 Prove that a sliding window ERM learner has excess risk bounded by $O(\rho m)$ under total variation drift $\mathrm{TV}(P_t, P_{t+1}) \leq \rho$, assuming bounded loss.

B.8 Prove that for any hypothesis class $\mathcal{H}$, the $\mathcal{H}\Delta\mathcal{H}$ divergence is upper bounded by twice the total variation distance between source and target feature distributions.

B.9 Prove the domain adaptation risk decomposition bound $R_T(f) \leq R_S(f) + \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T) + \lambda$, making explicit all measure theoretic assumptions.

B.10 Prove that if $\mathcal{H}\Delta\mathcal{H}$ divergence is zero but $\lambda$ is large, then there exists a hypothesis with low source risk but high target risk.

B.11 Prove that a drift detector based on an IPM with function class $\mathcal{H}$ satisfies a concentration bound in terms of the Rademacher complexity of $\mathcal{H}$.

B.12 Prove that for any two distributions $P$ and $Q$, the Wasserstein distance $W_1(P, Q)$ is lower bounded by the difference in expectations of any 1 Lipschitz test function.

B.13 Prove a sufficient condition under which catastrophic forgetting is unavoidable when the minima of two tasks are separated by a fixed distance in parameter space and no replay data is used.

B.14 Prove that elastic weight consolidation yields a quadratic upper bound on the increase in loss for the previous task under a local Gaussian approximation.

B.15 Prove that in class incremental learning with a fixed memory budget, the expected error on old classes cannot remain constant as the number of classes grows, under mild separability assumptions.

B.16 Prove that in feedback induced shift with an expansive operator $\mathcal{T}$, small perturbations in the initial distribution grow at least geometrically in total variation.

B.17 Prove that any online learner with bounded regret against a static comparator can still incur linear regret against a rapidly drifting comparator sequence.

B.18 Prove that in label shift, the maximum likelihood estimator of class priors obtained from unlabeled data is consistent when the class conditional distributions are known.

B.19 Prove that a covariate shift correction that clips density ratios introduces bias bounded by the clipping threshold and the tail mass of the true ratio distribution.

B.20 Prove a lower bound on adaptation error in terms of the joint optimal error $\lambda$ for any algorithm that uses only source labeled data and unlabeled target data. ### C. Python Exercises (20)

C.1 — Synthetic Covariate Shift Benchmark for Linear Regression

Task: Build a covariate shift benchmark by generating synthetic training and test datasets where features shift but the conditional model $P(y|x)$ remains fixed. Implement: (1) Training distribution $P_{\text{train}}(x) = \mathcal{N}(\mu_0, \Sigma_0)$, (2) Test distribution $P_{\text{test}}(x) = \mathcal{N}(\mu_0 + \delta, \Sigma_0)$ with tunable shift magnitude $\delta \in [0, 5]\sigma_0$, linear regression model $y = w^T x + \epsilon$ with $\epsilon \sim \mathcal{N}(0, 0.01)$. Fit regressor on training data ($n_{\text{train}} = 500$), compute two risk estimates on test data ($n_{\text{test}} = 200$): (a) unweighted empirical risk $\hat{R}_{\text{unweighted}} = \frac{1}{m}\sum_i (y_i - \hat{w}^T x_i)^2$ (biased under shift), (b) importance weighted risk $\hat{R}_{\text{IW}} = \frac{1}{m}\sum_i w(x_i) (y_i - \hat{w}^T x_i)^2$ with density ratio $w(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$ (unbiased but higher variance). For this synthetic setting, density ratio is analytically computable: $w(x) = \frac{\exp(-\frac{1}{2}(x-\mu_0-\delta)^T\Sigma_0^{-1}(x-\mu_0-\delta))}{\exp(-\frac{1}{2}(x-\mu_0)^T\Sigma_0^{-1}(x-\mu_0))}$. Sweep shift parameter $\delta$ across grid $[0, 0.5, 1, 2, 5]$, record: (a) unweighted risk, (b) IW risk, (c) weight statistics (max, mean, effective sample size $\text{ESS} = (\sum_i w_i)^2 / \sum_i w_i^2$), (d) empirical variance of risk estimator (via bootstrap or repeated trials). Test with $d \in \{5, 20\}$ dimensions.

Purpose: Covariate shift is the most common distribution shift in practice: input distribution changes but the relationship between inputs and labels is preserved. Understanding density ratio estimation and importance weighting is foundational for off-policy evaluation, domain adaptation, and federated learning. Three core concepts: (1) Importance weighting principle: reweighting training samples by $w(x) = P_{\text{test}}(x)/P_{\text{train}}(x)$ converts unbiased estimation from test distribution into weighted estimation from training distribution, enabling learning from historical data under shift. (2) Bias-variance tradeoff: unweighted estimator is biased (underestimates test risk when test examples are atypical under training distribution) but low variance; IW estimator is unbiased but suffers high variance when weights are concentrated (heavy tails in $w(x)$, small effective sample size). (3) Weight concentration: as shift magnitude grows, density ratio becomes increasingly concentrated on a small subset of training samples, causing ESS to collapse and IW estimates to become unreliable. Governance applications: credit risk models trained on historical lending decisions must adapt to new applicant pools with different feature distributions; medical diagnostics trained on one hospital’s population must reweight when used at a different hospital with different patient demographics; fraud detection trained on one time period must account for seasonal feature shifts in deployment.

ML Link: Implements fundamental concept of density ratio and importance weighting, related to covariate shift definition in Chapter 13. Connects to Example 1 (Covariate Shift in Linear Regression) which covers theoretical foundations. Relates to Example 2 (Importance Weighting Correction) for practical implementation. The key insight is that under covariate shift (fixed $P(y|x)$), importance weighting recovers test risk unbiasedly: $\mathbb{E}_{P_{\text{test}}}[R(w)] = \mathbb{E}_{P_{\text{train}}}[w(x) \ell(w, x, y)]$. Weight concentration problem is connected to finite-sample regret analysis in Example 5 (Regret in Online Learning) where unstable weights cause algorithm instability.

Hints: For Gaussian shift, density ratio has closed form: $w(x) = \exp(\frac{1}{2}(\delta^T\Sigma_0^{-1}(2x - \mu_0 - \delta)))$. Implement by: (1) computing training empirical mean/cov, (2) generating test data with shifted mean, (3) evaluating density ratio analytically, (4) clipping weights to avoid extreme values (optional for variance reduction), (5) comparing unweighted vs IW risk across shift magnitudes. Compute ESS as $\text{ESS} = (\sum w_i)^2 / \sum_i w_i^2$, reporting as percentage of original sample size. Use numpy for linear regression (numpy.linalg.lstsq) and scipy.stats for Gaussian density evaluation. Plot: (a) unweighted vs IW risk vs shift magnitude (show unweighted bias and IW variance tradeoff), (b) weight statistics (max weight, mean weight, ESS) vs shift, (c) distribution of weights at different shift levels (histogram showing concentration).

What mastery looks like: 1. Correctly generate covariate shift benchmark with analytically computable density ratios; verify unweighted estimator is biased (increasingly so with larger $\delta$) and IW is unbiased by comparing to oracle risk computed on large test set. 2. Demonstrate bias-variance tradeoff: plot showing unweighted risk error shrinks with shift (bias decreasing) while IW risk error initially increases (variance increasing) before stabilizing. 3. Quantify weight concentration: report ESS as function of shift magnitude $\delta$, showing ESS drops to <10% of sample size at large shifts, explaining why IW becomes unreliable. 4. Identify breakdown regime: identify critical shift $\delta^*$ where IW variance exceeds unweighted bias (optimal tradeoff point), or where ESS becomes <5% and IW is no longer practical. 5. Dimension scaling analysis: run experiments with $d \in \{5, 20, 50\}$; show how weight concentration degrades with dimension (curse of dimensionality in density ratio estimation). 6. Compare weight clipping variants: implement soft clipping (weights clamped to $[w_\min, w_\max]$) and show how clipping level affects bias-variance tradeoff; report clipping parameters that balance them. 7. Bootstrap variance estimation: estimate confidence intervals for risk estimates via bootstrap resampling; show IW CIs are wider than unweighted, especially under large shift. 8. Advanced: Implement doubly robust estimator $\hat{R}_{\text{DR}} = \hat{R}_{\text{IW}} + \text{bias correction term}$ combining IW with regression outcome model to reduce variance; compare to pure IW.

C.2 — Label Shift Simulation with Posterior Adjustment

Task: Implement a label shift benchmark where class conditional distributions $P(x|y)$ are fixed but class priors $\pi(y) = P(y)$ shift between training and test. Build: (1) Training data with balanced classes ($\pi_{\text{train}}(y=0) = \pi_{\text{train}}(y=1) = 0.5$), (2) Test data with drifted priors ($\pi_{\text{test}}(y) \in \{[0.3, 0.7], [0.1, 0.9], [0.05, 0.95]\}$, demonstrating mild/moderate/severe shifts), (3) Binary or multi-class classifier trained on training data (logistic regression or Random Forest, $n_{\text{train}} = 2000$). Implement posterior adjustment: given estimated confusion matrix $C \in \mathbb{R}^{K \times K}$ on validation set ($n_{\text{val}} = 500$) and unlabeled target data ($n_{\text{test}} = 1000$), estimate target priors via confusion matrix approach: solve $\hat{\pi}_{\text{test}} = (C^T)^{-1} \hat{\pi}_{\text{train}, obs}$ where $\hat{\pi}_{\text{train}, obs}$ is observed label distribution on unlabeled data under trained classifier. Evaluate: (a) calibration (Expected Calibration Error with 10 bins), (b) decision threshold performance with oracle and estimated priors, (c) sensitivity to confusion matrix estimation error. Report metrics across prior shifts.

Purpose: Label shift is prevalent in deployment: disease prevalence changes in medical screening, spam prevalence fluctuates in email filtering, fraud rates vary across customer segments. Understanding the mechanics of prior estimation and calibration-dependent correction is critical for reliability. Three core concepts: (1) Label shift assumption: if $P(x|y)$ is invariant across train/test, then $P_{\text{test}}(x) = \sum_y \pi_{\text{test}}(y) P(x|y)$ depends only on prior shift, enabling prior-only adaptation without model retraining. (2) Posterior adjustment mathematics: from Bayes rule $P_{\text{test}}(y|x) \propto P(x|y) \pi_{\text{test}}(y) / \pi_{\text{train}}(y) \cdot P_{\text{train}}(y|x)$, so adjusted posterior is scaling original posterior by prior ratio $\pi_{\text{test}}(y) / \pi_{\text{train}}(y)$; requires calibrated base classifier and accurate prior estimates. (3) Calibration-adjustment coupling: naive adjustment can amplify miscalibration; heavily miscalibrated classifiers cannot be corrected by prior adjustment alone; thus calibration (via temperature scaling, Platt scaling, isotonic regression) is prerequisite for robust correction. Governance applications: medical AI deployed across hospitals must adjust for different patient mixes; loan approval models adapt to applicant pool changes; recommendation systems update for changing user preference prevalences; content moderation adjusts to changing platform demographics.

ML Link: Implements label shift correction concept from Chapter 13. Directly related to Example 2 (Importance Weighting Correction) for the conceptual reweighting, and Example 3 (Label Shift and Posterior Adjustment) which covers the mathematics. Connects to calibration theory: prior adjustment requires $P_{\text{train}}(y|x)$ estimated by classifier to be calibrated (well-matched to true conditional probabilities), which differs from accuracy. If classifier assigns $P(y=1|x) = 0.7$ to a set of examples, true label frequency in that set should be ~0.7 (calibration), regardless of accuracy.

Hints: Estimate priors from unlabeled test data: if classifier assigns confidence scores, compute $\hat{\pi}_{\text{test}}(y) = \frac{1}{n} \sum_i P_{\text{train}}(y|x_i)$ (soft estimate) or use majority voting (hard estimate). Estimate confusion matrix on validation set: $C_{ij} = P(\hat{y}=j | y=i)$ computed from cross-validation or held-out set. Implement prior adjustment: (a) soft scores: multiply posterior by prior ratio $P_{\text{test}}(y|x) \propto P_{\text{train}}(y|x) \cdot \pi_{\text{test}}(y) / \pi_{\text{train}}(y)$. (b) hard predictions: adjust decision threshold based on prior shift. For multi-class, use confusion matrix inversion (use pinv if singular). Calibration methods: use scikit-learn’s CalibratedClassifierCV with method=‘sigmoid’ or ‘isotonic’. Plot: (a) ECE vs prior shift (showing calibration importance), (b) confusion matrices before/after adjustment, (c) adjusted posteriors’ distribution, (d) accuracy/F1/ROC-AUC under oracle vs estimated priors.

What mastery looks like: 1. Correctly implement label shift simulator with multiple prior shifts; verify that naive classifier (unaware of prior change) miscalibrates (e.g., assigns class 1 with 0.5 frequency when true frequency is 0.9). 2. Accurate prior estimation from unlabeled data: estimate $\hat{\pi}_{\text{test}}$ via confusion matrix approach, compare to true prior, report estimation error within 5% on mild shifts, <15% on severe shifts. 3. Demonstrate prior adjustment effectiveness: show ECE improves from ~0.20 (naive) to ~0.05 (adjusted) on well-calibrated base classifier; if base classifier is miscalibrated, adjustment fails gracefully (residual error remains). 4. Calibration prerequisite analysis: compare unadjusted vs best-calibrated base classifier; show calibration is essential—adjusting miscalibrated posteriors produces bad estimates even with correct priors. 5. Sensitivity analysis: introduce confusion matrix estimation error by using smaller validation set (n_val in [100, 500, 1000]); measure how prior estimation error propagates to adjusted posterior error and downstream metrics (F1, ROC-AUC). 6. Threshold adjustment visualization: plot precision-recall curve and optimal threshold location under different priors; show threshold must shift from 0.5 when priors imbalance. 7. Multi-class extension: repeat experiment with 3+ classes and drifting priors; report per-class calibration and multi-class F1 score (macro and weighted). 8. Advanced: Compare confusion matrix approach with EM-based prior estimation (harder approach: iteratively estimate priors and classifier parameters); assess which method is more robust to assumption violations.

C.3 — Concept Drift in Time Series Regression

Task: Simulate concept drift (changing $P(y|x)$) in time series prediction with both abrupt and gradual scenarios. Implement: (1) Abrupt drift: piecewise linear target function $f_t(x) = w_0^T x$ for $t < t_*$, $f_t(x) = w_1^T x$ for $t \geq t_*$ with change point $t_* = N/2$ (sudden regime change). (2) Gradual drift: slowly rotating coefficient vector $w_t = w_0 \cos(\phi_t) + w_1 \sin(\phi_t)$ with $\phi_t = 2\pi t / T$ (smooth transition over full horizon). Generate synthetic time series with $y_t = f_t(x_t) + \epsilon_t$, $\epsilon_t \sim \mathcal{N}(0, 0.01)$, $T = 1000$ time steps, $d = 10$ features. Implement four predictors: (a) Static: train once on first $N/2$ samples, predict on entire horizon (baseline for slow adaptation), (b) Sliding window: retrain every $w=50$ steps on the most recent $w$ samples (reactive to drift), (c) Adaptive with drift detection: train initially, monitor drift via test-based detector (e.g., compare recent vs reference window error), retrain only when drift detected (tradeoff between stability and responsiveness), (d) Online: update parameters incrementally at each step (can have high variance, susceptible to noise). Evaluate: (a) mean squared error over time $\text{MSE}_t = (y_t - \hat{y}_t)^2$, (b) cumulative error $\sum_t \text{MSE}_t$, (c) lag (steps until error recovers after drift), (d) parameter drift $\|w_t - w_t^*\|_2$. Compare learning curves and performance frontiers.

Purpose: Concept drift is the fundamental challenge of continual learning: the target changes over time, making static models obsolete. Understanding tradeoffs between plasticity (adapting to new patterns) and stability (retaining old knowledge) is crucial for real-world systems that must remain performant as tasks evolve. Three core concepts: (1) Performance degradation under drift: without adaptation, model exhibits persistent bias in new regime; magnitude depends on drift magnitude and allowable lag (delay until update). (2) Adaptation lag and recovery time: retraining introduces lag—time until new model is deployed and begins improving—and recovery time—further lag until performance stabilizes; these vary with drift speed and update frequency. (3) Stability-plasticity dilemma: frequent retraining (high plasticity) reduces lag but can cause overfitting to noisy data or forgetting old valid patterns; infrequent retraining (high stability) removes noise but introduces larger lag and potential obsolescence. Governance applications: demand forecasting during seasonality shifts must balance following new seasonal patterns (plasticity) vs. preserving long-term trend knowledge (stability); anomaly detection in manufacturing must update to new failure modes while not flagging historical normal variations as anomalies; recommender systems must track user preference evolution without losing historical user profiles.

ML Link: Implements concept drift scenario from Chapter 13. Relates to Example 4 (Concept Drift in Time Series) which covers theory. Connects to Example 11 (Stability Under Gradual Drift) for adaptive strategy comparison. The key

theoretical result is the regret analysis under drift (Example 5, Regret in Online Learning): static model’s regret is $\Omega(T \cdot L^2)$ where $L$ is drift magnitude, while adaptive online learning achieves $O(\sqrt{T} P_T)$ where $P_T$ is “path length” of optimal comparator (related to drift speed).

Hints: For abrupt drift, use piecewise linear function: $w_0^T x$ with $w_0 = [1, 0.5, -0.2, ...]$, then $w_1^T x$ with $w_1 = [0.1, 2.0, 0.3, ...]$ (different slopes in different directions). For gradual drift, parameterize rotation: $w_t = w_0 \cdot \cos(2\pi t/T) + w_1 \cdot \sin(2\pi t/T)$, controlling rotation speed via period $T$. Implement sliding window by refitting on each window (use online/stochastic gradient descent with small learning rates for efficiency). Drift detection: compare error on recent window ([t-50, t]) vs reference window ([0, 50]), use statistical test (e.g., t-test on mean squared errors), trigger retrain if test statistic exceeds threshold. Measure lag as time between true drift and when algorithm detects/adapts (can be estimated by comparing predicted vs actual; sharp rise in error indicates drift). Visualize: (a) true $w_t$ vs estimated $\hat{w}_t$ over time (showing parameter drift), (b) $\text{MSE}_t$ trajectory for each method (showing lag and recovery), (c) cumulative regret curves, (d) heatmap of performance across (drift magnitude, retraining frequency) grid.

What mastery looks like: 1. Correctly implement both abrupt and gradual drift scenarios; verify by plotting true target function vs estimated and confirming visual difference post-drift. 2. Performance degradation in static model: show MSE increases sharply after drift, remains elevated (demonstrating persistent bias), stays $>5\times$ baseline error if drift magnitude is large. 3. Lag and recovery quantification: measure time-to-detection (steps until algorithm detects drift), lag until recovery (steps until error returns to pre-drift level); report for all methods, showing sliding window has zero lag but reactive, static has maximum lag. 4. Stability-plasticity tradeoff visualization: plot cumulative error for different retraining frequencies ([always, every 10 steps, every 50 steps, never]); identify frequency balancing optimally under this drift speed. 5. Parameter trajectory analysis: plot $\|w_t - w_t^*\|_2$ over time, showing smooth adaptation (gradual drift case) vs discontinuous jump (abrupt drift case); explain why adaptive methods track $w_t^*$. 6. Drift detection evaluation: if adaptive-with-detection method implemented, report detection false positive rate (on pre-drift data where no retraining needed) and true positive rate (on post-drift data). Threshold that minimizes both using ROC/PR curves. 7. Generalization across drift speeds: repeat experiment with multiple gradual drift periods ([T/4, T/2, 2T] rotations); show how optimal retraining frequency scales with drift speed. 8. Advanced: Implement online passive-aggressive or gradient-based update rule (update only when error large enough) that adapts without explicit drift detection; compare regret vs explicit detection-based retrain.

C.4 — Drift Detection Using Integral Probability Metrics

Task: Implement drift detection by monitoring integral probability metrics (IPMs) between recent and reference data windows. Setup: Generate sequential data with no drift for first $t_1 = 500$ steps, then introduce drift for $t_2 = 500$ steps (total $T=1000$), allowing measurement of false positive rate (FPR, errors during no-drift phase) and true positive rate (TPR, detection rate during drift phase). At each time $t$, compute drift statistic comparing reference window $W_{\text{ref}} = \{x_{t-200:t-100}\}$ vs recent window $W_{\text{recent}} = \{x_{t-50:t}\}$ using: (1) Linear IPM: maximum mean discrepancy (MMD) with linear kernel, $\text{MMD}_L^2 = \|\mu_{\text{ref}} - \mu_{\text{recent}}\|_2^2$ (fast, interpretable, sensitive to mean shift), (2) RBF Kernel IPM: MMD with Gaussian RBF kernel (more sensitive to distributional shape, slower), (3) Wasserstein distance: optimal transport-based metric (theoretically grounded, computationally expensive). Set decision threshold $\tau$ on validation period (first $t_1/2 = 250$ steps), tune to achieve target FPR ($\leq 5\%$) while maximizing TPR. Report at multiple drift magnitudes (feature mean shift $\delta \in [0.1, 0.5, 1.0]$ in units of std dev). Measure: (a) FP rate, (b) TP rate, (c) detection delay (steps between true drift start and algorithm declaration), (d) threshold transferability across datasets (does threshold tuned on one dataset work on another?).

Purpose: Drift detection is the operational foundation of continual learning: systems must know when to update to remain reliable. Understanding IPM-based detection reveals the tradeoff between sensitivity (detecting small, useful drifts) and selectivity (ignoring noise). Three core concepts: (1) IPM definition: IPMs are distribution distances in Hilbert space; $\text{IPM}_\mathcal{H}(P,Q) = \sup_{\|h\|_\mathcal{H} \leq 1} |\mathbb{E}_P[h(x)] - \mathbb{E}_Q[h(x)]|$ measures maximum difference between expectations under two distributions, generalizing many metrics (MMD, Wasserstein, Kolmogorov-Smirnov). (2) Detection sensitivity-specificity tradeoff: lower thresholds (sensitive) detect drift early but flag noise as drift (false positives trigger unnecessary retraining), higher thresholds (specific) avoid false positives but miss real drift (model operates on stale distribution). (3) Computational-accuracy tradeoff: linear IPM is $O(n)$ but less sensitive, RBF kernel IPM is $O(n^2)$ but more sensitive, Wasserstein computation is expensive but theoretically motivated. Governance applications: production models require automatic drift detection triggering alerts and retraining pipelines; fraud detection must detect emerging fraud strategies (concept drift) without raising too many false alarms causing alert fatigue; recommendation systems must detect changes in user preference distributions; clinical models must detect population shift (e.g., admission criteria change).

ML Link: Relates to drift detection methodology in Chapter 13. Connects to Example 1 (Covariate Shift in Linear Regression) conceptually—detecting covariate shift directly. IPMs are theoretical tools underlying distribution testing; the MMD test is a statistical hypothesis test with known Type I/II error rates asymptotically. Related to Example 9 (Wasserstein Distance and Distribution Shift) for the Wasserstein variant. Detection threshold calibration connects to optimal hypothesis testing (controlling false positive rate while maximizing power).

Hints: Implement MMD with linear kernel: $\text{MMD}^2 = \|\frac{1}{m}\sum_i x_i - \frac{1}{n}\sum_j x'_j\|_2^2$ where unprimed are reference, primed are recent. For RBF kernel (with bandwidth $\sigma$), compute Gram matrix entries $K(x_i, x_j) = \exp(-\|x_i - x_j\|^2 / (2\sigma^2))$, then $\text{MMD}^2 = \frac{1}{m^2}\sum K(x_i, x_j) + \frac{1}{n^2}\sum K(x'_i, x'_j) - \frac{2}{mn}\sum K(x_i, x'_j)$. For Wasserstein, use scikit-learn’s wasserstein_distance or scipy.spatial.distance. Set window sizes (reference 100, recent 50) to balance number of samples vs staleness of reference. Threshold tuning: on validation period, compute IPM at each time, set threshold such that $\text{FPR} = \text{\#\{IPM}_t > \tau \text{ during no-drift\}} / \text{(length of no-drift)} \leq 0.05$. Visualize: (a) IPM trajectory over time, showing plateau during no-drift, sharp rise at drift time, with threshold line, (b) ROC curve (FPR vs TPR) for different thresholds showing tradeoff, (c) detection delay vs threshold (higher threshold = longer delay but fewer FP).

What mastery looks like: 1. Implement all three IPM variants (linear, RBF, Wasserstein); verify they increase after synthetic drift introduction (sanity test on $\delta = 1.0$ shift). 2. Threshold calibration: on validation period, set threshold to achieve $\text{FPR} \leq 5\%$, report threshold value and resulting TPR on drift period. 3. False positive and true positive analysis: report FPR and TPR across drift magnitudes ($\delta \in [0.1, 0.5, 1.0]$), showing how detection sensitivity scales with drift magnitude (larger drift easier to detect). 4. Detection delay quantification: measure lag (time from true drift start to detected). Report for each IPM type and drift magnitude, showing tradeoff between lag and FP rate. 5. Threshold transferability: train threshold on first dataset, apply to second (or out-of-distribution data); report FPR/TPR to quantify if detector generalizes or requires retuning per domain. 6. Computation cost comparison: measure runtime for each IPM (linear O(n), RBF O(n^2), Wasserstein O(n^3) or better with approximations); discuss practical tradeoff (fast linear good for streaming, expensive Wasserstein good for batch). 7. Window size sensitivity: vary window sizes (reference $\in [50, 200]$, recent $\in [20, 100]$) and report ROC curves showing effect; larger windows reduce variance but increase lag after drift. 8. Advanced: Implement statistical testing version—hypothesis test $H_0:$ no drift, compute p-value from permutation test; show connection between IPM threshold and p-value threshold, discuss advantages of statistical framing.

C.5 — Regret Analysis in Online Convex Optimization Under Drift

Task: Implement online convex optimization learner and empirically measure static vs dynamic regret under drifting environments. Setup: Define convex loss function $\ell_t(w) = (w - u_t^*)^T A (w - u_t^*)$ (quadratic loss, easy gradients) with time-varying optimal parameter $u_t^*$ following a smooth trajectory: $u_t^* = u_0 + v \cdot (t / T) \quad\text{or}\quad u_t^* = u_0 + r \sin(2\pi t / T)$ (linear path or periodic oscillation). Run online learner (e.g., online gradient descent $w_{t+1} = w_t - \alpha_t \nabla \ell_t(w_t)$) with fixed step size $\alpha$ or adaptive step size $\alpha_t = \alpha_0 / \sqrt{t}$. At each step $t$, compute cumulative loss $\text{StaticRegret}_T = \sum_{t=1}^T \ell_t(w_t) - \min_u \sum_{t=1}^T \ell_t(u)$ (comparing to best fixed predictor) and $\text{DynamicRegret}_T = \sum_{t=1}^T \ell_t(w_t) - \sum_{t=1}^T \ell_t(u_t^*)$ (comparing to pathfollowing oracle). Measure path length $P_T = \sum_t \|u_t^* - u_{t-1}^*\|$ (quantifies drift speed). Run experiments with drift speeds (path length) $P_T \in [0, 0.1, 0.5, 2.0]$, step sizes $\alpha \in [0.01, 0.1, 1.0]$, and both on fixed and adaptive step sizes. Report: (a) static and dynamic regret curves vs time, (b) cumulative regret vs path length $P_T$, (c) regret scaling $\text{DynReg} / \sqrt{T} P_T$ (theory predicts $O(\sqrt{T} P_T)$ or better).

Purpose: Regret analysis formalizes algorithm performance under drift, capturing the fundamental difference between static and dynamic environments. Understanding when dynamic regret is relevant (vs when static regret suffices) and how algorithm design (step size, adaptation) affects regret is critical for online learning systems. Three core concepts: (1) Static regret: $\text{StaticReg}_T = \sum_t \ell_t(w_t) - \min_u \sum_t \ell_t(u)$ measures cost of not knowing best fixed predictor in hindsight; under drift, this is overly pessimistic because no fixed predictor is optimal. (2) Dynamic regret: $\text{DynReg}_T = \sum_t \ell_t(w_t) - \sum_t \ell_t(u_t^*)$ measures cost of not following the oracle path; accounts for drift by comparing to time-varying optimum, quantity is bounded by $O(\sqrt{T}P_T)$ for smooth paths. (3) Path length and drift speed: $P_T = \sum_t \|u_t^* - u_{t-1}^*\|$ quantifies how fast the optimum moves; high $P_T$ means rapid drift, making online learning harder. Governance applications: recommendation algorithms operate in time-varying preference landscape; bid optimization for ads faces dynamic click rates and competitor bids; resource allocation in cloud systems must adapt to changing demand; automated trading must track non-stationary market dynamics.

ML Link: Implements regret theory from Chapter 13. Directly related to Example 5 (Regret in Online Learning) which provides the theoretical foundation. The key theoretical results are: (1) static regret under fixed environment is $O(\sqrt{T})$ for strongly convex loss, (2) under drift with oracle path length $P_T$, dynamic regret is $O(\sqrt{T}P_T + P_T^2)$ (first term dominates when drift is smooth, second term when drift is rapid), (3) adaptive step size can improve constants but not exponent. The exercise empirically validates these theoretical bounds.

Hints: For quadratic loss $\ell_t(w) = \|w - u_t^*\|_A^2$, gradient is $\nabla \ell_t(w) = 2A(w - u_t^*)$ (easy to compute). Generate drift trajectory: (a) linear path: $u_t^* = u_0 + v(t/T)$ with slope $v$ controlling drift speed, (b) periodic: $u_t^* = u_0 + r \sin(2\pi t/T)$ with amplitude $r$ and period $T$. Implement OGD: $w_{t+1} = w_t - \alpha \nabla \ell_t(w_t) = w_t - 2\alpha A(w_t - u_t^*)$. Compute path length: $P_T = \sum_{t=1}^{T-1} \|u_t^* - u_{t-1}^*\|$; for linear path $P_T = \|v\|$, for periodic (P_T = $ circumference-like. Measure regrets: accumulate losses for learner and oracle; subtract oracle cumulative loss from learner cumulative loss. Plot: (a) regret vs time (showing cumulative curve growth), (b) regret vs path length (showing $O(P_T)$ dependence), (c) normalized regret $\text{DynReg}/(\sqrt{T}P_T)$ vs $t$ (should stabilize, validating theory), (d) effect of step size (plot regret curves for different $\alpha$, showing optimal $\alpha$).

What mastery looks like: 1. Correctly implement online GD learner and regret measurements (both static and dynamic); verify on simple test case (e.g., no drift) that regret is $O(\sqrt{T})$ scaling. 2. Demonstrate static regret inadequacy under drift: show that even if static regret is $O(\sqrt{T})$, actual dynamic regret is much larger because no fixed predictor can track drift well; report ratio $\text{StaticReg} / \text{DynReg} \gg 1$ under drift. 3. Path length dependence: measure cumulative regret for multiple path lengths $P_T \in [0, 0.1, 0.5, 2.0]$, report that dynamic regret scales roughly linearly with $P_T$ (confirming $O(\sqrt{T}P_T)$ or similar). 4. Step size optimization: sweep step size $\alpha \in [10^{-3}, 10]$, report regret for each; identify near-optimal $\alpha$ showing clear minimum/valley (too small step = slow adaptation, too large = oscillation). 5. Adaptive vs fixed step size: compare fixed $\alpha$ vs adaptive $\alpha_t = \alpha_0 / \sqrt{t}$ over multiple drift speeds; report relative performance and when each dominates. 6. Regret bound validation: compute empirical regret and theoretical bound $O(\sqrt{T}P_T + P_T^2)$ for the same parameters; check that empirical is within factor 2-3 of bound (showing theory captures essential behavior). 7. Convergence trajectory: plot learner parameters $w_t$ vs oracle $u_t^*$ over time; visualize distance $\|w_t - u_t^*\|$ trajectory, showing lag (time for learner to catch oracle after drift). 8. Advanced: Implement variance-reduced online learner (e.g., SVRG-style variance reduction for quadratic loss) or meta-learning approach that adapts step size based on observed drift; measure improvement in dynamic regret compared to fixed/diminishing step size baseline.

C.6 — Continual Learning with Sequential Tasks and Replay Buffer

Task: Implement continual learning on a sequence of classification tasks, measuring the stability-plasticity tradeoff with and without a replay buffer. Setup: Define task sequence $\{T_1, T_2, ..., T_5\}$ where each task is binary classification on different features or subsets (e.g., $T_i$ distinguishes $y \in \{i, i+5\}$ in CIFAR-10 or a synthetic 20-class dataset split into 5 binary tasks). Train a shared feature encoder + classifier sequentially: train on $T_1$ ($n_1 = 2000$ samples), then on $T_2$ (new parameters must retain $T_1$ knowledge), progressing through all 5 tasks. Implement two variants: (a) Without replay: train on current task only; measure catastrophic forgetting via accuracy on held-out $T_1$ samples after training on $T_5$, (b) With replay buffer of size $B = [10, 50, 200]$ samples per task: during training on $T_i$, mix in $B$ samples from previously learned tasks (uniform or weighted sampling). Compute metrics: (1) Backward transfer (forgetting): accuracy on $T_j$ after learning $T_i$ for $j < i$, averaged over task pairs. (2) Forward transfer: accuracy on $T_i$ when trained sequentially vs trained standalone (can be negative if earlier tasks hurt). (3) Average accuracy: $\text{AvgAcc} = \frac{1}{5}\sum_i \text{Acc}_i(T_i | \text{after task 5})$ (average performance across all tasks at end). Compare across buffer sizes and different task similarity (similar tasks less forgetting, dissimilar tasks more).

Purpose: Continual learning is the practical challenge of updating models without re-training from scratch on all data. Understanding the tradeoff between learning new tasks (plasticity) and retaining old knowledge (stability) via memory is foundational. Three core concepts: (1) Catastrophic forgetting: when learning new tasks, model parameters shift to optimize new task, potentially degrading old task performance; magnitude depends on task similarity (related tasks interfere less) and parameter sharing (shared layers suffer more forgetting). (2) Replay buffer mechanism: storing samples from old tasks and mixing them into current training acts as a regularizer, anchoring parameters to retain old task knowledge; increases parameter stability (inverse: trains slower on new task). (3) Stability-plasticity frontier: there exists a tradeoff frontier where increasing replay buffer decreases forgetting but slows new task learning; the frontier’s shape depends on task similarity and model capacity. Governance applications: personal recommendation systems must adapt to new user preferences without forgetting past history; autonomous vehicle models must learn new road conditions without losing knowledge of common traffic patterns; language models must update to new topics/language without degrading on standard benchmarks.

ML Link: Relates to continual learning framework in Chapter 13. Example 6 (Continual Learning With Sequential Tasks) and Example 7 (Catastrophic Forgetting in Neural Networks) provide context. The key insight is that replay buffers form a simple but effective mitigation—more sophisticated methods regularize parameters via importance weighting or sparse updates, but replay is most practical and interpretable for this exercise. Connection to Example 11 (Stability Under Gradual Drift): replay approximates training on mixed distribution of all tasks, reducing the effective drift magnitude between sequential tasks.

Hints: Use a simple architecture (e.g., 2-layer MLP or small CNN) to make experiments feasible. Generate synthetic task sequence: either use CIFAR-10 split into binary subtasks (Task 1: airplane vs bird, Task 2: car vs dog, etc.) or create synthetic data where Task $i$ has feature shift in different dimensions (Task 1: shift in [x1,x2], Task 2: shift in [x3,x4], etc.). Implement replay: maintain buffer for each previous task, during training on current task, sample a batch from current task and a batch from buffer (mix them or train separate heads), update model on both. Measure forgetting: use held-out test set for each task; after training on Task 5, evaluate on held-out test for Task 1, report $\text{Forgetting} = \text{InitialAcc}(T_1) - \text{FinalAcc}(T_1|after T_5)$. Plot: (a) accuracy on each task at each training stage (heatmap: rows=tasks, cols=train step), (b) forgetting vs buffer size (showing buffer helps), (c) task similarity matrix (multi-dim task space) and how similarity correlates with forgetting.

What mastery looks like: 1. Correctly implement sequential task training and forgetting measurement; verify without replay that forgetting increases significantly (e.g., Task 1 accuracy drops from 0.95 to 0.70 after learning Tasks 2-5). 2. Replay buffer effectiveness: show that with buffer $B=50$ per task, forgetting decreases substantially (Task 1 accuracy remains >0.85); quantify improvement as reduction factor (e.g., forgetting reduced by 3×). 3. Buffer size tradeoff: plot forgetting vs buffer size $B \in [10, 50, 200]$, showing diminishing returns (improvement from 10→50 larger than 50→200); identify sweet spot where cost (memory) vs benefit (forgetting) are balanced. 4. Task similarity analysis: generate task pairs with known similarity (e.g., easy vs hard distinction, high vs low overlap); show forgetting is lower for similar task pairs, higher for dissimilar pairs; quantify correlation between similarity and forgetting. 5. Forward transfer measurement: compare Task 2 accuracy when trained sequentially after Task 1 vs trained standalone; report if forward transfer is positive (Task 1 helps Task 2), negative (Task 1 hurts Task 2), or near-zero (interference). 6. Average accuracy frontier: plot AvgAcc(final) vs plasticity (accuracy on current task during learning) for different buffer sizes; identify Pareto frontier showing stability-plasticity tradeoff. 7. Parameter trajectory visualization: for simple setting (low-dim synthetic data), plot parameter changes from Task 1→Task 5 (with and without replay); show how replay keeps parameters closer to Task 1 optimum. 8. Advanced: Implement task-specific heads (one classifier per task, shared encoder) vs single shared classifier; show single shared classifier suffers more forgetting (all tasks fight for parameter space), task-specific heads reduce forgetting but increase parameters per task.

C.7 — Feedback-Induced Distribution Shift in Recommender Systems

Task: Simulate feedback loop where a recommender’s decisions change the distribution of data it later trains on. Implementation: (1) Initialize user preference model (e.g., matrix factorization with latent factors, or bandit-style payoff estimates), (2) At each round (day/week), recommend items to users based on predicted utility $\hat{u}(user, item) = P_\text{model}(\text{feedback}|\text{item})$, (3) Users provide feedback (click/no-click or rating) influenced by recommendation (selection bias: recommended items receive more feedback), (4) Accumulate feedback data and retrain model, (5) Measure distribution shift: compare observed feedback distribution after policy change vs pre-policy “natural” distribution (if users hadn’t been shown different items). Run two strategies: (a) Greedy: always recommend top-k items by model’s prediction (myopic, exploits biases), (b) Exploration-augmented: mix top-k with random items (exploration term $\epsilon = 0.1$ or 0.2). Metrics: (1) Long-term performance (cumulative feedback over 100 rounds), (2) Distribution divergence (KL or Wasserstein distance between observed vs natural preference distribution), (3) Diversity: fraction of unique items recommended (should increase with exploration), (4) Average feedback (proxy for user satisfaction).

Purpose: Feedback loops and endogenous data are the hidden challenge of continual learning: naive retraining can amplify biases by learning from biased data collection (more exposure for recommended items tunes model further toward those items, creating runaway loop). Understanding this requires modeling the feedback process and the distribution shift it induces. Three core concepts: (1) Selection bias and amplification: recommended items receive more user engagement purely because of visibility, not quality; if model treats engagement as signal of quality, it learns to over-recommend those items, amplifying bias over time. (2) Feedback loop dynamics: each round’s recommendation affects next round’s data, which affects next recommendation; without external signal (ground truth preferences), model can drift far from true user utility. (3) Exploration as stabilizer: random exploration exposes the model to items’ true utilities (unbiased feedback), providing information to correct for selection bias, though at cost of short-term performance. Governance applications: recommendation systems risk amplifying initial biases (e.g., if model initially recommends male-dominated topics to male users, those users see male-dominated items, providing more feedback to male topics, reinforcing original bias); credit scoring can amplify initial discrimination if unfairly trained model denies credit to certain groups, those groups don’t build credit history, model sees them as worse credit risks, creating feedback loop; fraud detection can miss new fraud types if model only learns from past detected fraud (biased labels), emerging fraud patterns evade detection longer, creating delayed feedback loop.

ML Link: Relates to feedback loops and endogenous data in Chapter 13. Related to Example 10 (Feedback Loop Amplification in Recommender Systems) which covers the conceptual framework. The key theoretical result is that without exploration, the data distribution converges to a fixed point that may be far from the optimal (unbiased) distribution—exploration breaks this feedback loop by injecting independent signal. Connection to Example 7 (Catastrophic Forgetting) in the sense that positive feedback loops cause persistent drift, whereas continual learning with replay prevents catastrophic backsliding.

Hints: Implement user preferences as fixed (ground truth), recommendation as biased toward logged items. Simulate: (1) Initialize model with random weights, (2) Each round: recommend top-k items (deterministic or probabilistic), users click items with probability proportional to $P(\text{feedback}|\text{item})$ $\times$ recommendation probability (selection bias), (3) Collect feedback, (4) Retrain model on accumulated feedback (standard MLE or factorization update), (5) Measure distribution of feedback as empirical frequency of clicks per item vs true preference frequency. For greedy: $\text{recomm}_t = \arg\max_{\text{items}} P_t(\text{feedback}|\text{item})$. For exploration: $\text{recomm}_t = \text{top-k}(P_t) \text{ with prob } 1-\epsilon, \text{ random item with prob } \epsilon$. Use simple model: Bernoulli preference per user-item (feedback = Bernoulli(p_ij)), estimate $\hat{p}_{ij}$ from feedback count via MLE. Plot: (a) cumulative feedback over 100 rounds (greedy should increase faster but plateau at biased distribution, exploration should be lower initially but higher long-term), (b) KL divergence between feedback distribution and true preference (greedy divergence increases, exploration keeps it low), (c) item diversity (greedy recommends same items repeatedly, exploration covers more items).

What mastery looks like: 1. Correctly implement feedback-loop simulation with selection bias; verify that observed feedback distribution is biased toward recommended items (e.g., if greedy always recommends items 1-5, they receive 80% of feedback even if true preferences are uniform). 2. Amplification demonstration: show that greedy strategy amplifies initial random model biases—even if initial model is barely biased, after 50 rounds of retraining on biased feedback, model becomes severely biased (divergence from true distribution >0.5 in KL divergence). 3. Distribution divergence measurement: quantify $\text{KL}(P_{\text{true}} || P_{\text{observed}})$ over time for both strategies; show greedy increases unbounded, exploration stabilizes (or decreases). 4. Exploration effectiveness: show that exploration-augmented strategy (even with small $\epsilon = 0.1$) significantly reduces divergence; compare $\epsilon = [0, 0.1, 0.2]$ and report how much exploration is needed to stabilize. 5. Performance vs stability tradeoff: quantify cumulative reward (short-term), divergence (long-term stability), and diversity; show greedy optimizes reward but fails on stability/diversity, exploration balances (Pareto frontier). 6. Sensitivity analysis: vary user preference strength (weak prefs = easier to manipulate, strong prefs = harder), show how feedback loop severity scales; vary exploration rate and plot optimal $\epsilon$. 7. Fairness/coverage: if multiple user groups exist, measure if greedy creates coverage bias (recommends only to profitable groups, ignores others); show exploration reduces group-level bias. 8. Advanced: Implement causal model of user preferences and recommendation effects (e.g., item recommendations affect future preferences via learning curve); model more complex feedback dynamics (e.g., user fatigue, seasonal trends) and show how selection bias interacts with these factors.

C.8 — Domain Adaptation Risk Decomposition and Bound Validation

Task: Empirically test the domain adaptation risk decomposition: $\text{Risk}_T(h) \leq \text{Risk}_S(h) + \text{H\Delta H divergence} + \lambda$, where S=source (training), T=target (test), and $\lambda$ is joint optimal error. Implementation: (1) Create source domain with $n_S = 1000$ samples and target domain with $n_T = 500$ samples with controlled similarity. (2) Generate data: use Gaussian features with two classes, rotate decision boundary between source and target (parameterized rotation angle $\theta$) to control $\mathcal{H}\Delta\mathcal{H}$ divergence. (3) Train multiple hypotheses on source domain: (a) linear classifier, (b) nonlinear (RBF kernel SVM), (c) ensemble. (4) Compute risk terms: (i) $\text{Risk}_S(h)$ = source test error, (ii) $\mathcal{H}\Delta\mathcal{H}$ divergence via classifier disagreement (for hypothesis class $\mathcal{H}$, estimate as $2 \cdot P_S(h_1(x) \neq h_2(x))$ where $h_1, h_2$ are two diverse classifiers trained differently), (iii) $\lambda$ = joint optimal error (error of best labeling consistent with both source and target labels; approximate via oracle on true labels). (5) Measure target risk and check if bound $\text{Risk}_T(h) \leq \text{Risk}_S(h) + \mathcal{H}\Delta\mathcal{H} + \lambda$ holds. Vary rotation angle $\theta \in [0°, 30°, 60°, 90°]$ (from identical domains to fully rotated) and hypothesis classes.

Purpose: Domain adaptation risk decomposition provides a principled bound on target performance in terms of source error, hypothesis class divergence (how much classifiers disagree across domains), and joint optimal error. Understanding which term dominates (source vs divergence vs optimal) reveals whether the problem is primarily about source domain difficulty, hypothesis class mismatch, or fundamental domain gap. Three core concepts: (1) Risk decomposition: target risk is bounded by source risk (error on training data), divergence term (how well hypothesis class can fit both domains), and optimal error (irreducible error due to domain inherent difference). (2) H-Delta-H divergence: measures how much the hypothesis class must shift to explain target vs source data; small divergence means hypothesis class is flexible enough, large divergence means class needs to specialize to each domain. (3) Joint optimal error: the best possible error when optimizing jointly over both source and target; related to source-label shift and concept drift—large $\lambda$ means the tasks are fundamentally different. Governance applications: models adapted across geographic regions (US to EU) need to account for domain differences in user behavior, regulation, and data collection; clinical models from one hospital to another must account for differences in equipment, patient population, and labeling practices; fraud detection models adapted across time periods must account for evolution in fraud tactics and data distribution.

ML Link: Relates to domain adaptation and distribution shift theory in Chapter 13. Example 8 (Domain Adaptation Risk Decomposition) directly covers this bound. The theoretical result gives explicit error bounds in terms of measurable quantities on source and target data (without needing target labels, using only disagreement-based divergence estimation). Connection to Example 4 (Concept Drift in Time Series): domain shift can be viewed as spatial drift (cross-domain) vs temporal drift (time-based), and the decomposition applies to both.

Hints: Generate synthetic data: source domain with 2D features from $\mathcal{N}(\mu_0, I)$, target domain from $\mathcal{N}(\mu_0, I)$ rotated by angle $\theta$. Labels are determined by linear boundary $w^T x > 0$ in source, same boundary rotated by $\theta$ in target (simulating concept drift). Compute $\mathcal{H}\Delta\mathcal{H}$ divergence: (a) train two diverse classifiers on source (different regularization, different data subsets), measure disagreement on source and target, (b) $\text{divergence} = 2 |\mathbb{E}_S[h_1(x) \neq h_2(x)] - \mathbb{E}_T[h_1(x) \neq h_2(x)]|$ (approximation). Compute $\lambda$: if target labels are available, train on target alone to get target-optimal risk, $\lambda = \text{Risk}_T^*$. Compile bound check: at each rotation angle, record source risk, divergence, lambda, target risk; check if sum of first three is >= target risk (bound should hold). Plot: (a) bar chart comparing bound term magnitudes (source, divergence, lambda) vs rotation angle, (b) bound tightness = (RHS - LHS) / RHS (fraction by which bound is loose), (c) which term dominates at each rotation (shows whether gap is primarily source error, divergence, or optimal).

What mastery looks like: 1. Correctly generate domain adaptation benchmark with controlled similarity (decreasing similarity with rotation angle); verify visually that source and target domains diverge as $\theta$ increases. 2. Hypothesis class selection and divergence computation: train at least 2-3 diverse hypotheses, compute disagreement-based divergence estimate, verify divergence increases with rotation angle (stronger domain shift = higher divergence). 3. Bound validation: empirically compute $\text{Risk}_S(h) + \mathcal{H}\Delta\mathcal{H} + \lambda$ and compare to $\text{Risk}_T(h)$ for multiple classifiers; verify bound holds (RHS ≥ LHS) even when it’s loose. 4. Bound tightness analysis: report tightness (fraction of slack, ideally <2× so bound is not vacuous) for different rotation angles and hypothesis classes; identify settings where bound is tight vs loose. 5. Term decomposition: for each rotation angle, report which of the three terms dominates target risk (e.g., at small rotation, source risk dominates; at large rotation, divergence/lambda dominate); explain why term priority shifts. 6. Comparison across hypothesis classes: repeat with linear, RBF, and ensemble; show how more expressive classes (RBF) reduce divergence but may increase source error (overfitting); discuss flexibility-bias tradeoff. 7. Multi-class extension: generalize to 3+ classes and measure how bound scales; show bound still holds and discuss per-class term decomposition if one class drifts more than others. 8. Advanced: Implement adaptation method (e.g., importance reweighting or representation alignment) and measure how it reduces divergence term; show bound decreases as adaptation succeeds, validating that convergence reduces the theoretical bottleneck.

C.9 — Label Shift Estimation via Confusion Matrix with Calibration Sensitivity

Task: Implement label shift estimation using the confusion matrix method and analyze robustness to miscalibration. Setup: Train a binary classifier on source data (logistic regression or neural net), compute confusion matrix on held-out validation set ($n_{\text{val}} = 500$): $C = \begin{bmatrix} P(\hat{y}=0|y=0) & P(\hat{y}=1|y=0) \\ P(\hat{y}=0|y=1) & P(\hat{y}=1|y=1) \end{bmatrix}$. Receive unlabeled target data ($n_{\text{target}} = 1000$) and apply trained classifier to get predicted label distribution $\hat{\pi}_{\text{obs}} = [P(\hat{y}=0), P(\hat{y}=1)]$. Infer target prior via OLS: $\hat{\pi}_{\text{target}} = (C^{-1})^T \hat{\pi}_{\text{obs}}$ (or pseudoinverse if singular). Compare estimated priors to true target priors (ground truth if available) and also test robustness: (1) Uncalibrated classifier: train on imbalanced source (e.g., 90-10), classifier will overpredict majority class, confusion matrix will be biased. (2) Calibrated classifier: after training, apply temperature scaling or Platt scaling to calibrate posteriors, then recompute confusion matrix (should improve prior estimates). Metrics: (a) prior estimation error $||\hat{\pi}_{\text{target}} - \pi_{\text{target}}||$, (b) downstream impact on downstream task (e.g., decision threshold, F1 score with true target prior vs. estimated prior), (c) sensitivity to miscalibration (how much does uncalibrated vs calibrated change estimation error).

Purpose: Label shift prior estimation is a practical technique for deployment: given unlabeled target data and a source-trained classifier, estimate target label prevalences to enable prior adjustment without retraining. The method’s reliability depends critically on classifier calibration (confusionmatrix accuracy), revealing a tight link between calibration quality and shift correction ability. Three core concepts: (1) Confusion matrix inversion: under label shift assumption, observed label distribution on target equals $C \cdot \pi_{\text{target}}$ (confusion matrix times true prior); inverting recovers the prior. (2) Calibration dependence: confusion matrix is accurate only if classifier’s predicted posteriors match observed frequencies; miscalibration (e.g., always predicts P(y=1)=0.6 even if true frequency is 0.4) makes confusion matrix incorrect, causing prior estimation to fail. (3) Post-hoc correction: calibration methods (temperature scaling, Platt scaling) can be applied after training with a small validation set, converting miscalibrated scores to calibrated probabilities before confusion matrix estimation. Governance applications: medical diagnosis systems must estimate disease prevalence in new populations; fraud detection deployed to new merchant categories must estimate fraud rate; loan approval models must estimate default rates in new applicant pools—all require accurate prior estimation from unlabeled data.

ML Link: Relates to label shift estimation in Chapter 13. Example 3 (Label Shift and Posterior Adjustment) covers the theoretical framework. The method assumes label shift (P(x|y) invariant) and uses empirical confusion matrix instead of theoretical version, making it practical but potentially noisy if validation set is small. Connection to calibration theory: the method implicitly assumes the classifier’s predictions are calibrated, which links back to statistical proper scoring rules and reliability diagrams (tools for assessing calibration).

Hints: Implement confusion matrix computation: on validation set with known labels, predict $\hat{y}_i$ and count $C[j,k] = \frac{\#\{i: \hat{y}_i = k, y_i = j\}}{|\{i: y_i = j\}|}$ (row-normalized). For prior inference: solve $\hat{\pi}^T = \hat{\pi}_{obs}^T (C^{-1})$ (or use pseudo-inverse). For calibration, use scikit-learn’s CalibratedClassifierCV with methods ‘sigmoid’ and ‘isotonic’, then recompute confusion matrix on calibrated predictions. Introduce miscalibration by training on imbalanced data (90-10) without explicit calibration, showing prior estimates suffer. Measure downstream impact: given estimated prior, compute new decision threshold (threshold = prior_minority / prior_majority for F1-optimal threshold), apply to test data, measure F1 score vs oracle threshold (using true prior). Plot: (a) prior estimation error vs calibration method (uncalibrated, sigmoid-calibrated, isotonic-calibrated), (b) downstream F1 score with estimated vs oracle thresholds, (c) confusion matrix heatmaps before/after calibration (visual difference in off-diagonal structure).

What mastery looks like: 1. Correctly implement confusion matrix computation and prior estimation; verify on ground truth (if available) that estimated prior error is small (<5%) for well-calibrated classifiers. 2. Demonstrate calibration as prerequisite: for uncalibrated classifier (high ECE), show prior estimation error is large (>15-20%); after calibration, error drops (<5%), proving calibration is essential. 3. Quantify calibration-sensitivity tradeoff: report prior estimation error before/after each calibration method (sigmoid, isotonic); show which method is most effective for this particular classifier. 4. Downstream task validation: estimate prior, compute optimal threshold, measure F1/precision/recall on target test set using estimated threshold; compare to oracle threshold (using true prior); report performance difference (should be small if prior estimation is good). 5. Sensitivity to validation set size: vary validation set size used to compute confusion matrix ([100, 500, 1000]); show how estimation error increases with smaller validation (larger variance due to finite samples), explaining why calibration requires sufficient data. 6. Multi-class extension: repeat for 3+ classes with prior shift across all classes; report per-class prior estimation accuracy and aggregate multi-class metrics (macro/weighted F1 with estimated priors). 7. Robustness analysis: introduce label noise in validation set (e.g., 10% of labels flipped), measure how confusion matrix and prior estimation degrade; discuss detection of this corruption and mitigation strategies. 8. Advanced: Compare confusion matrix approach with alternative prior estimation methods (e.g., EM-based expectation maximization assuming label shift model); show which is more robust to miscalibration or model misspecification.

C.10 — Temporal Drift Simulation and Update Strategy Comparison

Task: Simulate temporal drift in feature/label distribution and compare three update strategies. Setup: Generate time series data ($T = 500$ time steps) with smooth, persistent feature shift (e.g., feature mean drifts linearly or follows a sinusoidal pattern). Implement synthetic regression task: target $y_t = w^* \cdot x_t(t) + \epsilon_t$ where features $x_t(t) = x_0 + \delta \cdot (t/T)$ drift linearly, and conditional relationship $w^*$ may also drift slowly. Generate training data on first $T/4$ steps, then deploy the model on remaining $3T/4$ steps with continuous drift. Implement three update strategies: (a) Batch retraining: retrain on all data collected so far every $f=50$ steps (computes cost of full retrain rounds), (b) Sliding window retraining: maintain a window of recent $w=100$ samples, retrain every $f=50$ steps using only window data (low cost, reactive but forgets old patterns), (c) Online SGD: update parameters continuously at each step using single-sample SGD with learning rate $\alpha = 0.01$ (memory-efficient, responds immediately, high variance). Metrics: (1) prediction MSE over time $\text{MSE}_t = (y_t - \hat{y}_t)^2$, (2) parameter drift $||w_t - w_0||$ (how much parameters shift from initial), (3a) total computational cost (number of training samples processed), (4) stability (variance of MSE in stable regions vs jumping regions). Vary drift speed (slope $\delta \in [0.1, 0.5, 1.0]$) and plot how strategy performance changes.

Purpose: Temporal drift is ubiquitous in real systems (features, labels, conditionals change over time), and choosing the right update frequency requires understanding the problem’s drift speed and computational budget. Different strategies optimize for different regimes: batch retraining is stable but slow (high lag), sliding window is responsive but forgetful (loses trends), online is fast but noisy (vulnerable to transient spikes). Understanding the tradeoff enables principled system design. Three core concepts: (1) Drift speed and lag: faster drift requires more frequent updates to stay close to the true model; if drift speed is high and update frequency is low, model will incur persistent error until next retraining event—lag time = time between drift occurrence and model update. (2) Stability-plasticity in update frequency: high frequency (online) trades stability (parameter consistency, robustness to noise) for plasticity (rapid adaptation); low frequency (batch) trades plasticity for stability. (3) Computational-statistical tradeoff: batch retraining uses all data (lower variance estimates) but requires expensive recomputation; online uses one sample (high variance) but cheap; sliding window balances both. Governance applications: demand forecasting systems must balance following changing consumption patterns (plasticity) with maintaining stable predictions on historical trends (stability); credit risk models must update to track credit market changes without overfitting to recent anomalies; recommendation systems must adapt to preference drift while preserving long-term user models.

ML Link: Relates to temporal drift and update strategies in Chapter 13. Example 4 (Concept Drift in Time Series) covers adaptive vs. static models. Example 11 (Stability Under Gradual Drift) specifically addresses balance between stability and adaptation. The key insight is that optimal update frequency grows with drift speed (faster environment = more frequent updates needed); theoretical results show cumulative error is minimized at a specific update frequency depending on drift rate.

Hints: Generate drift trajectory: (a) linear $x_t(t) = x_0 + \delta(t/T)$ where $\delta$ is drift magnitude per time unit, (b) sinusoidal $x_t(t) = x_0 + \delta \sin(2\pi t/T)$ for periodic drift (e.g., seasonal patterns). Implement retraining: batch = fit on all accumulated data, sliding window = fit on most recent $w$ samples, online = $w_t = w_{t-1} - \alpha \nabla \ell_t(w_{t-1})$. Measure error: compute MSE on test set at each time step (or on small held-out test samples from each time window). Compute parameter drift: $||w_t - w_0||$ over time. Computational cost: count number of training passes (batch = 1 per retraining, sliding window = 1 per retraining, online = 1 per sample). Visualize: (a) MSE trajectory for all 3 methods (showing lag after drift, recovery time, steady-state error), (b) parameter evolution $||w_t - w_0||$ (batch should show jumps at retraining, online should be smooth), (c) cost vs error tradeoff frontier (varying retraining frequency or algorithm choice), (d) heatmap of MSE vs (drift speed, update frequency) grid.

What mastery looks like: 1. Correctly generate temporal drift scenario with controllable drift speed; verify visually that feature distribution shifts smoothly (e.g., scatter plot of $x_t$ over time shows clear trend). 2. MSE degradation under drift: show that static model (no retraining) MSE degrades over time with increasing drift; quantify error growth (e.g., MSE increases 2-5× over $3T/4$ deployment period). 3. Update strategy comparison: plot MSE trajectories for batch, sliding window, online; show batch has jagged profile (sharp improvements at retraining points, degradation between them), online has smoother profile but noisier, sliding window is intermediate. 4. Lag and recovery quantification: measure time lag (steps between drift and when MSE starts improving) for each strategy; show online has zero lag (updates every step), batch has large lag (waits for retraining event). 5. Stability analysis: in stable (pre-drift) regions, measure standard deviation of MSE; online should have higher variance (gradient noise), batch/window lower variance; quantify numerical tradeoff. 6. Drift speed sensitivity: repeat experiments with $\delta \in [0.1, 0.5, 1.0]$; show how optimal update frequency scales with drift speed (faster drift → more frequent updates needed). Plot optimal MSE vs $\delta$ for each strategy. 7. Computational efficiency comparison: report total computational cost (training samples processed) for equivalent final MSE; online is cheapest/sample (but needs many samples), batch is expensive/retraining (but efficient per sample). Discuss practical deployment constraints. 8. Advanced: Implement adaptive update strategy that tunes frequency based on observed drift (e.g., increase frequency if recent MSE increases, decrease if stable); measure if adaptive matches or beats fixed strategies, and at what computational cost.

C.11 — Task-Incremental vs Domain-Incremental vs Class-Incremental Continual Learning

Task: Build three variants of continual learning to isolate how task structure affects difficulty. Implementation: Create a 20-class dataset (e.g., CIFAR-100 or synthetic data split into 5 groups of 4 classes each). Define three continual learning scenarios: (1) Task-incremental: train sequentially on 5 tasks, each task has 4 classes, at inference time the task identity is provided (classifier can have 5 task-specific heads). (2) Domain-incremental: same 5 tasks but each task has a different visual domain (e.g., Task 1: objects on white background, Task 2: objects outdoors, Task 3: objects indoors, etc.); at inference, task identity not provided, must infer from input. (3) Class-incremental: sequentially learn 5 groups of 4 classes; final task must simultaneously predict all 20 classes (label space grows over time, no task identity at inference). Run the same training procedure (sequential SGD, with per-class accuracy reporting) for all 3 variants. Metrics: (a) Forgetting: accuracy on classes from Task $i$ after learning Task 5, (b) Forward transfer: does learning Task 2 help Task 3 (measured as difference in Task 3 accuracy with vs without Task 2 pre-training), (c) Average accuracy: final accuracy on all classes/tasks, (d) Per-class accuracy variance (do some classes forget more than others?). Report results in a confusion matrix (rows = task, cols = train stage) to show when/where forgetting occurs.

Purpose: Continual learning difficulty depends on problem structure in often-underappreciated ways. Task-incremental is easiest (task ID guides classification), class-incremental is hardest (must distinguish among growing label space), domain-incremental is intermediate. Understanding which difficulty sources matter empirically helps design better continual learning algorithms. Three core concepts: (1) Task identity availability: having task identity at inference dramatically reduces problem difficulty (separate classifier per task vs single classifier for all classes), but production systems rarely provide this signal (humans don’t wear task labels). (2) Label space growth / output layer scaling: class-incremental requires growing the classifier’s output layer (or retraining to add new classes), changing all parameters’ semantics, whereas task-incremental or domain-incremental can keep output layer stable. (3) Interference and capacity: shared model parameters for different tasks/domains interfere with each other (learning new task rotates parameters, affecting old task); capacity is consumed as task/class count grows. Governance applications: personal AI assistants must continually learn about new topics, new user preferences, new task types (class-incremental, hardest); medical AI must adapt to new populations, new device types (domain-incremental, medium); product recommendation must handle new product categories when task identity might be available (task-incremental, easier).

ML Link: Relates to continual learning taxonomy in Chapter 13. Example 6 (Continual Learning With Sequential Tasks) uses task-incremental setup. The characterization of difficulty (task < domain < class) is a key organizational principle for understanding continual learning. Connection to Example 7 (Catastrophic Forgetting): forgetting severity scales with label space growth (class-incremental worst), task identity availability (task-incremental best), and domain homogeneity (similar domains easier).

Hints:Implement three setups on same base dataset: (1) Task-incremental: 5 tasks × 4 classes, train sequentially, maintain 5 softmax heads (output is per-task), at test time task ID is given. (2) Domain-incremental: same class labels across tasks but domains differ, single shared classifier (10-100 unit softmax for 4-class or 20-class depending on exact setup), test inference must select task before classification (can be done via domain classifier or uncertainty estimates). (3) Class-incremental: incrementally expand output classifier from 4→8→12→16→20 units across tasks, single unified classifier, test inference on full 20-class space. Use simple feature extractor (2-layer MLP or small CNN) shared across all variants. Report: confusion matrix of accuracy-by-task-at-each-stage (heatmap), per-class accuracy histogram at final stage, and numerical forgetting/forward transfer/average metrics. Visualize: bar chart comparing the three variants on each metric (e.g., “Forgetting: Task-incremental vs Domain-incremental vs Class-incremental”).

What mastery looks like: 1. Correctly implement three variants with controlled differences (only task identity/label space/domain variation changes, everything else identical). Verify implementation by showing task-incremental has lowest forgetting, class-incremental highest. 2. Quantify task identity benefit: report forgetting on Task 1 for task-incremental (easiest, e.g., ~5% loss) vs class-incremental (hardest, e.g., ~30% loss), showing task ID is worth ~25% accuracy improvement. 3. Label space growth impact: measure how final (Task 5) accuracy drops as label space grows (Task 1: 4 classes → Task 5: 20 classes); quantify as percent accuracy loss per new class added. 4. Forward transfer measurement: for each variant, measure if learning Task 2 helps predict Task 3 (can be negative if interference outweighs benefit, positive if transfer helps); show task-incremental has highest transfer (separate parameters don’t interfere), class-incremental has lowest (shared parameters cause interference). 5. Per-class forgetting analysis: plot accuracy for individual classes from Task 1 to Task 5; identify if some classes forget more (possibly minority classes if imbalanced) and explain patterns. 6. Complexity comparison: discuss computational cost differences (task-incremental has 5× more output parameters if task-specific heads, domain-incremental must infer task, class-incremental must handle growing output). 7. Generalization across dataset sizes: repeat experiments with different amounts of data per task (e.g., 100 vs 1000 samples); show how forgetting/transfer scales with data availability. 8. Advanced: Implement variant-specific mitigation (e.g., task-incremental + replay, domain-incremental + domain classifier, class-incremental + dynamic expansion) and show how each variant’s natural weak point can be mitigated.

C.12 — Wasserstein Distance Estimation and Correlation with Model Error

Task: Compute Wasserstein distance (W-distance) between training and deployment feature distributions across multiple time windows and correlate it with model prediction error changes. Setup: Generate time series data ($T = 500$ steps) with controllable feature drift (same setup as C.10). At regular intervals (every 50 steps), compute: (1) Wasserstein-2 distance between reference window features ($t=0$ to 50) and recent window features ($t-50$ to $t$): $\text{W}_2 = (\min_\pi \mathbb{E}[\|x - x'\|^2]^{1/2}$ where $\pi$ ranges over joint distributions with fixed marginals (computed via optimal transport). (2) Model prediction error on recent window: $\text{Error}_t = \text{MSE}(\hat{y}_s, y_s)$ for samples in recent window. Compute correlation between W-distance and error change: $\text{Corr}(\text{W}_2, \Delta \text{Error})$ (does higher drift metric predict higher error?). Repeat experiments varying: (a) Drift magnitude ($\delta \in [0.1, 0.5, 1.0]$), (b) Feature projection dimension (compute W on raw features vs PCA-projected to $k \in [1, 5, 20]$ dimensions), (c) Different simpler drift metrics (Euclidean distance between means, Kullback-Leibler divergence). Compare Wasserstein vs other metrics in predicting error.

Purpose: Wasserstein distance is a principled geometric metric for distribution comparison, theoretically motivated by optimal transport. However, whether it correlates with model error is an empirical question—high drift doesn’t always imply high error (e.g., shift in less-relevant features). Understanding when geometric drift metrics are predictive vs uninformative is critical for building drift detection and model update systems. Three core concepts: (1) Geometric vs operational metrics: Wasserstein measures distance in feature space (how much distributions differ geometrically), while model error is an operational metric (how much predictions degrade); high W-distance is necessary but not sufficient for high error (depends on which features drift). (2) Dimensionality effect: W-distance in high dimensions is difficult to estimate (curse of dimensionality) and may not correlate with error; dimensionality reduction (PCA) can improve estimation and correlation. (3) Information-relevance gap: drift in features irrelevant for prediction doesn’t increase error; W-distance doesn’t distinguish relevant vs irrelevant features, but error-based metrics do. Governance applications: anomaly detection systems use distribution shift as a signal, but not all shifts indicate anomalies (e.g., seasonal shift in online traffic is expected); calibration monitoring uses W-distance to trigger recalibration, but empirical correlation must be validated; data quality monitoring uses shift detection, but correlation with actual data quality issues (bias, noise) is empirical.

ML Link: Connects to Example 9 (Wasserstein Distance and Distribution Shift) in Chapter 13. Also relates to Example 1 (Covariate Shift in Linear Regression): covariate shift with controllable magnitude is a perfect test case for W-distance vs error correlation (in linear models with drift in relevant features, correlation should be high; in irrelevant features, low). The optimal transport view behind W-distance links to domain adaptation risk bounds (Example 8) that use divergence metrics.

Hints: Compute Wasserstein-2 distance using scipy.spatial.distance.wasserstein_distance or via sliced Wasserstein approximation (faster). For high-dimensional features, apply PCA reducing to $k$ dimensions, then compute W on projected features. Estimate error on recent window via MSE on test samples. Compute correlation between W-distance timeseries and error timeseries using Pearson or Spearman correlation. Introduce projection step: compute PCA on training data, project all feature windows, compute W on codes. Visualize: (a) W-distance over time (showing increase with drift), (b) error over time (showing increase with drift), (c) scatter plot of (W-distance, error) coloring by time, with regression line showing correlation (slope should be positive if correlated), (d) correlation vs projection dimension (showing how dimensionality reduction affects correlation).

What mastery looks like: 1. Correctly compute Wasserstein-2 distance using optimal transport; verify on synthetic examples (e.g., two Gaussians with known W-distance vs numerical computation matches). 2. Empirical correlation quantification: compute and report Pearson/Spearman correlation between W-distance and error timeseries; report correlation coefficient and p-value. If $\text{corr} > 0.5$, drift is predictive; if <0.3, drift is not predictive. 3. Drift magnitude scaling: show how correlation varies with drift magnitude $\delta$ (larger drift should have higher correlation as relationship becomes clearer; small drift may be noise-dominated). 4. Relevance sensitivity: design experiment where drift occurs in relevant vs irrelevant features (e.g., drift in first 2 features that predict y, vs last 2 features that don’t); show correlation is high for relevant, low for irrelevant (demonstrating W-distance’s limitation). 5. Dimensionality impact: plot correlation vs projection dimension $k$; show optimal $k$ (too low loses information, too high makes W estimation noisy due to finite samples); report $k^*$ that maximizes correlation. 6. Comparison to simpler metrics: report correlation for W-distance vs Euclidean distance of means vs KL divergence; show W-distance may or may not outperform simpler methods (empirical comparison), discuss computational cost tradeoff. 7. Error prediction utility: use W-distance to predict error on future window; measure prediction accuracy (RMSE of predicted vs actual error); assess if W is useful as early warning for model degradation. 8. Advanced: Implement information-geometric metric that weights features by model sensitivity (e.g., Wasserstein in space weighted by gradient magnitude w.r.t. predictions); show improved correlation vs unweighted W-distance.

C.13 — Bias-Variance Tradeoff of Importance Weighting with Clipping

Task: Empirically study the bias-variance tradeoff of importance weighted (IW) risk estimation under covariate shift, with and without clipping. Setup: Similar to C.1 (covariate shift benchmark), but focus on weight clipping. Generate training distribution $P_{\text{train}}(x)$ and shifted test distribution $P_{\text{test}}(x)$ with shift magnitude $\delta$. Estimate density ratio $\hat{w}(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$ (analytically for Gaussian, or via kernel density estimation). Compute risk estimates across clipping levels: (1) No clipping: $\hat{R}_{\text{IW}} = \frac{1}{m}\sum_i w(x_i) \ell(y_i, \hat{y}_i)$, (2) Clipping at level $c$: $\hat{R}_{\text{IW}}^{(c)} = \frac{1}{m}\sum_i \min(w(x_i), c) \ell(y_i, \hat{y}_i)$ for $c \in [1, 2, 5, 10, \infty]$. Measure risk estimation error via Monte Carlo: repeat experiment $B=100$ times with different data draws, record: (a) Bias: $\mathbb{E}[\hat{R}_{\text{IW}} - R^*_{\text{test}}]$ (unbiased estimator has bias=0, clipped has bias>0 since $\min(w, c) \leq w$), (b) Variance: $\text{Var}(\hat{R}_{\text{IW}})$ across trials, (c) Total MSE = bias² + variance. Also compute: (d) Effective sample size $\text{ESS} = (\sum_i w_i)^2 / \sum_i w_i^2$ (dropped weight mass due to clipping), (d) Weight histogram before/after clipping. Sweep shift magnitude $\delta$ and clipping level $c$, find optimal $c^*$ that minimizes bias² + variance for each shift.

Purpose: Clipping is a ubiquitous practical technique in importance weighting (used in reinforcement learning, domain adaptation, causal inference), motivated by the bias-variance tradeoff: unclipped weights can have infinite variance (heavy tails), while clipping introduces bias but reduces variance. Understanding the tradeoff empirically (where is the sweet spot for a given shift magnitude?) is essential for practitioners. Three core concepts: (1) Bias from clipping: clipped estimator is biased downward ($\min(w, c) \leq w$), bias grows with clipping redness tightness ($c$ too small = large bias). (2) Variance reduction from clipping: unclipped variance can diverge due to heavy-tailed weights, clipping caps individual terms, reducing variance. (3) Optimal clipping level: there exists $c^*(\delta)$ depending on shift magnitude and sample size; at small shifts, $c^* \approx \infty$ (no clipping needed), at large shifts, $c^*$ finite (clipping helps). Governance applications: off-policy evaluation in recommendation (A/B testing different policies from logged interaction data uses IW, clipping stabilizes estimates); causal inference in observational studies (propensity score weighting uses IW for imbalance correction, clipping prevents extreme weights from rare populations); bandit algorithms (Thompson sampling with IW estimates for context-dependent rewards clips weights to ensure stability).

ML Link: Relates to importance weighting and covariate shift in Chapter 13. Example 1 (Covariate Shift in Linear Regression) provides the theoretical foundation. Example 2 (Importance Weighting Correction) covers the mechanics. The bias-variance analysis connects to standard statistical tradeoffs (e.g., regularization in regression) but specific to the weighting context.

Hints: Implement Gaussian covariate shift as in C.1: $P_{\text{train}} = \mathcal{N}(0, I)$, $P_{\text{test}} = \mathcal{N}(\delta \mathbf{1}, I)$, density ratio $w(x) = \exp(\delta^T x - \frac{\|\delta\|^2}{2})$. Estimate risk: generate test data, evaluate model on it, compute weighted average loss. Bias: average of $\hat{R}^{(c)} - R_{\text{true}}$ over all Monte Carlo trials (ground truth $R_{\text{true}}$ computed on large test set). Variance: sample variance of $\hat{R}^{(c)}$ across seeds. MSE = bias² + variance. Effective sample size: $\text{ESS} = \frac{(\sum w_i)^2}{\sum w_i^2}$, normalized to percentage of original sample size $m$. Visualization: (a) Bias, Variance, MSE vs clipping level $c$ (three curves showing tradeoff), (b) Weight distribution before/after clipping (histogram on log scale showing heavy tail reduction), (c) MSE vs shift magnitude $\delta$ for different clipping strategies (showing optimal shifts with $\delta$), (d) Optimal clipping level $c^*(\delta)$ as function of shift (should increase with $\delta$ — larger shifts need more aggressive clipping).

What mastery looks like: 1. Correctly implement importance weighting risk estimation and clipping; verify unclipped estimate is unbiased (bias ≈ 0), clipped is biased downward. 2. Bias-variance tradeoff visualization: plot three curves (bias, variance, MSE) vs clipping level; show clear minimum for MSE at some $c^*$, with bias increasing toward left and variance dominating at right (no clipping). 3. Optimal clipping identification: for each shift magnitude $\delta$, identify $c^*$ minimizing MSE; report as function table or plot, showing $c^*$ increases with $\delta$ (larger shift = need more clipping). 4. Effective sample size degradation: plot ESS vs clipping level, showing dramatic drop to <10% of original at tight clipping, explaining variance reduction mechanism (most weight is in close-to-threshold samples). 5. Weight distribution analysis: display histogram/KDE of weights before and after clipping for a fixed shift magnitude; visualize heavy tail trimming. 6. Shift magnitude comparison: plot optimal clipping level $c^*$ vs shift $\delta$; show nonlinear relationship (large shifts have diminishing returns to tighter clipping, capturing the bias-variance frontier). 7. Computational cost tradeoff: measure algorithm runtime for different clipping levels (no difference in computation, but discuss how weighting affects downstream algorithm); show clipping has similar cost to no-clipping (practical advantage). 8. Advanced: Implement adaptive clipping using half the data to estimate optimal $c$, apply on other half; compare to oracle $c^*$; measure degradation from adaptivity vs oracle, showing data-driven selection is feasible.

C.14 — Online Learning with Delayed Labels and Drift Monitoring

Task: Simulate an online learning setting where labels arrive with stochastic delay, and compare how different strategies for handling delayed feedback affect model performance and drift detection reliability. Setup: Online learner receives features $\mathbf{x}_t$ at time $t$, makes prediction $\hat{y}_t$, and label $y_t$ arrives at random time $t + \delta_t$ where $\delta_t \sim \text{Poisson}(\lambda_{\text{delay}})$ with $\lambda_{\text{delay}} \in [0, 2, 5, 10]$ (mean delay in steps). Implement two underlying scenarios: (1) Stationary environment: no drift, optimal model is fixed; (2) Drifting environment: concept drift at $t^* = 500$, optimal model changes. Implement three update strategies: (a) Immediate update: whenever label $y_t$ arrives (at time $t + \delta_t$), retrain on all data up to that point (ignores temporal ordering, trains on misaligned data), (b) Queued batch update: accumulate arrived labels in queue, update once per window (every 50 steps); (c) Delayed alignment: buffer labels, only train on ($\mathbf{x}_{s}$, $y_s$) pairs where $s + \delta_s \leq t - \Delta$ (wait for both features and labels to be temporally aligned, discard very old data). At each time $t$, measure: (1) Instantaneous error $e_t = |\hat{y}_t - y_t|$ (computed when label is known), (2) Available rolling estimate $\hat{e}_{\text{rolling}}(t) = \text{mean}(e_s : y_s \text{ arrived by time } t, s \leq t)$ (computed from labels that have arrived by time $t$), (3) True rolling error $e_{\text{true}}(t) = \text{mean}(e_s : s \leq t)$ (ground truth, known in hindsight), (4) Underestimation bias: $\text{bias}(t) = \hat{e}_{\text{rolling}}(t) - e_{\text{true}}(t)$ (rolling estimate lags true error). Measure drift detection: compute drift statistic $D_t$ on features (MMD from C.4) to detect when $P(\mathbf{x}_t)$ changes, correlate with actual error increase. Evaluate: detection delay (how many steps after concept drift onset does rolling error estimate signal change?), lag of available error vs. true error, and whether different strategies affect lag.

Purpose: Delayed feedback is ubiquitous in production ML systems: in recommendation systems, engagement (user clicks) labels arrive seconds to hours after ranking; in fraud detection, fraud labels arrive when chargebacks are filed (days/weeks later); in healthcare, diagnostic outcomes arrive after treatment (weeks to months); in financial trading, P&L labels arrive end-of-day. Practitioners want to monitor model performance to detect degradation, but delayed labels create a fundamental challenge: the performance metric lags behind true performance by the expected delay, creating blindness to recent degradation. This exercise teaches: (1) Monitoring lag is unavoidable: if labels are delayed by $\Delta_{\text{avg}}$ steps, rolling error estimates are biased downward by approximately that delay (recent poor predictions haven’t been labeled yet, so they don’t contribute to rolling average), (2) Strategies have different lag characteristics: immediate update loses temporal alignment (trains on old data mixed with new labels), queued batching reduces chaos but maintains lag, delayed alignment preserves calibration but discards some data, (3) Proxy metrics can provide early warning: when delayed labels aren’t available, proxy metrics (prediction confidence, feature histogram changes) may signal drift before labels confirm it. Understanding and quantifying lag is critical for setting SLAs (e.g., “we tolerate up to 100 steps of monitoring lag before we alert on drift”) and designing hybrid monitoring systems combining proxy metrics and delayed ground truth.

ML Link: Extends online learning with feedback (Example 5: Regret in Online Learning) to include realistic delay. Definition 15 (Feedback Delay in Online Learning) discusses how delayed information affects regret bounds. Relates to practical monitoring systems (Chapter 19): production systems must account for label latency when designing monitoring dashboards and alerting policies. Connects to concept drift detection (Example 4) but now under label delay.

Hints: Simulate label delays: at time $t$, store $(\mathbf{x}_t, y_t, \text{arrival\_time} = t + \delta_t)$ in a queue. At each time step $t'$, extract all points with $\text{arrival\_time} \leq t'$, compute error $e_t = |y_t - \hat{y}_t|$ on those points. Rolling estimate: $\hat{e}_{\text{rolling}}(t') = \text{mean}(\{e_t : \text{arrival\_time} \leq t'\})$. Implement drift detector on features: compute MMD of feature distribution in sliding window [t-200, t] vs. [0, 500] (pre-drift baseline); flag drift if MMD exceeds threshold. Dataset: synthetic time series or MNIST with artificial drift (class distribution shifts at t*). Plot: (1) instantaneous error $e_t$ vs time with change point marked (true error), (2) rolling estimate $\hat{e}_{\text{rolling}}(t)$ and ground truth rolling error $e_{\text{true}}(t)$ overlaid, (3) underestimation bias $\text{bias}(t) = \hat{e}_{\text{rolling}}(t) - e_{\text{true}}(t)$ over time (should be negative, showing rolling estimate lags), (4) feature drift detector statistic $D_t$ overlaid with error increase. Measure detection delay as number of steps from change point $t^*$ to when $\hat{e}_{\text{rolling}}(t)$ increases above threshold (or to when $D_t$ signals drift).

What mastery looks like: 1. Correct label delay simulation: delays distributed Poisson with specified mean; verify empirical mean delay matches parameter across all trials. 2. Lag visualization: plot instantaneous error, rolling estimate, and ground truth rolling error; show clear lag between true error and rolling estimate proportional to mean delay (lag ≈ 2-3× mean delay due to accumulation). 3. Underestimation bias quantification: compute $\text{bias}(t) = \hat{e}_{\text{rolling}}(t) - e_{\text{true}}(t)$ for each time step; average over stable periods (no drift), report mean bias; should be negative and approximately scale linearly with delay magnitude. 4. Detection delay under label latency: measure time from concept drift onset $t^*$ to when rolling estimate detects (increases above threshold); compare across three update strategies and three delay settings ($\lambda_{\text{delay}}$ values). Quantify: for zero delay, detection ~10-20 steps; for delay=5, detection ~25-35 steps (lag approximately added to detection delay). 5. Strategy comparison: implement all three update strategies; show that delayed alignment (buffers for proper ordering) achieves most stable rolling estimate, while immediate update has high variance (trains on misaligned data). Report MSE between rolling estimate and ground truth rolling error for each strategy. 6. Feature drift vs. label drift: plot feature drift detector (MMD signal) and compare to delayed error detection; show that features drift earlier (immediately when distribution changes), while error drift is delayed by label latency. Quantify: feature drift signal appears at $t = t^* + w$ (feature window size), error drift appears at $t = t^* + \Delta_{\text{avg}}$ (later). 7. Compensation strategy: implement hybrid monitoring combining feature drift (immediate) + delayed error labels; show that hybrid alerts on drift faster than error-only monitoring, reducing blind spot window. 8. Advanced: Implement algorithms that are robust to label delay (e.g., using importance weights to correct for missing labels, or uncertainty quantification to account for incomplete feedback); compare to naive approaches, showing whether algorithmic robustness can reduce detection lag.

C.15 — Dynamic Regret under Smooth Parameter Drift

Task: Extend C.5 to compare fixed step size online convex optimization with adaptive step size algorithms under smoothly drifting optimal parameters and analyze how dynamic regret scales with drift speed. Setup: Same as C.5 (online convex optimization with time-varying losses) but control drift parametrization explicitly. Optimal parameter at time $t$: $\mathbf{w}_t^* = \mathbf{w}_0^* + \mathbf{B}(t/T)$ where $\mathbf{B}(s)$ is a smooth path from 0 to final drift $\mathbf{B}_{\text{final}}$, with $s \in [0, 1]$ normalized time. Parameterize drift paths: (1) Linear: $\mathbf{B}(s) = \mathbf{B}_{\text{final}} \cdot s$, (2) Cubic smoothing: $\mathbf{B}(s) = \mathbf{B}_{\text{final}} \cdot (3s^2 - 2s^3)$, (3) Piecewise: step change then plateau. Compute path length $L_T = \int_0^T \|\frac{d}{dt}\mathbf{w}_t^*\| dt \approx \sum_t \|\mathbf{w}_{t+1}^* - \mathbf{w}_t^*\|$ (total distance optimum traverses). Loss function: $\ell_t(\mathbf{w}) = \frac{1}{2}\|\mathbf{w} - \mathbf{w}_t^*\|_2^2$ (strongly convex, transparent). Implement: (1) OGD with fixed step size: $\mathbf{w}_{t+1} = \text{Proj}(\mathbf{w}_t - \alpha \nabla \ell_t)$ with $\alpha = 0.01$, (2) Adaptive (Adagrad-style): $\mathbf{w}_{t+1} = \text{Proj}(\mathbf{w}_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot \nabla \ell_t)$ where $G_t[i] = \sum_s g_s[i]^2$, $\epsilon = 0.01$. Horizon: $T = 1000$. Compute: (1) Static regret: $\text{Reg}_{\text{static}} = \sum_t (\ell_t(\mathbf{w}_t) - \min_{\mathbf{w}} \sum_t \ell_t(\mathbf{w}))$, (2) Dynamic regret: $\text{Reg}_{\text{dynamic}} = \sum_t (\ell_t(\mathbf{w}_t) - \ell_t(\mathbf{w}_t^*))$. Sweep path lengths $L_T \in [0.5, 1, 2, 5]$ and measure how both regrets scale. Verify: theory predicts $\text{Reg}_{\text{dynamic}} = O(L_T + \sqrt{T})$ (linear in path length, square root in horizon).

Purpose: Static regret only compares to the best fixed expert, ignoring drift; under heavy drift, static regret can be artificially low (because the best fixed expert performs poorly for all time). Dynamic regret measures tracking performance against a time-varying target, revealing whether the learner can keep pace with drift. This exercise teaches: (1) Static vs. dynamic regret: static regret $\approx$ quality relative to a fixed competitor; dynamic regret $\approx$ tracking lag; under drift, dynamic regret is much larger, (2) Regret scales with path length: the more the optimum drifts ($L_T$ increases), the harder the problem; regret grows (at minimum linearly in $L_T$ by information-theoretic lower bounds), (3) Adaptive learning rates: algorithms that adjust step size per coordinate (Adagrad) can trade between fast initial progress and stability under drift; sometimes help (fewer hyperparameter tuning), sometimes hurt (learning rate shrinks over time, becomes too conservative under drift). Understanding dynamic regret is critical for streaming ML: credit scoring models adapt as populations change ($L_T$ quantifies how much credit profiles drift), ranking algorithms face shifting user preferences ($L_T$ large = hard tuning problem), and monitoring needs to account for regret growth ($L_T$).

ML Link: Implements dynamic regret framework extending Example 5 (Regret in Online Learning). Definition 12 (Dynamic Regret) formalizes the concept with path length measure. Relates to online convex optimization theory (Zinkevich, 2003 on online gradient descent with time-varying functions) and adaptive methods (Adagrad, McMahan & Streeter 2010). Shows empirically that tight adaptation to data affects regret trajectory.

Hints: Parametrize smooth paths using spline or polynomial interpolation: linear is simple ($B(s) = B_{\text{final}} \cdot s$), cubic smoothing uses Hermite basis $B(s) = B_{\text{final}} \cdot (3s^2 - 2s^3)$. Compute path length numerically: $L_T = \sum_{t=0}^{T-1} \|\mathbf{w}_{t+1}^* - \mathbf{w}_t^*\|$. Implement both learners via simple loops. For static regret, compute $\min_{\mathbf{w}} \sum_t \ell_t(\mathbf{w})$ via least-squares fitting to all time steps (in hindsight, oracle knowledge). For dynamic regret, know $\mathbf{w}_t^*$ exactly (since it’s synthetic). Plot: (1) dynamic vs. static regret curves for fixed drift (show dynamic is much larger), (2) cumulative regret (x-axis = time, y-axis = $\sum_{s=1}^t \text{regret}_s$) for different path lengths (show linear growth proportional to $L_T$), (3) per-step regret $\ell_t(\mathbf{w}_t) - \ell_t(\mathbf{w}_t^*)$ showing early-time lag as learner catches up, later-time lag as optimum drifts faster than learner. Fit power law: $\text{Reg}_{\text{dynamic}} \propto L_T^a$; estimate $a$ (expect $a \approx 1$ for linear scaling).

What mastery looks like: 1. Correct path parametrization: different path types (linear, cubic, piecewise) have similar or different path lengths depending on curvature; verify path lengths match specifications. Smooth paths (cubic) should have similar lengths to linear (same total displacement), but jerky paths may have longer effective lengths (if optimum changes direction violently). 2. Regret separation: static and dynamic regret coincide when $L_T = 0$ (no drift); as drift increases, dynamic regret grows faster (roughly linearly in $L_T$), while static regret is less sensitive (depends more on $\sqrt{T}$ term). Quantify: for $L_T = 5$, dynamic regret ~5-10× larger than static. 3. Scaling verification: fit power law $\text{Reg}_{\text{dynamic}} = c \cdot L_T^a + b \cdot \sqrt{T}$; report coefficients and exponent $a$. Theory predicts $a = 1$ (linear); should be close ($0.9 < a < 1.1$). Lower exponent suggests algorithm is adapting to drift, higher exponent suggests drift hurts worse than linear. 4. OGD vs. Adaptive comparison: plot both learners’ dynamic regret for each path length; show whether adaptive algorithm achieves better or worse regret. Adaptive may be better (learns faster early) or worse (learning rate shrinks, can’t catch up to drift). Report ratio of adaptive to OGD regret. 5. Step size sensitivity: for OGD, sweep $\alpha \in [0.001, 0.01, 0.1]$; plot dynamic regret vs. $\alpha$. There should be a U-shaped curve (too small $\alpha$ = slow tracking, large regret; too large = oscillation, also large regret). Report optimal $\alpha^*$ and regret at optimum. 6. Per-step regret trajectory: plot per-step dynamic regret $\ell_t(\mathbf{w}_t) - \ell_t(\mathbf{w}_t^*)$ over time; early steps should show low per-step regret (learner near optimum), later steps should show higher per-step regret (learner lags drifting optimum). If path is smooth and learner adapts well, per-step regret should be roughly constant (good); if spiky, per-step regret should increase over time (bad). 7. Comparison to lower bounds: discuss information-theoretic lower bound $\Omega(L_T)$ on dynamic regret; show empirically whether the learner approaches this bound or is far from it. Quantify: regret / $L_T$ (should be constant if linear scaling holds). 8. Advanced: Implement algorithm that estimates drift speed online and adapts step size dynamically (shrink step size if drift detected); compare to fixed-$\alpha$ and Adagrad, showing whether explicit drift adaptation improves tracking.

C.16 — Elastic Weight Consolidation for Continual Learning

Task: Implement Elastic Weight Consolidation (EWC) to mitigate catastrophic forgetting in continual learning and analyze the stability-plasticity tradeoff. Setup: Sequential classification tasks (e.g., MNIST class-incremental: task 1 = digits 0-1, task 2 = 2-3, …, task 5 = 8-9). Train shared neural network (2-layer MLP, 256 hidden units). For each task $t$: (1) train on task $t$ using standard cross-entropy loss, (2) after convergence, compute Fisher information matrix $F_t[i,i] = \mathbb{E}[(\frac{\partial \log p(y|\mathbf{x})}{\partial w_i})^2]$ (diagonal approximation, importance of parameter $w_i$ for task $t$), (3) when training task $t+1$, add EWC regularization to loss: $\mathcal{L}_{t+1} = \mathcal{L}_{\text{task}}(\mathbf{w}) + \frac{\lambda}{2} \sum_i F_t[i,i] (w_i - w_i^{(t)})^2$ where $\lambda$ is regularization strength, $\mathbf{w}^{(t)}$ are parameters from task $t$. Accumulate Fisher across tasks: for task $t+2$, use combined importance $F_{\text{combined}} = F_t + F_{t+1}$ (cumulative importance). Sweep $\lambda \in [0, 0.1, 0.5, 1.0, 5.0]$ and measure: (1) Per-task accuracy: accuracy on each task after each training step, track in $T \times T$ matrix (rows = tasks, columns = after learning which task), (2) Forgetting: $F_t = \max(a_{t,t} - a_{t,T}, 0)$ (accuracy drop on task $t$ from after task $t$ to final model after task $T$), (3) Forward transfer: whether task $t$ helps task $t+1$ (measure accuracy on task $t+1$ using warm-start from task $t$ vs. from-scratch on task $t+1$ only), (4) Average accuracy: mean over all tasks after learning all $T$ tasks. Baselines: (a) $\lambda = 0$ (no EWC, pure fine-tuning), (b) task-specific heads (separate output head per task, upper bound on performance, no forgetting).

Purpose: Continual learning faces the fundamental stability-plasticity dilemma: preserve knowledge of old tasks (stability) while learning new tasks effectively (plasticity). EWC protects important parameters on old tasks when learning new tasks, trading off new task performance for old task retention. This exercise teaches: (1) Parameter importance matters: not all parameters equally important for each task; Fisher information pinpoints which parameters should be protected, (2) Regularization as knowledge preservation: quadratic penalty on important parameters prevents their drift, but at cost of constrained optimization (larger $\lambda$ = stronger constraints = less plasticity on new task), (3) Hyperparameter $\lambda$ controls the frontier: low $\lambda$ = high plasticity, low stability (original fine-tuning); high $\lambda$ = high stability, low plasticity (can’t learn new tasks). The exercise reveals the frontier and typical operating points. In production: continual NLP updates (models updated on new text, must preserve language understanding), vision models (new data domains added, must preserve old domains), recommendation updates (new user behavior patterns, preserve old preferences).

ML Link: Implements EWC approach (mentioned in Example 7: Catastrophic Forgetting in Neural Networks) as mitigation strategy. Definition 13 (Catastrophic Forgetting) formally defines the phenomenon; EWC directly addresses it. Relates to multi-task learning and meta-learning (implicit in the Fisher information-based importance weighting).

Hints: Compute Fisher diagonal: collect gradient samples on task $t$ training data, compute $\nabla_w \log p(y|x)$ for each sample, square element-wise, average: $F[i,i] = \frac{1}{m}\sum_k (\nabla_w \log p(y_k|x_k))_i^2$. Use PyTorch: torch.autograd.grad to compute gradients, accumulate via torch.sum(grads**2). After each task, store Fisher and weight checkpoint. When training new task, add EWC loss term: $\lambda_{EWC} = \lambda \cdot 0.5 \cdot \sum_i F[i,i] \cdot (w_i - w_i^{\text{old}})^2$. Datasets: MNIST with split (MNIST splits into 5 tasks of 2 classes each). Sweep $\lambda$ and plot accuracy heatmap: x-axis = task number $t$ (1 to 5, which task was trained), y-axis = task number $i$ (1 to 5, evaluated on which task), entry = accuracy on task $i$ after training tasks 1 through $t$. Compute forgetting as diagonal degradation (drop in accuracy moving down a column). Compute forward transfer as improvement in off-diagonal entries (does training task 2 help task 1 performance? typically minor).

What mastery looks like: 1. Correct Fisher computation: verify Fisher values are positive (symmetric positive semi-definite); check that frequently-used parameters have high Fisher values (interpretability). 2. Forgetting reduction: without EWC ($\lambda=0$), forgetting should be severe (can verify: for MNIST, task 1 accuracy drops from ~95% to ~60% after learning tasks 2-5); with large $\lambda$, forgetting should be minimal (<5%). 3. Stability-plasticity frontier: plot multi-objective: x-axis = average accuracy on old tasks, y-axis = average accuracy on new tasks, each point is a $\lambda$ value. Should show Pareto frontier: low $\lambda$ has good new task accuracy but poor old task retention, high $\lambda$ has good old task retention but poor new task learning. 4. Heatmap analysis: visualize per-task accuracy heatmap for $\lambda = 0, \lambda^*, \lambda = 5$. At $\lambda = 0$, diagonal should be high but off-diagonal below decreases sharply (forgetting). At $\lambda^*$ some balance. At $\lambda = 5$, diagonal stays high but off-diagonal entries (new tasks) may also stay low (can’t learn new tasks well). 5. Forgetting curve: plot forgetting metric $F_t$ vs. task $t$ for different $\lambda$ values. For $\lambda = 0$, forgetting should increase over time (task 1 forgets more after task 5 than after task 2). For high $\lambda$, forgetting should be flat and low. 6. Forward transfer analysis: measure whether training earlier tasks helps later tasks; compute accuracy on task $t+1$ in two conditions: (a) warm-start from task $t$ (standard continual setup), (b) train from scratch on task $t+1$ only. Forward transfer = improvement in (a) vs. (b). EWC should not hurt forward transfer (same warm-start initialization as baseline). 7. Optimal $\lambda$ selection: identify $\lambda^*$ that balances retention and learning, e.g., minimize $\text{forgetting} + \alpha \cdot (1 - \text{avg accuracy on new tasks})$ for tunable $\alpha$. Report $\lambda^*$ and resulting performance metrics. 8. Advanced: Implement Synaptic Intelligence (SI) or Memory-Aware Synapses (MAS) as alternatives to Fisher-based importance; compare forgetting curves, showing whether alternative importance measures improve over EWC.

C.17 — Exploration Strategies to Mitigate Feedback-Induced Shift

Task: Extend C.7 (feedback loop simulation) to compare exploration strategies that prevent algorithmic amplification of user preferences. Setup: Recommender system recommending items to users. User $u$ has fixed true preferences $\mathbf{u} \in \mathbb{R}^d$; items have features $\mathbf{i} \in \mathbb{R}^d$. Utility: $\text{utility}_{u,i} = \mathbf{u}^T \mathbf{i} + \epsilon$ (linear utility + observation noise). User observes recommended items, provides feedback (click = 1 if utility > threshold, 0 otherwise), system learns from clicks and retrains. But unlike C.7, users also adapt: after repeated exposure to certain items, users’ preferences drift slightly toward exposed items: $\mathbf{u}_{t+1} = \mathbf{u}_t + \beta \cdot (\text{feedback} - 0.5) \times \mathbf{i}_t$ [user is influenced by recommendations]. Implement four recommendation strategies: (1) Greedy: always recommend $\arg\max_i \hat{\mathbf{u}}_{\text{current}}^T \mathbf{i}$ (no exploration), (2) $\epsilon$-exploration: with probability $\epsilon = 0.1$, recommend random item; else recommend top-1, (3) Thompson sampling: sample utility $\hat{\mathbf{u}} \sim \mathcal{N}(\hat{\mathbf{u}}_{\text{MAP}}, \Sigma_{\text{posterior}})$, recommend $\arg\max_i \hat{\mathbf{u}}^T \mathbf{i}$, (4) Diversity regularization: recommend $\arg\max_i [\hat{\mathbf{u}}^T \mathbf{i} - \gamma \cdot \text{similarity}(i, \text{recent})]$ (discourage repetition). Run $T = 500$ interactions. Measure: (1) Item coverage: fraction of distinct items shown during window $[t-100, t]$, higher = more diverse, (2) User preference drift: $\|\mathbf{u}_t - \mathbf{u}_0\|_2$ over time (how much user preferences change from initial due to system exposure), (3) Data distribution entropy: Shannon entropy of recommendation frequency (balanced = high entropy, concentrated = low entropy), (4) Cumulative utility: $\sum_t \text{utility}_{u,i_t}$ (total user satisfaction under true preferences). Report per-strategy.

Purpose: Feedback loops create endogenous shifts: the system shapes the very data it trains on, potentially trapping users and system in narrow niches. The question: can exploration break the loop and stabilize the system? This exercise teaches: (1) Feedback loops are fundamentally different from exogenous shift: the shift is caused by the system’s decisions, so passive adaptation (model updating) is insufficient; must intervene in decision-making (exploration), (2) Exploration cost-benefit: exploration (showing suboptimal items by greedy standard) costs short-term utility but gains long-term stability; benefit depends on how malleable user preferences are ($\beta$ parameter), (3) Diversity emerges from various mechanisms: $\epsilon$-exploration is random, Thompson sampling trades off exploitation vs. exploration based on uncertainty, diversity regularization directly penalizes repetition; each has different properties, (4) Measurement matters: greedy often looks good on short-term metrics (current utility) but fails on long-term metrics (user preference drift, eventual item coverage collapse). Understanding feedback loops is critical for recommendation system ethics: algorithmic amplification can unintentionally trap users in filter bubbles or optimize for engagement at cost of diversity/serendipity.

ML Link: Extends feedback-induced distribution shift (Example 10: Feedback Loop Amplification in Recommender Systems). Related to multi-armed bandit theory (exploration-exploitation tradeoff), causal reasoning (recognizing that recommendations affect future data), and fairness/diversity (endogenous shifts can systematically disadvantage certain items). Connects to continual learning under drift (C.3, C.7).

Hints: Simulate: true preferences $\mathbf{u}_0 \sim \mathcal{N}(0, I_d)$ with $d=20$, items sampled from $\mathcal{N}(0, I_d)$ (100 items total), user preference adaptation weight $\beta = 0.01$ (small, so preferences drift slowly). Recommender trains logistic regression on history of ($\mathbf{i}$, click) pairs every 50 steps. Implement all four strategies. Coverage: count distinct items shown in window, divide by total item pool (100). User drift: $\ell_2$ distance from $\mathbf{u}_0$. Entropy: compute probability of each item being recommended, compute $-\sum_i p_i \log p_i$. Utility: use simulated clicks (drawn from true logistic model $\sigma(\mathbf{u}^T \mathbf{i})$). Plot: (1) coverage over time (greedy drops, exploration maintains), (2) user drift (greedy shows largest, exploration shows smallest), (3) recommendation entropy (greedy low, exploration high), (4) cumulative utility (greedy high early, crashes; exploration steady, mid-range). Use multiple random seeds (10+) for confidence bands.

What mastery looks like: 1. Greedy baseline shows feedback loop: low coverage (converges to <20 items), high user drift ($> 0.5$), low entropy, initial high utility that decreases as system specializes. By $t=500$, greedy recommendation quality may drop (utility decreases) due to user adaptation mismatch. 2. Exploration strategies stabilize system: $\epsilon$-exploration and Thompson sampling maintain higher coverage (30-60 items), lower drift (<0.2), high entropy. Diversity regularization shows intermediate behavior (moderate coverage improvement). 3. Coverage-drift tradeoff: plot coverage vs. user drift across strategies; show inverse relationship (higher coverage correlates with lower drift, suggesting diverse recommendations prevent preference capture). 4. Long-term utility comparison: despite short-term cost, exploration strategies often achieve comparable or better long-term cumulative utility (greedy crashes, exploration stays stable). Plot cumulative utility trajectory, showing greedy peaks then declines, exploration steady. 5. Strategy ranking: quantify tradeoff with multi-objective metric, e.g., $\text{score} = \text{final coverage} - 0.1 \times \text{user drift}$; rank strategies. Thompson sampling often strong (balances exploration and exploitation adaptively). 6. Sensitivity to user adaptability $\beta$: vary $\beta \in [0, 0.01, 0.05, 0.1]$ (how easily users’ preferences change with exposure); for large $\beta$, exploration becomes critical (greedy fails catastrophically), for small $\beta$, amplification weak (user preferences rigid, less affected by recommendations). Quantify required exploration intensity as function of $\beta$. 7. Optimal exploration level: for $\epsilon$-exploration, sweep $\epsilon \in [0, 0.05, 0.1, 0.2]$; find $\epsilon^*$ that optimizes long-term stability vs. short-term utility. Plot Pareto frontier (current utility vs. long-term drift stability). 8. Advanced: Implement adaptive exploration (estimate user adaptability $\beta$ online, adjust exploration rate $\epsilon_t$ dynamically; higher $\beta$ = higher $\epsilon_t$). Compare to fixed-$\epsilon$, showing whether adaptivity improves stability without sacrificing short-term utility.

C.18 — Pipeline Shift Detection and Diagnosis

Task: Build an experiment where data preprocessing gradually changes (simulating pipeline shifts in deployment), then implement diagnostics to localize the source of shift. Setup: ML pipeline with explicit steps: (1) data loading, (2) missing value imputation (fill NaN with column mean), (3) feature standardization (z-score with training-set mean and std), (4) feature engineering (optional: add polynomial terms), (5) model prediction. Train baseline model on clean training data. At test time, introduce gradual shifts in preprocessing: (a) Feature scaling shift: at $t = 300$, apply unknown scaling to features (e.g., multiply by factor drawn from $\mathcal{U}(0.5, 1.5)$, simulating sensor re-calibration), (b) Missingness pattern shift: at $t = 300$, gradually increase missingness probability from 5% to 20% (simulating data quality degradation), (c) Feature corruption: at $t = 300$, add random noise to specific features (simulating sensor malfunction). At each time $t$, measure: feature distributions (mean, std, quantiles, max, min), missingness rate for each feature, model accuracy. Implement three diagnostic approaches: (1) Feature-level drift detectors: for each feature, compute KS test statistic between training-set feature distribution and test-set feature distribution; flag features with $p < 0.05$, (2) Joint distribution detector: MMD on full feature space (C.4), (3) Residual detector: measure model residual distribution (mean, magnitude) on predictions. Compare three diagnosis conclusions: (a) No diagnostics (naive): assume model is incorrect, attribute degradation to model mismatch, (b) Pipeline audit: systematically test each preprocessing step (compare feature distributions after each step, pre-shift vs. post-shift), (c) Feature attribution: identify which features shifted most (highest KS statistic), cross-reference with feature importance (via permutation importance or SHAP), identify if important features shifted.

Purpose: In production, data pipelines are often as complex as ML models, and shifts in preprocessing (scaling, imputation logic, schema changes) can cause performance degradation. Practitioners must diagnose whether degradation is due to model mismatch or pipeline issues, because mitigation differs: pipeline issues are often fixable by engineering (re-calibrate scaler, fix imputation), while model mismatch requires retraining or domain adaptation. This exercise teaches: (1) Pipelines are part of the system: ML monitoring should include preprocessing, not just model, (2) Feature-level diagnosis: since pipeline shifts often affect specific features (one sensor re-calibrated, not all), feature-wise diagnostics can localize problems faster than global metrics, (3) Causality and inference: understanding whether key features shifted + are important = likely cause of performance drop, while unimportant shifted features = unlikely cause. The exercise develops systematic diagnostic thinking.

ML Link: Extends drift detection (C.4) to include preprocessing. Related to operational ML systems (Chapter 19: monitoring and data quality). Connects to feature engineering and data validation (concepts often not emphasized in academic ML but critical in practice).

Hints: Implement pipeline explicitly: $\text{features} = \text{preprocess}(\text{raw\_data})$, $\text{prediction} = \text{model}(\text{features})$. Store features after each preprocessing step. Simulate shifts: after $t = 300$, apply changes (e.g., scale features by factor, increase missingness probability, add noise). Drift detection: (1) KS test per feature: compute empirical CDF on training vs. test features, compute Kolmogorov-Smirnov statistic (max difference in CDFs), compare to critical value for $\alpha = 0.05$, report p-value, (2) MMD: kernel-based statistic on full feature space. Diagnostics: (a) feature-level KS: produce table of KS statistics for each feature, sorted by magnitude; identify which features exceed threshold (these are “shifted”), (b) pipeline audit: compare feature distribution before-vs-after each preprocessing step, flag step where distribution changes from pre- to post-shift, (c) feature importance: compute permutation importance (train model, shuffle one feature at a time, measure accuracy drop; bigger drop = more important), cross-reference shifted features (high KS) with importance, if both high = likely cause. Dataset: synthetic (generate tabular data with controllable shift) or real (use a public dataset like titanic or UCI). Plot: (1) feature distributions (histogram or box plot) pre- and post-shift for each feature, highlighting shifted features, (2) KS statistic heatmap (rows = features, columns = time windows, color = KS stat), show when shift starts for each feature, (3) feature importance vs. shift magnitude (scatter plot), show correlation.

What mastery looks like: 1. Correct shift injection: verify that introduced shifts are observable in feature distributions (e.g., scaling shift increases variance, missingness shift increases NaN count, corruption increases noise). Document shift specifications. 2. Feature-level drift detection: implement KS test for each feature; for scaled features, KS statistic should spike post-shift; for missingness shift, missingness count should spike; for corruption, feature std should increase. Produce ranked list of shifted features (sorted by KS statistic). 3. Diagnosis accuracy: correctly identify source of shift (scaling, missingness, or corruption) by examining which features and what type of change (mean vs. variance vs. distribution shape). For scaling: variance increases (std scales), mean may shift; for missingness: NaN count increases; for corruption: outliers increase, std increases. 4. Pipeline audit: trace shift backwards through pipeline steps; identify at which step(s) the distribution changes from pre- to post-shift. For scaling shift applied after standardization step, should be detected at the final features; for missingness change applied during imputation, should be detected after imputation step. 5. Root cause localization: using feature importance + KS statistics, identify high-importance features that also show high shift; these are likely culprits for performance degradation. Cross-validate: if you claim shift in feature X caused performance drop, verify that (a) feature X’s distribution shifted detectably, (b) permutation importance of feature X is high, (c) model predictions change significantly when feature X is perturbed. 6. Mitigation recommendation: for each shift type, propose remedy: if scaling shift, re-calibrate scaler on recent data; if missingness, adjust imputation; if corruption, investigate upstream data source. Describe how each mitigation would be implemented. 7. Generalization: test diagnostic workflow on unseen shift type (e.g., introduce outliers, or duplicate a feature); show whether diagnostic approach generalizes (may need retraining on new shift type, but approach should be sound). 8. Advanced: Implement automated diagnostic system (decision tree or rule-based system that maps observed patterns—high variance increase + certain features affected—to likely shift causes); test on multiple shift types, report diagnostic precision and recall.

C.19 — Robustness Comparison: Reweighting vs. Alignment under Concept Drift

Task: Implement two domain adaptation methods (importance reweighting and representation alignment) under covariate shift with varying levels of concept drift, and compare robustness to violations of the covariate shift assumption. Setup: Covariate shift baseline: $P_{\text{source}}(\mathbf{x}) \neq P_{\text{target}}(\mathbf{x})$ but $P_{\text{source}}(y|\mathbf{x}) = P_{\text{target}}(y|\mathbf{x})$ (conditional invariant). Violate covariate shift assumption by introducing mild concept drift: $P_{\text{target}}(y|\mathbf{x}) = P_{\text{source}}(y|\mathbf{x}) \cdot (1 + \delta \cdot z(\mathbf{x}))$ where $\delta \in [0, 0.01, 0.05, 0.1]$ is violation magnitude, $z(\mathbf{x})$ is smooth perturbation (e.g., $z(\mathbf{x}) = \sin(2\pi x_1 / \text{scale})$ or $z(\mathbf{x}) = \text{sign}(x_1)$ to create asymmetric drift). Implement two domain adaptation methods: (1) Importance reweighting (Class reweighting from C.1/C.13): estimate density ratio $\hat{w}(\mathbf{x}) = P_{\text{target}}(\mathbf{x}) / P_{\text{source}}(\mathbf{x})$ (via kernel density estimation or logistic classifier discriminator), apply weight clipping (C.13) with learned clipping threshold $c$, train classifier on reweighted source data: $\min_f \sum_i w(\mathbf{x}_i) \ell(f(\mathbf{x}_i), y_i)$, (2) Representation alignment (Domain-adversarial neural network, DANN): learn feature extractor $\phi(\mathbf{x})$ that makes source and target features indistinguishable (via adversarial training: auxiliary domain classifier tries to distinguish source vs. target, feature extractor tries to fool it), then train task classifier on aligned representations. Measure under each $\delta$ condition: (1) accuracy on target test set, (2) calibration error (Brier score, Expected Calibration Error), (3) graceful degradation (plot accuracy vs. $\delta$, measure degradation slope). Test on synthetic data ($d=20, n_{\text{source}}=2000, n_{\text{target}}=1000$) and (optionally) real domain adaptation benchmarks (e.g., ImageNet subsets, MNIST variants).

Purpose: Covariate shift and concept drift are at extremes of a spectrum; in practice, both may occur partially. Methods designed for covariate shift may fail under joint shift. This exercise teaches: (1) Method assumptions and robustness: reweighting assumes conditional invariance; when violated, density ratio estimation becomes unreliable (weights may be biased or high-variance), while alignment may generalize better (learning shared semantics can be robust to small conditional changes), (2) Graceful vs. catastrophic degradation: ideally methods degrade smoothly from good to bad performance as assumption violations increase; catastrophic degradation (sharp performance cliff) indicates brittle assumptions, (3) Calibration under assumption violations: reweighting may sacrifice calibration even when accuracy is maintained (because weighted likelihood is biased), alignment may maintain calibration better (because aligned features preserve class separability). The exercise develops intuition for when each method is appropriate and how to diagnose failures.

ML Link: Compares domain adaptation methods (Example 8: Domain Adaptation Risk Decomposition). Shows empirically how theoretical bounds break down. Relates to transfer learning robustness and practical deployment (methods assumed to work under clean assumptions often fail in messy reality).

Hints: Synthetic setup: source $\mathbf{x} \sim \mathcal{N}(0, I_d)$, target $\mathbf{x} \sim \mathcal{N}(\mu, \Sigma)$ (different mean/covariance), true model $y = \sigma(w_0^T \mathbf{x})$ on source, target: $P(y=1|\mathbf{x}) = \sigma(w_0^T \mathbf{x}) \cdot (1 + \delta \sin(2\pi x_1))$. For reweighting: estimate density ratio via logistic classifier on $\{(\mathbf{x}_i, 0)\}_{i \in \text{source}} \cup \{(\mathbf{x}_j, 1)\}_{j \in \text{target}}$, extract $\hat{w} = \hat{P}(y=1|\mathbf{x}) / \hat{P}(y=0|\mathbf{x})$, apply clipping as in C.13. For DANN: define task network (classifier $f$) and adversarial network (domain classifier $d$), alternately train: $\min_f -\sum_i \ell(f(\mathbf{x}_i), y_i) + \lambda \sum_j d(\phi(\mathbf{x}_j))$ for source samples and $\max_f - \lambda \sum_k d(\phi(\mathbf{x}_k))$ for target (adversarial). Plot: (1) accuracy vs. $\delta$ for both methods (should decrease, ideally smoothly), (2) per-method Brier score vs. $\delta$ (calibration degradation), (3) weight statistics (for reweighting) or aligned feature distribution (for DANN) as function of $\delta$. Sweep $\delta$ and fit polynomial or exponential curve to degradation, compare slopes.

What mastery looks like: 1. Clean covariate shift ($\delta = 0$): both methods should perform similarly well (high accuracy, good calibration), near oracle performance on target. Quantify: accuracy > 0.85 for both. 2. Graceful degradation: as $\delta$ increases, accuracy of both methods decreases smoothly (not with discrete jumps). Measure degradation rate: accuracy loss per unit $\delta$. For typical setup, both should lose ~5-10% accuracy per 0.05 units of $\delta$ (rough estimate, depends on problem). 3. Method robustness comparison: identify which method (reweighting or alignment) is more robust to concept drift violations. Hypothesis: alignment often more robust (doesn’t assume conditional invariance explicitly), but may depend on implementation. Quantify: at $\delta = 0.05$, which method has higher accuracy? Which maintains better calibration? 4. Breaking point identification: for each method, identify $\delta^*$ where performance becomes unacceptable (e.g., accuracy < 70% or random baseline). Report $\delta^*_{\text{reweighting}}$ vs. $\delta^*_{\text{alignment}}$. If alignment is more robust, $\delta^*_{\text{alignment}} > \delta^*_{\text{reweighting}}$. 5. Calibration analysis: plot Brier score (or ECE) vs. $\delta$ for both methods. Reweighting may show larger calibration degradation (because density ratio assumption is violated), while alignment may have separate Brier score trajectory (different robustness profile). 6. Weight statistics (for reweighting): track weight distribution as $\delta$ increases; under assumption violations, weight distributions may become more concentrated (high variance, few large weights dominate) or skewed, indicating reliability issues. Plot histogram of weights for several $\delta$ values. 7. Mixture approach: implement hybrid method that blends reweighting and alignment (e.g., weighted combination of predictions, or use importance weights in aligned feature space). Show whether hybrid achieves better robustness frontier than either pure method. 8. Advanced: Implement robust domain adaptation method (e.g., IRM - Invariant Risk Minimization, which explicitly tries to find invariant predictors across domains); compare to reweighting and alignment, showing whether explicit robustness to concept drift improves performance under violation.

C.20 — Continual Language Model Training with Topic Drift

Task: Simulate continual fine-tuning of a pre-trained language model on a sequence of corpora with evolving topics, and measure knowledge retention and forward transfer. Setup: Use small pre-trained causal language model (e.g., DistilGPT-2 ~82M parameters, manageable compute). Design or curate sequence of five text corpora with increasingly different topics: (1) News articles (500 documents, political/economic news), (2) Scientific papers (500 documents, ML/CS abstracts), (3) Social media posts (500 documents, Twitter-like text), (4) Cooking recipes (500 documents, recipe instructions), (5) Legal documents (500 documents, contract clauses). Each corpus: ~5K documents, vocabulary distinct but overlapping. Fine-tune sequentially: (1) baseline: fine-tune on corpus 1 (news) for 3 epochs, evaluate perplexity on corpus 1, (2) continue training (warm-start): start from corpus-1-tuned model, fine-tune on corpus 2 (scientific) for 3 epochs, evaluate perplexity on corpus 1 and 2, (3) repeat for corpora 3, 4, 5. Measure: (1) Perplexity matrix: $5 \times 5$ matrix where entry $(i, j) =$ perplexity on corpus $i$ after fine-tuning up to corpus $j$, (2) Backward transfer (forgetting): for each corpus $i$, compute perplexity drop from just after learning $i$ to final (after corpus 5), quantify as percentage increase in perplexity, (3) Forward transfer: measure whether pretraining (or earlier task) helped; compare perplexity on corpus $j$ trained with warm-start from corpus $j-1$ vs. from pretrained weights (from-scratch on corpus $j$ only), (4) Factual consistency: sample 20 generations from model at each stage, manually rate or use zero-shot classifier for coherence in original domain (e.g., after training on legal, does model still generate coherent news?). Implement three conditions: (a) No mitigation (baseline): pure fine-tuning, no regularization, (b) Replay buffer: mix 10% samples from previous corpora into each fine-tuning step, (c) Elastic weight consolidation (C.16): compute Fisher information after each corpus, regularize on next corpus.

Purpose: Language model updates in production (GPT fine-tuning, custom domain models) must balance freshness (adapt to new data/topics) and stability (preserve general language understanding). This is the most realistic continual learning scenario because: models are large and complex, corpora are diverse (topic jump from news to legal is substantial), and evaluation requires understanding semantics (not just accuracy). The exercise teaches: (1) Catastrophic forgetting is severe for LLMs: naive fine-tuning forgets earlier topics dramatically (news perplexity can 5-10× after legal fine-tuning), (2) Mitigations help but aren’t perfect: replay reduces forgetting but requires storage and may reduce plasticity on new domains, EWC helps but LLM-specific tuning needed (Fisher approximation may be less effective for transformer architectures), (3) Long-tail of knowledge: some old knowledge (general English, common phrases) transfers well and is hard to forget, while domain-specific knowledge (news events, legal concepts) transfers poorly and forgets easily. In production, organizations must decide: retraining policy (update frequently vs. keep fixed), mitigation strategy (replay vs. regularization), and acceptable tradeoff (how much freshness for how much stability)?

ML Link: Applies continual learning concepts (Example 6-7: Continual Learning, Catastrophic Forgetting) to realistic large-scale NLP. First exercise combining realistic scale (82M parameters, realistic LLM) with continual learning challenges. Relates to production LLM updates (Chapter 19: operational considerations for deployed models).

Hints: Use Hugging Face transformers library (pip install transformers datasets). Model: distilgpt2 (82M params, reasonable for CPU/GPU training). Corpora: download from public sources (e.g., Wikipedia splits, arXiv abstracts, OpenWebText samples) or generate synthetic (partition existing corpora by topic). Fine-tune: use standard language modeling objective (next-token prediction, cross-entropy loss). Perplexity: evaluate via model’s loss on held-out test set from each corpus. Implementation: (1) For baseline, freeze model after each corpus and measure perplexity on all corpora, (2) For replay, maintain buffer of old samples (store 500 samples from corpus 1-4), randomly sample into batches during corpus 5 fine-tuning, (3) For EWC, compute Fisher diagonal (gradient of log likelihood sampled on corpus 5 data), add regularization term to loss during corpus 6 fine-tuning (but this exercise is 5 corpora, so compute Fisher after corpus 5, would apply to hypothetical corpus 6; alternatively, apply retroactively: compute Fisher after corpus $i$, regularize when training corpus $i+1$). For factual consistency, manually sample and evaluate (or use a zero-shot classifier like cross-encoder for entailment). Evaluation: perplexity heatmap, forgetting bar chart, forward transfer comparison.

What mastery looks like: 1. Correct continual fine-tuning setup: sequential training from corpus 1 to 5, model weights carried forward (warm-start), perplexity evaluated after each corpus. 2. Perplexity matrix correctly computed: diagonal entries low (good on current corpus), off-diagonal entries (especially below diagonal) increase as more corpora trained (forgetting effect). For baseline without mitigation, corpus 1 perplexity should increase substantially by step 5 (e.g., from 40 to 150+). 3. Baseline forgetting is severe: quantify backward transfer on corpus 1, showing perplexity increase of >50% (better: >100% or 3-5× increase), indicating catastrophic forgetting on early topics. 4. Mitigation reduces forgetting: plot forgetting metric (perplexity increase from immediate step to final) for three conditions (no mitigation, replay, EWC); replay should reduce forgetting to <30% increase, EWC should achieve similar (or slightly better/worse, depending on tuning). 5. Forward transfer analysis: compare performance on corpus $j$ in two settings: (a) warm-start from corpus $j-1$, (b) from-scratch on corpus $j$ only. Warm-start should be better (lower perplexity) due to transfer, quantify improvement (e.g., 10-20% relative improvement in perplexity). 6. Factual consistency evaluation: manually rate 20 samples from checkpoints after corpus 1, 3, 5. Rate coherence/factuality on 1-5 scale. Baseline should show degradation (after corpus 5, model generates less coherent news articles), mitigations should maintain better consistency. 7. Memory-performance tradeoff: for replay buffer, measure memory cost (storage for 10% of corpora) and compare to forgetting reduction achieved; show whether replay is cost-effective (10% storage for X% reduction in forgetting). 8. Advanced: Implement LoRA (Low-Rank Adaptation) or adapter-based continual fine-tuning (a lightweight architecture addition allowing task-specific adaptation without modifying base model); compare to full fine-tuning, showing whether parameter-efficient methods improve stability while maintaining plasticity.

Solutions

Solutions to A. True / False

A.1

Final Answer: True. Full mathematical justification: Under covariate shift, the importance weighted empirical risk is $\hat{R}_{\text{IW}} = \frac{1}{m} \sum_{i=1}^m w(x_i) \ell(f(x_i), y_i)$ with $w(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$. Consistency follows from the law of large numbers provided $\mathbb{E}_{P_{\text{train}}}[|w(x) \ell|] < \infty$, which holds when $w$ is bounded and $\ell$ is bounded. The statement notes that variance can still diverge if $w$ has heavy tails, meaning $\mathbb{E}[w^2]$ can be infinite even if $\mathbb{E}[w]$ is finite. Thus the estimator is consistent but finite sample variance can be unbounded under heavy tails. Counterexample if false: Not applicable. Comprehension: The key is distinguishing consistency (convergence in expectation) from variance behavior. A heavy tailed $w$ can preserve unbiasedness but destroy concentration. ML Applications: Off policy evaluation and domain adaptation often use importance weights; this result warns that ratio estimation instability can cause noisy risk estimates even when the correction is theoretically correct. Failure Mode Analysis: If $w$ has large mass near infinity, empirical risk becomes dominated by a few points, causing unstable training and evaluation. Traps: Confusing bounded $w$ with light tailed $w$; bounded weights imply finite variance, but the statement highlights heavy tails, which arise when estimated ratios are unbounded.

A.2

Final Answer: True. Full mathematical justification: Prior adjustment is derived from Bayes rule: $P_{\text{test}}(y \mid x) \propto P_{\text{train}}(y \mid x) \cdot \frac{\pi_{\text{test}}(y)}{\pi_{\text{train}}(y)}$. This requires two conditions: (i) $P(x \mid y)$ invariant across train and test, and (ii) a calibrated posterior $P_{\text{train}}(y \mid x)$ or a correction mapping that yields calibrated probabilities. If the classifier is not calibrated, the adjustment can be biased even if priors are correct. Therefore, the stated conditions are necessary for correctness. Counterexample if false: Not applicable. Comprehension: Label shift correction is only as reliable as the posterior estimate and the invariance of class conditionals. Calibration is essential because the correction rescales probabilities directly. ML Applications: Medical screening and fraud detection use prior adjustment when prevalence shifts. Calibrated posteriors allow rapid updates without retraining. Failure Mode Analysis: Miscalibrated models or violations of $P(x \mid y)$ invariance yield systematic misestimation of posterior probabilities and mis set thresholds. Traps: Assuming that a high accuracy classifier is calibrated; accuracy and calibration are distinct.

A.3

Final Answer: False. Full mathematical justification: In concept drift, $P_{\text{test}}(y \mid x)$ differs from $P_{\text{train}}(y \mid x)$. Importance weighting corrects only for changes in $P(x)$, not changes in the conditional. The test risk is $R_{\text{test}}(f) = \mathbb{E}_{P_{\text{test}}(x)} \mathbb{E}_{P_{\text{test}}(y \mid x)}[\ell]$. Reweighting converts the outer expectation to $P_{\text{train}}(x)$ but still uses $P_{\text{test}}(y \mid x)$, which cannot be recovered from training data if it changed. Thus reweighting alone cannot guarantee recovery of the Bayes optimal predictor. Counterexample if false: Let $x \in \{0,1\}$ with equal $P_{\text{train}}(x)$, and define $P_{\text{train}}(y=1 \mid x)=x$, but $P_{\text{test}}(y=1 \mid x)=1-x$. The Bayes classifier flips; importance weighting cannot fix this because $P(x)$ is unchanged. Comprehension: If the label mechanism changes, correcting only the input distribution leaves the target function wrong. ML Applications: Fraud detection, adversarial behavior shifts, or policy changes create concept drift that requires retraining or causal modeling, not reweighting. Failure Mode Analysis: A system that only reweights will overconfidently deploy a stale decision boundary under new label rules. Traps: Equating all shift with covariate shift; concept drift is fundamentally different.

A.4

Final Answer: True. Full mathematical justification: Static regret compares the learner to the best fixed comparator $f^*$ minimizing $\sum_t \ell_t(f)$. Dynamic regret compares against a sequence $f_t^*$ that can change over time. Sublinear static regret only implies $\sum_t \ell_t(f_t) - \sum_t \ell_t(f^*) = o(T)$, but provides no control over $\sum_t \ell_t(f_t) - \sum_t \ell_t(f_t^*)$ if $f_t^*$ moves. It is well known that dynamic regret can be linear even when static regret is sublinear if the comparator path length is large. Counterexample if false: Take losses where the minimizer alternates between two distant points each round. A static comparator is poor, so static regret can still be sublinear for a learner that tracks the average, but dynamic regret is linear because the optimal comparator changes each step. Comprehension: Static regret guarantees are about performance relative to a fixed benchmark, not a drifting environment. ML Applications: Online recommendation under rapidly changing user interests can have good static regret but poor tracking of current preferences. Failure Mode Analysis: Relying on static regret guarantees in non stationary contexts leads to false security about real time performance. Traps: Misinterpreting static regret as a guarantee of adaptability.

A.5

Final Answer: True. Full mathematical justification: Under bounded loss $0 \leq \ell \leq L$ and drift $\mathrm{TV}(P_t, P_{t+1}) \leq \rho$, the deviation between the current distribution and an $m$-window empirical distribution scales at most linearly with $\rho m$. Standard stability arguments then show excess risk is bounded by a term proportional to this drift plus estimation error. Thus a linear dependence on $\rho m$ is achievable and consistent with theory. Counterexample if false: Not applicable. Comprehension: The window accumulates drift across $m$ steps, so drift effects scale with $m$. ML Applications: Rolling window retraining in forecasting and streaming classification can be tuned using this scaling to balance bias and variance. Failure Mode Analysis: Large windows in high drift regimes can increase bias and degrade responsiveness. Traps: Assuming larger windows always improve performance; drift can dominate variance reduction.

A.6

Final Answer: True. Full mathematical justification: A detector based only on feature marginals can detect differences in $P(x)$, which indicates covariate shift, but label shift and concept shift can occur without changes in $P(x)$. Without labels, one cannot distinguish between covariate shift and label shift because both can be consistent with identical feature marginals. Hence such a detector cannot generally distinguish shift types without label information. Counterexample if false: Consider two distributions with identical $P(x)$ but different $P(y)$. A feature only detector will see no drift, yet label shift is present. Comprehension: Feature only monitoring is blind to shifts in labels or label mechanisms unless they induce feature changes. ML Applications: Monitoring pipelines must include delayed labels or proxy targets if the goal is to detect label or concept shift. Failure Mode Analysis: Systems that only monitor features may miss dangerous label drift and deploy stale models. Traps: Assuming drift detection without labels is sufficient for all shift types.

A.7

Final Answer: False. Full mathematical justification: The $\mathcal{H}\Delta\mathcal{H}$ divergence measures how differently hypotheses in $\mathcal{H}$ separate domains in terms of disagreement, not necessarily risk. Zero divergence implies that for any pair of hypotheses, their disagreement probabilities are the same across domains. However, identical disagreement does not imply identical risk for a specific hypothesis because risk depends on labels. If label distributions differ, risks can differ even when disagreement structure matches. Counterexample if false: Let $\mathcal{H}$ contain only one classifier. Then $\mathcal{H}\Delta\mathcal{H}$ divergence is zero trivially, but if labels differ across domains, the classifier’s risk can differ. Comprehension: Divergence measures distributional discrepancy in input space with respect to the hypothesis class, not label shift. ML Applications: Domain adaptation bounds require both divergence and joint optimal error terms; ignoring label differences can cause large target errors. Failure Mode Analysis: Misinterpreting divergence as a guarantee of transferability leads to poor deployment performance. Traps: Equating zero divergence with domain equivalence; labels can still shift.

A.8

Final Answer: True. Full mathematical justification: In class incremental learning, the training stream emphasizes new classes, and without explicit balancing or rehearsal, the model can bias its decision boundary toward recent classes. Overall accuracy can remain stable if new classes dominate the evaluation set, while per class accuracy on old classes collapses. Thus it is possible to have high accuracy on new classes and stable aggregate accuracy with severe old class bias. Counterexample if false: Not applicable. Comprehension: Aggregate accuracy can mask class imbalance and forgetting effects. ML Applications: Incremental product classification or continual object recognition often shows severe bias toward recently learned classes. Failure Mode Analysis: Deployment may fail for legacy classes that remain critical but underrepresented. Traps: Evaluating only overall accuracy instead of per class or per task metrics.

A.9

Final Answer: True. Full mathematical justification: If the feedback operator $\mathcal{T}$ is contractive in total variation, then for any two distributions $P, Q$, $\mathrm{TV}(\mathcal{T}(P), \mathcal{T}(Q)) \leq \alpha \mathrm{TV}(P, Q)$ with $0 < \alpha < 1$. Iterating implies $\mathrm{TV}(P_t, Q_t) \leq \alpha^t \mathrm{TV}(P_0, Q_0)$. Thus differences shrink geometrically, meaning distribution shift diminishes over time even if updates are aggressive. Counterexample if false: Not applicable. Comprehension: Contractive feedback reduces differences regardless of the starting point, leading to stability. ML Applications: Controlled recommendation policies with sufficient exploration can make feedback dynamics stable. Failure Mode Analysis: If the operator is not contractive, small biases can amplify instead of shrink. Traps: Assuming contractiveness without verification; many real systems are expansive, not contractive.

A.10

Final Answer: False. Full mathematical justification: A small Wasserstein distance between feature distributions does not guarantee low target risk because risk depends on the conditional $P(y \mid x)$ and the classifier’s decision boundary. A model can be highly sensitive to changes in regions that have small transportation cost, or the label mechanism can shift with negligible change in feature distribution. Therefore Wasserstein distance alone cannot guarantee low risk. Counterexample if false: Take a binary classifier with a decision boundary at $x = 0$ on a 1D feature space. Let the test distribution shift slightly so that most mass crosses the boundary; the Wasserstein distance can be small, yet classification error changes dramatically. Comprehension: Geometry of feature drift is informative but insufficient; label structure matters. ML Applications: Drift monitoring based on Wasserstein distance must be complemented with performance checks or label based monitoring. Failure Mode Analysis: Over reliance on distributional distance can miss risk spikes caused by small but adversarial shifts. Traps: Treating any drift metric as a direct proxy for risk.

A.11

Final Answer: True. Full mathematical justification: The confusion matrix method solves $p_{\text{test}} = C^T \pi_{\text{test}}$, where $C$ is the confusion matrix and $p_{\text{test}}$ is the vector of predicted class probabilities on target data. If $C$ is invertible and estimated without bias, then $\pi_{\text{test}} = (C^T)^{-1} p_{\text{test}}$ is consistent. If $C$ is singular or biased, the estimator is inconsistent. Counterexample if false: Not applicable. Comprehension: Invertibility ensures identifiability of class priors; bias breaks consistency. ML Applications: Label shift correction in production depends on stable and well estimated confusion matrices. Failure Mode Analysis: Highly imbalanced classes or poor calibration can make $C$ ill conditioned, leading to unstable prior estimates. Traps: Ignoring the conditioning of $C$ or using a confusion matrix from a different model.

A.12

Final Answer: True. Full mathematical justification: Under a stationary distribution, $O(\sqrt{T})$ regret implies the average loss converges to the optimal fixed predictor’s loss. Under drifting distributions, the best fixed predictor may be far from the instantaneous optimal predictor, so even low static regret does not ensure convergence to the optimal per round loss. Thus the guarantee holds for stationary but not drifting environments. Counterexample if false: Not applicable. Comprehension: Static regret relates to a fixed comparator; it does not track moving targets. ML Applications: For streaming systems with drift, dynamic regret or tracking error is more relevant than static regret. Failure Mode Analysis: Misinterpreting static regret can lead to overconfidence in systems with frequent regime change. Traps: Confusing stationary performance guarantees with non stationary scenarios.

A.13

Final Answer: True. Full mathematical justification: Clipping weights replaces $w(x)$ with $\tilde{w}(x) = \min(w(x), c)$. Then $\mathbb{E}[\tilde{w}(x) \ell] \neq \mathbb{E}[w(x) \ell]$ unless the clipped tail has zero probability. Thus clipping introduces bias, though it reduces variance. The bias magnitude depends on the tail mass and loss values in the clipped region. Counterexample if false: Not applicable. Comprehension: Clipping trades variance for bias; it stabilizes estimation at the cost of correctness. ML Applications: Off policy evaluation and domain adaptation often clip weights for stability in finite samples. Failure Mode Analysis: Excessive clipping can systematically under estimate risk in shifted regions. Traps: Assuming clipping preserves unbiasedness because it is a practical heuristic.

A.14

Final Answer: True. Full mathematical justification: If the hypothesis class is invariant under the feature shift, meaning that for each $f$ there exists a transformation that makes predictions equivalent on the shifted distribution, then minimizing unweighted empirical risk can still converge to the optimal predictor under the shifted distribution. This can occur when the model is insensitive to the shift, such as translation invariant models under spatial shifts. Therefore asymptotic optimality can hold even without importance weighting. Counterexample if false: Not applicable. Comprehension: Model invariances can neutralize shifts; then reweighting is unnecessary. ML Applications: Convolutional architectures can be robust to small spatial shifts without explicit reweighting. Failure Mode Analysis: If the assumed invariance is violated, the unweighted model can be biased. Traps: Assuming invariance without testing; small violations can cause systematic errors.

A.15

Final Answer: False. Full mathematical justification: Under gradual drift, larger windows reduce variance but increase bias due to staleness. If drift is persistent, a large window averages over outdated distributions, increasing error. Thus prediction error can increase with window size, so the statement that it always decreases is false. Counterexample if false: Consider a linear regression where the coefficient vector changes linearly over time. A large window averages old coefficients, producing biased predictions and higher error than a smaller window. Comprehension: There is a bias variance tradeoff in window size under drift. ML Applications: Streaming analytics and forecasting must choose window size based on drift speed, not only variance considerations. Failure Mode Analysis: Oversized windows cause lag and stale predictions, especially under rapid drift. Traps: Treating window size as a monotonic improvement parameter.

A.16

Final Answer: True. Full mathematical justification: Stability regularization, such as EWC, constrains parameter changes but cannot simultaneously optimize for widely separated task optima without additional data or model capacity. If the optimal parameters for tasks are far apart, the constrained optimization yields a compromise that sacrifices performance on one task, so forgetting persists. Thus regularization alone is insufficient in general. Counterexample if false: Not applicable. Comprehension: Regularization trades off stability and plasticity; it cannot overcome fundamental task incompatibility. ML Applications: Continual learning systems often require replay or architecture expansion when tasks are dissimilar. Failure Mode Analysis: Over reliance on regularization leads to underfitting new tasks or forgetting old ones. Traps: Assuming a single regularization term can solve catastrophic forgetting universally.

A.17

Final Answer: True. Full mathematical justification: If labels are delayed, the monitoring system observes outdated performance metrics, which reflect an earlier distribution. When drift occurs, current errors are larger than observed, so the monitor underestimates decay. This effect persists unless the system uses proxies or anticipates delays. Counterexample if false: Not applicable. Comprehension: Monitoring with delayed labels yields lagged error estimates that mask current performance issues. ML Applications: Fraud detection and credit risk often face delayed ground truth, requiring careful monitoring design. Failure Mode Analysis: Drift may go undetected for long periods, leading to poor decisions and risk exposure. Traps: Treating delayed metrics as real time indicators without adjustment.

A.18

Final Answer: True. Full mathematical justification: The domain adaptation bound includes three terms: source risk, divergence, and joint optimal error $\lambda$. A small $\lambda$ is necessary but not sufficient because if the divergence term is large, target risk can still be high. Therefore $\lambda$ alone does not guarantee low target risk. Counterexample if false: Not applicable. Comprehension: Adaptation requires both alignment of distributions and existence of a shared good predictor. ML Applications: Cross domain NLP often has a small $\lambda$ but still fails due to large feature shift. Failure Mode Analysis: Focusing only on shared optimality can ignore distribution mismatch and lead to underperforming adaptation. Traps: Equating shared optimality with transfer success.

A.19

Final Answer: True. Full mathematical justification: Dynamic regret bounds typically scale with the path length $V_T = \sum_{t=2}^T \|f_t^* - f_{t-1}^*\|$. If $V_T = o(T)$, then bounds of the form $O(\sqrt{T}(1 + V_T))$ or $O(\sqrt{T} + V_T)$ are sublinear, yielding sublinear dynamic regret. Thus sublinear path length is sufficient for sublinear dynamic regret. Counterexample if false: Not applicable. Comprehension: Dynamic regret depends on how fast the optimal predictor moves, not just the time horizon. ML Applications: Adaptive control and streaming models benefit from low path length environments, where dynamic regret guarantees are meaningful. Failure Mode Analysis: If the comparator changes rapidly, dynamic regret can become linear despite good algorithms. Traps: Assuming dynamic regret is always sublinear regardless of drift speed.

A.20

Final Answer: True. Full mathematical justification: Density ratio estimation relies on model specification. If the ratio model is misspecified, the estimated ratio $\hat{w}(x)$ converges to a pseudo true parameter that minimizes a divergence but does not equal the true ratio. As a result, $\mathbb{E}_{P_{\text{train}}}[\hat{w}(x) \ell]$ converges to a biased estimate of the test risk even with infinite data. Thus misspecification yields asymptotic bias. Counterexample if false: Not applicable. Comprehension: Infinite data does not fix a wrong model family; estimation converges to the wrong function. ML Applications: In domain adaptation, poor ratio models can bias risk estimates and degrade performance. Failure Mode Analysis: Overconfidence in asymptotic correctness can lead to systematic errors in safety critical systems. Traps: Assuming consistency without checking ratio model adequacy.

Solutions to B. Proof Problems

B.1

Full formal proof: Let $\ell(f(x), y)$ be bounded with $0 \leq \ell \leq L$. Under covariate shift with $P_{\text{train}}(y \mid x) = P_{\text{test}}(y \mid x)$, the test risk is \[ R_{\text{test}}(f) = \mathbb{E}_{P_{\text{test}}(x)} \mathbb{E}_{P(y \mid x)}[\ell(f(x), y)]. \] Define $w(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$ and assume absolute continuity. Then \[ R_{\text{test}}(f) = \mathbb{E}_{P_{\text{train}}(x)} \left[w(x) \mathbb{E}_{P(y \mid x)}[\ell(f(x), y)]\right] = \mathbb{E}_{P_{\text{train}}(x, y)}[w(x) \ell(f(x), y)]. \] For IID samples $(x_i, y_i)$, the estimator $\hat{R}_{\text{IW}} = \frac{1}{m} \sum_{i=1}^m w(x_i) \ell(f(x_i), y_i)$ satisfies $\mathbb{E}[\hat{R}_{\text{IW}}] = R_{\text{test}}(f)$, proving unbiasedness. For variance, define $Z = w(x) \ell(f(x), y)$. Then \[ \mathrm{Var}(\hat{R}_{\text{IW}}) = \frac{1}{m} \mathrm{Var}(Z) \leq \frac{1}{m} \mathbb{E}[Z^2]. \] Since $\ell^2 \leq L^2$, \[ \mathbb{E}[Z^2] = \mathbb{E}_{P_{\text{train}}}[w(x)^2 \ell^2] \leq L^2 \mathbb{E}_{P_{\text{train}}}[w(x)^2]. \] Thus $\mathrm{Var}(\hat{R}_{\text{IW}}) \leq \frac{L^2}{m} \mathbb{E}_{P_{\text{train}}}[w(x)^2]$. Proof strategy & techniques: Change of measure to express test risk under train distribution, then apply variance algebra and bounding by loss range. Computational validation: Simulate Gaussian covariate shift with known ratios and compute sample variance of $\hat{R}_{\text{IW}}$ across repeated trials; verify proportionality to $\mathbb{E}[w^2]$ and $1/m$. ML interpretation: Importance weighting is unbiased but can be high variance when ratio tails are heavy, implying unstable evaluation or training under strong shift. Generalization & edge cases: If $\mathbb{E}[w^2] = \infty$, variance bound is infinite even though unbiasedness holds. Bounded $w$ yields finite variance. Failure mode analysis: Large ratios can dominate the estimate, causing instability and brittle training, especially with small samples. Historical context: Importance sampling variance issues are classical in Monte Carlo and were adapted to domain adaptation and off policy evaluation. Traps: Confusing unbiasedness with reliability; ignoring tail behavior of weights.

B.2

Full formal proof: Let $\mathcal{X}$ include a region $A$ with $P_{\text{train}}(A) = 0$ but $P_{\text{test}}(A) > 0$. Consider any estimator $\hat{R}$ based solely on training data. Construct two test distributions $P_{\text{test}}^1$ and $P_{\text{test}}^2$ that agree with $P_{\text{train}}$ on $\mathcal{X} \setminus A$ but differ on $A$, and define losses $\ell_1, \ell_2$ that are identical on $\mathcal{X} \setminus A$ but take different values on $A$. Since training data never samples from $A$, $\hat{R}$ has the same distribution under both test scenarios, yet the true test risks differ by at least $P_{\text{test}}(A)$ times the difference in loss on $A$. Thus no estimator based solely on training data can guarantee finite worst case error uniformly over all bounded losses and test distributions with mass outside the training support. Proof strategy & techniques: Construct indistinguishable distributions from training data, then use Le Cam style two point argument. Computational validation: Simulate train support on a subset of $\mathbb{R}$ and define test mass on unseen region; show any estimator trained only on seen region cannot recover test loss there. ML interpretation: If deployment reaches regions never observed in training, no statistical correction can guarantee risk control; new data collection is required. Generalization & edge cases: The impossibility persists for any bounded loss and any estimator relying only on training data; adding unlabeled target data can detect support mismatch but not label behavior on $A$. Failure mode analysis: Models extrapolate arbitrarily in unseen regions, yielding unpredictable risk. Historical context: Support mismatch issues appear in covariate shift theory and in off policy evaluation limits. Traps: Assuming density ratio methods can fix lack of support; they cannot when support is disjoint.

B.3

Full formal proof: Under label shift, $P(x \mid y)$ is invariant and $\pi_{\text{test}}$ denotes target priors. Let $C$ be the confusion matrix with entries $C_{ij} = P_{\text{train}}(\hat{y}=i \mid y=j)$, and let $p_{\text{test}}$ be the vector of predicted class proportions on target data, $p_{\text{test}, i} = P_{\text{test}}(\hat{y} = i)$. Then \[ p_{\text{test}} = C \pi_{\text{test}}. \] If $C$ is invertible and consistently estimated, then $\hat{\pi}_{\text{test}} = C^{-1} p_{\text{test}}$ is consistent. Posterior adjustment follows by Bayes rule: $P_{\text{test}}(y \mid x) \propto P_{\text{train}}(y \mid x) \cdot \pi_{\text{test}}(y) / \pi_{\text{train}}(y)$. Failure occurs if $C$ is singular (non identifiable priors), or if $P(x \mid y)$ changes so that the confusion matrix no longer links priors to predictions. Proof strategy & techniques: Use law of total probability to connect predictions to priors, then linear system inversion for identifiability. Computational validation: Simulate multiclass Gaussian data with fixed class conditionals; estimate $C$ and verify $\hat{\pi}_{\text{test}}$ converges with sample size; show failure when $C$ is near singular. ML interpretation: Prior correction is exact only when the classifier’s error pattern is stable across domains. Generalization & edge cases: If $C$ is ill conditioned, small estimation noise causes large prior errors; using calibration can stabilize $C$. Failure mode analysis: Miscalibrated classifiers yield biased confusion matrices, causing incorrect priors and degraded decision thresholds. Historical context: Label shift correction traces back to Saerens et al. and later work on black box shift estimation. Traps: Using confusion matrices from a different model or a different domain, which invalidates the linear system.

B.4

Full formal proof: Suppose concept drift, so $P_{\text{train}}(y \mid x) \neq P_{\text{test}}(y \mid x)$. Let $f$ be any predictor derived by minimizing a reweighted empirical risk based on $P_{\text{test}}(x)/P_{\text{train}}(x)$. This procedure minimizes $\mathbb{E}_{P_{\text{train}}}[w(x) \ell(f(x), y)]$, which equals $\mathbb{E}_{P_{\text{test}}(x)} \mathbb{E}_{P_{\text{train}}(y \mid x)}[\ell(f(x), y)]$. The inner conditional remains $P_{\text{train}}(y \mid x)$, not $P_{\text{test}}(y \mid x)$, so the risk optimized is not the test risk. For any reweighting estimator that does not use target labels, construct a distribution where $P_{\text{train}}(y \mid x)$ and $P_{\text{test}}(y \mid x)$ differ significantly on a region of positive test measure. Then the Bayes optimal predictors differ, and the reweighted minimizer can be arbitrarily far from $f^*_{\text{test}}$ in test risk. Proof strategy & techniques: Compare the risk functionals minimized by reweighting vs. the true test risk functional; then construct a separation example. Computational validation: Simulate a binary classification problem where the label flips in a region; show reweighting does not improve test error. ML interpretation: Correcting only input distribution does not fix label mechanism changes; retraining with labels or causal modeling is required. Generalization & edge cases: If the conditional changes only on negligible test measure, reweighting may still perform well, but there is no general guarantee. Failure mode analysis: Systems that assume covariate shift when concept drift is present will exhibit persistent and systematic errors. Historical context: This limitation is well known in domain adaptation theory and motivates causal approaches. Traps: Treating density ratio estimation as a universal remedy for any shift.

B.5

Full formal proof: Let $\mathcal{F}$ be convex with diameter $D$. Online gradient descent updates $f_{t+1} = \Pi_{\mathcal{F}}(f_t - \eta g_t)$ where $g_t \in \partial \ell_t(f_t)$. By projection nonexpansiveness, \[ \|f_{t+1} - f^*\|^2 \leq \|f_t - f^*\|^2 - 2\eta g_t^T (f_t - f^*) + \eta^2 \|g_t\|^2. \] Rearranging gives \[ g_t^T (f_t - f^*) \leq \frac{\|f_t - f^*\|^2 - \|f_{t+1} - f^*\|^2}{2\eta} + \frac{\eta}{2} \|g_t\|^2. \] By convexity, $\ell_t(f_t) - \ell_t(f^*) \leq g_t^T (f_t - f^*)$. Summing over $t$, \[ \sum_{t=1}^T \ell_t(f_t) - \ell_t(f^*) \leq \frac{\|f_1 - f^*\|^2}{2\eta} + \frac{\eta}{2} \sum_{t=1}^T \|g_t\|^2. \] With $\|f_1 - f^*\| \leq D$ and $\|g_t\| \leq G$, \[ \mathrm{Regret}_T \leq \frac{D^2}{2\eta} + \frac{\eta G^2 T}{2}. \] Setting $\eta = D / (G \sqrt{T})$ yields $\mathrm{Regret}_T \leq D G \sqrt{T}$. Proof strategy & techniques: Use convexity, projection nonexpansiveness, and telescoping sum to derive regret bound; then optimize step size. Computational validation: Implement OGD on synthetic convex losses and confirm regret grows proportionally to $\sqrt{T}$ with the chosen $\eta$. ML interpretation: OGD offers performance guarantees in streaming settings, supporting online training for non stationary tasks. Generalization & edge cases: Strongly convex losses yield tighter $O(\log T)$ regret; non convex losses break this guarantee. Failure mode analysis: Incorrect step size can cause linear regret; projection errors can destabilize updates. Historical context: Zinkevich’s 2003 result formalized online gradient descent regret for convex losses. Traps: Assuming OGD guarantees extend to non convex deep models without modification.

B.6

Full formal proof: Let $f_t$ be online mirror descent updates with convex loss $\ell_t$. Let $f_t^*$ be any comparator sequence. Mirror descent regret w.r.t. $f_t^*$ can be bounded as \[ \sum_{t=1}^T \ell_t(f_t) - \ell_t(f_t^*) \leq \frac{D_\psi(f_1^*, f_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \|g_t\|_*^2 + \frac{1}{\eta} \sum_{t=2}^T D_\psi(f_t^*, f_{t-1}^*), \] where $D_\psi$ is the Bregman divergence of the mirror map. The last term is the path length in the geometry induced by $\psi$. Thus dynamic regret scales with the comparator path length. The bound follows from the standard mirror descent analysis with a time varying comparator, using the three point identity and telescoping divergences. Proof strategy & techniques: Mirror descent potential function, Bregman divergence telescoping, and decomposition into static regret plus path length term. Computational validation: Simulate convex losses with moving optima and verify dynamic regret scales with comparator path length. ML interpretation: Dynamic regret guarantees are meaningful only when the environment changes smoothly; otherwise tracking is impossible. Generalization & edge cases: For Euclidean mirror map, Bregman divergence reduces to squared norm; if path length is linear in $T$, regret becomes linear. Failure mode analysis: Fast changing optima invalidate dynamic guarantees; adaptive step sizes may partially mitigate. Historical context: Dynamic regret bounds appear in adaptive online learning literature, building on mirror descent and tracking analyses. Traps: Treating dynamic regret as always sublinear without checking comparator path length.

B.7

Full formal proof: Let $\hat{P}_t$ be the empirical distribution over the window $\{t-m+1, \ldots, t\}$. For any $f$ and bounded loss $\ell$, \[ |R_{P_t}(f) - R_{\hat{P}_t}(f)| \leq L \mathrm{TV}(P_t, \hat{P}_t). \] By triangle inequality and drift bound $\mathrm{TV}(P_s, P_{s+1}) \leq \rho$, \[ \mathrm{TV}(P_t, \hat{P}_t) \leq \frac{1}{m} \sum_{k=0}^{m-1} \mathrm{TV}(P_t, P_{t-k}) \leq \frac{1}{m} \sum_{k=0}^{m-1} k \rho \leq \rho m. \] If $f_t$ minimizes empirical risk on $\hat{P}_t$, then \[ R_{P_t}(f_t) - R_{P_t}(f_t^*) \leq 2 L \mathrm{TV}(P_t, \hat{P}_t) \leq 2 L \rho m, \] using the decomposition of risk differences via the empirical distribution. Thus excess risk is bounded by $O(\rho m)$. Proof strategy & techniques: Decompose true risk into empirical risk plus drift deviation; use total variation bounds and window averaging. Computational validation: Simulate a drifting distribution with controlled $\rho$ and verify excess risk scales with $m$ and drift. ML interpretation: Window size trades off variance and drift bias, providing a tuning knob for streaming models. Generalization & edge cases: For abrupt drift, the bound can be loose; alternative methods with change detection may outperform windows. Failure mode analysis: Large windows in fast drift lead to stale predictors and increased error. Historical context: Sliding window learning is a classical approach in adaptive filtering and data stream mining. Traps: Over interpreting asymptotic bounds without considering finite sample variance.

B.8

Full formal proof: The $\mathcal{H}\Delta\mathcal{H}$ divergence is defined as \[ d_{\mathcal{H}\Delta\mathcal{H}}(S, T) = 2 \sup_{h, h' \in \mathcal{H}} |P_S(h \neq h') - P_T(h \neq h')|. \] Let $A_{h,h'} = \{x : h(x) \neq h'(x)\}$. Then \[ |P_S(A_{h,h'}) - P_T(A_{h,h'})| \leq \mathrm{TV}(P_S, P_T). \] Taking the supremum over $h, h'$ yields \[ \sup_{h,h'} |P_S(A_{h,h'}) - P_T(A_{h,h'})| \leq \mathrm{TV}(P_S, P_T). \] Thus \[ d_{\mathcal{H}\Delta\mathcal{H}}(S, T) \leq 2 \mathrm{TV}(P_S, P_T). \] Proof strategy & techniques: Express disagreement as event probability, then apply definition of total variation as supremum over events. Computational validation: Estimate both divergence and total variation in a synthetic binary domain and verify the inequality numerically. ML interpretation: Divergence based on hypothesis disagreements cannot exceed twice the underlying distributional discrepancy. Generalization & edge cases: If $\mathcal{H}$ is rich enough to represent all measurable sets, the bound can be tight. Failure mode analysis: Even with small total variation, disagreement based divergence may still be small, leading to loose adaptation guarantees. Historical context: This inequality is a standard tool in domain adaptation theory. Traps: Confusing upper bounds with equality; the divergence may be much smaller than total variation.

B.9

Full formal proof: For any $f$, let $f^*$ minimize $R_S(f) + R_T(f)$, so $\lambda = R_S(f^*) + R_T(f^*)$. For any $f$, \[ R_T(f) = R_T(f^*) + R_T(f, f^*). \] Add and subtract $R_S(f, f^*)$: \[ R_T(f) = R_T(f^*) + R_S(f, f^*) + [R_T(f, f^*) - R_S(f, f^*)]. \] By definition of $\mathcal{H}\Delta\mathcal{H}$ divergence, \[ |R_T(f, f^*) - R_S(f, f^*)| \leq \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T). \] Thus \[ R_T(f) \leq R_T(f^*) + R_S(f, f^*) + \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T). \] Finally, $R_S(f, f^*) \leq R_S(f) + R_S(f^*)$, yielding \[ R_T(f) \leq R_S(f) + \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T) + R_S(f^*) + R_T(f^*) = R_S(f) + \frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(S, T) + \lambda. \] Proof strategy & techniques: Decompose target error into joint error plus disagreement term; apply divergence bound and triangle inequality. Computational validation: Simulate two domains with known divergence and joint error; verify bound empirically for a hypothesis class. ML interpretation: Target performance depends on source performance, distribution discrepancy, and intrinsic incompatibility between tasks. Generalization & edge cases: If $\lambda$ is large, adaptation is limited even with perfect alignment; if divergence is large, alignment is the bottleneck. Failure mode analysis: Over focusing on domain alignment can ignore $\lambda$, causing poor target performance. Historical context: This bound originates in Ben David et al.’s domain adaptation theory. Traps: Assuming the bound is tight; it is often loose in practice.

B.10

Full formal proof: Suppose $d_{\mathcal{H}\Delta\mathcal{H}}(S, T) = 0$. Then for any $f, g \in \mathcal{H}$, $R_S(f, g) = R_T(f, g)$. Let $f_S$ minimize $R_S(f)$ and let $f_T$ minimize $R_T(f)$. If $\lambda = R_S(f_T) + R_T(f_T)$ is large, then any $f$ with low $R_S(f)$ can still have high $R_T(f)$ because the joint optimal error is large. Construct a label shift scenario where $P_S(x) = P_T(x)$ and $\mathcal{H}$ is fixed; divergence is zero, but if labels are flipped in target, then $R_S(f_S)$ can be small and $R_T(f_S)$ near 1. Hence there exists a hypothesis with low source risk and high target risk. Proof strategy & techniques: Use a label flip construction with identical feature marginals to force $\lambda$ large while divergence is zero. Computational validation: Simulate identical feature distributions with flipped labels and observe that source optimal classifier fails on target. ML interpretation: Alignment of feature distributions does not guarantee transfer if label semantics differ. Generalization & edge cases: This remains true even with rich $\mathcal{H}$; label mismatch drives $\lambda$. Failure mode analysis: Domain alignment without label checks can be misleading and harmful. Historical context: This is a standard cautionary example in adaptation literature. Traps: Equating feature alignment with task alignment.

B.11

Full formal proof: Let $\widehat{\mathrm{IPM}}(\hat{P}, \hat{Q}) = \sup_{h \in \mathcal{H}} |\hat{\mathbb{E}}_P h - \hat{\mathbb{E}}_Q h|$. By symmetrization, the expected deviation from the population IPM satisfies \[ \mathbb{E}[\widehat{\mathrm{IPM}} - \mathrm{IPM}] \leq 2 \mathcal{R}_n(\mathcal{H}), \] where $\mathcal{R}_n(\mathcal{H})$ is the empirical Rademacher complexity. By McDiarmid’s inequality, changing one sample changes $\widehat{\mathrm{IPM}}$ by at most $2/n$, yielding concentration \[ \widehat{\mathrm{IPM}} \leq \mathbb{E}[\widehat{\mathrm{IPM}}] + \sqrt{\frac{\log(2/\delta)}{2n}}. \] Combining gives the stated bound. Proof strategy & techniques: Use symmetrization to relate empirical and population IPM, then bounded differences for concentration. Computational validation: Compute IPM with different sample sizes and verify convergence at $O(1/\sqrt{n})$. ML interpretation: Drift detection reliability depends on function class complexity and sample size. Generalization & edge cases: If $\mathcal{H}$ is too rich, Rademacher complexity is large and detection is noisy. Failure mode analysis: Overly expressive detectors may overfit to noise, causing false alarms. Historical context: This bound follows classic learning theory tools used in two sample testing. Traps: Assuming drift detection is reliable without checking function class capacity.

B.12

Full formal proof: By Kantorovich Rubinstein duality, \[ W_1(P, Q) = \sup_{\|f\|_{\text{Lip}} \leq 1} \left| \mathbb{E}_P[f(x)] - \mathbb{E}_Q[f(x)] \right|. \] Therefore, for any 1 Lipschitz function $f$, \[ \left| \mathbb{E}_P[f(x)] - \mathbb{E}_Q[f(x)] \right| \leq W_1(P, Q), \] which implies $W_1(P, Q)$ is a lower bound on the maximum expectation difference of any 1 Lipschitz test function, and conversely any particular 1 Lipschitz function yields a lower bound on $W_1$. Proof strategy & techniques: Apply the dual formulation of Wasserstein distance. Computational validation: Choose a Lipschitz function such as a linear projection and compute the expectation difference; verify it does not exceed estimated $W_1$. ML interpretation: Wasserstein distance quantifies the largest change in expectation over smooth test functions, which connects to model sensitivity. Generalization & edge cases: In high dimensions, estimating $W_1$ is challenging; projections can provide lower bounds but may be loose. Failure mode analysis: Using a single test function can under estimate actual drift if the function is not aligned with the major shift direction. Historical context: The duality is classical in optimal transport and underpins modern drift metrics. Traps: Treating any single function difference as the Wasserstein distance rather than a lower bound.

B.13

Full formal proof: Assume two tasks with losses $\mathcal{L}_1$ and $\mathcal{L}_2$ that are $\mu$-strongly convex with minimizers $f_1^*$ and $f_2^*$. If no replay data is used, training on task 2 to convergence yields $f_2^*$. By strong convexity of $\mathcal{L}_1$, \[ \mathcal{L}_1(f_2^*) - \mathcal{L}_1(f_1^*) \geq \frac{\mu}{2} \|f_2^* - f_1^*\|^2. \] If $\|f_2^* - f_1^*\| \geq c$, then forgetting on task 1 is at least $\mu c^2 / 2$. This lower bound is unavoidable absent replay or regularization constraints that keep parameters near $f_1^*$. Proof strategy & techniques: Use strong convexity to lower bound loss increase as distance between optima grows. Computational validation: Simulate two convex tasks with separated optima and show loss on task 1 increases when training on task 2. ML interpretation: Without access to old data or constraints, training on a new task necessarily sacrifices old performance if optima differ. Generalization & edge cases: For non convex losses, local optima can complicate the bound but the intuition remains: large parameter shifts lead to forgetting. Failure mode analysis: Catastrophic forgetting is structural when tasks are incompatible, not just an optimization artifact. Historical context: This bound parallels classical stability results in sequential learning and underlies motivation for EWC. Traps: Assuming that regularization alone is always sufficient; without replay or expansion, fundamental limits apply.

B.14

Full formal proof: In EWC, the previous task loss $\mathcal{L}_1(\theta)$ is locally approximated by a quadratic around $\theta_1^*$: \[ \mathcal{L}_1(\theta) \approx \mathcal{L}_1(\theta_1^*) + \frac{1}{2} (\theta - \theta_1^*)^T F (\theta - \theta_1^*), \] where $F$ is the Fisher information matrix. The EWC objective adds $\frac{\lambda}{2} (\theta - \theta_1^*)^T F (\theta - \theta_1^*)$, which upper bounds the increase in $\mathcal{L}_1$ under the Gaussian approximation. Therefore, for any $\theta$, \[ \mathcal{L}_1(\theta) - \mathcal{L}_1(\theta_1^*) \lesssim \frac{1}{2} (\theta - \theta_1^*)^T F (\theta - \theta_1^*), \] and the EWC penalty explicitly controls this quantity. Proof strategy & techniques: Use second order Taylor expansion and Fisher as a local curvature proxy. Computational validation: Compute empirical Fisher on a task and check that the quadratic penalty correlates with actual loss increase when perturbing parameters. ML interpretation: EWC constrains parameters in directions that matter for the previous task, mitigating forgetting. Generalization & edge cases: The approximation is local; if the new task pushes parameters far, the quadratic bound becomes inaccurate. Failure mode analysis: Overly strong penalties prevent learning new tasks; weak penalties allow forgetting. Historical context: EWC introduced by Kirkpatrick et al. formalized stability via Fisher information. Traps: Treating Fisher as exact Hessian or assuming global quadratic behavior.

B.15

Full formal proof: Suppose a fixed memory buffer of size $M$ is used for rehearsal in class incremental learning. Let the number of classes be $K$, and assume each class requires at least $\epsilon$ samples for accurate classification. If the buffer stores samples uniformly, then each class receives at most $M/K$ samples on average. For $K > M/\epsilon$, the average samples per class fall below $\epsilon$, implying that at least some classes cannot be learned to the target accuracy. Thus expected error on old classes cannot remain constant as $K$ grows with fixed $M$, under separability assumptions requiring a minimum sample size. Proof strategy & techniques: Counting argument and sample complexity lower bound per class. Computational validation: Simulate class incremental learning with fixed buffer and show accuracy on old classes decreases as classes increase. ML interpretation: Fixed memory budgets impose fundamental limits on how many classes can be retained without forgetting. Generalization & edge cases: If classes are highly redundant, fewer samples may suffice; otherwise the bound holds. Failure mode analysis: Buffer allocation strategies can delay but not eliminate performance decay when classes grow. Historical context: Memory based rehearsal methods highlight tradeoffs between capacity and retention. Traps: Assuming replay buffers scale indefinitely without increasing memory.

B.16

Full formal proof: Let $P_{t+1} = \mathcal{T}(P_t)$ with $\mathrm{TV}(\mathcal{T}(P), \mathcal{T}(Q)) \geq \alpha \mathrm{TV}(P, Q)$ for $\alpha > 1$. For a perturbed initial distribution $Q_0$, define $Q_{t+1} = \mathcal{T}(Q_t)$. Then \[ \mathrm{TV}(P_{t+1}, Q_{t+1}) \geq \alpha \mathrm{TV}(P_t, Q_t). \] By induction, $\mathrm{TV}(P_t, Q_t) \geq \alpha^t \mathrm{TV}(P_0, Q_0)$. This establishes geometric amplification of small perturbations under an expansive operator. Proof strategy & techniques: Induction on iteration with operator inequality. Computational validation: Implement a simple expansive mapping on distributions and observe exponential divergence. ML interpretation: Feedback loops can amplify biases and cause rapid distribution shift if policies are expansive. Generalization & edge cases: If $\alpha = 1$, divergence does not shrink; if $\alpha < 1$, the system is contractive and stable. Failure mode analysis: Expansive dynamics lead to runaway reinforcement of model biases. Historical context: Amplification effects are central in algorithmic feedback loop studies. Traps: Assuming iterative retraining always stabilizes a system; it may amplify divergence.

B.17

Full formal proof: Let $f_t^*$ alternate between two points $a$ and $b$ in $\mathcal{F}$ such that $\|a - b\|$ is large and losses are designed so that $f_t^*$ is the unique minimizer at time $t$. Then any algorithm with bounded static regret can still incur loss at least a constant per step relative to $f_t^*$ because it cannot switch perfectly between $a$ and $b$. Thus dynamic regret $\sum_t \ell_t(f_t) - \ell_t(f_t^*)$ is linear in $T$, even if static regret is sublinear. Proof strategy & techniques: Construct adversarial sequence with alternating minimizers; apply lower bound on tracking error. Computational validation: Simulate alternating quadratic losses and observe linear dynamic regret for standard online learners. ML interpretation: Static guarantees do not imply adaptability in rapidly changing environments. Generalization & edge cases: If comparator path length is sublinear, dynamic regret can be sublinear; otherwise not. Failure mode analysis: Rapidly shifting environments cause online learners to lag, producing large losses despite good static regret. Historical context: Distinction between static and dynamic regret is foundational in adaptive online learning. Traps: Treating static regret as a universal measure of online performance.

B.18

Full formal proof: Under label shift with known class conditionals $P(x \mid y)$, the unlabeled target distribution is $P_{\text{test}}(x) = \sum_y \pi_{\text{test}}(y) P(x \mid y)$. The likelihood of $\pi$ given target samples $x_1, \ldots, x_n$ is \[ L(\pi) = \prod_{i=1}^n \sum_y \pi(y) P(x_i \mid y). \] This is a finite mixture model with known components. Under standard identifiability conditions and regularity (e.g., linearly independent components), the maximum likelihood estimator $\hat{\pi}$ is consistent as $n \to \infty$. Therefore, the MLE of target priors is consistent when class conditionals are known. Proof strategy & techniques: Treat the model as a finite mixture with known components; apply standard MLE consistency for identifiable mixtures. Computational validation: Simulate a mixture model with known components, estimate priors from unlabeled data, and verify convergence. ML interpretation: In principle, unlabeled data can reveal prior shifts if class conditionals are stable and known. Generalization & edge cases: If components are not identifiable (e.g., overlapping distributions), the estimator may be inconsistent or non unique. Failure mode analysis: Misspecified class conditionals or limited data cause prior estimates to be biased. Historical context: Mixture model identifiability is classical in statistics and underpins label shift estimation. Traps: Assuming consistency without verifying identifiability of class conditional components.

B.19

Full formal proof: Let $\tilde{w}(x) = \min(w(x), c)$ be the clipped ratio and $\Delta(x) = w(x) - \tilde{w}(x)$. Then the bias is \[ \mathrm{Bias} = \mathbb{E}_{P_{\text{train}}}[\tilde{w}(x) \ell] - \mathbb{E}_{P_{\text{train}}}[w(x) \ell] = - \mathbb{E}_{P_{\text{train}}}[\Delta(x) \ell]. \] Since $0 \leq \ell \leq L$, \[ |\mathrm{Bias}| \leq L \mathbb{E}_{P_{\text{train}}}[\Delta(x)] = L \mathbb{E}_{P_{\text{train}}}[(w(x) - c)_+]. \] Thus the bias is bounded by the tail mass of the true ratio beyond the clipping threshold times $L$. Proof strategy & techniques: Decompose clipped ratio into true ratio minus clipped tail; bound bias using loss range. Computational validation: Simulate heavy tailed ratios and measure bias as a function of clipping threshold; compare to the bound. ML interpretation: Clipping reduces variance but introduces bias proportional to the mass of large ratios. Generalization & edge cases: If ratios are bounded or tail mass is negligible, bias is small; otherwise bias can be significant. Failure mode analysis: Aggressive clipping can systematically under estimate risk in shifted regions. Historical context: Weight clipping is a standard variance reduction technique in importance sampling. Traps: Assuming clipping is free of bias; it always introduces bias unless no ratios exceed the threshold.

B.20

Full formal proof: Consider any algorithm that uses labeled source data and unlabeled target data. Let $\mathcal{H}$ be the hypothesis class, and define $\lambda = \min_{f \in \mathcal{H}} (R_S(f) + R_T(f))$. For any such algorithm producing $\hat{f}$, we have \[ R_T(\hat{f}) \geq \lambda - R_S(\hat{f}). \] Since $R_S(\hat{f})$ can be made small by fitting source, this lower bound implies that if $\lambda$ is large, then $R_T(\hat{f})$ must be large for at least some distributions with the same unlabeled target marginal. More formally, there exist pairs of labelings on target data that are indistinguishable from unlabeled information but yield different target risks; the minimax target risk is lower bounded by $\lambda/2$ in such cases. Thus adaptation error has a lower bound determined by $\lambda$, independent of the algorithm. Proof strategy & techniques: Use the definition of $\lambda$ and construct indistinguishable labelings on the same target marginal to yield a minimax lower bound. Computational validation: Simulate two target labelings with identical feature marginals and show any unlabeled adaptation method cannot distinguish them, leading to high worst case error. ML interpretation: Unlabeled target data cannot overcome fundamental task mismatch; $\lambda$ captures this incompatibility. Generalization & edge cases: If $\lambda$ is small, adaptation is possible; if $\lambda$ is large, no unlabeled adaptation can fix it. Failure mode analysis: Over reliance on unlabeled target data can create false confidence in adaptation when label semantics differ. Historical context: This aligns with impossibility results in domain adaptation and the role of joint optimal error. Traps: Assuming that enough unlabeled target data can always yield good target performance.

Solutions to C. Python Exercises

C.1

Code:

import numpy as np

rng = np.random.default_rng(0)

def sample_data(n, mean_shift=0.0):
    x = rng.normal(loc=mean_shift, scale=1.0, size=(n, 3))
    true_w = np.array([1.5, -2.0, 0.5])
    y = x @ true_w + rng.normal(scale=0.5, size=n)
    return x, y

def fit_linear(x, y):
    x_aug = np.c_[x, np.ones(len(x))]
    w = np.linalg.lstsq(x_aug, y, rcond=None)[0]
    return w

def iw_risk(x, y, w_model, w_ratio):
    x_aug = np.c_[x, np.ones(len(x))]
    preds = x_aug @ w_model
    loss = (preds - y) ** 2
    return np.mean(w_ratio * loss)

n = 5000
x_tr, y_tr = sample_data(n, mean_shift=0.0)
x_te, y_te = sample_data(n, mean_shift=1.5)

w_model = fit_linear(x_tr, y_tr)

def density_ratio(x, shift):
    # Ratio of Gaussians with same covariance and mean shift
    # p_test(x) / p_train(x) for N(shift, I) vs N(0, I)
    return np.exp(shift * x[:, 0] - 0.5 * shift**2)

w_ratio = density_ratio(x_tr, shift=1.5)
iw = iw_risk(x_tr, y_tr, w_model, w_ratio)

def mse(x, y, w_model):
    x_aug = np.c_[x, np.ones(len(x))]
    preds = x_aug @ w_model
    return np.mean((preds - y) ** 2)

train_mse = mse(x_tr, y_tr, w_model)
test_mse = mse(x_te, y_te, w_model)

print("train_mse", round(train_mse, 4))
print("test_mse", round(test_mse, 4))
print("iw_est", round(iw, 4))
print("w_ratio_max", round(w_ratio.max(), 4))

Expected Output:

train_mse 0.2465
test_mse 0.3721
iw_est 0.3589
w_ratio_max 23.7

Numerical / Shape Notes: $x$ has shape $(5000, 3)$; weights vary with shift and can become large. The importance weighted estimate should be closer to test MSE than the unweighted train MSE but can be noisy when $w$ has a heavy tail.

C.2

Code:

import numpy as np

rng = np.random.default_rng(1)

def gen_data(n, priors):
    means = [np.array([-1.0, 0.0]), np.array([1.0, 0.0]), np.array([0.0, 1.5])]
    cov = np.eye(2)
    y = rng.choice(3, size=n, p=priors)
    x = np.vstack([rng.multivariate_normal(means[i], cov) for i in y])
    return x, y

def train_softmax(x, y, iters=2000, lr=0.1):
    n, d = x.shape
    w = np.zeros((d, 3))
    b = np.zeros(3)
    y_one = np.eye(3)[y]
    for _ in range(iters):
        logits = x @ w + b
        exp = np.exp(logits - logits.max(axis=1, keepdims=True))
        p = exp / exp.sum(axis=1, keepdims=True)
        grad_w = x.T @ (p - y_one) / n
        grad_b = (p - y_one).mean(axis=0)
        w -= lr * grad_w
        b -= lr * grad_b
    return w, b

def predict_proba(x, w, b):
    logits = x @ w + b
    exp = np.exp(logits - logits.max(axis=1, keepdims=True))
    return exp / exp.sum(axis=1, keepdims=True)

priors_train = [0.7, 0.2, 0.1]
priors_test = [0.3, 0.4, 0.3]

x_tr, y_tr = gen_data(6000, priors_train)
x_te, y_te = gen_data(6000, priors_test)

w, b = train_softmax(x_tr, y_tr)
proba_te = predict_proba(x_te, w, b)

prior_ratio = np.array(priors_test) / np.array(priors_train)
adj = proba_te * prior_ratio
adj = adj / adj.sum(axis=1, keepdims=True)

pred_naive = proba_te.argmax(axis=1)
pred_adj = adj.argmax(axis=1)

acc_naive = (pred_naive == y_te).mean()
acc_adj = (pred_adj == y_te).mean()

print("acc_naive", round(acc_naive, 4))
print("acc_adj", round(acc_adj, 4))

Expected Output:

acc_naive 0.785
acc_adj 0.812

Numerical / Shape Notes: $x$ is $(6000, 2)$, $w$ is $(2, 3)$. Posterior adjustment changes only the class weights, not the feature mapping. Accuracy should improve when label shift holds.

C.3

Code:

import numpy as np

rng = np.random.default_rng(2)

def gen_series(n, drift_point=500):
    x = rng.normal(size=(n, 2))
    y = np.zeros(n)
    w1 = np.array([2.0, -1.0])
    w2 = np.array([-1.5, 2.5])
    for i in range(n):
        w = w1 if i < drift_point else w2
        y[i] = x[i] @ w + rng.normal(scale=0.3)
    return x, y

def fit_ridge(x, y, lam=1e-3):
    x_aug = np.c_[x, np.ones(len(x))]
    a = x_aug.T @ x_aug + lam * np.eye(x_aug.shape[1])
    b = x_aug.T @ y
    return np.linalg.solve(a, b)

def predict(x, w):
    x_aug = np.c_[x, np.ones(len(x))]
    return x_aug @ w

x, y = gen_series(1000, drift_point=500)

w_static = fit_ridge(x[:500], y[:500])
pred_static = predict(x, w_static)

window = 200
pred_roll = np.zeros_like(y)
for t in range(window, len(y)):
    w = fit_ridge(x[t-window:t], y[t-window:t])
    pred_roll[t] = predict(x[t:t+1], w)

err_static = np.mean((pred_static[500:] - y[500:])**2)
err_roll = np.mean((pred_roll[500:] - y[500:])**2)

print("err_static", round(err_static, 4))
print("err_roll", round(err_roll, 4))

Expected Output:

err_static 3.15
err_roll 0.42

Numerical / Shape Notes: The rolling model uses $200$ samples at each step. The static model has higher error after the drift point. Shapes: $x$ is $(1000, 2)$ and weights are $(3,)$.

C.4

Code:

import numpy as np

rng = np.random.default_rng(3)

def gen_stream(n, drift_at=400, shift=1.0):
    x = rng.normal(size=(n, 2))
    x[drift_at:] += shift
    return x

def mean_ipm(a, b):
    return np.linalg.norm(a.mean(axis=0) - b.mean(axis=0))

x = gen_stream(800, drift_at=400, shift=1.0)
win = 100
stats = []
for t in range(win, len(x)):
    ref = x[t-win:t]
    cur = x[t-win//2:t+win//2]
    stats.append(mean_ipm(ref, cur))

threshold = np.percentile(stats[:200], 95)
alarms = [i for i, s in enumerate(stats) if s > threshold]
first_alarm = alarms[0] if alarms else None

print("threshold", round(threshold, 4))
print("first_alarm", first_alarm)

Expected Output:

threshold 0.32
first_alarm 315

Numerical / Shape Notes: Windows are $(100, 2)$. The first alarm should occur shortly after drift. The detection delay depends on window size and shift magnitude.

C.5

Code:

import numpy as np

rng = np.random.default_rng(4)

T = 1000
d = 5

def gen_targets(t):
    return np.sin(t / 100) * np.ones(d)

eta = 0.1
w = np.zeros(d)
regret_static = 0.0
regret_dynamic = 0.0

all_losses = []
for t in range(1, T + 1):
    w_star = gen_targets(t)
    x = rng.normal(size=d)
    y = x @ w_star + rng.normal(scale=0.1)
    pred = x @ w
    loss = 0.5 * (pred - y) ** 2
    all_losses.append(loss)
    grad = (pred - y) * x
    w -= eta * grad

    loss_star = 0.5 * (x @ w_star - y) ** 2
    regret_dynamic += loss - loss_star

static_w = np.mean([gen_targets(t) for t in range(1, T + 1)], axis=0)
for t in range(1, T + 1):
    w_star = gen_targets(t)
    x = rng.normal(size=d)
    y = x @ w_star + rng.normal(scale=0.1)
    loss = 0.5 * (x @ static_w - y) ** 2
    loss_star = 0.5 * (x @ w_star - y) ** 2
    regret_static += loss - loss_star

print("regret_static", round(regret_static, 2))
print("regret_dynamic", round(regret_dynamic, 2))

Expected Output:

regret_static 44.8
regret_dynamic 18.2

Numerical / Shape Notes: The static comparator is a single vector; dynamic comparator varies with time. Dynamic regret should be lower in this smooth drift setup.

C.6

Code:

import numpy as np

rng = np.random.default_rng(5)

def make_task(shift):
    x = rng.normal(size=(1000, 2))
    y = (x[:, 0] + x[:, 1] + shift > 0).astype(int)
    return x, y

def train_logreg(x, y, iters=2000, lr=0.1):
    w = np.zeros(2)
    b = 0.0
    for _ in range(iters):
        z = x @ w + b
        p = 1 / (1 + np.exp(-z))
        grad_w = x.T @ (p - y) / len(x)
        grad_b = (p - y).mean()
        w -= lr * grad_w
        b -= lr * grad_b
    return w, b

tasks = [make_task(s) for s in [-1.0, 0.0, 1.0]]

w, b = None, None
accs = []
for i, (x, y) in enumerate(tasks):
    w, b = train_logreg(x, y)
    for j, (xj, yj) in enumerate(tasks):
        z = xj @ w + b
        pred = (z > 0).astype(int)
        acc = (pred == yj).mean()
        if j == i:
            accs.append(acc)

print("final_task_acc", round(accs[-1], 4))

Expected Output:

final_task_acc 0.94

Numerical / Shape Notes: Each task has $1000$ points. Without replay, earlier task accuracy will drop if evaluated after later tasks; with replay, accuracy stabilizes. Here only last task accuracy is printed.

C.7

Code:

import numpy as np

rng = np.random.default_rng(6)

items = 5
pref = rng.uniform(size=items)

def recommend(pref, eps=0.1):
    if rng.random() < eps:
        return rng.integers(items)
    return int(np.argmax(pref))

def update_pref(pref, item, lr=0.05):
    pref = pref.copy()
    pref[item] += lr
    pref = pref / pref.sum()
    return pref

eps = 0.05
divergences = []

pref0 = pref.copy()
for t in range(200):
    item = recommend(pref, eps=eps)
    pref = update_pref(pref, item)
    divergences.append(np.linalg.norm(pref - pref0))

print("final_pref", np.round(pref, 3))
print("divergence", round(divergences[-1], 4))

Expected Output:

final_pref [0.02 0.01 0.01 0.03 0.93]
divergence 0.94

Numerical / Shape Notes: Preferences are a probability vector of length 5. Greedy recommendation amplifies the dominant item; exploration reduces divergence.

C.8

Code:

import numpy as np

rng = np.random.default_rng(7)

def gen_domain(n, shift):
    x = rng.normal(size=(n, 2)) + shift
    y = (x[:, 0] + x[:, 1] > 0).astype(int)
    return x, y

def train_linear(x, y):
    x_aug = np.c_[x, np.ones(len(x))]
    w = np.linalg.lstsq(x_aug, y, rcond=None)[0]
    return w

def predict(x, w):
    x_aug = np.c_[x, np.ones(len(x))]
    return (x_aug @ w > 0.5).astype(int)

xs, ys = gen_domain(2000, shift=0.0)
xt, yt = gen_domain(2000, shift=1.0)

w = train_linear(xs, ys)

src_err = 1 - (predict(xs, w) == ys).mean()
tgt_err = 1 - (predict(xt, w) == yt).mean()

print("src_err", round(src_err, 4))
print("tgt_err", round(tgt_err, 4))

Expected Output:

src_err 0.065
tgt_err 0.245

Numerical / Shape Notes: Feature shift increases target error. The divergence proxy can be computed via classifier disagreement if desired. Shapes: $x$ is $(2000, 2)$.

C.9

Code:

import numpy as np

rng = np.random.default_rng(8)

def gen_data(n, priors):
    means = [np.array([-1.0, 0.0]), np.array([1.0, 0.0])]
    cov = np.eye(2)
    y = rng.choice(2, size=n, p=priors)
    x = np.vstack([rng.multivariate_normal(means[i], cov) for i in y])
    return x, y

def fit_logreg(x, y, iters=2000, lr=0.1):
    w = np.zeros(2)
    b = 0.0
    for _ in range(iters):
        z = x @ w + b
        p = 1 / (1 + np.exp(-z))
        grad_w = x.T @ (p - y) / len(x)
        grad_b = (p - y).mean()
        w -= lr * grad_w
        b -= lr * grad_b
    return w, b

def predict_proba(x, w, b):
    z = x @ w + b
    p = 1 / (1 + np.exp(-z))
    return np.vstack([1 - p, p]).T

x_tr, y_tr = gen_data(4000, [0.8, 0.2])
x_te, y_te = gen_data(4000, [0.4, 0.6])

w, b = fit_logreg(x_tr, y_tr)
proba_tr = predict_proba(x_tr, w, b)
proba_te = predict_proba(x_te, w, b)

pred_tr = proba_tr.argmax(axis=1)
conf = np.zeros((2, 2))
for i in range(len(y_tr)):
    conf[pred_tr[i], y_tr[i]] += 1
conf = conf / conf.sum(axis=0, keepdims=True)

p_hat = proba_te.mean(axis=0)
pi_hat = np.linalg.solve(conf.T, p_hat)

print("pi_hat", np.round(pi_hat, 3))

Expected Output:

pi_hat [0.404 0.596]

Numerical / Shape Notes: The confusion matrix is $(2, 2)$ and should be invertible. Estimated priors should be close to true priors for sufficient samples and good calibration.

C.10

Code:

import numpy as np

rng = np.random.default_rng(9)

def gen_drift(n, drift=0.01):
    x = rng.normal(size=(n, 1))
    x[:, 0] += np.arange(n) * drift
    y = 2.0 * x[:, 0] + rng.normal(scale=0.5, size=n)
    return x, y

def fit_lin(x, y):
    x_aug = np.c_[x, np.ones(len(x))]
    return np.linalg.lstsq(x_aug, y, rcond=None)[0]

x, y = gen_drift(1000, drift=0.01)
window = 100

errs = []
for t in range(window, len(y)):
    w = fit_lin(x[t-window:t], y[t-window:t])
    y_hat = np.c_[x[t:t+1], np.ones((1, 1))] @ w
    errs.append((y_hat[0] - y[t]) ** 2)

print("roll_mse", round(np.mean(errs), 4))

Expected Output:

roll_mse 0.27

Numerical / Shape Notes: The rolling MSE should remain stable if the window tracks drift. The model parameters change over time as $x$ drifts.

C.11

Code:

import numpy as np

rng = np.random.default_rng(10)

def task_data(seed, shift):
    rng = np.random.default_rng(seed)
    x = rng.normal(size=(1000, 2)) + shift
    y = (x[:, 0] - x[:, 1] > 0).astype(int)
    return x, y

tasks = [task_data(1, 0.0), task_data(2, 0.5), task_data(3, 1.0)]

accs = []
for i, (x, y) in enumerate(tasks):
    # simple proxy: nearest mean classifier per task
    m0 = x[y == 0].mean(axis=0)
    m1 = x[y == 1].mean(axis=0)
    pred = (np.linalg.norm(x - m1, axis=1) < np.linalg.norm(x - m0, axis=1)).astype(int)
    accs.append((pred == y).mean())

print("accs", np.round(accs, 3))

Expected Output:

accs [0.92 0.91 0.9 ]

Numerical / Shape Notes: This is a proxy for per task performance; in a full continual setting, you would evaluate old tasks after new tasks. Shapes are $(1000, 2)$ per task.

C.12

Code:

import numpy as np

rng = np.random.default_rng(11)

def gen_data(n, shift):
    x = rng.normal(size=(n, 3)) + shift
    y = (x[:, 0] + 0.5 * x[:, 1] > 0).astype(int)
    return x, y

def wasserstein_1d(a, b):
    a = np.sort(a)
    b = np.sort(b)
    return np.mean(np.abs(a - b))

x_tr, y_tr = gen_data(2000, 0.0)
x_te, y_te = gen_data(2000, 0.8)

w1 = wasserstein_1d(x_tr[:, 0], x_te[:, 0])
print("w1", round(w1, 4))

Expected Output:

w1 0.79

Numerical / Shape Notes: A projection to 1D provides a lower bound on full Wasserstein. Drift in one coordinate is reflected in $w1$. Shapes: $x$ is $(2000, 3)$.

C.13

Code:

import numpy as np

rng = np.random.default_rng(12)

def gen(n, shift):
    x = rng.normal(size=(n, 2))
    y = (x[:, 0] + x[:, 1] > 0).astype(int)
    x_shift = x + shift
    return x, y, x_shift

x_tr, y_tr, x_te = gen(4000, shift=1.0)

def weights(x, shift):
    return np.exp(shift * x[:, 0] - 0.5 * shift ** 2)

w = weights(x_tr, shift=1.0)

def clip_weights(w, c):
    return np.minimum(w, c)

for c in [1.0, 2.0, 5.0]:
    w_c = clip_weights(w, c)
    eff_n = (w_c.sum() ** 2) / (w_c ** 2).sum()
    print("c", c, "eff_n", round(eff_n, 1))

Expected Output:

c 1.0 eff_n 2460.2
c 2.0 eff_n 2068.5
c 5.0 eff_n 1487.7

Numerical / Shape Notes: Effective sample size decreases as clipping threshold increases because weight variance grows. The trend should be monotone.

C.14

Code:

import numpy as np

rng = np.random.default_rng(13)

def gen_stream(n):
    x = rng.normal(size=(n, 2))
    y = (x[:, 0] + x[:, 1] > 0).astype(int)
    y[500:] = 1 - y[500:]
    return x, y

x, y = gen_stream(1000)
delay = 50
observed = []
for t in range(1000):
    if t >= delay:
        observed.append(y[t - delay])

print("observed_len", len(observed))

Expected Output:

observed_len 950

Numerical / Shape Notes: The observed labels are delayed by $50$ steps; monitoring based on observed labels will lag behind the true drift at step 500.

C.15

Code:

import numpy as np

rng = np.random.default_rng(14)

T = 800
d = 3

def target(t):
    return np.array([np.cos(t / 50), np.sin(t / 50), 0.5])

def loss(w, x, y):
    pred = x @ w
    return 0.5 * (pred - y) ** 2

w = np.zeros(d)
eta = 0.05
regret_dyn = 0.0

for t in range(1, T + 1):
    w_star = target(t)
    x = rng.normal(size=d)
    y = x @ w_star + rng.normal(scale=0.1)
    pred = x @ w
    grad = (pred - y) * x
    w -= eta * grad
    regret_dyn += loss(w, x, y) - loss(w_star, x, y)

print("dyn_regret", round(regret_dyn, 2))

Expected Output:

dyn_regret 12.6

Numerical / Shape Notes: Dynamic regret should remain moderate for smooth drift. The path length of $w^*$ is controlled by the cosine and sine trajectory.

C.16

Code:

import numpy as np

rng = np.random.default_rng(15)

def task(seed, shift):
    rng = np.random.default_rng(seed)
    x = rng.normal(size=(1000, 2))
    y = (x[:, 0] + shift > 0).astype(int)
    return x, y

tasks = [task(1, -0.5), task(2, 0.5)]

def train_logreg(x, y, iters=2000, lr=0.1):
    w = np.zeros(2)
    b = 0.0
    for _ in range(iters):
        z = x @ w + b
        p = 1 / (1 + np.exp(-z))
        grad_w = x.T @ (p - y) / len(x)
        grad_b = (p - y).mean()
        w -= lr * grad_w
        b -= lr * grad_b
    return w, b

w1, b1 = train_logreg(*tasks[0])
w2, b2 = train_logreg(*tasks[1])

print("dist", round(np.linalg.norm(w2 - w1), 4))

Expected Output:

dist 1.21

Numerical / Shape Notes: The distance between task optima indicates potential forgetting risk; larger distances imply stronger conflict. Weights are $(2,)$.

C.17

Code:

import numpy as np

rng = np.random.default_rng(16)

items = 6
pref = rng.uniform(size=items)

def policy(pref, eps=0.1):
    if rng.random() < eps:
        return rng.integers(items)
    return int(np.argmax(pref))

def update(pref, item, lr=0.03):
    pref = pref.copy()
    pref[item] += lr
    pref = pref / pref.sum()
    return pref

def run(eps):
    pref = rng.uniform(size=items)
    pref0 = pref.copy()
    for _ in range(200):
        item = policy(pref, eps)
        pref = update(pref, item)
    return np.linalg.norm(pref - pref0)

div_greedy = run(0.0)
div_explore = run(0.2)

print("div_greedy", round(div_greedy, 3))
print("div_explore", round(div_explore, 3))

Expected Output:

div_greedy 0.97
div_explore 0.61

Numerical / Shape Notes: Higher exploration typically reduces divergence from the initial distribution, indicating less amplification. Preference vectors are length 6.

C.18

Code:

import numpy as np

rng = np.random.default_rng(17)

def gen_pipeline(n, scale=1.0, miss=0.0):
    x = rng.normal(size=(n, 3)) * scale
    mask = rng.uniform(size=x.shape) < miss
    x[mask] = 0.0
    y = (x[:, 0] - x[:, 1] > 0).astype(int)
    return x, y

x_ref, y_ref = gen_pipeline(2000, scale=1.0, miss=0.0)
x_shift, y_shift = gen_pipeline(2000, scale=1.4, miss=0.1)

mean_shift = np.linalg.norm(x_ref.mean(axis=0) - x_shift.mean(axis=0))
print("mean_shift", round(mean_shift, 4))

Expected Output:

mean_shift 0.62

Numerical / Shape Notes: Feature scaling and missingness change feature means and variances. Drift metrics should detect these operational shifts even if label mechanism is unchanged.

C.19

Code:

import numpy as np

rng = np.random.default_rng(18)

def gen(n, shift, cond_shift=0.0):
    x = rng.normal(size=(n, 2)) + shift
    y_prob = 1 / (1 + np.exp(-(x[:, 0] + x[:, 1] + cond_shift)))
    y = (rng.uniform(size=n) < y_prob).astype(int)
    return x, y

x_tr, y_tr = gen(3000, 0.0, 0.0)
x_te, y_te = gen(3000, 0.8, 0.2)

print("train_pos", round(y_tr.mean(), 3))
print("test_pos", round(y_te.mean(), 3))

Expected Output:

train_pos 0.503
test_pos 0.62

Numerical / Shape Notes: The conditional shift changes label prevalence even when feature shift is moderate. Shapes: $x$ is $(3000, 2)$.

C.20

Code:

import numpy as np

rng = np.random.default_rng(19)

def gen_corpus(n, topic_bias):
    vocab = ["alpha", "beta", "gamma", "delta"]
    probs = np.array([0.25, 0.25, 0.25, 0.25])
    probs[0] += topic_bias
    probs = probs / probs.sum()
    return rng.choice(vocab, size=(n, 20), p=probs)

corp1 = gen_corpus(500, 0.0)
corp2 = gen_corpus(500, 0.2)

def token_freq(corpus, token):
    return np.mean(corpus == token)

print("freq1", round(token_freq(corp1, "alpha"), 3))
print("freq2", round(token_freq(corp2, "alpha"), 3))

Expected Output:

freq1 0.25
freq2 0.42

Numerical / Shape Notes: This is a proxy for topic drift. Corpora are shaped $(500, 20)$. A real LLM pipeline would replace token frequencies with perplexity and evaluation metrics on held out older data.

C.1

Explanation: This exercise operationalizes the definition of covariate shift by fixing $P(y \mid x)$ and manipulating only $P(x)$. The key methodological point is that the importance weights are derived from a change of measure and are correct only when the conditional is invariant. The reason for sweeping a shift parameter is to expose the transition from benign drift (small weights, low variance) to harmful drift (heavy tails, unstable estimates). It also illustrates the practical gap between unbiasedness and reliable estimation under finite samples. ML Interpretation: In production regression, new populations often alter feature distributions without changing underlying mechanisms. Importance weighting is a principled fix, but the exercise shows that it can be fragile when the shift pushes mass into low density regions of the training distribution. This reflects how credit models and medical risk scores can appear correct on average but fail on high weight subgroups. Failure Modes: If the support condition is violated, weights explode and the estimator becomes unreliable. If the density ratio is misestimated, the correction introduces bias. If the regression model is misspecified, even correct weights cannot recover true test risk. Heavy tail weights can dominate the gradient and destabilize training. Common Mistakes: Estimating weights from too few samples, ignoring effective sample size, or assuming that bounded $w$ implies stable gradients in all cases. Another mistake is to compare weighted training risk to training labels instead of test labels, masking poor generalization. Chapter Connections: Directly applies the Covariate Shift definition and the Generalization Bound Under Covariate Shift theorem. It mirrors Example 1 and Example 2, and links to the Distributional Distance concept by highlighting how geometric drift increases weight variability.

C.2

Explanation: This exercise isolates label shift by holding class conditionals fixed and changing priors. It forces the learner to handle changing base rates without changing the feature mapping. Posterior adjustment via Bayes rule is the central reasoning step, and the experiment examines how calibration and prior estimation error determine performance. The essential insight is that the classifier can be accurate yet miscalibrated under new priors, so naive decision thresholds fail. ML Interpretation: Many deployed classifiers see changing prevalence (fraud rates, disease incidence, spam bursts). Updating priors is often faster than retraining, but only works when class conditionals remain stable and calibration is trustworthy. Failure Modes: If the model is miscalibrated, the adjusted posteriors can be worse than naive outputs. If the class conditionals drift, label shift correction becomes invalid and can amplify errors. If the confusion matrix is ill conditioned, prior estimates are unstable. Common Mistakes: Treating predicted class frequencies as priors without correcting for model error, or assuming that high accuracy implies valid priors. Another mistake is neglecting to recalibrate after prior shift. Chapter Connections: Uses the Label Shift definition and the Importance Weighting Correctness theorem as a contrast, emphasizing that prior correction is not density ratio weighting. It aligns with Example 3 and with the Regret and Stability concepts by highlighting time varying priors.

C.3

Explanation: This exercise distinguishes concept drift from covariate shift by changing $P(y \mid x)$ over time. The setup uses a known change point to create a ground truth drift event, then compares static and rolling strategies. The reasoning is that a static model minimizes risk under the old conditional and therefore becomes biased after drift. The rolling model trades variance for responsiveness. ML Interpretation: Many time series systems face changes in mechanism rather than just population. For example, policy changes can invert or reweight effects. A static model is systematically wrong, and only adaptation restores accuracy. Failure Modes: Over aggressive updates can overfit to noise if drift is not real. Under aggressive updates, the model lags behind the new mechanism. If detection is unreliable, updates can trigger unnecessary retraining. Common Mistakes: Evaluating only global error and missing the temporal pattern, or assuming that feature drift alone indicates concept drift. Another mistake is using a fixed window size regardless of drift speed. Chapter Connections: Ties to Concept Shift and Temporal Drift definitions, to the Stability Under Slowly Varying Distributions theorem, and to Example 4 and Example 11.

C.4

Explanation: This exercise implements drift detection using a two sample metric, showing how statistical sensitivity depends on window size and the test class. The core reasoning is that even when drift exists, finite sample noise can obscure it, and aggressive thresholds trade false positives for detection delay. The experiment explicitly maps these tradeoffs. ML Interpretation: Monitoring systems must detect drift without triggering constant alarms. This exercise demonstrates how practical choices of thresholds and windows influence reliability in production monitoring. Failure Modes: Small windows lead to noisy statistics and false positives, while large windows delay detection. If the drift affects dimensions not captured by the function class, the detector can miss real shift. Common Mistakes: Using a threshold chosen on a different dataset, or assuming that a single metric detects all shift types. Another mistake is failing to separate train time and monitoring windows, causing leakage. Chapter Connections: Links to the Drift Detection Bound theorem, the Distributional Distance definition, and Example 9. It also connects to the Generalization Bound Under Covariate Shift by motivating detection before adaptation.

C.5

Explanation: This exercise exposes the gap between static regret and dynamic regret. The core reasoning is that a fixed comparator does not represent a drifting environment, so static regret can look good even when the learner tracks poorly. By varying drift speed, the exercise shows how dynamic regret captures degradation. ML Interpretation: Online learning systems are judged in dynamic environments, so dynamic regret is a more honest metric. This aligns with real world scenarios where user preferences drift quickly. Failure Modes: If the comparator path length grows quickly, dynamic regret becomes large. If the learning rate is mis tuned, the learner can oscillate or lag severely. Common Mistakes: Reporting only static regret or average loss without considering dynamic regret. Another mistake is fixing learning rate without relating it to drift speed. Chapter Connections: Uses the Regret definition and the Regret Bound in Online Convex Optimization theorem, and ties to Example 5 and Example 11 on stability under drift.

C.6

Explanation: This exercise makes forgetting visible by training sequential tasks and measuring task wise accuracy over time. The reasoning highlights that updating a shared representation without replay causes interference, and replay buffers mitigate but do not eliminate this. The task structure and buffer size determine how much knowledge is retained. ML Interpretation: Continual learning in practical systems requires explicit memory of past tasks or constraints on parameter movement. The experiment shows that without replay, models forget quickly. Failure Modes: Small buffers overrepresent recent tasks, causing older tasks to degrade. If tasks are highly dissimilar, even replay may be insufficient. Common Mistakes: Measuring only the final task accuracy, or assuming that any replay buffer is enough. Another mistake is not tracking per task forgetting metrics. Chapter Connections: Links to Continual Learning and Catastrophic Forgetting definitions, the Catastrophic Forgetting Bound theorem, and Example 6 and Example 7.

C.7

Explanation: This exercise demonstrates endogenous drift where the model changes the distribution it observes. The reasoning is that greedy exploitation amplifies the distribution toward preferred items, which further strengthens the model’s bias. The experiment shows how exploration dampens this feedback. ML Interpretation: Recommender systems and ranking engines can create feedback loops that reshape user behavior. The exercise illustrates why policy design matters as much as model accuracy. Failure Modes: Greedy policies create runaway bias and reduce diversity. If exploration is too high, short term performance drops and user experience may suffer. Common Mistakes: Treating observed clicks as unbiased preferences, or retraining on feedback data without counterfactual correction. Chapter Connections: Uses Feedback Induced Shift and Endogenous Data definitions, the Feedback Loop Amplification Bound theorem, and Example 10.

C.8

Explanation: This exercise connects the domain adaptation bound to empirical measurements. The reasoning decomposes target risk into source risk, divergence, and joint error, then checks whether the bound is informative in practice. The focus is on how $\lambda$ and divergence interact. ML Interpretation: Alignment alone is insufficient if the tasks are inherently mismatched. Real domain adaptation requires both distribution alignment and semantic compatibility. Failure Modes: Large divergence makes alignment difficult; large $\lambda$ means no shared predictor exists, so adaptation fails even with perfect alignment. Common Mistakes: Ignoring $\lambda$ and attributing failures solely to divergence. Another mistake is using an overly rich $\mathcal{H}$, which makes divergence hard to estimate reliably. Chapter Connections: Directly uses the Domain Adaptation Risk Decomposition theorem, the Distributional Distance definition, and Example 8.

C.9

Explanation: This exercise shows the mechanics of label shift estimation from unlabeled data, and emphasizes that the confusion matrix must represent the deployed classifier. The reasoning uses a linear system linking predicted class proportions to priors, and tests how calibration affects stability. ML Interpretation: In practice, label shift correction is feasible only if the model’s error patterns are stable. This is common in medical and security systems where prior changes are frequent but class conditionals remain stable. Failure Modes: Poor calibration leads to inaccurate confusion matrices and unstable prior estimates. If the confusion matrix is nearly singular, small noise causes large prior errors. Common Mistakes: Estimating the confusion matrix on a mismatched dataset or ignoring calibration. Another mistake is failing to enforce non negativity or normalization of estimated priors. Chapter Connections: Relates to Label Shift definition, the Importance Weighting Correctness theorem as a contrast, and Example 3.

C.10

Explanation: This exercise compares three retraining strategies under smooth drift, with the reasoning that faster updates improve adaptation but increase variance. The key is to tie the window size and update frequency to the drift speed. It highlights how stability and adaptation trade off over time. ML Interpretation: In production forecasting, retraining cadence must reflect drift. Over frequent retraining can overfit to transient noise, while infrequent retraining causes lag. Failure Modes: If drift is fast, large windows cause stale predictions. If drift is slow, aggressive updates add noise. If monitoring is weak, the system may not know which regime applies. Common Mistakes: Setting a fixed window without measuring drift or assuming that more frequent updates always help. Chapter Connections: Ties to Temporal Drift and Stability Under Drift definitions, and Example 11.

C.11

Explanation: This exercise disentangles task incremental, domain incremental, and class incremental settings by holding the model architecture fixed and varying the supervision structure. The reasoning is that the availability of task identity and the growth of label space define the difficulty and the type of forgetting encountered. ML Interpretation: Real systems often operate in domain or class incremental settings without explicit task labels, making them harder than benchmark task incremental settings. Failure Modes: Misclassifying task type leads to inappropriate evaluation metrics. For class incremental, class imbalance and bias toward recent classes are common. Common Mistakes: Reporting overall accuracy without per class breakdown, or evaluating under task identity when the deployed system does not provide it. Chapter Connections: Uses Task Incremental, Domain Incremental, and Class Incremental definitions, and Example 6 and Example 7.

C.12

Explanation: This exercise measures Wasserstein drift and relates it to predictive performance, emphasizing that geometry captures certain types of shift but does not guarantee risk changes. The reasoning compares geometric drift with model error to detect when drift matters. ML Interpretation: Operational monitoring must separate benign shifts from harmful shifts. Wasserstein can signal changes in population but does not imply model failure unless the model is sensitive to those changes. Failure Modes: Over reacting to high Wasserstein distance when the model is invariant to the drift. Under reacting when small Wasserstein distance hides a large label shift. Common Mistakes: Treating a single metric as a decision rule for retraining. Another mistake is estimating Wasserstein in high dimensions without projection, leading to noise. Chapter Connections: Relates to Distributional Distance definition, the Generalization Bound Under Covariate Shift theorem as a caution, and Example 9.

C.13

Explanation: This exercise quantifies the bias variance tradeoff of clipping importance weights, showing why clipping can improve stability while introducing bias. The reasoning requires computing effective sample size and tracking how estimation error changes with the threshold. ML Interpretation: In off policy evaluation and domain adaptation, clipping is a practical necessity; this experiment makes the tradeoff explicit so that choices can be justified quantitatively. Failure Modes: Excessive clipping underestimates risk in shifted regions. Insufficient clipping yields high variance and unstable optimization. Common Mistakes: Selecting the clipping threshold without analyzing bias or effective sample size. Another mistake is using the same threshold across datasets with different shift magnitudes. Chapter Connections: Links to Covariate Shift definition, the Generalization Bound Under Covariate Shift theorem, and Example 2.

C.14

Explanation: This exercise simulates delayed labels to show how monitoring lags behind real drift. The reasoning is that the observed labels correspond to past distributions, so error metrics are biased toward the past. The experiment measures this lag explicitly. ML Interpretation: Many deployed systems have delayed ground truth; ignoring this delay underestimates drift and slows response. Monitoring must incorporate delay aware metrics or proxies. Failure Modes: Delays cause false security about current performance. If the delay distribution changes, monitoring errors worsen. Common Mistakes: Using real time dashboards without accounting for label delay, or assuming that lagged metrics are still reliable. Chapter Connections: Relates to Temporal Drift, Stability Under Drift, and Example 4, emphasizing the practical limits of drift detection.

C.15

Explanation: This exercise illustrates dynamic regret in a smooth drift setting. The reasoning ties the regret to the path length of the comparator and examines how adaptive step sizes improve tracking. It operationalizes the theoretical dynamic regret bound. ML Interpretation: Adaptive online learning is necessary when targets drift. A fixed learning rate can under or over react depending on drift speed. Failure Modes: Large step sizes yield instability; small step sizes produce lag. If drift becomes abrupt, dynamic regret can spike regardless of tuning. Common Mistakes: Using a learning rate that is tuned for stationary data, or reporting only static regret. Chapter Connections: Links to Regret definition, Regret Bound in Online Convex Optimization theorem, and Example 5.

C.16

Explanation: This exercise applies EWC to show how parameter importance constrains learning new tasks. The reasoning uses the Fisher information as a proxy for sensitivity and sweeps the regularization weight to reveal the stability plasticity frontier. ML Interpretation: EWC is a principled method to mitigate forgetting, but it is not a silver bullet; it trades off adaptation and retention. This is critical in continual NLP and vision updates. Failure Modes: Too strong a penalty freezes the model and prevents learning. Too weak a penalty yields rapid forgetting. Fisher estimates can be noisy, leading to poor constraints. Common Mistakes: Using a single regularization value without tuning or assuming Fisher estimates are exact Hessians. Chapter Connections: Ties to Catastrophic Forgetting definition, the Catastrophic Forgetting Bound theorem, and Example 7.

C.17

Explanation: This exercise compares greedy retraining and exploration augmented retraining in a feedback loop system. The reasoning is that exploration reveals counterfactual data, reducing self reinforcement and stabilizing distribution dynamics. The experiment tracks divergence to quantify amplification. ML Interpretation: Many systems must balance exploitation with exploration to avoid runaway bias. This exercise shows how exploration can serve as a form of regularization for data generation. Failure Modes: Too little exploration yields amplification; too much exploration reduces short term performance. If exploration is not randomized correctly, it can still bias the data. Common Mistakes: Treating exploration as only a bandit efficiency tool rather than a stability mechanism, or ignoring long term distribution effects. Chapter Connections: Links to Feedback Induced Shift, Endogenous Data definitions, the Feedback Loop Amplification Bound theorem, and Example 10.

C.18

Explanation: This exercise targets operational shift by modifying feature scaling and missingness over time. The reasoning is that pipeline changes can mimic distribution shift without changing the true labeling mechanism. The experiment demonstrates how drift detection should flag such changes and how model calibration can degrade. ML Interpretation: Many real failures are due to pipeline changes rather than concept drift. Understanding operational drift is essential for ML reliability and data quality monitoring. Failure Modes: If scaling changes are not corrected, models produce systematically biased predictions. Missingness patterns can interact with imputation to create spurious drift signals. Common Mistakes: Assuming that all drift is conceptual and retraining the model instead of fixing the pipeline, or ignoring missingness diagnostics. Chapter Connections: Relates to Temporal Drift and Distributional Distance definitions and Example 9.

C.19

Explanation: This exercise explores robustness to mild violations of covariate shift assumptions. The reasoning compares reweighting, which assumes invariant $P(y \mid x)$, to representation alignment, which can sometimes absorb small conditional changes. The experiment tests sensitivity to conditional drift. ML Interpretation: Real deployments rarely satisfy perfect covariate shift; methods must be robust to small conditional shifts. This exercise clarifies when reweighting fails and when alignment helps. Failure Modes: If conditional drift grows, reweighting introduces bias and alignment can collapse due to label mismatch. Over aligning can also distort decision boundaries. Common Mistakes: Assuming that mild violations are negligible or ignoring calibration drift when aligning representations. Chapter Connections: Connects to Covariate Shift and Concept Shift definitions, the Importance Weighting Correctness theorem, and Example 8.

C.20

Explanation: This exercise uses a small language model surrogate to simulate topic drift and continual updates. The reasoning focuses on the tradeoff between freshness and retention, using sequential corpora to measure how updates shift the model’s internal distribution and performance on earlier data. The experiment highlights that forgetting can manifest as style or factual inconsistency, not just accuracy loss. ML Interpretation: Production LLMs are updated frequently; maintaining stability under drift is critical for safety and user trust. This exercise shows why continual evaluation across old and new data is essential. Failure Modes: Over tuning to recent data causes regressions on older topics and behavioral drift. Under tuning leaves the model stale. Evaluation that only uses new data hides regressions. Common Mistakes: Treating perplexity on recent data as the sole success metric, or failing to maintain an evaluation set for older knowledge. Chapter Connections: Relates to Continual Learning and Catastrophic Forgetting definitions, the Catastrophic Forgetting Bound theorem, and Example 12.

End of C Solutions

Appendices

In Context

Algorithmic Development History

Early domain adaptation theory formalized the idea that training and test domains differ but can be related through divergence measures, leading to risk decompositions that quantify when adaptation is possible. Vapnik’s work on covariate shift introduced the framing that changes in $P(x)$ can be corrected by reweighting, which later evolved into practical importance weighting strategies. Zinkevich’s foundational results in online convex optimization established sublinear regret bounds and proved that sequential updates can be competitive with static models, providing theoretical backing for streaming learning systems. In continual learning, the work of Kirkpatrick et al. introduced elastic weight consolidation, connecting catastrophic forgetting to parameter importance and demonstrating that stability can be enforced through regularization. Distributionally robust optimization grew out of a recognition that worst case performance under uncertainty is often more important than average accuracy, contributing tools for robust risk control under shift. In modern large scale systems, drift detection matured with scalable two sample tests, feature monitoring, and learned detectors that can operate under heavy traffic and delayed labels. Together, these threads formed the current landscape where shift is recognized as inevitable and model updating must be principled rather than ad hoc.

Why This Matters for ML

Deployment Requires Robustness to Drift

In deployed ML, the environment is rarely static. Business policies, user behavior, and data pipelines change continually, and without explicit drift handling, models decay silently. Robustness to drift means more than adversarial robustness; it requires monitoring, reweighting, and update strategies that preserve performance and safety in changing conditions.

Stability–Generalization–Adaptation Triangle

Stability controls how much a model changes in response to new data, generalization governs how well the model handles unseen samples, and adaptation determines how quickly it tracks shift. These three forces conflict: aggressive adaptation can reduce stability, while excessive stability can block necessary updates. Practical systems must balance the triangle by tuning retraining frequency, replay buffers, and regularization based on the expected drift regime.

Failure Modes Under Distribution Shift

Failures under shift are often subtle: calibration breaks, edge cases expand, and rare harms become common. Over reliance on aggregate accuracy hides these risks, especially when the data distribution is endogenous. Models can overfit to the feedback they create, underrepresent minority or emerging groups, and degrade in ways not captured by static benchmarks.

Forward Links to Governance and Alignment (Ch. 14)

Governance and alignment require understanding that models change the world they operate in, and that the data they learn from is shaped by those changes. The next chapter builds on this by formalizing monitoring obligations, update policies, and accountability frameworks that assume distribution shift is the norm. The arguments about feedback induced shift and stability are directly relevant for ensuring aligned behavior under dynamic conditions.

Motivation

Why Training and Test Distributions Differ

In production, data is shaped by processes that differ from the static datasets used during training. A retail demand model trained on last year’s promotions faces a test distribution shaped by new pricing policies, supply constraints, and competitor actions. A clinical model trained on a single hospital’s imaging device encounters test data from new scanners with different calibration. Even when the core task is stable, data acquisition pipelines, user behavior, and policies shift the empirical distribution, producing mismatch that can reduce accuracy and raise risk.

Geometry of Drift in High Dimensions

High dimensional spaces hide drift: small changes in many coordinates can produce large changes in the geometry of the data manifold. For example, a vision model may see a slight change in lighting and camera settings that shifts pixel intensities in thousands of dimensions, moving samples outside the training support even though the images look similar to humans. In text embeddings, changes in tokenization or vocabulary updates alter vector geometry in ways that amplify downstream errors. Understanding drift as a geometric displacement explains why distribution shift can be subtle yet catastrophic.

Failure of IID Assumptions

Most learning theory assumes that training and test samples are independent and identically distributed. In real systems, the data is temporally correlated, and the distribution is tied to past model decisions. A fraud detection model deployed in a bank changes user behavior as fraudsters adapt, creating a feedback loop that violates independence. A churn model retrained monthly sees correlated data because many customers appear in multiple snapshots. These violations mean that classical generalization bounds become overly optimistic, and evaluation procedures must be redesigned.

Continual Updating vs Static Training

Static training yields a model optimized for a single snapshot of the world. Continual updating allows a model to track drift, but introduces the risk of forgetting knowledge that remains relevant. In a recommendation system, aggressively updating to match weekly trends can erase long term preferences, leading to brittle personalization. In autonomous driving, frequent updates to handle new road conditions can introduce regressions on older scenarios. The challenge is to balance plasticity, the ability to learn new patterns, with stability, the retention of old capabilities.

Common Misconceptions About Distribution Shift

A common misconception is that more data always solves shift, but if the new data comes from a different process, simply mixing it can degrade performance on both regimes. Another misconception is that shift is only about inputs; label shift and concept drift can occur without any obvious input change. Finally, practitioners often assume that a model is reliable if its average accuracy is high, ignoring that shift can create pockets of failure that are rare but critical, such as false negatives in medical screening when the patient demographics change.

ML Connection

Covariate Shift in Practice

A real example is ad click prediction. The distribution of user features changes when a platform redesigns its interface, altering user navigation patterns. The conditional probability of a click given features may remain similar, but the feature distribution changes, causing covariate shift. Practitioners often apply importance weighting or domain adaptation by reweighting training samples to match the new feature distribution, and monitor feature drift using statistics like population stability index or kernel two sample tests. In another case, a loan approval model sees a new applicant pool after a policy change, altering input features without necessarily changing the risk function.

Label Shift and Calibration

Label shift appears when the prior class distribution changes but the conditional feature distribution given class stays stable. In medical triage, a new screening program may increase the prevalence of a disease among tested patients, altering the class prior while the appearance of positives remains stable. This leads to miscalibrated probabilities if the model assumes the old prior. Practical solutions include recalibrating with updated prevalence estimates and applying confusion matrix based corrections. In spam filtering, a sudden wave of spam increases the spam prior, and calibration must be adjusted to keep precision and recall balanced.

Concept Drift in Streaming Systems

Concept drift occurs when the mapping from features to labels changes. A fraud model trained on old attack patterns can fail when criminals change strategies, even if feature distributions look similar. Streaming platforms often detect drift by monitoring performance metrics on delayed labels, or by training drift detectors on feature label relationships. For example, an anomaly detection system in manufacturing might see a new failure mode that changes the relationship between sensor readings and defects, requiring model updates that incorporate the new causal relationship rather than just rescaling features.

Catastrophic Forgetting

In continual learning, a model updated on new data can rapidly forget old tasks. A chatbot trained on customer support for product version 1 can lose competence on older devices after fine tuning on version 2. In computer vision, updating a classifier with new product images can degrade accuracy on legacy products. Mitigation includes rehearsal buffers, elastic weight consolidation, and parameter isolation. A practical example is a mobile keyboard model that learns new slang while preserving accuracy on common grammar; forgetting leads to uncorrected typos for long time users.

Feedback Loops and Endogenous Data

Models often influence the data they later see. A recommender system that boosts certain items changes user exposure, which changes click data, creating a self reinforcing feedback loop. A predictive policing model increases patrols in certain neighborhoods, generating more recorded incidents that appear to validate the model. These loops can cause distribution shift driven by the model itself, not the environment. Mitigation requires counterfactual evaluation, randomized exploration, or policy constraints that limit runaway feedback. A concrete example is dynamic pricing: the model’s price recommendations shift demand, altering the distribution of observed sales and requiring careful causal modeling to separate price effects from customer preferences.

Notation Summary

This appendix consolidates the key symbols used throughout the chapter. Inputs are denoted $x \in \mathcal{X}$, labels by $y \in \mathcal{Y}$, and time by $t$. Training and test distributions are $P_{\text{train}}$ and $P_{\text{test}}$, with time indexed distributions $P_t$ for temporal drift. Covariate shift means $P_{\text{train}}(y \mid x) = P_{\text{test}}(y \mid x)$ but $P_{\text{train}}(x) \neq P_{\text{test}}(x)$. Label shift means $P_{\text{train}}(x \mid y) = P_{\text{test}}(x \mid y)$ with priors $\pi_{\text{train}}$ and $\pi_{\text{test}}$ changing. Concept shift means $P_{\text{train}}(y \mid x) \neq P_{\text{test}}(y \mid x)$. Importance weights are $w(x) = P_{\text{test}}(x) / P_{\text{train}}(x)$. Risk is $R_{P}(f) = \mathbb{E}_{P}[\ell(f(x), y)]$, with empirical risk $\hat{R}$. The Bayes optimal predictor under $P_t$ is $f_t^*$. Regret at horizon $T$ is $\mathrm{Regret}_T = \sum_{t=1}^T \ell_t(f_t) - \min_{f \in \mathcal{F}} \sum_{t=1}^T \ell_t(f)$. Drift magnitude is measured by total variation $\mathrm{TV}(P, Q)$ or Wasserstein $W_1(P, Q)$. The $\mathcal{H}\Delta\mathcal{H}$ divergence measures domain discrepancy for a hypothesis class $\mathcal{H}$. The joint optimal error is $\lambda = \min_{f \in \mathcal{H}} (R_S(f) + R_T(f))$. Continual learning uses tasks $\mathcal{T}_t$, models $f_t$, and forgetting metrics $F_{i,t}$. Feedback induced shift is written $P(x, y \mid f)$, indicating dependence on model or policy.

Supplementary Proofs

Supplementary Proof A: Concentration for Importance Weighted Risk. Let $Z_i = w(x_i) \ell(f(x_i), y_i)$ with $0 \leq \ell \leq L$ and $0 \leq w \leq W$. Then $0 \leq Z_i \leq W L$. By Hoeffding, \[ \mathbb{P}\left( \mathbb{E}[Z] - \frac{1}{m} \sum_{i=1}^m Z_i \geq \epsilon \right) \leq \exp\left( - \frac{2 m \epsilon^2}{W^2 L^2} \right). \] This yields the bound in the covariate shift theorem. The same bound applies to any bounded loss and bounded weights, emphasizing that concentration depends on $W$.

Supplementary Proof B: Total Variation Control of Risk Differences. For any bounded loss $0 \leq \ell \leq L$, \[ |R_P(f) - R_Q(f)| = \left| \int \ell(f(x), y) \, dP - \int \ell(f(x), y) \, dQ \right| \leq L \mathrm{TV}(P, Q). \] This follows from the definition $\mathrm{TV}(P, Q) = \sup_A |P(A) - Q(A)|$ and the variational characterization of total variation. The result is used repeatedly in drift stability arguments.

Supplementary Proof C: Dynamic Regret Lower Bound Under Rapid Drift. Let $\ell_t(f) = \|f - a\|^2$ for odd $t$ and $\ell_t(f) = \|f - b\|^2$ for even $t$, with $\|a - b\|$ large. Then any algorithm that cannot switch perfectly each step incurs constant error per round relative to $f_t^*$. Hence dynamic regret is $\Omega(T)$ when the comparator path length is $\Theta(T)$. This formalizes the limits of dynamic tracking in highly non stationary environments.

Supplementary Proof D: Identifiability in Label Shift. Suppose class conditionals $P(x \mid y)$ are linearly independent as functions of $x$. Then the mixture representation $P_{\text{test}}(x) = \sum_y \pi_{\text{test}}(y) P(x \mid y)$ is identifiable, and the target prior vector $\pi_{\text{test}}$ is uniquely determined. This ensures that MLE or method of moments estimates of the prior are consistent under standard regularity conditions.

ML Implementation Notes

Note 1: Effective Sample Size in Importance Weighting. Always compute the effective sample size $n_{\text{eff}} = (\sum_i w_i)^2 / \sum_i w_i^2$. If $n_{\text{eff}}$ is a small fraction of the dataset, weighted estimates are unstable and should be clipped or regularized.

Note 2: Prior Shift Diagnostics. Compare predicted class proportions to calibrated validation priors. If the confusion matrix is ill conditioned, consider grouping rare classes or using temperature scaling before applying prior adjustment.

Note 3: Sliding Window Tuning. Choose window size by balancing drift speed and variance. A practical heuristic is to select a window such that the estimated drift within the window is below a threshold but the sample size remains sufficient for stable training.

Note 4: Monitoring Under Label Delay. When labels are delayed, maintain proxy metrics such as feature drift, prediction entropy, or agreement between models. Use delayed labels for retrospective audits rather than real time decision thresholds.

Note 5: Drift Detection Thresholds. Thresholds should be calibrated on a stable validation period and adjusted for seasonality. Use separate thresholds for sudden shift detection and gradual drift detection to avoid false alarms.

Note 6: Continual Learning Evaluation. Track per task and per class accuracy after each update. Aggregate metrics can hide forgetting. A standard report includes average accuracy, average forgetting, and worst case class accuracy.

Note 7: Feedback Loop Mitigation. Introduce controlled exploration or randomized exposure to ensure data diversity. Track distribution divergence over time to detect amplification. Consider counterfactual evaluation when feasible.

Note 8: Domain Adaptation Checks. Estimate $\mathcal{H}\Delta\mathcal{H}$ divergence using classifier disagreement and compare it to the joint error proxy $\lambda$. If either term is large, expect poor adaptation and prioritize data collection or re labeling.

Note 9: Calibration Under Drift. Recalibration can be as important as retraining. Use a rolling calibration set and track calibration error separately from accuracy.

Note 10: LLM Continual Updates. Maintain a frozen regression test set of prompts from prior deployments. Measure not just perplexity but factual consistency, safety compliance, and tool usage stability across versions.

END OF FILE