Paper deep dive
A Quantitative Characterization of Forgetting in Post-Training
Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/22/2026, 6:32:15 AM
Summary
This paper provides a theoretical framework for understanding catastrophic forgetting in generative model post-training by modeling the process as a two-mode mixture (old and new tasks). It distinguishes between 'mass forgetting' (collapse of old mixture weight) and 'old-component drift' (parameter shift). The authors prove that forward-KL objectives (SFT) drive mass forgetting, while reverse-KL objectives (RL) preserve mass and limit drift via overlap-gated mechanisms. The study also analyzes the impact of replay and evaluates specific post-training methods like SDFT, TTT-Discover, and OAPL.
Entities (7)
Relation Signals (4)
Forward-KL → causes → Mass Forgetting
confidence 95% · forward-KL objectives trained on data from the new distribution drive the old weight to zero
Reverse-KL → prevents → Mass Forgetting
confidence 95% · reverse-KL objectives converge to the true target (thereby avoiding mass forgetting)
Replay → modifies → Forward-KL
confidence 90% · For forward-KL, replay must modify the training distribution to change the population optimum
SDFT → implements → Reverse-KL
confidence 85% · SDFT behaves like a reverse-KL update toward an evolving teacher distribution
Cypher Suggestions (2)
Identify training objectives that prevent mass forgetting. · confidence 95% · unvalidated
MATCH (o:Objective)-[:PREVENTS]->(f:ForgettingMechanism {name: 'Mass Forgetting'}) RETURN o.nameFind all post-training methods and their relationship to forgetting mechanisms. · confidence 90% · unvalidated
MATCH (m:Method)-[:PREVENTS|INDUCING]->(f:ForgettingMechanism) RETURN m.name, f.name
Abstract
Abstract:Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develop theoretical results under a two-mode mixture abstraction (representing old and new tasks), proposed by Chen et al. (2025) (arXiv:2510.18874), and formalize forgetting in two forms: (i) mass forgetting, where the old mixture weight collapses to zero, and (ii) old-component drift, where an already-correct old component shifts during training. For equal-covariance Gaussian modes, we prove that forward-KL objectives trained on data from the new distribution drive the old weight to zero, while reverse-KL objectives converge to the true target (thereby avoiding mass forgetting) and perturb the old mean only through overlap-gated misassignment probabilities controlled by the Bhattacharyya coefficient, yielding drift that decays exponentially with mode separation and a locally well-conditioned geometry with exponential convergence. We further quantify how replay interacts with these objectives. For forward-KL, replay must modify the training distribution to change the population optimum; for reverse-KL, replay leaves the population objective unchanged but prevents finite-batch old-mode starvation through bounded importance weighting. Finally, we analyze three recently proposed near-on-policy post-training methods, SDFT (arXiv:2601.19897), TTT-Discover (arXiv:2601.16175), and OAPL (arXiv:2602.19362), via the same lens and derive explicit conditions under which each retains old mass and exhibits overlap-controlled drift. Overall, our results show that forgetting can by precisely quantified based on the interaction between divergence direction, geometric behavioral overlap, sampling regime, and the visibility of past behavior during training.
Tags
Links
- Source: https://arxiv.org/abs/2603.12163v1
- Canonical: https://arxiv.org/abs/2603.12163v1
PDF not stored locally. Use the link above to view on the source site.
Full Text
233,972 characters extracted from source content.
Expand or collapse full text
A Quantitative Characterization of Forgetting in Post-Training Krishnakumar Balasubramanian1,2, Shiva Prasad Kasiviswanathan2 1Department of Statistics, University of California, Davis 2Amazon Abstract Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develop theoretical results under a two-mode mixture abstraction (representing old and new tasks), proposed by Chen et al. (2025), and formalize forgetting in two forms: (i) mass forgetting, where the old mixture weight collapses to zero, and (i) old-component drift, where an already-correct old component shifts during training. For equal-covariance Gaussian modes, we prove that forward-KL objectives trained on data from the new distribution drive the old weight to zero, while reverse-KL objectives converge to the true target (thereby avoiding mass forgetting) and perturb the old mean only through overlap-gated misassignment probabilities controlled by the Bhattacharyya coefficient, yielding drift that decays exponentially with mode separation and a locally well-conditioned geometry with exponential convergence. We further quantify how replay interacts with these objectives. For forward-KL, replay must modify the training distribution to change the population optimum; for reverse-KL, replay leaves the population objective unchanged but prevents finite-batch “old-mode starvation” through bounded importance weighting. Finally, we analyze three recently proposed near–on-policy post-training methods, SDFT (Shenfeld et al., 2026), T-Discover (Yuksekgonul et al., 2026), and OAPL (Ritter et al., 2026), via the same lens and derive explicit conditions under which each retains old mass and exhibits overlap-controlled drift. Overall, our results show that forgetting can by precisely quantified based on the interaction between divergence direction, geometric behavioral overlap, sampling regime, and the visibility of past behavior during training. 1 Introduction Continual learning investigates how sequentially trained models can acquire new capabilities without erasing old ones, a process fundamentally challenged by catastrophic forgetting, where performance on earlier tasks rapidly degrades. While the literature contains many algorithmic responses (see Section 1.2), the mechanisms behind forgetting are less unified, especially for post-training pipelines in modern generative models whose “behavior” is best represented as a probability distribution over outputs. In this paper, by viewing training procedures as a divergence-minimization or distribution-matching step, we ask a basic question: Can we precisely quantify when a post-training procedure induces forgetting and when it does not? We aim to answer this question by studying the two-mode mixture model proposed by Chen et al. (2025) that abstracts a continual-learning step into “old” and “new” distributions. Let pop_o and pnp_n denote the old and new data-generating distributions over an output space Y (where the subscript oo and nn denotes old and new respectively). We define a true target mixture pα(y)=αpo(y)+(1−α)pn(y),α∈[0,1],p_α(y)\;=\;α p_o(y)+(1-α)p_n(y), α∈[0,1], which represents the ideal outcome of learning the new behavior while retaining an α fraction of the old behavior. We consider an learner model family that explicitly contains two components (one intended for the old mode and one for the new mode), qβ(y)=βqo(y)+(1−β)qn(y),β∈[0,1].q_β(y)\;=\;β q_o(y)+(1-β)q_n(y), β∈[0,1]. In this formulation, the parameters to be learned consist of the mixture weight (β) and the parameters governing the component distributions (qoq_o and qnq_n) of the model. Throughout this paper, we assume that the component qoq_o has already been trained to approximate the old distribution pop_o. Continual learning in this setup (and more broadly) typically refers to the process of learning the new distribution pnp_n while preserving the previously learned behavior encoded by pop_o. In the mixture formulation above, this corresponds to updating the parameters of the component qnq_n and the mixture weight β so that the learned model qβq_β approximates the target mixture pαp_α, while ensuring that the component qoq_o continues to represent the previously learned distribution pop_o. We distinguish two distinct forms of forgetting in this continual learning setup. Mass forgetting corresponds to the collapse of mixture mass on the old mode, whereas old-component drift occurs when the model retains nonzero mass on the old mode but its parameters move away from the true old distribution: (i) Mass Forgetting (Mass Collapse): This occurs when the optimal mixture weight satisfies β⋆=0β =0, meaning that the learned model places zero mass on the old mode. Equivalently, the mixture weight on the old component undergoes mass collapse. We show that this can arise even when qo(y)=po(y)q_o(y)=p_o(y) and qn(y)=pn(y)q_n(y)=p_n(y), i.e., when the learner is given the correct forms of both the old and new distributions. In this setting the only learnable parameter is the mixing proportion β. Thus β⋆=0β =0 (instead of the desired β⋆=αβ =α) represents a strong form of forgetting in which the learned model discards the old behavior despite having access to its exact distribution. (i) Old-Component Drift: This occurs when, during continual training, the parameters of the learned old component qoq_o drift away from the true old distribution pop_o. In this case the model may still allocate nontrivial mass to the old mode (i.e., β need not collapse to zero), but the parameters governing the old component shift so that the learned distribution no longer faithfully represents the original behavior. For example, in a location family this corresponds to the mean parameter of the old component drifting away from the true old mean. This setting is intentionally minimal111Appendix D shows that similar conclusions hold for finite-mixture models.: there are only two modes, the model family is expressive enough to represent both, and yet the aforementioned forms of forgetting can still occur. The benefit of this simplicity is that it enables exact decompositions and sharp theorems that cleanly separate objective-driven forgetting from representational limits. A central theme in this work, motivated by modern post-training pipelines, is the contrast between forward and reverse KL divergence based training objectives given respectively by minθKL(p∥qθ),andminθKL(qθ∥p). _θKL(p\,\|\,q_θ), _θKL(q_θ\,\|\,p). The forward-KL is the population analogue of maximum likelihood on a “data” distribution p. In the context of the model above, forward-KL correspond to SFT-based training with only new data (i.e., p=p=p_ n). The reverse-KL objective, is the population analogue of matching the model to a target distribution p under on-policy sampling from qθq_θ (a common lens for KL-regularized policy improvement and RL-style updates). We consider the case where po=(μo,Σ)p_o=N( _o, ) and pn=(μn,Σ)p_n=N( _n, ) with separation δ:=‖μn−μo‖Σ−1δ:=\| _n- _o\|_ ^-1. The learner model q is parameterized as a two-component mixture with weight β and component means (mo,mn)(m_o,m_n), with both components sharing covariance Σ . In this setting we can explicitly analyze the dynamics of forgetting under the different training objectives described above. To isolate the effect of learning the mixture weight, we further set mo=μom_o= _o and mn=μnm_n= _n in the learner model and optimize only over β. Our first main result shows that, in this setting, the new-data-only forward-KL-based SFT training objective LSFT(β):=KL(pn∥qβ)L_SFT(β):=KL(p_n\,\|\,q_β) is strictly increasing in β (even when the component shapes are already correct), so β⋆=0β =0 is the unique population minimizer. Moreover, under logit-parameterized gradient flow, the trajectory β(t)β(t) (corresponding to the population training objective) decreases monotonically to 0. The analysis reveals an intuitive mechanism: the gradient is given by the difference between the current old mass β and the expected old responsibility (i.e., average posterior probability, under a given sampling distribution, that an observation is assigned to the old component) under the new data distribution. When the modes are well separated, this responsibility is exponentially small, so the update effectively reduces to repeatedly shrinking β until the old mode vanishes. Our second main result considers reinforcement learning with a reverse-KL objective LRL(β,mo,mn):=KL(qβ,mo,mn∥pα),L_RL(β,m_o,m_n):=KL(q_β,m_o,m_n\,\|\,p_α), which corresponds to KL-regularized on-policy RL updates toward a target distribution that explicitly retains the old behavior222See Appendix A for details.. When the old component is already correct (e.g., mo=μom_o= _o), the gradient with respect to the old parameters admits an exact decomposition: the only terms capable of moving the old mode arise from misassignment probabilities, i.e., responsibilities that incorrectly attribute an old-mode sample to the new component and vice versa under the target. These misassignment probabilities are controlled by an overlap quantity (the Bhattacharyya coefficient), yielding bounds that decay exponentially with the squared Mahalanobis distance between the means for equal-covariance Gaussians. Consequently, in the well-separated regime, reverse-KL updates can meaningfully adjust the new mode while perturbing the old mode only through exponentially small overlap effects. Finally, a local Polyak–Łojasiewicz (PL) analysis shows that, in sufficiently separated regimes, the reverse-KL objective exhibits a favorable local geometry that implies exponential convergence under gradient flow. Method Prevents Mass Forgetting? Controls Old-Component Drift? Forward-KL (SFT) ✗ ✓ (Theorem 2.2) (but unimportant as mass collapses) Reverse-KL (RL) ✓ ✓ Exponentially small in δ (Theorem 2.3) (Theorem 2.3) SDFT ✓ If demonstrator strength is >0>0 ✓ Finite total drift⋆ (Shenfeld et al., 2026) (Theorem 3.1(A)) (Theorem 3.1(B)) T-Discover ✓ If anchor sufficiently strong; ✓ Exponentially small in δ (Yuksekgonul et al., 2026) Collapses if anchor too weak (Theorem 3.2(B)) (Theorem 3.2(A)) OAPL Partial: bounded by old-mode weight ✓ Exponentially small in δ (Ritter et al., 2026) of the frozen reference policy (Theorem 3.3(B)) (Brantley et al., 2025) (Theorem 3.3(A)) Table 1: Summary of forgetting behavior across training objectives. “Prevents Mass Forgetting” means the population optimum satisfies β⋆>0β >0. “Controls Old-Component Drift” means the gradient ‖∇moL‖\| _m_oL\| is provably small when mo=μom_o= _o. Mode separation δ=‖μn−μo‖Σ−1δ=\| _n- _o\|_ ^-1. ⋆ : We show it is exponentially small under additional assumptions; see Remark 3.2. Effect of Replay on SFT and RL. We also examine the effect of replay and quantify how it interacts with forward- and reverse-KL objectives in fundamentally different ways. For forward-KL (SFT), replay prevents mass forgetting only when it enters the training distribution (i.e., the numerator of the objective): mixing a λ fraction of old samples into the data shifts the population optimum to retain β⋆=λβ =λ. In contrast, mixing old samples only on the model side leaves the learned parameter collapsing to β⋆=0β =0 and merely imposes an external retention floor. For reverse-KL (KL-regularized RL), replay does not alter the population objective but instead addresses a finite-batch failure mode. By ensuring that old-mode samples appear in minibatches with high probability and using bounded importance weights, replay preserves the same reverse-KL gradient in expectation while preventing stochastic “old-mode starvation” that can otherwise mimic new-only updates. Together, these results show that replay plays fundamentally different roles in the two settings: for SFT it modifies the population objective, whereas for reverse-KL methods it improves the stochastic optimization dynamics. Near-on-policy Methods. We next consider three recent near-on-policy post-training methods, namely SDFT Shenfeld et al. (2026), T-discover Yuksekgonul et al. (2026) and OAPL Ritter et al. (2026); Brantley et al. (2025). Our mixture-model analysis reveals a sharp difference between the different algorithms. SDFT behaves like a reverse-KL update toward an evolving teacher distribution generated from the model itself based on a demonstrator. It avoids mass forgetting if the demonstrator is strong enough, while avoiding the old-component drift. T-Discover’s entropic objective is intrinsically mode-seeking: without a sufficiently strong KL anchor it can still collapse mass onto the higher-reward mode, although the drift of an already-correct old mode remains overlap-gated and decays exponentially with separation. OAPL behaves differently because its target is an exponential tilt of a frozen reference policy: it can only preserve or reweight modes already present in that reference, but its parametric updates are likewise geometrically local, with cross-mode influence controlled by exponentially small overlap terms. Together, these results show that these three methods inherit the stability of on-policy learning, but forgetting and retention is governed by different mechanisms. A summary of our results and conclusions is provided in Table 1. 1.1 Intuition via Disjoint-support Case As a prelude to our results that follow in the rest of this draft, in this section, we study the limiting case where each “mode” lives on a separate region of the sample space. The core intuition behind our general results are well-captured by this simplified setup. Definition 1.1. Let (,ℱ,μ)(Y,F,μ) be a measurable space with reference measure μ. Assume there exist measurable sets Ao,An⊆A_o,A_n forming a partition: Ao∩An=∅A_o∩ A_n= , Ao∪An=A_o∪ A_n=Y. Assume the component densities satisfy po(y)=0=qo(y)for y∉Ao,pn(y)=0=qn(y)for y∉An,p_o(y)=0=q_o(y)\ \ for y∉ A_o, p_n(y)=0=q_n(y)\ \ for y∉ A_n, where qo(⋅)q_o(·) and qn(⋅)q_n(·) are the (model) component densities. Define the mixtures pα(y)=αpo(y)+(1−α)pn(y),qβ(y)=βqo(y)+(1−β)qn(y),p_α(y)=α p_o(y)+(1-α)p_n(y), q_β(y)=β q_o(y)+(1-β)q_n(y), and note that on AoA_o we have pα=αpop_α=α p_o, qβ=βqoq_β=β q_o, and similarly on AnA_n. In this disjoint-support limit, both forward and reverse-KL admit exact decompositions into a mixture-weight term and within-mode terms. This makes the key point transparent: if training uses new-only data (p=pnp=p_n), then the forward-KL objective contains an explicit penalty that is strictly increasing in the old-mode weight β, and thus the unique optimizer is β⋆=0β =0. Crucially, this collapse occurs regardless of how well the old component is modeled, because the new-only objective has no incentive to allocate probability mass to a region it never observes. This provides a clean caricature of catastrophic forgetting as mass collapse driven by off-policy training. Lemma 1.1 (Exact KL decompositions under disjoint supports). Under the disjoint-support assumption, KL(pα∥qβ) (p_α\|q_β) =αlogαβ+(1−α)log1−α1−β+αKL(po∥qo)+(1−α)KL(pn∥qn), =α αβ+(1-α) 1-α1-β\;+\;α\,KL(p_o\|q_o)\;+\;(1-α)\,KL(p_n\|q_n), KL(qβ∥pα) (q_β\|p_α) =βlogβα+(1−β)log1−β1−α+βKL(qo∥po)+(1−β)KL(qn∥pn). =β βα+(1-β) 1-β1-α\;+\;β\,KL(q_o\|p_o)\;+\;(1-β)\,KL(q_n\|p_n). Proof of Lemma 1.1. We prove the first identity; the second is analogous. Split the integral over Ao∪AnA_o∪ A_n: KL(pα∥qβ)=∫pαlogpαqβdμ=∫Aoαpologαpoβqodμ+∫An(1−α)pnlog(1−α)pn(1−β)qndμ.KL(p_α\|q_β)= p_α p_αq_β\,dμ= _A_oα p_o α p_oβ q_o\,dμ+ _A_n(1-α)p_n (1-α)p_n(1-β)q_n\,dμ. Inside each region, expand the log: logαpoβqo=logαβ+logpoqo α p_oβ q_o= αβ+ p_oq_o, and similarly for the new mode. Using ∫Aopoμ=∫Anpnμ=1 _A_op_o\,dμ= _A_np_n\,dμ=1 yields the stated decomposition. ∎ Remark 1.1 (Exact mode locality for shape parameters). If qo(⋅)q_o(·) and qn(⋅)q_n(·) have separate parameter vectors (say θo _o and θn _n), Lemma 1.1 implies that, holding mixture weights fixed, minimizing either KL w.r.t. θo _o depends only on the old-mode divergence, and similarly for θn _n. In particular, if qo=poq_o=p_o, then ∇θoKL(qβ∥pα)=0 _ _oKL(q_β\|p_α)=0 and ∇θoKL(pα∥qβ)=0 _ _oKL(p_α\|q_β)=0: the old mode is exactly stationary while the new mode can be updated. Remark 1.2 (Why Weights Can Still Collapse Under forward-KL). Even in the disjoint-support case, if the training distribution is new-only, i.e. p=pnp=p_n, then for the forward-KL objective KL(pn∥qβ)KL(p_n\|q_β) the decomposition reduces to KL(pn∥qβ)=log11−β+KL(pn∥qn),KL(p_n\|q_β)= 11-β+KL(p_n\|q_n), which is strictly increasing in β. Thus, optimizing forward-KL on new-only data drives β→0β→ 0 (all mass on the new mode), regardless of how well the old mode was modeled. This is a clean caricature of catastrophic forgetting by mass collapse. By contrast, for reverse-KL, the weight term βlog(β/α)+(1−β)log((1−β)/(1−α))β (β/α)+(1-β) ((1-β)/(1-α)) penalizes moving β far from α. 1.2 Related Works From a methodological perspective, continual learning can be broadly categorized to rely on some mechanism for maintaining exposure to previously learned behaviors (i.e., information about po(y)p_o(y) in our setup) during training. In practice, this is done by either the availability of an explicit memory oracle that stores and replays past data or a sampling oracle capable of generating on-policy data from the current model. We refer to Wang et al. (2024) and Shi et al. (2025) for recent surveys. Below we provide a admittedly incomplete overview of a few related works on continual learning, with a focus on methods for generative models and general theoretical results. Replay-based methods (Schaul et al., 2015; Lopez-Paz and Ranzato, 2017) assume access to a memory oracle to store samples from previous tasks to be used in the current training process. A special case of replay-based methods are regularization-based method where the model parameter is stored (instead of the training data) and used as a regularizer when training for new tasks (Kirkpatrick et al., 2017; Li and Hoiem, 2017; Schwarz et al., 2018). Theoretical lower bounds on the amount of memory required for continual learning was established by Chen et al. (2022). In the context of large generative models, replay-based typically involved re-training a pre-trained model and hence maybe inefficient (or even in-feasible if the pre-trained model is closed-weights). Nevertheless, several works have proposed algorithms to efficiently use replay methods in the context of such large models (Shin et al., 2017). In the context of large pre-trained generative models, on-policy approaches that continually sample from the current model and train on the resulting data are widely used. Such methods appear in both reinforcement learning and supervised fine-tuning settings, for example in on-policy distillation and policy-improvement style updates (Tajwar et al., 2024; Lu and T.M.Lab, 2025; Zhao et al., 2026a; Chen et al., 2025; Shenfeld et al., 2025). Related ideas also arise in mid-training procedures that bridge pre-training and post-training distributions (Liu et al., 2025), as well as in self-distillation methods that iteratively train a model on samples generated by its own policy (Shenfeld et al., 2026; Zhao et al., 2026b; Hübotter et al., 2026; Penaloza et al., 2026). Earlier work explored connections between reinforcement learning and distribution matching for language-model fine-tuning (Korbak et al., 2022), and recent methods such as OAPL construct improvement targets relative to a lagged reference policy (Ritter et al., 2026; Brantley et al., 2025). Theoretical results on forgetting under (overparameterized) linear models have been studied recently by many authors, including Evron et al. (2022); Lin et al. (2023); Li et al. (2025); Ding et al. (2024); Deng et al. (2025); Banayeeanzade et al. (2025); Karpel et al. (2026). The linear classification setting was further analyzed by Evron et al. (2023). PAC-Bayes bounds for continual learning were established recently by Friedman and Meir (2026), and gradient descent dynamics in continual learning problems was studied by Bennani et al. (2021); Doan et al. (2021); Karakida and Akaho (2022); Cai and Diakonikolas (2025); Taheri et al. (2025); Graldi et al. (2025). In contrast to these works, our results aim to provide a principled understanding of practical post-training methods used for generative models, in particular foward and backward KL based fine-tuning. Perhaps the closest to our work is Chan et al. (2022), which studies the role of forward and reverse-KL divergences in approximate policy iteration and analyzes their policy-improvement properties in reinforcement learning. In contrast, our work focuses on continual learning and forgetting in generative-model post-training, where the forward and reverse-KL objectives arise from SFT and RL-style updates, and we quantify how these objectives induce or prevent forgetting through a distributional mixture model. 2 Forgetting in Forward and Reverse-KL Objectives A key aspect of our analysis is the difference between forward- and reverse-KL objectives and how they affect continual learning dynamics. We emphasize many of the forthcoming result assume mixture-of-two Gaussians for simplicity of exposition; extensions to finite-mixture and strongly log-concave densities are provided in Appendices D and E respectively. Similarly, extensions to a class of f-divergence is provided in Appendix C. All proofs are presented in Appendix B. 2.1 Forgetting in Two-component Mixture Model In this section, we describe the minimalist mixture model, also considered by Chen et al. (2025) for their empirical observations. Fix d∈ℕd and a positive definite covariance matrix Σ∈ℝd×d ^d× d. For μ∈ℝdμ ^d, denote the Gaussian density by φΣ(y;μ):=(2π)−d/2|Σ|−1/2exp(−12(y−μ)⊤Σ−1(y−μ)). _ \! (y;μ ):=(2π)^-d/2| |^-1/2 \! (- 12(y-μ) ^-1(y-μ) ). We use a shared covariance Σ (with bounded spectrum) for both modes so that separation and overlap are controlled purely by the means and the mixture weights. Let po(y):=φΣ(y;μo),p_o(y):= _ \! (y; _o ), and pn(y):=φΣ(y;μn)p_n(y):= _ \! (y; _n ) with μo≠μn _o≠ _n. These densities represent the pre-existing (old) behavior distribution and the newly learned (new) behavior distribution, respectively. Define the Mahalanobis separation as δ:=‖μn−μo‖Σ−1=(μn−μo)⊤Σ−1(μn−μo).δ:= _n- _o _ ^-1= ( _n- _o) ^-1( _n- _o). The scalar δ is dimensionless and will quantitatively govern overlap quantities (and hence misassignment and forgetting rates) through exponential-in-δ2δ^2 bounds. Fix a target mixture weight α∈(0,1)α∈(0,1) and define pα(y):=αpo(y)+(1−α)pn(y), p_α(y):=α\,p_o(y)+(1-α)\,p_n(y), (2.1) with target responsibilities so(y):=αpo(y)pα(y)s_o(y):= α\,p_o(y)p_α(y) and sn(y):=1−so(y)s_n(y):=1-s_o(y). The mixture pαp_α formalizes the desired post-training outcome: retain an α-fraction of the old distribution while incorporating a (1−α)(1-α)-fraction of the new distribution. We now introduce the learner model family used in post-training. For parameters β∈(0,1)β∈(0,1) and means mo,mn∈ℝdm_o,m_n ^d, define qβ,mo,mn(y):=βφΣ(y;mo)+(1−β)φΣ(y;mn). q_β,m_o,m_n(y):=β\, _ \! (y;m_o )+(1-β)\, _ \! (y;m_n ). (2.2) Here β encodes how much probability mass the model allocates to the old mode, while mom_o and mnm_n control the within-mode locations. Define the model responsibilities (posterior component probabilities under q): ro(y):=βφΣ(y;mo)qβ,mo,mn(y),rn(y):=1−ro(y). r_o(y):= β\, _ \! (y;m_o )q_β,m_o,m_n(y), r_n(y):=1-r_o(y). (2.3) The responsibilities act as soft assignments of a sample y to the old versus new component and will serve as the gate through which overlap induces cross-mode gradient effects. We now introduce two notions of forgetting: (a) mass forgetting (mass collapse of the old mixture weight) and (b) old-component drift (distortion of the old component itself). Definition 2.1 (Mass Forgetting). Assume that the “old” mean μo _o and the “new” mean μn _n are available to the model, i.e., qβ(y):=βpo(y)+(1−β)pn(y),β∈[0,1],q_β(y):=β\,p_o(y)+(1-β)\,p_n(y), β∈[0,1], where β is the learnable parameter. We say that a training objective exhibits mass forgetting if its optimal solution satisfies β⋆=0β =0. Minimizing such an objective therefore leads the learned model to assign zero mixture mass to the old component, even in this favorable setting. Equivalently, the learned model reduces to qβ⋆(y)=pn(y)q_β (y)=p_n(y), so the model no longer represents the old distribution po(y)p_o(y) despite it being available in the model class. Definition 2.2 (ε -Bounded Drift of the Old Component). Suppose that the old mean is set correctly (i.e., mo=μom_o= _o). We say that a training objective L(β,mo,mn)L(β,m_o,m_n) exhibits ε -bounded drift of the old component if ∥∇moL(β,mo=μo,mn)∥≤ε,\| _m_oL(β,m_o= _o,m_n)\|≤ , for some problem-dependent quantity ε that tends to 0 in the regime of interest. This certifies that, at the correct old mean, the objective exerts only a small update signal on the old component, so gradient-based optimization can induce at most ε -scale drift of the old distribution while learning the remaining parameters. This notion captures a form of “retention”, as opposed to forgetting captured by Definition 2.1: if this definition is violated, although the mixture weight on the old component may remain nonzero, updates induced by the objective can gradually shift the parameters of the old distribution away from the true old behavior. A desirable training objective for continual learning should therefore avoid mass forgetting in the sense of Definition 2.1 and, at the same time, induce only vanishingly small drift of the old distribution in the sense of Definition 2.2. Bhattacharyya Overlap and a Responsibility Bound. The Bhattacharyya coefficient (Bhattacharyya, 1943) between densities f,gf,g is defined as BC(f,g):=∫ℝdf(y)g(y)y∈(0,1].BC(f,g):= _R^d f(y)\,g(y)\,dy∈(0,1]. restatable [Posterior Leakage bound via Bhattacharyya Coefficient]lemmaPLBC Let f,gf,g be densities and let w∈(0,1)w∈(0,1). Define the mixture h(y):=wf(y)+(1−w)g(y)h(y):=wf(y)+(1-w)g(y) and the responsibility rf(y):=wf(y)h(y)r_f(y):= wf(y)h(y). Then Y∼g[rf(Y)]≤12w1−wBC(f,g),Y∼f[1−rf(Y)]≤121−wBC(f,g). _Y g [r_f(Y) ]≤ 12 w1-w\;BC(f,g), _Y f [1-r_f(Y) ]≤ 12 1-ww\;BC(f,g). Remark 2.1 (Bhattacharyya Coefficient for Equal-covariance Gaussians). Let f(y)=φΣ(y;μ1)f(y)= _ \! (y; _1 ) and g(y)=φΣ(y;μ2)g(y)= _ \! (y; _2 ) with Σ≻0 0. Then BC(f,g)=exp(−18‖μ1−μ2‖Σ−12).BC(f,g)= \! (- 18 _1- _2 _ ^-1^2 ). To see this, first note that a completion-of-squares computation yields f(y)g(y)=φΣ(y;(μ1+μ2)/2)exp(−18‖μ1−μ2‖Σ−12). f(y)g(y)= _ \! (y;( _1+ _2)/2 )\, \! (- 18 _1- _2 _ ^-1^2 ). Integrating over y gives the claim. The Bhattacharyya coefficient BC(f,g)BC(f,g) provides a symmetric proxy for mode overlap and converts geometric separation into quantitative bounds on posterior mis-assignment. Lemma 2.1 shows that the leakage probabilities Y∼g[rf(Y)]E_Y g[r_f(Y)] and Y∼f[1−rf(Y)]E_Y f[1-r_f(Y)]—the chances that samples from one mode receive posterior responsibility from the other in the mixture—are bounded by a simple prefactor times BC(f,g)BC(f,g). In our setting, instances such as (f,g)=(po,pn)(f,g)=(p_o,p_n) with w=βw=β control pn[ro(Y)]E_p_n[r_o(Y)] (new samples incorrectly assigned to the old component), while w=αw=α controls po[1−so(Y)]E_p_o[1-s_o(Y)] (old samples attributed to the new component under the true target). These leakage terms act as the gates through which forgetting occurs. Under forward-KL training on new-only data, the logit gradient has the form β−pn[ro(Y)]β-E_p_n[r_o(Y)], so the exponentially small leakage term leaves a net push toward β↓0β 0. Under reverse-KL training to a true target, old-parameter updates are proportional to misassignment probabilities such as po[1−ro(Y)]E_p_o[1-r_o(Y)] and po[1−so(Y)]E_p_o[1-s_o(Y)], so drift is suppressed when overlap is small. Finally, Lemma 2.1 makes this explicit for equal-covariance Gaussians, giving BC(po,pn)=exp(−δ2/8)BC(p_o,p_n)= (-δ^2/8) and hence exponential decay of cross-mode influence with Mahalanobis separation δ. 2.2 Forward-KL SFT Exhibits Mass Forgetting We now analyze the behavior of the forward-KL objective in the two-mode mixture model when training is performed using only new data. The following theorem shows that, in this setting, the objective drives the mixture weight on the old mode to collapse, resulting in strong forgetting. restatable [Mass Forgetting in Forward-KL SFT]theoremfKL Consider the target model in (2.1) with μo≠μn _o≠ _n. Fix the learner model (see (2.2)) means at the true old/new means and define qβ(y):=qβ,μo,μn(y)=βpo(y)+(1−β)pn(y)q_β(y):=q_β, _o, _n(y)=β\,p_o(y)+(1-β)\,p_n(y), β∈[0,1].β∈[0,1]. Consider LSFT(β):=KL(pn∥qβ).L_SFT(β):=KL(p_n\|q_β). Then: 1. LSFT(0)=0L_SFT(0)=0 and LSFT(β)>0L_SFT(β)>0 for every β∈(0,1]β∈(0,1]. Hence β=0β=0 is the unique global minimizer over [0,1][0,1]. 2. LSFTL_SFT is strictly increasing on [0,1][0,1]. 3. Let ϕ∈ℝφ be a logit with β=σ(ϕ)β=σ(φ) and consider gradient flow ϕ˙(t)=−dϕLSFT(σ(ϕ(t))). φ(t)=- ddφL_SFT(σ(φ(t))). Then ϕ(t)φ(t) is strictly decreasing and β(t)=σ(ϕ(t))β(t)=σ(φ(t)) satisfies β(t)↓0β(t) 0 as t→∞t→∞. Moreover, recalling the model responsibility in (2.3), the logit gradient has the explicit form dϕLSFT(σ(ϕ))=β−Y∼pn[ro(Y)],soϕ˙(t)=pn[ro(Y)]−β(t)<0. ddφL_SFT(σ(φ))=β-E_Y p_n [r_o(Y) ], φ(t)=E_p_n[r_o(Y)]-β(t)<0. (2.4) Finally, the leakage bound (Lemma 2.1 + Remark 2.1) yields Y∼pn[ro(Y)]≤12β1−βexp(−δ28),E_Y p_n [r_o(Y) ]≤ 12 β1-β \! (- δ^28 ), (2.5) and hence ϕ˙(t)≤12β(t)1−β(t)e−δ2/8−β(t) φ(t)≤ 12 β(t)1-β(t)e^-δ^2/8-β(t). Remark 2.2. We emphasize that except for the explicit bound in (2.5), the above result does not rely on the Gaussian assumption; it holds for any pair of distinct densities pop_o and pnp_n. The above result shows a strong form of forgetting: when the forward-KL objective is optimized on new-only data, the optimal mixture weight places zero mass on the old mode. The mechanism is transparent in the logit-gradient expression (2.4): the update compares the current mass β on the old mode with the probability that new data are assigned to that mode, and since this assignment probability is exponentially small in the separation δ, the update consistently pushes β downward. Thus even though the model family explicitly contains the correct old component, the forward-KL objective provides no incentive to retain it once the training distribution contains only new data, illustrating a strong form of forgetting. 2.2.1 When Does Replay Prevent Forgetting under Forward-KL SFT? Theorem 2.2 established a population-level strong-forgetting phenomenon for forward-KL SFT: when the training distribution is new-only (pnp_n), the forward-KL objective LSFT(β)=KL(pn∥qβ)L_SFT(β)=KL(p_n\|q_β) uniquely minimizes at β⋆=0β =0 even though the model class contains the correct old component pop_o. A natural next question is whether replay-style modifications can mitigate this collapse. The key structural point is that forward-KL has the form KL(data∥model)KL(data\,\|\,model), so its population optimizer is determined by the data distribution. Consequently, replay can only prevent strong forgetting at the population level if it changes the effective training distribution. We therefore distinguish two canonical interventions: (i) denominator replay, which mixes the old mode into the model side while keeping the data distribution new-only, and (i) numerator replay, which mixes old samples into the data side. Only (i) changes the population optimum. restatable lemmaReplayFKL Fix λ∈(0,1)λ∈(0,1). (A)Define the replay-mixed learner model as q~β,λ(y):=(1−λ)qβ(y)+λpo(y). q_β,λ(y):=(1-λ)\,q_β(y)+λ\,p_o(y). Then q~β,λ=qβ~ q_β,λ=q_ β with β~=λ+(1−λ)β∈[λ,1], β=λ+(1-λ)β∈[λ,1], and the optimization problem minβ∈[0,1]KL(pn∥q~β,λ) _β∈[0,1]KL\! (p_n\,\|\, q_β,λ ) has the unique minimizer β⋆=0β =0 (equivalently β~⋆=λ β =λ). (B) Define the replay-mixed data distribution p~λ(y):=(1−λ)pn(y)+λpo(y). p_λ(y):=(1-λ)\,p_n(y)+λ\,p_o(y). Then minβ∈[0,1]KL(p~λ∥qβ) _β∈[0,1]KL\! ( p_λ\,\|\,q_β ) has the unique minimizer β⋆=λβ =λ, and the minimum value is 0. Part (A) shows that denominator replay does not change the directional preference of forward-KL; it only restricts the model family. Indeed, q~β,λ q_β,λ is not a new family of distributions: it is exactly the original mixture family with an affine reparameterization β~=λ+(1−λ)β β=λ+(1-λ)β and hence a constraint β~∈[λ,1] β∈[λ,1]. Therefore the optimization reduces to minimizing KL(pn∥qβ~)KL(p_n\|q_ β) over β~∈[λ,1] β∈[λ,1]. Since the map β~↦KL(pn∥qβ~) β (p_n\|q_ β) is strictly increasing (Theorem 2.2), the optimizer chooses the smallest admissible old mass, namely β~⋆=λ β =λ, which corresponds to β⋆=0β =0. Operationally, any nonzero old mass present in the deployed distribution q~β⋆,λ=qλ q_β ,λ=q_λ is therefore a hard-coded floor imposed by the replay mixing, rather than something learned from new-only data. Part (B) is qualitatively different because numerator replay changes the training distribution itself. When the data distribution is p~λ=(1−λ)pn+λpo p_λ=(1-λ)p_n+λ p_o, the model class contains a member that matches it exactly, namely qβq_β with β=λβ=λ. Thus the forward-KL can attain its global minimum value 0 at β⋆=λβ =λ, and uniqueness follows because two distinct mixture weights cannot represent the same density when po≢pnp_o ≡ p_n. In other words, forward-KL performs a mode-covering projection of the data mixture onto the model family, and it necessarily retains whatever fraction of old data is present in the training distribution. 2.3 Reverse-KL RL Avoids Mass Forgetting and Controls Old-Component Drift We now verify that the reverse-KL objective is correctly aligned with the intended true target at the parameter level. We first show that the reverse-KL objective is consistent with respect to the target distribution. In particular, when the learner parameters match the target mixture, i.e., when the mixture weight equals the target weight and the new-mode mean equals the true new mean, (β,mn)=(α,μn)(β,m_n)=(α, _n), then the model distribution qβ,mnq_β,m_n coincides exactly with the target pαp_α. At this point the reverse-KL divergence vanishes and the gradients with respect to both parameters are zero, so the point is stationary. Moreover, since KL divergence is nonnegative and equals zero only when the two distributions coincide, this parameter choice is in fact a global minimizer of the objective. Thus reverse-KL RL is aligned with the target at the population level: the optimal solution preserves the correct mixture mass on the old mode and therefore avoids mass forgetting. restatable [Consistency of Reverse-KL RL]theoremRKLN Consider the target model in (2.1) and the learner model family in (2.2) with mo=μom_o= _o. Let L(β,mn):=KL(qβ,mn∥pα)=∫ℝdqβ,mn(y)logqβ,mn(y)pα(y)dy.L(β,m_n):=KL\! (q_β,m_n\,\|\,p_α )= _R^dq_β,m_n(y)\, q_β,m_n(y)p_α(y)\,dy. Then L is continuously differentiable on (0,1)×ℝd(0,1)×R^d, and its gradients are ∂βL(β,mn) ∂βL(β,m_n) =∫ℝd(φΣ(y;μo)−φΣ(y;mn))logqβ,mn(y)pα(y)dy, = _R^d ( _ \! (y; _o )- _ \! (y;m_n ) )\, q_β,m_n(y)p_α(y)\,dy, (2.6) ∇mnL(β,mn) _m_nL(β,m_n) =(1−β)∫ℝdφΣ(y;mn)Σ−1(y−mn)logqβ,mn(y)pα(y)dy. =(1-β) _R^d _ \! (y;m_n )\, ^-1(y-m_n)\, q_β,m_n(y)p_α(y)\,dy. (2.7) In particular, the point (β,mn)=(α,μn)(β,m_n)=(α, _n) is stationary: ∂βL(β,mn)|(β,mn)=(α,μn)=0,∇mnL(β,mn)|(β,mn)=(α,μn)=0. . ∂βL(β,m_n) |_(β,m_n)=(α, _n)=0, . _m_nL(β,m_n) |_(β,m_n)=(α, _n)=0. Moreover, since KL(⋅∥⋅)≥0KL(·\|·)≥ 0 with equality iff the two densities coincide a.e., (α,μn)(α, _n) is a global minimizer of L and achieves L(α,μn)=0L(α, _n)=0. We next bound the drift of the old distribution’s parameters when learning the new distribution. restatable [Bounding Drift of the Old Distribution in Reverse-KL RL]theoremRKL Consider the learner model family in (2.2) and the reverse-KL objective LRL(β,mo,mn):=KL(qβ,mo,mn∥pα)L_RL(β,m_o,m_n):=KL\! (q_β,m_o,m_n\,\|\,p_α ). Assume the old mean is already correct: mo=μom_o= _o (with mnm_n arbitrary). Recalling (2.3), let ro(y)r_o(y) be the learner responsibility under qβ,μo,mnq_β, _o,m_n and let so(y)s_o(y) be the target responsibility under pαp_α. Then the gradient w.r.t. the old mean admits the exact decomposition ∇moLRL(β,μo,mn)=βΣ−1(εq(β,mn)(mn−μo)−εp(α)(μn−μo)), _m_oL_RL(β, _o,m_n)=β\, ^-1 ( _q(β,m_n)\,(m_n- _o)- _p(α)\,( _n- _o) ), (2.8) where the misassignment probabilities are εq(β,mn):=Y∼po[1−ro(Y)], _q(β,m_n):=E_Y p_o [1-r_o(Y) ], and εp(α):=Y∼po[1−so(Y)]. _p(α):=E_Y p_o [1-s_o(Y) ]. Moreover, these satisfy the explicit Gaussian overlap bounds εq(β,mn)≤121−βexp(−18‖mn−μo‖Σ−12),εp(α)≤121−αexp(−δ28), _q(β,m_n)≤ 12 1-β \! (- 18 m_n- _o _ ^-1^2 ), _p(α)≤ 12 1-α \! (- δ^28 ), (2.9) and hence (with ∥⋅∥2\|·\|_2 denoting the operator norm), we have ‖∇moLRL(β,μo,mn)‖≤β‖Σ−1‖2(εq(β,mn)‖mn−μo‖+εp(α)‖μn−μo‖). _m_oL_RL(β, _o,m_n) ≤β\, ^-1 _2 ( _q(β,m_n)\, m_n- _o + _p(α)\, _n- _o ). The above result shows that when optimizing reverse-KL toward a target that explicitly retains the old mode, the learning signal acting on the old parameters is tightly controlled. In particular, when the old mean is already correct, the gradient admits an exact decomposition in which the only terms capable of moving the old mode arise from misassignment probabilities (1−ro)(1-r_o) and (1−so)(1-s_o) under old-mode samples. The theorem further shows that these misassignment probabilities are governed by Gaussian overlap quantities (Bhattacharyya coefficients) and decay exponentially with the squared Mahalanobis separation, scaling as exp(−δ2/8) (-δ^2/8) in the equal-covariance case. Consequently, in well-separated regimes the reverse-KL objective exerts only an exponentially small update pressure on an already-correct old component, so optimization can adjust the new mode while perturbing the old mode only through negligible overlap effects. Theorems 2.3 and 2.3 together quantify precisely that the reverse-KL objective is naturally aligned with the desired continual-learning outcome: it favors a solution that simultaneously preserves the old mode and correctly represents the new one and the mixing proportions. 2.3.1 A Local Exponential Rate for Reverse-KL Minimization In this subsection we prove a local (near-optimum) exponential convergence rate for the reverse-KL minimization when mom_o is fixed at μo _o and we optimize over (β,mn)(β,m_n) by gradient flow. Because β is constrained to [0,1][0,1], we parameterize β=σ(ϕ)β=σ(φ) using the logit ϕ∈ℝφ . restatable [Explicit Local PL bound and Exponential Convergence for Reverse-KL Gradient Flow]theoremPL Consider the reverse-KL objective L(ϕ,m):=KL(qϕ,m∥pα),L(φ,m):=KL(q_φ,m\|p_α), and where qϕ,m(y):=β(ϕ)ϕΣ(y;μo)+(1−β(ϕ))ϕΣ(y;m),β(ϕ)=11+e−ϕ,q_φ,m(y):=β(φ)\, _ (y; _o)+(1-β(φ))\, _ (y;m), β(φ)= 11+e^-φ, with mo=μom_o= _o fixed, and let θ:=(ϕ,m)∈ℝd+1,θ⋆:=(ϕ⋆,m⋆),ϕ⋆=logα1−α,m⋆=μn.θ:=(φ,m) ^d+1, θ :=(φ ,m ), φ = α1-α, m = _n. Then H⋆:=∇2L(θ⋆)≻0H_ :=∇^2L(θ ) 0 and μ⋆:=λmin(H⋆)>0. _ := _ (H_ )>0. Fix r0>0r_0>0 and let K=Br0(θ⋆)K=B_r_0(θ ). Let LH:=LH(K)L_H:=L_H(K) be the Hessian-Lipschitz constant given by Lemma B.3. Define ρ:=minr0,μ⋆2LH,εloc:=μ⋆8ρ2.ρ:= \! \r_0, _ 2L_H \, _loc:= _ 8ρ^2. Then: 1. For every θ∈Bρ(θ⋆)θ∈ B_ρ(θ ), ∇2L(θ)⪰μ⋆2I.∇^2L(θ) _ 2I. (2.10) Consequently, for every θ∈Bρ(θ⋆)θ∈ B_ρ(θ ), ‖∇L(θ)‖2≥μ⋆L(θ),L(θ)≥μ⋆4‖θ−θ⋆‖2.\|∇ L(θ)\|^2≥ _ \,L(θ), L(θ)≥ _ 4\|θ-θ \|^2. (2.11) 2. Let θ(t)θ(t) solve the Euclidean gradient flow θ˙(t)=−∇L(θ(t)). θ(t)=-∇ L(θ(t)). If θ(0)∈Bρ(θ⋆)θ(0)∈ B_ρ(θ ) and L(θ(0))≤εloc,L(θ(0))≤ _loc,, then θ(t)∈Bρ(θ⋆)θ(t)∈ B_ρ(θ ) for all t≥0t≥ 0, and L(θ(t))≤L(θ(0))e−μ⋆t∀t≥0.L(θ(t))≤ L(θ(0))\,e^- _ t ∀ t≥ 0. (2.12) Moreover, ‖θ(t)−θ⋆‖≤2μ⋆L(θ(0))e−μ⋆t/2∀t≥0.\|θ(t)-θ \|≤ 2 _ L(θ(0))\,e^- _ t/2 ∀ t≥ 0. (2.13) Theorem 2.3.1 says that once the reverse-KL dynamics enter a neighborhood in which the local curvature stays uniformly positive, the loss behaves like a strongly convex function and gradient flow converges exponentially fast to the optimum. The locality is quantified explicitly by ρ and εloc _loc: the radius is determined by the curvature at the optimum μ⋆ _ and by the local Hessian-Lipschitz constant LHL_H, while the admissible sublevel threshold is εloc=μ⋆ρ2/8 _loc= _ ρ^2/8. Thus the same Hessian curvature quantity that measures local identifiability also determines the exponential convergence rate and the size of the certified local basin. 2.4 Replay Improves Reverse-KL Methods in Practice Our earlier population-level reverse-KL analysis suggests that KL-regularized on-policy updates are naturally aligned with retention: when the true target pαp_α explicitly includes the old mode, reverse-KL gradients that would move an already-correct old component are overlap-gated and become exponentially small as the separation between modes increases. However, this population-level picture does not by itself rule out a purely stochastic failure mode. In particular, when the current old-mode weight β is small, a minibatch drawn from the current model qβ,mnq_β,m_n may contain no old-mode samples, causing the stochastic update in that iteration to behave effectively as a “new-only” update. A simple semi-on-policy remedy is to inject a small replay fraction of old-mode samples into the behavior distribution used to construct minibatches. Importantly, this can be done while still estimating the same reverse-KL gradient in expectation via bounded importance weighting. This intervention simultaneously (i) guarantees that old-mode samples appear in minibatches regardless of the current value of β, and (i) avoids the high-variance pathologies typically associated with general off-policy importance sampling, since the resulting importance ratios remain uniformly bounded. Replay-Mixed Sampling with Bounded Importance Weights. Fix a replay rate λ∈(0,1)λ∈(0,1) and define bλ,β,mn(y):=(1−λ)qβ,mn(y)+λpo(y),wλ(y):=qβ,mn(y)bλ,β,mn(y).b_λ,β,m_n(y):=(1-λ)\,q_β,m_n(y)+λ\,p_o(y), w_λ(y):= q_β,m_n(y)b_λ,β,m_n(y). (2.14) A minibatch is formed by drawing Y1,…,YN∼i.i.d.bλ,β,mnY_1,…,Y_N i.i.d. b_λ,β,m_n. restatable lemmaReplayRKL Fix λ∈(0,1)λ∈(0,1), β∈(0,1)β∈(0,1), and mn∈ℝdm_n ^d. Let bλ,β,mnb_λ,β,m_n and wλw_λ be defined in (2.14), and let Y1,…,YN∼i.i.d.bλ,β,mnY_1,…,Y_N i.i.d. b_λ,β,m_n. (A) There exists β~∈(0,1) β∈(0,1) such that bλ,β,mn(y)=qβ~,mn(y),β~=λ+(1−λ)β≥λ.b_λ,β,m_n(y)=q_ β,m_n(y), β=λ+(1-λ)β\ ≥\ λ. Moreover, the importance ratio is uniformly bounded, 0≤wλ(y)≤11−λ∀y∈ℝd.0≤ w_λ(y)≤ 11-λ ∀ y ^d. Consequently, for any integrable h:ℝd→ℝkh:R^d ^k, Y∼bλ,β,mn[wλ(Y)h(Y)]=Y∼qβ,mn[h(Y)],E_Y b_λ,β,m_n [w_λ(Y)\,h(Y) ]=E_Y q_β,m_n [h(Y) ], (2.15) and if additionally bλ,β,mn[‖h(Y)‖2]<∞E_b_λ,β,m_n[\|h(Y)\|^2]<∞, then Y∼bλ,β,mn[‖wλ(Y)h(Y)‖2]≤1(1−λ)2Y∼bλ,β,mn[‖h(Y)‖2].E_Y b_λ,β,m_n [\|w_λ(Y)\,h(Y)\|^2 ]≤ 1(1-λ)^2\,E_Y b_λ,β,m_n [\|h(Y)\|^2 ]. (2.16) (B) Under the mixture generative model of bλ,β,mn=qβ~,mnb_λ,β,m_n=q_ β,m_n, let Zi∈0,1Z_i∈\0,1\ be the latent indicator that YiY_i came from the old component. Then Zi∼Bernoulli(β~)Z_i ( β) with β~≥λ β≥λ, hence (∑i=1NZi=0)=(1−β~)N=((1−λ)(1−β))N≤(1−λ)N, Pr ( _i=1^NZ_i=0 )=(1- β)^N=((1-λ)(1-β))^N≤(1-λ)^N, (2.17) and a multiplicative Chernoff bound implies (∑i=1NZi≤λ2N)≤exp(−λ8N). Pr ( _i=1^NZ_i≤ λ2N )\ ≤\ \! (- λ8N ). (2.18) Part (A) formalizes that replay-mixing does not change the reverse-KL population objective: any quantity that can be expressed as an expectation under the current model qβ,mnq_β,m_n (in particular, standard score-form expressions for ∇θKL(qβ,mn∥pα) _θKL(q_β,m_n\|p_α)) can be estimated unbiasedly from replay-mixed samples by weighting with wλw_λ. At the same time, the correction is benign: the pointwise dominance bλ,β,mn≥(1−λ)qβ,mnb_λ,β,m_n≥(1-λ)q_β,m_n forces a uniform weight bound wλ≤(1−λ)−1w_λ≤(1-λ)^-1, which immediately yields the second-moment control (2.16) and prevents the high-variance failure modes of general importance sampling. Part (B) addresses the stochastic “old-mode starvation” pathology directly. Even if β is tiny, replay ensures the effective old-component probability under the behavior distribution satisfies β~≥λ β≥λ, so the probability that a minibatch contains no old samples decays as (1−λ)N(1-λ)^N, independent of β. Thus replay-mixing decouples old-mode visibility in stochastic gradients from the current policy’s old weight. Combined with our earlier overlap-gated reverse-KL gradient identities (which show that, at the population level, the old mode is only weakly perturbed when it is already correct), this yields a minimalistic mechanism by which a method can be “not exactly on-policy” yet still exhibit on-policy-like resistance to catastrophic forgetting in finite-batch optimization. Recent work of KL-regularized on-policy RL (Shah et al., 2025; Tang and Munos, 2025) emphasize that how the KL term is estimated and differentiated matters: common surrogate/stop-gradient constructions can yield biased gradients or even optimize a qualitatively different objective than intended. Lemma 2.4 is complementary to these estimator-correctness issues. Even if one uses a correct (on-policy) gradient expression of the form ∇θ(θ)=Y∼qθ[hθ(Y)] _θJ(θ)=E_Y q_θ[h_θ(Y)], minibatches drawn purely from qθq_θ can still suffer a coverage problem: low-probability regions (e.g. previously learned behavior/modes) may be absent, making the step behave effectively like “new-only” optimization. Replay-mixing fixes this by sampling from bλ,θ=(1−λ)qθ+λpob_λ,θ=(1-λ)q_θ+λ p_o, and using the importance ratio wλ=qθ/bλ,θw_λ=q_θ/b_λ,θ to recover unbiased qθq_θ-expectations, while keeping wλ≤(1−λ)−1w_λ≤(1-λ)^-1 uniformly. Thus, replay-mixing provides a principled way to reuse replay/stale samples and stabilize gradient estimation without the unbounded importance weights that typically make off-policy correction high-variance. It can alleviate coverage/variance pathologies, while the KL-estimator bias mechanisms studied in Shah et al. (2025); Tang and Munos (2025) must still be addressed by choosing an appropriate KL-gradient estimator. 2.5 Summarizing Consequences for SFT and RL Post-Training In our two-mode Gaussian mixture model, SFT corresponds to minimizing a forward-KL objective of the form KL(data∥model)KL(data\,\|\,model). When the training distribution is new-only (pnp_n), this objective exhibits a population-level pathology: although the model family explicitly contains the correct old component pop_o, the unique minimizer of LSFT(β)=KL(pn∥qβ)L_SFT(β)=KL(p_n\|q_β) collapses the old mixture weight to β⋆=0β =0. The logit-gradient formula ϕ˙=pn[ro(Y)]−β φ=E_p_n[r_o(Y)]-β makes the mechanism transparent: updates compare the current old mass β with the frequency with which new data are (mis)assigned to the old component, and this leakage probability is exponentially small in the separation δ. Replay interacts with SFT in an asymmetric manner. Mixing old samples only on the model side (denominator replay) does not alter the population objective and therefore still drives the trainable parameter to β⋆=0β =0, with any apparent “retention” arising solely from the externally imposed floor β~≥λ β≥λ. In contrast, incorporating replay into the data distribution (numerator replay) genuinely changes the forward-KL objective: the population optimum then recovers the replay fraction exactly (β⋆=λβ =λ). In this sense, numerator replay provides the minimal rehearsal mechanism by which SFT can avoid strong forgetting in this toy setting. By contrast, RL-style post-training with KL regularization can be naturally expressed as minimizing a reverse-KL objective KL(q∥pα)KL(q\,\|\,p_α) toward a true target pα=αpo+(1−α)pnp_α=α p_o+(1-α)p_n that explicitly preserves the old behavior. In this setting the learned parameters are aligned with retention: the matching parameters (β,mn)=(α,μn)(β,m_n)=(α, _n) are stationary and globally minimize the reverse-KL, while the learning signal for the old mean at mo=μom_o= _o is provably gated by misassignment probabilities that decay exponentially with mode separation. This formalizes the mode-locality intuition: reverse-KL updates can adapt the new mode while perturbing an already-correct old mode only through exponentially small overlap regions. In practice, however, purely on-policy stochastic optimization can suffer a finite-batch starvation failure when β becomes small, since minibatches sampled from the current model may contain no old-mode samples. A semi-on-policy replay mixture resolves this issue without altering the population objective. Specifically, mixing a small fraction λ of true old samples into the behavior distribution and applying bounded importance weights yields an unbiased estimator of the same reverse-KL gradient, while guaranteeing Ω(λN) (λ N) old-mode samples per minibatch with high probability. 3 Forgetting in Near-on-policy Algorithms Based on the quantification established in Section 2, we now study forgetting in three recently proposed algorithms: Self-Distillation Fine-Tuning (SDFT) (Shenfeld et al., 2026), T-Discover (Yuksekgonul et al., 2026), and OAPL (Ritter et al., 2026; Brantley et al., 2025). All three are near on-policy in the sense that their update targets are constructed from distributions closely tied to the model’s own behavior, but they do so in different ways: SDFT distills the current policy toward an evolving teacher induced by demonstrations, T-Discover reweights samples drawn from the current policy using an entropic reward objective together with a KL anchor, and OAPL samples from a lagged reference policy and defines both its improvement target and regression objective relative to that same frozen reference. In this sense, demonstrations (via distillation) and rewards (via RL-style tilting) are two practical mechanisms for constructing on-policy update targets that keep training supported on the model’s current distribution. The following subsections show that these design choices lead to distinct forgetting profiles in our mixture model: SDFT behaves most like a reverse-KL update with an evolving teacher, T-Discover balances reward-driven discovery against KL-based retention, and OAPL preserves or reweights only the modes already present in its frozen reference while remaining geometrically local. 3.1 Mixture-Model Analysis of SDFT with Demonstrations and EMA Teachers Self-Distillation Fine-Tuning (SDFT) (Shenfeld et al., 2026) updates a student policy on-policy by sampling from the student and minimizing a reverse-KL objective to a demonstration-conditioned teacher. In the notation of Shenfeld et al. (2026), the student is πθt(⋅∣x) _ _t(· x), while the teacher is the same model additionally conditioned on an expert demonstration c, yielding πθ¯t(⋅∣x,c) _ θ_t(· x,c). Thus each step minimizes DKL(πθt(⋅∣x)∥πθ¯t(⋅∣x,c)),D_KL\! ( _ _t(· x)\,\|\, _ θ_t(· x,c) ), taking gradients only with respect to the student parameters θt _t while treating the teacher as fixed within that step. The key role of the demonstration c is to bias the teacher distribution toward a desirable behavior: conditioning on c changes how much probability mass the teacher assigns to “old” behavior versus “new” behavior, and it shifts the teacher’s preferred new behavior in a direction suggested by the demonstration. Across steps, Shenfeld et al. (2026) proposes an EMA teacher (updating θ¯t θ_t by exponential moving average of θt _t) to stabilize the on-policy feedback loop and avoid chasing a rapidly moving target. Mixture Abstraction of Demonstration-Guided SDFT. We model the SDFT process in our two-mode Gaussian mixture abstraction from Section 2.1. As before, the old behavior is represented by the fixed Gaussian po(y):=φΣ(y;μo)p_o(y):= _ \! (y; _o ). At each iteration t, the teacher distribution is summarized by a teacher state ν~t=(αt,νt)∈(0,1)×ℝd ν_t=( _t, _t)∈(0,1)×R^d, where αt∈(0,1) _t∈(0,1) denotes the amount of old-mode mass retained by the teacher and νt _t is the current location of the teacher’s new mode. Similarly, the student distribution is summarized by the student state m~t=(βt,mt)∈(0,1)×ℝd m_t=( _t,m_t)∈(0,1)×R^d, where βt _t is the student’s old-mode mass and mtm_t is the student’s new-mode location. The per-step SDFT objective is then reverse-KL from student to the current teacher mixture. To model the fact that the demonstration c provides a directional signal (a “vector closer to truth”), we introduce a fixed demonstration anchor state ν~(c):=(αc,νc)∈(0,1)×ℝd, ν(c):=( _c, _c)∈(0,1)×R^d, which should be interpreted as the mixture summary of the demonstration-conditioned teacher preference induced by c. We allow the teacher to be updated both by EMA smoothing toward the student and by an explicit pull toward ν~(c) ν(c). A single scalar λ∈[0,1]λ∈[0,1] controls the demonstrator strength: λ=0λ=0 means the teacher is purely an EMA of the student, while larger λ makes the teacher track the demonstration anchor more strongly; see also Remark 3.1. Definition 3.1 (EMA+demonstrator SDFT dynamics in the mixture model). Fix Σ≻0 0 and po(y)=φΣ(y;μo)p_o(y)= _ \! (y; _o ). For teacher parameters (α,ν)∈(0,1)×ℝd(α,ν)∈(0,1)×R^d define pα,ν(y):=αpo(y)+(1−α)φΣ(y;ν),p_α,ν(y):=α\,p_o(y)+(1-α)\, _ \! (y;ν ), and for student parameters (β,m)∈(0,1)×ℝd(β,m)∈(0,1)×R^d define qβ,m(y):=βpo(y)+(1−β)φΣ(y;m).q_β,m(y):=β\,p_o(y)+(1-β)\, _ \! (y;m ). Define the phasewise reverse-KL loss L(β,m;α,ν):=KL(qβ,m∥pα,ν).L(β,m;α,ν):=KL\! (q_β,m\,\|\,p_α,ν ). Let ν~(c)=(αc,νc)∈(0,1)×ℝd ν(c)=( _c, _c)∈(0,1)×R^d denote a fixed demonstration anchor. We write the teacher and student states as ν~t:=(αt,νt)∈(0,1)×ℝd,m~t:=(βt,mt)∈(0,1)×ℝd. ν_t:=( _t, _t)∈(0,1)×R^d, m_t:=( _t,m_t)∈(0,1)×R^d. Given parameters γ>0γ>0 (student step size), ζ∈(0,1]ζ∈(0,1] (EMA rate), and λ∈[0,1]λ∈[0,1] (demonstrator strength), we say that (ν~t,m~t)t≥0( ν_t, m_t)_t≥ 0 follow EMA+demonstrator SDFT dynamics if, for all t≥0t≥ 0, m~t+1 m_t+1 =m~t−γ∇(β,m)L(βt,mt;αt,νt), = m_t-γ\, _(β,m)L( _t,m_t; _t, _t), (3.1) ν~t+1 ν_t+1 =(1−ζ)ν~t+ζ((1−λ)m~t+1+λν~(c)). =(1-ζ)\, ν_t+ζ ((1-λ)\, m_t+1+λ\, ν(c) ). (3.2) The update (3.1) is a reverse-KL tracking step of the student toward the current teacher, and (3.2) is an EMA teacher update that interpolates between the updated student and the demonstration anchor. Theorem 3.1. Assume the teacher iterates remain in the compact set K:=(α,ν)∈(0,1)×ℝd:α∈[α¯,α¯],‖ν−μo‖≤Rν,‖ν−μo‖Σ−1≥δ¯,K:= \(α,ν)∈(0,1)×R^d:\ α∈[ α, α],\ \|ν- _o\|≤ R_ν,\ ν- _o _ ^-1≥ δ \, for some 0<α¯≤α¯<10< α≤ α<1, Rν<∞R_ν<∞, and δ¯>0 δ>0, and assume the demonstration anchor ν~(c)=(αc,νc)∈K ν(c)=( _c, _c)∈ K. For each y=(α,ν)∈Ky=(α,ν)∈ K define Fy(x):=KL(qx∥py),x=(β,m),qx:=qβ,m,py:=pα,ν.F_y(x):=KL\! (q_x\,\|\,p_y ), x=(β,m), q_x:=q_β,m, p_y:=p_α,ν. Then FyF_y is minimized at x=yx=y and ∇2Fy(y)≻0∇^2F_y(y) 0. Consequently, by continuity and compactness, there exist constants μ>0μ>0, LH<∞L_H<∞, and r0>0r_0>0 such that λmin(∇2Fy(y))≥μ∀y∈K, _ \! (∇^2F_y(y) )≥μ ∀ y∈ K, and ∇2Fy(⋅)∇^2F_y(·) is LHL_H-Lipschitz on Br0(y)B_r_0(y) uniformly in y∈Ky∈ K. Define ρ:=minr0,μ2LH,M:=supy∈K‖x−y‖≤ρλmax(∇2Fy(x))<∞.ρ:= \! \r_0, μ2L_H \, M:= _ subarraycy∈ K\\ \|x-y\|≤ρ subarray _ \! (∇^2F_y(x) )<∞. Assume ‖m~0−ν~0‖≤ρ\| m_0- ν_0\|≤ρ and ‖ν~0−ν~(c)‖≤ρ\| ν_0- ν(c)\|≤ρ, and run the EMA+demonstrator SDFT dynamics of Definition 3.1 with step size 0<γ≤1/M0<γ≤ 1/M. Then the following hold. (A) Stability and geometric tracking. Let q:=1−γμ2∈(0,1).q:=1- γμ2∈(0,1). Then, for all t≥0t≥ 0, ‖m~t+1−ν~t‖≤q‖m~t−ν~t‖.\| m_t+1- ν_t\|≤ q\,\| m_t- ν_t\|. (3.3) Moreover, the teacher error to the demonstration anchor satisfies ‖ν~t+1−ν~(c)‖≤(1−ζλ)‖ν~t−ν~(c)‖+ζ(1−λ)‖m~t+1−ν~t‖.\| ν_t+1- ν(c)\|≤(1-ζλ)\,\| ν_t- ν(c)\|+ζ(1-λ)\,\| m_t+1- ν_t\|. (3.4) In particular, if λ>0λ>0, then ν~t→ν~(c) ν_t→ ν(c) and m~t→ν~(c) m_t→ ν(c), and both sequences remain in the ρ-neighborhood where the above curvature bounds apply. (B) Accumlated Old-component drift. Let L~(β,mo,mn;α,ν):=KL(qβ,mo,mn∥pα,ν),qβ,mo,mn(y):=βφΣ(y;mo)+(1−β)φΣ(y;mn). L(β,m_o,m_n;α,ν):=KL\! (q_β,m_o,m_n\,\|\,p_α,ν ), q_β,m_o,m_n(y):=β\, _ \! (y;m_o )+(1-β)\, _ \! (y;m_n ). There exists a finite constant LoldL_old, depending only on K,Σ,ρK, ,ρ, such that along the trajectory, ‖∇moL~(βt,μo,mt;αt,νt)‖≤Lold(‖m~t−ν~t‖+‖ν~t−ν~(c)‖). \| _m_o L( _t, _o,m_t; _t, _t) \|≤ L_old (\| m_t- ν_t\|+\| ν_t- ν(c)\| ). (3.5) If λ>0λ>0, then the right-hand side is summable over t, and in particular ∑t=0∞‖∇moL~(βt,μo,mt;αt,νt)‖<∞, _t=0^∞ \| _m_o L( _t, _o,m_t; _t, _t) \|<∞, (3.6) so the total update pressure on an already-correct old mean is finite (no accumulated drift). Moreover, the student converges to the demonstration-induced equilibrium state: (βt,mt)→(αc,νc)as t→∞.( _t,m_t)→( _c, _c) t→∞. In particular, for any desired target state ν~⋆=(α⋆,ν⋆)∈ℝd+1 ν =(α ,ν ) ^d+1 (e.g. ν~⋆=(α,μn) ν =(α, _n)), one has the exact limit identity limt→∞‖(βt,mt)−ν~⋆‖=‖ν~(c)−ν~⋆‖. _t→∞ \|( _t,m_t)- ν \|= \| ν(c)- ν \|. (3.7) Consequently, if the demonstration anchor is εdemo _demo-accurate in the sense that ‖ν~(c)−ν~⋆‖≤εdemo \| ν(c)- ν \|≤ _demo, then lim supt→∞‖(βt,mt)−ν~⋆‖≤εdemo. _t→∞ \|( _t,m_t)- ν \|≤ _demo. Remark 3.1 (Connection to SDFT assumptions). In Shenfeld et al. (2026, Section 3.2), the demonstration-conditioned teacher π(⋅∣x,c)π(· x,c) is argued to be a good target when (i) it is approximately reward-optimal and (i) it has minimal deviation from the current student in KL. Concretely, the paper motivates conditions of the form y∼π(⋅∣x,c)[r(y,x)]≈y∼πk+1⋆(⋅∣x)[r(y,x)]⏟(1) Optimality,KL(π(⋅∣x,c)∥πk(⋅∣x))≈KL(πk+1⋆(⋅∣x)∥πk(⋅∣x))⏟(2) Minimal deviation. E_y π(· x,c)[r(y,x)] _y _k+1 (· x)[r(y,x)]_(1) Optimality, KL\! (π(· x,c)\,\|\, _k(· x) ) \! ( _k+1 (· x)\,\|\, _k(· x) )_(2) Minimal deviation. Our mixture-model theorem instantiates these two requirements as explicit, checkable assumptions on the teacher state and its dynamics. Condition (1) corresponds to assuming the demonstration anchor ν~(c)=(αc,νc) ν(c)=( _c, _c) is close to the desired target (α,μn)(α, _n) (up to a controllable approximation error): the teacher retains nontrivial old mass (αc _c bounded away from 0) while pointing its “new” component toward the correct new behavior (νc _c near μn _n). Condition (2) corresponds to our tracking regime: the EMA recursion keeps the evolving teacher ν~t ν_t close to the student m~t m_t, and we assume the iterates remain in a uniform local neighborhood ‖m~t−ν~t‖≤ρ\| m_t- ν_t\|≤ρ where the phasewise reverse-KL loss is uniformly well-conditioned (strongly convex/smooth). In particular, local curvature implies that “teacher close to student” in parameter space is equivalent (up to constants) to “small KL deviation” in distribution space: KL(qm~t∥pν~t)=Θ(‖m~t−ν~t‖2).KL\! (q_ m_t\,\|\,p_ ν_t )= \! (\| m_t- ν_t\|^2 ). Under these two ingredients, the student can improve toward the demonstrated behavior while avoiding accumulated old-mode drift, because the only update channel acting on an already-correct old component is overlap-gated and becomes summable along the exponentially contracting student–teacher lag. Remark 3.2 (Exponential Dependence on Mode Separation). If along the EMA+demonstrator trajectory the teacher remains uniformly separated from the old mode, inft≥0‖νt−μo‖Σ−1≥δ¯>0, _t≥ 0 _t- _o _ ^-1\ ≥\ δ>0, and the tracking tube is chosen small enough that the student new mean also stays separated, e.g. supt≥0‖mt−νt‖Σ−1≤‖Σ−1/2‖2supt≥0‖m~t−ν~t‖≤δ¯2⇒‖mt−μo‖Σ−1≥δ¯/2, _t≥ 0 m_t- _t _ ^-1\ ≤\ \| ^-1/2\|_2\, _t≥ 0\| m_t- ν_t\|\ ≤\ δ2 m_t- _o _ ^-1≥ δ/2, and if βt,αt _t, _t are bounded away from (0,1)(0,1) (as ensured by K and the tracking assumption), then there is a constant Csep<∞C_sep<∞ (depending only on K,ΣK, and the tracking radius) such that for all t, ‖∇moL~(βt,μo,mt;αt,νt)‖≤Csepexp(−δ¯eff28)‖m~t−ν~t‖,δ¯eff:=δ¯−‖Σ−1/2‖2ρ. \| _m_o L( _t, _o,m_t; _t, _t) \|\ ≤\ C_sep\, \! (- δ_eff^28 )\,\| m_t- ν_t\|, δ_eff:= δ-\| ^-1/2\|_2\,ρ. (3.8) Combining (3.8) with the geometric tracking bound ‖m~t−ν~t‖≤κt‖m~0−ν~0‖\| m_t- ν_t\|≤κ^t\| m_0- ν_0\| yields the refined summability estimate ∑t=0∞‖∇moL~(βt,μo,mt;αt,νt)‖≤Csep1−κexp(−δ¯eff28)‖m~0−ν~0‖. _t=0^∞ \| _m_o L( _t, _o,m_t; _t, _t) \|\ ≤\ C_sep1-κ\, \! (- δ_eff^28 )\,\| m_0- ν_0\|. Thus, in well-separated regimes, the total update pressure on an already-correct old mean is not only finite but exponentially small in the Mahalanobis separation between the old mode and the (teacher/student) new mode. Theorem 3.1 shows that SDFT’s two stabilizers: (a) reverse-KL tracking and (b) an EMA teacher anchored by demonstrations, jointly prevent mass forgetting and control old-component drift. Roughly, Part (A) ensures convergence to the anchor state (αc,νc)( _c, _c) and thus preserves nonzero old-mode mass, while Part (B) shows that the gradient acting on an already-correct old mean is summable, ruling out accumulated drift. In particular, • Part (A) formalizes the on-policy effect: because each phasewise reverse-KL objective is locally strongly convex around the current teacher optimum, a single gradient step contracts the student toward the current teacher by a uniform factor q<1q<1, echoing the mode-locality of our static reverse-KL analysis. The teacher recursion then mediates the evolution of the target itself: the EMA term smooths the teacher toward the updated student, while the anchor term pulls it toward the demonstrator summary ν~(c) ν(c), with λ quantifying the strength of this pull. When λ>0λ>0, the target cannot wander indefinitely: the teacher is attracted to ν~(c) ν(c) and, since the student tracks the teacher, both sequences converge to the same demonstrated state (αc,νc)( _c, _c). • Part (B) converts this tracking picture into a continual-learning guarantee: the update signal on an already-correct old mean is Lipschitz in the student–teacher lag (and the teacher–anchor mismatch), hence it decays along the trajectory and is summable, ruling out accumulated drift of the old distribution. Finally, the limit characterization (3.7) makes the consistency story explicit: the asymptotic distance to any desired target ν~⋆ ν is exactly the demonstrator’s mismatch ‖ν~(c)−ν~⋆‖\| ν(c)- ν \|, so an approximately correct demonstration yields an approximately correct limit. 3.2 Mixture-Model Analysis of T-Discover T-Discover (Yuksekgonul et al., 2026) updates a model during test-time search on a single hard instance, rather than freezing the policy and relying only on prompting or search heuristics. Its key ingredient is an entropic objective that reweights on-policy samples by exp(ηr) (η r), thereby favoring high-reward outputs, together with an explicit KL penalty to a fixed reference policy that limits how far the policy can drift. From the perspective of continual learning, this creates a natural tension between discovery (shifting mass toward high-reward behaviors) and retention (preserving previously learned behaviors encoded in the reference). In our two-mode mixture model, this raises two concrete questions. First, can the entropic objective itself cause strong forgetting, i.e. collapse of the old-mode mass β? Second, when the old mode is already correctly represented, does the objective exert a nontrivial drift signal on the old mean, or is this weak forgetting suppressed by overlap? The next lemma and theorem answer these questions first in a disjoint-support idealization, and then in the Gaussian mixture setting of the paper. A KL-anchored Entropic Objective. Let r:ℝd→ℝr:R^d be a measurable reward and fix an entropic parameter η>0η>0. For any density q, define Jη(q):=logY∼q[eηr(Y)].J_η(q)\;:=\; _Y q [e^η r(Y) ]. Fix a reference density q0q_0 and a KL coefficient λref≥0 _ref≥ 0, and define ℒη,λref(q):=Jη(q)−λrefKL(q∥q0).L_η, _ref(q)\;:=\;J_η(q)\;-\; _ref\,KL(q\|q_0). (3.9) Two-Mode Mixture Family. To analyze the objective (3.9), we first restrict the learner to a two-mode mixture family qβ(y):=βpo(y)+(1−β)pn(y),β∈[0,1].q_β(y):=β\,p_o(y)+(1-β)\,p_n(y), β∈[0,1]. We take the reference density to be a fixed mixture q0=qβ0q_0=q_ _0 for some β0∈(0,1) _0∈(0,1), which serves as the KL anchor in (3.9). Lemma 3.1 (Disjoint-support Intuition for T-Discover). Fix η>0η>0 and λref≥0 _ref≥ 0. Consider the disjoint-support setting from Definition 1.1 and assume the reward is constant on each region: r(y)=uofor y∈Ao,r(y)=unfor y∈An, r(y)=u_o\ for y∈ A_o, r(y)=u_n\ for y∈ A_n, (3.10) for constants uo,un∈ℝu_o,u_n . Then: 1. For every β∈[0,1]β∈[0,1], Jη(qβ)=log(βeηuo+(1−β)eηun),KL(qβ∥qβ0)=βlogβ0+(1−β)log1−β1−β0.J_η(q_β)= \! (β e^η u_o+(1-β)e^η u_n ), (q_β\|q_ _0)=β β _0+(1-β) 1-β1- _0. (3.11) 2. If λref=0 _ref=0 and un>uou_n>u_o, then β⋆=0β =0 is the unique maximizer of β↦ℒη,0(qβ)β _η,0(q_β). (Symmetrically, if uo>unu_o>u_n, then β⋆=1β =1.) 3. If λref>0 _ref>0 and uo≠unu_o≠ u_n, then β↦ℒη,λref(qβ)β _η, _ref(q_β) is strictly concave on (0,1)(0,1) and has a unique maximizer β⋆∈(0,1)β ∈(0,1). If uo=unu_o=u_n, then the unique maximizer is β⋆=β0β = _0. The disjoint-support case cleanly isolates the basic mechanism. Without the KL anchor (λref=0 _ref=0), the entropic utility is purely mode-seeking: it places all mass on whichever mode has higher reward, so if the new mode is preferred then strong forgetting occurs through β⋆=0β =0. Once λref>0 _ref>0, however, the KL term forces the optimizer to remain in the interior whenever the reference retains both modes, so mass collapse is ruled out in this idealized setting. Thus, in the absence of overlap, retention is controlled entirely by whether the reference policy itself preserves old mass. Now considering the two-mode Gaussian mixture setting from Section 2.1, we return to the full parametric mixture family qβ,mo,mn(y)=βφΣ(y;mo)+(1−β)φΣ(y;mn),q_β,m_o,m_n(y)=β\, _ \! (y;m_o )+(1-β)\, _ \! (y;m_n ), in which the component means are also learned. Theorem 3.2 (T-Discover in the Gaussian mixture setting). Consider the target model in (2.1). Fix the learner model (see (2.2)) means at the true old/new means and define qβ(y):=qβ,μo,μn(y)=βpo(y)+(1−β)pn(y)q_β(y):=q_β, _o, _n(y)=β\,p_o(y)+(1-β)\,p_n(y), β∈[0,1]β∈[0,1], and q0:=qβ0q_0:=q_ _0 for some β0∈(0,1) _0∈(0,1). Let γ:=Φ(−δ/2)γ:= \!(-δ/2), where Φ is the standard normal CDF and define κ:=1−2γ∈(0,1)κ:=1-2γ∈(0,1). Define the Bayes partition An:=y∈ℝd:(μn−μo)⊤Σ−1(y−μo+μn2)≥0,Ao:=ℝd∖An, A_n:= \y ^d:\ ( _n- _o) ^-1 (y- _o+ _n2 )≥ 0 \, A_o:=R^d A_n, (3.12) and let the reward be the corresponding two-level step function r(y)=uofor y∈Ao,r(y)=unfor y∈An. r(y)=u_o\ for y∈ A_o, r(y)=u_n\ for y∈ A_n. (3.13) Then the following hold. (A) Suppose un>uou_n>u_o. Define D(β):=KL(qβ∥qβ0),Jη(qβ)=log(eηuo(γ+κβ)+eηun(1−γ−κβ)),D(β):=KL(q_β\|q_ _0), J_η(q_β)= \! (e^η u_o (γ+κβ )+e^η u_n (1-γ-κβ ) ), and let λcrit(new):=−Jη′(qβ)|β=0−D′(0)=κ(eηun−eηuo)(γeηuo+(1−γ)eηun)(−D′(0))>0. _crit^(new):= -J_η (q_β) |_β=0-D (0)= κ (e^η u_n-e^η u_o ) (γ e^η u_o+(1-γ)e^η u_n )\,(-D (0))>0. (3.14) Then: • if 0≤λref≤λcrit(new)0≤ _ref≤ _crit^(new), the unique maximizer of β↦ℒη,λref(qβ)β _η, _ref(q_β) is β⋆=0β =0; • if λref>λcrit(new) _ref> _crit^(new), the unique maximizer satisfies β⋆∈(0,β0)β ∈(0, _0). Thus, unlike the disjoint-support setting, a positive KL anchor does not automatically prevent strong forgetting: the anchor must be sufficiently strong. (B) Consider now the learner family in (2.2) and define Jη(β,mo,mn):=logY∼qβ,mo,mn[eηr(Y)].J_η(β,m_o,m_n):= _Y q_β,m_o,m_n [e^η r(Y) ]. At the correct old mean mo=μom_o= _o, the gradient satisfies ∇moJη(β,μo,mn)=β(wn−wo)φ(δ/2)δΣ−1(μn−μo), _m_oJ_η(β, _o,m_n)=β\,(w_n-w_o)\, (δ/2)δ\, ^-1( _n- _o), (3.15) where φ(t)=(2π)−1/2e−t2/2 (t)=(2π)^-1/2e^-t^2/2 is the standard normal density and wo:=eηuoY∼qβ,μo,mn[eηr(Y)],wn:=eηunY∼qβ,μo,mn[eηr(Y)].w_o:= e^η u_oE_Y q_β, _o,m_n[e^η r(Y)], w_n:= e^η u_nE_Y q_β, _o,m_n[e^η r(Y)]. If moreover |uo|,|un|≤R|u_o|,|u_n|≤ R, then ‖∇moJη(β,μo,mn)‖≤β(e2ηR−e−2ηR)12πe−δ2/8δ‖Σ−1(μn−μo)‖. \| _m_oJ_η(β, _o,m_n) \|\ ≤\ β\,(e^2η R-e^-2η R)\, 1 2π\, e^-δ^2/8δ\, \| ^-1( _n- _o) \|. (3.16) In particular, the old-mean learning signal is exponentially small in the Mahalanobis separation δ. Furthermore, if the full KL-anchored objective is evaluated at a synchronized point q0=qβ,μo,mnq_0=q_β, _o,m_n, then the KL term has zero gradient in mom_o, so the same bound applies to the full T-style objective at that point. Theorem 3.2 shows that the disjoint-support intuition survives qualitatively in the overlapping Gaussian setting, but with an important refinement. As in the idealized case, the unanchored entropic utility is intrinsically mode-seeking and therefore reallocates mass toward the higher-reward mode, potentially causing strong forgetting through collapse of the old-mode weight. However, unlike the disjoint-support regime, a positive KL anchor does not automatically rule out this collapse, because the Gaussian components have common support and the KL penalty remains finite at the boundary β=0β=0. Instead, retention requires the anchor to be sufficiently strong, and the threshold (3.14) makes this tradeoff explicit: the reward tilt promotes discovery, while the KL anchor must be large enough to counterbalance that tendency and preserve memory. At the same time, part (B) shows that the weak-forgetting story is much more favorable. Even in regimes where mass collapse is still possible, the learning signal on an already-correct old mean is highly localized: once the old mode is correctly positioned, only samples near the Bayes decision boundary generate nontrivial update pressure, and that pressure decays exponentially in the Mahalanobis separation δ. This mirrors the reverse-KL locality results proved earlier in the paper, where overlap similarly controls the extent to which an old mode can be perturbed. Thus, in the Gaussian mixture model, T-style entropic objectives cleanly separate two effects: the reward term drives discovery by reallocating mixture mass, while the geometry of overlap controls the drift of already-correct old parameters. We further provide an exact characterization of β∗β^* for both the disjoint case and the Gaussian case in Proposition in Section B.11. 3.3 Mixture-Model Analysis of OAPL We next consider the OAPL algorithm (Brantley et al., 2025; Ritter et al., 2026) which uses a frozen reference policy q0q_0 (typically a lagged inference engine) both to generate samples and to define the update target. In distribution space, the corresponding KL-regularized improvement problem has the closed-form optimizer q∗(y)=1Zq0(y)er(y)/τ,Z:=Y∼q0[er(Y)/τ],q^*(y)= 1Z\,q_0(y)e^r(y)/τ, Z:=E_Y q_0 [e^r(Y)/τ ], and OAPL fits this target by a squared “advantage-matching” regression under the same reference measure. This makes OAPL technically off-policy relative to the current trainable parameters, but still self-consistent: the sampling distribution and the optimization target are defined from the same reference policy. OAPL Setup (Single Phase). Fix a frozen reference mixture q0(y):=qβ0,μo,μn(y)=β0po(y)+(1−β0)pn(y),β0∈(0,1),q_0(y):=q_ _0, _o, _n(y)= _0\,p_o(y)+(1- _0)\,p_n(y), _0∈(0,1), (3.17) where po(y):=φΣ(y;μo),pn(y):=φΣ(y;μn),μo≠μn,Σ≻0.p_o(y):= _ \! (y; _o ), p_n(y):= _ \! (y; _n ), _o≠ _n, 0. Let r:ℝd→ℝr:R^d be measurable and τ>0τ>0. Define V∗:=τlogY∼q0[er(Y)/τ],A∗(y):=r(y)−V∗,q∗(y):=q0(y)eA∗(y)/τ.V^*:=τ _Y q_0[e^r(Y)/τ], A^*(y):=r(y)-V^*, q^*(y):=q_0(y)e^A^*(y)/τ. For the parametric family qβ,mn(y):=βpo(y)+(1−β)φΣ(y;mn),q_β,m_n(y):=β\,p_o(y)+(1-β)\, _ \! (y;m_n ), the OAPL population regression objective is J(β,mn):=Y∼q0[(τlogqβ,mn(Y)q0(Y)−A∗(Y))2].J(β,m_n):=E_Y q_0\! [ (τ q_β,m_n(Y)q_0(Y)-A^*(Y) )^2 ]. We first analyze the disjoint-support case, before analyzing the Gaussian mixture model. Lemma 3.2 (Disjoint-support OAPL only Weights Existing Mode Mass). Consider the disjoint-support setting from Definition 1.1 and the step-wise constant rewards as in (3.10). Then the OAPL target q∗q^* is again a two-component mixture with unchanged components, q∗(y)=β∗po(y)+(1−β∗)pn(y),q^*(y)=β^*\,p_o(y)+(1-β^*)\,p_n(y), where β∗=β0ero/τβ0ero/τ+(1−β0)ern/τ.β^*= _0e^r_o/τ _0e^r_o/τ+(1- _0)e^r_n/τ. (3.18) In particular, if β0∈(0,1) _0∈(0,1) and ro,rnr_o,r_n are finite, then β∗∈(0,1)β^*∈(0,1). In the disjoint-support idealization, OAPL cannot create or destroy modes; it can only reweight the mass already present in the reference. The explicit formula (3.18) shows that old-mode mass is retained whenever the reference already assigns nonzero mass to it. Equally importantly, if the reference has already collapsed the old mode (e.g. β0=0 _0=0), then OAPL cannot recover it in this idealization. Theorem 3.3 (OAPL in the Gaussian Mixture Model). Let the reference mixture q0q_0 be given by (3.17), and assume the reward is the two-level step function as in (3.13) based on the Bayes partition in (3.12). Let ro(0)(y):=β0po(y)/q0(y)r_o^(0)(y):= _0\,p_o(y)/q_0(y) denote the old responsibility under the frozen reference. Then the following hold. (A) The expected old responsibility of the target q∗q^* is Y∼q∗[ro(0)(Y)]=β0((1−γ)ero/τ+γern/τ)β0((1−γ)ero/τ+γern/τ)+(1−β0)(γero/τ+(1−γ)ern/τ).E_Y q^* [r_o^(0)(Y) ]= _0 ((1-γ)e^r_o/τ+γ e^r_n/τ ) _0 ((1-γ)e^r_o/τ+γ e^r_n/τ )+(1- _0) (γ e^r_o/τ+(1-γ)e^r_n/τ ). (3.19) In particular, if β0∈(0,1) _0∈(0,1) and ro,rnr_o,r_n are finite, then Y∼q∗[ro(0)(Y)]∈(0,1),E_Y q^* [r_o^(0)(Y) ]∈(0,1), so the OAPL target cannot completely forget the old mode as long as the reference retains nonzero old mass. (B) For the parametric objective J(β,mn)=Y∼q0[(τlogqβ,mn(Y)q0(Y)−A∗(Y))2],J(β,m_n)=E_Y q_0\! [ (τ q_β,m_n(Y)q_0(Y)-A^*(Y) )^2 ], the gradient with respect to the new mean satisfies ∇mnJ(β,mn)=2τY∼q0[Δβ,mn(Y)rn(β,mn)(Y)Σ−1(Y−mn)],Δβ,mn(y):=τlogqβ,mn(y)q0(y)−A∗(y). _m_nJ(β,m_n)=2τ\,E_Y q_0 [ _β,m_n(Y)\;r_n^(β,m_n)(Y)\; ^-1(Y-m_n) ], _β,m_n(y):=τ q_β,m_n(y)q_0(y)-A^*(y). (3.20) Hence the influence of old-mode samples on the new-mean update is gated by the new-component responsibility rn(β,mn)r_n^(β,m_n). If moreover |r(y)|≤R|r(y)|≤ R for all y, then at the synchronized point (β,mn)=(β0,μn)(β,m_n)=( _0, _n), ‖2τβ0Y∼po[A∗(Y)rn(β0,μn)(Y)Σ−1(Y−μn)]‖≤ 4τRβ0Mo→nεo→nref, \|2τ\, _0\,E_Y p_o [A^*(Y)\,r_n^( _0, _n)(Y)\, ^-1(Y- _n) ] \|\ ≤\ 4τ R\, _0\, M_o \, _o ^ref, (3.21) where Mo→n:=Y∼po[‖Σ−1(Y−μn)‖2]=tr(Σ−1)+(μo−μn)⊤Σ−2(μo−μn),M_o :=E_Y p_o [\| ^-1(Y- _n)\|^2 ]=tr( ^-1)+( _o- _n) ^-2( _o- _n), and εo→nref:=Y∼po[rn(β0,μn)(Y)]≤121−β0β0exp(−18‖μn−μo‖Σ−12). _o ^ref:=E_Y p_o [r_n^( _0, _n)(Y) ]≤ 12 1- _0 _0 \! (- 18 _n- _o _ ^-1^2 ). Thus the old-mode contribution to the update of the new mean is exponentially small in the separation. Part (A) is the Gaussian analogue of the disjoint-support mass-reweighting formula. Because the Gaussian components overlap, the target q∗q^* is no longer exactly a mixture in the same family, but its expected old responsibility admits the explicit formula (3.19). This shows that, unlike purely new-only forward-KL SFT, OAPL cannot entirely discard the old mode unless the frozen reference has already done so. Part (B) shows that OAPL is also geometrically local: the contribution of old-mode samples to the update of the new mean is suppressed by the overlap factor εo→nref _o ^ref, which decays exponentially with the Mahalanobis separation. Thus OAPL combines two stabilizing features in this toy model: the reference anchor retains a positive amount of old mass, and the parametric update remains localized because cross-mode influence is exponentially small. 4 Conclusion Within a mixture-model framework, we analyze how two predominant continual-learning mechanisms, namely on-policy sampling and replay-based memory access to past behavior, mitigate catastrophic forgetting. In the absence of these mechanisms, the effective training distribution becomes dominated by new information, making forgetting difficult to avoid even when the model class can represent both old and new behaviors. Our analysis shows that forward-KL objectives trained on new-only data naturally induce mass forgetting of the old behavior, whereas reverse-KL–style updates aligned with on-policy targets retain old modes and produce only overlap-controlled drift. Replay interacts with these objectives in fundamentally different ways: under forward-KL it must modify the training distribution to prevent mass forgetting, while under reverse-KL it primarily stabilizes stochastic optimization by ensuring persistent visibility of past behavior. These insights extend to several modern near–on-policy post-training methods, including SDFT, T-Discover, and OAPL, whose updates can be interpreted through the same mixture-model perspective. Future work includes extending this analysis to high-dimensional generative models where behaviors correspond to richer semantic modes rather than mixture models. Another interesting promising direction is to design principled post-training algorithms that explicitly balance exploration of new behaviors with retention of past capabilities using theoretically grounded sampling or memory mechanisms, built upon the insights derived in this work. References Banayeeanzade et al. (2025) M. Banayeeanzade, M. Soltanolkotabi, and M. Rostami. Theoretical Insights into Overparameterized Models in Multi-Task and Replay-Based Continual Learning. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=4zGPT0ZwnU. Bennani et al. (2021) M. A. Bennani, T. Doan, and M. Sugiyama. Generalisation Guarantees For Continual Learning With Orthogonal Gradient Descent. 2021. URL https://openreview.net/forum?id=hecuSLbL_vC. Bhattacharyya (1943) A. K. Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society, 35:99–110, 1943. Brantley et al. (2025) K. Brantley, M. Chen, Z. Gao, J. D. Lee, W. Sun, W. Zhan, and X. Zhang. Accelerating RL for LLM Reasoning with Optimal Advantage Regression. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=T1V8BJO0iG. Cai and Diakonikolas (2025) X. Cai and J. Diakonikolas. Last Iterate Convergence of Incremental Methods as a Model of Forgetting. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=mSGcDhQPwm. Chan et al. (2022) A. Chan, H. Silva, S. Lim, T. Kozuno, A. R. Mahmood, and M. White. Greedification operators for policy optimization: Investigating forward and reverse KL divergences. Journal of Machine Learning Research, 23(253):1–79, 2022. Chen et al. (2025) H. Chen, N. Razin, K. Narasimhan, and D. Chen. Retaining by doing: The role of on-policy data in mitigating forgetting. arXiv preprint arXiv:2510.18874, 2025. Chen et al. (2010) L. H. Chen, L. Goldstein, and Q.-M. Shao. Normal approximation by Stein’s method. Springer Science & Business Media, 2010. Chen et al. (2022) X. Chen, C. Papadimitriou, and B. Peng. Memory bounds for continual learning. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 519–530. IEEE, 2022. Deng et al. (2025) J. Deng, Q. Wu, P. Ju, S. Lin, Y. Liang, and N. Shroff. Unlocking the Power of Rehearsal in Continual Learning: A Theoretical Perspective. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=p6nhzZ9ilZ. Ding et al. (2024) M. Ding, K. Ji, D. Wang, and J. Xu. Understanding Forgetting in Continual Learning with Linear Regression. In International Conference on Machine Learning, pages 10978–11001. PMLR, 2024. Doan et al. (2021) T. Doan, M. A. Bennani, B. Mazoure, G. Rabusseau, and P. Alquier. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. In International Conference on Artificial Intelligence and Statistics, pages 1072–1080. PMLR, 2021. Evron et al. (2022) I. Evron, E. Moroshko, R. Ward, N. Srebro, and D. Soudry. How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pages 4028–4079. PMLR, 2022. Evron et al. (2023) I. Evron, E. Moroshko, G. Buzaglo, M. Khriesh, B. Marjieh, N. Srebro, and D. Soudry. Continual learning in linear classification on separable data. In International Conference on Machine Learning, pages 9440–9484. PMLR, 2023. Friedman and Meir (2026) L. Friedman and R. Meir. PAC-Bayes bounds for cumulative loss in Continual Learning. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=hWw269fPov. Graldi et al. (2025) J. Graldi, A. Breccia, G. Lanzillotta, T. Hofmann, and L. Noci. The importance of being lazy: Scaling limits of continual learning. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=edhBkkYS8R. Hübotter et al. (2026) J. Hübotter, F. Lübeck, L. Behric, A. Baumann, M. Bagatella, D. Marta, I. Hakimi, I. Shenfeld, T. K. Buening, C. Guestrin, et al. Reinforcement Learning via Self-Distillation. arXiv preprint arXiv:2601.20802, 2026. Karakida and Akaho (2022) R. Karakida and S. Akaho. Learning curves for continual learning in neural networks: Self-knowledge transfer and forgetting. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=tFgdrQbbaa. Karpel et al. (2026) G. Karpel, E. Moroshko, R. Levinstein, R. Meir, D. Soudry, and I. Evron. Optimal L2L_2 Regularization in High-dimensional Continual Linear Regression. arXiv preprint arXiv:2601.13844, 2026. Kirkpatrick et al. (2017) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. Korbak et al. (2022) T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. Advances in Neural Information Processing Systems, 35:16203–16220, 2022. Li et al. (2025) H. Li, S. Lin, L. Duan, Y. Liang, and N. Shroff. Theory on Mixture-of-Experts in Continual Learning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=7XgKAabsPp. Li and Hoiem (2017) Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. Lin et al. (2023) S. Lin, P. Ju, Y. Liang, and N. Shroff. Theory on forgetting and generalization of continual learning. In International Conference on Machine Learning, pages 21078–21100. PMLR, 2023. Liu et al. (2025) E. Liu, G. Neubig, and C. Xiong. Midtraining Bridges Pretraining and Posttraining Distributions. arXiv preprint arXiv:2510.14865, 2025. Lopez-Paz and Ranzato (2017) D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017. Lu and T.M.Lab (2025) K. Lu and T.M.Lab. On-Policy Distillation. Thinking Machines Lab: Connectionism, 2025. URL https://thinkingmachines.ai/blog/on-policy-distillation. Penaloza et al. (2026) E. Penaloza, D. Vattikonda, N. Gontier, A. Lacoste, L. Charlin, and M. Caccia. Privileged Information Distillation for Language Models. arXiv preprint arXiv:2602.04942, 2026. Ritter et al. (2026) D. Ritter, O. Oertell, B. Guo, J. Chang, K. Brantley, and W. Sun. LLMs Can Learn to Reason Via Off-Policy RL. arXiv preprint arXiv:2602.19362, 2026. Schaul et al. (2015) T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015. Schwarz et al. (2018) J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR, 2018. Shah et al. (2025) V. Shah, J. Obando-Ceron, V. Jain, B. Bartoldson, B. Kailkhura, S. Mittal, G. Berseth, P. S. Castro, Y. Bengio, N. Malkin, et al. A Comedy of Estimators: On KL Regularization in RL Training of LLMs. arXiv preprint arXiv:2512.21852, 2025. Shenfeld et al. (2025) I. Shenfeld, J. Pari, and P. Agrawal. RL’s razor: Why online reinforcement learning forgets less. arXiv preprint arXiv:2509.04259, 2025. Shenfeld et al. (2026) I. Shenfeld, M. Damani, J. Hübotter, and P. Agrawal. Self-Distillation Enables Continual Learning. arXiv preprint arXiv:2601.19897, 2026. Shi et al. (2025) H. Shi, Z. Xu, H. Wang, W. Qin, W. Wang, Y. Wang, Z. Wang, S. Ebrahimi, and H. Wang. Continual learning of large language models: A comprehensive survey. ACM Computing Surveys, 58(5):1–42, 2025. Shin et al. (2017) H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017. Taheri et al. (2025) H. Taheri, A. Ghosh, and A. Mazumdar. On the Theory of Continual Learning with Gradient Descent for Neural Networks. arXiv preprint arXiv:2510.05573, 2025. Tajwar et al. (2024) F. Tajwar, A. Singh, A. Sharma, R. Rafailov, J. Schneider, T. Xie, S. Ermon, C. Finn, and A. Kumar. Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=bWNPx6t0sF. Tang and Munos (2025) Y. Tang and R. Munos. On a few pitfalls in KL divergence gradient estimation for RL. arXiv preprint arXiv:2506.09477, 2025. Wang et al. (2024) L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024. Yuksekgonul et al. (2026) M. Yuksekgonul, D. Koceja, X. Li, F. Bianchi, J. McCaleb, X. Wang, J. Kautz, Y. Choi, J. Zou, C. Guestrin, et al. Learning to discover at test time. arXiv preprint arXiv:2601.16175, 2026. Zhao et al. (2026a) A. Zhao, Z. Chen, J. Tong, Y. Fan, F. Ye, S. Li, Y. Ma, W. Li, and X. Shen. On-Policy Supervised Fine-Tuning for Efficient Reasoning. arXiv preprint arXiv:2602.13407, 2026a. Zhao et al. (2026b) S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models. arXiv preprint arXiv:2601.18734, 2026b. Appendix Appendix A RL-based Post-training and Reverse-KL Minimization In KL-regularized RL (the standard trust-region formulation of RL), the policy update is posed as maxππ[r]−τKL(π∥πref), _π\;\;E_π[r]\;-\;τ\,KL(π\| _ref), which penalizes deviation from a reference policy πref _ref while improving reward. The unique optimizer is the exponential tilt π∗(y)=1Zπref(y)er(y)/τ,Z:=Y∼πref[er(Y)/τ].π^*(y)\;=\; 1Z\, _ref(y)\,e^r(y)/τ, Z:=E_Y _ref [e^r(Y)/τ ]. Assuming Z<∞Z<∞, this distribution is well defined. Moreover, for any π≪πrefπ _ref one has the exact identity τlogZ−(π[r]−τKL(π∥πref))=τKL(π∥π∗).τ Z\;-\; (E_π[r]- (π\| _ref) )\;=\;τ\,KL(π\|π^*). Thus maximizing the KL-regularized RL objective is equivalent (up to an additive constant τlogZτ Z) to minimizing the KL(π∥π∗)KL(π\|π^*) to the reward-tilted target π∗π^*; see Korbak et al. (2022) for more details. A concrete example in our two-mode Gaussian setting (defined in Section 2.1) is obtained by choosing πref(y):=pβ0(y)=β0po(y)+(1−β0)pn(y),β0∈(0,1), _ref(y):=p_ _0(y)= _0\,p_o(y)+(1- _0)\,p_n(y), _0∈(0,1), and defining the reward r(y):=τlogpα(y)qβ0(y)=τlogαpo(y)+(1−α)pn(y)β0po(y)+(1−β0)pn(y).r(y):=τ p_α(y)q_ _0(y)=τ α\,p_o(y)+(1-α)\,p_n(y) _0\,p_o(y)+(1- _0)\,p_n(y). This reward can be interpreted as a log-density correction: it assigns positive reward to outputs that are underweighted by the reference policy relative to the desired true target pαp_α, and negative reward to outputs that are overweighted, so that the KL-regularized RL update exactly tilts the reference policy toward pαp_α. In this case, we have Z=Y∼πref[er(Y)/τ]=∫qβ0(y)pα(y)qβ0(y)y=∫pα(y)y=1,Z=E_Y _ref [e^r(Y)/τ ]= q_ _0(y)\, p_α(y)q_ _0(y)\,dy= p_α(y)\,dy=1, so the tilted optimizer becomes π∗(y)=1Zπref(y)er(y)/τ=qβ0(y)pα(y)qβ0(y)=pα(y)=αpo(y)+(1−α)pn(y).π^*(y)= 1Z\, _ref(y)e^r(y)/τ=q_ _0(y)\, p_α(y)q_ _0(y)=p_α(y)=α\,p_o(y)+(1-α)\,p_n(y). Thus, with this choice of reference policy and reward, the KL-regularized RL solution exactly recovers the Gaussian mixture target. Appendix B Proofs We state a few results used in the proof. Lemma B.1 (Gaussian Stein identity (Chen et al., 2010)). Let Y∼(μ,Σ)Y (μ, ) with Σ≻0 0, and let g:ℝd→ℝg:R^d be continuously differentiable with [‖∇g(Y)‖]<∞E[ ∇ g(Y) ]<∞. Then [Σ−1(Y−μ)g(Y)]=[∇g(Y)].E [ ^-1(Y-μ)\,g(Y) ]=E [∇ g(Y) ]. Lemma B.2 (Fourth moment of a Gaussian quadratic form). Let Z∼(0,Id)Z (0,I_d) and let A⪰0A 0 be symmetric. Then [(Z⊤AZ)2]= 2tr(A2)+(tr(A))2.E [(Z AZ)^2 ]\;=\;2\,tr(A^2)+ (tr(A) )^2. Proof. This is standard; one way to establish the result is to diagonalize A=U⊤ΛUA=U U with Λ=diag(λ1,…,λd) =diag( _1,…, _d) and UZ=ZUZ d=Z. Then Z⊤AZ=∑iλiZi2Z AZ= _i _iZ_i^2 and [(∑iλiZi2)2]=∑iλi2[Zi4]+∑i≠jλiλj[Zi2][Zj2]=3∑iλi2+∑i≠jλiλj=2∑iλi2+(∑iλi)2.E [ ( _i _iZ_i^2 )^2 ]= _i _i^2E[Z_i^4]+ _i≠ j _i _jE[Z_i^2]E[Z_j^2]=3 _i _i^2+ _i≠ j _i _j=2 _i _i^2+ ( _i _i )^2. establishing the result. ∎ B.1 Proof of Lemma 2.1 Proof of Lemma 2.1. For the first bound, g[rf(Y)]=∫g(y)wf(y)wf(y)+(1−w)g(y)y=∫wf(y)g(y)wf(y)+(1−w)g(y)y.E_g[r_f(Y)]= g(y) wf(y)wf(y)+(1-w)g(y)\,dy= wf(y)g(y)wf(y)+(1-w)g(y)\,dy. Using a+b≥2aba+b≥ 2 ab with a=wf(y)a=wf(y) and b=(1−w)g(y)b=(1-w)g(y) yields wf(y)+(1−w)g(y)≥2w(1−w)f(y)g(y),wf(y)+(1-w)g(y)≥ 2 w(1-w)f(y)g(y), so the integrand is at most wf(y)g(y)2w(1−w)f(y)g(y)=12w1−wf(y)g(y). wf(y)g(y)2 w(1-w)f(y)g(y)= 12 w1-w\; f(y)g(y). Integrating gives the stated inequality, and the second inequality follows by symmetry. ∎ B.2 Proof of Theorem 2.2 Proof of Theorem 2.2. Define the likelihood ratio X(y):=po(y)pn(y)X(y):= p_o(y)p_n(y). Then for every y, qβ(y)=βpo(y)+(1−β)pn(y)=pn(y)((1−β)+βX(y)),q_β(y)=β p_o(y)+(1-β)p_n(y)=p_n(y) ((1-β)+β X(y) ), so LSFT(β)=KL(pn∥qβ)=Y∼pn[logpn(Y)qβ(Y)]=pn[−log((1−β)+βX(Y))].L_SFT(β)=KL(p_n\|q_β)=E_Y p_n\! [ p_n(Y)q_β(Y) ]=E_p_n\! [- ((1-β)+β X(Y) ) ]. Since pn[X(Y)]=∫po=1E_p_n[X(Y)]= p_o=1 and −log- is strictly convex, Jensen’s inequality gives LSFT(β)≥−log((1−β)+β[X(Y)])=−log(1)=0,L_SFT(β)\ ≥\ - ((1-β)+β\,E[X(Y)] )=- (1)=0, with strict inequality for β>0β>0 because X is not a.s. constant under pnp_n when μo≠μn _o≠ _n. Thus β=0β=0 is the unique minimizer. Differentiate under the expectation (justified since (1−β)+βX(Y)≥1−β>0(1-β)+β X(Y)≥ 1-β>0 and Gaussians have all moments): LSFT′(β)=−pn[X(Y)−1(1−β)+βX(Y)],LSFT′(β)=pn[(X(Y)−1)2((1−β)+βX(Y))2].L_SFT (β)=-E_p_n\! [ X(Y)-1(1-β)+β X(Y) ], L_SFT (β)=E_p_n\! [ (X(Y)-1)^2 ((1-β)+β X(Y) )^2 ]. Since X is not a.s. constant, LSFT′(β)>0L_SFT (β)>0 for all β∈(0,1)β∈(0,1), hence LSFT′L_SFT is strictly increasing. Also LSFT′(0)=−[X(Y)−1]=0L_SFT (0)=-E[X(Y)-1]=0, so LSFT′(β)>0L_SFT (β)>0 for all β∈(0,1)β∈(0,1). Therefore LSFTL_SFT is strictly increasing on [0,1][0,1]. Now parameterize β=σ(ϕ)β=σ(φ). Since dβdϕ=β(1−β)>0 dβdφ=β(1-β)>0, dϕLSFT(σ(ϕ))=β(1−β)LSFT′(β)> 0∀β∈(0,1), ddφL_SFT(σ(φ))=β(1-β)\,L_SFT (β)\ >\ 0 ∀β∈(0,1), so along logit gradient flow ϕ˙=−dϕLSFT(σ(ϕ)) φ=- ddφL_SFT(σ(φ)) we have ϕ˙<0 φ<0 whenever β∈(0,1)β∈(0,1). Thus ϕ(t)φ(t) is strictly decreasing and β(t)β(t) is strictly decreasing, hence β(t)→β∞∈[0,1]β(t)→ _∞∈[0,1] exists. If β∞>0 _∞>0, then LSFT′(β∞)>0L_SFT ( _∞)>0, so dϕLSFT(σ(ϕ(t))) ddφL_SFT(σ(φ(t))) stays bounded below by a positive constant for large t, contradicting ϕ(t)φ(t) having a finite limit. Hence β∞=0 _∞=0. To obtain (2.4), note that ∂ϕlogqβ(y)=ro(y)−β _φ q_β(y)=r_o(y)-β for any two-component mixture (with β=σ(ϕ)β=σ(φ)). Since LSFT(β)=pn[logpn−logqβ]L_SFT(β)=E_p_n[ p_n- q_β], dϕLSFT(σ(ϕ))=−pn[∂ϕlogqβ(Y)]=β−pn[ro(Y)], ddφL_SFT(σ(φ))=-E_p_n [ _φ q_β(Y) ]=β-E_p_n[r_o(Y)], yielding (2.4). Finally, (2.5) is Lemma 2.1 applied to the mixture qβq_β with w=βw=β, f=pof=p_o, g=png=p_n, combined with Lemma 2.1 and BC(po,pn)=e−δ2/8BC(p_o,p_n)=e^-δ^2/8. ∎ B.3 Proof of Lemma 2.2.1 Proof of Lemma 2.2.1. (A) Expand q~β,λ=(1−λ)(βpo+(1−β)pn)+λpo=(λ+(1−λ)β)po+(1−λ)(1−β)pn. q_β,λ=(1-λ) (β p_o+(1-β)p_n )+λ p_o= (λ+(1-λ)β )p_o+(1-λ)(1-β)p_n. Set β~:=λ+(1−λ)β β:=λ+(1-λ)β so that q~β,λ=qβ~ q_β,λ=q_ β. Thus minβ∈[0,1]KL(pn∥q~β,λ)=minβ~∈[λ,1]KL(pn∥qβ~). _β∈[0,1]KL(p_n\| q_β,λ)= _ β∈[λ,1]KL(p_n\|q_ β). It remains to show γ↦F(γ):=KL(pn∥qγ)γ F(γ):=KL(p_n\|q_γ) is strictly increasing on (0,1)(0,1). Differentiate under the integral: F(γ)=∫pn(y)logpn(y)(1−γ)pn(y)+γpo(y)dy,F(γ)= p_n(y) p_n(y)(1-γ)p_n(y)+γ p_o(y)\,dy, F′(γ)=−∫pn(y)po(y)−pn(y)qγ(y)y,F′(γ)=∫pn(y)(po(y)−pn(y))2qγ(y)2y.F (γ)=- p_n(y)\, p_o(y)-p_n(y)q_γ(y)\,dy, F (γ)= p_n(y)\, (p_o(y)-p_n(y))^2q_γ(y)^2\,dy. Since po≢pnp_o ≡ p_n (distinct means), F′(γ)>0F (γ)>0 on (0,1)(0,1), so F′F is strictly increasing. Also F′(0)=−∫(po−pn)=0F (0)=- (p_o-p_n)=0, hence F′(γ)>0F (γ)>0 for all γ∈(0,1)γ∈(0,1) and F is strictly increasing. Therefore the minimizer over [λ,1][λ,1] is uniquely β~⋆=λ β =λ, i.e. β⋆=0β =0. (B) Observe that qλ(y)=λpo(y)+(1−λ)pn(y)=p~λ(y)q_λ(y)=λ p_o(y)+(1-λ)p_n(y)= p_λ(y), hence KL(p~λ∥qλ)=0KL( p_λ\|q_λ)=0 and β=λβ=λ is a global minimizer. If KL(p~λ∥qβ)=0KL( p_λ\|q_β)=0, then qβ=p~λq_β= p_λ a.e., i.e. (β−λ)(po(y)−pn(y))=0a.e.(β-λ) (p_o(y)-p_n(y) )=0 .e. Since po−pnp_o-p_n is not zero a.e., this forces β=λβ=λ, proving uniqueness. ∎ B.4 Proof of Theorem 2.3 Proof of Theorem 2.3. We start with a general derivative identity for KL(qθ∥p)KL(q_θ\|p) when only qθq_θ depends on θ. Let p be a fixed strictly positive density, and let qθ:θ∈Θ\q_θ:θ∈ \ be a C1C^1 family of densities such that ∫qθ(y)y=1 q_θ(y)\,dy=1 for all θ and differentiation under the integral is justified. Then ∇θKL(qθ∥p)=∫ℝd(∇θqθ(y))logqθ(y)p(y)dy. _θKL(q_θ\|p)= _R^d( _θq_θ(y))\, q_θ(y)p(y)\,dy. (B.1) Indeed, writing KL(qθ∥p)=∫qθlog(qθ/p)KL(q_θ\|p)= q_θ (q_θ/p) and differentiating, ∇θKL(qθ∥p)=∫(∇θqθ)logqθpdy+∫qθ∇θ(logqθp)dy=∫(∇θqθ)logqθpdy+∫∇θqθdy. _θKL(q_θ\|p)= ( _θq_θ) q_θp\,dy+ q_θ\, _θ\! ( q_θp )\,dy= ( _θq_θ) q_θp\,dy+ _θq_θ\,dy. The last integral vanishes because ∫∇θqθdy=∇θ∫qθy=∇θ1=0 _θq_θ\,dy= _θ q_θ\,dy= _θ1=0, yielding (B.1). For the present Gaussian-mixture family, the required interchange of derivative and integral holds by dominated convergence: ∇θqβ,mn(y) _θq_β,m_n(y) is a Gaussian density times a polynomial in y, while log(qβ,mn(y)/pα(y)) (q_β,m_n(y)/p_α(y)) grows at most quadratically in ‖y‖\|y\| because both numerator and denominator are mixtures of equal-covariance Gaussians; hence the integrand is dominated by an integrable function. We next compute ∂βqβ,mn _βq_β,m_n and ∇mnqβ,mn _m_nq_β,m_n. By definition, qβ,mn(y)=βφΣ(y;μo)+(1−β)φΣ(y;mn),q_β,m_n(y)=β\, _ \! (y; _o )+(1-β)\, _ \! (y;m_n ), so ∂βqβ,mn(y)=φΣ(y;μo)−φΣ(y;mn). ∂βq_β,m_n(y)= _ \! (y; _o )- _ \! (y;m_n ). Also, using the standard Gaussian derivative identity ∇mφΣ(y;m)=φΣ(y;m)Σ−1(y−m) _m _ \! (y;m )= _ \! (y;m )\, ^-1(y-m), we have ∇mnqβ,mn(y)=(1−β)φΣ(y;mn)Σ−1(y−mn). _m_nq_β,m_n(y)=(1-β)\, _ \! (y;m_n )\, ^-1(y-m_n). Substituting these into (B.1) yields (2.6)–(2.7). Evaluating the aforementioned partials at (β,mn)=(α,μn)(β,m_n)=(α, _n), we have pointwise equality of densities: qα,μn(y)=αφΣ(y;μo)+(1−α)φΣ(y;μn)=pα(y)for all y,q_α, _n(y)=α\, _ \! (y; _o )+(1-α)\, _ \! (y; _n )=p_α(y) all y, hence log(qα,μn(y)/pα(y))=log1=0 \! (q_α, _n(y)/p_α(y) )= 1=0 for all y. Plugging this into (2.6)–(2.7) yields ∂βL(α,μn)=0 _βL(α, _n)=0 and ∇mnL(α,μn)=0 _m_nL(α, _n)=0. Finally, KL(⋅∥⋅)≥0KL(·\|·)≥ 0 always and equals 0 iff q=pq=p a.e., so L(α,μn)=0L(α, _n)=0 and (α,μn)(α, _n) is a global minimizer. ∎ B.5 Proof of Theorem 2.3 Proof of Theorem 2.3. Write q(y):=qβ,mo,mn(y)q(y):=q_β,m_o,m_n(y) and p(y):=pα(y)p(y):=p_α(y). A standard identity (cf. (B.1) later in the paper) yields ∇moKL(q∥p)=∫(∇moq(y))logq(y)p(y)dy, _m_oKL(q\|p)= ( _m_oq(y))\, q(y)p(y)\,dy, since ∫∇moq=0 _m_oq=0. Moreover ∇moq(y)=βφΣ(y;mo)Σ−1(y−mo) _m_oq(y)=β\, _ \! (y;m_o )\, ^-1(y-m_o). Evaluating at mo=μom_o= _o gives ∇moLRL(β,μo,mn)=βΣ−1Y∼po[(Y−μo)logq(Y)p(Y)]. _m_oL_RL(β, _o,m_n)=β\, ^-1E_Y p_o\! [(Y- _o)\, q(Y)p(Y) ]. Apply Stein’s identity (Lemma B.1) with g(y)=logq(y)p(y)g(y)= q(y)p(y) to obtain ∇moLRL(β,μo,mn)=βpo[∇ylogq(Y)p(Y)]=βpo[∇ylogq(Y)−∇ylogp(Y)]. _m_oL_RL(β, _o,m_n)=β\,E_p_o\! [ _y q(Y)p(Y) ]=β\,E_p_o\! [ _y q(Y)- _y p(Y) ]. For equal-covariance Gaussian mixtures, the score is responsibility-weighted: ∇ylogq(y)=−Σ−1(y−(ro(y)mo+rn(y)mn)),∇ylogp(y)=−Σ−1(y−(so(y)μo+sn(y)μn)). _y q(y)=- ^-1 (y- (r_o(y)m_o+r_n(y)m_n ) ), _y p(y)=- ^-1 (y- (s_o(y) _o+s_n(y) _n ) ). Substituting mo=μom_o= _o and subtracting yields ∇ylogq(y)−∇ylogp(y)=Σ−1((1−ro(y))(mn−μo)−(1−so(y))(μn−μo)). _y q(y)- _y p(y)= ^-1 ((1-r_o(y))(m_n- _o)-(1-s_o(y))( _n- _o) ). Taking poE_p_o gives (2.8) with εq=po[1−ro] _q=E_p_o[1-r_o] and εp=po[1−so] _p=E_p_o[1-s_o]. Finally, (2.9) follows from Lemma 2.1 and Lemma 2.1: • For εq=po[1−ro(Y)] _q=E_p_o[1-r_o(Y)], apply Lemma 2.1 to the model mixture qβ,μo,mnq_β, _o,m_n with w=βw=β, f=φΣ(⋅;μo)f= _ \! (·; _o ), g=φΣ(⋅;mn)g= _ \! (·;m_n ). • For εp=po[1−so(Y)] _p=E_p_o[1-s_o(Y)], apply Lemma 2.1 to the target mixture pαp_α with w=αw=α, f=pof=p_o, g=png=p_n. ∎ B.6 Proof of Theorem 2.3.1 We start by showing the Lipschitiz continuity of the Hessian. Lemma B.3 (Local Hessian-Lipschitzness of the reverse-KL objective). Let L(ϕ,m):=KL(qϕ,m∥pα),qϕ,m(y):=β(ϕ)ϕΣ(y;μo)+(1−β(ϕ))ϕΣ(y;m),β(ϕ)=11+e−ϕ,L(φ,m):=KL(q_φ,m\|p_α), q_φ,m(y):=β(φ)\, _ (y; _o)+(1-β(φ))\, _ (y;m), β(φ)= 11+e^-φ, with mo=μom_o= _o fixed. Fix any compact set K:=(ϕ,m):|ϕ−ϕ⋆|≤r,‖m−m⋆‖≤r⊂ℝ×ℝd,K:=\(φ,m):\ |φ-φ |≤ r,\ \|m-m \|≤ r\ ×R^d, where ϕ⋆=logα1−αφ = α1-α and m⋆=μnm = _n. Then L is C3C^3 on K. Consequently, there exists a finite constant LH(K)<∞L_H(K)<∞ such that ‖∇2L(θ)−∇2L(θ′)‖2≤LH(K)‖θ−θ′‖∀θ,θ′∈K.\|∇^2L(θ)-∇^2L(θ )\|_2≤ L_H(K)\,\|θ-θ \| ∀θ,θ ∈ K. In particular, the Hessian-Lipschitz assumption used to quantify the local PL region holds on every compact neighborhood bounded away from β∈0,1β∈\0,1\. Proof of Theorem 2.3.1. Write θ=(ϕ,m)θ=(φ,m) and qθ=qϕ,mq_θ=q_φ,m. We first show that derivatives of the integrand in the KL objective admit a uniform integrable envelope on K. Since K is compact and β(ϕ)β(φ) is continuous, there exist constants 0<β¯≤β(ϕ)≤β¯<1∀(ϕ,m)∈K.0< β≤β(φ)≤ β<1 ∀(φ,m)∈ K. Moreover, the set of means m:(ϕ,m)∈K\m:\ (φ,m)∈ K\ is compact. For each multi-index ν with |ν|≤3|ν|≤ 3, the derivatives ∂θνqθ(y) _θ^νq_θ(y) are finite linear combinations of terms of the form Pν(y,m)φΣ(y;μo)orQν(y,m)φΣ(y;m),P_ν(y,m)\, _ \! (y; _o ) Q_ν(y,m)\, _ \! (y;m ), where Pν,QνP_ν,Q_ν are polynomials in y whose coefficients depend continuously on m. Since m ranges over a compact set, there exist constants Cν,cν>0C_ν,c_ν>0 such that |∂θνqθ(y)|≤Cν(1+‖y‖3)e−cν‖y‖2∀θ∈K,∀y∈ℝd,∀|ν|≤3.| _θ^νq_θ(y)|≤ C_ν(1+\|y\|^3)e^-c_ν\|y\|^2 ∀θ∈ K,\ ∀ y ^d,\ ∀|ν|≤ 3. (B.2) We next control the logarithmic factor. Because β¯>0 β>0, we have the pointwise lower bound qθ(y)≥β¯φΣ(y;μo)∀θ∈K,∀y∈ℝd.q_θ(y)≥ β\, _ \! (y; _o ) ∀θ∈ K,\ ∀ y ^d. Similarly, pα(y)≥αφΣ(y;μo)p_α(y)≥α\, _ \! (y; _o ). For equal-covariance Gaussian mixtures, the quadratic terms in logqθ(y) q_θ(y) and logpα(y) p_α(y) cancel, and the remaining difference grows at most linearly in ‖y‖\|y\|. Hence there exists a constant Clog>0C_ >0 such that supθ∈K|logqθ(y)pα(y)|≤Clog(1+‖y‖)∀y∈ℝd. _θ∈ K | q_θ(y)p_α(y) |≤ C_ (1+\|y\|) ∀ y ^d. (B.3) Now write the KL objective as L(θ)=∫ℝdqθ(y)logqθ(y)pα(y)dy.L(θ)= _R^dq_θ(y)\, q_θ(y)p_α(y)\,dy. Differentiating with respect to θ up to third order produces finite sums of products of derivatives of qθq_θ, powers of qθ−1q_θ^-1, and the factor log(qθ/pα) (q_θ/p_α). Using the lower bound qθ(y)≥β¯φΣ(y;μo)q_θ(y)≥ β\, _ \! (y; _o ), the derivative envelope (B.2), and the logarithmic bound (B.3), each derivative of the integrand up to order 33 is dominated by a function of the form C(1+‖y‖M)e−c‖y‖2C(1+\|y\|^M)e^-c\|y\|^2 for some constants C,M,c>0C,M,c>0 independent of θ∈Kθ∈ K. This envelope is integrable on ℝdR^d. Therefore differentiation under the integral sign is justified up to third order by dominated convergence, so L∈C3(K)L∈ C^3(K). Finally, since L∈C3(K)L∈ C^3(K), the third derivative ∇3L∇^3L is continuous on the compact set K, and hence bounded: M3:=supθ∈K‖∇3L(θ)‖op<∞.M_3:= _θ∈ K\|∇^3L(θ)\|_op<∞. The mean value theorem in Banach spaces then implies ‖∇2L(θ)−∇2L(θ′)‖2≤M3‖θ−θ′‖∀θ,θ′∈K.\|∇^2L(θ)-∇^2L(θ )\|_2≤ M_3\,\|θ-θ \| ∀θ,θ ∈ K. Thus the Hessian is Lipschitz on K with LH(K):=M3L_H(K):=M_3. ∎ Proof of Theorem 2.3.1. We first prove that H⋆:=∇2L(θ⋆)≻0H_ :=∇^2L(θ ) 0 and μ⋆:=λmin(H⋆)>0. _ := _ (H_ )>0. Write θ=(ϕ,m)∈ℝ×ℝdθ=(φ,m) ×R^d, qθ:=qϕ,mq_θ:=q_φ,m, and θ⋆=(ϕ⋆,m⋆)θ =(φ ,m ), so that qθ⋆=pαq_θ =p_α. Since L(θ)=KL(qθ∥pα)=KL(qθ∥qθ⋆),L(θ)=KL(q_θ\|p_α)=KL(q_θ\|q_θ ), we first derive the Fisher representation of the Hessian at θ⋆θ . Let ℓθ(y):=logqθ(y),sθ(y):=∇θℓθ(y)=∇θlogqθ(y). _θ(y):= q_θ(y), s_θ(y):= _θ _θ(y)= _θ q_θ(y). Then L(θ)=∫qθ(y)(ℓθ(y)−ℓθ⋆(y))y.L(θ)= q_θ(y) ( _θ(y)- _θ (y) )\,dy. Differentiating with respect to θ, and using ∇θqθ=qθsθ _θq_θ=q_θs_θ, gives ∇θL(θ)=∫qθ(y)sθ(y)(ℓθ(y)−ℓθ⋆(y))y+∫qθ(y)∇θℓθ(y)y. _θL(θ)= q_θ(y)s_θ(y) ( _θ(y)- _θ (y) )\,dy+ q_θ(y) _θ _θ(y)\,dy. Since qθ∇θℓθ=∇θqθq_θ _θ _θ= _θq_θ and ∫qθ(y)y=1 q_θ(y)\,dy=1, the second term vanishes: ∫qθ(y)∇θℓθ(y)y=∫∇θqθ(y)y=∇θ∫qθ(y)y=∇θ1=0. q_θ(y) _θ _θ(y)\,dy= _θq_θ(y)\,dy= _θ q_θ(y)\,dy= _θ1=0. Hence ∇θL(θ)=∫qθ(y)sθ(y)(ℓθ(y)−ℓθ⋆(y))y. _θL(θ)= q_θ(y)s_θ(y) ( _θ(y)- _θ (y) )\,dy. In particular, at θ=θ⋆θ=θ , the logarithmic factor vanishes pointwise, so ∇θL(θ⋆)=0. _θL(θ )=0. We next differentiate once more. Set hθ(y):=ℓθ(y)−ℓθ⋆(y).h_θ(y):= _θ(y)- _θ (y). Then ∇θL(θ)=∫qθ(y)sθ(y)hθ(y)y. _θL(θ)= q_θ(y)s_θ(y)h_θ(y)\,dy. Differentiating and evaluating at θ=θ⋆θ=θ , every term containing the factor hθ(y)h_θ(y) vanishes because hθ⋆(y)=0h_θ (y)=0. The only surviving term comes from differentiating hθh_θ, and since ∇θhθ(y)=∇θℓθ(y)=sθ(y), _θh_θ(y)= _θ _θ(y)=s_θ(y), we obtain ∇θ2L(θ⋆)=∫qθ⋆(y)sθ⋆(y)sθ⋆(y)⊤y=Y∼qθ⋆[sθ⋆(Y)sθ⋆(Y)⊤]. _θ^2L(θ )= q_θ (y)\,s_θ (y)s_θ (y) \,dy=E_Y q_θ \! [s_θ (Y)s_θ (Y) ]. Because qθ⋆=pαq_θ =p_α, this yields the Fisher representation H⋆=∇2L(θ⋆)=Y∼pα[s(Y)s(Y)⊤],H_ =∇^2L(θ )=E_Y p_α\! [s(Y)s(Y) ], where s(Y)=sθ⋆(Y)s(Y)=s_θ (Y). For the present two-component model, the score vector is s(Y)=(ro⋆(Y)−αrn⋆(Y)Σ−1(Y−μn)),ro⋆(y):=αϕΣ(y;μo)pα(y),rn⋆(y):=1−ro⋆(y).s(Y)= pmatrixr_o (Y)-α\\[2.0pt] r_n (Y)\, ^-1(Y- _n) pmatrix, r_o (y):= α\, _ (y; _o)p_α(y), r_n (y):=1-r_o (y). We now prove that H⋆H_ is positive definite. Let v=(u,a)∈ℝ×ℝdv=(u,a) ×R^d. Using the Fisher representation, v⊤H⋆v=Y∼pα[(u(ro⋆(Y)−α)+a⊤rn⋆(Y)Σ−1(Y−μn))2].v H_ v=E_Y p_α\! [ (u (r_o (Y)-α )+a r_n (Y) ^-1(Y- _n) )^2 ]. Define gv(y):=u(ro⋆(y)−α)+a⊤rn⋆(y)Σ−1(y−μn).g_v(y):=u (r_o (y)-α )+a r_n (y) ^-1(y- _n). Then v⊤H⋆v=Y∼pα[gv(Y)2]≥0.v H_ v=E_Y p_α[g_v(Y)^2]≥ 0. Suppose now that v⊤H⋆v=0v H_ v=0. Then gv(Y)=0g_v(Y)=0 for pαp_α-almost every Y. Since pαp_α is a strictly positive continuous density on ℝdR^d, every nonempty open set has positive pαp_α-measure. Because gvg_v is continuous, if there existed y0∈ℝdy_0 ^d with gv(y0)≠0g_v(y_0)≠ 0, then by continuity there would exist an open neighborhood U∋y0U y_0 on which gvg_v is bounded away from 0, implying Y∼pα[gv(Y)2]≥∫Ugv(y)2pα(y)y>0,E_Y p_α[g_v(Y)^2]≥ _Ug_v(y)^2p_α(y)\,dy>0, a contradiction. Therefore gv(y)=0∀y∈ℝd.g_v(y)=0 ∀ y ^d. We first show that u=0u=0. Evaluating at y=μny= _n, the second term vanishes, so 0=gv(μn)=u(ro⋆(μn)−α).0=g_v( _n)=u (r_o ( _n)-α ). Now ro⋆(μn)=αϕΣ(μn;μo)αϕΣ(μn;μo)+(1−α)ϕΣ(μn;μn).r_o ( _n)= α\, _ ( _n; _o)α\, _ ( _n; _o)+(1-α) _ ( _n; _n). Since μo≠μn _o≠ _n, ϕΣ(μn;μo)<ϕΣ(μn;μn), _ ( _n; _o)< _ ( _n; _n), hence ro⋆(μn)<αr_o ( _n)<α, so ro⋆(μn)−α≠0r_o ( _n)-α≠ 0. Therefore u=0.u=0. With u=0u=0, the identity gv(y)=0g_v(y)=0 becomes a⊤rn⋆(y)Σ−1(y−μn)=0∀y∈ℝd.a r_n (y) ^-1(y- _n)=0 ∀ y ^d. Because rn⋆(y)>0r_n (y)>0 for all y∈ℝdy ^d (both Gaussian components are strictly positive and 1−α>01-α>0), we may divide by rn⋆(y)r_n (y) and obtain a⊤Σ−1(y−μn)=0∀y∈ℝd.a ^-1(y- _n)=0 ∀ y ^d. Taking y=μn+Σay= _n+ a, we get 0=a⊤Σ−1(Σa)=a⊤a=‖a‖2,0=a ^-1( a)=a a=\|a\|^2, so a=0a=0. Thus v=(u,a)=0v=(u,a)=0 is the only vector satisfying v⊤H⋆v=0v H_ v=0. Therefore H⋆H_ is positive definite: H⋆≻0.H_ 0. Consequently, μ⋆:=λmin(H⋆)>0. _ := _ (H_ )>0. We next prove the explicit lower bound on the Hessian inside the ball Bρ(θ⋆)B_ρ(θ ). Fix θ∈Bρ(θ⋆)θ∈ B_ρ(θ ). By Weyl’s inequality, λmin(∇2L(θ))≥λmin(∇2L(θ⋆))−‖∇2L(θ)−∇2L(θ⋆)‖2. _ \! (∇^2L(θ) )≥ _ \! (∇^2L(θ ) )-\|∇^2L(θ)-∇^2L(θ )\|_2. Since λmin(∇2L(θ⋆))=μ⋆ _ (∇^2L(θ ))= _ and the Hessian is LHL_H-Lipschitz on K, we obtain λmin(∇2L(θ))≥μ⋆−LH‖θ−θ⋆‖. _ \! (∇^2L(θ) )≥ _ -L_H\|θ-θ \|. Because θ∈Bρ(θ⋆)θ∈ B_ρ(θ ) and ρ≤μ⋆/(2LH)ρ≤ _ /(2L_H), LH‖θ−θ⋆‖≤LHρ≤μ⋆2,L_H\|θ-θ \|≤ L_Hρ≤ _ 2, hence λmin(∇2L(θ))≥μ⋆−μ⋆2=μ⋆2. _ \! (∇^2L(θ) )≥ _ - _ 2= _ 2. This proves (2.10). We next derive the two local inequalities in (2.11). Fix θ∈Bρ(θ⋆)θ∈ B_ρ(θ ) and write d:=θ−θ⋆d:=θ-θ . Since ∇L(θ⋆)=0∇ L(θ )=0, Taylor’s theorem with integral remainder gives L(θ)−L(θ⋆)=∫01(1−s)d⊤∇2L(θ⋆+sd)ds.L(θ)-L(θ )= _0^1(1-s)\,d ∇^2L(θ +sd)\,d\,ds. Because the whole line segment θ⋆+sdθ +sd lies in Bρ(θ⋆)B_ρ(θ ), the Hessian lower bound (2.10) applies throughout the segment, so L(θ)=∫01(1−s)d⊤∇2L(θ⋆+sd)ds≥∫01(1−s)μ⋆2‖d‖2s=μ⋆4‖d‖2.L(θ)= _0^1(1-s)\,d ∇^2L(θ +sd)\,d\,ds≥ _0^1(1-s)\, _ 2\|d\|^2\,ds= _ 4\|d\|^2. This proves the quadratic-growth inequality. To prove the PL inequality, we use strong convexity in the form L(θ⋆)≥L(θ)+⟨∇L(θ),θ⋆−θ⟩+μ⋆4‖θ⋆−θ‖2,L(θ )≥ L(θ)+ ∇ L(θ),θ -θ + _ 4\|θ -θ\|^2, which holds because L is μ⋆/2 _ /2-strongly convex on Bρ(θ⋆)B_ρ(θ ). Since L(θ⋆)=0L(θ )=0, this becomes L(θ)≤⟨∇L(θ),θ−θ⋆⟩−μ⋆4‖θ−θ⋆‖2.L(θ)≤ ∇ L(θ),θ-θ - _ 4\|θ-θ \|^2. By Cauchy–Schwarz, L(θ)≤‖∇L(θ)‖‖θ−θ⋆‖−μ⋆4‖θ−θ⋆‖2.L(θ)≤\|∇ L(θ)\|\,\|θ-θ \|- _ 4\|θ-θ \|^2. The right-hand side is a quadratic function of t:=‖θ−θ⋆‖t:=\|θ-θ \|, namely ‖∇L(θ)‖t−μ⋆4t2,\|∇ L(θ)\|\,t- _ 4t^2, whose maximum over t≥0t≥ 0 is attained at t=2‖∇L(θ)‖/μ⋆t=2\|∇ L(θ)\|/ _ and equals 1μ⋆‖∇L(θ)‖2. 1 _ \|∇ L(θ)\|^2. Therefore L(θ)≤1μ⋆‖∇L(θ)‖2,L(θ)≤ 1 _ \|∇ L(θ)\|^2, which is equivalent to ‖∇L(θ)‖2≥μ⋆L(θ).\|∇ L(θ)\|^2≥ _ L(θ). This proves the local PL bound. We now turn to the gradient-flow estimates. Let θ(t)θ(t) solve θ˙(t)=−∇L(θ(t)) θ(t)=-∇ L(θ(t)). By the chain rule, dtL(θ(t))=⟨∇L(θ(t)),θ˙(t)⟩=−‖∇L(θ(t))‖2≤0. ddtL(θ(t))= ∇ L(θ(t)), θ(t) =-\|∇ L(θ(t))\|^2≤ 0. Thus L(θ(t))L(θ(t)) is nonincreasing along the flow. We next prove that the trajectory stays inside Bρ(θ⋆)B_ρ(θ ). Assume θ(0)∈Bρ(θ⋆)θ(0)∈ B_ρ(θ ) and L(θ(0))≤εloc=μ⋆ρ2/8L(θ(0))≤ _loc= _ ρ^2/8. On the boundary of the ball, the quadratic-growth bound gives L(θ)≥μ⋆4ρ2=2εlocwhenever ‖θ−θ⋆‖=ρ.L(θ)≥ _ 4ρ^2=2 _loc \|θ-θ \|=ρ. Since L(θ(t))≤L(θ(0))≤εlocL(θ(t))≤ L(θ(0))≤ _loc for all t≥0t≥ 0, the trajectory can never reach a point with ‖θ(t)−θ⋆‖=ρ\|θ(t)-θ \|=ρ. Hence θ(t)∈Bρ(θ⋆)θ(t)∈ B_ρ(θ ) for all t≥0t≥ 0. Because the entire trajectory remains in Bρ(θ⋆)B_ρ(θ ), the local PL inequality applies for all t≥0t≥ 0. Combining it with the energy identity yields dtL(θ(t))=−‖∇L(θ(t))‖2≤−μ⋆L(θ(t)). ddtL(θ(t))=-\|∇ L(θ(t))\|^2≤- _ L(θ(t)). Grönwall’s inequality then gives L(θ(t))≤L(θ(0))e−μ⋆t,L(θ(t))≤ L(θ(0))\,e^- _ t, which is (2.12). Finally, the quadratic-growth inequality implies L(θ(t))≥μ⋆4‖θ(t)−θ⋆‖2.L(θ(t))≥ _ 4\|θ(t)-θ \|^2. Combining this with (2.12) yields μ⋆4‖θ(t)−θ⋆‖2≤L(θ(0))e−μ⋆t, _ 4\|θ(t)-θ \|^2≤ L(θ(0))\,e^- _ t, and therefore ‖θ(t)−θ⋆‖≤2μ⋆L(θ(0))e−μ⋆t/2,\|θ(t)-θ \|≤ 2 _ L(θ(0))\,e^- _ t/2, which is (2.13). ∎ B.7 Proof of Lemma 2.4 Proof of Lemma 2.4. We establish both parts of the statement in order. Part (A). First expand the replay mixture using qβ,mn=βpo+(1−β)φΣ(⋅;mn)q_β,m_n=β p_o+(1-β) _ \! (·;m_n ): bλ,β,mn=(1−λ)(βpo+(1−β)φΣ(⋅;mn))+λpo=(λ+(1−λ)β)po+(1−λ)(1−β)φΣ(⋅;mn).b_λ,β,m_n=(1-λ) (β p_o+(1-β) _ \! (·;m_n ) )+λ p_o= (λ+(1-λ)β )p_o+(1-λ)(1-β) _ \! (·;m_n ). Set β~:=λ+(1−λ)β∈(0,1) β:=λ+(1-λ)β∈(0,1), so that 1−β~=(1−λ)(1−β)1- β=(1-λ)(1-β), which proves bλ,β,mn=qβ~,mnb_λ,β,m_n=q_ β,m_n and β~≥λ β≥λ. Next, observe that bλ,β,mn=(1−λ)qβ,mn+λpo≥(1−λ)qβ,mnb_λ,β,m_n=(1-λ)q_β,m_n+λ p_o≥(1-λ)q_β,m_n pointwise. Therefore for all y, 0≤wλ(y)=qβ,mn(y)bλ,β,mn(y)≤qβ,mn(y)(1−λ)qβ,mn(y)=11−λ.0≤ w_λ(y)= q_β,m_n(y)b_λ,β,m_n(y)≤ q_β,m_n(y)(1-λ)q_β,m_n(y)= 11-λ. For unbiasedness, let h be integrable under qβ,mnq_β,m_n (equivalently, wλhw_λh integrable under bλ,β,mnb_λ,β,m_n). Then bλ,β,mn[wλ(Y)h(Y)]=∫bλ,β,mn(y)qβ,mn(y)bλ,β,mn(y)h(y)y=∫qβ,mn(y)h(y)y=qβ,mn[h(Y)],E_b_λ,β,m_n [w_λ(Y)h(Y) ]= b_λ,β,m_n(y) q_β,m_n(y)b_λ,β,m_n(y)h(y)\,dy= q_β,m_n(y)h(y)\,dy=E_q_β,m_n [h(Y) ], which is (2.15). Finally, if b[‖h(Y)‖2]<∞E_b[\|h(Y)\|^2]<∞, then using the uniform bound wλ2≤(1−λ)−2w_λ^2≤(1-λ)^-2, bλ,β,mn[‖wλ(Y)h(Y)‖2]≤1(1−λ)2bλ,β,mn[‖h(Y)‖2],E_b_λ,β,m_n [\|w_λ(Y)h(Y)\|^2 ]≤ 1(1-λ)^2\,E_b_λ,β,m_n [\|h(Y)\|^2 ], which is (2.16). Part (B). Under bλ,β,mn=qβ~,mnb_λ,β,m_n=q_ β,m_n, the standard mixture generative model yields latent indicators Zi∼i.i.d.Bernoulli(β~)Z_i i.i.d. Bernoulli( β) with β~≥λ β≥λ. Therefore (∑i=1NZi=0)=(1−β~)N, Pr ( _i=1^NZ_i=0 )=(1- β)^N, and substituting 1−β~=(1−λ)(1−β)1- β=(1-λ)(1-β) gives (2.17) and the upper bound (1−β~)N≤(1−λ)N(1- β)^N≤(1-λ)^N. For (2.18), let S:=∑i=1NZi∼Binomial(N,β~)S:= _i=1^NZ_i (N, β), so [S]=Nβ~≥NλE[S]=N β≥ Nλ. The multiplicative Chernoff bound implies (S≤(1−δ)[S])≤exp(−δ22[S]). Pr (S≤(1-δ)E[S] )≤ \! (- δ^22E[S] ). Taking δ=1/2δ=1/2 yields (S≤12Nβ~)≤exp(−Nβ~/8)≤exp(−Nλ/8) Pr(S≤ 12N β)≤ (-N β/8)≤ (-Nλ/8). Since λ2N≤β~2N λ2N≤ β2N, we have (S≤λ2N)≤(S≤β~2N)≤exp(−λ8N), Pr (S≤ λ2N )≤ Pr (S≤ β2N )≤ \! (- λ8N ), which is (2.18). ∎ B.8 Proof of Theorem 3.1 Proof of Theorem 3.1. Fix y=(α,ν)∈Ky=(α,ν)∈ K and write Fy(x)=KL(qx∥py)F_y(x)=KL(q_x\|p_y) for x=(β,m)x=(β,m). Because qy≡pyq_y≡ p_y, we have Fy(y)=0F_y(y)=0, and since KL is nonnegative, y is a global minimizer of FyF_y. As established earlier for equal-covariance Gaussian mixtures, FyF_y is C2C^2 in a neighborhood of y, and the Hessian at y equals the Fisher information of the parameterization x↦qxx q_x under Y∼pyY p_y, hence ∇2Fy(y)≻0∇^2F_y(y) 0. By continuity of y↦∇2Fy(y)y ∇^2F_y(y) and compactness of K, there exists μ>0μ>0 with λmin(∇2Fy(y))≥μ _ (∇^2F_y(y))≥μ uniformly over y∈Ky∈ K. Similarly, by smooth dependence of FyF_y on (x,y)(x,y) and compactness, there exist r0>0r_0>0 and LH<∞L_H<∞ such that the Hessian is LHL_H-Lipschitz on Br0(y)B_r_0(y) uniformly in y∈Ky∈ K. Define ρ=minr0,μ/(2LH)ρ= \r_0,μ/(2L_H)\. Then for any y∈Ky∈ K and any x∈Bρ(y)x∈ B_ρ(y), Weyl’s inequality gives λmin(∇2Fy(x))≥λmin(∇2Fy(y))−‖∇2Fy(x)−∇2Fy(y)‖2≥μ−LH‖x−y‖≥μ2. _ \! (∇^2F_y(x) )≥ _ \! (∇^2F_y(y) )-\|∇^2F_y(x)-∇^2F_y(y)\|_2≥μ-L_H\|x-y\|≥ μ2. Since the set (x,y):y∈K,‖x−y‖≤ρ\(x,y):y∈ K,\ \|x-y\|≤ρ\ is compact and (x,y)↦∇2Fy(x)(x,y) ∇^2F_y(x) is continuous, there exists M<∞M<∞ such that λmax(∇2Fy(x))≤M _ (∇^2F_y(x))≤ M uniformly on that set. We now prove Part (A). Fix t≥0t≥ 0 and write yt=ν~ty_t= ν_t and xt=m~tx_t= m_t. Assume inductively that xt∈Bρ(yt)x_t∈ B_ρ(y_t). Since yty_t is the minimizer of FytF_y_t, we have ∇Fyt(yt)=0∇ F_y_t(y_t)=0. The mean-value formula for gradients yields ∇Fyt(xt)−∇Fyt(yt)=(∫01∇2Fyt(yt+s(xt−yt))s)(xt−yt).∇ F_y_t(x_t)-∇ F_y_t(y_t)= ( _0^1∇^2F_y_t (y_t+s(x_t-y_t) )\,ds )(x_t-y_t). Define At:=∫01∇2Fyt(yt+s(xt−yt))s.A_t:= _0^1∇^2F_y_t (y_t+s(x_t-y_t) )\,ds. Because the segment yt+s(xt−yt):s∈[0,1]⊂Bρ(yt)\y_t+s(x_t-y_t):s∈[0,1]\⊂ B_ρ(y_t), we have λmin(At)≥μ2,λmax(At)≤M. _ (A_t)≥ μ2, _ (A_t)≤ M. Using the student update xt+1=xt−γ∇Fyt(xt)\,x_t+1=x_t-γ∇ F_y_t(x_t)\, with 0<γ≤1/M0<γ≤ 1/M and ∇Fyt(yt)=0∇ F_y_t(y_t)=0, xt+1−yt=xt−yt−γ(∇Fyt(xt)−∇Fyt(yt))=(I−γAt)(xt−yt).x_t+1-y_t=x_t-y_t-γ(∇ F_y_t(x_t)-∇ F_y_t(y_t))=(I-γ A_t)(x_t-y_t). Since AtA_t is symmetric with spectrum in [μ/2,M][μ/2,M], we have ∥I−γAt∥2=maxλ∈[μ/2,M]|1−γλ|≤1−γμ2=:q,\|I-γ A_t\|_2= _λ∈[μ/2,M]|1-γλ|≤ 1- γμ2=:q, which proves (3.3). We next derive (3.4) from the teacher update. Let ν~(c) ν(c) be fixed. From (3.2), ν~t+1−ν~(c)=(1−ζ)(ν~t−ν~(c))+ζ(1−λ)(m~t+1−ν~(c)). ν_t+1- ν(c)=(1-ζ) ( ν_t- ν(c) )+ζ(1-λ) ( m_t+1- ν(c) ). Rewrite m~t+1−ν~(c)=(m~t+1−ν~t)+(ν~t−ν~(c)) m_t+1- ν(c)=( m_t+1- ν_t)+( ν_t- ν(c)) to get ν~t+1−ν~(c)=(1−ζλ)(ν~t−ν~(c))+ζ(1−λ)(m~t+1−ν~t). ν_t+1- ν(c)=(1-ζλ) ( ν_t- ν(c) )+ζ(1-λ) ( m_t+1- ν_t ). Taking norms yields (3.4). If λ>0λ>0, then 1−ζλ∈(0,1)1-ζλ∈(0,1) and (3.3) implies ‖m~t+1−ν~t‖→0\| m_t+1- ν_t\|→ 0; thus the recursion (3.4) implies ‖ν~t−ν~(c)‖→0\| ν_t- ν(c)\|→ 0, and hence also ‖m~t−ν~(c)‖→0\| m_t- ν(c)\|→ 0. We now prove Part (B). Define G(β,m;α,ν):=∇moL~(β,μo,m;α,ν).G(β,m;α,ν):= _m_o L(β, _o,m;α,ν). At the teacher-matched point (β,m)=(α,ν)(β,m)=(α,ν), we have qα,μo,ν≡pα,νq_α, _o,ν≡ p_α,ν, hence L~(α,μo,ν;α,ν)=0 L(α, _o,ν;α,ν)=0, and differentiating shows G(α,ν;α,ν)=0G(α,ν;α,ν)=0. The map (β,m,α,ν)↦G(β,m;α,ν)(β,m,α,ν) G(β,m;α,ν) is continuous and C1C^1 on the compact set :=(β,m,α,ν):(α,ν)∈K,‖(β,m)−(α,ν)‖≤ρ,C:= \(β,m,α,ν):\ (α,ν)∈ K,\ \|(β,m)-(α,ν)\|≤ρ \, so its Jacobian with respect to (β,m,α,ν)(β,m,α,ν) is bounded on C. Thus there exists Lold<∞L_old<∞ such that for all (β,m,α,ν)∈(β,m,α,ν) , ‖G(β,m;α,ν)−G(α,ν;α,ν)‖≤Lold(‖(β,m)−(α,ν)‖+‖(α,ν)−ν~(c)‖).\|G(β,m;α,ν)-G(α,ν;α,ν)\|≤ L_old (\|(β,m)-(α,ν)\|+\|(α,ν)- ν(c)\| ). Using G(α,ν;α,ν)=0G(α,ν;α,ν)=0 and substituting (β,m,α,ν)=(βt,mt,αt,νt)(β,m,α,ν)=( _t,m_t, _t, _t) yields (3.5). If λ>0λ>0, then ‖m~t−ν~t‖→0\| m_t- ν_t\|→ 0 and ‖ν~t−ν~(c)‖→0\| ν_t- ν(c)\|→ 0, so the right-hand side of (3.5) is summable, proving (3.6). Finally, since m~t=(βt,mt)→ν~(c)=(αc,νc) m_t=( _t,m_t)→ ν(c)=( _c, _c), let ν~⋆∈ℝd+1 ν ^d+1 be any target state and note that for all t, ‖m~t−ν~⋆‖≤‖m~t−ν~(c)‖+‖ν~(c)−ν~⋆‖, \| m_t- ν \|≤ \| m_t- ν(c) \|+ \| ν(c)- ν \|, and also, by the reverse triangle inequality, ‖m~t−ν~⋆‖≥‖ν~(c)−ν~⋆‖−‖m~t−ν~(c)‖. \| m_t- ν \|≥ \| ν(c)- ν \|- \| m_t- ν(c) \|. Taking lim sup in the first inequality and lim inf in the second, and using ‖m~t−ν~(c)‖→0\| m_t- ν(c)\|→ 0, yields the limit identity (3.7). ∎ B.9 Proof of Lemma 3.1 Proof of Lemma 3.1. We prove the three statements in order. Case 1: Compute Jη(qβ)J_η(q_β) and KL(qβ∥qβ0)KL(q_β\|q_ _0). Under the disjoint-support assumption, qβ=βpoq_β=β p_o on AoA_o and qβ=(1−β)pnq_β=(1-β)p_n on AnA_n. Since the reward is constant on each region, Y∼qβ[eηr(Y)] _Y q_β[e^η r(Y)] =∫Aoqβ(y)eηuoy+∫Anqβ(y)eηuny = _A_oq_β(y)e^η u_o\,dy+ _A_nq_β(y)e^η u_n\,dy =βeηuo∫Aopo(y)y+(1−β)eηun∫Anpn(y)y =β e^η u_o _A_op_o(y)\,dy+(1-β)e^η u_n _A_np_n(y)\,dy =βeηuo+(1−β)eηun, =β e^η u_o+(1-β)e^η u_n, which proves the first identity in (3.11) after taking logarithms. For the KL term, on AoA_o we have logqβ(y)qβ0(y)=logβpo(y)β0po(y)=logβ0, q_β(y)q_ _0(y)= β p_o(y) _0p_o(y)= β _0, and on AnA_n we have logqβ(y)qβ0(y)=log(1−β)pn(y)(1−β0)pn(y)=log1−β1−β0. q_β(y)q_ _0(y)= (1-β)p_n(y)(1- _0)p_n(y)= 1-β1- _0. Therefore KL(qβ∥qβ0)=∫Aoβpo(y)logβ0dy+∫An(1−β)pn(y)log1−β1−β0dy,KL(q_β\|q_ _0)= _A_oβ p_o(y) β _0\,dy+ _A_n(1-β)p_n(y) 1-β1- _0\,dy, which simplifies to the second identity in (3.11). Case 2: Unanchored case λref=0 _ref=0. Set a:=eηuo,b:=eηun.a:=e^η u_o, b:=e^η u_n. Then Jη(qβ)=log(b+β(a−b)).J_η(q_β)= \! (b+β(a-b) ). This is the logarithm of an affine function of β, so its monotonicity is determined by the sign of a−ba-b: if un>uou_n>u_o, then b>ab>a, so a−b<0a-b<0 and the affine term is strictly decreasing in β, hence the unique maximizer is β⋆=0β =0; if uo>unu_o>u_n, the same argument gives β⋆=1β =1; if uo=unu_o=u_n, then a=ba=b and JηJ_η is constant. Case 3: Anchored case λref>0 _ref>0. Let F(β):=Jη(qβ)=log(b+β(a−b)),G(β):=KL(qβ∥qβ0),H(β):=F(β)−λrefG(β).F(β):=J_η(q_β)= \! (b+β(a-b) ), G(β):=KL(q_β\|q_ _0), H(β):=F(β)- _refG(β). For β∈(0,1)β∈(0,1), F′(β)=a−b+β(a−b),F′(β)=−(a−b)2(b+β(a−b))2,F (β)= a-bb+β(a-b), F (β)=- (a-b)^2(b+β(a-b))^2, and G′(β)=logβ0−log1−β1−β0,G′(β)=1β+11−β.G (β)= β _0- 1-β1- _0, G (β)= 1β+ 11-β. If uo≠unu_o≠ u_n, then a≠ba≠ b, so F′(β)<0F (β)<0 on (0,1)(0,1), while G′(β)>0G (β)>0 on (0,1)(0,1). Thus H′(β)=F′(β)−λrefG′(β)<0∀β∈(0,1),H (β)=F (β)- _refG (β)<0 ∀β∈(0,1), so H is strictly concave. To show the maximizer is interior, note that F′(β)F (β) remains finite on [0,1][0,1], whereas limβ↓0G′(β)=−∞,limβ↑1G′(β)=+∞. _β 0G (β)=-∞, _β 1G (β)=+∞. Hence limβ↓0H′(β)=+∞,limβ↑1H′(β)=−∞. _β 0H (β)=+∞, _β 1H (β)=-∞. By continuity of H′H , there exists β⋆∈(0,1)β ∈(0,1) with H′(β⋆)=0H (β )=0, and by strict concavity this β⋆β is unique. Finally, if uo=unu_o=u_n, then F is constant in β, so maximizing H is equivalent to minimizing G. Since G is strictly convex and G′(β0)=0G ( _0)=0, its unique minimizer is β0 _0. ∎ B.10 Proof of Theorem 3.2 Proof of Theorem 3.2. We prove parts (A) and (B) separately. Proof of Part (A): We start by computing the region probabilities under pop_o and pnp_n. Let Δ:=μn−μo := _n- _o and define T(Y):=Δ⊤Σ−1(Y−μo+μn2).T(Y):= ^-1 (Y- _o+ _n2 ). By definition, An=T(Y)≥0A_n=\T(Y)≥ 0\ and Ao=T(Y)<0A_o=\T(Y)<0\. If Y∼po=(μo,Σ)Y p_o=N( _o, ), then [T(Y)]=Δ⊤Σ−1(μo−μo+μn2)=−12Δ⊤Σ−1Δ=−δ22,E[T(Y)]= ^-1 ( _o- _o+ _n2 )=- 12\, ^-1 =- δ^22, and Var(T(Y))=Δ⊤Σ−1ΣΣ−1Δ=δ2.Var(T(Y))= ^-1 ^-1 =δ^2. Therefore Y∼po(An)=((−δ2/2,δ2)≥0)=Φ(−δ2)=γ. Pr_Y p_o(A_n)= Pr (N(-δ^2/2,δ^2)≥ 0 )= \! (- δ2 )=γ. Hence po(Ao)=1−γ. Pr_p_o(A_o)=1-γ. Similarly, if Y∼pn=(μn,Σ)Y p_n=N( _n, ), then [T(Y)]=+δ22,Var(T(Y))=δ2,E[T(Y)]=+ δ^22, (T(Y))=δ^2, so pn(An)=Φ(δ2)=1−γ,pn(Ao)=γ. Pr_p_n(A_n)= \! ( δ2 )=1-γ, Pr_p_n(A_o)=γ. We next compute Jη(qβ)J_η(q_β) and its derivatives. Since qβ=βpo+(1−β)pnq_β=β p_o+(1-β)p_n, the probability of AoA_o under qβq_β is qβ(Ao)=β(1−γ)+(1−β)γ=γ+β(1−2γ)=γ+κβ. Pr_q_β(A_o)=β(1-γ)+(1-β)γ=γ+β(1-2γ)=γ+κβ. Thus qβ(An)=1−γ−κβ. Pr_q_β(A_n)=1-γ-κβ. Because r(y)=uor(y)=u_o on AoA_o and r(y)=unr(y)=u_n on AnA_n, Y∼qβ[eηr(Y)]=eηuo(γ+κβ)+eηun(1−γ−κβ),E_Y q_β[e^η r(Y)]=e^η u_o(γ+κβ)+e^η u_n(1-γ-κβ), which proves the formula for Jη(qβ)J_η(q_β). Differentiating gives Jη′(qβ)=κ(eηuo−eηun)eηuo(γ+κβ)+eηun(1−γ−κβ),J_η (q_β)= κ(e^η u_o-e^η u_n)e^η u_o(γ+κβ)+e^η u_n(1-γ-κβ), and Jη′(qβ)=−κ2(eηuo−eηun)2(eηuo(γ+κβ)+eηun(1−γ−κβ))2<0J_η (q_β)=- κ^2(e^η u_o-e^η u_n)^2 (e^η u_o(γ+κβ)+e^η u_n(1-γ-κβ) )^2<0 whenever uo≠unu_o≠ u_n. Next, we analyze the KL anchor. Let h(y):=po(y)−pn(y).h(y):=p_o(y)-p_n(y). Then qβ(y)=qβ0(y)+(β−β0)h(y)q_β(y)=q_ _0(y)+(β- _0)h(y) and qβ′(y)=h(y)q_β (y)=h(y). Define D(β)=KL(qβ∥qβ0)=∫qβ(y)logqβ(y)qβ0(y)dy.D(β)=KL(q_β\|q_ _0)= q_β(y) q_β(y)q_ _0(y)\,dy. Using ∫h(y)y=0 h(y)\,dy=0 and differentiating under the integral sign, D′(β)=∫h(y)logqβ(y)qβ0(y)dy,D′(β)=∫h(y)2qβ(y)y.D (β)= h(y) q_β(y)q_ _0(y)\,dy, D (β)= h(y)^2q_β(y)\,dy. Because qβ(y)>0q_β(y)>0 for all y and h≢0h ≡ 0, we have D′(β)>0D (β)>0 for β∈(0,1)β∈(0,1). Thus D is strictly convex. Also D′(β0)=0D ( _0)=0, so strict convexity implies D′(β)<0for β<β0,D′(β)>0for β>β0.D (β)<0 β< _0, D (β)>0 β> _0. In particular, D′(0)<0D (0)<0. Finally, we characterize the maximizer of the anchored objective. Assume un>uou_n>u_o, so that Jη′(qβ)<0J_η (q_β)<0 for all β∈[0,1]β∈[0,1]. Define H(β):=ℒη,λref(qβ)=Jη(qβ)−λrefD(β).H(β):=L_η, _ref(q_β)=J_η(q_β)- _refD(β). Since Jη′<0J_η <0 and D′>0D >0, we have H′(β)=Jη′(qβ)−λrefD′(β)<0∀β∈(0,1),H (β)=J_η (q_β)- _refD (β)<0 ∀β∈(0,1), so H is strictly concave. Now H′(0)=Jη′(qβ)|β=0−λrefD′(0).H (0)=J_η (q_β) |_β=0- _refD (0). Because Jη′(qβ)|β=0<0J_η (q_β)|_β=0<0 and D′(0)<0D (0)<0, define λcrit(new):=−Jη′(qβ)|β=0−D′(0)>0. _crit^(new):= -J_η (q_β)|_β=0-D (0)>0. If 0≤λref≤λcrit(new)0≤ _ref≤ _crit^(new), then H′(0)≤0H (0)≤ 0. Since H′H is strictly decreasing, we have H′(β)<0H (β)<0 for all β∈(0,1)β∈(0,1), so the unique maximizer is β⋆=0β =0. If λref>λcrit(new) _ref> _crit^(new), then H′(0)>0H (0)>0. At β=β0β= _0 we have D′(β0)=0D ( _0)=0, so H′(β0)=Jη′(qβ0)<0.H ( _0)=J_η (q_ _0)<0. By continuity of H′H and strict concavity of H, there is a unique root of H′H in (0,β0)(0, _0), which is the unique maximizer of H. This proves part (A). Proof of Part (B): Consider the full family qβ,mo,mn(y)=βφΣ(y;mo)+(1−β)φΣ(y;mn),q_β,m_o,m_n(y)=β\, _ \! (y;m_o )+(1-β)\, _ \! (y;m_n ), and define Jη(β,mo,mn)=logY∼qβ,mo,mn[eηr(Y)].J_η(β,m_o,m_n)= _Y q_β,m_o,m_n[e^η r(Y)]. Let M:=Y∼qβ,mo,mn[eηr(Y)],wη(q)(y):=eηr(y)M.M:=E_Y q_β,m_o,m_n[e^η r(Y)], w_η^(q)(y):= e^η r(y)M. By the standard score identity, ∇θJη(qθ)=Y∼qθ[wη(qθ)(Y)∇θlogqθ(Y)]. _θJ_η(q_θ)=E_Y q_θ [w_η^(q_θ)(Y)\, _θ q_θ(Y) ]. For the old mean parameter, ∇mologqβ,mo,mn(y)=ro(y)Σ−1(y−mo),ro(y):=βφΣ(y;mo)qβ,mo,mn(y). _m_o q_β,m_o,m_n(y)=r_o(y)\, ^-1(y-m_o), r_o(y):= β\, _ \! (y;m_o )q_β,m_o,m_n(y). Hence, at mo=μom_o= _o, ∇moJη(β,μo,mn)=Y∼qβ,μo,mn[wη(q)(Y)ro(Y)Σ−1(Y−μo)]. _m_oJ_η(β, _o,m_n)=E_Y q_β, _o,m_n\! [w_η^(q)(Y)\,r_o(Y)\, ^-1(Y- _o) ]. Using qβ,μo,mn(y)ro(y)=βφΣ(y;μo)=βpo(y),q_β, _o,m_n(y)\,r_o(y)=β\, _ \! (y; _o )=β\,p_o(y), this becomes ∇moJη(β,μo,mn)=βY∼po[wη(q)(Y)Σ−1(Y−μo)]. _m_oJ_η(β, _o,m_n)=β\,E_Y p_o\! [w_η^(q)(Y)\, ^-1(Y- _o) ]. Since r is constant on AoA_o and AnA_n, the weight is constant on each region: wη(q)(y)=woon Ao,wη(q)(y)=wnon An.w_η^(q)(y)=w_o\ on A_o, w_η^(q)(y)=w_n\ on A_n. Thus ∇moJη(β,μo,mn)=β(wopo[Σ−1(Y−μo)Y∈Ao]+wnpo[Σ−1(Y−μo)Y∈An]). _m_oJ_η(β, _o,m_n)=β (w_o\,E_p_o[ ^-1(Y- _o)1\Y∈ A_o\]+w_n\,E_p_o[ ^-1(Y- _o)1\Y∈ A_n\] ). Since po[Σ−1(Y−μo)]=0E_p_o[ ^-1(Y- _o)]=0, the two truncated expectations are negatives of one another, giving ∇moJη(β,μo,mn)=β(wn−wo)po[Σ−1(Y−μo)Y∈An]. _m_oJ_η(β, _o,m_n)=β(w_n-w_o)\,E_p_o\! [ ^-1(Y- _o)1\Y∈ A_n\ ]. By Lemma B.4, we have po[Σ−1(Y−μo)Y∈An]=φ(δ/2)δΣ−1(μn−μo),E_p_o\! [ ^-1(Y- _o)1\Y∈ A_n\ ]= (δ/2)δ\, ^-1( _n- _o), where φ(t)=(2π)−1/2e−t2/2 (t)=(2π)^-1/2e^-t^2/2. Substituting proves (3.15). Finally, if |uo|,|un|≤R|u_o|,|u_n|≤ R, then eηr(y)∈[e−ηR,eηR]e^η r(y)∈[e^-η R,e^η R] pointwise, so M∈[e−ηR,eηR],wo,wn∈[e−2ηR,e2ηR],M∈[e^-η R,e^η R], w_o,w_n∈[e^-2η R,e^2η R], and therefore |wn−wo|≤e2ηR−e−2ηR.|w_n-w_o|≤ e^2η R-e^-2η R. Using φ(δ/2)=(2π)−1/2e−δ2/8 (δ/2)=(2π)^-1/2e^-δ^2/8 in (3.15) yields (3.16). If the full objective is evaluated at a synchronized point q0=qβ,μo,mnq_0=q_β, _o,m_n, then ∇moKL(qβ,mo,mn∥q0)|mo=μo=0, _m_oKL(q_β,m_o,m_n\|q_0) |_m_o= _o=0, because log(q/q0)≡0 (q/q_0)≡ 0 at that point and ∫∇moq=0 _m_oq=0. Hence the same bound applies to the full objective. ∎ Lemma B.4 (Truncated Gaussian moment along the Bayes halfspace). In the setting of Theorem 3.2, we have [Σ−1(Y−μo) 1Y∈An]=φ(δ/2)δΣ−1(μn−μo),E [ ^-1(Y- _o)\,1\Y∈ A_n\ ]= (δ/2)δ\, ^-1( _n- _o), where φ(t):=(2π)−1/2e−t2/2 (t):=(2π)^-1/2e^-t^2/2 is the standard normal density. Proof. The proof is based on routine moment computations. Write Δ:=μn−μo := _n- _o and X:=Y−μoX:=Y- _o. Then X∼(0,Σ)X (0, ) and Σ−1(Y−μo)Y∈An=Σ−1X 1Y∈An. ^-1(Y- _o)1\Y∈ A_n\= ^-1X\,1\Y∈ A_n\. We start by rewriteing the truncation event. By definition of AnA_n, we have Y∈An Y∈ A_n ⟺Δ⊤Σ−1(Y−μo+μn2)≥0 ^-1 (Y- _o+ _n2 )≥ 0 ⟺Δ⊤Σ−1(μo+X−μo+μn2)≥0 ^-1 ( _o+X- _o+ _n2 )≥ 0 ⟺Δ⊤Σ−1(X−Δ2)≥0 ^-1 (X- 2 )≥ 0 ⟺Δ⊤Σ−1X≥12Δ⊤Σ−1Δ=δ22. ^-1X≥ 12 ^-1 = δ^22. Next we whiten the Gaussian. Let Z:=Σ−1/2XZ:= ^-1/2X. Then Z∼(0,Id)Z (0,I_d) and X=Σ1/2ZX= ^1/2Z. Hence Σ−1X=Σ−1Σ1/2Z=Σ−1/2Z. ^-1X= ^-1 ^1/2Z= ^-1/2Z. Also define b:=Σ−1/2Δ∈ℝd.b:= ^-1/2 ^d. Then ‖b‖=b⊤b=Δ⊤Σ−1Δ=δ\|b\|= b b= ^-1 =δ, and Δ⊤Σ−1X=Δ⊤Σ−1Σ1/2Z=(Σ−1/2Δ)⊤Z=b⊤Z. ^-1X= ^-1 ^1/2Z=( ^-1/2 ) Z=b Z. Therefore, we have the equivalence: Y∈An⟺b⊤Z≥δ22.Y∈ A_n b Z≥ δ^22. Combining these identities gives [Σ−1(Y−μo) 1Y∈An]=Σ−1/2[Z 1b⊤Z≥δ2/2].E [ ^-1(Y- _o)\,1\Y∈ A_n\ ]= ^-1/2\,E [Z\,1\b Z≥δ^2/2\ ]. (B.4) We now reduce the above to a one-dimensional truncated normal moment. Let u:=b/δu:=b/δ, so ‖u‖=1\|u\|=1 and b⊤Z=δu⊤Zb Z=δ\,u Z. Define the scalar random variable U:=u⊤Z.U:=u Z. Since Z∼(0,Id)Z (0,I_d) and ‖u‖=1\|u\|=1, we have U∼(0,1)U (0,1). Moreover, b⊤Z≥δ2/2=δU≥δ2/2=U≥δ/2.\b Z≥δ^2/2\=\δ U≥δ^2/2\=\U≥δ/2\. Now decompose Z into its component along u and its orthogonal remainder: Z=uU+V,V:=Z−uU=(I−uu⊤)Z.Z=uU+V, V:=Z-uU=(I-u )Z. We claim V is independent of U and satisfies [V]=0E[V]=0. Indeed, (U,V)(U,V) is jointly Gaussian (as an affine image of the Gaussian vector Z), and Cov(U,V)=[UV⊤]=[(u⊤Z)Z⊤(I−uu⊤)]=u⊤[ZZ⊤](I−uu⊤) Cov(U,V)=E[UV ]=E [(u Z)\,Z (I-u ) ]=u E[Z ](I-u ) =u⊤Id(I−uu⊤)=u⊤−u⊤uu⊤=u⊤−u⊤=0. =u I_d(I-u )=u -u u =u -u =0. For jointly Gaussian random variables, zero covariance implies independence, hence U and V are independent. Also [V]=(I−uu⊤)[Z]=0E[V]=(I-u )E[Z]=0. Therefore, [Z 1U≥δ/2] [Z\,1\U≥δ/2\ ] =[(uU+V) 1U≥δ/2] =E [(uU+V)\,1\U≥δ/2\ ] =u[U 1U≥δ/2]+[V 1U≥δ/2]. =u\,E [U\,1\U≥δ/2\ ]\;+\;E [V\,1\U≥δ/2\ ]. Using independence of V and U and [V]=0E[V]=0, [V 1U≥δ/2]=[[V 1U≥δ/2∣U]]=[U≥δ/2[V∣U]] [V\,1\U≥δ/2\ ]=E [E[V\,1\U≥δ/2\ U] ]=E [1\U≥δ/2\\,E[V U] ] =[U≥δ/2[V]]=0. =E [1\U≥δ/2\\,E[V] ]=0. Thus [Z 1U≥δ/2]=u[U 1U≥δ/2].E [Z\,1\U≥δ/2\ ]=u\,E [U\,1\U≥δ/2\ ]. (B.5) We now compute the scalar truncated moment. Since U∼(0,1)U (0,1) with density φ , we have [U 1U≥a]=∫a∞uφ(u)ufor any a∈ℝ.E [U\,1\U≥ a\ ]= _a^∞u\, (u)\,du any a . We now compute this integral explicitly. Recall φ(u)=(2π)−1/2e−u2/2 (u)=(2π)^-1/2e^-u^2/2, so duφ(u)=(2π)−1/2du(e−u2/2)=(2π)−1/2(−u)e−u2/2=−uφ(u). ddu (u)=(2π)^-1/2 ddu (e^-u^2/2 )=(2π)^-1/2 (-u )e^-u^2/2=-u\, (u). Hence uφ(u)=−φ′(u)u (u)=- (u), and therefore ∫a∞uφ(u)u=−∫a∞φ′(u)u=−(limu→∞φ(u)−φ(a))=φ(a), _a^∞u\, (u)\,du=- _a^∞ (u)\,du=- ( _u→∞ (u)- (a) )= (a), since limu→∞φ(u)=0 _u→∞ (u)=0. Taking a=δ/2a=δ/2 yields [U 1U≥δ/2]=φ(δ/2).E [U\,1\U≥δ/2\ ]= (δ/2). (B.6) Substituting (B.6) into (B.5) gives [Z 1b⊤Z≥δ2/2]=[Z 1U≥δ/2]=uφ(δ/2)=bδφ(δ/2).E [Z\,1\b Z≥δ^2/2\ ]=E [Z\,1\U≥δ/2\ ]=u\, (δ/2)= bδ\, (δ/2). Plugging this into (B.4) yields [Σ−1(Y−μo) 1Y∈An]=Σ−1/2(bδφ(δ/2))=φ(δ/2)δΣ−1/2b [ ^-1(Y- _o)\,1\Y∈ A_n\ ]= ^-1/2 ( bδ\, (δ/2) )= (δ/2)δ\, ^-1/2b =φ(δ/2)δΣ−1/2Σ−1/2Δ=φ(δ/2)δΣ−1Δ. = (δ/2)δ\, ^-1/2 ^-1/2 = (δ/2)δ\, ^-1 . Recalling that Δ=μn−μo = _n- _o, completes the proof. ∎ B.11 Exact Characterization of the Optimal Mixture Weight for the T-Discover Proposition 1 (Exact characterization of the optimal mixture weight for the T-style objective). Fix η>0η>0, λref≥0 _ref≥ 0, and a reference weight β0∈(0,1) _0∈(0,1). Let a:=eηuo,b:=eηun,a:=e^η u_o, b:=e^η u_n, where uo,un∈ℝu_o,u_n are the old- and new-side reward levels. (A) Disjoint-support case. Assume the disjoint-support setting of Lemma 3.1. Define Hdisc(β):=ℒη,λref(qβ)=log(βa+(1−β)b)−λref[βlogβ0+(1−β)log1−β1−β0].H_disc(β):=L_η, _ref(q_β)= \! (β a+(1-β)b )- _ref\! [β β _0+(1-β) 1-β1- _0 ]. Then the maximizer β⋆∈[0,1]β ∈[0,1] is characterized as follows: β⋆=0,λref=0,b>a,1,λref=0,a>b,any β∈[0,1],λref=0,a=b,unique solution in (0,1) of a−bβa+(1−β)b=λref(logβ0−log1−β1−β0),λref>0.β = cases0,& _ref=0,\;b>a,\\[3.0pt] 1,& _ref=0,\;a>b,\\[3.0pt] any β∈[0,1],& _ref=0,\;a=b,\\[3.0pt] unique solution in (0,1) of a-bβ a+(1-β)b= _ref ( β _0- 1-β1- _0 ),& _ref>0. cases In particular, when λref>0 _ref>0 and a=ba=b, the unique maximizer is β⋆=β0β = _0. (B) Gaussian case. Assume the Gaussian setting of Theorem 3.2. Define γ:=Φ(−δ2),κ:=1−2γ,γ:= \! (- δ2 ), κ:=1-2γ, and Hgauss(β):=ℒη,λref(qβ)=log(a(γ+κβ)+b(1−γ−κβ))−λrefD(β),H_gauss(β):=L_η, _ref(q_β)= \! (a(γ+κβ)+b(1-γ-κβ) )- _ref\,D(β), where D(β):=KL(qβ∥qβ0).D(β):=KL(q_β\|q_ _0). Then the maximizer β⋆∈[0,1]β ∈[0,1] is characterized as follows: • If λref=0 _ref=0, then β⋆=0,b>a,1,a>b,any β∈[0,1],a=b.β = cases0,&b>a,\\ 1,&a>b,\\ any β∈[0,1],&a=b. cases • If λref>0 _ref>0 and a=ba=b, then β⋆=β0β = _0. • If b>ab>a, define λcrit(new):=κ(b−a)(aγ+b(1−γ))(−D′(0)),D′(β)=∫ℝd(po(y)−pn(y))logqβ(y)qβ0(y)dy. _crit^(new):= κ(b-a) (aγ+b(1-γ) )\,(-D (0)), D (β)= _R^d (p_o(y)-p_n(y) ) q_β(y)q_ _0(y)\,dy. Then β⋆=0,0≤λref≤λcrit(new),unique solution in (0,β0) of κ(a−b)a(γ+κβ)+b(1−γ−κβ)=λrefD′(β),λref>λcrit(new).β = cases0,&0≤ _ref≤ _crit^(new),\\[4.0pt] unique solution in (0, _0) of κ(a-b)a(γ+κβ)+b(1-γ-κβ)= _ref\,D (β),& _ref> _crit^(new). cases • If a>ba>b, define λcrit(old):=κ(a−b)(a(1−γ)+bγ)D′(1). _crit^(old):= κ(a-b) (a(1-γ)+bγ )\,D (1). Then β⋆=1,0≤λref≤λcrit(old),unique solution in (β0,1) of κ(a−b)a(γ+κβ)+b(1−γ−κβ)=λrefD′(β),λref>λcrit(old).β = cases1,&0≤ _ref≤ _crit^(old),\\[4.0pt] unique solution in ( _0,1) of κ(a-b)a(γ+κβ)+b(1-γ-κβ)= _ref\,D (β),& _ref> _crit^(old). cases Proof. We treat the disjoint-support and Gaussian cases separately. Proof of Part (A): By Lemma 3.1, the objective is exactly Hdisc(β)=log(βa+(1−β)b)−λref[βlogβ0+(1−β)log1−β1−β0].H_disc(β)= \! (β a+(1-β)b )- _ref\! [β β _0+(1-β) 1-β1- _0 ]. First consider λref=0 _ref=0. Then Hdisc(β)=log(βa+(1−β)b)H_disc(β)= (β a+(1-β)b), the logarithm of an affine function of β. Hence: • if b>ab>a, it is strictly decreasing, so β⋆=0β =0; • if a>ba>b, it is strictly increasing, so β⋆=1β =1; • if a=ba=b, it is constant, so every β∈[0,1]β∈[0,1] is optimal. Now assume λref>0 _ref>0. Differentiate on (0,1)(0,1): Hdisc′(β)=a−bβa+(1−β)b−λref(logβ0−log1−β1−β0).H_disc (β)= a-bβ a+(1-β)b- _ref ( β _0- 1-β1- _0 ). Differentiating once more gives Hdisc′(β)=−(a−b)2(βa+(1−β)b)2−λref(1β+11−β)<0∀β∈(0,1),H_disc (β)=- (a-b)^2(β a+(1-β)b)^2- _ref ( 1β+ 11-β )<0 ∀β∈(0,1), so HdiscH_disc is strictly concave on (0,1)(0,1). Moreover, limβ↓0Hdisc′(β)=+∞,limβ↑1Hdisc′(β)=−∞, _β 0H_disc (β)=+∞, _β 1H_disc (β)=-∞, because the logarithmic term diverges while the first term remains finite at the endpoints. Hence, by continuity of Hdisc′H_disc , there exists a unique β⋆∈(0,1)β ∈(0,1) with Hdisc′(β⋆)=0H_disc (β )=0. This gives the stated first-order equation. If a=ba=b, then the first term in Hdisc′H_disc vanishes identically, so the unique solution is logβ0=log1−β1−β0, β _0= 1-β1- _0, which is equivalent to β=β0β= _0. Proof of Part (B): By Theorem 3.2, the Gaussian objective can be written as Hgauss(β)=log(a(γ+κβ)+b(1−γ−κβ))−λrefD(β),H_gauss(β)= \! (a(γ+κβ)+b(1-γ-κβ) )- _refD(β), with D′(β)=∫ℝd(po(y)−pn(y))logqβ(y)qβ0(y)dy,D′(β)=∫ℝd(po(y)−pn(y))2qβ(y)y>0.D (β)= _R^d (p_o(y)-p_n(y) ) q_β(y)q_ _0(y)\,dy, D (β)= _R^d (p_o(y)-p_n(y))^2q_β(y)\,dy>0. Therefore D is strictly convex, D′(β0)=0D ( _0)=0, D′(0)<0D (0)<0, and D′(1)>0D (1)>0. First consider λref=0 _ref=0. Then Hgauss(β)=log(a(γ+κβ)+b(1−γ−κβ)),H_gauss(β)= \! (a(γ+κβ)+b(1-γ-κβ) ), whose derivative is Hgauss′(β)=κ(a−b)a(γ+κβ)+b(1−γ−κβ).H_gauss (β)= κ(a-b)a(γ+κβ)+b(1-γ-κβ). Since κ>0κ>0, the sign is the sign of a−ba-b. Thus: • if b>ab>a, the derivative is strictly negative and β⋆=0β =0; • if a>ba>b, the derivative is strictly positive and β⋆=1β =1; • if a=ba=b, the derivative is zero and every β is optimal. Now assume λref>0 _ref>0. If a=ba=b, then the logarithmic term is constant in β, so maximizing HgaussH_gauss is equivalent to minimizing D. Since D is strictly convex and D′(β0)=0D ( _0)=0, its unique minimizer is β0 _0; hence β⋆=β0β = _0. Next suppose b>ab>a. Then Hgauss′(β)=κ(a−b)a(γ+κβ)+b(1−γ−κβ)−λrefD′(β),H_gauss (β)= κ(a-b)a(γ+κβ)+b(1-γ-κβ)- _refD (β), and Hgauss′(β)=−κ2(a−b)2(a(γ+κβ)+b(1−γ−κβ))2−λrefD′(β)<0.H_gauss (β)=- κ^2(a-b)^2 (a(γ+κβ)+b(1-γ-κβ) )^2- _refD (β)<0. So HgaussH_gauss is strictly concave, hence has at most one maximizer in (0,1)(0,1). Evaluate the derivative at β=0β=0: Hgauss′(0)=κ(a−b)aγ+b(1−γ)−λrefD′(0).H_gauss (0)= κ(a-b)aγ+b(1-γ)- _refD (0). Since a−b<0a-b<0 and D′(0)<0D (0)<0, this equals Hgauss′(0)=−κ(b−a)aγ+b(1−γ)+λref(−D′(0)).H_gauss (0)=- κ(b-a)aγ+b(1-γ)+ _ref(-D (0)). Therefore Hgauss′(0)≤0H_gauss (0)≤ 0 exactly when λref≤κ(b−a)(aγ+b(1−γ))(−D′(0))=λcrit(new). _ref≤ κ(b-a)(aγ+b(1-γ))(-D (0))= _crit^(new). If this holds, then because Hgauss′H_gauss is strictly decreasing on (0,1)(0,1), we have Hgauss′(β)<0H_gauss (β)<0 for all β∈(0,1)β∈(0,1), so the unique maximizer is β⋆=0β =0. If instead λref>λcrit(new) _ref> _crit^(new), then Hgauss′(0)>0H_gauss (0)>0. At β=β0β= _0, since D′(β0)=0D ( _0)=0, Hgauss′(β0)=κ(a−b)a(γ+κβ0)+b(1−γ−κβ0)<0.H_gauss ( _0)= κ(a-b)a(γ+κ _0)+b(1-γ-κ _0)<0. By continuity and strict monotonicity of Hgauss′H_gauss , there is a unique root in (0,β0)(0, _0), and that root is the unique maximizer. It satisfies exactly the first-order equation κ(a−b)a(γ+κβ)+b(1−γ−κβ)=λrefD′(β). κ(a-b)a(γ+κβ)+b(1-γ-κβ)= _refD (β). The case a>ba>b is symmetric. Now the entropic derivative is positive, and the threshold is determined by the right boundary: λcrit(old)=Hgauss′(β)|λref=0,β=1D′(1)=κ(a−b)(a(1−γ)+bγ)D′(1). _crit^(old)= H_gauss (β) |_ _ref=0,\ β=1D (1)= κ(a-b)(a(1-γ)+bγ)D (1). If λref≤λcrit(old) _ref≤ _crit^(old), then Hgauss′(1)≥0H_gauss (1)≥ 0, and since Hgauss′H_gauss is strictly decreasing, Hgauss′(β)>0H_gauss (β)>0 on (0,1)(0,1), so the unique maximizer is β⋆=1β =1. If λref>λcrit(old) _ref> _crit^(old), then Hgauss′(1)<0H_gauss (1)<0 while Hgauss′(β0)=κ(a−b)a(γ+κβ0)+b(1−γ−κβ0)>0,H_gauss ( _0)= κ(a-b)a(γ+κ _0)+b(1-γ-κ _0)>0, so the unique maximizer lies in (β0,1)( _0,1) and is characterized by the same first-order equation. This completes the proof. ∎ B.12 Proof of Lemma 3.2 Proof of Lemma 3.2. Under disjoint-support, we have q0(y)=β0po(y)for y∈Ao,q0(y)=(1−β0)pn(y)for y∈An.q_0(y)= _0p_o(y) y∈ A_o, q_0(y)=(1- _0)p_n(y) y∈ A_n. Since the reward is constant on each region, q∗(y)=1Zq0(y)er(y)/τ=1Zβ0ero/τpo(y),y∈Ao,1Z(1−β0)ern/τpn(y),y∈An.q^*(y)= 1Zq_0(y)e^r(y)/τ= cases 1Z\, _0e^r_o/τp_o(y),&y∈ A_o,\\[8.0pt] 1Z\,(1- _0)e^r_n/τp_n(y),&y∈ A_n. cases Therefore q∗q^* is again a two-component mixture with the same components: q∗(y)=β∗po(y)+(1−β∗)pn(y),q^*(y)=β^*p_o(y)+(1-β^*)p_n(y), where β∗=β0ero/τZ,1−β∗=(1−β0)ern/τZ.β^*= _0e^r_o/τZ, 1-β^*= (1- _0)e^r_n/τZ. Because q∗q^* is a probability density, 1=∫q∗(y)y=β0ero/τZ+(1−β0)ern/τZ,1= q^*(y)\,dy= _0e^r_o/τZ+ (1- _0)e^r_n/τZ, so Z=β0ero/τ+(1−β0)ern/τ.Z= _0e^r_o/τ+(1- _0)e^r_n/τ. Substituting into the expression for β∗β^* gives β∗=β0ero/τβ0ero/τ+(1−β0)ern/τ,β^*= _0e^r_o/τ _0e^r_o/τ+(1- _0)e^r_n/τ, which is (3.18). Since β0∈(0,1) _0∈(0,1) and ro,rnr_o,r_n are finite, all factors are strictly positive, so β∗∈(0,1)β^*∈(0,1). ∎ B.13 Proof of Theorem 3.3 Proof of Theorem 3.3. We prove parts (A) and (B) separately. Proof of Part (A): We start by computing the exact expected old responsibility under the target q∗q^*. Recall q∗(y)=1Zq0(y)er(y)/τ,ro(0)(y)=β0po(y)q0(y).q^*(y)= 1Zq_0(y)e^r(y)/τ, r_o^(0)(y)= _0p_o(y)q_0(y). Then Y∼q∗[ro(0)(Y)] _Y q^* [r_o^(0)(Y) ] =∫q∗(y)β0po(y)q0(y)y = q^*(y)\, _0p_o(y)q_0(y)\,dy =β0Z∫po(y)er(y)/τy. = _0Z p_o(y)e^r(y)/τ\,dy. (B.7) We therefore compute the two Gaussian integrals Io:=∫po(y)er(y)/τy,In:=∫pn(y)er(y)/τy.I_o:= p_o(y)e^r(y)/τ\,dy, I_n:= p_n(y)e^r(y)/τ\,dy. Under Y∼po=(μo,Σ)Y p_o=N( _o, ), the Bayes region AnA_n is entered with probability po(An)=Φ(−δ2)=γ, Pr_p_o(A_n)= \! (- δ2 )=γ, so po(Ao)=1−γ Pr_p_o(A_o)=1-γ. Hence Io=ero/τpo(Ao)+ern/τpo(An)=(1−γ)ero/τ+γern/τ.I_o=e^r_o/τ Pr_p_o(A_o)+e^r_n/τ Pr_p_o(A_n)=(1-γ)e^r_o/τ+γ e^r_n/τ. Similarly, under Y∼pn=(μn,Σ)Y p_n=N( _n, ), pn(Ao)=γ,pn(An)=1−γ, Pr_p_n(A_o)=γ, Pr_p_n(A_n)=1-γ, so In=γero/τ+(1−γ)ern/τ.I_n=γ e^r_o/τ+(1-γ)e^r_n/τ. Now compute the normalizer Z: Z=Y∼q0[er(Y)/τ]=β0Io+(1−β0)In.Z=E_Y q_0[e^r(Y)/τ]= _0I_o+(1- _0)I_n. Substituting IoI_o and Z into (B.7) yields Y∼q∗[ro(0)(Y)]=β0((1−γ)ero/τ+γern/τ)β0((1−γ)ero/τ+γern/τ)+(1−β0)(γero/τ+(1−γ)ern/τ),E_Y q^* [r_o^(0)(Y) ]= _0 ((1-γ)e^r_o/τ+γ e^r_n/τ ) _0 ((1-γ)e^r_o/τ+γ e^r_n/τ )+(1- _0) (γ e^r_o/τ+(1-γ)e^r_n/τ ), which is (3.19). All terms in the numerator and denominator are strictly positive, so the ratio lies in (0,1)(0,1). Proof of Part (B): We start by computing the gradient identity and overlap bound. Define Δβ,mn(y):=τlogqβ,mn(y)q0(y)−A∗(y),J(β,mn)=Y∼q0[Δβ,mn(Y)2]. _β,m_n(y):=τ q_β,m_n(y)q_0(y)-A^*(y), J(β,m_n)=E_Y q_0\! [ _β,m_n(Y)^2 ]. Since q0q_0 and A∗A^* are fixed, differentiation under the expectation gives ∇mnJ(β,mn)=2Y∼q0[Δβ,mn(Y)∇mnΔβ,mn(Y)]=2τY∼q0[Δβ,mn(Y)∇mnlogqβ,mn(Y)]. _m_nJ(β,m_n)=2\,E_Y q_0\! [ _β,m_n(Y)\, _m_n _β,m_n(Y) ]=2τ\,E_Y q_0\! [ _β,m_n(Y)\, _m_n q_β,m_n(Y) ]. For qβ,mn(y)=βpo(y)+(1−β)φΣ(y;mn),q_β,m_n(y)=β p_o(y)+(1-β) _ \! (y;m_n ), only the new component depends on mnm_n, and using ∇mφΣ(y;m)=φΣ(y;m)Σ−1(y−m), _m _ \! (y;m )= _ \! (y;m )\, ^-1(y-m), we obtain ∇mnlogqβ,mn(y)=(1−β)φΣ(y;mn)qβ,mn(y)Σ−1(y−mn)=rn(β,mn)(y)Σ−1(y−mn), _m_n q_β,m_n(y)= (1-β) _ \! (y;m_n )q_β,m_n(y)\, ^-1(y-m_n)=r_n^(β,m_n)(y)\, ^-1(y-m_n), which proves (3.20). Now specialize to the synchronized point (β,mn)=(β0,μn)(β,m_n)=( _0, _n). There qβ,mn=q0q_β,m_n=q_0, so Δβ0,μn(y)=τlogq0(y)q0(y)−A∗(y)=−A∗(y). _ _0, _n(y)=τ q_0(y)q_0(y)-A^*(y)=-A^*(y). Thus the contribution of old-mode samples to the gradient is −2τβ0Y∼po[A∗(Y)rn(β0,μn)(Y)Σ−1(Y−μn)].-2τ\, _0\,E_Y p_o\! [A^*(Y)\,r_n^( _0, _n)(Y)\, ^-1(Y- _n) ]. If |r(y)|≤R|r(y)|≤ R, then |V∗|=τ|logq0[er/τ]|≤R|V^*|=τ | _q_0[e^r/τ] |≤ R because er/τ∈[e−R/τ,eR/τ]e^r/τ∈[e^-R/τ,e^R/τ], so |A∗(y)|=|r(y)−V∗|≤2R.|A^*(y)|=|r(y)-V^*|≤ 2R. By Cauchy–Schwarz and the fact that 0≤rn≤10≤ r_n≤ 1 implies rn2≤rnr_n^2≤ r_n, ‖Y∼po[A∗(Y)rn(β0,μn)(Y)Σ−1(Y−μn)]‖ \|E_Y p_o\! [A^*(Y)\,r_n^( _0, _n)(Y)\, ^-1(Y- _n) ] \| ≤Y∼po[A∗(Y)2‖Σ−1(Y−μn)‖2]Y∼po[rn(β0,μn)(Y)] ≤ E_Y p_o\! [A^*(Y)^2\,\| ^-1(Y- _n)\|^2 ] E_Y p_o\! [r_n^( _0, _n)(Y) ] ≤2RY∼po[‖Σ−1(Y−μn)‖2]εo→nref=2RMo→nεo→nref. ≤ 2R E_Y p_o\! [\| ^-1(Y- _n)\|^2 ] _o ^ref=2R\, M_o \, _o ^ref. Multiplying by 2τβ02τ _0 gives (3.21). Finally, the reference leakage bound follows directly from Lemma 2.1 applied to the mixture q0=β0po+(1−β0)pnq_0= _0p_o+(1- _0)p_n together with Lemma 2.1: εo→nref=Y∼po[rn(β0,μn)(Y)]=Y∼po[1−ro(β0,μn)(Y)]≤121−β0β0exp(−18‖μn−μo‖Σ−12). _o ^ref=E_Y p_o [r_n^( _0, _n)(Y) ]=E_Y p_o [1-r_o^( _0, _n)(Y) ]≤ 12 1- _0 _0 \! (- 18 _n- _o _ ^-1^2 ). ∎ Appendix C Extension to f-divergences f-divergence. Let f:(0,∞)→ℝf:(0,∞) be convex. Given distributions P,QP,Q with densities p,qp,q w.r.t. a common reference μ and with P≪QP Q, define the Csiszár–Morimoto f-divergence Df(P∥Q):=∫q(y)f(p(y)q(y))μ(y).D_f(P\|Q):= q(y)\,f\! ( p(y)q(y) )\,dμ(y). Affine invariance. If f~(t)=f(t)+a(t−1) f(t)=f(t)+a(t-1) for any a∈ℝa , then Df~(P∥Q)=Df(P∥Q)D_ f(P\|Q)=D_f(P\|Q) because ∫q(t−1)=∫(p−q)=0 q(t-1)= (p-q)=0. Thus we may assume without loss of generality that f(1)=0,f′(1)=0,f(1)=0, f (1)=0, whenever f is differentiable at 11. Adjoint (reverse) generator. Define the adjoint generator f⋄(t):=tf(1/t),t>0.f (t):=t\,f(1/t), t>0. We start by the following standard adjoint result with the proof provided for completeness. Lemma C.1 (Adjoint identity). Assume P≪QP Q and Q≪PQ P with densities p,qp,q. Then Df(P∥Q)=∫p(y)f⋄(q(y)p(y))μ(y)=Df⋄(Q∥P).D_f(P\|Q)= p(y)\,f \! ( q(y)p(y) )\,dμ(y)=D_f (Q\|P). Moreover, if f is convex then f⋄f is convex; if f is strictly convex then f⋄f is strictly convex. Proof. Using Df(P∥Q)=∫qf(p/q)D_f(P\|Q)= qf(p/q) and multiplying and dividing by p, ∫qf(p/q)=∫p(qp)f(pq)=∫pf⋄(q/p). qf(p/q)= p ( qp )f\! ( pq )= p\,f (q/p). The equality Df(P∥Q)=Df⋄(Q∥P)D_f(P\|Q)=D_f (Q\|P) is immediate from definitions. Convexity/strict convexity of f⋄f follows from standard perspective-transform facts: t↦tf(1/t)t tf(1/t) preserves convexity on (0,∞)(0,∞). ∎ When taking gradients of f-divergences, the quantity κf(t):=tf′(t),t>0, _f(t):=t\,f (t), t>0, plays a central role. For convex twice differentiable f, κf(t)≥0 _f(t)≥ 0. We now generalize Theorem 2.2 from KL to the a family of f-divergences. Theorem C.1. Assume f:(0,∞)→ℝf:(0,∞) is twice continuously differentiable and strictly convex, normalized so that f(1)=f′(1)=0f(1)=f (1)=0. Let f⋄(t)=tf(1/t)f (t)=tf(1/t) and κf(t)=tf′(t) _f(t)=tf (t). (A) Fix mo=μom_o= _o and mn=μnm_n= _n and define qβ=βpo+(1−β)pnq_β=β p_o+(1-β)p_n. Consider LSFTf(β):=Df(pn∥qβ),β∈[0,1].L_SFT^f(β):=D_f(p_n\|q_β), β∈[0,1]. Then LSFTf(0)=0L_SFT^f(0)=0 and LSFTf(β)>0L_SFT^f(β)>0 for every β∈(0,1]β∈(0,1], hence β=0β=0 is the unique global minimizer. Moreover, LSFTfL_SFT^f is strictly increasing on [0,1][0,1], and logit gradient flow ϕ˙=−dϕLSFTf(σ(ϕ)) φ=- ddφL_SFT^f(σ(φ)) satisfies β(t)=σ(ϕ(t))↓0β(t)=σ(φ(t)) 0. (B) Fix α∈(0,1)α∈(0,1) and consider LRLf(β,mo,mn):=Df(qβ,mo,mn∥pα).L_RL^f(β,m_o,m_n):=D_f\! (q_β,m_o,m_n\,\|\,p_α ). Assume mo=μom_o= _o and define the density ratio w(y):=qβ,μo,mn(y)pα(y)w(y):= q_β, _o,m_n(y)p_α(y). Then the gradient w.r.t. the old mean admits the exact decomposition ∇moLRLf(β,μo,mn)=βΣ−1(Af(β,mn)(mn−μo)−Bf(α,β,mn)(μn−μo)), _m_oL_RL^f(β, _o,m_n)=β\, ^-1 (A_f(β,m_n)\,(m_n- _o)-B_f(α,β,m_n)\,( _n- _o) ), where Af(β,mn):=Y∼po[κf(w(Y))(1−ro(Y))],Bf(α,β,mn):=Y∼po[κf(w(Y))(1−so(Y))].A_f(β,m_n):=E_Y p_o [ _f(w(Y))\,(1-r_o(Y)) ], B_f(α,β,m_n):=E_Y p_o [ _f(w(Y))\,(1-s_o(Y)) ]. If, in addition, f has bounded curvature 0≤κf(t)≤Cf0≤ _f(t)≤ C_f for all t>0t>0, then ‖∇moLRLf(β,μo,mn)‖≤βCf‖Σ−1‖2(εq(β,mn)‖mn−μo‖+εp(α)‖μn−μo‖), _m_oL_RL^f(β, _o,m_n) ≤β\,C_f\, ^-1 _2 ( _q(β,m_n)\, m_n- _o + _p(α)\, _n- _o ), where εq=po[1−ro] _q=E_p_o[1-r_o] and εp=po[1−so] _p=E_p_o[1-s_o] satisfy the same Gaussian overlap bounds as in (2.9). Proof of Theorem C.1. Proof of Part (A). By Lemma C.1, LSFTf(β)=Df(pn∥qβ)=Y∼pn[f⋄(qβ(Y)pn(Y))].L_SFT^f(β)=D_f(p_n\|q_β)=E_Y p_n\! [f \! ( q_β(Y)p_n(Y) ) ]. Write X(Y):=po(Y)pn(Y)X(Y):= p_o(Y)p_n(Y) and Zβ:=(1−β)+βX(Y)Z_β:=(1-β)+β X(Y) so that qβ(Y)pn(Y)=Zβ q_β(Y)p_n(Y)=Z_β and [Zβ]=1E[Z_β]=1. Since f⋄f is strictly convex and minimized uniquely at 11 (by normalization), Jensen’s inequality yields LSFTf(β)=[f⋄(Zβ)]≥f⋄([Zβ])=f⋄(1)=0,L_SFT^f(β)=E[f (Z_β)]≥ f (E[Z_β])=f (1)=0, with strict inequality for β>0β>0 because X(Y)X(Y) is nonconstant when μo≠μn _o≠ _n. Hence β=0β=0 is the unique minimizer. For monotonicity, note that for each fixed y, the map β↦f⋄((1−β)+βX(y))β f ((1-β)+β X(y)) is convex in β (because f⋄f is convex and (1−β)+βX(y)(1-β)+β X(y) is affine). Thus LSFTfL_SFT^f is convex on [0,1][0,1]. Moreover, for any 0<β1<β2≤10< _1< _2≤ 1, convexity implies the secant slope is nondecreasing: LSFTf(β2)−LSFTf(β1)β2−β1≥LSFTf(β1)−LSFTf(0)β1−0=LSFTf(β1)β1> 0, L_SFT^f( _2)-L_SFT^f( _1) _2- _1\ ≥\ L_SFT^f( _1)-L_SFT^f(0) _1-0= L_SFT^f( _1) _1\ >\ 0, so LSFTf(β2)>LSFTf(β1)L_SFT^f( _2)>L_SFT^f( _1) and LSFTfL_SFT^f is strictly increasing. The logit-gradient-flow claim follows exactly as in the KL case: since β′(ϕ)=β(1−β)>0β (φ)=β(1-β)>0, dϕLSFTf(σ(ϕ))>0 ddφL_SFT^f(σ(φ))>0 for β∈(0,1)β∈(0,1), hence ϕ(t)φ(t) decreases and β(t)↓0β(t) 0. Proof of Part (B). Let q=qβ,mo,mnq=q_β,m_o,m_n and p=pαp=p_α and w=q/pw=q/p. Write Df(q∥p)=∫p(y)f(w(y))y.D_f(q\|p)= p(y)\,f(w(y))\,dy. Since p does not depend on mom_o, ∇moDf(q∥p)=∫p(y)f′(w(y))∇mow(y)y=∫f′(w(y))∇moq(y)y. _m_oD_f(q\|p)= p(y)\,f (w(y))\, _m_ow(y)\,dy= f (w(y))\, _m_oq(y)\,dy. At mo=μom_o= _o, ∇moq(y)=βpo(y)Σ−1(y−μo) _m_oq(y)=β\,p_o(y) ^-1(y- _o), so ∇moLRLf(β,μo,mn)=βΣ−1Y∼po[f′(w(Y))(Y−μo)]. _m_oL_RL^f(β, _o,m_n)=β\, ^-1E_Y p_o\! [f (w(Y))(Y- _o) ]. Apply Stein’s identity (Lemma B.1) with g(y)=f′(w(y))g(y)=f (w(y)) to get ∇moLRLf(β,μo,mn)=βpo[∇yf′(w(Y))]=βpo[f′(w(Y))∇yw(Y)]. _m_oL_RL^f(β, _o,m_n)=β\,E_p_o\! [ _yf (w(Y)) ]=β\,E_p_o\! [f (w(Y))\, _yw(Y) ]. Since ∇w=w(∇logq−∇logp)∇ w=w(∇ q-∇ p), this becomes ∇moLRLf(β,μo,mn)=βpo[κf(w(Y))(∇logq(Y)−∇logp(Y))]. _m_oL_RL^f(β, _o,m_n)=β\,E_p_o\! [ _f(w(Y))\,(∇ q(Y)-∇ p(Y)) ]. The score difference ∇logq−∇logp∇ q-∇ p is exactly the same as in the KL proof, yielding the stated decomposition with Af=po[κf(w)(1−ro)]A_f=E_p_o[ _f(w)(1-r_o)] and Bf=po[κf(w)(1−so)]B_f=E_p_o[ _f(w)(1-s_o)]. If κf≤Cf _f≤ C_f, then Af≤CfεqA_f≤ C_f\, _q and Bf≤CfεpB_f≤ C_f\, _p, giving the displayed bound. The overlap bounds on εq _q and εp _p are identical to the KL case and follow from Lemma 2.1 and Remark 2.1. ∎ Remark C.1 (What Changes from KL to General f?). Relative to KL (where κf≡1 _f≡ 1), the only new factor in the old-mean gradient is the curvature weight κf(w)=wf′(w) _f(w)=wf (w) applied to the score difference. When κf _f is bounded, the qualitative message is unchanged: the old-mean drift remains controlled by overlap/misassignment probabilities, which are exponentially small in the separation for Gaussians. If κf _f is unbounded, the exact decomposition still holds, but quantitative bounds must track the distribution of w=q/pαw=q/p_α under Y∼poY p_o. Remark C.2 (On the Bounded-curvature Assumption). The bounded-curvature condition supt>0κf(t)=supt>0(tf′(t))<∞ _t>0 _f(t)= _t>0 (tf (t) )<∞ is a convenient way to ensure that the overlap-gated terms appearing in Theorem C.1(B) can be upper bounded purely by misassignment probabilities (times geometric factors), without having to track additional tail behavior of the density ratio w(⋅)=q(⋅)/pα(⋅)w(·)=q(·)/p_α(·). It holds for several standard, smooth f-divergences with “log-like” curvature, for example: • KL: fKL(t)=tlogt−(t−1)f_KL(t)=t t-(t-1) gives κf(t)≡1 _f(t)≡ 1. • Jensen–Shannon: one generator is fJS(t)=tlogt−(t+1)log(t+12)f_JS(t)=t t-(t+1) \! ( t+12 ), for which fJS′(t)=1t(t+1)f_JS (t)= 1t(t+1) and hence κfJS(t)=1t+1≤1 _f_JS(t)= 1t+1≤ 1. • Triangular discrimination: one generator is f△(t)=(t−1)2t+1f_ (t)= (t-1)^2t+1, for which f△′(t)=8(t+1)3f_ (t)= 8(t+1)^3 and hence κf△(t)=8t(t+1)3≤3227 _f_ (t)= 8t(t+1)^3≤ 3227. By contrast, the condition fails for many popular divergences whose curvature blows up either as t↓0t 0 or t↑∞t ∞, e.g. • Squared Hellinger: f(t)=(t−1)2f(t)=( t-1)^2 gives κf(t)=12t _f(t)= 12 t (unbounded as t↓0t 0). • Pearson χ2χ^2: f(t)=(t−1)2f(t)=(t-1)^2 gives κf(t)=2t _f(t)=2t (unbounded as t↑∞t ∞). • Neyman χ2χ^2: f(t)=(1−t)2tf(t)= (1-t)^2t gives κf(t)=2/t2 _f(t)=2/t^2 (unbounded as t↓0t 0). • Power/α-divergences: for fα(t)=tα−α(t−1)−1α(α−1)f_α(t)= t^α-α(t-1)-1α(α-1) one has κfα(t)=tα−1 _f_α(t)=t^α-1, which is unbounded for every α≠1α≠ 1 (as t↓0t 0 when α<1α<1 and as t↑∞t ∞ when α>1α>1); the only bounded-curvature member of this family is the α→1α→ 1 limit (KL). Finally, some widely used discrepancies (e.g. total variation with f(t)=12|t−1|f(t)= 12|t-1|) are not C2C^2 and therefore fall outside the present smooth framework. In all these cases, the exact decomposition of Theorem C.1(B) still holds, but one must bound the weighted terms po[κf(w(Y))(1−ro(Y))]E_p_o[ _f(w(Y))(1-r_o(Y))] and po[κf(w(Y))(1−so(Y))]E_p_o[ _f(w(Y))(1-s_o(Y))] by exploiting additional structure (e.g. explicit tail control of w under pop_o, clipping/regularization, or refined overlap estimates). Appendix D Finite K-mode Gaussian mixtures The two-mode analysis makes the separation between forward- and reverse-KL especially transparent, but real models are typically multi-modal. We therefore extend the picture to a finite K-component Gaussian mixture with shared covariance. The goal is two fold: first, to show that the mode-locality property of reverse-KL persists in the multi-mode setting, with gradients on a matched mode controlled by pairwise overlaps; and second, to show that forward-KL trained on a subset of modes still induces exact weight collapse on the complement. Together, these results demonstrate that the qualitative foward-vs.-reverse-KL contrast is not an artifact of the K=2K=2 case. Lemma D.1 (Linear independence of Fixed-covariance Gaussian Translates). Fix Σ≻0 0 and distinct means μ1,…,μK∈ℝd _1,…, _K ^d. If coefficients c1,…,cK∈ℝc_1,…,c_K satisfy ∑k=1KckφΣ(y;μk)=0∀y∈ℝd, _k=1^Kc_k\, _ \! (y; _k )=0 ∀ y ^d, then c1=⋯=cK=0c_1=·s=c_K=0. Proof. Take Fourier transforms. The Fourier transform of φΣ(⋅;μk) _ \! (·; _k ) equals e−it⊤μke−12t⊤Σte^-it _k\,e^- 12t t. Hence the assumption implies 0=∑k=1Kcke−it⊤μke−12t⊤Σt=e−12t⊤Σt∑k=1Kcke−it⊤μk∀t∈ℝd.0= _k=1^Kc_k\,e^-it _k\,e^- 12t t=e^- 12t t _k=1^Kc_k\,e^-it _k ∀ t ^d. Since e−12t⊤Σt>0e^- 12t t>0, we have ∑kcke−it⊤μk=0 _kc_ke^-it _k=0 for all t. Choose a vector v∈ℝdv ^d such that the scalars ak:=v⊤μka_k:=v _k are pairwise distinct (this holds for all v outside a finite union of hyperplanes). Then for all s∈ℝs , 0=∑k=1Kcke−isak.0= _k=1^Kc_ke^-isa_k. Differentiating n=0,…,K−1n=0,…,K-1 times at s=0s=0 gives the Vandermonde system ∑kck(−iak)n=0 _kc_k(-ia_k)^n=0, whose coefficient matrix is invertible since the aka_k are distinct. Hence ck=0c_k=0 for all k. ∎ Lemma D.2 (Pairwise Responsibility Upper Bound in a K-mixture). Let q(y)=∑ℓ=1Kβℓfℓ(y)q(y)= _ =1^K _ f_ (y) be a mixture of densities with weights βℓ>0 _ >0. Define responsibilities rj(y)=βjfj(y)q(y)r_j(y)= _jf_j(y)q(y). Then for any j≠kj≠ k and all y, rj(y)≤βjfj(y)βjfj(y)+βkfk(y).r_j(y)≤ _jf_j(y) _jf_j(y)+ _kf_k(y). Consequently, for Y∼fkY f_k, [rj(Y)]≤12βjβkBC(fj,fk).E[r_j(Y)]≤ 12 _j _k\;BC(f_j,f_k). Proof. The pointwise bound follows since q(y)≥βjfj(y)+βkfk(y)q(y)≥ _jf_j(y)+ _kf_k(y). For the expectation, apply Lemma 2.1 to the two-component mixture (βjfj+βkfk)/(βj+βk)( _jf_j+ _kf_k)/( _j+ _k) and use the definition of the Bhattacharyya coefficient. ∎ Theorem D.1. Fix Σ≻0 0, distinct means μ1,…,μK∈ℝd _1,…, _K ^d, and weights α∈ΔK−1α∈ ^K-1 with αk>0 _k>0. Define the target mixture p(y):=∑k=1KαkφΣ(y;μk),sk(y):=αkφΣ(y;μk)p(y).p(y):= _k=1^K _k\, _ \! (y; _k ), s_k(y):= _k\, _ \! (y; _k )p(y). Let the model be q(y):=∑k=1KβkφΣ(y;mk),rk(y):=βkφΣ(y;mk)q(y),β∈ΔK−1,βk>0.q(y):= _k=1^K _k\, _ \! (y;m_k ), r_k(y):= _k\, _ \! (y;m_k )q(y), β∈ ^K-1,\ _k>0. (A) Let T⊂1,…,KT⊂\1,…,K\ be nonempty and define pT(y):=∑k∈Tα~kφΣ(y;μk),α~k:=αk∑j∈Tαj.p_T(y):= _k∈ T α_k\, _ \! (y; _k ), α_k:= _k _j∈ T _j. Fix mk=μkm_k= _k for all k and optimize only β. Then KL(pT∥qβ)KL(p_T\|q_β) has the unique minimizer βk⋆=α~k,k∈T,0,k∉T. _k = cases α_k,&k∈ T,\\ 0,&k∉ T. cases (B) Fix k and assume mk=μkm_k= _k. Then ∇mkKL(q∥p)=βkΣ−1(∑j≠kεk→j(q)(mj−μk)−∑j≠kεk→j(p)(μj−μk)), _m_kKL(q\|p)= _k\, ^-1 ( _j≠ k ^(q)_k→ j\,(m_j- _k)- _j≠ k ^(p)_k→ j\,( _j- _k) ), where εk→j(q):=Y∼(μk,Σ)[rj(Y)],εk→j(p):=Y∼(μk,Σ)[sj(Y)]. ^(q)_k→ j:=E_Y ( _k, )[r_j(Y)], ^(p)_k→ j:=E_Y ( _k, )[s_j(Y)]. Moreover, εk→j(q)≤12βjβkexp(−18‖mj−μk‖Σ−12),εk→j(p)≤12αjαkexp(−18‖μj−μk‖Σ−12). ^(q)_k→ j≤ 12 _j _k\, \! (- 18\, m_j- _k _ ^-1^2 ), ^(p)_k→ j≤ 12 _j _k\, \! (- 18\, _j- _k _ ^-1^2 ). Proof. Proof of Part (A): If KL(pT∥qβ)=0KL(p_T\|q_β)=0, then qβ≡pTq_β≡ p_T almost everywhere. By Lemma D.1, the Gaussian translates are linearly independent, so coefficients must match exactly, giving the stated β⋆β . Proof of Part (B): Differentiate KL(q∥p)KL(q\|p) w.r.t. mkm_k and apply the same score-difference computation as in the two-mode case. Using ∑ℓrℓ=1 _ r_ =1 and ∑ℓsℓ=1 _ s_ =1, collect terms to obtain the exact decomposition. The exponential bounds follow from Lemma D.2 and Remark 2.1. ∎ Remark D.1 (Multi-mode Forgetting: Local vs. Global Effects). The multi-mode result reveals how the two forms of forgetting introduced earlier, mass forgetting and component drift, extend beyond the two-mode setting. Part (A) characterizes mass forgetting under forward-KL objectives. When training data come only from a subset of modes T, the forward-KL objective KL(pT∥qβ)KL(p_T\|q_β) is minimized exactly by allocating mixture mass only to those observed modes. All components k∉Tk∉ T must receive zero optimal weight, βk⋆=0 _k =0. Thus forward-KL induces mass collapse: any behavior not represented in the training distribution is eliminated at the population optimum. This formalizes catastrophic forgetting in the mixture model as a global reallocation of mixture mass driven by the support of the data distribution. Part (B) characterizes component drift under reverse-KL objectives. If a component k is already correctly placed (mk=μkm_k= _k), its gradient depends only on pairwise overlaps with the remaining modes through the misassignment probabilities εk→j(q) ^(q)_k→ j and εk→j(p) ^(p)_k→ j. These quantities measure how often samples from mode k are attributed to another mode j under the model or the target. When the modes are well separated, the overlap bounds show that these probabilities decay exponentially in the Mahalanobis separation between modes. Consequently, the update signal acting on an already-correct component becomes exponentially small, so reverse-KL updates can adjust other modes while inducing only negligible drift on matched ones.In continual-learning terms, this means that previously learned behaviors (represented by correctly matched components) are locally protected. Taken together, the multi-mode analysis reinforces the qualitative contrast observed in the two-mode case. Reverse-KL objectives exhibit mode-local updates that protect matched components up to exponentially small overlap effects, whereas forward-KL objectives induce global mass reallocation, collapsing components absent from the training data. Appendix E From Gaussians to (Strongly) Log-Concave Location Families This section shows that the two main mechanisms from the Gaussian analysis persist for much broader log-concave component families. The key difference is that in the Gaussian case the mixture-score differences become constants (linear scores), whereas for general log-concave components we obtain overlap-controlled bounds that depend on (i) a smoothness constant for the score map and (i) an overlap quantity such as the Bhattacharyya coefficient. Log-concave Location Family. Let V:ℝd→ℝV:R^d be C2C^2 and convex, and define the log-concave density ρ(x)=1Zexp(−V(x)),Z:=∫ℝde−V(x)x<∞.ρ(x)= 1Z (-V(x)), Z:= _R^de^-V(x)\,dx<∞. For μ∈ℝdμ ^d, let the location-shift density be ρμ(y):=ρ(y−μ). _μ(y):=ρ(y-μ). Then each ρμ _μ is log-concave and strictly positive. Old/New Components and Mixtures. Fix μo,μn∈ℝd _o, _n ^d and α∈(0,1)α∈(0,1), and define pα(y)=αρμo(y)+(1−α)ρμn(y).p_α(y)=α\, _ _o(y)+(1-α)\, _ _n(y). For parameters β∈(0,1)β∈(0,1) and mn∈ℝdm_n ^d (with mom_o fixed to μo _o), define qβ,mn(y)=βρμo(y)+(1−β)ρmn(y),LRL(β,mn):=KL(qβ,mn∥pα).q_β,m_n(y)=β\, _ _o(y)+(1-β)\, _m_n(y), L_RL(β,m_n):=KL\! (q_β,m_n\,\|\,p_α ). Define responsibilities ro(y):=βρμo(y)qβ,mn(y),so(y):=αρμo(y)pα(y).r_o(y):= β\, _ _o(y)q_β,m_n(y), s_o(y):= α\, _ _o(y)p_α(y). We start by the following standard identity with the proof provided for completeness. Lemma E.1 (Integration-by-parts Identity for Location Parameters). Let ρ be as above and assume additionally that ρ∈C1ρ∈ C^1 and that limR→∞∫‖y‖=Rρμ(y)|g(y)|S(y)=0 _R→∞ _\|y\|=R _μ(y)\, |g(y) |\,dS(y)=0 for every μ and every C1C^1 function g appearing below (this holds, e.g., if g has at most polynomial growth and ρμ _μ has at least exponential tails, which is true for many log-concave families used in practice). Then for any C1C^1 function g:ℝd→ℝg:R^d with suitable integrability, ∇μ∫ℝdρμ(y)g(y)y=∫ℝdρμ(y)∇yg(y)y. _μ _R^d _μ(y)\,g(y)\,dy= _R^d _μ(y)\, _yg(y)\,dy. Proof. Because ρμ(y)=ρ(y−μ) _μ(y)=ρ(y-μ), we have ∇μρμ(y)=−∇yρμ(y) _μ _μ(y)=- _y _μ(y). Differentiating under the integral (justified by dominated convergence under the integrability assumptions) yields ∇μ∫ρμ(y)g(y)y=∫(∇μρμ(y))g(y)y=−∫(∇yρμ(y))g(y)y. _μ _μ(y)g(y)\,dy= ( _μ _μ(y))g(y)\,dy=- ( _y _μ(y))g(y)\,dy. Integrate by parts on ℝdR^d: −∫(∇yρμ)gy=∫ρμ∇ygdy−limR→∞∫‖y‖=Rρμ(y)g(y)n(y)S(y),- ( _y _μ)g\,dy= _μ\, _yg\,dy- _R→∞ _\|y\|=R _μ(y)\,g(y)\,n(y)\,dS(y), where n(y)n(y) is the outward normal. The boundary term vanishes by assumption, giving the claim. ∎ Lemma E.2 (Bhattacharyya Coefficient Bound for Strongly Log-concave Shifts). Assume V is m-strongly convex for some m>0m>0, i.e. ∇2V(x)⪰mI∇^2V(x) mI for all x. Then for any μ1,μ2∈ℝd _1, _2 ^d, BC(ρμ1,ρμ2):=∫ℝdρμ1(y)ρμ2(y)y≤exp(−m8‖μ1−μ2‖2).BC( _ _1, _ _2):= _R^d _ _1(y) _ _2(y)\,dy\ ≤\ \! (- m8\,\| _1- _2\|^2 ). Proof. Let Δ:=μ1−μ2 := _1- _2 and write ρμ(y)=Z−1exp(−V(y−μ)) _μ(y)=Z^-1 (-V(y-μ)). Then ρμ1(y)ρμ2(y)=1Zexp(−V(y−μ1)+V(y−μ2)2). _ _1(y) _ _2(y)= 1Z \! (- V(y- _1)+V(y- _2)2 ). Apply the strong convexity midpoint inequality (equivalent to ∇2V⪰mI∇^2V mI): for all u,vu,v, V(u)+V(v)≥ 2V(u+v2)+m4‖u−v‖2.V(u)+V(v)\ ≥\ 2V\! ( u+v2 )+ m4\|u-v\|^2. With u=y−μ1u=y- _1 and v=y−μ2v=y- _2, we have (u+v)/2=y−(μ1+μ2)/2(u+v)/2=y-( _1+ _2)/2 and u−v=μ2−μ1=−Δu-v= _2- _1=- , so V(y−μ1)+V(y−μ2)2≥V(y−μ1+μ22)+m8‖Δ‖2. V(y- _1)+V(y- _2)2\ ≥\ V\! (y- _1+ _22 )+ m8\| \|^2. Therefore ρμ1(y)ρμ2(y)≤1Zexp(−V(y−μ1+μ22))exp(−m8‖Δ‖2)=ρ(μ1+μ2)/2(y)e−m‖Δ‖2/8. _ _1(y) _ _2(y)≤ 1Z \! (-V\! (y- _1+ _22 ) ) \! (- m8\| \|^2 )= _( _1+ _2)/2(y)\,e^-m\| \|^2/8. Integrating over y and using ∫ρ(μ1+μ2)/2(y)y=1 _( _1+ _2)/2(y)\,dy=1 yields the bound. ∎ Theorem E.1. Assume ρ is a C2C^2 log-concave density of the form ρ(x)∝e−V(x)ρ(x) e^-V(x) with V convex. Fix μo≠μn _o≠ _n and α∈(0,1)α∈(0,1), and define pαp_α and qβ,mnq_β,m_n as above. (A) For β∈[0,1]β∈[0,1], define qβnew:=βρμo+(1−β)ρμnq_β^new:=β\, _ _o+(1-β)\, _ _n and LSFT(β):=KL(ρμn∥qβnew).L_SFT(β):=KL\! ( _ _n\,\|\,q_β^new ). Then LSFT(0)=0L_SFT(0)=0 and LSFT(β)>0L_SFT(β)>0 for every β∈(0,1]β∈(0,1]; moreover LSFTL_SFT is strictly increasing on [0,1][0,1]. In particular, the unique minimizer is β⋆=0β =0. (B) Assume additionally that ∇V∇ V is L-Lipschitz (equivalently, ∇2V(x)⪯LI∇^2V(x) LI for all x), so that the score map u(y;μ):=∇ylogρμ(y)=−∇V(y−μ)u(y;μ):= _y _μ(y)=-∇ V(y-μ) satisfies the uniform Lipschitz property ‖u(y;μ1)−u(y;μ2)‖≤L‖μ1−μ2‖∀y,μ1,μ2.\|u(y; _1)-u(y; _2)\|≤ L\| _1- _2\| ∀ y, _1, _2. Then LRL(β,mn)=KL(qβ,mn∥pα)L_RL(β,m_n)=KL(q_β,m_n\|p_α) is differentiable and its gradient with respect to the old location parameter mom_o (evaluated at mo=μom_o= _o) obeys the bound ∥∇moKL(qβ,μo,mn∥pα)∥≤βL(εq(β,mn)∥mn−μo∥+εp(α)∥μn−μo∥), \| _m_oKL\! (q_β, _o,m_n\,\|\,p_α ) \|\ ≤\ β\,L ( _q(β,m_n)\,\|m_n- _o\|+ _p(α)\,\| _n- _o\| ), where the misassignment probabilities are εq(β,mn):=Y∼ρμo[1−ro(Y)],εp(α):=Y∼ρμo[1−so(Y)]. _q(β,m_n):=E_Y _ _o [1-r_o(Y) ], _p(α):=E_Y _ _o [1-s_o(Y) ]. Moreover, for any densities (no log-concavity needed), Lemma 2.1 implies the overlap bounds εq(β,mn)≤121−βBC(ρμo,ρmn),εp(α)≤121−αBC(ρμo,ρμn). _q(β,m_n)≤ 12 1-β\;BC( _ _o, _m_n), _p(α)≤ 12 1-α\;BC( _ _o, _ _n). If, in addition, V is m-strongly convex for some m>0m>0, then by Lemma E.2, εq(β,mn)≤121−βexp(−m8‖mn−μo‖2),εp(α)≤121−αexp(−m8‖μn−μo‖2), _q(β,m_n)≤ 12 1-β \! (- m8\|m_n- _o\|^2 ), _p(α)≤ 12 1-α \! (- m8\| _n- _o\|^2 ), so the old-location gradient is exponentially small in the separation when modes are well separated. (C) Let L(β,mn)=KL(qβ,mn∥pα)L(β,m_n)=KL(q_β,m_n\|p_α) with mo=μom_o= _o fixed. Then (β,mn)=(α,μn)(β,m_n)=(α, _n) satisfies qα,μn≡pαq_α, _n≡ p_α, hence ∂βL(β,mn)|(β,mn)=(α,μn)=0,∇mnL(β,mn)|(β,mn)=(α,μn)=0, . ∂βL(β,m_n) |_(β,m_n)=(α, _n)=0, . _m_nL(β,m_n) |_(β,m_n)=(α, _n)=0, and L(α,μn)=0L(α, _n)=0. Proof. Proof of Part (A): Write qβnew=(1−β)ρμn+βρμoq_β^new=(1-β) _ _n+β _ _o and define the likelihood ratio X(y):=ρμo(y)/ρμn(y)X(y):= _ _o(y)/ _ _n(y). Under Y∼ρμnY _ _n we have [X(Y)]=∫ρμo=1E[X(Y)]= _ _o=1. Then KL(ρμn∥qβnew)=ρμn[logρμn(Y)(1−β)ρμn(Y)+βρμo(Y)]=−[log((1−β)+βX(Y))].KL( _ _n\|q_β^new)=E_ _ _n\! [ _ _n(Y)(1-β) _ _n(Y)+β _ _o(Y) ]=-E [ ((1-β)+β X(Y) ) ]. By concavity of log , [log((1−β)+βX(Y))]≤log((1−β)+β[X(Y)])=log(1)=0,E [ ((1-β)+β X(Y) ) ]≤ ((1-β)+ [X(Y)] )= (1)=0, with strict inequality for every β>0β>0 because X(Y)X(Y) is non-constant when μo≠μn _o≠ _n (two distinct shifts of a positive density cannot coincide a.e.). Therefore LSFT(0)=0L_SFT(0)=0 and LSFT(β)>0L_SFT(β)>0 for β>0β>0. To show strict increase: the map g(β):=[log((1−β)+βX)]g(β):=E[ ((1-β)+β X)] is strictly concave in β whenever X is non-degenerate, since log is strictly concave and (1−β)+βX(1-β)+β X is affine in β with nonzero randomness. Because g(0)=0g(0)=0 and g(β)<0g(β)<0 for all β>0β>0, strict concavity implies g is strictly decreasing on [0,1][0,1]. Hence LSFT(β)=−g(β)L_SFT(β)=-g(β) is strictly increasing. Proof of Part (B): Let p=pαp=p_α and q=qβ,mnq=q_β,m_n, and denote mo=μom_o= _o. Using the general identity ∇θKL(qθ∥p)=∫(∇θqθ)log(qθ/p) _θKL(q_θ\|p)= ( _θq_θ) (q_θ/p) (as in Theorem 2.3), together with ∇moρmo(y)=−∇yρmo(y) _m_o _m_o(y)=- _y _m_o(y) and Lemma E.1, one obtains ∇moKL(q∥p)=β∫ρμo(y)∇ylogq(y)p(y)dy=βY∼ρμo[∇ylogq(Y)p(Y)]. _m_oKL(q\|p)=β _ _o(y)\, _y q(y)p(y)\,dy=β\,E_Y _ _o\! [ _y q(Y)p(Y) ]. Next expand mixture scores: ∇ylogq(y)=ro(y)u(y;μo)+(1−ro(y))u(y;mn),∇ylogp(y)=so(y)u(y;μo)+(1−so(y))u(y;μn). _y q(y)=r_o(y)\,u(y; _o)+(1-r_o(y))\,u(y;m_n), _y p(y)=s_o(y)\,u(y; _o)+(1-s_o(y))\,u(y; _n). Subtracting gives ∇ylogq(y)p(y)=(1−ro(y))(u(y;mn)−u(y;μo))−(1−so(y))(u(y;μn)−u(y;μo)). _y q(y)p(y)=(1-r_o(y)) (u(y;m_n)-u(y; _o) )-(1-s_o(y)) (u(y; _n)-u(y; _o) ). Taking norms and using the Lipschitz bound on u yields ∥∇moKL(q∥p)∥ \| _m_oKL(q\|p) \| ≤βρμo[(1−ro(Y))‖u(Y;mn)−u(Y;μo)‖]+βρμo[(1−so(Y))‖u(Y;μn)−u(Y;μo)‖] ≤β\,E_ _ _o\! [(1-r_o(Y))\|u(Y;m_n)-u(Y; _o)\| ]+β\,E_ _ _o\! [(1-s_o(Y))\|u(Y; _n)-u(Y; _o)\| ] ≤βL([1−ro(Y)]‖mn−μo‖+[1−so(Y)]‖μn−μo‖), ≤β\,L (E[1-r_o(Y)]\,\|m_n- _o\|+E[1-s_o(Y)]\,\| _n- _o\| ), which is the stated inequality with εq _q and εp _p. The bounds on εq _q and εp _p in terms of BCBC follow directly from Lemma 2.1. Under strong convexity, apply Lemma E.2 to the relevant shifted pairs. Proof of Part (C): At (β,mn)=(α,μn)(β,m_n)=(α, _n) we have qα,μn≡pαq_α, _n≡ p_α, so log(q/p)≡0 (q/p)≡ 0. Therefore any gradient formula of the form ∇θKL(qθ∥p)=∫(∇θqθ)log(qθ/p) _θKL(q_θ\|p)= ( _θq_θ) (q_θ/p) evaluates to zero. Also KL(q∥p)≥0KL(q\|p)≥ 0 with equality iff q=pq=p a.e., so L(α,μn)=0L(α, _n)=0. ∎ Remark E.1 (Relation to the Gaussian bounds). For Gaussians with covariance Σ , one may take V(x)=12x⊤Σ−1xV(x)= 12x ^-1x, so m=λmin(Σ−1)m= _ ( ^-1) and L=λmax(Σ−1)L= _ ( ^-1). Then Lemma E.2 recovers the familiar Gaussian overlap decay BC≤exp(−‖μ1−μ2‖Σ−12/8)BC≤ (-\| _1- _2\|_ ^-1^2/8) up to replacing the Mahalanobis norm by its spectral bounds. Moreover the Gaussian score is linear, which strengthens Part (B) from a bound to the exact identity derived earlier. E.1 Local PL Geometry and Exponential Convergence for Strongly Log-concave Mixtures We also provide a qualitative local PL condition along with exponential convergence in this case. We start with the following standard result with proof provided for completeness. Lemma E.3 (Fisher identity for strongly log-concave location families). Let ρ(x)=Z−1e−V(x)ρ(x)=Z^-1e^-V(x) on ℝdR^d with V∈C2V∈ C^2 and ∫e−V<∞ e^-V<∞. Assume integration by parts is valid (e.g. V grows superlinearly so that boundary terms vanish). If X∼ρX ρ, then [∇V(X)∇V(X)⊤]=[∇2V(X)].E [∇ V(X)\,∇ V(X) ]=E [∇^2V(X) ]. In particular, if ∇2V(x)⪰mI∇^2V(x) mI for all x (i.e. V is m-strongly convex), then [∇V(X)∇V(X)⊤]⪰mIE[∇ V(X)∇ V(X) ] mI. Proof. For each i,ji,j, apply integration by parts with density ρ: 0=∫∂i(e−V(x)∂jV(x))dx=∫e−V(x)(∂ijV(x)−∂iV(x)∂jV(x))x.0= _i (e^-V(x)\, _jV(x) )\,dx= e^-V(x) ( _ijV(x)- _iV(x)\, _jV(x) )\,dx. Divide by Z to obtain [∂ijV(X)]=[∂iV(X)∂jV(X)]E[ _ijV(X)]=E[ _iV(X) _jV(X)]. ∎ Theorem E.2. Let ρ(x)=Z−1e−V(x)ρ(x)=Z^-1e^-V(x) on ℝdR^d, where V∈C3V∈ C^3 is m-strongly convex (∇2V⪰mI∇^2V mI) and L-smooth (∇2V⪯LI∇^2V LI). Assume X∼ρ[‖∇V(X)‖4]<∞E_X ρ [\|∇ V(X)\|^4 ]<∞. For μ∈ℝdμ ^d define the shifted density ρμ(y):=ρ(y−μ) _μ(y):=ρ(y-μ). Fix μo≠μn _o≠ _n and α∈(0,1)α∈(0,1), and define pα(y):=αρμo(y)+(1−α)ρμn(y).p_α(y):=α\, _ _o(y)+(1-α)\, _ _n(y). Fix mo=μom_o= _o and parameterize the model by θ=(ϕ,m)∈ℝ×ℝdθ=(φ,m) ×R^d: β(ϕ)=σ(ϕ),qθ(y)=β(ϕ)ρμo(y)+(1−β(ϕ))ρm(y),L(θ)=KL(qθ∥pα).β(φ)=σ(φ), q_θ(y)=β(φ)\, _ _o(y)+(1-β(φ))\, _m(y), L(θ)=KL(q_θ\|p_α). Let θ⋆=(ϕ⋆,m⋆)θ =(φ ,m ) with ϕ⋆=logα1−αφ = α1-α and m⋆=μnm = _n, so qθ⋆≡pαq_θ ≡ p_α and L(θ⋆)=0L(θ )=0. (A) L is C2C^2 in a neighborhood of θ⋆θ and H⋆:=∇2L(θ⋆)=Y∼pα[s(Y)s(Y)⊤],H_ :=∇^2L(θ )=E_Y p_α [s(Y)\,s(Y) ], where the score vector is s(Y)=(ro⋆(Y)−αrn⋆(Y)∇V(Y−μn)),ro⋆(y)=αρμo(y)pα(y),rn⋆=1−ro⋆.s(Y)= pmatrixr_o (Y)-α\\[2.0pt] r_n (Y)\,∇ V(Y- _n) pmatrix, r_o (y)= α\, _ _o(y)p_α(y), r_n =1-r_o . (B) Let Δ:=μn−μo := _n- _o and define the overlap proxy ρsep:=exp(−m8‖Δ‖2), _sep:= \! (- m8\,\| \|^2 ), which upper bounds BC(ρμo,ρμn)BC( _ _o, _ _n) by Lemma E.2. Define εo→n:=121−αρsep,εn→o:=12α1−αρsep,v:=α(1−α)ρsep. _o := 12 1-α _sep, _n := 12 α1-α _sep, v:= α(1-α)\, _sep. Let G2:=Y∼pα[‖∇V(Y−μn)‖2]G_2:=E_Y p_α[\|∇ V(Y- _n)\|^2] and G4:=X∼ρ[‖∇V(X)‖4]G_4:=E_X ρ[\|∇ V(X)\|^4]. Then λmin(H⋆)≥minα(1−α)−v,(1−α)m−2(1−α)εn→oG4− 3vG2. _ (H_ )\ ≥\ \α(1-α)-v,\;\;(1-α)m-2(1-α) _n \, G_4 \\;-\;3 v\, G_2. In particular, for ‖Δ‖\| \| large enough (so that the right-hand side is positive), λmin(H⋆)>0 _ (H_ )>0. (C) If μ⋆:=λmin(H⋆)>0 _ := _ (H_ )>0, then there exists ε>0 >0 such that on the sublevel set θ:L(θ)≤ε\θ:\,L(θ)≤ \ the Polyak–Łojasiewicz inequality holds: ‖∇L(θ)‖2≥μ⋆2L(θ).\|∇ L(θ)\|^2\ ≥\ _ 2\,L(θ). Consequently, any gradient-flow solution θ˙(t)=−∇L(θ(t)) θ(t)=-∇ L(θ(t)) with L(θ(0))≤εL(θ(0))≤ satisfies the exponential rate L(θ(t))≤L(θ(0))e−(μ⋆/2)t∀t≥0.L(θ(t))≤ L(θ(0))\,e^-( _ /2)t ∀ t≥ 0. Proof. Proof of Part (A): The Fisher/Hessian identity at θ⋆θ is standard for smooth parametric families: since qθ⋆=pαq_θ =p_α, one has L(θ)=KL(qθ∥qθ⋆)L(θ)=KL(q_θ\|q_θ ) and hence ∇2L(θ⋆)=qθ⋆[∇logqθ⋆∇logqθ⋆⊤]∇^2L(θ )=E_q_θ [∇ q_θ \,∇ q_θ ], provided differentiation under the integral is justified (here ensured by smoothness and log-concave tails). The score components follow from: (i) ∂ϕlogqθ=ro−β(ϕ) _φ q_θ=r_o-β(φ) for a two-component mixture, and at θ⋆θ β(ϕ⋆)=αβ(φ )=α; (i) ∇mlogqθ(y)=rn(y)∇mlogρm(y) _m q_θ(y)=r_n(y)\, _m _m(y) and ∇mlogρm(y)=∇V(y−m) _m _m(y)=∇ V(y-m) for location shifts ρm(y)=Z−1e−V(y−m) _m(y)=Z^-1e^-V(y-m). Proof of Part (B): As in the Gaussian proof of Theorem 2.3.1, write H⋆=(AB⊤BC)H_ = pmatrixA&B \\ B&C pmatrix. Introduce the latent label Z∈0,1Z∈\0,1\ with (Z=1)=α Pr(Z=1)=α and Y|Z=1∼ρμoY|Z=1 _ _o, Y|Z=0∼ρμnY|Z=0 _ _n; then ro⋆(Y)=[Z|Y]r_o (Y)=E[Z|Y] and e:=ro⋆−Ze:=r_o -Z satisfies e2=Var(Z|Y)=ro⋆(1−ro⋆)e^2= Var(Z|Y)=r_o (1-r_o ). (i) Bound A. Exactly as before, A=Var(ro⋆)=α(1−α)−[e2].A= Var(r_o )=α(1-α)-E[e^2]. Using r(1−r)≤1−r(1-r)≤ 1-r and r(1−r)≤r(1-r)≤ r, we obtain [e2]≤α[1−ro⋆|Z=1]+(1−α)[ro⋆|Z=0]E[e^2]≤α\,E[1-r_o |Z=1]+(1-α)E[r_o |Z=0]. By Lemma 2.1 applied to the two-component mixture pαp_α and Lemma E.2 (via BC≤ρsepBC≤ _sep), these are bounded by εo→n _o and εn→o _n , yielding [e2]≤vE[e^2]≤ v and hence A≥α(1−α)−vA≥α(1-α)-v. (i) Bound λmin(C) _ (C). Here sm=rn⋆(Y)∇V(Y−μn)=(1−ro⋆)∇V(Y−μn)s_m=r_n (Y)\,∇ V(Y- _n)=(1-r_o )\,∇ V(Y- _n), so C=[(1−ro⋆)2∇V(Y−μn)∇V(Y−μn)⊤]⪰(1−α)[(1−ro⋆)2G(Y)G(Y)⊤∣Z=0],C=E [(1-r_o )^2\,∇ V(Y- _n)\,∇ V(Y- _n) ] (1-α)\,E [(1-r_o )^2\,G(Y)\,G(Y) Z=0 ], where G(Y):=∇V(Y−μn)G(Y):=∇ V(Y- _n). On Z=0Z=0, (1−ro⋆)2≥1−2ro⋆(1-r_o )^2≥ 1-2r_o , hence [(1−ro⋆)2GG⊤∣Z=0]⪰[GG⊤∣Z=0]−2[ro⋆GG⊤∣Z=0].E[(1-r_o )^2G Z=0] [G Z=0]-2\,E[r_o G Z=0]. Now Y−μn∼ρY- _n ρ, so [GG⊤∣Z=0]=X∼ρ[∇V(X)∇V(X)⊤]⪰mIE[G Z=0]=E_X ρ[∇ V(X)∇ V(X) ] mI by Lemma E.3. Also [ro⋆GG⊤∣Z=0]E[r_o G Z=0] is PSD and ∥[ro⋆G⊤∣Z=0]∥2≤[ro⋆∥G∥2∣Z=0]≤[ro⋆∣Z=0][‖G‖4∣Z=0]≤εn→oG4, \|E[r_o G Z=0] \|_2 [r_o \,\|G\|^2 Z=0]≤ E[r_o Z=0]\, E[\|G\|^4 Z=0]≤ _n \, G_4, using ro⋆2≤ro⋆r_o 2≤ r_o and the definition of G4G_4. Therefore λmin(C)≥(1−α)m−2(1−α)εn→oG4 _ (C)≥(1-α)m-2(1-α) _n G_4. (i) Bound ‖B‖2\|B\|_2. Now B=[(ro⋆−α)(1−ro⋆)G(Y)]B=E[(r_o -α)(1-r_o )\,G(Y)]. Repeating the algebra from the Gaussian case shows B=[Δ(Y)G(Y)],|Δ(Y)|≤3|e(Y)|.B=E[ (Y)\,G(Y)], | (Y)|≤ 3|e(Y)|. Hence ‖B‖2≤3[|e|‖G‖]≤3[e2][‖G‖2]≤3vG2\|B\|_2≤ 3\,E[|e|\,\|G\|]≤ 3 E[e^2] E[\|G\|^2]≤ 3 v G_2. (iv) Conclude. As before, Weyl’s inequality yields λmin(H⋆)≥minA,λmin(C)−‖B‖2 _ (H_ )≥ \A, _ (C)\-\|B\|_2, giving the stated bound. Proof of Part (C): If μ⋆=λmin(H⋆)>0 _ = _ (H_ )>0, continuity of the Hessian implies that in some neighborhood U of θ⋆θ , ∇2L(θ)⪰μ⋆2I∇^2L(θ) _ 2I, i.e. L is μ⋆/2 _ /2-strongly convex on U. Strong convexity implies the PL inequality ‖∇L(θ)‖2≥(μ⋆/2)(L(θ)−L(θ⋆))=(μ⋆/2)L(θ)\|∇ L(θ)\|^2≥( _ /2)\,(L(θ)-L(θ ))=( _ /2)L(θ) on U. Choose ε>0 >0 so that the sublevel set L≤ε⊂U\L≤ \⊂ U. Along the gradient flow, dtL(θ(t))=−‖∇L(θ(t))‖2, ddtL(θ(t))=-\|∇ L(θ(t))\|^2, combining with PL and integrating yields L(θ(t))≤L(θ(0))e−(μ⋆/2)tL(θ(t))≤ L(θ(0))e^-( _ /2)t for all t≥0t≥ 0. ∎