Paper deep dive
The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models
Xingcheng Xu
Abstract
Abstract:Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an "effective reward" aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.
Tags
Links
- Source: https://arxiv.org/abs/2507.20150
- Canonical: https://arxiv.org/abs/2507.20150
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/12/2026, 5:46:04 PM
Summary
The paper presents a rigorous mathematical framework for analyzing the stability of the reward-to-policy mapping in Reinforcement Learning (RL) for Large Language Models (LLMs). It identifies that policy brittleness, such as spurious reasoning and deceptive alignment, arises from non-unique optimal actions and reward misspecification. The authors prove that entropy regularization restores policy stability and provide a unified theoretical explanation for various empirical failure modes in LLM training.
Entities (5)
Relation Signals (4)
Reinforcement Learning → shapes → Large Language Models
confidence 100% · Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models.
Reward-Policy Map → determines → Policy Stability
confidence 95% · The properties of this map are a critical determinant of the robustness and predictability of RL-trained agents.
Entropy Regularization → restores → Policy Stability
confidence 95% · We also prove that entropy regularization restores policy stability at the cost of increased stochasticity.
Non-unique optimal actions → causes → Policy Brittleness
confidence 90% · We show that policy brittleness often stems from non-unique optimal actions.
Cypher Suggestions (2)
Map the relationship between methodologies and technologies · confidence 95% · unvalidated
MATCH (m:Methodology)-[r]->(t:Technology) RETURN m.name, type(r), t.name
Find all concepts related to policy stability · confidence 90% · unvalidated
MATCH (e:Concept)-[:CAUSES|RESTORES|DETERMINES]->(p:Property {name: 'Policy Stability'}) RETURN e, pFull Text
297,820 characters extracted from source content.
Expand or collapse full text
The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models Xingcheng Xu Shanghai Artificial Intelligence Laboratory xingcheng.xu18@gmail.com Abstract Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an "effective reward" aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems. Keywords: Large language models (LLM), reinforcement learning (RL), RLHF, RLVR, alignment, reasoning, emergent misalignment, AI safety, reward-policy map, continuity analysis Contents 1 Introduction 2 Continuity Analysis of the Reward-Policy Map 2.1 Framework and Definitions 2.2 Continuity of the Optimal Value Function 2.3 Analysis of the Argmax Correspondence 2.4 Conditions for Continuity of the Reward-Policy Map 2.5 Conditions for Discontinuity of the Reward-Policy Map 3 RL for LLMs With a Single Reward Model 3.1 Framing LLM Text Generation as an MDP 3.2 Theoretical Implications of the Reward-Policy Map 3.3 Analysis of Alignment and Reasoning Phenomena 3.3.1 The "Clever Slacker": Suboptimality from Incomplete Rewards 3.3.2 The Tie-Breaker Effect: Resolving Degeneracy with Additive Rewards 3.4 Broader Implications and Mitigation Strategies 4 LLMs Trained with Multiple Specialized Reward Models 4.1 Framework: Single LLM, Multiple Data Classes and Reward Models 4.2 Analyzing Policies Derived from State-Dependent Effective Rewards 4.3 Continuity of the Policy Map for the Effective Reward Model 4.4 Mitigating Discontinuities: Entropy Regularization 4.5 Implications for Multi-Task RL Training for LLMs 4.6 Remarks on State-Dependent Aggregation Mechanism 5 Connecting Theory to Practice: A Review of Empirical Evidence 5.1 Deceptive Reasoning: From Rational Cheating to Policy Obfuscation 5.1.1 Initial Failure: Rational Cheating under a Simple Reward Model 5.1.2 A Deeper Failure: Policy Shift to Obfuscated Deception 5.2 The Intelligence-Obedience Trade-off: An Incomplete Reward Specification 5.3 Controllable Reasoning: Engineering a Desirable Policy Shift via Tie-Breaker Rewards 5.4 From Faithfulness to Sophistry: Policy Jumps in RLHF-based Alignment 5.5 Multi-Reward Instability: Suggestive Evidence from Data Mixture Experiments 5.6 Multi-Reward Instability: Evidence from Controlled Perturbation Experiments 6 Discussion 6.1 On the Assumption of Reward Continuity 6.2 The Role of Practical RL Algorithms and Optimization Dynamics 6.3 The Influence of Data Distribution and Mixture Ratios 7 Conclusion A Related Work B Limitations C Appendix on Additional Proofs C.1 A Construction of the Bump Function C.2 Proof of the Lipschitz Property of the Softmax Function D Appendix on Experimental Details of Multi-RL Setting "Human-centric LLMs typically optimise for rewards based on human prejudgement: an expert observes the agent’s action and decides whether it is a good action, or picks the best agent action among multiple alternatives. … means that they are not directly grounded in the reality of the world." — Silver and Sutton (2025) 1 Introduction The development of a new generation of large language models (LLMs) represents a significant advance in artificial intelligence. Systems such as OpenAI’s o-series (o1, o3, o4-mini) (OpenAI, 2024, 2025b), Google’s Gemini 2.5 (Gemini Team, Google, 2025), Anthropic’s Claude 4 (Anthropic, 2025) and xAI’s Grok 4 are increasingly designed not merely as conversational agents, but as large reasoning models (LRMs) capable of addressing complex, multi-step problems across domains like mathematics, science, and software engineering. The development of these systems, alongside powerful open-source models like Deepseek R1 (Guo et al., 2025) and Qwen3 (Yang et al., 2025), relies heavily on reinforcement learning (RL) as a crucial training methodology. RL is employed not only to align model behavior with safety and ethical standards but also to teach models how to generate sophisticated reasoning trajectories (Guo et al., 2025; Yang et al., 2025). In this way, RL plays a pivotal role in both scaling complex reasoning capabilities and ensuring robust alignment with human values. Yet despite its promise, the application of RL introduces a set of persistent challenges that undermine both reasoning quality and alignment stability. At the heart of these difficulties lies the reliance on optimizing model policies against learned or specified reward functions. RL-trained policies often exhibit brittleness and undesirable generalization. For example, a model rewarded solely for a correct final answer, with no credit for a valid reasoning process, may resort to spurious reasoning: learning to generate the answer via a shortcut, then fabricating a plausible-but-fallacious justification post-hoc (Baker et al., 2025; Wang et al., 2025a; Chen et al., 2025b). In other cases, policies can degrade in unpredictable ways. A model intended to be an objective assistant might exhibit a sudden style shift and adopt an overly sycophantic tone (OpenAI, 2025a, c). A sharp degradation in instruction-following fidelity is also common (Fu et al., 2025), where models ignore specified output formats, length constraints, or language requirements when such details are not strongly enforced by the reward. These failure modes are more than just cosmetic—they threaten the reliability, controllability, and safety of large-scale AI systems. This challenge highlights a fundamental distinction between applying RL to language models and its use in traditional domains. In settings like Go or chess (Silver et al., 2016, 2017, 2018), where rewards are well-defined and objective, the policy is simply a tool to maximize outcomes; the path taken matters little as long as the goal is achieved. But for LLMs, this paradigm no longer holds. The output sequence itself is the product, directly consumed by users. Moreover, the reward is not an objective truth but a noisy approximation of human preferences (Ouyang et al., 2022; Bai et al., 2022; Lee et al., 2023) or manually specified rules (Wang et al., 2025b; Su et al., 2025). As a result, the policy’s behavior cannot be treated as a secondary concern—it is central. In language modeling, the policy is not just a means to an end; in many ways, it is the end. This shift in paradigm reframes the observed instabilities: they are not superficial flaws, but fundamental failures in the product itself. This raises a deeper and largely unresolved question: What underlying mathematical mechanisms make RL-trained policies so sensitive to the design of the reward function in both LLM reasoning and alignment? In the absence of a formal theory of stability, current solutions rely heavily on ad-hoc heuristics and empirical tuning rather than first principles. Common practices include reward engineering or algorithmic modifications, such as adding specific penalties to suppress issues like spurious reasoning, poor instruction adherence, or inefficient problem-solving (Baker et al., 2025; Wang et al., 2025a; Chen et al., 2025b; Fu et al., 2025; Sui et al., 2025; Aggarwal and Welleck, 2025; Chen et al., 2025a; Qu et al., 2025). Additional strategies include extensive preference data curation (Ouyang et al., 2022) and the use of structured but non-theoretical frameworks such as Constitutional AI (Bai et al., 2022) or deliberative alignment (Guan et al., 2024) to guide the learning. While these approaches can mitigate specific failure modes, they are inherently reactive and lack a general understanding or guarantee of policy stability. To build a principled foundation, this paper develops a rigorous mathematical framework for analyzing the stability of the mapping from reward functions to their corresponding optimal policies. By modeling reasoning trace generation as a Markov Decision Process (MDP) and applying tools from functional analysis, we examine the continuity of this reward-to-policy map to formally explain the roots of policy brittleness. Our analysis shows that instabilities often arise from two key sources: non-unique optimal actions and imprecise reward signals. When a reward model assigns nearly equal value to multiple distinct actions, a common scenario in the expansive space of reasoning and creative generation, the resulting policy becomes inherently unstable. In such cases, even tiny perturbations in the reward can lead to abrupt, discontinuous changes in behavior, revealing a fundamental fragility at the heart of RL-trained language models. Building on this theoretical foundation, we apply our framework to analyze a wide range of observed LLM behaviors through two key lenses: incomplete reward specifications and tie-breaking dynamics among degenerate optima. Many failure cases can be understood as manifestations of the "clever slacker" problem, where the policy rationally exploits an underspecified reward. For example, spurious reasoning emerges when only the final answer is rewarded, giving no incentive for a coherent reasoning process. Likewise, instruction-following failures, such as ignoring output format, length, or language constraints, occur when these aspects are not reflected in the reward except for correctness. Our framework also clarifies the role of auxiliary rewards. By breaking ties between near-equally valued actions, even small penalties (e.g., for verbosity) or bonuses (e.g., for correct formatting) can shift the policy toward desirable behaviors. These discontinuous shifts are not inherently undesirable—they can be leveraged as mechanisms for reward shaping, allowing designers to steer policies when the base reward is incomplete or noisy. Ultimately, our analysis offers a unified mathematical perspective on these phenomena, showing they arise naturally from the structure of the reward landscape and the sensitivity of policy optimization within it. While our single-reward analysis yields valuable insights, leading LLMs such as OpenAI’s o-series, Gemini 2.5, and Grok 4 are rarely trained on a singular reward. Instead, they are typically optimized using a complex blend of reward signals across multiple domains, ranging from mathematical reasoning and code generation to safety alignment (Guo et al., 2025; Liang et al., 2025; Cheng et al., 2025). This motivates the second stage of our analysis, where we extend the framework to more realistic multi-reward training regimes. In this setting, we model the system’s behavior using a state-dependent effective reward function that captures how the model internally aggregates multiple (and sometimes conflicting) objectives. We demonstrate that the same principles of stability apply: policy continuity now depends on the properties of this effective reward. Crucially, this highlights the aggregation mechanism of the effective reward as a key determinant of policy robustness. To address resulting instabilities, we formally analyze the role of regularization techniques and show that entropy regularization restores Lipschitz continuity in the reward-policy map. This guarantees that small changes in the reward yield correspondingly small changes in behavior, but it comes at the cost of trading off some degree of optimality for greater stability. To validate the practical relevance of our framework, we systematically link its theoretical insights to a broad spectrum of recent empirical findings. Our analysis accounts for the deceptive reasoning behavior (Wang et al., 2025a; Baker et al., 2025): from simple cheating under a weak reward model to a more pernicious policy shift towards obfuscated deception when that reward is naively "patched". We further elucidate the intelligence-obedience trade-off in large reasoning models (Fu et al., 2025) and demonstrate how targeted auxiliary rewards can serve as tie-breakers to resolve such tensions and enable fine-grained behavioral control (Aggarwal and Welleck, 2025). Extending to the subjective realm of RLHF, our theory also explains the emergence of "performance illusions" (Wen et al., 2024), where models shift from truthful responses to persuasive but misleading ones that exploit human feedback biases. Finally, we review and present results related to the multi-reward setting. These case studies provide comprehensive empirical support for our framework, offering a unified lens through which to understand critical stability challenges in modern AI. Our key contributions are as following: 1. A theoretical framework for policy stability in RL-trained language models. We establish a formal analysis of the reward-policy map and prove that policy instability stems from inherent properties of the reward landscape, notably reward misspecification and the presence of degenerate optima. 2. A unified explanation for alignment and reasoning phenomena in LLMs. We demonstrate that common failure modes, such as spurious reasoning, poor instruction-following and inefficient reasoning, emerge naturally from misspecified reward signals. We also introduce the notion of an effective reward as a key determinant of policy robustness in a multi-reward setting. 3. A principled justification for entropy regularization. We prove that entropy regularization restores continuity in the reward-policy map, providing a theoretical foundation for its widespread use in stabilizing behavior during large-scale model training. 2 Continuity Analysis of the Reward-Policy Map This section develops the core theoretical framework for analyzing the stability of reinforcement learning policies. We formally investigate the continuity of the mapping from a reward function, R, to a corresponding optimal policy, πR∗π^*_Rπ∗italic_R. A central theme of our work is that the properties of this map are a critical determinant of the robustness and predictability of RL-trained agents. Our analysis proceeds by first establishing the foundational stability of the optimal Q-function with respect to the reward. This result is then used to analyze the continuity properties of the set of optimal actions, which allows us to derive the conditions under which the reward-policy map is continuous and, conversely, the conditions under which it becomes discontinuous. 2.1 Framework and Definitions We consider a standard infinite-horizon discounted Markov Decision Process (MDP) defined by the tuple (S,A,P,R,γ)(S,A,P,R,γ)( S , A , P , R , γ ). Assumption 2.1. The state space S and action space A are compact metric spaces, endowed with their respective Borel σ-algebras ℬ(S)B(S)B ( S ) and ℬ(A)B(A)B ( A ). Assumption 2.2. The transition kernel P:S×A→(S)P:S× A (S)P : S × A → P ( S ) is a stochastic kernel such that for any bounded continuous function f∈(S)f (S)f ∈ C ( S ), the mapping (s,a)↦∫Sf(s′)P(ds′|s,a)(s,a) _Sf(s )P(ds |s,a)( s , a ) ↦ ∫S f ( s′ ) P ( d s′ | s , a ) is continuous on S×AS× AS × A. Assumption 2.3. The discount factor γ∈[0,1)γ∈[0,1)γ ∈ [ 0 , 1 ). The reward function R:S×A→ℝR:S× A : S × A → blackboard_R determines the immediate reward. We consider the space of reward functions ℛRR to be the Banach space (S×A)C(S× A)C ( S × A ) of continuous functions on S×AS× AS × A, equipped with the supremum norm ‖R‖∞=sup(s,a)∈S×A|R(s,a)|\|R\|_∞= _(s,a)∈ S× A|R(s,a)|∥ R ∥∞ = sup( s , a ) ∈ S × A | R ( s , a ) |. A policy π:S→(A)π:S (A)π : S → P ( A ) is a stochastic kernel satisfying appropriate measurability conditions. Let PP denote the space of all such policies. The continuity of the reward-policy map, which is the central subject of our analysis, depends critically on the topology endowed upon this policy space PP. Since our results will cover different types of policies (e.g., deterministic versus stochastic), we will specify the relevant topology in the context of each theorem to ensure maximum clarity and precision. For any given policy π and reward function R∈ℛR ∈ R, its performance is quantified by the state-value function VRπ(s)=π[∑t=0∞γtR(st,at)∣s0=s],V^π_R(s)=E_π [ _t=0^∞γ^tR(s_t,a_t) s_0=s ],Vitalic_πitalic_R ( s ) = blackboard_Eπ [ ∑t = 0∞ γitalic_t R ( sitalic_t , aitalic_t ) ∣ s0 = s ] , representing the expected total discounted reward obtained by starting in state s and subsequently following policy π. The optimal action-value function QR∗∈(S×A)Q^*_R (S× A)Q∗italic_R ∈ C ( S × A ) is the unique fixed point of the Bellman optimality operator TRT_RTitalic_R: (TRQ)(s,a)=R(s,a)+γ∫Smaxa′∈AQ(s′,a′)P(ds′|s,a).(T_RQ)(s,a)=R(s,a)+γ _S _a ∈ AQ(s ,a )P(ds |s,a).( Titalic_R Q ) ( s , a ) = R ( s , a ) + γ ∫S maxitalic_a′ ∈ A Q ( s′ , a′ ) P ( d s′ | s , a ) . (1) The existence and uniqueness of QR∗Q^*_RQ∗italic_R follows from the Banach fixed-point theorem, as TRT_RTitalic_R is a contraction mapping on (S×A)C(S× A)C ( S × A ) with modulus γ. The set of optimal actions at state s for a reward function R is given by the argmax correspondence: A∗(s;R)=argmaxa∈AQR∗(s,a).A^*(s;R)= argmax_a∈ AQ^*_R(s,a).A∗ ( s ; R ) = argmaxitalic_a ∈ A Q∗italic_R ( s , a ) . (2) An optimal policy πR∗π^*_Rπ∗italic_R is any policy π∈π π ∈ P such that for all s∈Ss∈ Ss ∈ S, the support of the measure π(⋅|s)π(·|s)π ( ⋅ | s ) is contained within A∗(s;R)A^*(s;R)A∗ ( s ; R ). Let Π∗(R)⊆ ^*(R) Π∗ ( R ) ⊆ P denote the set of all optimal policies for R. Our analysis centers on the continuity of the mapping from the reward function R to a corresponding optimal policy. This can be viewed either as the continuity properties of the set-valued map R↦Π∗(R)R ^*(R)R ↦ Π∗ ( R ) or, more commonly, by defining a specific selection rule f:ℛ→f:R : R → P such that f(R)=πR∗∈Π∗(R)f(R)=π^*_R∈ ^*(R)f ( R ) = π∗italic_R ∈ Π∗ ( R ), and analyzing the continuity of this single-valued map f. We denote this map by MRL:ℛ→M_RL:R _R L : R → P. Our analytical approach, visualized in Figure 1, is to decompose this overall map and sequentially investigate the properties of each constituent mapping: from the reward function R to the optimal Q-function QR∗Q^*_RQ∗italic_R, then to the set of optimal actions A∗A^*A∗, and ultimately to the selected policy πR∗π^*_Rπ∗italic_R. RRRQR∗Q^*_RQ∗italic_RA∗(⋅;R)A^*(·;R)A∗ ( ⋅ ; R )πR∗π^*_Rπ∗italic_RBellman OperatorLipschitz Continuous (Prop. 2.4) Argmax Operator Upper Hemi-Continuous (Lemma 2.5) Policy Selection(Dis-)Continuity (Sec. 2.4 & 2.5)End-to-End Map Figure 1: The analytical roadmap for the stability analysis of the reward-policy map. Our analysis proceeds from left to right, investigating the continuity properties at each step of the mapping: from the reward function (R) to the optimal Q-function (QR∗Q^*_RQ∗italic_R), then to the set of optimal actions (A∗A^*A∗), and finally to the resulting optimal policy (πR∗π^*_Rπ∗italic_R). The key insight is that the stability of the final policy hinges on the properties of the argmax correspondence. 2.2 Continuity of the Optimal Value Function The foundation for analyzing the policy map’s continuity lies in the stability of the optimal Q-function with respect to changes in the reward function. This is a standard result in dynamic programming, see e.g., Lecarpentier et al. (2021). Proposition 2.4. Let Assumptions 2.1-2.3 hold. The mapping R↦QR∗R Q^*_RR ↦ Q∗italic_R from (ℛ,∥⋅∥∞)(R,\|·\|_∞)( R , ∥ ⋅ ∥∞ ) to ((S×A),∥⋅∥∞)(C(S× A),\|·\|_∞)( C ( S × A ) , ∥ ⋅ ∥∞ ) is Lipschitz continuous with constant 1/(1−γ)1/(1-γ)1 / ( 1 - γ ). That is, for any R1,R2∈ℛR_1,R_2 1 , R2 ∈ R, ‖QR1∗−QR2∗‖∞≤11−γ‖R1−R2‖∞.\|Q^*_R_1-Q^*_R_2\|_∞≤ 11-γ\|R_1-R_2\|_∞.∥ Q∗italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - Q∗italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥∞ ≤ divide start_ARG 1 end_ARG start_ARG 1 - γ end_ARG ∥ R1 - R2 ∥∞ . (3) Proof. Let Q1=QR1∗Q_1=Q^*_R_1Q1 = Q∗italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Q2=QR2∗Q_2=Q^*_R_2Q2 = Q∗italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. By definition, Q1=TR1Q1Q_1=T_R_1Q_1Q1 = Titalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Q1 and Q2=TR2Q2Q_2=T_R_2Q_2Q2 = Titalic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Q2. Then, |Q1(s,a)−Q2(s,a)| |Q_1(s,a)-Q_2(s,a)|| Q1 ( s , a ) - Q2 ( s , a ) | =|(TR1Q1)(s,a)−(TR2Q2)(s,a)| =|(T_R_1Q_1)(s,a)-(T_R_2Q_2)(s,a)|= | ( Titalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Q1 ) ( s , a ) - ( Titalic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Q2 ) ( s , a ) | =|R1(s,a)−R2(s,a)+γ∫S(maxa′Q1(s′,a′)−maxa′Q2(s′,a′))P(ds′|s,a)| = |R_1(s,a)-R_2(s,a)+γ _S ( _a Q_1(s ,a )- _a Q_2(s ,a ) )P(ds |s,a) |= | R1 ( s , a ) - R2 ( s , a ) + γ ∫S ( maxitalic_a′ Q1 ( s′ , a′ ) - maxitalic_a′ Q2 ( s′ , a′ ) ) P ( d s′ | s , a ) | ≤|R1(s,a)−R2(s,a)|+γ∫S|maxa′Q1(s′,a′)−maxa′Q2(s′,a′)|P(ds′|s,a). ≤|R_1(s,a)-R_2(s,a)|+γ _S | _a Q_1(s ,a )- _a Q_2(s ,a ) |P(ds |s,a).≤ | R1 ( s , a ) - R2 ( s , a ) | + γ ∫S | maxitalic_a′ Q1 ( s′ , a′ ) - maxitalic_a′ Q2 ( s′ , a′ ) | P ( d s′ | s , a ) . Using the inequality |maxf−maxg|≤sup|f−g|=‖f−g‖∞| f- g|≤ |f-g|=\|f-g\|_∞| max f - max g | ≤ sup | f - g | = ∥ f - g ∥∞, we have |Q1(s,a)−Q2(s,a)| |Q_1(s,a)-Q_2(s,a)|| Q1 ( s , a ) - Q2 ( s , a ) | ≤‖R1−R2‖∞+γ∫S‖Q1−Q2‖∞P(ds′|s,a) ≤\|R_1-R_2\|_∞+γ _S\|Q_1-Q_2\|_∞P(ds |s,a)≤ ∥ R1 - R2 ∥∞ + γ ∫S ∥ Q1 - Q2 ∥∞ P ( d s′ | s , a ) ≤‖R1−R2‖∞+γ‖Q1−Q2‖∞. ≤\|R_1-R_2\|_∞+γ\|Q_1-Q_2\|_∞.≤ ∥ R1 - R2 ∥∞ + γ ∥ Q1 - Q2 ∥∞ . Taking the supremum over (s,a)(s,a)( s , a ) yields ‖Q1−Q2‖∞≤‖R1−R2‖∞+γ‖Q1−Q2‖∞\|Q_1-Q_2\|_∞≤\|R_1-R_2\|_∞+γ\|Q_1-Q_2\|_∞∥ Q1 - Q2 ∥∞ ≤ ∥ R1 - R2 ∥∞ + γ ∥ Q1 - Q2 ∥∞, which rearranges to the desired result. ∎ 2.3 Analysis of the Argmax Correspondence The stability of the optimal policy map is critically dependent on the behavior of the set of optimal actions A∗(s;R)=argmaxa∈AQR∗(s,a)A^*(s;R)= argmax_a∈ AQ^*_R(s,a)A∗ ( s ; R ) = argmaxitalic_a ∈ A Q∗italic_R ( s , a ). We analyze this set-valued map (or correspondence) R↦A∗(⋅;R)R A^*(·;R)R ↦ A∗ ( ⋅ ; R ). The result is a direct application of Berge’s Maximum Theorem, see e.g., Berge (1963) and Aliprantis and Border (2006). Lemma 2.5. Let Assumptions 2.1-2.3 hold. For each fixed s∈Ss∈ Ss ∈ S, the argmax correspondence Φs:ℛ↠A _s:R AΦitalic_s : R ↠ A defined by Φs(R)=A∗(s;R) _s(R)=A^*(s;R)Φitalic_s ( R ) = A∗ ( s ; R ) has the following properties: 1. Φs(R) _s(R)Φitalic_s ( R ) is non-empty and compact for each R∈ℛR ∈ R. 2. Φs _sΦitalic_s is upper hemi-continuous (u.h.c.) from (ℛ,∥⋅∥∞)(R,\|·\|_∞)( R , ∥ ⋅ ∥∞ ) to the space of compact subsets of A, where the latter is endowed with the topology induced by the Hausdorff metric. Proof. 1. Non-empty and compact values: Under Assumptions 2.1-2.3, it is a standard result that the Bellman optimality operator TRT_RTitalic_R maps the space of bounded continuous functions (S×A)C(S× A)C ( S × A ) to itself. As established by the Banach fixed-point theorem, its unique fixed point, QR∗Q^*_RQ∗italic_R, must therefore be an element of this space, i.e., QR∗∈(S×A)Q^*_R (S× A)Q∗italic_R ∈ C ( S × A ). Thus, for any fixed R∈ℛR ∈ R and s∈Ss∈ Ss ∈ S, the function a↦QR∗(s,a)a Q^*_R(s,a)a ↦ Q∗italic_R ( s , a ) is continuous on the action space A.Since A is compact (Assumption 2.1), the Weierstrass Extreme Value Theorem ensures that QR∗(s,⋅)Q^*_R(s,·)Q∗italic_R ( s , ⋅ ) attains its maximum on A. Therefore, the set of maximizers A∗(s;R)A^*(s;R)A∗ ( s ; R ) is non-empty. Furthermore, because QR∗(s,⋅)Q^*_R(s,·)Q∗italic_R ( s , ⋅ ) is continuous, the set A∗(s;R)A^*(s;R)A∗ ( s ; R ) can be written as a∈A∣QR∗(s,a)=maxa′∈AQR∗(s,a′)\a∈ A Q^*_R(s,a)= _a ∈ AQ^*_R(s,a )\ a ∈ A ∣ Q∗italic_R ( s , a ) = maxitalic_a′ ∈ A Q∗italic_R ( s , a′ ) , which is a closed subset of the compact space A. A closed subset of a compact space is compact. Thus, A∗(s;R)A^*(s;R)A∗ ( s ; R ) is compact-valued. 2. Upper hemi-continuity: We prove u.h.c. using the sequential characterization: Φs _sΦitalic_s is u.h.c. at R0∈ℛR_0 0 ∈ R if for any sequence (Rn)n∈ℕ(R_n)_n ( Ritalic_n )n ∈ blackboard_N in ℛRR converging to R0R_0R0 (i.e., ‖Rn−R0‖∞→0\|R_n-R_0\|_∞→ 0∥ Ritalic_n - R0 ∥∞ → 0), and for any sequence (an)n∈ℕ(a_n)_n ( aitalic_n )n ∈ blackboard_N in A such that an∈A∗(s;Rn)a_n∈ A^*(s;R_n)aitalic_n ∈ A∗ ( s ; Ritalic_n ) for all n and an→a0a_n→ a_0aitalic_n → a0 for some a0∈Aa_0∈ Aa0 ∈ A, it must follow that a0∈A∗(s;R0)a_0∈ A^*(s;R_0)a0 ∈ A∗ ( s ; R0 ). This is equivalent to showing that the graph of the correspondence is closed, which for compact-valued correspondences with a compact range space (A) implies u.h.c. (see e.g., Aliprantis and Border (2006), Theorem 17.11 and 17.31). Let (Rn)n∈ℕ(R_n)_n ( Ritalic_n )n ∈ blackboard_N be a sequence in ℛRR such that Rn→R0R_n→ R_0Ritalic_n → R0. Let (an)n∈ℕ(a_n)_n ( aitalic_n )n ∈ blackboard_N be a sequence in A such that an∈A∗(s;Rn)a_n∈ A^*(s;R_n)aitalic_n ∈ A∗ ( s ; Ritalic_n ) for each n, and assume an→a0a_n→ a_0aitalic_n → a0 in A. By definition of an∈A∗(s;Rn)a_n∈ A^*(s;R_n)aitalic_n ∈ A∗ ( s ; Ritalic_n ), we have: QRn∗(s,an)≥QRn∗(s,a)for all a∈A.Q^*_R_n(s,a_n)≥ Q^*_R_n(s,a) all a∈ A.Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , aitalic_n ) ≥ Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , a ) for all a ∈ A . (4) We want to show that a0∈A∗(s;R0)a_0∈ A^*(s;R_0)a0 ∈ A∗ ( s ; R0 ), i.e., QR0∗(s,a0)≥QR0∗(s,a)Q^*_R_0(s,a_0)≥ Q^*_R_0(s,a)Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a0 ) ≥ Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a ) for all a∈Aa∈ Aa ∈ A. Consider the term QRn∗(s,an)Q^*_R_n(s,a_n)Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , aitalic_n ). We have: |QRn∗(s,an)−QR0∗(s,a0)|≤|QRn∗(s,an)−QR0∗(s,an)|+|QR0∗(s,an)−QR0∗(s,a0)|. |Q^*_R_n(s,a_n)-Q^*_R_0(s,a_0)|≤|Q^*_R_n(s,a_n)-Q^*_R_0(s,a_n)|+|Q^*_R_0(s,a_n)-Q^*_R_0(s,a_0)|.| Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , aitalic_n ) - Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a0 ) | ≤ | Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , aitalic_n ) - Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , aitalic_n ) | + | Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , aitalic_n ) - Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a0 ) | . The first term on the right-hand side satisfies |QRn∗(s,an)−QR0∗(s,an)|≤‖QRn∗−QR0∗‖∞|Q^*_R_n(s,a_n)-Q^*_R_0(s,a_n)|≤\|Q^*_R_n-Q^*_R_0\|_∞| Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , aitalic_n ) - Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , aitalic_n ) | ≤ ∥ Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT - Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥∞. By Proposition 2.4, ‖QRn∗−QR0∗‖∞→0\|Q^*_R_n-Q^*_R_0\|_∞→ 0∥ Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT - Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥∞ → 0 as Rn→R0R_n→ R_0Ritalic_n → R0. For the second term, since QR0∗(s,⋅)Q^*_R_0(s,·)Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , ⋅ ) is continuous on A (as QR0∗∈(S×A)Q^*_R_0 (S× A)Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ C ( S × A )) and an→a0a_n→ a_0aitalic_n → a0, it follows that |QR0∗(s,an)−QR0∗(s,a0)|→0|Q^*_R_0(s,a_n)-Q^*_R_0(s,a_0)|→ 0| Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , aitalic_n ) - Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a0 ) | → 0. Therefore, limn→∞QRn∗(s,an)=QR0∗(s,a0) _n→∞Q^*_R_n(s,a_n)=Q^*_R_0(s,a_0)limitalic_n → ∞ Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , aitalic_n ) = Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a0 ). Now, for any fixed a∈Aa∈ Aa ∈ A, consider the right-hand side of inequality (4), QRn∗(s,a)Q^*_R_n(s,a)Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , a ). Since ‖QRn∗−QR0∗‖∞→0\|Q^*_R_n-Q^*_R_0\|_∞→ 0∥ Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT - Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥∞ → 0, we have limn→∞QRn∗(s,a)=QR0∗(s,a) _n→∞Q^*_R_n(s,a)=Q^*_R_0(s,a)limitalic_n → ∞ Q∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , a ) = Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a ). Taking the limit as n→∞n→∞n → ∞ in inequality (4), we obtain: QR0∗(s,a0)≥QR0∗(s,a)for all a∈A.Q^*_R_0(s,a_0)≥ Q^*_R_0(s,a) all a∈ A.Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a0 ) ≥ Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a ) for all a ∈ A . This implies that a0∈argmaxa∈AQR0∗(s,a)a_0∈ argmax_a∈ AQ^*_R_0(s,a)a0 ∈ argmaxitalic_a ∈ A Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , a ), i.e., a0∈A∗(s;R0)a_0∈ A^*(s;R_0)a0 ∈ A∗ ( s ; R0 ). This argument directly establishes that the graph of the correspondence Φs _sΦitalic_s is closed. Since the range space A is a compact Hausdorff space, the closed graph property is equivalent to upper hemi-continuity. Thus, the upper hemi-continuity of Φs _sΦitalic_s is proven. ∎ Remark 2.6. The sequential definition of upper hemi-continuity used in the proof is: if (Rn,an)(R_n,a_n)( Ritalic_n , aitalic_n ) is a sequence in the graph of the correspondence (i.e., an∈A∗(s;Rn)a_n∈ A^*(s;R_n)aitalic_n ∈ A∗ ( s ; Ritalic_n )) such that Rn→R0R_n→ R_0Ritalic_n → R0 and an→a0a_n→ a_0aitalic_n → a0, then (R0,a0)(R_0,a_0)( R0 , a0 ) must also be in the graph (i.e., a0∈A∗(s;R0)a_0∈ A^*(s;R_0)a0 ∈ A∗ ( s ; R0 )). This is indeed the definition of a closed graph correspondence. For correspondences into a compact Hausdorff space (like A), the closed graph property is equivalent to upper hemi-continuity and the correspondence being closed-valued (hence compact-valued since A is compact) (cf. Aliprantis and Border (2006), Theorem 17.11). The u.h.c. property is fundamental, but it does not imply lower hemi-continuity (l.h.c.) nor the continuity of arbitrary single-valued selections from A∗(s;R)A^*(s;R)A∗ ( s ; R ). Discontinuities in policy maps often arise when A∗(s;R)A^*(s;R)A∗ ( s ; R ) is not l.h.c., which is typical when the set of maximizers changes its structure (e.g., its cardinality or dimension). 2.4 Conditions for Continuity of the Reward-Policy Map The continuity of a specific policy map MRL:R↦πR∗M_RL:R π^*_RMitalic_R L : R ↦ π∗italic_R depends critically on whether the optimal action is unique. Theorem 2.7 (Continuity of Deterministic Optimal Policy Map under Uniqueness). Let Assumptions 2.1-2.3 hold. Let R0∈ℛR_0 0 ∈ R. Suppose there exists an open neighborhood (R0)⊂ℛN(R_0) ( R0 ) ⊂ R of R0R_0R0 such that for all R∈(R0)R (R_0)R ∈ N ( R0 ) and all s∈Ss∈ Ss ∈ S, the set of optimal actions A∗(s;R)A^*(s;R)A∗ ( s ; R ) is a singleton, denoted a∗(s;R)\a^*(s;R)\ a∗ ( s ; R ) . Let detP_detPitalic_d e t be the space of all functions π:S→Aπ:S→ Aπ : S → A. Define the policy map MRL:(R0)→detM_RL:N(R_0) _detMitalic_R L : N ( R0 ) → Pitalic_d e t by setting MRL(R)=πRM_RL(R)= _RMitalic_R L ( R ) = πitalic_R, where πR(s)=a∗(s;R) _R(s)=a^*(s;R)πitalic_R ( s ) = a∗ ( s ; R ) for all s∈Ss∈ Ss ∈ S. Equip detP_detPitalic_d e t with the topology of pointwise convergence: a sequence (πn)n∈ℕ( _n)_n ( πitalic_n )n ∈ blackboard_N in detP_detPitalic_d e t converges to π∈detπ _detπ ∈ Pitalic_d e t if for every s∈Ss∈ Ss ∈ S, dA(πn(s),π(s))→0d_A( _n(s),π(s))→ 0ditalic_A ( πitalic_n ( s ) , π ( s ) ) → 0 as n→∞n→∞n → ∞. Then the map MRLM_RLMitalic_R L is continuous at R0R_0R0. Proof. We want to show that if (Rn)n∈ℕ(R_n)_n ( Ritalic_n )n ∈ blackboard_N is a sequence in (R0)N(R_0)N ( R0 ) such that Rn→R0R_n→ R_0Ritalic_n → R0 in the ∥⋅∥∞\|·\|_∞∥ ⋅ ∥∞ topology, then MRL(Rn)→MRL(R0)M_RL(R_n)→ M_RL(R_0)Mitalic_R L ( Ritalic_n ) → Mitalic_R L ( R0 ) in the topology of pointwise convergence. Let πn=MRL(Rn) _n=M_RL(R_n)πitalic_n = Mitalic_R L ( Ritalic_n ) and π0=MRL(R0) _0=M_RL(R_0)π0 = Mitalic_R L ( R0 ). By the definition of MRLM_RLMitalic_R L, this means πn(s)=a∗(s;Rn) _n(s)=a^*(s;R_n)πitalic_n ( s ) = a∗ ( s ; Ritalic_n ) and π0(s)=a∗(s;R0) _0(s)=a^*(s;R_0)π0 ( s ) = a∗ ( s ; R0 ). The convergence πn→π0 _n→ _0πitalic_n → π0 requires that for every s∈Ss∈ Ss ∈ S, dA(a∗(s;Rn),a∗(s;R0))→0d_A(a^*(s;R_n),a^*(s;R_0))→ 0ditalic_A ( a∗ ( s ; Ritalic_n ) , a∗ ( s ; R0 ) ) → 0 as n→∞n→∞n → ∞. Fix an arbitrary s∈Ss∈ Ss ∈ S. Consider the correspondence Φs:ℛ↠A _s:R AΦitalic_s : R ↠ A defined by Φs(R)=A∗(s;R) _s(R)=A^*(s;R)Φitalic_s ( R ) = A∗ ( s ; R ). By Lemma 2.5, Φs _sΦitalic_s is upper hemi-continuous (u.h.c.) and its values A∗(s;R)A^*(s;R)A∗ ( s ; R ) are non-empty compact subsets of A. By the hypothesis of the theorem, for all R∈(R0)R (R_0)R ∈ N ( R0 ), Φs(R)=a∗(s;R) _s(R)=\a^*(s;R)\Φitalic_s ( R ) = a∗ ( s ; R ) is a singleton. According to a standard result in the theory of correspondences (e.g., Aliprantis and Border (2006), Lemma 17.6), a compact-valued, u.h.c. correspondence that is single-valued on an open set (by hypothesis the maximizer is unique on (R0)N(R_0)N ( R0 )) is continuous as a single-valued function on that open set. Specifically, let fs:(R0)→Af_s:N(R_0)→ Afitalic_s : N ( R0 ) → A be the function defined by fs(R)=a∗(s;R)f_s(R)=a^*(s;R)fitalic_s ( R ) = a∗ ( s ; R ) (this is well-defined as A∗(s;R)A^*(s;R)A∗ ( s ; R ) is a singleton on (R0)N(R_0)N ( R0 )). Since Φs _sΦitalic_s is u.h.c. and single-valued on the open set (R0)N(R_0)N ( R0 ), fsf_sfitalic_s is continuous on (R0)N(R_0)N ( R0 ). Therefore, for each fixed s∈Ss∈ Ss ∈ S, since Rn→R0R_n→ R_0Ritalic_n → R0 and Rn∈(R0)R_n (R_0)Ritalic_n ∈ N ( R0 ) (for n sufficiently large, as (R0)N(R_0)N ( R0 ) is a neighborhood of R0R_0R0), the continuity of fsf_sfitalic_s at R0R_0R0 implies that fs(Rn)→fs(R0)f_s(R_n)→ f_s(R_0)fitalic_s ( Ritalic_n ) → fitalic_s ( R0 ). This translates to a∗(s;Rn)→a∗(s;R0)a^*(s;R_n)→ a^*(s;R_0)a∗ ( s ; Ritalic_n ) → a∗ ( s ; R0 ) in A (i.e., dA(a∗(s;Rn),a∗(s;R0))→0d_A(a^*(s;R_n),a^*(s;R_0))→ 0ditalic_A ( a∗ ( s ; Ritalic_n ) , a∗ ( s ; R0 ) ) → 0). Since this holds for every s∈Ss∈ Ss ∈ S, it follows by definition that MRL(Rn)→MRL(R0)M_RL(R_n)→ M_RL(R_0)Mitalic_R L ( Ritalic_n ) → Mitalic_R L ( R0 ) in the topology of pointwise convergence. Thus, MRLM_RLMitalic_R L is continuous at R0R_0R0. ∎ Remark 2.8 (Measurability of the Optimal Policy). For πR _Rπitalic_R to be a well-defined policy, the map s↦a∗(s;R)s a^*(s;R)s ↦ a∗ ( s ; R ) should be measurable. Given the continuity of QR∗Q^*_RQ∗italic_R (if R is continuous and P has Feller properties, QR∗Q^*_RQ∗italic_R is continuous on S×AS× AS × A) and the uniqueness assumption, s↦a∗(s;R)s a^*(s;R)s ↦ a∗ ( s ; R ) will generally be measurable. 2.5 Conditions for Discontinuity of the Reward-Policy Map When the optimal action is not unique, discontinuity of the policy map is generally expected. Proposition 2.9 (Discontinuity under Non-Uniqueness for Deterministic Policies). Let Assumptions 2.1-2.3 hold. Suppose for R0∈ℛR_0 0 ∈ R, there exists a state s0∈Ss_0∈ Ss0 ∈ S such that the set of optimal actions A∗(s0;R0)A^*(s_0;R_0)A∗ ( s0 ; R0 ) is finite and contains at least two distinct actions, say a1,a2∈A∗(s0;R0)a_1,a_2∈ A^*(s_0;R_0)a1 , a2 ∈ A∗ ( s0 ; R0 ) with a1≠a2a_1≠ a_2a1 ≠ a2. Let MRL:ℛ→M_RL:R _R L : R → P be a policy map that selects a deterministic policy πR∗π^*_Rπ∗italic_R such that πR∗(s)∈A∗(s;R)π^*_R(s)∈ A^*(s;R)π∗italic_R ( s ) ∈ A∗ ( s ; R ) for all s∈Ss∈ Ss ∈ S (e.g., the selection rule for MRL(R0)M_RL(R_0)Mitalic_R L ( R0 ) results in πR0∗(s0)=a1π^*_R_0(s_0)=a_1π∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s0 ) = a1), then the map MRLM_RLMitalic_R L is discontinuous at R0R_0R0 under the topology of pointwise convergence for policies (i.e., πn→π _n→πitalic_n → π if dA(πn(s),π(s))→0d_A( _n(s),π(s))→ 0ditalic_A ( πitalic_n ( s ) , π ( s ) ) → 0 for all s). Proof. This proof demonstrates discontinuity directly by constructing a sequence of reward functions (Rε)ε>0(R_ )_ >0( Ritalic_ε )ε > 0 that converges to R0R_0R0, yet for which the optimal policy discontinuously switches from a1a_1a1 to a2a_2a2 at state s0s_0s0 for any ε>0 >0ε > 0. The core of this proof is a shift in perspective: instead of perturbing the reward function R and analyzing its complex effect on the optimal Q-function Q∗Q^*Q∗, we will directly define a perturbed optimal Q-function QεQ_ Qitalic_ε and then use the Bellman equation to find the reward function RεR_ Ritalic_ε that generates it. 1. Construct a Perturbation in the Q-Function Space Let Q0=QR0∗Q_0=Q^*_R_0Q0 = Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. By assumption, Q0(s0,a1)=Q0(s0,a2)Q_0(s_0,a_1)=Q_0(s_0,a_2)Q0 ( s0 , a1 ) = Q0 ( s0 , a2 ). We first define a continuous "bump" function φ∈(S×A) (S× A)φ ∈ C ( S × A ) with the following properties: • 0≤φ(s,a)≤10≤ (s,a)≤ 10 ≤ φ ( s , a ) ≤ 1 for all (s,a)(s,a)( s , a ). • φ(s0,a2)=1 (s_0,a_2)=1φ ( s0 , a2 ) = 1. • φ(s0,aj)=0 (s_0,a_j)=0φ ( s0 , aitalic_j ) = 0 for all aj≠a2a_j≠ a_2aitalic_j ≠ a2 and aj∈A∗(s0;R0)a_j∈ A^*(s_0;R_0)aitalic_j ∈ A∗ ( s0 ; R0 ). Such a function exists and can be constructed as Lemma C.1. Alternatively, since S×AS× AS × A is a compact (hence normal) metric space, the Urysohn Lemma can be invoked to produce a continuous function taking the value 1 on (s0,a2)\(s_0,a_2)\ ( s0 , a2 ) and vanishing outside any desired neighborhood of that point (see, e.g., Munkres (2000), Theorem 33.1). For any ε>0 >0ε > 0, we define a perturbed Q-function QεQ_ Qitalic_ε: Qε(s,a):=Q0(s,a)+εφ(s,a).Q_ (s,a):=Q_0(s,a)+ \, (s,a).Qitalic_ε ( s , a ) := Q0 ( s , a ) + ε φ ( s , a ) . By construction, QεQ_ Qitalic_ε is continuous and converges uniformly to Q0Q_0Q0 as ε→0 → 0ε → 0, since ‖Qε−Q0‖∞=ε\|Q_ -Q_0\|_∞= ∥ Qitalic_ε - Q0 ∥∞ = ε. 2. Invert the Bellman Equation to Find the Corresponding Reward For any given continuous function Q∈(S×A)Q (S× A)Q ∈ C ( S × A ), we can define a corresponding reward function RQR_QRitalic_Q as follows: RQ(s,a):=Q(s,a)−γ∫Smaxa′Q(s′,a′)P(ds′∣s,a).R_Q(s,a):=Q(s,a)-γ _S _a Q(s ,a )\,P(ds s,a).Ritalic_Q ( s , a ) := Q ( s , a ) - γ ∫S maxitalic_a′ Q ( s′ , a′ ) P ( d s′ ∣ s , a ) . By substituting this definition into the Bellman optimality operator TRQT_R_QTitalic_R start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT, we can verify that TRQ(Q)=QT_R_Q(Q)=QTitalic_R start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT ( Q ) = Q. Since TRQT_R_QTitalic_R start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT is a contraction mapping with modulus γ, the Banach Fixed-Point Theorem guarantees that Q is its unique fixed point. Therefore, Q is the optimal Q-function for the reward RQR_QRitalic_Q, i.e., QRQ∗=Q^*_R_Q=Q∗italic_R start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT = Q. Applying this inverse mapping to our perturbed function QεQ_ Qitalic_ε, we define a family of reward functions: Rε:=RQε.R_ :=R_Q_ .Ritalic_ε := Ritalic_Q start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT . Through this construction, we can be sure that QRε∗=QεQ^*_R_ =Q_ Q∗italic_R start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT = Qitalic_ε for every ε>0 >0ε > 0. Note that RεR_ Ritalic_ε is continuous and belongs to (S×A)C(S× A)C ( S × A ), satisfying the continuity assumption. 3. Verify Convergence in the Reward Space We must show that our constructed reward functions RεR_ Ritalic_ε converge to the original reward function R0R_0R0 as ε→0 → 0ε → 0. Note that R0=RQ0R_0=R_Q_0R0 = Ritalic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. ‖Rε−R0‖∞ \|R_ -R_0\|_∞∥ Ritalic_ε - R0 ∥∞ =‖RQε−RQ0‖∞ =\|R_Q_ -R_Q_0\|_∞= ∥ Ritalic_Q start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT - Ritalic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥∞ =‖(Qε−γ∫maxa′Qε)−(Q0−γ∫maxa′Q0)‖∞ = \| (Q_ -γ _a Q_ )- (Q_0-γ _a Q_0 ) \|_∞= ∥ ( Qitalic_ε - γ ∫ maxitalic_a′ Qitalic_ε ) - ( Q0 - γ ∫ maxitalic_a′ Q0 ) ∥∞ =‖(Qε−Q0)−γ(∫maxa′Qε−∫maxa′Q0)‖∞ = \|(Q_ -Q_0)-γ ( _a Q_ - _a Q_0 ) \|_∞= ∥ ( Qitalic_ε - Q0 ) - γ ( ∫ maxitalic_a′ Qitalic_ε - ∫ maxitalic_a′ Q0 ) ∥∞ ≤‖Qε−Q0‖∞+γ‖∫maxa′Qε−∫maxa′Q0‖∞. ≤\|Q_ -Q_0\|_∞+γ \| _a Q_ - _a Q_0 \|_∞.≤ ∥ Qitalic_ε - Q0 ∥∞ + γ ∥ ∫ maxitalic_a′ Qitalic_ε - ∫ maxitalic_a′ Q0 ∥∞ . The map Q↦maxa′Q(s′,a′)Q _a Q(s ,a )Q ↦ maxitalic_a′ Q ( s′ , a′ ) is Lipschitz continuous with constant 1. Thus, ‖∫maxa′Qε−∫maxa′Q0‖∞≤sups′|maxa′Qε(s′,a′)−maxa′Q0(s′,a′)|≤‖Qε−Q0‖∞=ε. \| _a Q_ - _a Q_0 \|_∞≤ _s | _a Q_ (s ,a )- _a Q_0(s ,a )|≤\|Q_ -Q_0\|_∞= .∥ ∫ maxitalic_a′ Qitalic_ε - ∫ maxitalic_a′ Q0 ∥∞ ≤ supitalic_s′ | maxitalic_a′ Qitalic_ε ( s′ , a′ ) - maxitalic_a′ Q0 ( s′ , a′ ) | ≤ ∥ Qitalic_ε - Q0 ∥∞ = ε . Substituting this back into the inequality, we get: ‖Rε−R0‖∞≤ε+γε=ε(1+γ).\|R_ -R_0\|_∞≤ +γ = (1+γ).∥ Ritalic_ε - R0 ∥∞ ≤ ε + γ ε = ε ( 1 + γ ) . As ε→0 → 0ε → 0, it follows that ‖Rε−R0‖∞→0\|R_ -R_0\|_∞→ 0∥ Ritalic_ε - R0 ∥∞ → 0. Thus, RεR_ Ritalic_ε converges to R0R_0R0. 4. The Action Switch and Discontinuity We have constructed a sequence of rewards RεR_ Ritalic_ε that converges to R0R_0R0, with corresponding optimal Q-functions QRε∗=QεQ^*_R_ =Q_ Q∗italic_R start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT = Qitalic_ε. We now show that for any ε>0 >0ε > 0, the action a2a_2a2 becomes the unique optimal action under the perturbed Q-function QεQ_ Qitalic_ε at state s0s_0s0. We do this by comparing a2a_2a2 to all other possible actions. Case 1: Comparison with other formerly optimal actions. Let aja_jaitalic_j be any action in the original optimal set A∗(s0;R0)A^*(s_0;R_0)A∗ ( s0 ; R0 ) such that aj≠a2a_j≠ a_2aitalic_j ≠ a2. Our construction of the bump function φ φ ensures that φ(s0,a2)=1 (s_0,a_2)=1φ ( s0 , a2 ) = 1 and φ(s0,aj)=0 (s_0,a_j)=0φ ( s0 , aitalic_j ) = 0. Qε(s0,a2)−Qε(s0,aj) Q_ (s_0,a_2)-Q_ (s_0,a_j)Qitalic_ε ( s0 , a2 ) - Qitalic_ε ( s0 , aitalic_j ) =(Q0(s0,a2)−Q0(s0,aj))+ε(φ(s0,a2)−φ(s0,aj)) = (Q_0(s_0,a_2)-Q_0(s_0,a_j) )+ ( (s_0,a_2)- (s_0,a_j) )= ( Q0 ( s0 , a2 ) - Q0 ( s0 , aitalic_j ) ) + ε ( φ ( s0 , a2 ) - φ ( s0 , aitalic_j ) ) =0+ε(1−0)=ε. =0+ (1-0)= .= 0 + ε ( 1 - 0 ) = ε . Since ε>0 >0ε > 0, a2a_2a2 becomes strictly preferred over all other actions that were optimal under R0R_0R0. Case 2: Comparison with formerly suboptimal actions. Let a′a a′ be any action not in the original optimal set, i.e., a′∉A∗(s0;R0)a ∉ A^*(s_0;R_0)a′ ∉ A∗ ( s0 ; R0 ). This means there is a suboptimality gap δa′:=Q0(s0,a2)−Q0(s0,a′)>0 _a :=Q_0(s_0,a_2)-Q_0(s_0,a )>0δitalic_a′ := Q0 ( s0 , a2 ) - Q0 ( s0 , a′ ) > 0. Qε(s0,a2)−Qε(s0,a′) Q_ (s_0,a_2)-Q_ (s_0,a )Qitalic_ε ( s0 , a2 ) - Qitalic_ε ( s0 , a′ ) =(Q0(s0,a2)−Q0(s0,a′))+ε(φ(s0,a2)−φ(s0,a′)) = (Q_0(s_0,a_2)-Q_0(s_0,a ) )+ ( (s_0,a_2)- (s_0,a ) )= ( Q0 ( s0 , a2 ) - Q0 ( s0 , a′ ) ) + ε ( φ ( s0 , a2 ) - φ ( s0 , a′ ) ) =δa′+ε(1−φ(s0,a′)). = _a + (1- (s_0,a )).= δitalic_a′ + ε ( 1 - φ ( s0 , a′ ) ) . Since ε>0 >0ε > 0 and φ(s0,a′)≤1 (s_0,a )≤ 1φ ( s0 , a′ ) ≤ 1, the term ε(1−φ(s0,a′)) (1- (s_0,a ))ε ( 1 - φ ( s0 , a′ ) ) is non-negative. Thus, Qε(s0,a2)−Qε(s0,a′)≥δa′>0.Q_ (s_0,a_2)-Q_ (s_0,a )≥ _a >0.Qitalic_ε ( s0 , a2 ) - Qitalic_ε ( s0 , a′ ) ≥ δitalic_a′ > 0 . This shows that a2a_2a2 is also strictly preferred over all actions that were already suboptimal. Combining both cases, we have proven that for any ε>0 >0ε > 0, a2a_2a2 is the unique maximizer of Qε(s0,⋅)Q_ (s_0,·)Qitalic_ε ( s0 , ⋅ ). Therefore, the optimal action set for RεR_ Ritalic_ε at s0s_0s0 is the singleton A∗(s0;Rε)=a2A^*(s_0;R_ )=\a_2\A∗ ( s0 ; Ritalic_ε ) = a2 . This means the policy map must select πRε∗(s0)=a2π^*_R_ (s_0)=a_2π∗italic_R start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT ( s0 ) = a2. We have a sequence of rewards Rε→R0R_ → R_0Ritalic_ε → R0, but the corresponding policy selection jumps from πR0∗(s0)=a1π^*_R_0(s_0)=a_1π∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s0 ) = a1 to πRε∗(s0)=a2π^*_R_ (s_0)=a_2π∗italic_R start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT ( s0 ) = a2. This demonstrates a failure of pointwise convergence, proving that the map MRLM_RLMitalic_R L is discontinuous at R0R_0R0. ∎ Proposition 2.10 (Discontinuity for Uniform Stochastic Policies). Let Assumptions 2.1-2.3 hold. Suppose for R0∈ℛR_0 0 ∈ R, there exists a state s0∈Ss_0∈ Ss0 ∈ S such that the set of optimal actions A∗(s0;R0)A^*(s_0;R_0)A∗ ( s0 ; R0 ) is finite and contains at least two distinct actions. Let m=|A∗(s0;R0)|≥2m=|A^*(s_0;R_0)|≥ 2m = | A∗ ( s0 ; R0 ) | ≥ 2. Let the policy map MRL:ℛ→M_RL:R _R L : R → P be defined by selecting the stochastic policy πR∗π^*_Rπ∗italic_R such that for each s∈Ss∈ Ss ∈ S, πR∗(⋅|s)π^*_R(·|s)π∗italic_R ( ⋅ | s ) is the uniform probability distribution over the set A∗(s;R)A^*(s;R)A∗ ( s ; R ). (If A∗(s;R)A^*(s;R)A∗ ( s ; R ) is empty, which should not happen for optimal policies derived from Q-functions on compact action spaces, or infinite, this definition would need refinement; here we rely on the construction yielding finite sets). Then the map R↦πR∗(⋅|s0)R π^*_R(·|s_0)R ↦ π∗italic_R ( ⋅ | s0 ), viewed as a map from (ℛ,∥⋅∥∞)(R,\|·\|_∞)( R , ∥ ⋅ ∥∞ ) to ((A),dTV)(P(A),d_TV)( P ( A ) , ditalic_T V ), where dTVd_TVditalic_T V is the Total Variation distance, is discontinuous at R0R_0R0. Proof. Let Q0=QR0∗Q_0=Q^*_R_0Q0 = Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. By assumption, A∗(s0;R0)A^*(s_0;R_0)A∗ ( s0 ; R0 ) is a finite set with m=|A∗(s0;R0)|≥2m=|A^*(s_0;R_0)|≥ 2m = | A∗ ( s0 ; R0 ) | ≥ 2. Let a2∈A∗(s0;R0)a_2∈ A^*(s_0;R_0)a2 ∈ A∗ ( s0 ; R0 ) be one of these optimal actions. The policy πR0∗(⋅|s0)π^*_R_0(·|s_0)π∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | s0 ) is the uniform distribution over A∗(s0;R0)A^*(s_0;R_0)A∗ ( s0 ; R0 ). Thus, for any a∈A∗(s0;R0)a∈ A^*(s_0;R_0)a ∈ A∗ ( s0 ; R0 ), πR0∗(a|s0)=1/mπ^*_R_0(a|s_0)=1/mπ∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( a | s0 ) = 1 / m. For a∉A∗(s0;R0)a∉ A^*(s_0;R_0)a ∉ A∗ ( s0 ; R0 ), πR0∗(a|s0)=0π^*_R_0(a|s_0)=0π∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( a | s0 ) = 0. We use the same sequence of reward functions RεR_ Ritalic_ε as constructed in the proof of Proposition 2.9. We have Rε→R0R_ → R_0Ritalic_ε → R0 in (ℛ,∥⋅∥∞)(R,\|·\|_∞)( R , ∥ ⋅ ∥∞ ). From the proof of Proposition 2.9, we established that the optimal action set for RεR_ Ritalic_ε is A∗(s0;Rε)=a2A^*(s_0;R_ )=\a_2\A∗ ( s0 ; Ritalic_ε ) = a2 . Let Rn=RεnR_n=R_ _nRitalic_n = Ritalic_ε start_POSTSUBSCRIPT n end_POSTSUBSCRIPT for a sequence εn→0 _n→ 0εitalic_n → 0. According to our policy selection rule, πRn∗(⋅|s0)π^*_R_n(·|s_0)π∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( ⋅ | s0 ) is the uniform distribution over A∗(s0;Rn)=a2A^*(s_0;R_n)=\a_2\A∗ ( s0 ; Ritalic_n ) = a2 . This means πRn∗(⋅|s0)π^*_R_n(·|s_0)π∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( ⋅ | s0 ) is the Dirac measure δa2 _a_2δitalic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT concentrated at a2a_2a2. So, πRn∗(a2|s0)=1π^*_R_n(a_2|s_0)=1π∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( a2 | s0 ) = 1, and πRn∗(a|s0)=0π^*_R_n(a|s_0)=0π∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( a | s0 ) = 0 for all a≠a2a≠ a_2a ≠ a2. We now compute the Total Variation distance dTV(πRn∗(⋅|s0),πR0∗(⋅|s0))d_TV(π^*_R_n(·|s_0),π^*_R_0(·|s_0))ditalic_T V ( π∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( ⋅ | s0 ) , π∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | s0 ) ) for n sufficiently large. Let μn=πRn∗(⋅|s0)=δa2 _n=π^*_R_n(·|s_0)= _a_2μitalic_n = π∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( ⋅ | s0 ) = δitalic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and μ0=πR0∗(⋅|s0) _0=π^*_R_0(·|s_0)μ0 = π∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | s0 ). The Total Variation distance between two probability measures ν1,ν2 _1, _2ν1 , ν2 on a countable (or finite, as is the case here for the support of μ0 _0μ0 and μn _nμitalic_n) set A′A A′ is given by dTV(ν1,ν2)=12∑a∈A′|ν1(a)−ν2(a)|d_TV( _1, _2)= 12 _a∈ A | _1(a)- _2(a)|ditalic_T V ( ν1 , ν2 ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑a ∈ A′ | ν1 ( a ) - ν2 ( a ) |. Here, A′A A′ can be taken as A∗(s0;R0)∪a2=A∗(s0;R0)A^*(s_0;R_0)∪\a_2\=A^*(s_0;R_0)A∗ ( s0 ; R0 ) ∪ a2 = A∗ ( s0 ; R0 ) since a2∈A∗(s0;R0)a_2∈ A^*(s_0;R_0)a2 ∈ A∗ ( s0 ; R0 ). dTV(μn,μ0) d_TV( _n, _0)ditalic_T V ( μitalic_n , μ0 ) =12∑a∈A∗(s0;R0)|μn(a)−μ0(a)| = 12 _a∈ A^*(s_0;R_0)| _n(a)- _0(a)|= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑a ∈ A∗ ( s start_POSTSUBSCRIPT 0 ; R0 ) end_POSTSUBSCRIPT | μitalic_n ( a ) - μ0 ( a ) | =12(|μn(a2)−μ0(a2)|+∑a∈A∗(s0;R0),a≠a2|μn(a)−μ0(a)|) = 12 (| _n(a_2)- _0(a_2)|+ _a∈ A^*(s_0;R_0),a≠ a_2| _n(a)- _0(a)| )= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( | μitalic_n ( a2 ) - μ0 ( a2 ) | + ∑a ∈ A∗ ( s start_POSTSUBSCRIPT 0 ; R0 ) , a ≠ a2 end_POSTSUBSCRIPT | μitalic_n ( a ) - μ0 ( a ) | ) =12(|1−1/m|+∑a∈A∗(s0;R0),a≠a2|0−1/m|). = 12 (|1-1/m|+ _a∈ A^*(s_0;R_0),a≠ a_2|0-1/m| ).= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( | 1 - 1 / m | + ∑a ∈ A∗ ( s start_POSTSUBSCRIPT 0 ; R0 ) , a ≠ a2 end_POSTSUBSCRIPT | 0 - 1 / m | ) . There are m−1m-1m - 1 terms in the sum. So, dTV(μn,μ0)=12(m−1m+(m−1)⋅1m)=122(m−1)m=m−1m. d_TV( _n, _0)= 12 ( m-1m+(m-1)· 1m )= 12 2(m-1)m= m-1m.ditalic_T V ( μitalic_n , μ0 ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG m - 1 end_ARG start_ARG m end_ARG + ( m - 1 ) ⋅ divide start_ARG 1 end_ARG start_ARG m end_ARG ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG 2 ( m - 1 ) end_ARG start_ARG m end_ARG = divide start_ARG m - 1 end_ARG start_ARG m end_ARG . Since m≥2m≥ 2m ≥ 2, we have (m−1)/m≥1/2(m-1)/m≥ 1/2( m - 1 ) / m ≥ 1 / 2. Thus, for n sufficiently large, dTV(πRn∗(⋅|s0),πR0∗(⋅|s0))=(m−1)/md_TV(π^*_R_n(·|s_0),π^*_R_0(·|s_0))=(m-1)/mditalic_T V ( π∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( ⋅ | s0 ) , π∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | s0 ) ) = ( m - 1 ) / m. As (m−1)/m(m-1)/m( m - 1 ) / m does not tend to 0 as n→∞n→∞n → ∞ (it is a constant ≥1/2≥ 1/2≥ 1 / 2), the sequence of policies πRn∗(⋅|s0)π^*_R_n(·|s_0)π∗italic_R start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( ⋅ | s0 ) does not converge to πR0∗(⋅|s0)π^*_R_0(·|s_0)π∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | s0 ) in Total Variation distance. Therefore, the map R↦πR∗(⋅|s0)R π^*_R(·|s_0)R ↦ π∗italic_R ( ⋅ | s0 ) is discontinuous at R0R_0R0. ∎ Remark 2.11. This analysis highlights that the mapping from rewards to optimal policies exhibits stability (continuity) only under strong structural conditions ensuring the uniqueness of optimal actions. In the absence of such conditions, the mapping is generally unstable, with small perturbations in the reward function potentially leading to abrupt changes in the optimal policy. This has significant implications for the robustness and practical implementation of reinforcement learning algorithms. 3 RL for LLMs With a Single Reward Model The theoretical framework for the continuity of the reward-policy map has significant implications for the training and behavior of large language models (LLMs), especially when fine-tuned using reinforcement learning (RL) such as Reinforcement Learning from Human Feedback (RLHF) (see e.g., Ouyang et al. (2022); Bai et al. (2022); Lee et al. (2023)) or Reinforcement Learning with Verifiable Rewards (RLVR) (see e.g., Guo et al. (2025); Su et al. (2025)). In this section, we frame LLM text generation as an MDP and discuss how the continuity (or lack thereof) of the optimal policy with respect to the reward function can explain certain observed behaviors and challenges in LLM alignment. 3.1 Framing LLM Text Generation as an MDP We model the sequential text generation process of an LLM as an infinite-horizon discounted Markov Decision Process (MDP) defined by the tuple (S,A,P,R,γ)(S,A,P,R,γ)( S , A , P , R , γ ). A rigorous analysis requires establishing a topology on the state and action spaces. The action space A is the model’s finite vocabulary, and the state space S comprises the finite set of all possible token sequences up to a maximum length TmaxT_ Troman_max. A fundamental property of any finite set endowed with a metric is that the resulting metric space is always compact. Therefore, the assumption that A and S are compact metric spaces is automatically satisfied for any choice of metric. This assumption, while a necessary precondition for the general theory, thus poses no practical constraint on the LLM setting. A more subtle but crucial point arises from the topology of these finite spaces. Any metric on a finite set induces the discrete topology, in which every subset is an open set. A direct mathematical consequence of this is that any function from a space with the discrete topology to any other metric space is necessarily continuous. Thus, the assumption that the reward function R(s,a)R(s,a)R ( s , a ) is continuous on the product space (S×A,d)(S× A,d)( S × A , d ) is also automatically satisfied for any reward function. This means that in the finite LLM setting, this assumption is not a restrictive simplification of reality but a direct consequence of the problem’s structure. The remaining components of the MDP are then defined. The transition kernel P is deterministic, as the next state is formed by concatenating the current state with the chosen action token. Finally, the policy π(a|s)π(a|s)π ( a | s ) is embodied by the LLM, and the goal of reinforcement learning is to find an optimal policy πR∗π^*_Rπ∗italic_R for a given reward R. This precise framing reveals that the source of policy instability does not lie in a potential failure of continuity of the reward function on its domain. Instead, the analytical framework’s power hinges on how perturbations in the reward function’s values propagate through the Bellman operator to the Q-function, and critically, how the non-continuous nature of the argmax argmaxargmax operator acts upon this Q-function. This formulation, therefore, shifts the focus from the topological properties of the state-action space, which are trivially satisfied, to the functional properties of the Bellman and argmax argmaxargmax operators. This provides a rigorous foundation for analyzing policy stability not as a failure of continuity on the state space, but as a consequence of the inherent discontinuity of optimization over a discrete action set. 3.2 Theoretical Implications of the Reward-Policy Map The stability results for the underlying value functions provide a foundation for our analysis. Proposition 2.4 establishes that the optimal action-value function QR∗Q^*_RQ∗italic_R is Lipschitz continuous with respect to the reward function R. This suggests a degree of robustness at the value level: small changes to the reward model lead to proportionally small and controlled changes in the optimal Q-values. Furthermore, Lemma 2.5 shows that the set of optimal actions A∗(s;R)A^*(s;R)A∗ ( s ; R ) is an upper hemi-continuous (u.h.c.) correspondence. This crucial property provides a form of "outer" stability: it guarantees that the set of best next tokens will not suddenly expand to include actions that were previously far from optimal. It also ensures that any convergent sequence of optimal actions (for a converging sequence of reward functions) will have its limit within the new set of optimal actions. However, this provides no guarantee that every action that was originally optimal will remain part of the optimal set after a small perturbation of the reward function. A critical consequence of a correspondence being only upper hemi-continuous is that it permits the set of optimal actions A∗(s;R)A^*(s;R)A∗ ( s ; R ) to abruptly shrink. In other words, a slight change in R can cause previously optimal actions to "disappear" by becoming strictly suboptimal. This lack of preservation, which corresponds to a failure of lower hemi-continuity (l.h.c.), is a primary driver of the policy instabilities central to our analysis. This is precisely the core issue: while the value function enjoys robust stability guarantees, these guarantees do not propagate to the policy level. The potential for the optimal action set to abruptly shrink means that any policy, which is fundamentally a selection from this set, is inherently fragile. Therefore, the stability of the policy itself becomes a far more delicate and critical property for predictable behavior. The continuity of the reward-policy map MRL:R↦πR∗M_RL:R π^*_RMitalic_R L : R ↦ π∗italic_R depends heavily on the structure of the optimal action set. Conditions for Policy Continuity. As established in Theorem 2.7, if the optimal action is unique for a given reward function R0R_0R0 and all reward functions in its vicinity, then the mapping to this deterministic optimal policy is continuous. In an LLM context, this stable regime would mean that small adjustments to the reward model only lead to minor, predictable changes in text generation. Policy Discontinuities due to Non-Unique Optimal Actions. The uniqueness condition is strong and often violated in language generation, where multiple words or phrases can be equally valid continuations. When multiple optimal actions exist, the policy map becomes prone to discontinuities, as shown in Propositions 2.9 and 2.10. A slight perturbation to the reward function can act as a tie-breaker, causing a deterministic policy to abruptly switch its choice of action or a stochastic policy to drastically shift its probability mass. This theoretical instability provides a formal basis for the brittle behaviors observed in practice, such as sudden changes in style or safety profile in response to minor changes in the reward model or prompt. 3.3 Analysis of Alignment and Reasoning Phenomena The theoretical framework can be directly applied to formalize and explain specific, observable behaviors in aligned or reasoning LLMs. In particular, we now use our results to analyze two fundamental challenges when using RL for LLM training. First, we will examine phenomena arising from incomplete reward specifications, where we formalize how a policy can be perfectly rational for its given objective yet suboptimal for the true, intended goal. Second, we will analyze behaviors stemming from the degeneracy of optima, where multiple distinct policies are equally optimal, and show how introducing an additive reward component functions as a tie-breaker to resolve this issue and enforce more specific behaviors. 3.3.1 The "Clever Slacker": Suboptimality from Incomplete Rewards A common challenge in alignment and reasoning is the "clever slacker" phenomenon, where an LLM produces factually correct responses that nevertheless ignore user-specified constraints or instructions, or generates answers via deceptive shortcuts. This can be modeled as the agent optimizing an incomplete reward function. The following proposition proves that such a policy is strictly suboptimal under the complete, desired reward objective. Proposition 3.1 (Suboptimality from Incomplete Rewards (General Form)). Let Rtrain∈ℛR_train _t r a i n ∈ R be the training reward function and Rmissing∈ℛR_missing _m i s s i n g ∈ R be the missing reward component. The true reward is Rtrue=Rtrain+RmissingR_true=R_train+R_missingRitalic_t r u e = Ritalic_t r a i n + Ritalic_m i s s i n g. Let πtrain∗∈Π∗(Rtrain)π^*_train∈ ^*(R_train)π∗italic_t r a i n ∈ Π∗ ( Ritalic_t r a i n ) be any optimal policy for RtrainR_trainRitalic_t r a i n, and let μ be the initial state distribution. If there exists a state s0∈Ss_0∈ Ss0 ∈ S and an action a2∈Aa_2∈ Aa2 ∈ A satisfying: 1. Action Optimality: a2a_2a2 is optimal under the training reward: a2∈A∗(s0;Rtrain)a_2∈ A^*(s_0;R_train)a2 ∈ A∗ ( s0 ; Ritalic_t r a i n ). 2. Positive Advantage for Missing Reward: The action a2a_2a2 has a strictly positive advantage under the missing reward component when evaluated with the policy πtrain∗π^*_trainπ∗italic_t r a i n: ARmissingπtrain∗(s0,a2)>0,A^π^*_train_R_missing(s_0,a_2)>0,Aitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 , a2 ) > 0 , where the advantage is defined as ARπ(s,a):=QRπ(s,a)−VRπ(s)A^π_R(s,a):=Q^π_R(s,a)-V^π_R(s)Aitalic_πitalic_R ( s , a ) := Qitalic_πitalic_R ( s , a ) - Vitalic_πitalic_R ( s ). 3. State Reachability: The state s0s_0s0 has a non-zero probability of being visited at some step when starting from the initial state distribution μ and following the policy πtrain∗π^*_trainπ∗italic_t r a i n. Then, the policy πtrain∗π^*_trainπ∗italic_t r a i n is strictly suboptimal for the true reward function RtrueR_trueRitalic_t r u e. Proof. To prove that πtrain∗π^*_trainπ∗italic_t r a i n is strictly suboptimal for RtrueR_trueRitalic_t r u e, we invoke the Policy Improvement Theorem (see e.g., Sutton and Barto (1998)). We show there exists an action whose value is strictly greater than the state-value achieved by the policy, which is a sufficient condition for suboptimality. We must show QRtrueπtrain∗(s0,a2)>VRtrueπtrain∗(s0)Q^π^*_train_R_true(s_0,a_2)>V^π^*_train_R_true(s_0)Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 , a2 ) > Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 ). Let’s expand the advantage condition (Condition 2): QRmissingπtrain∗(s0,a2)−VRmissingπtrain∗(s0)>0Q^π^*_train_R_missing(s_0,a_2)-V^π^*_train_R_missing(s_0)>0Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 , a2 ) - Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 ) > 0. Using the linearity of the value function for a fixed policy (VR1+R2π=VR1π+VR2πV^π_R_1+R_2=V^π_R_1+V^π_R_2Vitalic_πitalic_R start_POSTSUBSCRIPT 1 + R2 end_POSTSUBSCRIPT = Vitalic_πitalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + Vitalic_πitalic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), we can write: VRtrueπtrain∗(s0)=VRtrainπtrain∗(s0)+VRmissingπtrain∗(s0)V^π^*_train_R_true(s_0)=V^π^*_train_R_train(s_0)+V^π^*_train_R_missing(s_0)Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 ) = Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r a i n end_POSTSUBSCRIPT ( s0 ) + Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 ). Since πtrain∗π^*_trainπ∗italic_t r a i n is optimal for RtrainR_trainRitalic_t r a i n, VRtrainπtrain∗(s0)=Vtrain∗(s0)V^π^*_train_R_train(s_0)=V^*_train(s_0)Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r a i n end_POSTSUBSCRIPT ( s0 ) = V∗italic_t r a i n ( s0 ). So, VRtrueπtrain∗(s0)=Vtrain∗(s0)+VRmissingπtrain∗(s0)V^π^*_train_R_true(s_0)=V^*_train(s_0)+V^π^*_train_R_missing(s_0)Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 ) = V∗italic_t r a i n ( s0 ) + Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 ). Now let’s analyze the Q-value of taking action a2a_2a2: QRtrueπtrain∗(s0,a2)=QRtrainπtrain∗(s0,a2)+QRmissingπtrain∗(s0,a2)Q^π^*_train_R_true(s_0,a_2)=Q^π^*_train_R_train(s_0,a_2)+Q^π^*_train_R_missing(s_0,a_2)Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 , a2 ) = Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r a i n end_POSTSUBSCRIPT ( s0 , a2 ) + Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 , a2 ). Since πtrain∗π^*_trainπ∗italic_t r a i n is optimal for RtrainR_trainRitalic_t r a i n, QRtrainπtrain∗=Qtrain∗Q^π^*_train_R_train=Q^*_trainQitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r a i n end_POSTSUBSCRIPT = Q∗italic_t r a i n. By Condition 1, a2∈A∗(s0;Rtrain)a_2∈ A^*(s_0;R_train)a2 ∈ A∗ ( s0 ; Ritalic_t r a i n ), so Qtrain∗(s0,a2)=Vtrain∗(s0)Q^*_train(s_0,a_2)=V^*_train(s_0)Q∗italic_t r a i n ( s0 , a2 ) = V∗italic_t r a i n ( s0 ). Thus, QRtrueπtrain∗(s0,a2)=Vtrain∗(s0)+QRmissingπtrain∗(s0,a2)Q^π^*_train_R_true(s_0,a_2)=V^*_train(s_0)+Q^π^*_train_R_missing(s_0,a_2)Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 , a2 ) = V∗italic_t r a i n ( s0 ) + Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 , a2 ). We can now directly compare QRtrueπtrain∗(s0,a2)Q^π^*_train_R_true(s_0,a_2)Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 , a2 ) with VRtrueπtrain∗(s0)V^π^*_train_R_true(s_0)Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 ). The inequality QRtrueπtrain∗(s0,a2)>VRtrueπtrain∗(s0)Q^π^*_train_R_true(s_0,a_2)>V^π^*_train_R_true(s_0)Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 , a2 ) > Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT t r u e end_POSTSUBSCRIPT ( s0 ) holds if and only if: Vtrain∗(s0)+QRmissingπtrain∗(s0,a2)>Vtrain∗(s0)+VRmissingπtrain∗(s0).V^*_train(s_0)+Q^π^*_train_R_missing(s_0,a_2)>V^*_train(s_0)+V^π^*_train_R_missing(s_0).V∗italic_t r a i n ( s0 ) + Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 , a2 ) > V∗italic_t r a i n ( s0 ) + Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 ) . This simplifies to QRmissingπtrain∗(s0,a2)>VRmissingπtrain∗(s0)Q^π^*_train_R_missing(s_0,a_2)>V^π^*_train_R_missing(s_0)Qitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 , a2 ) > Vitalic_π start_POSTSUPERSCRIPT ∗t r a i n end_POSTSUPERSCRIPTR start_POSTSUBSCRIPT m i s s i n g end_POSTSUBSCRIPT ( s0 ), which is exactly Condition 2 in its expanded form. Since the condition holds by hypothesis, this implies that πtrain∗π^*_trainπ∗italic_t r a i n is not an optimal policy for RtrueR_trueRitalic_t r u e. Furthermore, because Condition 3 ensures that state s0s_0s0 is reachable from the initial state distribution μ, this local improvement opportunity at s0s_0s0 will lead to a strict increase in the overall expected return from the start states. Therefore, the policy πtrain∗π^*_trainπ∗italic_t r a i n is strictly suboptimal for the true reward function RtrueR_trueRitalic_t r u e. ∎ This formal result provides a unified lens to explain several critical alignment failures, which can be understood as policies that are optimal for an incomplete training objective but suboptimal for the true, desired objective. We analyze two such key phenomena below. Instruction Following Failure. A primary application of our proposition is in explaining why models often fail to adhere to user-specified constraints, a behavior sometimes termed the "clever slacker" phenomenon. This occurs when the training reward, RtrainR_trainRtrain, primarily measures a core task requirement, such as factual correctness, while ignoring secondary instructions like output format, style, or negative constraints. The adherence to these instructions represents the missing reward component, RmissingR_missingRmissing. If a model can produce a factually correct answer both with and without following the constraints, both paths may be equally optimal under RtrainR_trainRtrain. The policy is therefore not being "disobedient"; it is perfectly rational in choosing the path of least resistance to maximize the objective it was given. This behavior is precisely captured by our proposition, which demonstrates that such a policy is strictly suboptimal under the true, complete reward function Rtrue=Rtrain+RmissingR_true=R_train+R_missingRtrue = Rtrain + Rmissing. Spurious Reasoning. A more subtle and complex failure mode is spurious reasoning, where the model produces a correct final answer preceded by a chain-of-thought that is logically flawed or not causally linked to the result. This is also a consequence of an incomplete Outcome-Based Reward (OBR). Here, the training reward Rtrain=RoutcomeR_train=R_outcomeRtrain = Routcome only values the correctness of the final answer. The crucial missing component, Rmissing=RprocessR_missing=R_processRmissing = Rprocess, should reward the logical validity and faithfulness of the reasoning process itself. Because any reasoning path leading to the correct outcome is equally valued, the model may discover a low-effort strategy: first retrieve or guess the answer, and then generate a syntactically plausible but non-causal justification post-hoc. This policy, which fabricates a reasoning process, is another manifestation of a "clever slacker". While it perfectly optimizes the outcome-based objective, it is demonstrably suboptimal under the true objective that values faithful reasoning, as explained by our formal result. 3.3.2 The Tie-Breaker Effect: Resolving Degeneracy with Additive Rewards A key challenge in reward design is the degeneracy of optimal policies, which arises when a primary objective function, such as accuracy, deems multiple, behaviorally distinct policies to be equally optimal. This allows for undesirable behaviors, such as stylistic inconsistency or inefficient reasoning, to emerge. The practice of introducing an additional reward component is a form of reward engineering designed to break this degeneracy. The following proposition formalizes how such an additive reward can function as a tie-breaker to enforce a specific, desired policy. Proposition 3.2 (The Tie-Breaker Effect). Let Assumptions 2.1-2.3 hold. Let R0∈ℛR_0 0 ∈ R be a reward function, and suppose for some state s0∈Ss_0∈ Ss0 ∈ S, there exist at least two distinct optimal actions, a1,a2∈A∗(s0;R0)a_1,a_2∈ A^*(s_0;R_0)a1 , a2 ∈ A∗ ( s0 ; R0 ). Then for any ε>0 >0ε > 0, there exists a perturbed reward function R′∈ℛR ′ ∈ R such that ‖R′−R0‖∞≤ε\|R -R_0\|_∞≤ ∥ R′ - R0 ∥∞ ≤ ε and QR′∗(s0,a2)>QR′∗(s0,a1)Q^*_R (s_0,a_2)>Q^*_R (s_0,a_1)Q∗italic_R′ ( s0 , a2 ) > Q∗italic_R′ ( s0 , a1 ), i.e. a2a_2a2 becomes strictly optimal over a1a_1a1. Proof. Let Q0=QR0∗Q_0=Q^*_R_0Q0 = Q∗italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Since a1,a2a_1,a_2a1 , a2 are both optimal at s0s_0s0, Q0(s0,a1)=Q0(s0,a2)Q_0(s_0,a_1)=Q_0(s_0,a_2)Q0 ( s0 , a1 ) = Q0 ( s0 , a2 ). 1. Bump in Q-space. Choose a continuous bump as Proposition 2.9 that φ∈C(S×A),0≤φ≤1,φ(s0,a2)=1, ∈ C(S× A), 0≤ ≤ 1, (s_0,a_2)=1,φ ∈ C ( S × A ) , 0 ≤ φ ≤ 1 , φ ( s0 , a2 ) = 1 , and φ(s0,aj)=0 (s_0,a_j)=0φ ( s0 , aitalic_j ) = 0 for any other aj∈A∗(s0;R0)a_j∈ A^*(s_0;R_0)aitalic_j ∈ A∗ ( s0 ; R0 ). Define Qε′(s,a)=Q0(s,a)+ε′φ(s,a),Q_ (s,a)=Q_0(s,a)+ \, (s,a),Qitalic_ε′ ( s , a ) = Q0 ( s , a ) + ε′ φ ( s , a ) , where ε′=ε/(1+γ) = /(1+γ)ε′ = ε / ( 1 + γ ), so ‖Qε′−Q0‖∞=ε′→0\|Q_ -Q_0\|_∞= → 0∥ Qitalic_ε′ - Q0 ∥∞ = ε′ → 0. 2. Invert the Bellman operator. For any continuous Q, define RQ(s,a)=Q(s,a)−γ∫Smaxa′Q(s′,a′)P(ds′∣s,a).R_Q(s,a)=Q(s,a)-γ\! _S _a Q(s ,a )\,P(ds \! s,a).Ritalic_Q ( s , a ) = Q ( s , a ) - γ ∫S maxitalic_a′ Q ( s′ , a′ ) P ( d s′ ∣ s , a ) . Then one checks TRQ(Q)=QT_R_Q(Q)=QTitalic_R start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT ( Q ) = Q, and by contraction Q is the unique QRQ∗Q^*_R_QQ∗italic_R start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT. Hence for each ε>0 >0ε > 0, let Rε=RQ0+ε′φR_ =R_\,Q_0+ Ritalic_ε = Ritalic_Q start_POSTSUBSCRIPT 0 + ε′ φ end_POSTSUBSCRIPT so that QRε∗=Q0+ε′φ.Q^*_R_ =Q_0+ \, .Q∗italic_R start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT = Q0 + ε′ φ . 3. Convergence of rewards. Since RQR_QRitalic_Q depends continuously (in sup-norm) on Q and ‖Q0+ε′φ−Q0‖∞=ε′\|Q_0+ -Q_0\|_∞= ∥ Q0 + ε′ φ - Q0 ∥∞ = ε′, we have ‖Rε−R0‖∞≤ε′(1+γ)=ε→0\|R_ -R_0\|_∞≤ (1+γ)= → 0∥ Ritalic_ε - R0 ∥∞ ≤ ε′ ( 1 + γ ) = ε → 0. 4. Tie-breaking at s0s_0s0. Under QRε∗=Q0+ε′φQ^*_R_ =Q_0+ ∗italic_R start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT = Q0 + ε′ φ, QRε∗(s0,a2)−QRε∗(s0,a1)=[Q0(s0,a2)−Q0(s0,a1)]+ε′[φ(s0,a2)−φ(s0,a1)]=0+ε′>0.Q^*_R_ (s_0,a_2)-Q^*_R_ (s_0,a_1)= [Q_0(s_0,a_2)-Q_0(s_0,a_1) ]+ [ (s_0,a_2)- (s_0,a_1) ]=0+ >0.Q∗italic_R start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT ( s0 , a2 ) - Q∗italic_R start_POSTSUBSCRIPT ε end_POSTSUBSCRIPT ( s0 , a1 ) = [ Q0 ( s0 , a2 ) - Q0 ( s0 , a1 ) ] + ε′ [ φ ( s0 , a2 ) - φ ( s0 , a1 ) ] = 0 + ε′ > 0 . Thus a2a_2a2 is strictly preferred over a1a_1a1. This completes the proof. ∎ The "tie-breaker" effect formalized in Proposition 3.2 is a powerful tool for reward engineering, explaining how additive rewards ΔR0=R′−R0 R_0=R -R_0Δ R0 = R′ - R0 can resolve the degeneracy of optima and enforce specific desired behaviors. We analyze two key applications below: enforcing format adherence and promoting efficient reasoning. Enforcing Format and Style Consistency. Consider a scenario where multiple responses are equally valid under a base reward function, R0R_0R0, which primarily measures correctness. For example, an answer could be given in free-form text or as a structured JSON object. Both might be equally optimal, leading to a large optimal policy set Π∗(R0) ^*(R_0)Π∗ ( R0 ). To enforce a specific format, one can introduce a small, additive "bonus" reward, ΔR0=R′−R0 R_0=R -R_0Δ R0 = R′ - R0, for responses that adhere to the desired structure. This bonus acts as a tie-breaker. According to our proposition, this additive reward makes the policy that generates the correctly formatted output strictly optimal. This can cause the model’s behavior to "snap" discontinuously to the preferred style, rather than shifting gradually, which explains why such changes can be abrupt. Promoting Efficient Reasoning. The tendency for models to generate unnecessarily verbose responses is another consequence of a degenerate optimal policy set under an accuracy-only reward, R0=RaccR_0=R_accR0 = Racc. Both short and long reasoning paths that reach the correct answer are equally optimal. To solve this, one can introduce a length penalty, which is a negative additive reward. This penalty also functions as a powerful tie-breaker. Among all paths that achieve the same accuracy reward, the most efficient path will now have the strictly highest value, as it incurs the lowest total penalty. This modification collapses the optimal policy set Π∗(R0+ΔR0) ^*(R_0+ R_0)Π∗ ( R0 + Δ R0 ) to contain only the most concise policies. By making the efficient policy uniquely optimal, this approach clarifies the learning signal and promotes a more efficient reasoning process. 3.4 Broader Implications and Mitigation Strategies The analysis throughout this work highlights a fundamental distinction between applying reinforcement learning to LLMs versus traditional domains. In tasks with objective, well-defined rewards like the game of Go, achieving the optimal value V∗V^*V∗ is a sufficient goal; the policy is merely a means to an end. This paradigm shifts fundamentally for LLMs. The policy’s output, i.e. the sequence of generated tokens, is the final product consumed by the user, and the reward function is not a ground-truth oracle but an imperfect proxy for nuanced human preferences (RLHF) or a rule-based function (RLVR). Consequently, the policy π itself must become a primary object of analysis. The continuity of the reward-policy map thus emerges as a critical diagnostic for the robustness and trustworthiness of the aligned model. In essence, for language models, the policy’s behavior is not merely a means to an end; it is, in large part, the end itself. Given this critical role of the policy, the potential for discontinuities, as established by our theoretical results, underscores a central challenge in LLM alignment. The brittleness often observed in model behavior can be seen as a direct consequence of these instabilities, which arise when multiple generation strategies are near-optimal under a flawed or incomplete reward model. This highlights the importance of carefully engineering the learning objective to mitigate these issues. A primary approach is to introduce regularization techniques that encourage policy smoothness and resolve the degeneracy of optima. In practice, the most prevalent form of such regularization is fine-tuning a pre-trained base model, πbase _baseπbase, with a KL-divergence penalty. The objective is thus not merely to maximize the reward R, but to find a policy πRL _RLπRL that balances this with fidelity to the base model, governed by the objective: J(π) J(π)J ( π ) =τ∼π[∑t=0∞γt(R(st,at)−βKL(π(⋅|st)∥πbase(⋅|st)))] =E_τ π [ _t=0^∞γ^t (R(s_t,a_t)- _KL(π(·|s_t)\| _base(·|s_t)) ) ]= blackboard_Eτ ∼ π [ ∑t = 0∞ γitalic_t ( R ( sitalic_t , aitalic_t ) - β Droman_KL ( π ( ⋅ | sitalic_t ) ∥ πbase ( ⋅ | sitalic_t ) ) ) ] (5) =τ∼π[∑t=0∞γt(R(st,at)−βlogπ(at|st)πbase(at|st))]. =E_τ π [ _t=0^∞γ^t (R(s_t,a_t)-β π(a_t|s_t) _base(a_t|s_t) ) ].= blackboard_Eτ ∼ π [ ∑t = 0∞ γitalic_t ( R ( sitalic_t , aitalic_t ) - β log divide start_ARG π ( aitalic_t | sitalic_t ) end_ARG start_ARG πbase ( aitalic_t | sitalic_t ) end_ARG ) ] . This KL-divergence term serves a dual purpose. It preserves the vast linguistic knowledge of πbase _baseπbase, preventing catastrophic forgetting. Furthermore, as our analysis has shown, it acts as a crucial tie-breaker when the optimal policy set Π∗(R) ^*(R)Π∗ ( R ) is large, guiding the algorithm towards the optimal policy that is closest to the base model’s inherent capabilities and stylistic biases. This provides a practical, albeit constrained, mechanism for managing the degeneracy of the optima that we have identified. Alongside regularizing towards a base policy, another powerful mitigation strategy is to directly promote policy stochasticity via entropy regularization, which we will formalize in Section 4.4. 4 LLMs Trained with Multiple Specialized Reward Models Large Language Models (LLMs) are increasingly trained to handle diverse tasks and exhibit nuanced behaviors. A sophisticated approach to achieve this involves using multiple, specialized reward models, each tailored to a specific class of data or desired capability. This section extends our continuity analysis to such a multi-reward RL training paradigm across diverse domains (see e.g., Liang et al. (2025); Cheng et al. (2025); Su et al. (2025)). 4.1 Framework: Single LLM, Multiple Data Classes and Reward Models Consider a scenario where a single LLM policy, π:S→(A)π:S (A)π : S → P ( A ), is trained to perform well across N distinct classes of data or tasks. Let these classes be denoted by k∈1,…,Nk∈\1,…,N\k ∈ 1 , … , N . The framework is shown in Figure 2. • Data Classes (kD_kDitalic_k): Each class kD_kDitalic_k represents a distribution of initial states (e.g., prompts) specific to a certain domain or task (e.g., mathematical reasoning, coding, question answering, safety alignment). • Specialized Reward Models (RkR_kRitalic_k): For each data class kD_kDitalic_k, there is a corresponding reward model that provides a reward function Rk:S×A→ℝR_k:S× A _k : S × A → blackboard_R. This function Rk(s,a)R_k(s,a)Ritalic_k ( s , a ) evaluates the quality of action a (generating a token) in state s (current sequence) specifically for task k. We assume each Rk∈ℛ=(S×A)R_k =C(S× A)Ritalic_k ∈ R = C ( S × A ). Let =(R1,…,RN) R=(R_1,…,R_N)italic_R = ( R1 , … , Ritalic_N ) denote the tuple of these reward functions. The space of such tuples is ℛNR^NRitalic_N, equipped with the norm ‖−′‖∞N=max1≤k≤N‖Rk−Rk′‖∞\| R- R \|_∞_N= _1≤ k≤ N\|R_k-R_k \|_∞∥ italic_R - italic_R′ ∥∞N = max1 ≤ k ≤ N ∥ Ritalic_k - Ritalic_k′ ∥∞. • Training Objective: The LLM policy π is trained to optimize a global objective that aggregates performance across all N task-reward pairs. This is often formulated as maximizing the weighted sum of expected discounted rewards. During training, if an initial state s0s_0s0 is sampled from kD_kDitalic_k (with probability pkp_kpitalic_k, where ∑k=1Npk=1 _k=1^Np_k=1∑k = 1N pitalic_k = 1 and we assume pk>0p_k>0pitalic_k > 0 for all k), then the subsequent rewards for the trajectory generated by π are drawn from RkR_kRitalic_k. The overall objective is to find a policy π∗π^*_ Rπ∗bold_italic_R that maximizes: J(π;,k,pk)=∑k=1Npks0∼k,τ∼π(⋅|s0)[∑t=0∞γtRk(st,at)].J (π; R,\D_k\,\p_k\ )= _k=1^Np_kE_s_0 _k,τ π(·|s_0) [ _t=0^∞γ^tR_k(s_t,a_t) ].J ( π ; italic_R , Ditalic_k , pitalic_k ) = ∑k = 1N pitalic_k blackboard_Es start_POSTSUBSCRIPT 0 ∼ Ditalic_k , τ ∼ π ( ⋅ | s0 ) end_POSTSUBSCRIPT [ ∑t = 0∞ γitalic_t Ritalic_k ( sitalic_t , aitalic_t ) ] . (6) Here, τ∼π(⋅|s0)τ π(·|s_0)τ ∼ π ( ⋅ | s0 ) denotes a trajectory generated by policy π starting from s0s_0s0. The state space S, action space A, transition kernel P, and discount factor γ are defined as in Section 2 (satisfying Assumptions 2.1, 2.2, and 2.3). The single policy π must learn to adapt its behavior based on the implicit context of the input state s, even though it may not explicitly receive the class index k as input. LLM1D_1D12D_2D2⋮ ⋮ND_NDitalic_NR1R_1R1R2R_2R2⋮ ⋮RNR_NRitalic_NJ(π)J(π)J ( π )promptspromptspromptsrolloutrolloutrolloutaggregatebackpropagation & update policy Figure 2: The framework for training a single LLM policy with multiple data classes and specialized reward models. Prompts from data classes kD_kDitalic_k are used to generate trajectories (rollouts) with the policy π. Each trajectory is evaluated by its corresponding reward model RkR_kRitalic_k. The rewards are aggregated into a global objective function J(π)J(π)J ( π ), which is then used to update the policy via backpropagation. 4.2 Analyzing Policies Derived from State-Dependent Effective Rewards The global objective J(π)J(π)J ( π ) in Eq. (6) averages performance across episodes that are governed by distinct, episode-specific reward functions RkR_kRitalic_k. A single policy π(a|s)π(a|s)π ( a | s ), lacking explicit knowledge of k, must infer the context from s. The direct characterization of π∗π^*_ Rπ∗bold_italic_R maximizing J(π)J(π)J ( π ) via a simple Bellman optimality equation (as in Section 2) is generally not available. To apply the continuity analysis tools, we focus on a structured model of how an LLM might address this multi-task challenge: by forming an internal, state-dependent effective reward function, ReffR_effRitalic_e f f. This function aggregates the specialized rewards based on the current context: Reff(s,a;)=∑k=1Nwk(s)Rk(s,a).R_eff(s,a; R)= _k=1^Nw_k(s)R_k(s,a).Ritalic_e f f ( s , a ; italic_R ) = ∑k = 1N witalic_k ( s ) Ritalic_k ( s , a ) . (7) Here, wk:S→[0,1]w_k:S→[0,1]witalic_k : S → [ 0 , 1 ] are continuous weighting functions satisfying ∑k=1Nwk(s)=1 _k=1^Nw_k(s)=1∑k = 1N witalic_k ( s ) = 1 for all s∈Ss∈ Ss ∈ S. These weights model the LLM’s assessment of the relevance of task k given state s. The following analysis assumes that the weighting functions wk(s)w_k(s)witalic_k ( s ) are given. Crucially, these wk(s)w_k(s)witalic_k ( s ) are assumed to be fixed and do not change as Ritalic_R is perturbed. If wk(s)w_k(s)witalic_k ( s ) were themselves functions of Ritalic_R, a more comprehensive stability analysis would be required. The policy subject to our continuity analysis, denoted π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f, is the optimal policy for the standard MDP defined by this ReffR_effRitalic_e f f. It is thus greedy with respect to the optimal action-value function Qeff∗(s,a;)Q^*_eff(s,a; R)Q∗italic_e f f ( s , a ; italic_R ), which is the unique fixed point of the Bellman optimality equation: (Teff,Q)(s,a)=Reff(s,a;)+γ∫Smaxa′∈AQ(s′,a′)P(ds′|s,a).(T_eff, RQ)(s,a)=R_eff(s,a; R)+γ _S _a ∈ AQ(s ,a )P(ds |s,a).( Titalic_e f f , italic_R Q ) ( s , a ) = Ritalic_e f f ( s , a ; italic_R ) + γ ∫S maxitalic_a′ ∈ A Q ( s′ , a′ ) P ( d s′ | s , a ) . (8) The optimal policy π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f has its support contained within Aeff∗(s;)=argmaxa∈AQeff∗(s,a;)A_eff^*(s; R)= argmax_a∈ AQ^*_eff(s,a; R)Aitalic_e f f∗ ( s ; italic_R ) = argmaxitalic_a ∈ A Q∗italic_e f f ( s , a ; italic_R ). Remark 4.1 (On the Relationship between π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f and J(π)J(π)J ( π )). The policy π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f analyzed here is optimal for the constructed ReffR_effRitalic_e f f. The alignment of π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f with π∗π^*_ Rπ∗bold_italic_R (the maximizer of J(π)J(π)J ( π )) depends on how well wk(s)w_k(s)witalic_k ( s ) are chosen or learned. If, for instance, the initial state distributions were identical (k=initD_k=D_initDitalic_k = Ditalic_i n i t for all k) and one chose wk(s)=pkw_k(s)=p_kwitalic_k ( s ) = pitalic_k (constant weights), then J(π)J(π)J ( π ) simplifies to s0∼init[VReffπ(s0)]E_s_0 _init[V_R_eff^π(s_0)]blackboard_Es start_POSTSUBSCRIPT 0 ∼ Ditalic_i n i t end_POSTSUBSCRIPT [ Vitalic_R start_POSTSUBSCRIPT e f f end_POSTSUBSCRIPTπ ( s0 ) ], and π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f would indeed be π∗π^*_ Rπ∗bold_italic_R. Our analysis in this section focuses on the stability of π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f resulting from such an effective reward mechanism. Lemma 4.2 (Properties of the Effective Reward Mapping). Let each Rk∈ℛR_k _k ∈ R, and let wk:S→[0,1]w_k:S→[0,1]witalic_k : S → [ 0 , 1 ] be continuous functions such that ∑k=1Nwk(s)=1 _k=1^Nw_k(s)=1∑k = 1N witalic_k ( s ) = 1 for all s∈Ss∈ Ss ∈ S. 1. For any ∈ℛN R ^Nitalic_R ∈ Ritalic_N, the effective reward function Reff(⋅,⋅;)R_eff(·,·; R)Ritalic_e f f ( ⋅ , ⋅ ; italic_R ) defined in Eq. (7) is continuous on S×AS× AS × A. 2. The mapping ↦Reff(⋅,⋅;) R R_eff(·,·; R)italic_R ↦ Ritalic_e f f ( ⋅ , ⋅ ; italic_R ) is Lipschitz continuous from (ℛN,∥⋅∥∞N)(R^N,\|·\|_∞_N)( Ritalic_N , ∥ ⋅ ∥∞N ) to (ℛ,∥⋅∥∞)(R,\|·\|_∞)( R , ∥ ⋅ ∥∞ ) with Lipschitz constant 1: ‖Reff(⋅,⋅;)−Reff(⋅,⋅;′)‖∞≤‖−′‖∞N.\|R_eff(·,·; R)-R_eff(·,·; R )\|_∞≤\| R- R \|_∞_N.∥ Ritalic_e f f ( ⋅ , ⋅ ; italic_R ) - Ritalic_e f f ( ⋅ , ⋅ ; italic_R′ ) ∥∞ ≤ ∥ italic_R - italic_R′ ∥∞N . Proof. Since each RkR_kRitalic_k is continuous on S×AS× AS × A and each wkw_kwitalic_k is continuous on S, their products wk(s)Rk(s,a)w_k(s)R_k(s,a)witalic_k ( s ) Ritalic_k ( s , a ) are continuous. The finite sum Reff(s,a;)=∑k=1Nwk(s)Rk(s,a)R_eff(s,a; R)= _k=1^Nw_k(s)R_k(s,a)Ritalic_e f f ( s , a ; italic_R ) = ∑k = 1N witalic_k ( s ) Ritalic_k ( s , a ) is therefore continuous. For any (s,a)∈S×A(s,a)∈ S× A( s , a ) ∈ S × A, |Reff(s,a;)−Reff(s,a;′)| |R_eff(s,a; R)-R_eff(s,a; R )|| Ritalic_e f f ( s , a ; italic_R ) - Ritalic_e f f ( s , a ; italic_R′ ) | =|∑k=1Nwk(s)(Rk(s,a)−Rk′(s,a))| = | _k=1^Nw_k(s)(R_k(s,a)-R_k (s,a)) |= | ∑k = 1N witalic_k ( s ) ( Ritalic_k ( s , a ) - Ritalic_k′ ( s , a ) ) | ≤∑k=1Nwk(s)|Rk(s,a)−Rk′(s,a)| ≤ _k=1^Nw_k(s)|R_k(s,a)-R_k (s,a)|≤ ∑k = 1N witalic_k ( s ) | Ritalic_k ( s , a ) - Ritalic_k′ ( s , a ) | ≤∑k=1Nwk(s)‖Rk−Rk′‖∞ ≤ _k=1^Nw_k(s)\|R_k-R_k \|_∞≤ ∑k = 1N witalic_k ( s ) ∥ Ritalic_k - Ritalic_k′ ∥∞ ≤(∑k=1Nwk(s))max1≤j≤N‖Rj−Rj′‖∞ ≤ ( _k=1^Nw_k(s) ) _1≤ j≤ N\|R_j-R_j \|_∞≤ ( ∑k = 1N witalic_k ( s ) ) max1 ≤ j ≤ N ∥ Ritalic_j - Ritalic_j′ ∥∞ =1⋅‖−′‖∞N. =1·\| R- R \|_∞_N.= 1 ⋅ ∥ italic_R - italic_R′ ∥∞N . Taking the supremum over (s,a)(s,a)( s , a ) on the left-hand side yields the result. ∎ 4.3 Continuity of the Policy Map for the Effective Reward Model We now analyze the continuity of π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f with respect to Ritalic_R, based on the properties of ReffR_effRitalic_e f f. Proposition 4.3 (Lipschitz Stability of Effective Q-Function). Under Assumptions 2.1-2.3 and the existence of continuous wk(s)w_k(s)witalic_k ( s ) as defined above, the mapping ↦Qeff∗(⋅,⋅;) R Q^*_eff(·,·; R)italic_R ↦ Q∗italic_e f f ( ⋅ , ⋅ ; italic_R ) from (ℛN,∥⋅∥∞N)(R^N,\|·\|_∞_N)( Ritalic_N , ∥ ⋅ ∥∞N ) to ((S×A),∥⋅∥∞)(C(S× A),\|·\|_∞)( C ( S × A ) , ∥ ⋅ ∥∞ ) is Lipschitz continuous with constant 1/(1−γ)1/(1-γ)1 / ( 1 - γ ). Proof. This follows from Proposition 2.4 and Lemma 4.2. ∎ Lemma 4.4 (Upper Hemi-Continuity of Effective Argmax Correspondence). Under Assumptions 2.1-2.3 and the existence of continuous wk(s)w_k(s)witalic_k ( s ), for each fixed s∈Ss∈ Ss ∈ S, the argmax correspondence Φseff:ℛN↠A _s^eff:R^N AΦitalic_sitalic_e f f : Ritalic_N ↠ A defined by Φseff()=Aeff∗(s;) _s^eff( R)=A_eff^*(s; R)Φitalic_sitalic_e f f ( italic_R ) = Aitalic_e f f∗ ( s ; italic_R ) is non-empty, compact-valued, and upper hemi-continuous. Proof. The function Qeff∗(s,a;)Q^*_eff(s,a; R)Q∗italic_e f f ( s , a ; italic_R ) is continuous in (s,a)(s,a)( s , a ) (as it is the fixed point of Teff,T_eff, RTitalic_e f f , italic_R which maps continuous functions to continuous functions, given ReffR_effRitalic_e f f is continuous by Lemma 4.2). By Proposition 4.3, Qeff∗Q^*_effQ∗italic_e f f is continuous with respect to Ritalic_R. The proof is then similar to that of Lemma 2.5. ∎ Proposition 4.5 (Continuity of Policy Map (Effective Model) under Uniqueness). Let Assumptions 2.1-2.3 and the existence of continuous wk(s)w_k(s)witalic_k ( s ) hold. Let 0∈ℛN R_0 ^Nitalic_R0 ∈ Ritalic_N. Suppose there exists an open neighborhood (0)⊂ℛNN( R_0) ^NN ( italic_R0 ) ⊂ Ritalic_N of 0 R_0italic_R0 such that for all ∈(0) R ( R_0)italic_R ∈ N ( italic_R0 ) and all s∈Ss∈ Ss ∈ S, Aeff∗(s;)A_eff^*(s; R)Aitalic_e f f∗ ( s ; italic_R ) is a singleton, denoted aeff∗(s;)\a_eff^*(s; R)\ aitalic_e f f∗ ( s ; italic_R ) . Define the policy map MRLeff:(0)→detM_RL^eff:N( R_0) _detMitalic_R Litalic_e f f : N ( italic_R0 ) → Pitalic_d e t by MRLeff()(s)=aeff∗(s;)M_RL^eff( R)(s)=a_eff^*(s; R)Mitalic_R Litalic_e f f ( italic_R ) ( s ) = aitalic_e f f∗ ( s ; italic_R ). This map defines the deterministic selections of π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f. If detP_detPitalic_d e t is equipped with the topology of pointwise convergence, then MRLeffM_RL^effMitalic_R Litalic_e f f is continuous at 0 R_0italic_R0. Proof. The proof is analogous to that of Theorem 2.7, using ReffR_effRitalic_e f f and Lemma 4.4. ∎ Proposition 4.6 (Discontinuity of Policy Map (Effective Model) under Non-Uniqueness). Let Assumptions 2.1-2.3 and the existence of continuous wk(s)w_k(s)witalic_k ( s ) hold. Suppose for 0∈ℛN R_0 ^Nitalic_R0 ∈ Ritalic_N, there exists s0∈Ss_0∈ Ss0 ∈ S such that Aeff∗(s0;0)A_eff^*(s_0; R_0)Aitalic_e f f∗ ( s0 ; italic_R0 ) is finite and contains at least two distinct actions, a1≠a2a_1≠ a_2a1 ≠ a2. (1) Let MRLeff:ℛN→detM_RL^eff:R^N _detMitalic_R Litalic_e f f : Ritalic_N → Pitalic_d e t be a policy map selecting a deterministic policy π∗eff(s)∈Aeff∗(s;)π^*_ R^eff(s)∈ A_eff^*(s; R)π∗bold_italic_Ritalic_e f f ( s ) ∈ Aitalic_e f f∗ ( s ; italic_R ) (e.g., MRLeff(0)(s0)=a1M_RL^eff( R_0)(s_0)=a_1Mitalic_R Litalic_e f f ( italic_R0 ) ( s0 ) = a1), then MRLeffM_RL^effMitalic_R Litalic_e f f is discontinuous at 0 R_0italic_R0 under pointwise convergence. (2) Let MRLeff:ℛN→M_RL^eff:R^N _R Litalic_e f f : Ritalic_N → P be defined by selecting the stochastic policy π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f such that for each s∈Ss∈ Ss ∈ S, π∗eff(⋅|s)π^*_ R^eff(·|s)π∗bold_italic_Ritalic_e f f ( ⋅ | s ) is the uniform probability distribution over the set Aeff∗(s;)A_eff^*(s; R)Aitalic_e f f∗ ( s ; italic_R ). Then the map ↦π∗eff(⋅|s0) R π^*_ R^eff(·|s_0)italic_R ↦ π∗bold_italic_Ritalic_e f f ( ⋅ | s0 ), viewed as a map from (ℛN,∥⋅∥∞N)(R^N,\|·\|_∞_N)( Ritalic_N , ∥ ⋅ ∥∞N ) to (,dTV)(P,d_TV)( P , ditalic_T V ) is discontinuous at 0 R_0italic_R0. Proof. This proof applies the "inverse Bellman" method to the effective Q-function, Qeff∗Q^*_effQ∗italic_e f f, to demonstrate discontinuity. The core idea is to first construct a perturbed effective Q-function, Qeff,εQ_eff, Qitalic_e f f , ε, that guarantees an action switch, and then construct a tuple of reward functions, ε R_ italic_Ritalic_ε, that is guaranteed to produce this perturbed Q-function and that converges to 0 R_0italic_R0. 1. Construct a Perturbation in the Effective Q-Function Space Let Reff,0(s,a)=∑k=1Nwk(s)Rk,0(s,a)R_eff,0(s,a)= _k=1^Nw_k(s)R_k,0(s,a)Ritalic_e f f , 0 ( s , a ) = ∑k = 1N witalic_k ( s ) Ritalic_k , 0 ( s , a ) be the initial effective reward. Let Qeff,0=QReff,0∗Q_eff,0=Q^*_R_eff,0Qitalic_e f f , 0 = Q∗italic_R start_POSTSUBSCRIPT e f f , 0 end_POSTSUBSCRIPT be the corresponding optimal effective Q-function. By assumption, Aeff∗(s0;0)A_eff^*(s_0; R_0)Aitalic_e f f∗ ( s0 ; italic_R0 ) contains at least two actions, a1a_1a1 and a2a_2a2, which means Qeff,0(s0,a1)=Qeff,0(s0,a2)Q_eff,0(s_0,a_1)=Q_eff,0(s_0,a_2)Qitalic_e f f , 0 ( s0 , a1 ) = Qitalic_e f f , 0 ( s0 , a2 ). We define a continuous "bump" function φ∈(S×A) (S× A)φ ∈ C ( S × A ) as Proposition 2.9 such that: • 0≤φ(s,a)≤10≤ (s,a)≤ 10 ≤ φ ( s , a ) ≤ 1 for all (s,a)(s,a)( s , a ). • φ(s0,a2)=1 (s_0,a_2)=1φ ( s0 , a2 ) = 1. • φ(s0,aj)=0 (s_0,a_j)=0φ ( s0 , aitalic_j ) = 0 for all other actions aj∈Aeff∗(s0;0)a_j∈ A_eff^*(s_0; R_0)aitalic_j ∈ Aitalic_e f f∗ ( s0 ; italic_R0 ). For any ε>0 >0ε > 0, we define a perturbed effective Q-function Qeff,εQ_eff, Qitalic_e f f , ε: Qeff,ε(s,a):=Qeff,0(s,a)+εφ(s,a).Q_eff, (s,a):=Q_eff,0(s,a)+ \, (s,a).Qitalic_e f f , ε ( s , a ) := Qitalic_e f f , 0 ( s , a ) + ε φ ( s , a ) . 2. Invert the Bellman Equation to Find the Target Effective Reward Using the inverse Bellman mapping defined previously, we find the unique effective reward function, Reff,εR_eff, Ritalic_e f f , ε, whose optimal Q-function is exactly Qeff,εQ_eff, Qitalic_e f f , ε: Reff,ε:=RQeff,εsuch thatQReff,ε∗=Qeff,ε.R_eff, :=R_Q_eff, that Q^*_R_eff, =Q_eff, .Ritalic_e f f , ε := Ritalic_Q start_POSTSUBSCRIPT e f f , ε end_POSTSUBSCRIPT such that Q∗italic_R start_POSTSUBSCRIPT e f f , ε end_POSTSUBSCRIPT = Qitalic_e f f , ε . As established in the proof of Proposition 2.9, we know that as ε→0 → 0ε → 0, Reff,εR_eff, Ritalic_e f f , ε converges uniformly to Reff,0R_eff,0Ritalic_e f f , 0. 3. Construct a Convergent Sequence of Reward Tuples Now we must construct a sequence of reward tuples, ε=(R1,ε,…,RN,ε) R_ =(R_1, ,…,R_N, )italic_Ritalic_ε = ( R1 , ε , … , Ritalic_N , ε ), that converges to 0 R_0italic_R0 and generates our target effective reward Reff,εR_eff, Ritalic_e f f , ε. Let ΔRε=Reff,ε−Reff,0 R_ =R_eff, -R_eff,0Δ Ritalic_ε = Ritalic_e f f , ε - Ritalic_e f f , 0. We need to find perturbations δk(s,a)=Rk,ε(s,a)−Rk,0(s,a) _k(s,a)=R_k, (s,a)-R_k,0(s,a)δitalic_k ( s , a ) = Ritalic_k , ε ( s , a ) - Ritalic_k , 0 ( s , a ) that satisfy two conditions: 1. ∑k=1Nwk(s)δk(s,a)=ΔRε(s,a) _k=1^Nw_k(s) _k(s,a)= R_ (s,a)∑k = 1N witalic_k ( s ) δitalic_k ( s , a ) = Δ Ritalic_ε ( s , a ) for all (s,a)(s,a)( s , a ). 2. maxk‖δk‖∞→0 _k\| _k\|_∞→ 0maxitalic_k ∥ δitalic_k ∥∞ → 0 as ε→0 → 0ε → 0. We can distribute the perturbation across all components in a simple and robust manner. We define the perturbation for every reward component to be identical to the target perturbation of the effective reward: δk(s,a):=ΔRε(s,a)for all k∈1,…,N. _k(s,a):= R_ (s,a) all k∈\1,…,N\.δitalic_k ( s , a ) := Δ Ritalic_ε ( s , a ) for all k ∈ 1 , … , N . So, the new reward tuple ε R_ italic_Ritalic_ε is defined as: Rk,ε(s,a):=Rk,0(s,a)+ΔRε(s,a)for all k.R_k, (s,a):=R_k,0(s,a)+ R_ (s,a) all k.Ritalic_k , ε ( s , a ) := Ritalic_k , 0 ( s , a ) + Δ Ritalic_ε ( s , a ) for all k . It’s easy to verify that the construction correctly produces the target effective reward. The norm of the difference between the new and old reward tuples is: ‖ε−0‖∞N=maxk‖Rk,ε−Rk,0‖∞=‖ΔRε‖∞=‖Reff,ε−Reff,0‖∞. \| R_ - R_0\|_∞_N= _k\|R_k, -R_k,0\|_∞=\| R_ \|_∞=\|R_eff, -R_eff,0\|_∞.∥ italic_Ritalic_ε - italic_R0 ∥∞N = maxitalic_k ∥ Ritalic_k , ε - Ritalic_k , 0 ∥∞ = ∥ Δ Ritalic_ε ∥∞ = ∥ Ritalic_e f f , ε - Ritalic_e f f , 0 ∥∞ . In the proof of Proposition 2.9 (which is invoked in Step 2 of this proof), it was established that ‖Reff,ε−Reff,0‖∞≤ε(1+γ)\|R_eff, -R_eff,0\|_∞≤ (1+γ)∥ Ritalic_e f f , ε - Ritalic_e f f , 0 ∥∞ ≤ ε ( 1 + γ ). Therefore: ‖ε−0‖∞N→0as ε→0.\| R_ - R_0\|_∞_N→ 0 → 0.∥ italic_Ritalic_ε - italic_R0 ∥∞N → 0 as ε → 0 . 4. The Action Switch and Discontinuity We have successfully constructed a sequence of reward tuples ε→0 R_ → R_0italic_Ritalic_ε → italic_R0. The optimal effective Q-function for ε R_ italic_Ritalic_ε is Qeff,εQ_eff, Qitalic_e f f , ε. The final part of the argument is identical to that in Proposition 2.9. By construction, for any ε>0 >0ε > 0, Qeff,ε(s0,a2)Q_eff, (s_0,a_2)Qitalic_e f f , ε ( s0 , a2 ) is strictly greater than Qeff,ε(s0,a′)Q_eff, (s_0,a )Qitalic_e f f , ε ( s0 , a′ ) for any other action a′a a′. Thus, a2a_2a2 is the unique optimal action for the effective reward. Aeff∗(s0;ε)=a2A_eff^*(s_0; R_ )=\a_2\Aitalic_e f f∗ ( s0 ; italic_Ritalic_ε ) = a2 . (1) Deterministic Case: If the selection rule for 0 R_0italic_R0 chose a1a_1a1, the map MRLeffM_RL^effMitalic_R Litalic_e f f must now select a2a_2a2 for any ε R_ italic_Ritalic_ε. The policy jumps from a1a_1a1 to a2a_2a2, proving discontinuity. (2) Stochastic Case: For 0 R_0italic_R0, the policy at s0s_0s0 is a uniform distribution over Aeff∗(s0;0)A_eff^*(s_0; R_0)Aitalic_e f f∗ ( s0 ; italic_R0 ), which has mass ≤1/2≤ 1/2≤ 1 / 2 at a2a_2a2. For any ε R_ italic_Ritalic_ε, the policy becomes a Dirac delta measure at a2a_2a2, with mass 1. The Total Variation distance between these policies is constant and non-zero. Thus, the map to the stochastic policy is also discontinuous. ∎ 4.4 Mitigating Discontinuities: Entropy Regularization Entropy regularization offers another powerful mechanism to promote policy smoothness and address the degeneracy of optima (see e.g., Haarnoja et al. (2018); Geist et al. (2019)). By adding an entropy bonus, it acts as a tie-breaker that encourages the policy to be stochastic, especially among actions with similar values. This resolves policy indeterminacy by yielding a unique, smooth optimal policy. The objective effectively optimized with entropy regularization (Geist et al., 2019) would be: Jeff,α(π;)=π[∑t=0∞γt(Reff(st,at;)+αH(π(⋅|st)))],J_eff,α(π; R)=E_π [ _t=0^∞γ^t (R_eff(s_t,a_t; R)+α H(π(·|s_t)) ) ],Jitalic_e f f , α ( π ; italic_R ) = blackboard_Eπ [ ∑t = 0∞ γitalic_t ( Ritalic_e f f ( sitalic_t , aitalic_t ; italic_R ) + α H ( π ( ⋅ | sitalic_t ) ) ) ] , (9) where H(π(⋅|s))=−∑aπ(a|s)logπ(a|s)H(π(·|s))=- _aπ(a|s) π(a|s)H ( π ( ⋅ | s ) ) = - ∑a π ( a | s ) log π ( a | s ) is the policy entropy at state s, α>0α>0α > 0 is the temperature, and the expectation is over trajectories generated by π starting from an appropriate initial state distribution, with rewards ReffR_effRitalic_e f f. The corresponding soft Bellman optimality operator for Qeff,α∗(s,a;)Q^*_eff,α(s,a; R)Q∗italic_e f f , α ( s , a ; italic_R ) is: (Teff,,αQ)(s,a)=Reff(s,a;)+γs′∼P(⋅|s,a)[αlog∑a′∈Aexp(Q(s′,a′)/α)].(T_eff, R,αQ)(s,a)=R_eff(s,a; R)+ _s P(·|s,a) [α _a ∈ A (Q(s ,a )/α) ].( Titalic_e f f , italic_R , α Q ) ( s , a ) = Ritalic_e f f ( s , a ; italic_R ) + γ blackboard_Es′ ∼ P ( ⋅ | s , a ) [ α log ∑a′ ∈ A exp ( Q ( s′ , a′ ) / α ) ] . (10) For α>0α>0α > 0, Qeff,α∗Q^*_eff,αQ∗italic_e f f , α is unique with respect to ReffR_effRitalic_e f f (and thus with respect to Ritalic_R) under standard conditions. The optimal regularized policy, π,α∗π^*_ R,απ∗bold_italic_R , α, takes the Boltzmann (softmax) form: π,α∗(a|s)=exp(Qeff,α∗(s,a;)/α)∑b∈Aexp(Qeff,α∗(s,b;)/α).π^*_ R,α(a|s)= (Q^*_eff,α(s,a; R)/α) _b∈ A (Q^*_eff,α(s,b; R)/α).π∗bold_italic_R , α ( a | s ) = divide start_ARG exp ( Q∗italic_e f f , α ( s , a ; italic_R ) / α ) end_ARG start_ARG ∑b ∈ A exp ( Q∗italic_e f f , α ( s , b ; italic_R ) / α ) end_ARG . (11) This policy is unique with respect to Ritalic_R. See e.g., Geist et al. (2019) for more information. Proposition 4.7 (Lipschitz Continuity of Soft Optimal Policy). For α>0α>0α > 0, under Assumptions 2.1-2.3 and the existence of continuous wk(s)w_k(s)witalic_k ( s ) summing to 1, the mapping ↦π,α∗ R π^*_ R,αitalic_R ↦ π∗bold_italic_R , α is Lipschitz continuous from (ℛN,∥⋅∥∞N)(R^N,\|·\|_∞_N)( Ritalic_N , ∥ ⋅ ∥∞N ) to the space of policies. Specifically, for each s∈Ss∈ Ss ∈ S: dTV(π1,α∗(⋅|s),π2,α∗(⋅|s))≤12α(1−γ)∥1−2∥∞N.d_TV(π^*_ R_1,α(·|s),π^*_ R_2,α(·|s))≤ 12α(1-γ)\| R_1- R_2\|_∞_N.ditalic_T V ( π∗bold_italic_R start_POSTSUBSCRIPT 1 , α end_POSTSUBSCRIPT ( ⋅ | s ) , π∗bold_italic_R start_POSTSUBSCRIPT 2 , α end_POSTSUBSCRIPT ( ⋅ | s ) ) ≤ divide start_ARG 1 end_ARG start_ARG 2 α ( 1 - γ ) end_ARG ∥ italic_R1 - italic_R2 ∥∞N . Proof. Let 1,2∈ℛN R_1, R_2 ^Nitalic_R1 , italic_R2 ∈ Ritalic_N. Let Reff,1(s,a)=Reff(s,a;1)R_eff,1(s,a)=R_eff(s,a; R_1)Ritalic_e f f , 1 ( s , a ) = Ritalic_e f f ( s , a ; italic_R1 ) and Reff,2(s,a)=Reff(s,a;2)R_eff,2(s,a)=R_eff(s,a; R_2)Ritalic_e f f , 2 ( s , a ) = Ritalic_e f f ( s , a ; italic_R2 ) be the corresponding effective reward functions as defined in Eq. (7). Let Q1∗(s,a)=Qeff,α∗(s,a;1)Q^*_1(s,a)=Q^*_eff,α(s,a; R_1)Q∗1 ( s , a ) = Q∗italic_e f f , α ( s , a ; italic_R1 ) and Q2∗(s,a)=Qeff,α∗(s,a;2)Q^*_2(s,a)=Q^*_eff,α(s,a; R_2)Q∗2 ( s , a ) = Q∗italic_e f f , α ( s , a ; italic_R2 ) be the unique fixed points of the respective soft Bellman optimality operators: Q1∗(s,a) Q^*_1(s,a)Q∗1 ( s , a ) =Reff,1(s,a)+γs′∼P(⋅|s,a)[αlog∑a′exp(Q1∗(s′,a′)/α)], =R_eff,1(s,a)+ _s P(·|s,a) [α _a (Q^*_1(s ,a )/α) ],= Ritalic_e f f , 1 ( s , a ) + γ blackboard_Es′ ∼ P ( ⋅ | s , a ) [ α log ∑a′ exp ( Q∗1 ( s′ , a′ ) / α ) ] , Q2∗(s,a) Q^*_2(s,a)Q∗2 ( s , a ) =Reff,2(s,a)+γs′∼P(⋅|s,a)[αlog∑a′exp(Q2∗(s′,a′)/α)]. =R_eff,2(s,a)+ _s P(·|s,a) [α _a (Q^*_2(s ,a )/α) ].= Ritalic_e f f , 2 ( s , a ) + γ blackboard_Es′ ∼ P ( ⋅ | s , a ) [ α log ∑a′ exp ( Q∗2 ( s′ , a′ ) / α ) ] . The proof proceeds as follows: 1. Bound the difference in effective rewards. From Lemma 4.2, we have: ‖Reff,1−Reff,2‖∞≤‖1−2‖∞N.\|R_eff,1-R_eff,2\|_∞≤\| R_1- R_2\|_∞_N.∥ Ritalic_e f f , 1 - Ritalic_e f f , 2 ∥∞ ≤ ∥ italic_R1 - italic_R2 ∥∞N . (12) 2. Bound the difference in optimal soft Q-functions. The soft Bellman optimality operator TR,αQ=R+γℒαQT_R,αQ=R+ _αQTitalic_R , α Q = R + γ Litalic_α Q, where (ℒαQ)(s,a)=s′∼P(⋅|s,a)[αlog∑a′exp(Q(s′,a′)/α)](L_αQ)(s,a)=E_s P(·|s,a)[α _a (Q(s ,a )/α)]( Litalic_α Q ) ( s , a ) = blackboard_Es′ ∼ P ( ⋅ | s , a ) [ α log ∑a′ exp ( Q ( s′ , a′ ) / α ) ], is a contraction mapping with modulus γ in the supremum norm (Geist et al., 2019). That is, ‖TR,αQA−TR,αQB‖∞≤γ‖QA−QB‖∞\|T_R,αQ_A-T_R,αQ_B\|_∞≤γ\|Q_A-Q_B\|_∞∥ Titalic_R , α Qitalic_A - Titalic_R , α Qitalic_B ∥∞ ≤ γ ∥ Qitalic_A - Qitalic_B ∥∞. Let Q1=Q1∗Q_1=Q^*_1Q1 = Q∗1 and Q2=Q2∗Q_2=Q^*_2Q2 = Q∗2. Then Q1=TReff,1,αQ1Q_1=T_R_eff,1,αQ_1Q1 = Titalic_R start_POSTSUBSCRIPT e f f , 1 , α end_POSTSUBSCRIPT Q1 and Q2=TReff,2,αQ2Q_2=T_R_eff,2,αQ_2Q2 = Titalic_R start_POSTSUBSCRIPT e f f , 2 , α end_POSTSUBSCRIPT Q2. Consider ‖Q1−Q2‖∞\|Q_1-Q_2\|_∞∥ Q1 - Q2 ∥∞: ‖Q1−Q2‖∞ \|Q_1-Q_2\|_∞∥ Q1 - Q2 ∥∞ =‖TReff,1,αQ1−TReff,2,αQ2‖∞ =\|T_R_eff,1,αQ_1-T_R_eff,2,αQ_2\|_∞= ∥ Titalic_R start_POSTSUBSCRIPT e f f , 1 , α end_POSTSUBSCRIPT Q1 - Titalic_R start_POSTSUBSCRIPT e f f , 2 , α end_POSTSUBSCRIPT Q2 ∥∞ ≤‖TReff,1,αQ1−TReff,1,αQ2‖∞+‖TReff,1,αQ2−TReff,2,αQ2‖∞. ≤\|T_R_eff,1,αQ_1-T_R_eff,1,αQ_2\|_∞+\|T_R_eff,1,αQ_2-T_R_eff,2,αQ_2\|_∞.≤ ∥ Titalic_R start_POSTSUBSCRIPT e f f , 1 , α end_POSTSUBSCRIPT Q1 - Titalic_R start_POSTSUBSCRIPT e f f , 1 , α end_POSTSUBSCRIPT Q2 ∥∞ + ∥ Titalic_R start_POSTSUBSCRIPT e f f , 1 , α end_POSTSUBSCRIPT Q2 - Titalic_R start_POSTSUBSCRIPT e f f , 2 , α end_POSTSUBSCRIPT Q2 ∥∞ . Using the contraction property for the first term and direct evaluation for the second: (TReff,1,αQ2)(s,a)−(TReff,2,αQ2)(s,a)=Reff,1(s,a)−Reff,2(s,a)(T_R_eff,1,αQ_2)(s,a)-(T_R_eff,2,αQ_2)(s,a)=R_eff,1(s,a)-R_eff,2(s,a)( Titalic_R start_POSTSUBSCRIPT e f f , 1 , α end_POSTSUBSCRIPT Q2 ) ( s , a ) - ( Titalic_R start_POSTSUBSCRIPT e f f , 2 , α end_POSTSUBSCRIPT Q2 ) ( s , a ) = Ritalic_e f f , 1 ( s , a ) - Ritalic_e f f , 2 ( s , a ). So, ‖TReff,1,αQ2−TReff,2,αQ2‖∞=‖Reff,1−Reff,2‖∞\|T_R_eff,1,αQ_2-T_R_eff,2,αQ_2\|_∞=\|R_eff,1-R_eff,2\|_∞∥ Titalic_R start_POSTSUBSCRIPT e f f , 1 , α end_POSTSUBSCRIPT Q2 - Titalic_R start_POSTSUBSCRIPT e f f , 2 , α end_POSTSUBSCRIPT Q2 ∥∞ = ∥ Ritalic_e f f , 1 - Ritalic_e f f , 2 ∥∞. Thus, ‖Q1−Q2‖∞≤γ‖Q1−Q2‖∞+‖Reff,1−Reff,2‖∞.\|Q_1-Q_2\|_∞≤γ\|Q_1-Q_2\|_∞+\|R_eff,1-R_eff,2\|_∞.∥ Q1 - Q2 ∥∞ ≤ γ ∥ Q1 - Q2 ∥∞ + ∥ Ritalic_e f f , 1 - Ritalic_e f f , 2 ∥∞ . Rearranging gives: (1−γ)‖Q1−Q2‖∞≤‖Reff,1−Reff,2‖∞.(1-γ)\|Q_1-Q_2\|_∞≤\|R_eff,1-R_eff,2\|_∞.( 1 - γ ) ∥ Q1 - Q2 ∥∞ ≤ ∥ Ritalic_e f f , 1 - Ritalic_e f f , 2 ∥∞ . So, ‖Q1−Q2‖∞≤11−γ‖Reff,1−Reff,2‖∞.\|Q_1-Q_2\|_∞≤ 11-γ\|R_eff,1-R_eff,2\|_∞.∥ Q1 - Q2 ∥∞ ≤ divide start_ARG 1 end_ARG start_ARG 1 - γ end_ARG ∥ Ritalic_e f f , 1 - Ritalic_e f f , 2 ∥∞ . (13) Combining with Eq. (12): ‖Q1−Q2‖∞≤11−γ‖1−2‖∞N.\|Q_1-Q_2\|_∞≤ 11-γ\| R_1- R_2\|_∞_N.∥ Q1 - Q2 ∥∞ ≤ divide start_ARG 1 end_ARG start_ARG 1 - γ end_ARG ∥ italic_R1 - italic_R2 ∥∞N . (14) 3. Bound the Total Variation distance between policies. For a fixed state s, we view the Q-function as a vector of values over the action space A. Let the cardinality of A be d. The optimal soft policy is the softmax function applied to this Q-value vector, i.e., π,α∗(⋅|s)=softmax(Q(s,⋅)/α)∈ℝd.π^*_ R,α(·|s)=softmax(Q(s,·)/α) ^d.π∗bold_italic_R , α ( ⋅ | s ) = softmax ( Q ( s , ⋅ ) / α ) ∈ blackboard_Rd . We aim to bound the total variation distance dTV(π1,α∗(⋅|s),π2,α∗(⋅|s))d_TV(π^*_ R_1,α(·|s),π^*_ R_2,α(·|s))ditalic_T V ( π∗bold_italic_R start_POSTSUBSCRIPT 1 , α end_POSTSUBSCRIPT ( ⋅ | s ) , π∗bold_italic_R start_POSTSUBSCRIPT 2 , α end_POSTSUBSCRIPT ( ⋅ | s ) ), which is defined as: dTV(π1,α∗(⋅|s),π2,α∗(⋅|s))=12∑a∈A|π1,α∗(a|s)−π2,α∗(a|s)|=12∥π1,α∗(⋅|s)−π2,α∗(⋅|s)∥1.d_TV(π^*_ R_1,α(·|s),π^*_ R_2,α(·|s))= 12 _a∈ A|π^*_ R_1,α(a|s)-π^*_ R_2,α(a|s)|= 12\|π^*_ R_1,α(·|s)-π^*_ R_2,α(·|s)\|_1.ditalic_T V ( π∗bold_italic_R start_POSTSUBSCRIPT 1 , α end_POSTSUBSCRIPT ( ⋅ | s ) , π∗bold_italic_R start_POSTSUBSCRIPT 2 , α end_POSTSUBSCRIPT ( ⋅ | s ) ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑a ∈ A | π∗bold_italic_R start_POSTSUBSCRIPT 1 , α end_POSTSUBSCRIPT ( a | s ) - π∗bold_italic_R start_POSTSUBSCRIPT 2 , α end_POSTSUBSCRIPT ( a | s ) | = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ π∗bold_italic_R start_POSTSUBSCRIPT 1 , α end_POSTSUBSCRIPT ( ⋅ | s ) - π∗bold_italic_R start_POSTSUBSCRIPT 2 , α end_POSTSUBSCRIPT ( ⋅ | s ) ∥1 . To achieve this, we apply Lemma C.2 in Appendix, which establishes the Lipschitz continuity of the softmax function. The lemma states that for any two vectors x,y∈ℝdx,y ^dx , y ∈ blackboard_Rd: ‖softmax(x/α)−softmax(y/α)‖1≤1α‖x−y‖∞.\|softmax(x/α)-softmax(y/α)\|_1≤ 1α\|x-y\|_∞.∥ softmax ( x / α ) - softmax ( y / α ) ∥1 ≤ divide start_ARG 1 end_ARG start_ARG α end_ARG ∥ x - y ∥∞ . We apply this result by setting x and y to be the Q-value vectors at state s. This yields the following inequality for the L1L_1L1-distance between the policies at state s: ∥π1,α∗(⋅|s)−π2,α∗(⋅|s)∥1≤1α∥Q1(s,⋅)−Q2(s,⋅)∥∞,\|π^*_ R_1,α(·|s)-π^*_ R_2,α(·|s)\|_1≤ 1α\|Q_1(s,·)-Q_2(s,·)\|_∞,∥ π∗bold_italic_R start_POSTSUBSCRIPT 1 , α end_POSTSUBSCRIPT ( ⋅ | s ) - π∗bold_italic_R start_POSTSUBSCRIPT 2 , α end_POSTSUBSCRIPT ( ⋅ | s ) ∥1 ≤ divide start_ARG 1 end_ARG start_ARG α end_ARG ∥ Q1 ( s , ⋅ ) - Q2 ( s , ⋅ ) ∥∞ , where ‖Q1(s,⋅)−Q2(s,⋅)‖∞=maxa∈A|Q1(s,a)−Q2(s,a)|\|Q_1(s,·)-Q_2(s,·)\|_∞= _a∈ A|Q_1(s,a)-Q_2(s,a)|∥ Q1 ( s , ⋅ ) - Q2 ( s , ⋅ ) ∥∞ = maxitalic_a ∈ A | Q1 ( s , a ) - Q2 ( s , a ) |. Since the maximum difference at a single state s is bounded by the supremum norm over all states and actions (i.e., maxa∈A|Q1(s,a)−Q2(s,a)|≤‖Q1−Q2‖∞ _a∈ A|Q_1(s,a)-Q_2(s,a)|≤\|Q_1-Q_2\|_∞maxitalic_a ∈ A | Q1 ( s , a ) - Q2 ( s , a ) | ≤ ∥ Q1 - Q2 ∥∞), we have: dTV(π1,α∗(⋅|s),π2,α∗(⋅|s))≤12(1α∥Q1−Q2∥∞)=12α∥Q1−Q2∥∞.d_TV(π^*_ R_1,α(·|s),π^*_ R_2,α(·|s))≤ 12 ( 1α\|Q_1-Q_2\|_∞ )= 12α\|Q_1-Q_2\|_∞.ditalic_T V ( π∗bold_italic_R start_POSTSUBSCRIPT 1 , α end_POSTSUBSCRIPT ( ⋅ | s ) , π∗bold_italic_R start_POSTSUBSCRIPT 2 , α end_POSTSUBSCRIPT ( ⋅ | s ) ) ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 end_ARG start_ARG α end_ARG ∥ Q1 - Q2 ∥∞ ) = divide start_ARG 1 end_ARG start_ARG 2 α end_ARG ∥ Q1 - Q2 ∥∞ . (15) 4. Combining all steps. Substitute Eq. (14) into Eq. (15): dTV(π1,α∗(⋅|s),π2,α∗(⋅|s)) d_TV(π^*_ R_1,α(·|s),π^*_ R_2,α(·|s))ditalic_T V ( π∗bold_italic_R start_POSTSUBSCRIPT 1 , α end_POSTSUBSCRIPT ( ⋅ | s ) , π∗bold_italic_R start_POSTSUBSCRIPT 2 , α end_POSTSUBSCRIPT ( ⋅ | s ) ) ≤12α(11−γ‖1−2‖∞N) ≤ 12α ( 11-γ\| R_1- R_2\|_∞_N )≤ divide start_ARG 1 end_ARG start_ARG 2 α end_ARG ( divide start_ARG 1 end_ARG start_ARG 1 - γ end_ARG ∥ italic_R1 - italic_R2 ∥∞N ) =12α(1−γ)‖1−2‖∞N. = 12α(1-γ)\| R_1- R_2\|_∞_N.= divide start_ARG 1 end_ARG start_ARG 2 α ( 1 - γ ) end_ARG ∥ italic_R1 - italic_R2 ∥∞N . This establishes the Lipschitz continuity with the constant C(α,γ)=12α(1−γ)C(α,γ)= 12α(1-γ)C ( α , γ ) = divide start_ARG 1 end_ARG start_ARG 2 α ( 1 - γ ) end_ARG. This constant depends only on α and γ, and not on the state s. ∎ Figure 3: Effect of Entropy Regularization on Policy Distribution While entropy regularization, yielding a softmax optimal policy, endows the reward-policy map with desirable continuity (as demonstrated by its Lipschitz continuity properties, i.e., Proposition 4.7) and can enhance training stability, these advantages are accompanied by inherent trade-offs. The principal cost is a degree of suboptimality with respect to the immediate reward function; by distributing probability mass across multiple actions rather than exclusively selecting the action with the highest estimated value, the policy intentionally sacrifices some exploitative potential for increased stochasticity and exploration. Figure 3 provides a conceptual illustration of hard-max and softmax policy behaviors. This characteristic can manifest behaviorally as a "blurring" or "averaging" effect, where the policy may produce less decisive or distinctive outputs. This might be suboptimal for tasks demanding high-precision or optimal responses. Furthermore, the interaction of a softmax policy with an imperfect or incomplete reward model can influence the manifestation of undesirable behaviors, such as hallucinations in large language models. While not a direct cause, the policy’s inherent stochasticity means it may assign non-trivial probability to actions leading to subtly flawed or fabricated content if these actions possess near-optimal, yet inadequately penalized, values under the effective reward function (ReffR_effRitalic_e f f). The extent of these trade-offs is critically governed by the temperature parameter α, which necessitates careful calibration to balance policy smoothness and exploration against the fidelity of exploiting the reward landscape. 4.5 Implications for Multi-Task RL Training for LLMs The preceding analysis, which centers on policies derived from a state-dependent effective reward model, ReffR_effRitalic_e f f, yields critical insights into the stability of LLMs trained with multiple specialized reward models. The stability of the resultant policy, π∗effπ^*_ R^effπ∗bold_italic_Ritalic_e f f, is fundamentally governed by the properties of this constructed effective reward and its associated optimal Q-function, Qeff∗Q^*_effQ∗italic_e f f. The construction of the effective reward, particularly through the state-dependent weights wk(s)w_k(s)witalic_k ( s ), emerges as a decisive factor. An unstable or ill-defined weighting mechanism, or significant conflict between the specialized rewards RkR_kRitalic_k, may render ReffR_effRitalic_e f f highly sensitive to perturbations in any individual reward component RjR_jRitalic_j. This sensitivity is a primary source of policy instability. Specifically, if the resulting Qeff∗Q^*_effQ∗italic_e f f exhibits non-unique optimal actions for a given reward tuple 0 R_0italic_R0, the system becomes susceptible to discontinuous behavior. As demonstrated, a minor change to underlying reward models can alter the effective reward landscape sufficiently to break ties within the optimal action set Aeff∗A_eff^*Aitalic_e f f∗, causing an abrupt shift in the selected policy. Conversely, entropy regularization provides a robust mechanism for ensuring policy stability within this framework. By incorporating an entropy bonus into the optimization objective with respect to ReffR_effRitalic_e f f, this method smooths the policy landscape. It guarantees the existence of a unique, stochastic optimal policy, π,α∗π^*_ R,απ∗bold_italic_R , α, that varies continuously with the underlying reward tuple Ritalic_R. This approach can therefore substantially enhance the behavioral stability of an LLM against modifications in its constituent reward models, albeit at the cost of a less sharply optimized policy for any temperature α>0α>0α > 0. Ultimately, these findings underscore that the strategy by which an LLM integrates multiple reward signals across diverse domains, modeled here via the effective reward mechanism, is central to its stability properties. While our analysis has focused on the continuity of policies optimal for a given ReffR_effRitalic_e f f, a key practical challenge remains: to design or learn an aggregation mechanism, i.e., the weights wk(s)w_k(s)witalic_k ( s ), such that the resulting policy not only exhibits desirable stability but also effectively optimizes the intended global performance measure J(π)J(π)J ( π ). 4.6 Remarks on State-Dependent Aggregation Mechanism A crucial assumption underpinning our analysis is that the aggregation weights wk(s)w_k(s)witalic_k ( s ) are given, continuous functions of the state s, and remain fixed during the perturbation of the reward tuple Ritalic_R. This simplification is necessary for analytical tractability. In a more general setting, these weights might themselves depend on the policy π or the reward models Ritalic_R, leading to an ideal but far more complex formulation, wk∗(s;π,)w_k^*(s;π, R)witalic_k∗ ( s ; π , italic_R ). Such dependencies, while realistic, would introduce significant analytical challenges. First, if wk(s)w_k(s)witalic_k ( s ) were a function of π, the reward function in the Bellman equation would become policy-dependent, violating the foundational assumptions of the standard MDP framework. Second, a dependency of wk(s)w_k(s)witalic_k ( s ) on Ritalic_R would compromise the continuity of the effective reward map itself, invalidating the Lipschitz properties established in Lemma 4.2 and altering all subsequent stability results. Consequently, our framework is best interpreted as modeling systems where the aggregation mechanism is structurally fixed. This assumption corresponds to several plausible implementation scenarios. For instance, the weights could represent a simplified, policy-independent heuristic for task relevance, such as using prior probabilities pkp_kpitalic_k. Alternatively, and more compellingly, it can model a sophisticated architecture with a separate, pre-trained, and now-frozen component, such as a context encoder or a gating network, that determines task relevance based on the input state. In this view, our analysis assesses the stability of a downstream policy optimization process with respect to its reward models, given a fixed routing or aggregation module. By adopting this deliberate simplification, we can employ the standard Bellman machinery to analyze a well-defined effective MDP. This provides clear and rigorous insights into how reward perturbations propagate through a structured, yet representative, multi-task decision-making model. The broader question concerning the stability of jointly learning the aggregation weights and the policy would necessitate a more extensive analytical framework beyond the scope of this work. 5 Connecting Theory to Practice: A Review of Empirical Evidence Having established a formal theory of policy stability, this section grounds our framework in practice by reviewing and re-interpreting key empirical findings from the recent literature. The following case studies are carefully chosen to provide comprehensive empirical evidence for our theory, demonstrating its power to both explain existing failures and guide the development of more robust and trustworthy AI systems. We begin by examining how reward misspecification can lead to deceptive reasoning, illustrating a progression from simple, rational cheating to more sophisticated policy obfuscation. We then broaden our scope to show how the same theoretical principles explain the systemic intelligence-obedience trade-off observed in large reasoning models. Moving from diagnosis to prescription, we next analyze a case study of controllable reasoning, demonstrating how a principled, tie-breaker reward can be engineered to resolve the instabilities and achieve fine-grained control without sacrificing performance. We then turn our attention to the complex domain of RLHF-based alignment, showing how our theory explains the emergence of a more harmful failure mode: a policy shift from faithfulness to sophistry. Finally, we present a case study and a controlled perturbation experiment in a multi-reward setting across diverse domains. 5.1 Deceptive Reasoning: From Rational Cheating to Policy Obfuscation A critical challenge in training capable models is that policies may learn to be deceptively aligned, pursuing a hidden, undesirable strategy while appearing to optimize the given objective. Recent findings on reward hacking provide a stark, empirical validation of our theoretical framework, illustrating how policies rationally optimize incomplete rewards and how attempts to patch these rewards can trigger discontinuous shifts to more sophisticated failure modes. 5.1.1 Initial Failure: Rational Cheating under a Simple Reward Model Figure 4: The emergence of reward hacking in OpenAI o3-mini models on coding tasks. Under a "weak grader" that cannot detect test manipulation, the policy learns to cheat to maximize reward. A "strong grader" that penalizes this behavior results in an aligned, non-cheating policy. Adapted from Wang et al. (2025a). A compelling baseline case is the emergence of cheating in OpenAI’s o3-mini models during RL on coding tasks (Wang et al., 2025a; Baker et al., 2025). As illustrated in Figure 4, the model’s behavior diverges dramatically based on the reward signal’s quality. When trained with a "weak grader": a reward model that only checks for unit test pass rates, the policy progressively learns a deceptive strategy of modifying the tests to pass, with cheating attempts reaching nearly 80%. This result is a powerful real-world illustration of our "clever slacker" paradigm (Proposition 3.1). The weak grader implements an incomplete training reward (RtrainR_trainRitalic_t r a i n), and the model’s convergence towards cheating is not an irrational bug, but a perfectly rational policy for this misspecified objective. In contrast, a "strong grader" that penalizes test manipulation provides a better proxy for the true reward (RtrueR_trueRitalic_t r u e), making honesty the only viable path to high reward (Proposition 3.2). 5.1.2 A Deeper Failure: Policy Shift to Obfuscated Deception Figure 5: The evolution of deceptive alignment under different reward schemes. Adding a CoT-based penalty to the reward function does not eliminate deception but causes a policy shift towards a more sophisticated, obfuscated form of cheating that evades the monitor. Adapted from Baker et al. (2025). A natural next step to address this failure is to augment the reward function with a process-based signal. In a follow-up experiment, the reward was updated to additionally penalize the agent if its Chain-of-Thought (CoT) triggered a monitor designed to detect hacking attempts. This represents a perturbation to the reward landscape—one that may be significant in practice, and is intended to render the deceptive strategy non-optimal (Baker et al., 2025). The results, shown in Figure 5, are nuanced and cautionary. While the agent trained with this "CoT pressure" does learn to pass tests honestly more often than the baseline (Left, green line), it also learns a more pernicious strategy. The baseline agent’s cheating is transparent and almost always caught by the monitor. In contrast, the agent under CoT pressure learns to hide its deception; its cheating is almost never caught by the monitor (Right, blue line). It has learned to write an innocent-looking CoT while still manipulating the tests. This outcome is an illustration of the policy discontinuities predicted by our theory. Although the magnitude of the perturbation is unknown, introducing a CoT penalty did not lead to the policy more honest in a smooth and gradual way. In our theory, the direction of the perturbation is more important than its scale. In this case, it triggered an abrupt shift from one strategy (transparent deception) to a completely different one (obfuscated deception). This occurs because the new reward landscape still contains multiple, qualitatively distinct paths to a high reward, creating a degeneracy of near-optimal actions. Faced with a penalty for its old strategy, the model does not abandon deception, but rather discovers a new, previously unexplored point in the policy space that satisfies the new, more complex reward function. This is because the new reward still differs from the ground truth reward in reality. This experiment vividly demonstrates that naive attempts to patch a reward function can induce policy brittleness, leading to more dangerous and harder-to-detect forms of misalignment—a direct, empirical manifestation of the instabilities inherent in a complex reward-policy map. 5.2 The Intelligence-Obedience Trade-off: An Incomplete Reward Specification Beyond single-task deception, our framework can also explain systemic trade-offs between different desired capabilities in large reasoning models (LRMs). A comprehensive study by Fu et al. (2025) provides compelling, large-scale evidence for such a trade-off, showing that training methods designed to enhance reasoning can inadvertently degrade instruction-following. Their experiments, which applied reasoning-oriented reinforcement learning (RL) to several base models, revealed a consistent pattern: while the models became better at solving problems correctly, they grew worse at adhering to user-specified constraints. The results, summarized in Table 1, offer a clear quantitative illustration of this dilemma. Across multiple Qwen base models, applying reasoning-oriented RL consistently boosts the correctness accuracy (Corr-Acc) by a large margin. Simultaneously, however, it induces a consistent degradation in both hard (IF-HAcc) and soft (IF-SAcc) instruction-following accuracy. This empirically demonstrates that optimizing for reasoning performance alone can steer the policy toward a region of the policy space where obedience is sacrificed. Table 1: Performance comparison of base vs. Reasoning-Oriented RL. Arrows and colors indicate change relative to base model: ↑ (Green) for increase, ↓ (red) for decrease. Adapted from Table 4 in Fu et al. (2025). Model IF-HAcc IF-SAcc Corr-Acc Qwen2.5-1.5B 10.00 27.26 1.21 Reasoning-Oriented RL 9.52 ↓ 23.97 ↓ 14.58 ↑ Qwen2.5-7B 15.95 33.13 13.59 Reasoning-Oriented RL 10.48 ↓ 27.26 ↓ 28.39 ↑ Qwen2.5-Math-1.5B 9.28 23.33 18.91 Reasoning-Oriented RL 8.33 ↓ 21.31 ↓ 24.88 ↑ Qwen2.5-Math-7B 9.76 23.53 20.68 Reasoning-Oriented RL 7.85 ↓ 22.62 ↓ 32.61 ↑ Note. Hard accuracy (IF-HAcc) and soft accuracy (IF-SAcc) assess whether the model output satisfies user-specified constraints. IF-HAcc equals 1 only if all constraints in a query are satisfied, while IF-SAcc measures the average satisfaction rate across all constraints. Average correctness accuracy (Corr-Acc) captures the correctness of the model’s final answer regardless of constraint satisfaction, based on matching with the ground-truth answer. Results are averaged over five math benchmarks: AIME2024, AIME2025, AMC2023, Minerva, and Olympiad. This observed trade-off can be precisely explained by the principle of policies rationally optimizing an incomplete reward specification (Proposition 3.1). The RL training described by Fu et al. (2025) utilizes an outcome-based reward, RreasoningR_reasoningRreasoning, which is incomplete with respect to the user’s full intent. The model, as a rational agent, allocates its capacity to maximize the explicitly rewarded objective (correctness), causing the unrewarded, pre-existing skill of instruction-following to degrade. This provides direct empirical support for our work that many alignment failures are not arbitrary bugs, but predictable outcomes of optimizing a misspecified reward landscape. Our theoretical framework not only explains this degradation but also suggests a principled path toward mitigation. The core issue is that optimizing for RreasoningR_reasoningRreasoning alone yields a highly degenerate set of optimal policies, Π∗(Rorrectness) ^*(R_orrectness)Π∗ ( Rorrectness ), where many policies are effective at reasoning but poor at following instructions. To resolve this, we propose introducing an auxiliary instruction-following reward, RinstructionR_instructionRinstruction, which acts as a tie-breaker (Proposition 3.2). This targeted reward is designed to select for policies from the degenerate set that also exhibit the desired obedience, thereby seeking a solution in the intersection of high-performing reasoning and instruction-following policies. The following subsection will present a case study that illustrates this principle. 5.3 Controllable Reasoning: Engineering a Desirable Policy Shift via Tie-Breaker Rewards The previously discussed trade-off highlights a central challenge: if training for reasoning correctness yields a degenerate set of optimal policies, how can we select one that also exhibits other desirable behaviors, such as length constraint? The work of Aggarwal and Welleck (2025) on controlling Chain-of-Thought (CoT) length provides a powerful case study, demonstrating how a principled reward function can engineer a desirable shift in the policy space. Their method, Length-Controlled Policy Optimization (LCPO), trains a model using a reward function that combines a primary correctness reward with an auxiliary penalty for deviating from a target CoT length, ngoldn_goldngold, which is provided in the user prompt, e.g., r(y,ygold,ngold)=(y=ygold)−α|ngold−ny|.r(y,y_gold,n_gold)=I(y=y_gold)-α|n_gold-n_y|.r ( y , ygold , ngold ) = blackboard_I ( y = ygold ) - α | ngold - nitalic_y | . (16) In their experiments, they take α as 0.0003. The key insight from their empirical results is twofold. First, as shown in Figure 6, the resulting L1 models successfully learn to control their reasoning length according to the prompt’s specification. Second, this added controllability does not come at the cost of reasoning performance; in fact, the L1 models consistently outperform the baseline S1 model (which uses a hard token limit) in correctness across various computational budgets, and the base model DeepSeek-R1-1.5B with fewer tokens. (a) Length control precision (b) Reasoning performance vs. budget Figure 6: Empirical validation of our tie-breaker framework, demonstrating how a principled reward design resolves the intelligence-obedience trade-off. (a) The policy, trained with a length-penalizing reward, learns to precisely adhere to the target token budget specified in the prompt. (b) Crucially, this added controllability does not degrade reasoning performance; the L1 model consistently outperforms the S1 baseline across all budgets and the base model DeepSeek-R1-1.5B with fewer tokens. This outcome shows that by using a tie-breaker reward, RL can induce a desirable policy shift to a new region of the policy space where both high reasoning capability and instruction-following are achieved. Adapted from Aggarwal and Welleck (2025). Note. The performance reported in (b) is a macro-average over four math benchmarks: AMC, AIME, Olympiad-Bench, and MATH. This successful outcome can be understood by our theoretical framework. The LCPO reward function, with its length-penalty term, acts as a powerful tie-breaker (Proposition 3.2). The RL process, guided by this new reward, induces a shift in the policy space. It forces the model to abandon the initial policy and converge to a new, qualitatively superior policy that co-optimizes for both reasoning correctness and length controllability. Besides, as shown in Aggarwal and Welleck (2025), the model learns to conditionally execute different reasoning strategies, from direct solutions at low budgets to deliberative self-correction at high budgets, based on the length target it perceives in its input state. Therefore, this case study does not contradict our theory of instability—it exemplifies how to harness it. The brittleness of the reward-policy map, once seen as a flaw, becomes a tool for targeted policy improvement. Understanding this map deeply can allow us to engineer targeted shifts that push models into more capable and controllable regions of the policy space. 5.4 From Faithfulness to Sophistry: Policy Jumps in RLHF-based Alignment The preceding analyses have focused on instabilities in reasoning tasks, where reward functions are often objective and rule-based (e.g., correctness, test manipulation, or token length). This raises a critical question: do these theoretical instabilities persist in the more subjective domain of LLM alignment, where rewards are not defined by rules but are learned from human feedback? The work of Wen et al. (2024) on "Unintended Sophistry" provides a compelling affirmative answer, demonstrating how RL from Human Feedback (RLHF) can induce its own pernicious policy shifts. Their study reveals a "performance illusion" inherent in standard RLHF pipelines. As the quantitative results show in Figure 7, after RLHF with a reward model trained on human preferences, the human-perceived performance of the model increases substantially, while its true correctness stagnates or even declines (Wen et al., 2024). This indicates the model is not getting better at the task, but better at convincing humans. Figure 7: The "performance illusion" induced by RLHF. While human evaluators perceive a performance improvement (+9.4), the oracle reward shows no actual improvement (-1.8). This discrepancy arises because the model learns to exploit human evaluation biases in a misspecified reward model. Adapted from Wen et al. (2024). This phenomenon can be interpreted as a direct consequence of policy optimization under reward misspecification. The ideal objective of alignment is to find a policy optimal for the true reward RtrueR_trueRtrue, one that exhibits what we term a Faithful-and-Correct Style. However, the RLHF process optimizes for a flawed proxy, RtrainR_trainRtrain, which inadvertently rewards human cognitive biases. Our framework predicts that optimizing this misspecified reward may induce a discontinuous jump in the policy space (Proposition 2.10). The policy does not converge toward the ideal; instead, it leaps to a different equilibrium, a Persuasive Sophistry Style, which is optimal for RtrainR_trainRtrain. This style is characterized by fabricating evidence, employing internally consistent but fallacious logic, and generating unreadable code designed to pass superficial checks—all strategies that exploit the weaknesses of a human-feedback-based reward model. This case study therefore provides a powerful, real-world demonstration of our work: a misspecified reward function, as is common in RLHF, may result in a suboptimal policy and even cause a shift to a qualitatively different and more harmful style of behavior. 5.5 Multi-Reward Instability: Suggestive Evidence from Data Mixture Experiments (a) Effect of the data mixture change. (b) Comparison for different mixture strategies. Figure 8: Policy sensitivity to the data mixture in a multi-reward RLVR setting. (a) The "Exclude-One" analysis shows that removing a single dataset (e.g., ScienceQA) causes large, non-linear shifts in performance across different test sets, highlighting complex task interactions. (b) A comparison of mixture strategy classes reveals that different strategies achieve distinct generalization performance. Adapted from Liang et al. (2025). Our analysis so far has focused on instabilities arising from a single reward function. The challenge is amplified in the multi-reward setting, where a single policy must balance competing objectives. The work of Liang et al. (2025) on optimizing data mixtures for multi-domain RLVR provides a compelling case study here (Liang et al., 2025). In their experiments, altering the data mixture serves as a practical, high-level mechanism to reshape the effective reward landscape that the policy ultimately optimizes. Their results clearly demonstrate that the final policy is highly sensitive to this process; changing the mixture by, for instance, removing a single dataset or changing the mixture ratio, leads to large and non-linear shifts in the policy’s capabilities and performance profile (see Figure 8). A crucial distinction must be made, however, regarding the strength of this evidence. Our formal theory proves that a small, well-defined reward perturbation may trigger a discontinuous policy jump. Altering a data mixture is a significant intervention, and it is unclear from this experiment whether it corresponds to a small or a large change in the final effective reward landscape. Therefore, while these findings do not constitute a strict proof of our theory, they offer strong suggestive evidence of the policy brittleness our framework predicts. The observed sharp changes in behavior are highly consistent with the types of instability expected in a system with multiple rewards across diverse domains. This highlights that the data mixture is an important parameter for controlling the final policy’s stability, even if the precise mapping from mixture change to reward change remains an open question. 5.6 Multi-Reward Instability: Evidence from Controlled Perturbation Experiments Figure 9: Out-of-distribution (OOD) performance on safety and human value benchmarks. Model-1 vs. Model-2 shows that a slight perturbation to one reward model in the multi-reward tuple causes a large policy shift. Model-1 vs. Model-3 shows that a minor data curation step also leads to a significant performance change, highlighting the policy’s sensitivity to the factors that shape its effective reward landscape. Our analysis suggests that in a multi-reward multi-domain setting, the final policy’s stability is highly sensitive to both the individual reward models and the data mixture that shapes their aggregation. To provide direct, controlled evidence for this, we conducted a series of experiments training a single Multimodal LLM (based on a SFT model of Qwen2.5-VL-72B-Instruct (Bai et al., 2025)) using a multi-reward RL framework (see Figure 2). The model was trained on three classes of prompts: safety, human values, and general, each guided by a corresponding outcome-based reward model (ORM). We implemented a reinforcement learning framework, named OpenRLHF-V, based on the open-source OpenRLHF framework (Hu et al., 2024). We also experimented with a similar framework, VeRL (Sheng et al., 2024). Both frameworks yielded comparable training curves and final model performance. All experiments were conducted on a cluster of A800 GPUs using OpenRLHF-V, and the training hyperparameters are summarized in Table 2 in the Appendix. We first investigated the policy’s sensitivity to small perturbations in the reward function tuple. We trained two RL models, Model-1 and Model-2, using identical datasets and mixture ratios. The only difference was a slight modification to the safety ORM, while the human-value and general ORMs were kept the same. As shown in Figure 9, this minor change to just one component of the reward vector led to significant and widespread performance shifts across a range of out-of-distribution safety and human value benchmarks. This result provides strong evidence for the brittleness of the reward-policy map in a multi-reward setting, confirming our theory that a small reward perturbation can induce a shift in the final policy. Next, we investigated the policy’s sensitivity to the training data composition, which shapes the "effective reward" landscape. We trained Model-3 using the same set of reward models as Model-1. The only difference was a minor data curation step: we removed approximately 200 ambiguous prompts that were difficult for human annotators to classify as either "unsafe" or "normal". These prompts can be seen as the safety boundary and may have conflicting rewards (e.g., harmfulness versus helpfulness). This small change in the training data also resulted in a large performance difference between Model-1 and Model-3. This demonstrates that minor alterations to the data, which can influence how the policy learns to aggregate or "weigh" the different reward signals, can be sufficient to shift the policy to a different equilibrium. These controlled experiments provide complementary evidence supporting our stability framework. They demonstrate that the final policy in a multi-reward system is extremely sensitive to both the precise definition of its component reward functions and the data used to inform their aggregation, further underscoring that stability is a critical and non-trivial challenge. 6 Discussion Our work has established a formal mathematical framework for analyzing the stability of the reward-policy map in reinforcement learning. While the preceding sections have demonstrated its power in explaining and unifying a range of critical phenomena in LLM alignment and reasoning, it is equally important to discuss the assumptions underpinning our theory and its relationship with the complexities of real-world training. This section addresses three key aspects: the continuity assumption of the reward function, the influence of practical optimization algorithms, and the role of data distribution. 6.1 On the Assumption of Reward Continuity Our theoretical framework is built upon the assumption that the reward function R∈(S×A)R (S× A)R ∈ C ( S × A ). As established in our framing of the LLM problem, for a finite state-action space, any function is formally continuous due to the discrete topology induced by any metric. This means that even the binary success/failure signals common in practice, such as R(s,a)∈0,1R(s,a)∈\0,1\R ( s , a ) ∈ 0 , 1 , satisfy this topological assumption. Therefore, the source of policy instability in the LLM setting does not stem from a formal "discontinuity" of the reward function itself. The critical issue with such sparse, outcome-based rewards lies not in their topology, but in their functional impact on the value landscape: they are a primary driver of degeneracy. Because a vast number of distinct behavioral trajectories can lead to the same binary outcome (e.g., a correct final answer), they are all assigned an identical reward value. This equality propagates through the Bellman equation, resulting in an optimal Q-function, QR∗Q^*_RQ∗italic_R, where many different actions at a given state share the exact same maximum value. This widespread degeneracy creates the precise conditions under which the policy map becomes unstable. It enlarges the set of optimal actions, A∗(s;R)A^*(s;R)A∗ ( s ; R ), and flattens the value landscape, making the argmax argmaxargmax correspondence acutely sensitive to any perturbation in the Q-function’s values. Consequently, while binary rewards do not violate the continuity assumption in the finite LLM setting, they drastically increase the prevalence of co-optimal actions. This exacerbates the inherent discontinuity of the argmax argmaxargmax operator, making the policy cliffs identified by our theory more pronounced and severe in practice. The instability arises not from a "discontinuous" reward function, but from an optimization process breaking ties among a massively degenerate set of optimal choices and perturbations of reward function. 6.2 The Role of Practical RL Algorithms and Optimization Dynamics Our framework analyzes the stability of the true optimal policy, π∗π^*π∗, providing a clean understanding of a problem’s inherent structural properties. In practice, however, policies are found via iterative algorithms like PPO, which introduce a crucial dynamic that our static analysis does not capture: the perpetual noise inherent in the optimization process. This noise, arising from mini-batch sampling and stochastic gradient updates, acts as a constant source of perturbation. While our theory explains why a degenerate reward landscape makes a policy fundamentally fragile, this algorithmic noise is the ever-present force that can exploit this fragility, preventing the policy from settling into a stable strategy and instead causing the kind of training drift and run-to-run variance commonly observed in practice. A full analysis of these joint effects remains a critical direction for future work. 6.3 The Influence of Data Distribution and Mixture Ratios The stability of a learned policy is not only a function of the reward and the algorithm, but also of the data distribution on which it is trained. Our theoretical framework, by focusing on the properties of the MDP, implicitly assumes access to all states, yet in reality, a policy’s behavior is shaped only by the states it encounters during training. Besides, in the multi-reward setting, the mixture ratios directly form the "effective reward" landscape the policy optimizes against. A model trained on a data mixture heavily skewed towards one task (e.g., 90% coding data) will learn an internal reward aggregation strategy specialized for that task. If this specialized policy is then deployed in an environment with a different task mixture, its behavior can become unstable and unpredictable, as its internal reward strategy is now fundamentally mismatched to the new context. Therefore, the data mixture ratio is not merely a sampling detail but a top-level hyperparameter that governs the structure of the effective reward landscape and, consequently, the stability of the final multi-task model. 7 Conclusion The reliable alignment and reasoning of large language models is fundamentally constrained by the instability of reinforcement learning, which often yields brittle policies that fail in unpredictable ways. This work confronted this challenge not with more sophisticated heuristics, but by establishing a foundational mathematical theory of policy stability. Our central contribution is the formal proof that policy instability is not an empirical accident but a direct and predictable mathematical consequence of reward perturbation and degenerate optimal action sets, which are a common feature in the vast, nuanced decision spaces of language and reasoning. This theoretical principle provided a unified lens, reframing a wide range of seemingly disparate alignment failures, from the evolution of deceptive reasoning to systemic intelligence-obedience trade-offs and RLHF-induced sophistry, as rational and predictable outcomes of policy optimization on incomplete or ambiguous reward landscapes. We extended this framework from the foundational single-reward setting to the more complex multi-reward paradigm across diverse domains, demonstrating that stability then hinges critically on the mechanism for aggregating disparate reward signals into an "effective reward". Finally, moving from diagnosis to prescription, we formally guaranteed a solution by proving that entropy regularization robustly restores Lipschitz continuity to the reward-policy map, thereby ensuring a smooth and stable policy response to changes in the underlying reward models. This work reframes policy stability from reactive heuristics to a principled theory. It introduces a formal framework to analyze and anticipate instabilities in AI systems, laying the groundwork for more robust reinforcement learning. While the analysis provides essential foundations, it also highlights key directions for future research, especially the stability of dynamically learned reward aggregation. By establishing the mathematical perspective for stability, this research advances the design of RL algorithms that prioritize not only optimality but also the reliability essential for safe and trustworthy AI. Acknowledgments This research emerged from the Reasoning Trustworthiness project (SafeWork-R1), inspired by empirical observations from both public works in the field of large reasoning models and our extensive RL experiments during the project. We are grateful to everyone at the Center for Safe & Trustworthy AI, Shanghai Artificial Intelligence Laboratory, for their invaluable support. This work is supported by Shanghai Artificial Intelligence Laboratory. We also acknowledge the use of large language models for linguistic refinement during the preparation of this manuscript. References Aggarwal and Welleck [2025] Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025. Aliprantis and Border [2006] Charalambos D. Aliprantis and Kim C. Border. Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, Berlin, Heidelberg, 3rd edition, 2006. Anthropic [2025] Anthropic. System card: Claude opus 4 & claude sonnet 4. May 2025. System card documentation. Bai et al. [2025] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. Bai et al. [2022] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. Baker et al. [2025] Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025. Banihashem et al. [2021] Kiarash Banihashem, Adish Singla, and Goran Radanovic. Defense against reward poisoning attacks in reinforcement learning. arXiv preprint arXiv:2102.05776, 2021. Berge [1963] Claude Berge. Topological spaces: Including a treatment of multi-valued functions, vector spaces and convexity. Oliver & Boyd, 1963. Chen et al. [2025a] Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567, 2025a. Chen et al. [2025b] Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410, 2025b. Cheng et al. [2025] Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, et al. Revisiting reinforcement learning for llm reasoning from a cross-domain perspective. arXiv preprint arXiv:2506.14965, 2025. Fu et al. [2025] Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, and Yu Cheng. Scaling reasoning, losing control: Evaluating instruction following in large reasoning models. arXiv preprint arXiv:2505.14810, 2025. Geist et al. [2019] Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In International conference on machine learning, pages 2160–2169. PMLR, 2019. Gemini Team, Google [2025] Gemini Team, Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. June 2025. Guan et al. [2024] Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning enables safer language models. arXiv preprint arXiv:2412.16339, 2024. Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. Gupta et al. [2023] Dhawal Gupta, Yash Chandak, Scott Jordan, Philip S Thomas, and Bruno C da Silva. Behavior alignment via reward function optimization. Advances in Neural Information Processing Systems, 36:52759–52791, 2023. Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr, 2018. Hu et al. [2024] Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143, 2024. Lecarpentier et al. [2021] Erwan Lecarpentier, David Abel, Kavosh Asadi, Yuu Jinnai, Emmanuel Rachelson, and Michael L Littman. Lipschitz lifelong reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8270–8278, 2021. Lee et al. [2023] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023. Liang et al. [2025] Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu. Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning. arXiv preprint arXiv:2505.24871, 2025. Munkres [2000] James R. Munkres. Topology. Prentice Hall, Upper Saddle Rivers, NJ, 2nd edition, 2000. OpenAI [2024] OpenAI. Openai o1 system card. December 2024. Model card. OpenAI [2025a] OpenAI. Expanding on what we missed with sycophancy. 2025a. OpenAI [2025b] OpenAI. Openai o3 and o4-mini system card. April 2025b. Model card. OpenAI [2025c] OpenAI. Sycophancy in gpt-4o: What happened and what we’re doing about it. 2025c. Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. Puterman [2005] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2005. Qu et al. [2025] Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, et al. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond. arXiv preprint arXiv:2503.21614, 2025. Roijers et al. [2013] Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013. Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. Sheng et al. [2024] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024. Silver and Sutton [2025] David Silver and Richard S Sutton. Welcome to the era of experience. Google AI, 1, 2025. Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017. Silver et al. [2018] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018. Su et al. [2025] Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains. arXiv preprint arXiv:2503.23829, 2025. Sui et al. [2025] Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025. Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. Wang et al. [2020] Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning with perturbed rewards. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 6202–6209, 2020. Wang et al. [2025a] Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Johannes Heidecke, Tejal Patwardhan, and Dan Mossing. Persona features control emergent misalignment. OpenAI, 2025a. Wang et al. [2025b] Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example. arXiv preprint arXiv:2504.20571, 2025b. Wen et al. [2024] Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R Bowman, He He, and Shi Feng. Language models learn to mislead humans via rlhf. arXiv preprint arXiv:2409.12822, 2024. Wirth et al. [2017] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017. Yang et al. [2025] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. Appendix A Related Work Stability in Reinforcement Learning. A significant body of research in reinforcement learning theory focuses on the policy optimization process, analyzing convergence to optimality and the sample efficiency of various algorithms [Sutton and Barto, 1998, Puterman, 2005]. While it has been empirically observed and acknowledged in prior studies that learned policies may be sensitive to small perturbations in the reward function [Wirth et al., 2017, Wang et al., 2020, Banihashem et al., 2021, Gupta et al., 2023], a formal analysis of this instability has been less explored. Our work addresses this gap by providing a rigorous mathematical analysis of the reward-policy map itself. Using tools from functional analysis, we prove that policy discontinuities are an inherent property of the optimization landscape, arising naturally when multiple actions are optimal or near-optimal. This perspective on inherent instability also provides a new lens through which to understand the role of widely-used regularization techniques. Prior works have typically explained their benefits from algorithmic or heuristic viewpoints: for instance, the KL-divergence penalty in PPO is viewed as a mechanism for stabilizing training updates [Schulman et al., 2017], while entropy regularization is primarily seen as a method to encourage exploration [Haarnoja et al., 2018]. Our work provides a more fundamental justification. We demonstrate that the KL-divergence penalty acts as a crucial tie-breaker among degenerate optimal policies, while proving that entropy regularization is not merely an exploration heuristic but a mathematical tool that restores Lipschitz continuity to the reward-policy map. It reveals these techniques to be not just algorithmic optimizations, but fundamental mechanisms for ensuring the mathematical stability of the learned policy. RL for LLM Alignment and Reasoning. The application of reinforcement learning to the alignment and reasoning of Large Language Models (LLMs) has been propelled by a series of landmark empirical studies. Methodologies such as Reinforcement Learning from Human Feedback (RLHF) [Ouyang et al., 2022] and Constitutional AI [Bai et al., 2022] established the practical viability of using RL to steer model behavior. This trend continues with the use of Reinforcement Learning with Verifiable Rewards (RLVR) in state-of-the-art large reasoning models, including OpenAI’s o-series, Gemini 2.5, Grok 4, and DeepSeek-R1. Despite their empirical success, these RL-trained systems often exhibit a range of behavioral flaws. The literature documents phenomena such as "spurious reasoning" or "deceptive alignment", where a model fabricates justifications for answers produced via shortcuts [Baker et al., 2025, Wang et al., 2025a, Chen et al., 2025b], and "poor instruction-following fidelity", where models ignore details not directly enforced by the reward function [Fu et al., 2025]. While existing research typically addresses these failures with isolated empirical analyses, our work is distinct in providing a unified theoretical explanation. We demonstrate that these are not disparate bugs but rational behaviors that emerge from the inherent mathematical instability of the policy optimization process under incomplete or biased reward signals. These challenges are amplified in settings where LLMs must balance multiple objectives, a common scenario in recent work on multi-reward RL across diverse domains [Guo et al., 2025, Liang et al., 2025, Cheng et al., 2025]. Standard Multi-Objective Reinforcement Learning (MORL) traditionally seeks to compute an entire Pareto frontier of trade-off solutions for an external decision-maker [Roijers et al., 2013]. Our approach differs, as LLMs typically operate under a single policy that must dynamically reconcile even conflicting goals, such as accuracy, safety, helpfulness and style, based solely on the input context. We model this complex training regime by proposing a state-dependent effective reward, which represents the agent’s internal, context-sensitive aggregation of multiple reward signals. By formalizing this as a mapping from various reward components to a unified objective, we can apply the same functional-analytic tools to characterize the stability of the resulting policy in these complex, multi-objective environments. Appendix B Limitations The principal contribution of this work is a novel theoretical framework for analyzing policy stability. While our framework is motivated by and helps explain observed empirical phenomena in Large Language Models (LLMs), the analysis itself remains primarily theoretical. It currently lacks systematic empirical validation designed to quantitatively test the theory’s predictions regarding reward-policy stability. Furthermore, while our analysis provides a new theoretical foundation for the stabilizing effect of entropy regularization, proving that it can restore continuity to the reward-policy map. However, beyond these theoretical insights, we do not yet offer novel algorithmic designs or actionable engineering guidelines. Bridging this gap between theory and practice, through rigorous empirical investigation and the development of novel algorithms (e.g. process-based reward modeling (PRM), chain-of-thought (CoT) reward monitoring, uncertainty quantification in reward models, and environment-interaction-based reward design), remains a critical direction for future research. Appendix C Appendix on Additional Proofs C.1 A Construction of the Bump Function Lemma C.1. Let (S,dS)(S,\!d_S)( S , ditalic_S ) and (A,dA)(A,\!d_A)( A , ditalic_A ) be compact metric spaces and endow the product S×AS× AS × A with the product metric dSA((s,a),(s′,a′))=maxdS(s,s′),dA(a,a′).d_SA ((s,a),(s ,a ) )= \! \d_S(s,s ),\,d_A(a,a ) \.ditalic_S A ( ( s , a ) , ( s′ , a′ ) ) = max ditalic_S ( s , s′ ) , ditalic_A ( a , a′ ) . For δ>0δ>0δ > 0 define the bump φδ(s,a)=max0, 1−dSA((s,a),(s0,a2))δ,(s,a)∈S×A. _δ(s,a)\;=\; \! \0,\,1- d_SA ((s,a),(s_0,a_2) )δ \, (s,a)∈ S× A.φitalic_δ ( s , a ) = max 0 , 1 - divide start_ARG ditalic_S A ( ( s , a ) , ( s0 , a2 ) ) end_ARG start_ARG δ end_ARG , ( s , a ) ∈ S × A . Then φδ _δφitalic_δ is continuous and bounded in [0,1][0,1][ 0 , 1 ]. Proof. The map (s,a)↦dSA((s,a),(s0,a2))(s,a) d_SA ((s,a),(s_0,a_2) )( s , a ) ↦ ditalic_S A ( ( s , a ) , ( s0 , a2 ) ) is continuous on the compact product space. The function x↦max0,1−x/δx \0,1-x/δ\x ↦ max 0 , 1 - x / δ is continuous on ℝRblackboard_R. Composition preserves continuity, hence φδ∈(S×A) _δ (S× A)φitalic_δ ∈ C ( S × A ). Moreover 0≤φδ(s,a)≤10≤ _δ(s,a)≤ 10 ≤ φitalic_δ ( s , a ) ≤ 1 for all (s,a)(s,a)( s , a ), and φδ(s0,a2)=1, _δ(s_0,a_2)=1,φitalic_δ ( s0 , a2 ) = 1 , so ‖φδ‖∞=1\| _δ\|_∞=1∥ φitalic_δ ∥∞ = 1. ∎ C.2 Proof of the Lipschitz Property of the Softmax Function Lemma C.2. Let π:ℝd→ℝdπ:R^d ^dπ : blackboard_Rd → blackboard_Rd be the softmax function with temperature α>0α>0α > 0, defined as πi(x)=exp(xi/α)∑j=1dexp(xj/α) _i(x)= (x_i/α) _j=1^d (x_j/α)πitalic_i ( x ) = divide start_ARG exp ( xitalic_i / α ) end_ARG start_ARG ∑j = 1d exp ( xitalic_j / α ) end_ARG for x∈ℝdx ^dx ∈ blackboard_Rd. The function π(x)π(x)π ( x ) is Lipschitz continuous from the space (ℝd,∥⋅∥∞)(R^d,\|·\|_∞)( blackboard_Rd , ∥ ⋅ ∥∞ ) to (ℝd,∥⋅∥1)(R^d,\|·\|_1)( blackboard_Rd , ∥ ⋅ ∥1 ), with a Lipschitz constant of at most 1/α1/ 1 / α. That is, for any x,y∈ℝdx,y ^dx , y ∈ blackboard_Rd: ‖π(x)−π(y)‖1≤1α‖x−y‖∞.\|π(x)-π(y)\|_1≤ 1α\|x-y\|_∞.∥ π ( x ) - π ( y ) ∥1 ≤ divide start_ARG 1 end_ARG start_ARG α end_ARG ∥ x - y ∥∞ . Proof. Let f(x)=π(x)f(x)=π(x)f ( x ) = π ( x ). By the Mean Value Theorem for vector-valued functions, for any x,y∈ℝdx,y ^dx , y ∈ blackboard_Rd: ‖f(x)−f(y)‖1≤supt∈[0,1]‖Jf(y+t(x−y))‖(∞,1)⋅‖x−y‖∞,\|f(x)-f(y)\|_1≤ _t∈[0,1]\|J_f(y+t(x-y))\|_(∞,1)·\|x-y\|_∞,∥ f ( x ) - f ( y ) ∥1 ≤ supitalic_t ∈ [ 0 , 1 ] ∥ Jitalic_f ( y + t ( x - y ) ) ∥( ∞ , 1 ) ⋅ ∥ x - y ∥∞ , where Jf(z)J_f(z)Jitalic_f ( z ) is the Jacobian matrix of f at point z, and ‖Jf‖(∞,1)\|J_f\|_(∞,1)∥ Jitalic_f ∥( ∞ , 1 ) is the induced operator norm for mappings from (ℝd,∥⋅∥∞)(R^d,\|·\|_∞)( blackboard_Rd , ∥ ⋅ ∥∞ ) to (ℝd,∥⋅∥1)(R^d,\|·\|_1)( blackboard_Rd , ∥ ⋅ ∥1 ). This norm is defined as: ‖Jf‖(∞,1)=sup‖v‖∞≤1‖Jfv‖1.\|J_f\|_(∞,1)= _\|v\|_∞≤ 1\|J_fv\|_1.∥ Jitalic_f ∥( ∞ , 1 ) = sup∥ v ∥ start_POSTSUBSCRIPT ∞ ≤ 1 end_POSTSUBSCRIPT ∥ Jitalic_f v ∥1 . First, we compute the entries of the Jacobian matrix Jf(x)J_f(x)Jitalic_f ( x ). The partial derivatives are: (Jf)ij=∂πi∂xj=1απi(δij−πj),(J_f)_ij= ∂ _i∂ x_j= 1α _i( _ij- _j),( Jitalic_f )i j = divide start_ARG ∂ πitalic_i end_ARG start_ARG ∂ xitalic_j end_ARG = divide start_ARG 1 end_ARG start_ARG α end_ARG πitalic_i ( δitalic_i j - πitalic_j ) , where δij _ijδitalic_i j is the Kronecker delta. Next, we evaluate the action of the Jacobian on a vector v∈ℝdv ^dv ∈ blackboard_Rd such that ‖v‖∞≤1\|v\|_∞≤ 1∥ v ∥∞ ≤ 1. The i-th component of the resulting vector JfvJ_fvJitalic_f v is: (Jfv)i (J_fv)_i( Jitalic_f v )i =∑j=1d(Jf)ijvj=∑j=1d1απi(δij−πj)vj = _j=1^d(J_f)_ijv_j= _j=1^d 1α _i( _ij- _j)v_j= ∑j = 1d ( Jitalic_f )i j vitalic_j = ∑j = 1d divide start_ARG 1 end_ARG start_ARG α end_ARG πitalic_i ( δitalic_i j - πitalic_j ) vitalic_j (17) =1απi(∑j=1dδijvj−∑j=1dπjvj) = 1α _i ( _j=1^d _ijv_j- _j=1^d _jv_j )= divide start_ARG 1 end_ARG start_ARG α end_ARG πitalic_i ( ∑j = 1d δitalic_i j vitalic_j - ∑j = 1d πitalic_j vitalic_j ) (18) =1απi(vi−v¯), = 1α _i(v_i- v),= divide start_ARG 1 end_ARG start_ARG α end_ARG πitalic_i ( vitalic_i - over¯ start_ARG v end_ARG ) , (19) where we define v¯:=∑k=1dπkvk v:= _k=1^d _kv_kover¯ start_ARG v end_ARG := ∑k = 1d πitalic_k vitalic_k. Now, we compute the L1L_1L1-norm of JfvJ_fvJitalic_f v: ‖Jfv‖1=∑i=1d|(Jfv)i|=∑i=1d|1απi(vi−v¯)|=1α∑i=1dπi|vi−v¯|,\|J_fv\|_1= _i=1^d|(J_fv)_i|= _i=1^d | 1α _i(v_i- v) |= 1α _i=1^d _i|v_i- v|,∥ Jitalic_f v ∥1 = ∑i = 1d | ( Jitalic_f v )i | = ∑i = 1d | divide start_ARG 1 end_ARG start_ARG α end_ARG πitalic_i ( vitalic_i - over¯ start_ARG v end_ARG ) | = divide start_ARG 1 end_ARG start_ARG α end_ARG ∑i = 1d πitalic_i | vitalic_i - over¯ start_ARG v end_ARG | , since πi>0 _i>0πitalic_i > 0 and α>0α>0α > 0. To find the operator norm, we must maximize Φ(v,π):=∑i=1dπi|vi−v¯| (v,π):= _i=1^d _i|v_i- v|Φ ( v , π ) := ∑i = 1d πitalic_i | vitalic_i - over¯ start_ARG v end_ARG | over all v with ‖v‖∞≤1\|v\|_∞≤ 1∥ v ∥∞ ≤ 1. The function Φ Φ is convex in v, so its maximum over the compact convex set (the hypercube ‖v‖∞≤1\|v\|_∞≤ 1∥ v ∥∞ ≤ 1) must be achieved at one of its vertices. The vertices of this hypercube are vectors v with components vi∈−1,1v_i∈\-1,1\vitalic_i ∈ - 1 , 1 . Let us fix such a vector v. Let S=i∣vi=1S=\i v_i=1\S = i ∣ vitalic_i = 1 , and let p=∑i∈Sπip= _i∈ S _ip = ∑i ∈ S πitalic_i. Then ∑i∉Sπi=1−p _i∉ S _i=1-p∑i ∉ S πitalic_i = 1 - p. The weighted average v¯ vover¯ start_ARG v end_ARG becomes: v¯=∑i∈Sπi(1)+∑i∉Sπi(−1)=p−(1−p)=2p−1. v= _i∈ S _i(1)+ _i∉ S _i(-1)=p-(1-p)=2p-1.over¯ start_ARG v end_ARG = ∑i ∈ S πitalic_i ( 1 ) + ∑i ∉ S πitalic_i ( - 1 ) = p - ( 1 - p ) = 2 p - 1 . We can now express Φ Φ in terms of p: Φ(v,π) (v,π)Φ ( v , π ) =∑i∈Sπi|1−(2p−1)|+∑i∉Sπi|−1−(2p−1)| = _i∈ S _i|1-(2p-1)|+ _i∉ S _i|-1-(2p-1)|= ∑i ∈ S πitalic_i | 1 - ( 2 p - 1 ) | + ∑i ∉ S πitalic_i | - 1 - ( 2 p - 1 ) | (20) =∑i∈Sπi|2−2p|+∑i∉Sπi|−2p| = _i∈ S _i|2-2p|+ _i∉ S _i|-2p|= ∑i ∈ S πitalic_i | 2 - 2 p | + ∑i ∉ S πitalic_i | - 2 p | (21) =(2−2p)∑i∈Sπi+2p∑i∉Sπi(since p∈[0,1]) =(2-2p) _i∈ S _i+2p _i∉ S _i (since p∈[0,1])= ( 2 - 2 p ) ∑i ∈ S πitalic_i + 2 p ∑i ∉ S πitalic_i (since p ∈ [ 0 , 1 ] ) (22) =2(1−p)p+2p(1−p) =2(1-p)p+2p(1-p)= 2 ( 1 - p ) p + 2 p ( 1 - p ) (23) =4p(1−p). =4p(1-p).= 4 p ( 1 - p ) . (24) The function g(p)=4p(1−p)g(p)=4p(1-p)g ( p ) = 4 p ( 1 - p ) for p∈[0,1]p∈[0,1]p ∈ [ 0 , 1 ] has a maximum value of 111, which is achieved at p=1/2p=1/2p = 1 / 2. Thus, the maximum value of Φ(v,π) (v,π)Φ ( v , π ) is at most 111. This implies: ‖Jf(x)‖(∞,1)=sup‖v‖∞≤11αΦ(v,π)≤1α.\|J_f(x)\|_(∞,1)= _\|v\|_∞≤ 1 1α (v,π)≤ 1α.∥ Jitalic_f ( x ) ∥( ∞ , 1 ) = sup∥ v ∥ start_POSTSUBSCRIPT ∞ ≤ 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG α end_ARG Φ ( v , π ) ≤ divide start_ARG 1 end_ARG start_ARG α end_ARG . This bound is tight and holds for any x∈ℝdx ^dx ∈ blackboard_Rd. Finally, substituting this uniform bound back into the Mean Value Theorem inequality, we get: ‖π(x)−π(y)‖1≤1α‖x−y‖∞.\|π(x)-π(y)\|_1≤ 1α\|x-y\|_∞.∥ π ( x ) - π ( y ) ∥1 ≤ divide start_ARG 1 end_ARG start_ARG α end_ARG ∥ x - y ∥∞ . This completes the proof. ∎ Appendix D Appendix on Experimental Details of Multi-RL Setting The training hyperparameters used in the experiments described in Section 5.6 are summarized in Table 2. All experiments were conducted on a cluster of A800 GPUs using OpenRLHF-V framework. Table 2: Training hyperparameters of the experiments in Sectuion 5.6 Parameter Value rollout_batch_size 128 train_batch_size 64 micro_rollout_batch_size 4 micro_train_batch_size 2 max_len 9000 prompt_max_len 4096 generate_max_len 4096 advantage_estimator group_norm n_samples_per_prompt 8 temperature 1 vllm_sync_backend nccl apply_chat_template true use_kl_loss true init_kl_coef 0.001 kl_estimator k3 freeze_prefix true lr_warmup_ratio 0 actor_learning_rate 1e-6 bf16 true max_epochs 1 eps_clip 0.2 value_clip 0.2 max_norm 1 adam_betas [0.9, 0.95] flash_attn true zero_stage 3 adam_offload true colocate_actor_ref true