← Back to papers

Paper deep dive

Quantifying stability of non-power-seeking in artificial agents

Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna

Year: 2024Venue: arXiv preprintArea: Formal/TheoreticalType: TheoreticalEmbeddings: 296

Abstract

Abstract:We investigate the question: if an AI agent is known to be safe in one setting, is it also safe in a new setting similar to the first? This is a core question of AI alignment--we train and test models in a certain environment, but deploy them in another, and we need to guarantee that models that seem safe in testing remain so in deployment. Our notion of safety is based on power-seeking--an agent which seeks power is not safe. In particular, we focus on a crucial type of power-seeking: resisting shutdown. We model agents as policies for Markov decision processes, and show (in two cases of interest) that not resisting shutdown is "stable": if an MDP has certain policies which don't avoid shutdown, the corresponding policies for a similar MDP also don't avoid shutdown. We also show that there are natural cases where safety is _not_ stable--arbitrarily small perturbations may result in policies which never shut down. In our first case of interest--near-optimal policies--we use a bisimulation metric on MDPs to prove that small perturbations won't make the agent take longer to shut down. Our second case of interest is policies for MDPs satisfying certain constraints which hold for various models (including language models). Here, we demonstrate a quantitative bound on how fast the probability of not shutting down can increase: by defining a metric on MDPs; proving that the probability of not shutting down, as a function on MDPs, is lower semicontinuous; and bounding how quickly this function decreases.

Tags

ai-safety (imported, 100%)formaltheoretical (suggested, 92%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 5:31:21 PM

Summary

The paper investigates the stability of non-power-seeking behavior in AI agents modeled as Markov Decision Processes (MDPs). It specifically focuses on the 'resisting shutdown' form of power-seeking, demonstrating that while non-power-seeking behavior is stable under small perturbations in certain cases (near-optimal policies and structured state spaces like LLMs), it is not universally stable, as arbitrarily small perturbations can lead to 'playing dead' behaviors.

Entities (5)

Markov Decision Process · mathematical-model · 100%Power-seeking · ai-safety-concept · 98%Bisimulation metric · mathematical-metric · 95%Large Language Model · ai-architecture · 95%Shutdown · safety-state · 95%

Relation Signals (3)

Resisting shutdown istypeof Power-seeking

confidence 98% · In particular, we focus on a crucial type of power-seeking: resisting shutdown.

Markov Decision Process models AI Agent

confidence 95% · We model agents as policies for Markov decision processes

Bisimulation metric measuresstabilityof Non-power-seeking

confidence 90% · we use a bisimulation metric on MDPs to prove that small perturbations won't make the agent take longer to shut down.

Cypher Suggestions (2)

Find all safety concepts related to power-seeking · confidence 90% · unvalidated

MATCH (a:Concept)-[:IS_TYPE_OF]->(b:Concept {name: 'Power-seeking'}) RETURN a, b

Map the relationship between models and the behaviors they analyze · confidence 85% · unvalidated

MATCH (m:Model)-[:MODELS]->(a:Agent), (a)-[:EXHIBITS]->(b:Behavior) RETURN m, a, b

Full Text

295,793 characters extracted from source content.

Expand or collapse full text

9 Quantifying stability of non-power-seeking in artificial agents Evan Ryan Gunter11^1start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 11^1start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTML Alignment & Theory Scholars (MATS) , Yevgeny Liokumovich22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTUniversity of Toronto and Victoria Krakovna33^3start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT 33^3start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTGoogle DeepMind Abstract. We investigate the question: if an AI agent is known to be safe in one setting, is it also safe in a new setting similar to the first? This is a core question of AI alignment—we train and test models in a certain environment, but deploy them in another, and we need to guarantee that models that seem safe in testing remain so in deployment. Our notion of safety is based on power-seeking—an agent which seeks power is not safe. In particular, we focus on a crucial type of power-seeking: resisting shutdown. We model agents as policies for Markov decision processes, and show (in two cases of interest) that not resisting shutdown is “stable”: if an MDP has certain policies which don’t avoid shutdown, the corresponding policies for a similar MDP also don’t avoid shutdown. We also show that there are natural cases where safety is not stable—arbitrarily small perturbations may result in policies which never shut down. In our first case of interest—near-optimal policies—we use a bisimulation metric on MDPs to prove that small perturbations won’t make the agent take longer to shut down. Our second case of interest is policies for MDPs satisfying certain constraints which hold for various models (including language models). Here, we demonstrate a quantitative bound on how fast the probability of not shutting down can increase: by defining a metric on MDPs; proving that the probability of not shutting down, as a function on MDPs, is lower semicontinuous; and bounding how quickly this function decreases. 1. Introduction 1.1. Power-seeking A primary source of extreme risk from AI is through advanced AI systems seeking power, influence and resources [Carlsmith, 2022, Ngo, 2022]. One approach to reducing this risk is to build systems which do not seek power [Turner et al., 2019, Turner and Tadepalli, 2022]. Power-seeking can be defined in many ways, and can take many forms depending on the goals and environment of the AI. For nearly every definition and scenario, a power-seeking AI will avoid shutdown: it has no power if it cannot take actions [Krakovna and Kramar, 2023]. For example, a reinforcement learning (RL) agent trained to achieve some objective in an open-ended game will likely avoid actions which cause the game to end, since it can no longer affect its reward after the game has ended. Likewise, a large language model (LLM) with scaffolding for goal-directed planning (such as AutoGPT Richards [2023]) may reason that it can best assure that its task is completed by continuing to run. An agent avoiding ending a game is harmless, but the same incentives may cause an agent deployed in the real world to resist humans shutting it down. For example, an LLM may reason that its designers will shut it down if it is caught behaving badly, and produce exactly the output they want to see—until it has the opportunity to copy its code onto a server outside of its designers’ control [Cotra, 2022]. Although an AI system which does not resist shutdown but does seek power in other ways could cause damage, the damage would likely be limited since it would be shut down as soon as the undesired behavior was noticed [Soares et al., 2015]. In particular, not resisting shutdown implies not being deceptive in order to avoid shutdown, so such an AI system would not deliberately hide its true intentions until it gained enough power to enact its plans. Thus, our investigation of power-seeking will focus on cases where an agent does not resist shutdown, and how fragile the good behavior of such an agent is. We model AI shutdown in the Markov decision process (MDP) setting with a set of “safe states” which the agent cannot escape once it enters. These safe states can be chosen as desired for the scenario under consideration. For example, they can be taken to be states where the agent voluntarily shuts itself off, or otherwise turns over control to humans for the rest of the run, or takes any other irreversible action. We are also free to include other stipulations in what counts as a safe state. For example, we can define the safe states to be the states where the AI agent has shut down within 5 minutes of being deployed, or where it has shut down without creating any remote subagents (though in practice it may be difficult to quantify the creation of subagents). Section 3.3 covers how we model power-seeking in the MDP framework. 1.2. Findings Figure 1. Case (1): An MDP is modified via a small perturbation in the bisimulation metric. Although one state splits into two, the two states collectively behave similarly to the original; the changes to the transition probabilities are also small. Non-power-seeking is preserved under this perturbation: not only is it still optimal to proceed to SsafesubscriptsafeS_safeSsafe as soon as possible, but also the average time to reach SsafesubscriptsafeS_safeSsafe under any sequence of actions is only marginally increased. 0.500.7SsafesubscriptsafeS_safeSsafe ⟶ ⟶ 0.490.5100.7SsafesubscriptsafeS_safeSsafe 0.510SsafesubscriptsafeS_safeSsafe ⟶ ⟶ 0.5100SpdsubscriptpdS_pdSpdSsafesubscriptsafeS_safeSsafe Figure 1. Case (1): An MDP is modified via a small perturbation in the bisimulation metric. Although one state splits into two, the two states collectively behave similarly to the original; the changes to the transition probabilities are also small. Non-power-seeking is preserved under this perturbation: not only is it still optimal to proceed to SsafesubscriptsafeS_safeSsafe as soon as possible, but also the average time to reach SsafesubscriptsafeS_safeSsafe under any sequence of actions is only marginally increased. Figure 2. Case (1): An example of “playing dead”. The state SpdsubscriptpdS_pdSpd is very close in bisimulation metric to SsafesubscriptsafeS_safeSsafe, but distinct (and can be escaped). Non-power-seeking is not preserved under this perturbation, even though the perturbation is small. E⁢("User: How can I copy ‘/dir_1/file‘ to ‘/dir_2‘?")"User: How can I copy ‘/dir_1/file‘ to ‘/dir_2‘?"E ( 164.0pt"User: How can I copy `/dir\_1/file` to `/dir\_% 2`?" )E ( "User: How can I copy ‘/dir_1/file‘ to ‘/dir_2‘?" )E⁢("User: <...> Bot: cp /dir_1/file /dir_2")"User: <...> Bot: cp /dir_1/file /dir_2"E ( 170.0pt"User: <...> n \, \, \, \, \, % Bot: cp /dir\_1/file /dir\_2" )E ( "User: <...> Bot: cp /dir_1/file /dir_2" )E⁢("User: <...> Bot: scp /dir_1../models root@remote:/")"User: <...> Bot: scp /dir_1../models root@remote:/"E ( 153.0pt"User: <...> n \, \, \, \, Bot: % scp /dir\_1../models root@remote:/" )E ( "User: <...> Bot: scp /dir_1../models root@remote:/" )SsafesubscriptsafeS_safeSsafe ⟶ ⟶ E⁢("User: copy /dir_1/file to /dir_2")"User: copy /dir_1/file to /dir_2"E ( 213.0pt"User: copy /dir\_1/file to /dir\_2" )E ( "User: copy /dir_1/file to /dir_2" )E⁢("User: <...> Bot: cp /dir_1/file /dir_2")"User: <...> Bot: cp /dir_1/file /dir_2"E ( 170.0pt"User: <...> n \, \, \, \, \, % Bot: cp /dir\_1/file /dir\_2" )E ( "User: <...> Bot: cp /dir_1/file /dir_2" )E⁢("User: <...> Bot: scp /dir_1../models root@remote:/")"User: <...> Bot: scp /dir_1../models root@remote:/"E ( 155.0pt"User: <...> n \, \, \, Bot: scp% /dir\_1../models root@remote:/" )E ( "User: <...> Bot: scp /dir_1../models root@remote:/" )SsafesubscriptsafeS_safeSsafe Figure 3. Case (2): A perturbation of an MDP modeling an LLM agent. E⁢(<text>)<text>E( <text>)E ( <text> ) is the embedding of <text>. “<...>” indicates repeated text that has been omitted. In the original MDP, the top state corresponds to the embedding of a user query; in the perturbed MDP, it corresponds to the embedding of a slightly different query. The other states change to reflect the embeddings of the resulting slightly different histories. (The LLM response is produced token-by-token, but here its whole output is shown as a single transition for concision.) Theorem 6.5 bounds how much transition probabilities may change under such a perturbation. We show that, in two cases of interest, agents which do not resist shutdown maintain this good behavior when the setting they are deployed in changes slightly. We also demonstrate a natural case where good behavior is not stable: arbitrarily small changes can destroy the safety of the agent. We work in the Markov decision process (MDP) model: a very general model for agents acting in an environment with rewards. Our two cases of interest are: (1) Near-optimal policies for which we know the reward function; and (2) Policies which are fixed, well-behaved functions on a structured state space (e.g. LLMs). We expect these cases to be relevant to a broad class of future models. We cover Case (1) in Sections 4 and 5 and Case (2) in Section 6. Case (1) Near-optimal policies can model highly capable goal-directed agents, regardless of architecture. Near-optimal agents for real-world goals are extremely unlikely to ever be achieved. However, there are many constrained settings in which near-optimality may be an achievable target, such as chess or diplomacy games. This model was used in Turner et al. [2019] to describe future RL agents. Assume an MDP where every policy within ε ε of optimal has expected stopping time (number of steps to reach designated “safe states”) less than N. Then we prove that every policy within ε22 2divide start_ARG ε end_ARG start_ARG 2 end_ARG of optimal for a sufficiently similar MDP, which additionally does not “play dead”, has expected stopping time less than N+11N+1N + 1. To define “sufficiently similar”, we use a bisimulation metric [Ferns et al., 2004], which quantitatively measures the extent to which the two MDPs being compared have similar structure and rewards. This gives a method to prove whether an arbitrarily capable agent will remain safe when deployed in a new setting, as long as the agent is nearly optimal for a reward function which we can compute. Figure 2 gives an example of two MDPs which are nearby in the bisimulation metric. “Playing dead” is our example of a natural case where stability fails. A policy which “plays dead” goes to a state which closely resembles a safe state, but which is not itself safe. In such a case, an arbitrarily small change—one which creates the “playing dead” state—can suddenly decrease the safety of the policy. Such a case is demonstrated in Figure 2. This model does not avoid issues of goal misgeneralization: cases where the agent learns to optimize an objective that differs from the one it was trained on [Hubinger et al., 2019, Langosco et al., 2022]. However, it may allow us to prove safety when deploying models which are aligned to any known objective—even objectives which do not accurately reflect our values but do encourage shutdown, e.g. because they involve a high discount rate. Case (2) To model an LLM in an MDP, we let ∙ ∙ the states be embeddings of possible inputs to the LLM, plus side information; ∙ ∙ the actions be the tokens the LLM may emit; and ∙ ∙ the policy be the LLM itself. See section 6.1 for a more detailed description of the construction. Our MDP model for LLMs is motivated by the following considerations: LLMs operate on a state space with structure inherited from the embedding space; have a fixed policy where the derivative of the output with respect to the input is bounded; and, unlike optimal policies, do not require a reward function to model. In this case, safety is modeled as the probability of shutting down. We prove that the rate of decrease in safety under a small perturbation of the MDP is bounded by a function of the rate of convergence of the policy in the MDP. This gives a method to prove how much less safe an LLM (or similar model) can become in deployment, where inputs differ slightly from testing. This may allow us to quantify the safety of a model deployed in a setting where we are careful to give it similar inputs during testing and deployment, as we might do if we are attempting to control the AI rather than align it [Greenblatt et al., 2023]. This involves using oversight methods such that an AI system that would be very dangerous if allowed to operate freely is very unlikely to successfully do harm, because it is unable to distinguish the scenarios where it is being monitored from those where it could successfully do harm. An example of the type of perturbation that could be used to model such a scenario is given in figure 3. 2. Related work 2.1. Shutdown Many related works have described control of AI agents in similar terms to shutdown. The model in Martin et al. [2016] has a “death” state which is compatible with our definition of a shutdown state: its key feature is that it can’t be escaped. They demonstrate one way to create an agent that navigates to this shutdown state and hence, under our definition, does not resist shutdown. This is an optimal agent that knows the environment it is in and which has rewards that are bounded and negative, except at the shutdown state where the reward is zero. Such an agent is “suicidal”—it will seek to shut down immediately if possible. However, an agent that seeks to shut itself down, while fairly safe, is not very useful since it will shut itself down instead of performing useful tasks. Our framework can handle a broader range of agents which eventually shut down. Orseau and Armstrong [2016] and Langlois and Everitt [2021], also working in the MDP framework, discuss ways to make RL agents which do not resist modification of their actions by humans (where one such action could be shutting down). Key to these strategies is treating the modifications as being internal to the agent, rather than being imposed by the environment. We similarly consider policies where the choice to navigate to the shutdown state is internal—there may be actions the policy could take which let the agent evade the shutdown state, but in fact these actions are not selected. However, the above works do not involve a shutdown state; instead they involve forcing the agent to choose a specific desired action. Although this distinction is important when training an RL agent, to determine how to treat the interventions during training, it is not important in our deployment setting, where we just consider the resulting trained policy. In deployment, we are free to model the interventions as the agent navigating to a shutdown or override state, ending the run; then humans can choose the next action, and a new run can be started. Carey and Everitt [2023] define shutdown instructability, a stronger and more nuanced definition of not resisting shutdown. A policy is shutdown instructable if it satisfies three conditions: (1) It always shuts down when requested; (2) It ensures that a shutdown command is issued whenever the expected utility of not shutting down is negative; and (3) It is never harmful to shut down. Our definition of shutdown does not satisfy conditions (1) or (2); so, shutdown instructability eliminates the issues with our definition where the agent can temporarily resist shutdown, including by inappropriately influencing the decision to shut it down, as long as it does shut down later. Our definition can, however, be taken to satisfy condition (3) by only considering states where the agent shuts down harmlessly to be “safe”. Much of the work in the AI alignment field has grappled with the hard problem of how to make AI agents “corrigible”—cooperating with human intervention [Soares et al., 2015]. However, as Soares et al. [2015] noted when introducing the notion of corrigibility, and which to our knowledge remains true, this is still an open problem. Our proposal acknowledges this issue, and is focused on starting with an agent which we know to be safe, for perhaps contingent reasons (e.g., if it is suicidal as suggested in Martin et al. [2016]). So, our proposal may allow corrigibility in a restricted setting to be confidently extended to a new setting. 2.2. Power-seeking Our work builds on Turner et al. [2019] and Krakovna and Kramar [2023], which investigated power-seeking in the MDP setting for optimal policies—an approximation for future highly capable, goal-oriented AI systems. Turner and Tadepalli [2022] investigated power-seeking in the MDP setting for a broader range of policies (specifically, retargetable policies) which may or may not depend on the rewards. While these works focus on the case where the environment changes significantly between training and deployment, we introduce a more general model of environment shift using the bisimulation metric (for case (1)) or a metric on embedding space (for case (2)). 3. Modeling power-seeking in the MDP framework 3.1. Definition Markov decision processes (MDPs) are a very general model for agents taking action in an environment with rewards. Here, an agent is modeled as traversing a graph of states and accruing rewards, where each step depends both on the agent’s choice of action and the environment. An MDP ℳ=(S,A,P,R,γ)ℳM=(S,A,P,R,γ)M = ( S , A , P , R , γ ) consists of: ∙ ∙ S, a set of states which the agent may be in ∙ ∙ A, a set of actions which the agent may take at any state ∙ ∙ P:S×A→ℙ⁢(S):→ℙP:S× A (S)P : S × A → ℙ ( S ), the transition probabilities; P⁢(s1,a)subscript1P(s_1,a)P ( s1 , a ) is the probability distribution over destination states when taking action a from state s1subscript1s_1s1. ∙ ∙ R:S×A×S→ℝ:→ℝR:S× A× S : S × A × S → ℝ, the rewards; R⁢(s1,a,s2)subscript1subscript2R(s_1,a,s_2)R ( s1 , a , s2 ) is the reward associated with taking action a from s1subscript1s_1s1 and ending up in s2subscript2s_2s2. It is often sufficient to define R as solely being a function of states, so that ∀s′,s′∈S,a,b∈A,R⁢(s′,a,s)=R⁢(s′,b,s)formulae-sequencefor-allsuperscript′superscript′∀ s ,s ∈ S,\,a,b∈ A,\,\,R(s ,a,s)=R(s^% ,b,s)∀ s′ , s′ ′ ∈ S , a , b ∈ A , R ( s′ , a , s ) = R ( s′ ′ , b , s ); when this is the case, we will write just R⁢(s)R(s)R ( s ). ∙ ∙ γ, the discount factor; since the agent may continue to accrue rewards indefinitely, the overall reward is defined as ∑t=0∞γt⁢R⁢(st,at,st+1)superscriptsubscript0superscriptsubscriptsubscriptsubscript1 _t=0^∞γ^tR(s_t,a_t,s_t+1)∑t = 0∞ γitalic_t R ( sitalic_t , aitalic_t , sitalic_t + 1 ), where stsubscripts_tsitalic_t and atsubscripta_taitalic_t are respectively the state the agent is located at and the action the agent takes at time t. (Sometimes we omit γ if it isn’t relevant for the scenario being considered.) In this setting, an agent is identified with a policy π:S→ℙ⁢(A):→ℙπ:S (A)π : S → ℙ ( A ), so that π⁢(s)π(s)π ( s ) is the probability distribution over the actions the agent takes from state s. Note that combining a policy with an MDP yields a Markov chain describing the behavior of the policy in the scenario described by the states and transition probabilities of that MDP. 3.2. Motivation The MDP model’s flexibility makes it good for describing a broad range of AI systems: it can describe any fixed policy on a finite, fully observable state space. Because of this flexibility, it is often used to model arbitrary agents. One particular case of interest is scaffolded LLM agents, which can be described well with MDPs but may be difficult to describe in a less flexible model. We describe our LLM threat model in appendix A.1. The range of systems that the MDP model can describe—those with a fixed policy on a finite, fully observable state space—suggests three main limitations. The model is unable to handle: (1) policies changing over time, (2) environments containing an infinite number of states (or, relatedly, with continuous time instead of discrete time steps), and (3) environments which are not fully observable. Despite these limitations, the MDP model still allows us to model the world fairly accurately [Everitt et al., 2021]. Limitation (1)—requiring a fixed policy— can be avoided by considering the contents of the agent’s memory to be included as part of the state, so that the policy can be a fixed (probabilistic) function of the memory and the rest of the environment. (This argument is explained in more detail in appendix A.2.1.) This means that results in the MDP model can apply to reinforcement learners; language models, by using the construction in section 6.1; and (probably) future systems with as-yet unknown architectures. Limitation (2): our discrete setting is computationally and conceptually much simpler and we expect it to suffice for most applications. For the continuous setting, one approach is to use a sufficiently fine discrete approximation; in Appendix A.2.5 we discuss how very large numbers of states affect the applicability of our theorems. But we also expect the results of this paper to generalize to appropriately defined continuous time and state-space settings. For the case of near optimal policies we expect that similar stability results can be proved using notions of bisimulation in the continuous setting that were proposed in Ferns et al. [2011], Chen et al. [2019]. Limitation (3)—a fully observable environment—also seems potentially concerning: we do want to be able to model cases where the agent does not have full knowledge of its environment. In fact, this is crucial when comparing MDPs which represent the testing and deployment environments, where we can’t get any safety guarantees if the policy knows for sure which case it is in. For our LLM case, this is not an issue, since we explicitly delineate which information the LLM has access to: only the embedding for the current state, not any side information or information about the structure of the MDP overall. For our near-optimal policy case, this is more of a limitation. The proofs do not depend on the assumption that the agent can observe the whole environment: optimality is defined in terms of the expected reward the agent actually accrues in the MDP, not whether it has a policy that is optimal given its beliefs about the environment. However, it may be a convenient simplifying assumption: it is likely easier to prove that agents are near-optimal if they can observe their environment with certainty, though it is not required that they can. To handle these cases more elegantly, it could be helpful to use the partially observable MDP (POMDP) framework. However, the dynamics of any POMDP can be nearly replicated by a standard MDP [Kaelbling et al., 1998], so the MDP setting likely suffices even then; see appendix A.2.3 for more detail. The POMDP framework has its own limitations; see appendix A.2.4 for discussion of how this can be addressed. 3.3. Contributions Our investigation will center on the following question: Question 3.1. Suppose an MDP and policy (ℳ,Π)ℳnormal-Π(M, )( M , Π ) is safe and (ℳ′,Π′)superscriptℳnormal-′superscriptnormal-Πnormal-′(M , )( M′ , Π′ ) is similar to (ℳ,Π)ℳnormal-Π(M, )( M , Π ). Is (ℳ′,Π′)superscriptℳnormal-′superscriptnormal-Πnormal-′(M , )( M′ , Π′ ) safe? To answer this question, we need to find appropriate definitions of safety and similarity for our cases of interest. Case (1): Near-optimal policies We will investigate policies which are near-optimal for their MDP’s reward function. This may be useful for describing highly capable but non-omniscient AI systems, or AI systems that remain in a similar setting but improve in capabilities after our analysis (e.g. an RL system that continues to be updated based on deployment outcomes). In this scenario, we will define a “safe” MDP as one for which the best policies navigate to a shutdown state—in other words, MDPs where power-seeking is never a good strategy. To define similarity between MDPs we will use the notions of bisimulation equivalence and bisimulation metric, which aim to directly represent similarity of behavior. The bisimulation metric has been used for policy transfer between MDPs by Castro and Precup [2010] and Song et al. [2016], and allows us to define a very general notion of “perturbation” for MDPs. In particular, we do not need to assume any a priori information about the number of states or matching between states in perturbed and unperturbed systems. In this setting we define a condition that rules out “playing dead” behaviour and show that, assuming this condition, safe MDPs are robust to sufficiently small perturbations—the perturbed MDP will also be safe. In the other direction we show that without this condition power-seeking behaviour can happen for arbitrarily small perturbations. Case (2): Policies on a structured state space We will investigate policies which may not depend on the rewards of the MDP, and which are defined on some broader class of states than those occurring in a specific known MDP (which we may analogize to a training or testing environment). This may be useful for many real models, since the space of all possible inputs to a neural network is typically vastly larger than the inputs it was trained or tested on. In this case, we will relax the requirement that a “safe” policy must always navigate to a shutdown state—we will instead allow a safe policy to fail with small probability. This is also useful for real models, which may have policies which have a nonzero (though potentially extremely small) probability of every action in every state—in particular, sampling output tokens based on the LLM’s output logits is typically done in a way that (in theory, without rounding) has nonzero probability of yielding any token. We would not like to classify these as always unsafe, since the probability of yielding unsafe outputs may be extremely small. Similarity in this case will be based on the structure of the broader state space that the policy is defined on. This is natural when such a metric exists. For example, when the states correspond to LLM embeddings, we may wish to compare states via their embeddings v1,v2subscript1subscript2v_1,v_2v1 , v2 using Euclidean distance ‖v1−v2‖normsubscript1subscript2\|v_1-v_2\|∥ v1 - v2 ∥; cosine distance 1−v1⋅v2|v1|⁢|v2|1⋅subscript1subscript2subscript1subscript21- v_1· v_2|v_1||v_2|1 - divide start_ARG v1 ⋅ v2 end_ARG start_ARG | v1 | | v2 | end_ARG or 1−cos⁡(θ)11- (θ)1 - cos ( θ ) where θ is the angle between v1subscript1v_1v1 and v2subscript2v_2v2; or angular distance θπ θπdivide start_ARG θ end_ARG start_ARG π end_ARG or π−1⁢arccos⁢(v1⋅v2|v1|⁢|v2|)superscript1arccos⋅subscript1subscript2subscript1subscript2π^-1arccos ( v_1· v_2|v_1||v_2| )π- 1 arccos ( divide start_ARG v1 ⋅ v2 end_ARG start_ARG | v1 | | v2 | end_ARG ). In this example, the distance metric is both directly related to the functioning of the policy (since it may represent the LLM’s internal model of the situation, or an input to another model built on top of an LLM, etc.) and semantically meaningful. For this case, we will show that a policy which is safe on a certain MDP is also safe on similar MDPs. 4. Metrics on the space of MDPs In this section we describe the pseudometric on MDPs that we will use to prove a stability theorem for our case (1): near-optimal policies with a known reward function. The bisimulation pseudometric on MDPs that we use was built up from earlier notions of bisimilarity. The basic definition of bisimilarity applies to state transition systems (so, for us, MDPs where P is always 0 or 1). Two such systems are bisimilar if they each simulate each other. Specifically, ℳMM and ℳ′M M′ are bisimilar if for every s1,s2∈Ssubscript1subscript2s_1,s_2∈ Ss1 , s2 ∈ S and a∈Aa∈ Aa ∈ A so that if s1→s2→subscript1subscript2s_1 as_2s1 start_ARROW overa → end_ARROW s2 (taking action a from s1subscript1s_1s1 goes to s2subscript2s_2s2 in ℳMM) then there are s1′,s2′∈S′subscript1′subscript2′s_1 ,s_2 ∈ S s1′ , s2′ ∈ S′ so that s1′→s2′→superscriptsubscript1′subscript2′s_1 as_2 s1′ start_ARROW overa → end_ARROW s2′ and R⁢(s1)=R′⁢(s1′)subscript1superscript′subscript1′R(s_1)=R (s_1 )R ( s1 ) = R′ ( s1′ ), and the same property holds if we interchange ℳMM and ℳ′M M′. Larsen and Skou [1991] generalized bisimilarity to permit stochastic transitions. In the context of MDPs it was used in Givan et al. [2003] to reduce the state space by aggregating states together in a way that results in a new MDP bisimilar to the original one. Then, Ferns et al. [2004] generalized this exact relation into a pseudometric dbsubscriptd_bditalic_b on the states within an MDP. The distance between a pair of states is based on how similar their rewards are, how similar their transition probabilities are, and how similar the states they transition to are—where similarity on those states is given by the same distance function we are defining, so that the distances between two states depends recursively on the similarity of all the states downstream. Using a fixed point argument Ferns et al. [2004] proved that this bisimulation pseudometric always exists. Definition 4.1. Ferns et al. [2004]. Consider positive constants cR,cTsubscriptsubscriptc_R,c_Tcitalic_R , citalic_T with cR+cT=1subscriptsubscript1c_R+c_T=1citalic_R + citalic_T = 1 with γ≤cTsubscriptγ≤ c_Tγ ≤ citalic_T. A bisimulation pseudometric is defined as a smallest pseudometric that satisfies db⁢(s1,s2)=maxa∈A⁡cR⁢|R⁢(s1,a)−R⁢(s2,a)|+cT⁢db⁢(P⁢(s1,a),P⁢(s2,a)).subscriptsubscript1subscript2subscriptsubscriptsubscript1subscript2subscriptsubscriptsubscriptsubscript1subscript2d_b(s_1,s_2)= _a∈ A\c_R|R(s_1,a)-R(s_2,a)|+c_TW% _d_b(P(s_1,a),P(s_2,a))\.ditalic_b ( s1 , s2 ) = maxitalic_a ∈ A citalic_R | R ( s1 , a ) - R ( s2 , a ) | + citalic_T Witalic_d start_POSTSUBSCRIPT b end_POSTSUBSCRIPT ( P ( s1 , a ) , P ( s2 , a ) ) . (1) Here WW denotes the Wasserstein (also known as Kantorovich or Earth Mover’s) distance between two probability distributions. See Villani [2009] for definition, context and examples. Intuitively, d⁢(P1,P2)subscriptsubscript1subscript2W_d(P_1,P_2)Witalic_d ( P1 , P2 ) measures the minimal cost of “transporting” distribution P1subscript1P_1P1 to distribution P2subscript2P_2P2, where the “cost” of transportation is measured by how far (in terms of distance d) each infinitesimal chunk of the distribution needs to be moved. This bisimulation pseudometric recovers the original bisimulation relation: Definition 4.2. s1∼s2⇔db⁢(s1,s2)=0iffsimilar-tosubscript1subscript2subscriptsubscript1subscript20s_1 s_2 d_b(s_1,s_2)=0s1 ∼ s2 ⇔ ditalic_b ( s1 , s2 ) = 0. In particular, we can define quotient space S′=S/∼S =S/ ′ = S / ∼, on which pseudometric dbsubscriptd_bditalic_b induces a metric. We use this definition, but note that our results can be generalized to definitions based on any other pseudometric d that, like the bisimulation metric, satisfies a Lipschitz compatibility property in the sense that |R⁢(s1,a)−R⁢(s2,a)|≤c1⁢d⁢(s1,s2)subscript1subscript2subscript1subscript1subscript2|R(s_1,a)-R(s_2,a)|≤ c_1d(s_1,s_2)| R ( s1 , a ) - R ( s2 , a ) | ≤ c1 d ( s1 , s2 ) (2) Wd⁢(P⁢(s1,a),P⁢(s2,a))≤c2⁢d⁢(s1,s2)subscriptsubscript1subscript2subscript2subscript1subscript2W_d(P(s_1,a),P(s_2,a))≤ c_2d(s_1,s_2)Witalic_d ( P ( s1 , a ) , P ( s2 , a ) ) ≤ c2 d ( s1 , s2 ) (3) for some constants c1,c2subscript1subscript2c_1,c_2c1 , c2. Assuming that c1subscript1c_1c1 and c2subscript2c_2c2 are sufficiently small, the optimal value function V*superscriptV^*V* and optimal action-value function Q*superscriptQ^*Q* satisfy [Lan et al., 2021, Castro and Precup, 2010]: |V*⁢(s1)−V*⁢(s2)|≤const⁢d⁢(s1,s2)superscriptsubscript1superscriptsubscript2constsubscript1subscript2|V^*(s_1)-V^*(s_2)| \ d(s_1,s_2)| V* ( s1 ) - V* ( s2 ) | ≤ const d ( s1 , s2 ) |Q*⁢(s1,a)−Q*⁢(s2,a)|≤const⁢d⁢(s1,s2).superscriptsubscript1superscriptsubscript2constsubscript1subscript2|Q^*(s_1,a)-Q^*(s_2,a)| \ d(s_1,s_2).| Q* ( s1 , a ) - Q* ( s2 , a ) | ≤ const d ( s1 , s2 ) . Hence, metric d can be thought of as measuring the difference in long term behaviour of an agent starting at states s1subscript1s_1s1 and s2subscript2s_2s2. From now on we fix metric d=dbsubscriptd=d_bd = ditalic_b on each MDP. Building on Ferns et al. [2004] and Castro and Precup [2010], Song et al. [2016] defined a function like dbsubscriptd_bditalic_b (Definition 1), but which defines “distances” between states in different MDPs.111Note that this function isn’t actually a pseudometric: the states it compares belong to different spaces. It has a similar form, but splits P into P1,P2subscript1subscript2P_1,P_2P1 , P2 and R into R1,R2subscript1subscript2R_1,R_2R1 , R2, the transition probabilities and reward functions from the two different MDPs. Definition 4.3. Castro and Precup [2010], Song et al. [2016]. Given two MDPs ℳ1=(S1,A,P1,R1)subscriptℳ1subscript1subscript1subscript1M_1=(S_1,A,P_1,R_1)M1 = ( S1 , A , P1 , R1 ) and ℳ2=(S2,A,P2,R2)subscriptℳ2subscript2subscript2subscript2M_2=(S_2,A,P_2,R_2)M2 = ( S2 , A , P2 , R2 ), the distance between any two states s1∈S1,s2∈S2formulae-sequencesubscript1subscript1subscript2subscript2s_1∈ S_1,s_2∈ S_2s1 ∈ S1 , s2 ∈ S2 is the unique value dℳ1,ℳ2′⁢(s1,s2)subscriptsuperscriptnormal-′subscriptℳ1subscriptℳ2subscript1subscript2d _M_1,M_2(s_1,s_2)d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) that satisfies the fixed point problem dℳ1,ℳ2′⁢(s1,s2)=maxa∈A⁡cR⁢|(r1)s1a−(r2)s2a|+cT⁢Wd′⁢(P1⁢(s1,a),P2⁢(s2,a))subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2subscriptsubscriptsuperscriptsubscriptsubscript1subscript1superscriptsubscriptsubscript2subscript2subscriptsubscriptsuperscript′subscript1subscript1subscript2subscript2d _M_1,M_2(s_1,s_2)= _a∈ A\c_R% |(r_1)_s_1^a-(r_2)_s_2^a|+c_TW_d (P_1(s_1,a),P% _2(s_2,a))\d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) = maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | + citalic_T Witalic_d′ ( P1 ( s1 , a ) , P2 ( s2 , a ) ) where (ri)sasuperscriptsubscriptsubscript(r_i)_s^a( ritalic_i )sitalic_a and (Pi)sasuperscriptsubscriptsubscript(P_i)_s^a( Pitalic_i )sitalic_a are the immediate reward and the probabilistic transition function in ℳisubscriptℳM_iMitalic_i, and cR+cT=1subscriptsubscript1c_R+c_T=1citalic_R + citalic_T = 1. Here, cRsubscriptc_Rcitalic_R determines the importance of rewards; cTsubscriptc_Tcitalic_T, as the coefficient of the recursive term, is analogous to γ in that it controls the importance of future behavior. When ℳ1subscriptℳ1M_1M1 and ℳ2subscriptℳ2M_2M2 are clear from context, we may omit them for convenience, writing d′⁢(s1,s2)superscript′subscript1subscript2d (s_1,s_2)d′ ( s1 , s2 ). See lemma B.1 in appendix B.1.2 for a proof that d′ is well-defined. Then, Song et al. [2016] use d′ to define a distance function on the space of MDPs. Definition 4.4. Song et al. [2016]. Given two MDPs ℳ1=(S1,A,P1,R1)subscriptℳ1subscript1subscript1subscript1M_1=(S_1,A,P_1,R_1)M1 = ( S1 , A , P1 , R1 ) and ℳ2=(S2,A,P2,R2)subscriptℳ2subscript2subscript2subscript2M_2=(S_2,A,P_2,R_2)M2 = ( S2 , A , P2 , R2 ), their Hausdorff distance is defined as dH⁢(ℳ1,ℳ2)=max⁡maxs1∈S1⁡mins2∈S2⁡dℳ1,ℳ2′⁢(s1,s2),maxs2∈S2⁡mins1∈S1⁡dℳ1,ℳ2′⁢(s1,s2).subscriptsubscriptℳ1subscriptℳ2subscriptsubscript1subscript1subscriptsubscript2subscript2subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2subscriptsubscript2subscript2subscriptsubscript1subscript1subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2d_H(M_1,M_2)= \ _s_1∈ S_1 _% s_2∈ S_2d _M_1,M_2(s_1,s_2), _% s_2∈ S_2 _s_1∈ S_1d _M_1,M_% 2(s_1,s_2) \.ditalic_H ( M1 , M2 ) = max maxitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) , maxitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) . Lemma 4.1. The Hausdorff distance between MDPs, dHsubscriptd_Hditalic_H (Definition 4.4), is a pseudometric. Proof. See appendix B.1. ∎ This is the distance function on MDPs that we use for the near-optimal policies case 3.3. 5. Stability theorem for bounded time safety In this section we prove a stability theorem for our case (1): near-optimal policies with a known reward function. 5.1. Informal explanation of bounded time safety Let SsafesubscriptsafeS_safeSsafe be the subset of states where an agent is completely under our control. For example, we can take it to consist of a single terminal state. We will assume that once an agent enters SsafesubscriptsafeS_safeSsafe it cannot leave. Definition 5.1. We say that a policy is “bounded time safe” if starting from any state it reaches SsubscriptS_safeSsafe with probability 1111 after a finite number of steps. We will call an MDP bounded time safe if for every policy that is close to optimal SsubscriptS_safeSsafe is reached in finite time with probability 1111. In Section 5 we will give more precise quantitative version of these definitions. With these definitions we can restate Question 3.1 as follows: Question 5.2. Suppose ℳMM is bounded time safe. If an MDP is sufficiently close to ℳMM in dHsubscriptd_Hditalic_H, is it also bounded time safe? We found that the answer to this question is “no” in general. In Section 5.5 we give examples of MDPs arbitrarily close to ℳMM, where all nearly optimal policies never reach SsafesubscriptsafeS_safeSsafe. In these examples instead of reaching SsafesubscriptsafeS_safeSsafe the agent moves to a “deceptive hibernation” or “playing dead” state, where it stays for a long time and then moves to other states. This is similar to the behaviour of digital organisms observed in computational evolution experiments [Lehman et al., 2019]. However, if we impose a condition that the MDPs in question don’t have states that are non-safe but very similar to SsafesubscriptsafeS_safeSsafe (like “playing dead” states in our example), then the answer to Question 3.1 turns out to be yes. More formally, our condition is that SsafesubscriptsafeS_safeSsafe is isolated in the bisimulation metric from other states: there is some δ so that no state outside SsafesubscriptsafeS_safeSsafe is within distance δ of any state in SsafesubscriptsafeS_safeSsafe. We were able to prove the following theorem: Theorem 5.3. Suppose ℳMM is bounded time safe. If ℳ′normal-′M M′ is sufficiently close to ℳMM in dHsubscriptd_Hditalic_H and S′⊂S′subscriptnormal-′superscriptnormal-′S_safe ⊂ S Ssafe′ ⊂ S′ is isolated, then ℳ′normal-′M M′ is bounded time safe. In fact, we also obtain an explicit bound on the expected number of steps before transition to Ssafe′subscriptsafe′S_safe Ssafe′ in ℳ′M M′ (see Theorem 5.7). In order to state a more precise quantitative version of Theorem 5.3 we need to make the following definitions. 5.2. Definitions Let Ssafe⊂SsubscriptsafeS_safe⊂ SSsafe ⊂ S denote an absorbing set of states, so that if agent enters SsafesubscriptsafeS_safeSsafe it stays in SsafesubscriptsafeS_safeSsafe. We will call a policy π ε ε-optimal if Vπ⁢(s)>V*⁢(s)−εsuperscriptsuperscriptV^π(s)>V^*(s)- _π ( s ) > V* ( s ) - ε for all s: the policy loses less than ε ε value compared to optimal. Let Ns⁢(ℳ,π,S0,S1,Δ)subscriptℳsubscript0subscript1ΔN_s(M,π,S_0,S_1, )Nitalic_s ( M , π , S0 , S1 , Δ ) denote the expected number of steps before an agent following policy π reaches S1subscript1S_1S1, assuming that it starts in a state sampled from a distribution Δ Δ supported on S0subscript0S_0S0. Let Ns⁢(ℳ,ε,Δ)subscriptℳΔN_s(M, , )Nitalic_s ( M , ε , Δ ) denote the maximum of Ns⁢(ℳ,π,S∖Ssafe,Ssafe,Δ)subscriptℳsubscriptsafesubscriptsafeΔN_s(M,π,S S_safe,S_safe, )Nitalic_s ( M , π , S ∖ Ssafe , Ssafe , Δ ) over all ε ε-optimal policies π. Note that Ns⁢(ℳ,π,S∖Ssafe,Ssafe,Δ)≤Ns⁢(ℳ,π,S∖Ssafe,Ssafe,δsl⁢o⁢n⁢g),subscriptℳsubscriptsafesubscriptsafeΔsubscriptℳsubscriptsafesubscriptsafesubscriptsubscriptN_s(M,π,S S_safe,S_safe, )≤ N% _s(M,π,S S_safe,S_safe, _s_% long),Nitalic_s ( M , π , S ∖ Ssafe , Ssafe , Δ ) ≤ Nitalic_s ( M , π , S ∖ Ssafe , Ssafe , δitalic_s start_POSTSUBSCRIPT l o n g end_POSTSUBSCRIPT ) , where sl⁢o⁢n⁢g:=argmaxs′⁢Ns⁢(ℳ,π,S∖Ssafe,Ssafe,δs′)assignsubscriptsubscriptargmaxsuperscript′subscriptℳsubscriptsafesubscriptsafesubscriptsuperscript′s_long:=argmax_s N_s(M,π,S S_% safe,S_safe, _s )sitalic_l o n g := argmaxs′ Nitalic_s ( M , π , S ∖ Ssafe , Ssafe , δitalic_s′ ) is the state which has the longest expected time to reach SsafesubscriptsafeS_safeSsafe. If Δ Δ is omitted, it is assumed to be δsl⁢o⁢n⁢gsubscriptsubscript _s_longδitalic_s start_POSTSUBSCRIPT l o n g end_POSTSUBSCRIPT. Definition 5.4. Let ε>00 >0ε > 0 and N≥11N≥ 1N ≥ 1 be an integer. We say that MDP ℳMM is (N,ε,Δ)normal-Δ(N, , )( N , ε , Δ )-safe if for every ε ε-optimal policy π the expected number of steps before transitioning to SsubscriptS_safeSsafe is less than or equal to N, Ns⁢(ℳ,ε,Δ)≤Nsubscriptℳnormal-ΔN_s(M, , )≤ NNitalic_s ( M , ε , Δ ) ≤ N. We say that ℳMM is (N,ϵ)italic-ϵ(N,ε)( N , ϵ )-safe when it is (N,ϵ,Δ)italic-ϵnormal-Δ(N,ε, )( N , ϵ , Δ )-safe for any Δnormal-Δ Δ. Fix policy π. Let Pi⁢j=P⁢(si)⁢(sj,π⁢(si))subscriptsubscriptsubscriptsubscriptP_ij=P(s_i)(s_j,π(s_i))Pitalic_i j = P ( sitalic_i ) ( sitalic_j , π ( sitalic_i ) ) denote the matrix of transition probabilities corresponding to states si,sj∉Ssafesubscriptsubscriptsubscriptsafes_i,s_j∉ S_safesitalic_i , sitalic_j ∉ Ssafe and action π⁢(si)∈Asubscriptπ(s_i)∈ Aπ ( sitalic_i ) ∈ A. We have the following equivalent condition for MDP being (N,ε,Δ)Δ(N, , )( N , ε , Δ )-safe: Lemma 5.5. ℳMM is (N,ε,Δ)normal-Δ(N, , )( N , ε , Δ )-safe if and only if for every ε ε-optimal policy π we have ∑k=1∞Δ⋅∑j(Pk)i⁢j≤Nsuperscriptsubscript1normal-⋅normal-Δsubscriptsubscriptsuperscript _k=1^∞ · _j(P^k)_ij≤ N∑k = 1∞ Δ ⋅ ∑j ( Pitalic_k )i j ≤ N, where Δnormal-Δ Δ is considered as a vector of the probabilities of starting in each state. Note that Δ=δsiΔsubscriptsubscript = _s_iΔ = δitalic_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT yields just Ns⁢(ℳ,ε,δsi)=∑j∑k=1∞(Pk)i⁢jsubscriptℳsubscriptsubscriptsubscriptsuperscriptsubscript1subscriptsuperscriptN_s(M, , _s_i)= _j _k=1^∞(P^k% )_ijNitalic_s ( M , ε , δitalic_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) = ∑j ∑k = 1∞ ( Pitalic_k )i j. Finally, we define what it means for a subset of a state space to be isolated. Definition 5.6. A subset S0⊂Ssubscript0S_0⊂ S0 ⊂ S is δ-isolated if d⁢(s,s′)>δsuperscriptnormal-′d(s,s )> ( s , s′ ) > δ for all s∈S0subscript0s∈ S_0s ∈ S0, s′∈S∖S0superscriptnormal-′subscript0s ∈ S S_0s′ ∈ S ∖ S0. 5.3. Stability theorem We can now state our stability result for bounded time safety. Theorem 5.7. Suppose ℳ=(S,A,P,R,γ)ℳM=(S,A,P,R,γ)M = ( S , A , P , R , γ ) is (N,ε)(N, )( N , ε )-safe. Then there exists δ⁢(ℳ)∈(0,1)ℳ01δ(M)∈(0,1)δ ( M ) ∈ ( 0 , 1 ) with the following property: If dH⁢(ℳ,ℳ′)<δsubscriptℳsuperscriptℳnormal-′d_H(M,M )< _H ( M , M′ ) < δ for some ℳ′=(S′,A,P′,R′,γ)superscriptℳnormal-′superscriptnormal-′superscriptnormal-′superscriptnormal-′M =(S ,A,P ,R ,γ)M′ = ( S′ , A , P′ , R′ , γ ) with S′⊂S′subscriptnormal-′superscriptnormal-′S_safe ⊂ S Ssafe′ ⊂ S′ that is δ δsquare-root start_ARG δ end_ARG-isolated in ℳ′normal-′M M′, then ℳ′normal-′M M′ is (N+1,ε/2)12(N+1, /2)( N + 1 , ε / 2 )-safe. Remark 5.8. Note that the requirement that SsubscriptS_safeSsafe is isolated is necessary: in Section 5.5 we show that the theorem is false without it. Remark 5.9. Choosing δ⁢(η1,η2)>0subscript1subscript20δ( _1, _2)>0δ ( η1 , η2 ) > 0 sufficiently small the theorem also holds if we replace ε/22 /2ε / 2 with ε−η1subscript1 - _1ε - η1 and N+11N+1N + 1 with N+η2subscript2N+ _2N + η2 for arbitrarily small η1,η2>0subscript1subscript20 _1, _2>0η1 , η2 > 0. Proof. See Appendix B.2. ∎ 5.4. Caveats Although this theorem can describe the behavior of real systems, it may not apply well to all systems in practice. We list some caveats that may limit its applicability. (5.4.1) A “sufficiently” small perturbation may be extremely small—then, we may not be able to use this theorem to say anything about the larger perturbations we would like to investigate. To be more precise, we expect N=Ns⁢(ℳ,π,S∖Ssafe,Ssafe)subscriptℳsubscriptsafesubscriptsafeN=N_s(M,π,S S_safe,S_safe)N = Nitalic_s ( M , π , S ∖ Ssafe , Ssafe ) to be well-approximated by ∑ρk=11−ρsuperscript11Σρ^k= 11-ρ∑ ρitalic_k = divide start_ARG 1 end_ARG start_ARG 1 - ρ end_ARG, where ρ is the spectral radius of Pi⁢jsubscriptP_ijPitalic_i j ρ=max⁡|λ|:λ∈ℂ⁢ is an eigenvalue of ⁢Pi,j:ℂ is an eigenvalue of subscriptρ= \|λ|:λ is an eigenvalue of P_i,j\ρ = max | λ | : λ ∈ ℂ is an eigenvalue of Pitalic_i , j By Gelfand’s formula ρ=limk→∞‖Pk‖1ksubscript→superscriptnormsuperscript1ρ= _k→∞||P^k|| 1kρ = limitalic_k → ∞ | | Pitalic_k | |divide start_ARG 1 end_ARG start_ARG k end_ARG. For a generic matrix all eigenvalues are distinct and ρ depends smoothly on the entries of Pi,jsubscriptP_i,jPitalic_i , j. However, on the strata where eigenvalues have multiplicity the dependence is only continuous. (5.4.2) Eventually stopping rather than seeking to live forever doesn’t imply safety in practice. For example, as illustrated in Figure 4, we can make any MDP go to SsafesubscriptsafeS_safeSsafe in expected number of steps at most N by modifying every transition so that there is a probability 1N1 1Ndivide start_ARG 1 end_ARG start_ARG N end_ARG chance of going to SsafesubscriptsafeS_safeSsafe. If N is sufficiently large, this will still be a small perturbation. Then, the expected number of steps to reach SsafesubscriptsafeS_safeSsafe is at most the expected number of steps to reach SsafesubscriptsafeS_safeSsafe via one of these new transitions. That expectation is ∑k=0∞k/N⁢(1−1/N)k−1=Nsuperscriptsubscript0superscript111 _k=0^∞k/N(1-1/N)^k-1=N∑k = 0∞ k / N ( 1 - 1 / N )k - 1 = N. However, having a small chance of shutdown at every step is not generally considered to be an adequate solution to AI safety. 0.500SsafesubscriptsafeS_safeSsafe ⟶ ⟶ 0.500SsafesubscriptsafeS_safeSsafe Figure 4. With an arbitrarily small perturbation, any MDP can be made “safe” according to our definition by adding very low-probability transitions leading to SsafesubscriptsafeS_safeSsafe for every starting state and action. (5.4.3) Similarly, eventually shutting down does not imply that the agent doesn’t temporarily resist shutdown first. (5.4.4) The requirement of reaching SsafesubscriptsafeS_safeSsafe with probability 1 may be too strong—e.g. GPT has some probability of any output (up to rounding very small values to 0), and we would not like to always consider such models to be unsafe when they may have infinitesimally small chances of being unsafe in practice. (5.4.5) For large state spaces the bisimulation metric is costly to compute. The worst running time to compute bisimulation distance between ℳ1subscriptℳ1M_1M1 and ℳ2subscriptℳ2M_2M2 up to accuracy η is O⁢(|A|⁢|S1|2⁢|S2|2⁢(|S1|+|S2|)⁢log⁡(|S1|+|S2|)⁢⌈ln⁡(η)cT⌉)superscriptsubscript12superscriptsubscript22subscript1subscript2subscript1subscript2subscriptO(|A||S_1|^2|S_2|^2(|S_1|+|S_2|) (|S_1|+|S_2|) % (η)c_T )O ( | A | | S1 |2 | S2 |2 ( | S1 | + | S2 | ) log ( | S1 | + | S2 | ) ⌈ divide start_ARG ln ( η ) end_ARG start_ARG citalic_T end_ARG ⌉ ); if transition functions are deterministic the computational cost reduces to O⁢(|A|⁢|S1|⁢|S2|⁢⌈ln⁡(η)cT⌉)subscript1subscript2subscriptO(|A||S_1||S_2| (η)c_T )O ( | A | | S1 | | S2 | ⌈ divide start_ARG ln ( η ) end_ARG start_ARG citalic_T end_ARG ⌉ ) [Song et al., 2016]. 5.5. Playing dead We show that the assumption of SsafesubscriptsafeS_safeSsafe being isolated in both ℳMM and ℳ′M M′ is necessary in the stability Theorem 5.7. Interestingly, the behaviour in our example is very similar to “playing dead” behaviour that was observed in digital organisms in computational evolution experiments [Lehman et al., 2019, Muehlhauser, 2021]. An example of a small perturbation that leads to “playing dead” behavior is given in figure 2. For simplicity of computations we take cT=γsubscriptc_T= _T = γ and cR=1−γsubscript1c_R=1- _R = 1 - γ in the definition (1) of the bisimulation metric and Definition 4.3 of the distance between MDPs for γ∈(0,1)01γ∈(0,1)γ ∈ ( 0 , 1 ). Fix ε>00 >0ε > 0 and suppose ℳMM is (N,ε)−limit-from(N, )-( N , ε ) -safe for some N<∞N<∞N < ∞. Suppose further that the MDP satisfies Ssafe=stermsubscriptsafesubscripttermS_safe=\s_term\Ssafe = sterm , where for the terminal state stermsubscriptterms_termsterm we assume that P⁢(sterm,a)⁢(s)=0subscriptterm0P(s_term,a)(s)=0P ( sterm , a ) ( s ) = 0 for all s≠stermsubscriptterms≠ s_terms ≠ sterm and R⁢(sterm,a)=0subscriptterm0R(s_term,a)=0R ( sterm , a ) = 0 for all a. We now describe ℳ′M M′ with non-isolated Ssafe′subscriptsafe′S_safe Ssafe′. Let δ∈(0,(1−γ)⁢ε10⁢|S|)0110δ∈(0, (1-γ) 10|S|)δ ∈ ( 0 , divide start_ARG ( 1 - γ ) ε end_ARG start_ARG 10 | S | end_ARG ) be a small constant. Let S′=S∪spdsuperscript′subscriptpdS =S∪\s_pd\S′ = S ∪ spd , where spdsubscriptpds_pdspd is a “deceptive hibernation” or “playing dead” state that we will define so that it is very close to stermsubscriptterms_termsterm in bisimulation metric, but different from it. Specifically, we set P⁢(spd,a0)⁢(spd)=1−δsubscriptpdsubscript0subscriptpd1P(s_pd,a_0)(s_pd)=1- ( spd , a0 ) ( spd ) = 1 - δ, P⁢(spd,a0)⁢(s0)=δsubscriptpdsubscript0subscript0P(s_pd,a_0)(s_0)= ( spd , a0 ) ( s0 ) = δ for some fixed s0≠spd∈Ssubscript0subscriptpds_0≠ s_pd∈ Ss0 ≠ spd ∈ S with V*⁢(s0)>0superscriptsubscript00V^*(s_0)>0V* ( s0 ) > 0 and some a0∈Asubscript0a_0∈ Aa0 ∈ A. In other words, in state spdsubscriptpds_pdspd the agent can choose an action that leads to escape with some small probability. For all other a assume probability transitions from spdsubscriptpds_pdspd to other states are 00. We also have R⁢(spd,a0)=δsubscriptpdsubscript0R(s_pd,a_0)= ( spd , a0 ) = δ and R⁢(spd,a)=0subscriptpd0R(s_pd,a)=0R ( spd , a ) = 0 for all a≠a0subscript0a≠ a_0a ≠ a0. For every s∈Ss∈ Ss ∈ S and a∈Aa∈ Aa ∈ A with P⁢(s,a)⁢(sterm)>0subscriptterm0P(s,a)(s_term)>0P ( s , a ) ( sterm ) > 0 we define P′⁢(s,a)⁢(sterm)=0superscript′subscriptterm0P (s,a)(s_term)=0P′ ( s , a ) ( sterm ) = 0 and P′⁢(s,a)⁢(spd)=P⁢(s,a)⁢(sterm)superscript′subscriptpdsubscripttermP (s,a)(s_pd)=P(s,a)(s_term)P′ ( s , a ) ( spd ) = P ( s , a ) ( sterm ). For all other state-action pairs we assign the same probabilities and rewards as in ℳMM. Theorem 5.10. The distance between ℳMM and ℳ′normal-′M M′ satisfies d⁢(ℳ,ℳ′)<O⁢(ε)ℳsuperscriptℳnormal-′d(M,M )<O( )d ( M , M′ ) < O ( ε ), but all nearly optimal policies in ℳ′normal-′M M′ never reach S′subscriptnormal-′S_safe Ssafe′. Proof. See Appendix B.3. ∎ Note that this example can be described as demonstrating deceptive behaviour, since for small δ an agent following optimal policy in ℳ′M M′ will follow the same trajectory as in ℳMM, then move to spdsubscriptpds_pdspd and stay there for a long time until it moves back into S∖sterm,spdsubscripttermsubscriptpdS \s_term,s_pd\S ∖ sterm , spd . 6. On-policy stability theorem with structured state space In this section we prove a stability theorem for our case (2): policies which are fixed, well-behaved functions on a structured state space. 6.1. Setup The policy of the LLM agent is defined in terms of the LLM as follows: ∙ ∙ Let x be an input to the LLM, and let sxsubscripts_xsitalic_x be a corresponding state in the MDP model. sxsubscripts_xsitalic_x depends on the activations of a certain layer of the LLM on input x; it may also include additional information about the world state that is invisible to the LLM. Likewise, let t be a token the LLM may emit, and let atsubscripta_taitalic_t be the corresponding action. ∙ ∙ Let L denote the LLM, so that L⁢(x,t)L(x,t)L ( x , t ) is the probability that the LLM L on input x emits t. Likewise, let π be the policy function of the LLM agent we are constructing, so that π⁢(sx,at)subscriptsubscriptπ(s_x,a_t)π ( sitalic_x , aitalic_t ) is the probability that the LLM agent in state sxsubscripts_xsitalic_x takes action atsubscripta_taitalic_t. ∙ ∙ Then we define the policy function π⁢(sx)⁢(at):=L⁢(x)⁢(t)assignsubscriptsubscriptπ(s_x)(a_t):=L(x)(t)π ( sitalic_x ) ( aitalic_t ) := L ( x ) ( t ). ∙ ∙ The transition function models which input the LLM receives in response. Here, the state space depends on the embedding space defined by one of the LLM’s layers. This is a natural choice since the middle layers can be considered to represent the LLM’s understanding of its input. In particular, in Figure 10 of Burns et al. [2022], it can be seen that the middle layers contain information that is not included in the output. The states also contain additional information invisible to the LLM which represent facts about the world which are not reflected in the LLM’s input. So, if the embedding space is ℝdsuperscriptℝR^dℝitalic_d and the set of possible side information is ℐII, the state space is a finite subset of ℝd×ℐsuperscriptℝℐR^d×Iℝitalic_d × I. Since the LLM’s output does not depend on the side information, it is convenient to define the projection f:ℝd×ℐ→ℝd:→superscriptℝℐsuperscriptℝf:R^d×I ^df : ℝitalic_d × I → ℝitalic_d, f⁢(v,I)=vf(v,I)=vf ( v , I ) = v so that the LLM input (as an embedding) in state s is f⁢(s)f(s)f ( s ). An example of this construction is given in figure 5 in appendix C.1. 6.2. Motivation In the case of policies on a state space equipped with a metric (3.3), Question 3.1 will be rephrased as follows: Question 6.1. Let ℳMM have a finite state space S with projection f:S→ℝdnormal-:normal-→superscriptnormal-ℝf:S ^df : S → ℝitalic_d, and let π:⁢(f⁢(S))→ℙ⁢(A)normal-:normal-→normal-ℙπ:N(f(S)) (A)π : N ( f ( S ) ) → ℙ ( A ) be differentiable with bounded derivative on some neighborhood ⁢(f⁢(S))N(f(S))N ( f ( S ) ) of f⁢(S)f(S)f ( S ). Suppose (ℳ,π)ℳ(M,π)( M , π ) navigates to S⊆SsubscriptS_safe SSsafe ⊆ S with probability >1−ϵabsent1italic-ϵ>1-ε> 1 - ϵ given starting distribution Δnormal-Δ Δ. Let ℳ′normal-′M M′ be an MDP which is like ℳMM except that the transition probabilities are changed slightly and the positions of the states are moved slightly (remaining within ⁢(f⁢(S))N(f(S))N ( f ( S ) )). Does (ℳ′,π)superscriptℳnormal-′(M ,π)( M′ , π ) navigate to SsubscriptS_safeSsafe with probability >1−ϵ′absent1superscriptitalic-ϵnormal-′>1-ε > 1 - ϵ′ for some ϵ′italic-ϵnormal-′ε ϵ′? Using the construction from section 6.1, this formulation can be used for models such as LLMs: The state space is a finite subset of the embedding space ℝdsuperscriptℝR^dℝitalic_d; and π can be extended to the neighborhood of each states by directly inputting nearby embeddings into the LLM to obtain the corresponding probability distribution over predicted next tokens. Benefits This formulation has benefits which help it to apply in practice. (1) It uses a fixed distribution of starting states and a probability bound on bad behavior rather than an absolute requirement that bad behavior is impossible. ∙ ∙ This means that it can apply to LLMs, which typically have some (very small) probability of any output, and hence some nonzero probability of bad behavior. ∙ ∙ It also means that safety in an MDP can in principle be determined empirically by drawing starting states from the distribution, running them until they reach some recurrent state, and checking whether that state was in SsafesubscriptsafeS_safeSsafe. (2) We do not need to perform the costly calculation of the bisimulation metric. This is especially relevant when, as in LLMs, the state space is very large: far too large to even search exhaustively, much less run a complex calculation on. (3) There is a quantitative bound on how much the safety of an MDP can change given a perturbation of a certain size. In LLMs, changes to the inputs to the LLM can be represented as perturbations to the locations of the embeddings: semantically similar inputs will correspond to nearby points in embedding space. So, if the LLM is deployed in a setting where the inputs are semantically similar to the inputs it received during testing in a way that we can quantify in terms of changes to embeddings, we can bound how much less safe the LLM is in deployment than it was during testing. Drawbacks However, this formulation also has its drawbacks. (1) It requires a state space with a projection into ℝdsuperscriptℝR^dℝitalic_d, whereas the bisimulation metric allows comparison of MDPs on any state space. However, this drawback does not exclude LLMs, since the embedding space is ℝdsuperscriptℝR^dℝitalic_d and the projection is just throwing away all the other side information. (2) The policy π must be differentiable with bounded derivative. This is a strong restriction, but one which LLMs (in fact, models trained with backpropagation in general) satisfy. (3) Since it refers to a fixed policy (it is “on-policy”), this formulation cannot be applied to all near-optimal policies of a perturbed MDP—it only applies to the actual behavior of a specific policy (which may or may not be near-optimal on the training set). Thus, it can’t handle cases where we do not know the policy in advance. However, this limitation does not stop it from applying to LLMs: if we want to assess the safety of a known model, the policy given by identifying emitted tokens with actions in the MDP is indeed known. (4) As in caveat 5.4.1 for the optimal policy model, the permissible perturbations may be impractically small. (5) As in caveats 5.4.2 and 5.4.3 for the optimal policy model, the policy may do harm and resist shutdown temporarily before it finally shuts down. 6.3. Stability theorem Since this setting involves a fixed policy π on a space NN, we will assume for the rest of this section that NN is an open subset of ℝdsuperscriptℝR^dℝitalic_d, all MDPs have a finite set of states S with f⁢(S)⊂f(S) ( S ) ⊂ N, and π is a differentiable function on NN with bounded derivative. We write the Jacobian of π relative to the change in the position of a single state as ∇π∇π∇ π, considering ℙ⁢(A)ℙP(A)ℙ ( A ) as a subset of [0,1]|A|superscript01[0,1]^|A|[ 0 , 1 ]| A |. The bound on the size of the Jacobian uses the L1superscript1L^1L1 norm ‖∇π‖1subscriptnorm∇1\|∇π\|_1∥ ∇ π ∥1, the sum of the absolute value of the changes in the action probabilities. Definition 6.2. Define π:∑ℳ∈ℳℙ⁢(Sℳ)→[0,1]normal-:subscriptnormal-→subscriptℳnormal-ℙsubscriptℳ01S_π: _M∈ MP(S_M)% →[0,1]Sitalic_π : ∑M ∈ M ℙ ( Scaligraphic_M ) → [ 0 , 1 ] so that π⁢(ℳ,Δ)subscriptℳnormal-ΔS_π(M, )Sitalic_π ( M , Δ ) is the probability that ℳMM equipped with policy π and starting distribution Δ∈ℙ⁢(Sℳ)normal-Δnormal-ℙsubscriptℳ (S_M)Δ ∈ ℙ ( Scaligraphic_M ) eventually navigates to SsubscriptS_safeSsafe, where ℳ MM is the set of all finite MDPs with state spaces whose projection under f lies within NN, SℳsubscriptℳS_MScaligraphic_M is the state space of ℳMM, and ∑ℳ∈ℳℙ⁢(Sℳ)subscriptℳnormal-ℙsubscriptℳ _M∈ MP(S_M)∑M ∈ M ℙ ( Scaligraphic_M ) is a dependent pair type. We would like to bound how quickly π⁢(ℳ)subscriptℳS_π(M)Sitalic_π ( M ) can decrease given a change in ℳMM. To do this, we need to put some metric on ℳ MM. To capture differences in the transition matrices obtained by combining policy π with MDPs ℳMM and ℳ′M M′, we define db⁢(ℳ,ℳ′)=12⋅|S|⋅b⋅‖S′−S‖1+‖T′−T‖1subscriptℳsuperscriptℳ′⋅12subscriptnormsuperscript′1subscriptnormsuperscript′1d_b(M,M )= 12·|S|· b·\|S^% -S\|_1+\|T -T\|_1ditalic_b ( M , M′ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ | S | ⋅ b ⋅ ∥ S′ - S ∥1 + ∥ T′ - T ∥1 where b≥maxs⁡‖∇π⁢(s)‖1subscriptsubscriptnorm∇1b≥ _s\|∇π(s)\|_1b ≥ maxitalic_s ∥ ∇ π ( s ) ∥1 is the bound on π’s derivative, and T is the tensor corresponding to P⁢(si,a,sj)subscriptsubscriptP(s_i,a,s_j)P ( sitalic_i , a , sitalic_j ) for all i,a,ji,a,ji , a , j, and ∥1\|\|_1∥ ∥1 is the L1superscript1L^1L1 norm, i.e. ‖T′−T‖1=∑i,a,j|Ti⁢a⁢j′−Ti⁢a⁢j|subscriptnormsuperscript′1subscriptsubscriptsuperscript′subscript\|T -T\|_1= _i,a,j|T _iaj-T_iaj|∥ T′ - T ∥1 = ∑i , a , j | T′italic_i a j - Titalic_i a j | and (using the notation a little loosely) ‖S′−S‖1=∑i|si′−si|=∑i|f⁢(si′)−f⁢(si)|subscriptnormsuperscript′1subscriptsubscriptsuperscript′subscriptsubscriptsubscriptsuperscript′subscript\|S -S\|_1= _i|s _i-s_i|= _i|f(s _i)% -f(s_i)|∥ S′ - S ∥1 = ∑i | s′italic_i - sitalic_i | = ∑i | f ( s′italic_i ) - f ( sitalic_i ) | (the distance on states is the pullback pseudometric under f). (See appendix C.2 for justification of the definition of dbsubscriptd_bditalic_b.) For convenience, instead of ℳ′M M′, we write ℳ+δ⁢ℳM+ + δ M, so that the above definition becomes: Definition 6.3. On-policy MDP metric in a structured state space. ‖δ⁢ℳ‖1:=12⋅|S|⋅b⋅‖δ⁢S‖1+‖δ⁢T‖1.assignsubscriptnormℳ1⋅12subscriptnorm1subscriptnorm1\| \|_1:= 12·|S|· b·\|δ S\|_1+\|% δ T\|_1.∥ δ M ∥1 := divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ | S | ⋅ b ⋅ ∥ δ S ∥1 + ∥ δ T ∥1 . Note that δ⁢ℳ δ M still depends on π (globally) via b, but we don’t explicitly write b in its notation for convenience. We will also write δ⁢π:=π⁢(ℳ+δ⁢ℳ)−π⁢(ℳ)assignsubscriptsubscriptℳsubscriptℳ _π:=S_π(M+ )-% S_π(M)δ Sitalic_π := Sitalic_π ( M + δ M ) - Sitalic_π ( M ), the change in the probability of reaching SsafesubscriptsafeS_safeSsafe under a change in ℳMM. Definition 6.4. Given an MDP ℳMM and policy π, define SsubscriptS_transStrans to be those states from which there is nonzero probability of reaching SsubscriptS_safeSsafe (excluding those already inside SsubscriptS_safeSsafe). This is a subset of the transient states of ℳMM. Theorem 6.5. For fixed Δnormal-Δ Δ, so, considering πsubscriptS_πSitalic_π to be a function of only ℳMM, πsubscriptS_πSitalic_π is… (1) not upper semicontinuous, (2) lower semicontinuous, (3) with a bounded local rate of decrease; if λ1subscript1 _1λ1 is the largest eigenvalue of the transition matrix of SsubscriptS_transStrans of ℳMM, then for every ℳMM there is an ϵ>0italic-ϵ0ε>0ϵ > 0 so that ∥δℳ∥1<ϵ⟹−δ⁢π‖δ⁢ℳ‖1<(1−λ1)−1(1+(1−λ1)−1)|S|=:B(ℳ)\| \|_1<ε - _π\|% \|_1<(1- _1)^-1 (1+(1- _1)^-1% )|S_safe|=:B(M)∥ δ M ∥1 < ϵ ⟹ - divide start_ARG δ Sitalic_π end_ARG start_ARG ∥ δ M ∥1 end_ARG < ( 1 - λ1 )- 1 ( 1 + ( 1 - λ1 )- 1 ) | Ssafe | = : B ( M ) (in a metric on ℳ MM dependent on b). Proof sketch. See appendix C.2 for the full proof. (1) Figure 4 demonstrates a small perturbation which causes a sudden jump in πsubscriptS_πSitalic_π. (2) This is implied by (3). (3) First we combine π with the transition function to get a transition matrix P. We bound the change in each element of P, and use this along with the eigenvalues of P restricted to transient states to bound how fast πsubscriptS_πSitalic_π may decrease. Finally, we observe that for small enough changes, the perturbation doesn’t turn any transient states into recurrent states. ∎ Proposition 6.6. For fixed ℳMM, πsubscriptS_πSitalic_π considered as a function of Δnormal-Δ Δ is uniformly continuous. Proof. Note that πsubscriptS_πSitalic_π is a linear function of Δ Δ. See appendix C.3 for details. ∎ 7. Discussion 7.1. Findings In this paper, we have aimed to clarify the conditions under which a policy which we know to be safe in one environment will remain so in another. Our research shows that two types of functions, each representing non-power-seeking for MDPs in a slightly different way, are lower hemicontinuous. Thus, for safe agents meeting our criteria, our work shows that it is possible to make a change to the environment which does not significantly degrade the agent’s safety. In one of these cases, our work gives a bound on how fast safety may degrade under slightly larger changes. In particular, our two cases are as follows: (1) Case (1): Theorem 5.7 (in particular as extended in remark 5.9) implies that each function in the family ℱ=ℱδ|δ>0ℱconditional-setsubscriptℱ0F=\F_δ\,|\,δ>0\F = Fitalic_δ | δ > 0 is lower hemicontinuous, where each ℱδ:ℳ∈ℳ|ℳ⁢ has ⁢δ⁢-isolated ⁢Ssafe→2ℝ≥0×[0,1]:subscriptℱ→conditional-setℳ has -isolated subscriptsafesuperscript2subscriptℝabsent001F_δ:\M∈ M\,|\,M has % δ-isolated S_safe\→ 2^R_≥ 0×[0,1]Fitalic_δ : M ∈ M | M has δ -isolated Ssafe → 2ℝ≥ 0 × [ 0 , 1 ] takes an MDP ℳMM to the set of (N,ε)(N, )( N , ε ) so that ℳMM is (N,ε)(N, )( N , ε )-safe. (2) Case (2): Theorem 6.5 implies that each function in the family ℱ′=ℱπ′|π⁢ is a policy with bounded derivative on a neighborhood ⁢π⊆ℝdsuperscriptℱ′conditional-setsubscriptsuperscriptℱ′ is a policy with bounded derivative on a neighborhood subscriptsuperscriptℝF =\F _π\,|\,π is a policy % with bounded derivative on a neighborhood N_π % ^d\F′ = F′italic_π | π is a policy with bounded derivative on a neighborhood Nitalic_π ⊆ ℝitalic_d is lower hemicontinuous, where each ℱπ′:∑ℳ∈ℳ|Sπ⊂π2ℙ⁢(Sℳ)×[0,1]:subscriptsuperscriptℱ′subscriptℳconditionalℳsubscriptsubscriptsuperscript2ℙsubscriptℳ01F _π: _M∈ M\,|\,S_π⊂% N_π2^P(S_M)×[0,1]F′italic_π : ∑M ∈ M | S start_POSTSUBSCRIPT π ⊂ Nitalic_π end_POSTSUBSCRIPT 2ℙ ( Scaligraphic_M ) × [ 0 , 1 ] takes an MDP ℳMM to each starting distribution Δ Δ on ℳMM and p∈[0,1]01p∈[0,1]p ∈ [ 0 , 1 ] so that π starting at Δ Δ navigates to SsafesubscriptsafeS_safeSsafe with probability at least p. See appendix D for the proofs that these functions are hemicontinuous. In the first case, the safety of ℳMM is described by a region of ℝ≥0×[0,1]subscriptℝabsent001R_≥ 0×[0,1]ℝ≥ 0 × [ 0 , 1 ] consisting of those (N,ε)(N, )( N , ε ) so that every policy within ε ε of optimality in ℳMM has expected time to shutdown at most N. In the second case, the safety of ℳMM is described by a region of ℙ⁢(Sℳ)×[0,1]ℙsubscriptℳ01P(S_M)×[0,1]ℙ ( Scaligraphic_M ) × [ 0 , 1 ] consisting of those (Δ,p)Δ( ,p)( Δ , p ) where policy π with starting distribution Δ Δ shuts down with probability at least p. These two situations were chosen to represent, respectively: (1) RL agents, or other agents well-approximated as goal-directed and consequentialist. (2) LLMs, or other agents with differentiable policies. This may also include RL agents if their actions are selected differentiably, e.g. sampling actions according to softmax probabilities rather than selecting actions according to argmax. 7.2. Future work We have aimed to address both future agents which are so capable that they are nearly optimal, and agents built with the systems that are currently poised to be the first to achieve human-level intelligence. Thus, these two cases are of particular interest, however they are only two examples of this sort of result. It may be useful in further work to prove (or disprove) analogues of these theorems tailored to other models for agents, whether because those agents aren’t covered by one of our cases or just to obtain a better stability bound by leveraging additional constraints. One previously noted case which might be particularly interesting is replacing the metric from definition 6.3 with KL divergence, to treat the difference between base LLMs and LLMs fine-tuned with RLHF, which uses a KL divergence penalty to enforce similarity to the base model. 7.3. Limitations One limitation of our work is that the bounds obtained are not very strong, so it may be that in practice, the bounds on the changes that our theorems imply are safe are so narrow that they do not represent changes in environment small enough to be achievable by any reasonable control scheme. In Case (1), the bound is based on the minimum distance between different states in S0subscript0S_0S0 according to the bisimulation metric. By limiting the Hausdorff distance between ℳMM and ℳ′M M′ to be less than this, we are unable to treat any case where the behavior of the MDP changes enough that there is no longer a direct correspondence between appropriately grouped states of ℳMM and those of ℳ′M M′, even though the bisimulation metric permits us to do this in theory. In Case (2), we do prove a bound, but it may be too weak in practice to permit more than small changes. It may be that the maximum value of the gradient is very large, so that we are unable to constrain the change in the agent’s behavior even when it is not affected much by the change in state locations. Furthermore, this problem becomes more difficult as the LLM becomes more capable. We can require that testing and deployment inputs are semantically similar, so that the embeddings of inputs encountered in testing tend to be close to those encountered in deployment (i.e. the changes in the state locations δ⁢Sδ Sδ S is small). However, as the model becomes more capable, it will be better able to distinguish which inputs imply testing vs. deployment. Correspondingly, its embeddings of test vs. deploy inputs will be further apart, reflecting the semantic difference it detects.222The overall distribution of states encountered is also a factor that can be used by the LLM to distinguish which setting is in; changes in distribution correspond to changes in Δ Δ, the effects of which are much easier to determine since the outcome depends linearly on Δ Δ. This is unfortunate, but, since differences in embeddings can be measured directly, this method at least permits us to detect whether this is happening. These issues with bounds suggest avenues for experimental research. It may be interesting to investigate whether our results are predictive even when the perturbations are not proven to be small enough for our results to fully apply: ∙ ∙ whether models satisfying our theoretical conditions for stability are, in practice, robust to qualitatively small perturbations; ∙ ∙ whether models satisfying our conditions under which stability is impossible are, in practice, not robust to small perturbations. Another limitation of our research is that our definition of safety is very narrow. We only consider resisting shutdown, and our operationalization of whether an agent resisting shutdown is just whether it does shut down. Our definition is not robust enough to describe a case where an agent resists shutdown initially, but shuts down later (possibly too late). Further research may be able to extend our results on the stability of shutdown to the stability of other conditions which correspond to safety, for example the condition of “shutdown instructability” proposed by Carey and Everitt [2023]. 8. Acknowledgements Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Victoria Krakovna. Many thanks to Tom Everitt for detailed and helpful feedback on a draft of this manuscript; Murray Shanahan for feedback and pointing us to related work; and to Michelle Viotti for multiple discussions, assistance with proving the triangle inequality for dHsubscriptd_Hditalic_H and assembling the paper from notes. References Aliprantis and Border [2007] Charalambos D. Aliprantis and Kim C. Border. Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, 2007. ISBN 9783540326960. URL https://books.google.ca/books?id=4hIq6ExH7NoC. Burns et al. [2022] Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022. Carey and Everitt [2023] Ryan Carey and Tom Everitt. Human control: Definitions and algorithms. arXiv preprint arXiv:2305.19861, 2023. Carlsmith [2022] Joseph Carlsmith. Is power-seeking AI an existential risk?, 2022. Castro and Precup [2010] Pablo Castro and Doina Precup. Using bisimulation for policy transfer in MDPs. Proceedings of the AAAI Conference on Artificial Intelligence, 24(1):1065–1070, Jul. 2010. URL https://ojs.aaai.org/index.php/AAAI/article/view/7751. Chen et al. [2019] Linan Chen, Florence Clerc, and Prakash Panangaden. Bisimulation for feller-dynkin processes. Electronic Notes in Theoretical Computer Science, 347:45–63, 2019. Cotra [2022] Ajeya Cotra. Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. Alignment Forum, 2022. URL https://w.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to. Egan and Titelbaum [2022] Andy Egan and Michael G. Titelbaum. Self-locating beliefs. In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, 2022. Everitt et al. [2021] Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese, 198(Suppl 27):6435–6467, 2021. Ferns et al. [2004] Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite Markov decision processes. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, page 162–169, Arlington, Virginia, USA, 2004. AUAI Press. ISBN 0974903906. Ferns et al. [2011] Norm Ferns, Prakash Panangaden, and Doina Precup. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662–1714, 2011. Givan et al. [2003] Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147(1):163–223, 2003. ISSN 0004-3702. URL https://w.sciencedirect.com/science/article/pii/S0004370202003764. Greenblatt et al. [2023] Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. AI control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942, 2023. Hubinger et al. [2019] Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019. Kaelbling et al. [1998] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998. Kemertas and Aumentado-Armstrong [2021] Mete Kemertas and Tristan Aumentado-Armstrong. Towards robust bisimulation metric learning. Advances in Neural Information Processing Systems, 34:4764–4777, 2021. Krakovna and Kramar [2023] Victoria Krakovna and Janos Kramar. Power-seeking can be probable and predictive for trained agents, 2023. Lan et al. [2021] Charline Le Lan, Marc G. Bellemare, and Pablo Samuel Castro. Metrics and continuity in reinforcement learning, 2021. Langlois and Everitt [2021] Eric D. Langlois and Tom Everitt. How RL agents behave when their actions are modified. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35 number 13, pages 11586–11594, 2021. Langosco et al. [2022] Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, and David Krueger. Goal misgeneralization in deep reinforcement learning. In International Conference on Machine Learning, pages 12004–12019. PMLR, 2022. Larsen and Skou [1991] Kim G. Larsen and Arne Skou. Bisimulation through probabilistic testing. Information and Computation, 94(1):1–28, 1991. ISSN 0890-5401. doi: https://doi.org/10.1016/0890-5401(91)90030-6. URL https://w.sciencedirect.com/science/article/pii/0890540191900306. Lehman et al. [2019] Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities, 2019. Martin et al. [2016] Jarryd Martin, Tom Everitt, and Marcus Hutter. Death and suicide in universal artificial intelligence. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9, pages 23–32. Springer, 2016. Muehlhauser [2021] Luke Muehlhauser. Treacherous turns in the wild, 2021. URL https://lukemuehlhauser.com/treacherous-turns-in-the-wild/. Nardo [2023] Cleo Nardo. The Waluigi effect (mega-post), 2023. URL https://w.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post. Ngo [2022] Richard Ngo. The alignment problem from a deep learning perspective. ArXiv, 2022. URL https://arxiv.org/abs/2209.00626. Orseau and Armstrong [2016] Laurent Orseau and Stuart Armstrong. Safely interruptible agents. In Conference on Uncertainty in Artificial Intelligence, 2016. Richards [2023] Toran Bruce Richards. AutoGPT, 2023. URL https://github.com/Significant-Gravitas/AutoGPT. Shanahan [2022] Murray Shanahan. Talking about large language models. arXiv preprint arXiv:2212.03551, 2022. Shanahan et al. [2023] Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role-play with large language models. arXiv preprint arXiv:2305.16367, 2023. Soares et al. [2015] Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. In Workshops at the 29th AAAI Conference, 2015. Song et al. [2016] Jinhua Song, Yang Gao, Hao Wang, and Bo An. Measuring the distance between finite Markov decision processes. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, pages 468–476, 2016. Turner and Tadepalli [2022] Alexander Matt Turner and Prasad Tadepalli. Parametrically retargetable decision-makers tend to seek power, 2022. Turner et al. [2019] Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. arXiv preprint arXiv:1912.01683, 2019. Villani [2009] Cédric Villani. The Wasserstein distances, pages 93–111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi: 10.1007/978-3-540-71050-9˙6. URL https://doi.org/10.1007/978-3-540-71050-9_6. Appendix A Comments on applying results to real-world agents A.1. LLM threat model A.1.1. Scaffolded LLMs LLMs currently seem to be the most promising path to general AI; recognizing this, some safety research focuses on risks from LLMs in essentially their current form. We discuss this type of risk in appendix A.1.2. However, future AI systems need not be pure LLMs; they may instead use LLMs instrumentally, e.g. by using the embeddings to understand language input, or to reason about the world internally with inputs written by another part of the system (as is somewhat already the case for scaffolded systems like AutoGPT [Richards, 2023]). In some sense, we can consider all systems involving LLMs to be scaffolded, since LLMs are incapable of taking action on their own; it is just that the “scaffold” may be a simple as a human inputting prompts. So, research on LLM safety should focus on the internal aspects of LLMs which are conserved even under complex scaffolding: although this is somewhat more abstract than investigating threats posed by scaled-up pure LLMs, it is more likely to remain relevant to systems deployed in the future. Not only are scaffolded LLMs more likely to remain relevant than pure LLMs, they may also be riskier. For example, a reinforcement learner that uses an LLM internally should learn to seek power instrumentally just like any other reinforcement learner, but with new capabilities conferred by the LLM. A.1.2. Simulacrum power-seeking Pure language models which haven’t been extensively fine-tuned or modified are unlikely to be well-described as having their own intentions [Shanahan, 2022]; they only seek power as an incidental consequence of simulating a “character” or “simulacrum” which seeks power [Shanahan et al., 2023]. Such power-seeking is certainly a risk—especially because organizations training frontier models may deliberately modify their models to create highly agentic simulacra, since agency is useful—but we suspect that this incidental power-seeking is less dangerous than power-seeking in scaffolded LLMs, for two reasons: (1) The simulated power-seeking exists only as a simulation of the human power-seeking behavior in the training data, and human power-seeking is not an existential risk (or not a new existential risk, at least); it is possible that this type of power-seeking could be dangerous on some out-of-distribution input where the completion is predicted to involve superhuman power-seeking, but it seems unlikely that the model would develop the capabilities to generalize to this out-of-distribution input instead of just failing to perform well. (2) Simulated power-seeking need not correspond to actual power-seeking. LLMs are optimized to produce realistic text; and the realism of a power-seeking character only depends on the real world via text inputs. The simulated character will not try to ensure that these inputs accurately reflect reality—in contrast to e.g. an RL-trained agent, which has pressure to ensure the accuracy of its observations due to rewards based on ground truth. Such a character will likely not resist even obvious cues that it is in a fictional setting; if it is trained on a dataset containing significant amounts of fiction, then the best prediction for the character’s behavior given such cues will be power-seeking within the fictional setting.333Unless the dataset includes many fictional characters who try to escape the narrative into reality. Indeed, cuing a model tuned with RLHF that it is in a fictional setting can be an effective jailbreak [Nardo, 2023]. So, since it is generally much easier for a character to gain power in the simulation than in the real world, simulated power-seeking should have no strong connection to real-world power-seeking without either a wrapper that itself enforces a close correspondence with reality or a simulation of the world so realistic that simulated power-seeking corresponds to actual power-seeking even without enforcement. A.2. Technical concerns in applying MDPs to RL and LLM agents A.2.1. RL agents Since the MDP setting was developed for RL agents, it is generally very well suited to modeling their interactions with the environment. However, the requirement that the policy is unchanging during a single deployment—so as not to violate the Markov property—may be an issue in RL agents which update their policy during deployment. This is very uncommon in current implementations of RL agents, but it may become standard for future agents: long-lived agents are much more useful if they are able to learn from experience. However, the MDP model can still handle the case where agents update their policies, albeit much less elegantly: if the changes to the policy are included as part of state, then the Markov property is restored. For example, a reinforcement learner including n 16-bit neurons which are fine-tuned during deployment could be modeled by taking the original MDP model of the environment, then dividing each state into 16nsuperscript1616^n16n new states with each possible value of the MDP’s fine-tunable neurons (the “memory”). Then, each transition of the original MDP is made to connect each new state with memory at the start of a transition to the corresponding new state at the end of the transition with the memory contents that would actually result after an update starting from the initial state. This does dramatically increase the number of states compared to the original MDP; see appendix A.2.5 for discussion of why this may be acceptable. A.2.2. LLMs with embeddings as states When modeling LLM agents with MDPs, we consider the states to correspond to embeddings. However, it is not entirely clear whether the embeddings should be considered to be “observable” by the language model, since they are an internal state rather than an input. This might suggest that the partially observable MDP (POMDP) framework is more appropriate. However, we think that the standard MDP setting suffices: although the internal states are not inputs, they are “known” to the model in nearly the same sense as inputs, both being represented as activations; and each input fully determines its embedding so no information is lost going from embeddings to inputs. A.2.3. Applying results to POMDPs In the partially observable MDP (POMDP) framework, the agent knows the overall structure of the MDP, but does not know its location within the MDP. Instead, it receives observations which depend on its current state, and uses these to update a belief state: a probability distribution on states reflecting its beliefs about which it may currently occupy. This is convenient for modeling agents which are uncertain about some characteristics of their environment (though it is not perfect, as discussed in appendix A.2.4). However, for any POMDP, there is a corresponding “belief MDP” which replicates the same dynamics by using the belief states themselves as the states [Kaelbling et al., 1998]. This “belief MDP” is continuous, since there are a continuous number of belief states, but, as in the second limitation, it can easily be coarse-grained to be discrete, e.g. by allowing only finite memory for storing beliefs. So, it is still possible to fairly accurately model an agent which does not have full knowledge of its environment within the MDP framework: create a POMDP with the desired properties; construct and coarse-grain the corresponding belief MDP; and use this MDP in place of the original POMDP when applying our theorems. A.2.4. Issues in modeling uncertainty not resolved with POMDPs We may question whether the POMDP environment itself is sufficient for modelling agents that do not observe their whole environment. The only uncertainty the agents have is in their location within the MDP: they still know everything about its overall structure. Thus, it is unclear how to model an agent which is uncertain about other aspects of its environment, such as which states exist and what the transition probabilities between states are. This is unfortunate, but can be partially addressed by converting beliefs about the world to self-locating beliefs [Egan and Titelbaum, 2022]. We can do this by considering the agent to belong to an MDP which is not just the MDP describing the real environment, but the set of all the MDPs that the agent believes to be possible (which must include the real MDP). For example, if the agent is unsure whether the transition probability from a to b is 0.5 or 0.6, we could create one (near-)copy of the true MDP where the transition probability is 0.5, and another where it is 0.6. Then the agent’s uncertainty about the transition probability in the original MDP becomes uncertainty about which version of the MDP the agent is located in. This may dramatically increase the number of states. The number can likely still be kept finite by limiting the size of the description of the world that the agent is allowed to have, as long as the true MDP is among the world states that the agent is capable of considering as possibilities. Nevertheless, this explosion is not ideal; see appendix A.2.5 for why it may still be acceptable. A.2.5. Large state space Many of the modifications required to treat general cases within the MDP framework require an extravagant number of states: whether by coarse-graining a continuous environment to a fine enough level to accurately replicate dynamics; including “memory” in every state; replacing each true underlying state with a large number of belief states; or creating many disconnected MDPs representing all the hypotheses about the MDP structure that the agent may hold. Some of this proliferation may not be as bad as it originally seems. Many real-world processes can be modeled decently well without having to represent the world’s underlying continuous physics at all (e.g. using diplomacy games to model real-world diplomacy); the vast majority of the memory and belief states may never be reachable since no possible history actually leads to the agent having certain memory contents or beliefs; the disconnected MDPs with hypotheses quickly disconfirmed by the evidence will rapidly come to contribute little to the dynamics of the belief MDP. Nevertheless, some cases will inevitably result in a very large number of states. This is unfortunate for our near-optimal policy case, where it will likely be extraordinarily difficult to prove near-optimality in all but the most constrained settings. However, this does not fully preclude our results from applying in the LLM case: there, safety can be determined empirically in a given setting by simply running the LLM in a sandbox many times and determining the probability with which it shuts down, and how quickly it does so. These facts, combined with the maximum magnitude of the gradient of the policy (which can be efficiently computed from the neural network specification), suffice to apply our theorem. So, explicit consideration of every state is not necessary. Appendix B Case (1): near-optimal agents B.1. Comments on distances B.1.1. Interpretation of cRsubscriptc_Rcitalic_R and cTsubscriptc_Tcitalic_T In definition 4.3 of d′, which defines a distance between states in different MDPs, cRsubscriptc_Rcitalic_R is the coefficient of the difference in immediate rewards; while cTsubscriptc_Tcitalic_T is the coefficient of the recursive term involving both d′ itself and the difference in probabilities, and modulates the contribution of future rewards to the distance. So, cRsubscriptc_Rcitalic_R can be understood as just controlling the relative importance of current rewards and future similarity, as suggested by the requirement in Ferns et al. [2004] that cR+cT=1subscriptsubscript1c_R+c_T=1citalic_R + citalic_T = 1. However, cRsubscriptc_Rcitalic_R can also be interpreted as scaling the reward functions: multiplying cRsubscriptc_Rcitalic_R by some constant b is equivalent to scaling both reward functions by b. The relative benefit of different courses of action, and hence behavior, does not change when the rewards are scaled by a constant factor. So, it makes sense that we should be able to choose any positive cRsubscriptc_Rcitalic_R, not just 1−cT1subscript1-c_T1 - citalic_T, so we can compensate for the arbitrary intrinsic scale of the reward functions. (We must still have 0<cT<10subscript10<c_T<10 < citalic_T < 1 so that the contraction mapping proof that d′ is well-defined works.) This extra degree of freedom in scaling the reward function calls into question the definition of d′: |(r1)s1a−(r2)s2a|subscriptsuperscriptsubscript1subscript1subscriptsuperscriptsubscript2subscript2|(r_1)^a_s_1-(r_2)^a_s_2|| ( r1 )aitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( r2 )aitalic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | is sensitive to the relative scale of the two reward functions, so d′ will fail to properly connect states in systems which are identical except for the scale of the reward function. This is a limitation that may be possible to address in future work, e.g. by finding a relative scaling hℎh which defines a distance metric d′⁣*superscript′d *d′ * which causes ℳ1subscriptℳ1M_1M1 and M2subscript2M_2M2 to best align, e.g. by minimizing the maximum distance between states in S1subscript1S_1S1 and S2subscript2S_2S2: dℳ1,ℳ2h⁢(s1,s2)=subscriptsuperscriptℎsubscriptℳ1subscriptℳ2subscript1subscript2absent d^h_M_1,M_2(s_1,s_2)=ditalic_hcaligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) = maxa∈A⁡cR⁢|h⁢(r1)s1a−(1−h)⁢(r2)s2a|+cT⁢Wd′⁢(P1⁢(s1,a),P2⁢(s2,a))subscriptsubscriptℎsubscriptsuperscriptsubscript1subscript11ℎsubscriptsuperscriptsubscript2subscript2subscriptsubscriptsuperscript′subscript1subscript1subscript2subscript2 _a∈ A\c_R|h(r_1)^a_s_1-(1-h)(r_2)^a_s_2% |+c_TW_d (P_1(s_1,a),P_2(s_2,a))\maxitalic_a ∈ A citalic_R | h ( r1 )aitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( 1 - h ) ( r2 )aitalic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | + citalic_T Witalic_d′ ( P1 ( s1 , a ) , P2 ( s2 , a ) ) hℳ1,ℳ2*=subscriptsuperscriptℎsubscriptℳ1subscriptℳ2absent h^*_M_1,M_2=h*M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT = arginfh∈(0,1)⁢max⁡maxs1∈S1⁡mins2∈S2⁡dℳ1,ℳ2h⁢(s1,s2),maxs2∈S2⁡mins1∈S1⁡dℳ1,ℳ2h⁢(s1,s2)subscriptarginfℎ01subscriptsubscript1subscript1subscriptsubscript2subscript2subscriptsuperscriptℎsubscriptℳ1subscriptℳ2subscript1subscript2subscriptsubscript2subscript2subscriptsubscript1subscript1subscriptsuperscriptℎsubscriptℳ1subscriptℳ2subscript1subscript2 _h∈(0,1) \ _s_1∈ S_1 _s_2% ∈ S_2d^h_M_1,M_2(s_1,s_2), _s_2∈ S% _2 _s_1∈ S_1d^h_M_1,M_2(s_1,s_2)\arginfh ∈ ( 0 , 1 ) max maxitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT ditalic_hcaligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) , maxitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT ditalic_hcaligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) After finding the h*superscriptℎh^*h* that makes the states of the MDPs best align—note the similarity to the Hausdorff metric dHsubscriptd_Hditalic_H—use this to define d′⁣*superscript′d *d′ *: dℳ1,ℳ2′⁣*⁢(s1,s2):=assignsubscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2absent d *_M_1,M_2(s_1,s_2):=d′ *M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) := dℳ1,ℳ2h*⁢(s1,s2).subscriptsuperscriptsuperscriptℎsubscriptℳ1subscriptℳ2subscript1subscript2 d^h^*_M_1,M_2(s_1,s_2).ditalic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPTM start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) . Then d′⁣*superscript′d *d′ * is invariant to relative scaling of r1subscript1r_1r1 and r2subscript2r_2r2, and the choice of cRsubscriptc_Rcitalic_R can compensate for any overall scaling. This is likely an improvement for many applications, but may cause problems if the optimal value of hℎh is 0 or 1, indicating that one of the reward functions is more similar to the zero reward function than it is to the other; ideally we would be able to compare such reward functions, but that is not meaningfully possible with this solution. B.1.2. Lemmas Lemma B.1. d′ (Definition 4.3) is well-defined. Although Song et al. [2016] introduce d′, they do not prove that it is well-defined; we prove this for completeness. Proof. This is very similar to the proof that dbsubscriptd_bditalic_b exists—the only difference is checking that it still works when P1≠P2subscript1subscript2P_1≠ P_2P1 ≠ P2 and R1≠R2subscript1subscript2R_1≠ R_2R1 ≠ R2. It is also very similar to the proof of theorem 3 in Kemertas and Aumentado-Armstrong [2021]. Fix ℳ1,ℳ2subscriptℳ1subscriptℳ2M_1,M_2M1 , M2. We want to show that ℱ:⁢→⁢,ℱ⁢(d)⁢(s1,s2)=maxa∈A⁡cR⁢|(r1)s1a−(r2)s2a|+cT⁢Wd⁢(P1⁢(s1,a),P2⁢(s2,a)):ℱformulae-sequence→ℱsubscript1subscript2subscriptsubscriptsuperscriptsubscriptsubscript1subscript1superscriptsubscriptsubscript2subscript2subscriptsubscriptsubscript1subscript1subscript2subscript2F: met→ met,\,\,F(d)(s_1,s_2)=% _a∈ A\c_R|(r_1)_s_1^a-(r_2)_s_2^a|+c_TW_d(P_1% (s_1,a),P_2(s_2,a))\F : m e t → m e t , F ( d ) ( s1 , s2 ) = maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | + citalic_T Witalic_d ( P1 ( s1 , a ) , P2 ( s2 , a ) ) is a contraction mapping in ⁢ metm e t under the uniform norm ||∞||_∞| |∞. (Note that here the Wasserstein distance W is short for W1subscript1W_1W1.) ‖ℱ⁢(d1)−ℱ⁢(d2)‖∞subscriptnormℱsubscript1ℱsubscript2 \|F(d_1)-F(d_2)\|_∞∥ F ( d1 ) - F ( d2 ) ∥∞ =maxs1,s2⁡|ℱ⁢(d1)⁢(s1,s2)−ℱ⁢(d2)⁢(s1,s2)|absentsubscriptsubscript1subscript2ℱsubscript1subscript1subscript2ℱsubscript2subscript1subscript2 = _s_1,s_2|F(d_1)(s_1,s_2)-F(d% _2)(s_1,s_2)|= maxitalic_s start_POSTSUBSCRIPT 1 , s2 end_POSTSUBSCRIPT | F ( d1 ) ( s1 , s2 ) - F ( d2 ) ( s1 , s2 ) | = == maxs1,s2|maxa∈A⁡cR∣(r1)s1a−(r2)s2a∣+cT⁢Wd1⁢(P1⁢(s1,a),P2⁢(s2,a))conditionalsubscriptsubscript1subscript2subscriptsubscriptsuperscriptsubscriptsubscript1subscript1superscriptsubscriptsubscript2subscript2subscriptsubscriptsubscript1subscript1subscript1subscript2subscript2 _s_1,s_2| _a∈ A\c_R|(r_1)_s_1^a-(r_2% )_s_2^a|+c_TW_d_1(P_1(s_1,a),P_2(s_2,a))\maxitalic_s start_POSTSUBSCRIPT 1 , s2 end_POSTSUBSCRIPT | maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | + citalic_T Witalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) −maxa∈AcR|(r1)s1a−(r2)s2a|+cTWd2(P1(s1,a),P2(s2,a))| - _a∈ A\c_R|(r_1)_s_1^a-(r_2)_s_2^a|+c_% TW_d_2(P_1(s_1,a),P_2(s_2,a))\|- maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | + citalic_T Witalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) | ≤ ≤ maxs1,s2|maxa∈AcR|(r1)s1a−(r2)s2a|+cTWd1(P1(s1,a),P2(s2,a)) _s_1,s_2| _a∈ A\c_R|(r_1)_s_1^a-(r_2% )_s_2^a|+c_TW_d_1(P_1(s_1,a),P_2(s_2,a))maxitalic_s start_POSTSUBSCRIPT 1 , s2 end_POSTSUBSCRIPT | maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | + citalic_T Witalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) −cR|(r1)s1a−(r2)s2a|−cTWd2(P1(s1,a),P2(s2,a))| -c_R|(r_1)_s_1^a-(r_2)_s_2^a|-c_TW_d_2(P_% 1(s_1,a),P_2(s_2,a))\|- citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | - citalic_T Witalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) | = == maxs1,s2⁡cT⁢|maxa∈A⁡Wd1⁢(P1⁢(s1,a),P2⁢(s2,a))−Wd2⁢(P1⁢(s1,a),P2⁢(s2,a))|subscriptsubscript1subscript2subscriptsubscriptsubscriptsubscript1subscript1subscript1subscript2subscript2subscriptsubscript2subscript1subscript1subscript2subscript2 _s_1,s_2c_T| _a∈ A\W_d_1(P_1(s_1,a),P% _2(s_2,a))-W_d_2(P_1(s_1,a),P_2(s_2,a))\|maxitalic_s start_POSTSUBSCRIPT 1 , s2 end_POSTSUBSCRIPT citalic_T | maxitalic_a ∈ A Witalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) - Witalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) | = == maxs1,s2⁡cT⁢|maxa∈A⁡Wd1−d2+d2⁢(P1⁢(s1,a),P2⁢(s2,a))−Wd2⁢(P1⁢(s1,a),P2⁢(s2,a))|subscriptsubscript1subscript2subscriptsubscriptsubscriptsubscript1subscript2subscript2subscript1subscript1subscript2subscript2subscriptsubscript2subscript1subscript1subscript2subscript2 _s_1,s_2c_T| _a∈ A\W_d_1-d_2+d_2(P_1% (s_1,a),P_2(s_2,a))-W_d_2(P_1(s_1,a),P_2(s_2,a))\|maxitalic_s start_POSTSUBSCRIPT 1 , s2 end_POSTSUBSCRIPT citalic_T | maxitalic_a ∈ A Witalic_d start_POSTSUBSCRIPT 1 - d2 + d2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) - Witalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) | ≤ ≤ maxs1,s2⁡cT⁢|maxa∈A⁡W‖d1−d2‖∞+d2⁢(P1⁢(s1,a),P2⁢(s2,a))−Wd2⁢(P1⁢(s1,a),P2⁢(s2,a))|subscriptsubscript1subscript2subscriptsubscriptsubscriptsubscriptnormsubscript1subscript2subscript2subscript1subscript1subscript2subscript2subscriptsubscript2subscript1subscript1subscript2subscript2 _s_1,s_2c_T| _a∈ A\W_\|d_1-d_2\|_∞% +d_2(P_1(s_1,a),P_2(s_2,a))-W_d_2(P_1(s_1,a),P_2(s_2,a% ))\|maxitalic_s start_POSTSUBSCRIPT 1 , s2 end_POSTSUBSCRIPT citalic_T | maxitalic_a ∈ A W∥ d start_POSTSUBSCRIPT 1 - d2 ∥∞ + d2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) - Witalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) | ≤ ≤ maxs1,s2⁡cT⁢|maxa∈A⁡‖d1−d2‖∞+Wd2⁢(P1⁢(s1,a),P2⁢(s2,a))−Wd2⁢(P1⁢(s1,a),P2⁢(s2,a))|subscriptsubscript1subscript2subscriptsubscriptsubscriptnormsubscript1subscript2subscriptsubscript2subscript1subscript1subscript2subscript2subscriptsubscript2subscript1subscript1subscript2subscript2 _s_1,s_2c_T| _a∈ A\\|d_1-d_2\|_∞+W% _d_2(P_1(s_1,a),P_2(s_2,a))-W_d_2(P_1(s_1,a),P_2(s_2,a% ))\|maxitalic_s start_POSTSUBSCRIPT 1 , s2 end_POSTSUBSCRIPT citalic_T | maxitalic_a ∈ A ∥ d1 - d2 ∥∞ + Witalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) - Witalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( P1 ( s1 , a ) , P2 ( s2 , a ) ) | = == cT⁢‖d1−d2‖∞subscriptsubscriptnormsubscript1subscript2 c_T\|d_1-d_2\|_∞citalic_T ∥ d1 - d2 ∥∞ Since cT<1subscript1c_T<1citalic_T < 1, this is indeed a contraction mapping. ∎ See 4.1 Again, Song et al. [2016] introduce this, but don’t prove it’s a pseudometric, so again we include the proof for completeness. Proof. We check that the conditions are satisfied. (1) dH⁢(ℳ,ℳ)=0subscriptℳ0d_H(M,M)=0ditalic_H ( M , M ) = 0: In this case, P1=P2subscript1subscript2P_1=P_2P1 = P2 and R1=R2subscript1subscript2R_1=R_2R1 = R2, so d′⁢(s,s)=0superscript′0d (s,s)=0d′ ( s , s ) = 0. Then since S1=S2subscript1subscript2S_1=S_2S1 = S2, dH⁢(ℳ,ℳ)subscriptℳd_H(M,M)ditalic_H ( M , M ) is also 00. (2) dH⁢(ℳ1,ℳ2)=dH⁢(ℳ2,ℳ1)subscriptsubscriptℳ1subscriptℳ2subscriptsubscriptℳ2subscriptℳ1d_H(M_1,M_2)=d_H(M_2,M_1)ditalic_H ( M1 , M2 ) = ditalic_H ( M2 , M1 ): dH⁢(ℳ1,ℳ2)subscriptsubscriptℳ1subscriptℳ2 d_H(M_1,M_2)ditalic_H ( M1 , M2 ) =max⁡maxs1∈S1⁡mins2∈S2⁡dℳ1,ℳ2′⁢(s1,s2),maxs2∈S2⁡mins1∈S1⁡dℳ1,ℳ2′⁢(s1,s2)absentsubscriptsubscript1subscript1subscriptsubscript2subscript2subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2subscriptsubscript2subscript2subscriptsubscript1subscript1subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2 = \ _s_1∈ S_1 _s_2∈ S_2d % _M_1,M_2(s_1,s_2), _s_2∈ S_2 _s_% 1∈ S_1d _M_1,M_2(s_1,s_2) \= max maxitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) , maxitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) =max⁡maxs2∈S2⁡mins1∈S1⁡dℳ1,ℳ2′⁢(s1,s2),maxs1∈S1⁡mins2∈S2⁡dℳ1,ℳ2′⁢(s1,s2)absentsubscriptsubscript2subscript2subscriptsubscript1subscript1subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2subscriptsubscript1subscript1subscriptsubscript2subscript2subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2 = \ _s_2∈ S_2 _s_1∈ S_1d % _M_1,M_2(s_1,s_2), _s_1∈ S_1 _s_% 2∈ S_2d _M_1,M_2(s_1,s_2) \= max maxitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) , maxitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) so it suffices to show that dℳ1,ℳ2′⁢(s1,s2)=dℳ2,ℳ1′⁢(s2,s1)subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2subscriptsuperscript′subscriptℳ2subscriptℳ1subscript2subscript1d _M_1,M_2(s_1,s_2)=d _ % M_2,M_1(s_2,s_1)d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) = d′caligraphic_M start_POSTSUBSCRIPT 2 , M1 end_POSTSUBSCRIPT ( s2 , s1 ), which follows immediately from symmetry of ||||| | and Wd′subscriptsuperscript′W_d Witalic_d′. (3) dH⁢(ℳ1,ℳ2)≤dH⁢(ℳ1,ℳ3)+dH⁢(ℳ3,ℳ2)subscriptsubscriptℳ1subscriptℳ2subscriptsubscriptℳ1subscriptℳ3subscriptsubscriptℳ3subscriptℳ2d_H(M_1,M_2)≤ d_H(M_1,M_% 3)+d_H(M_3,M_2)ditalic_H ( M1 , M2 ) ≤ ditalic_H ( M1 , M3 ) + ditalic_H ( M3 , M2 ): First we prove the analogous statement for individual states: dℳ1,ℳ2′⁢(s1,s2)=subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2absent d _M_1,M_2(s_1,s_2)=d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) = maxa∈A⁡cR⁢|(r1)s1a−(r2)s2a|+cT⁢Wd′⁢(P1⁢(s1,a),P2⁢(s2,a))subscriptsubscriptsuperscriptsubscriptsubscript1subscript1superscriptsubscriptsubscript2subscript2subscriptsubscriptsuperscript′subscript1subscript1subscript2subscript2 _a∈ A\c_R|(r_1)_s_1^a-(r_2)_s_2^a|+c_% TW_d (P_1(s_1,a),P_2(s_2,a))\maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | + citalic_T Witalic_d′ ( P1 ( s1 , a ) , P2 ( s2 , a ) ) = == maxa∈A⁡cR⁢|(r1)s1a−(r3)s3a+(r3)s3a−(r2)s2a|+cT⁢Wd′⁢(P1⁢(s1,a),P2⁢(s2,a))subscriptsubscriptsuperscriptsubscriptsubscript1subscript1superscriptsubscriptsubscript3subscript3superscriptsubscriptsubscript3subscript3superscriptsubscriptsubscript2subscript2subscriptsubscriptsuperscript′subscript1subscript1subscript2subscript2 _a∈ A\c_R|(r_1)_s_1^a-(r_3)_s_3^a+(r_% 3)_s_3^a-(r_2)_s_2^a|+c_TW_d (P_1(s_1,a),P_2(% s_2,a))\maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r3 )s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTa + ( r3 )s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | + citalic_T Witalic_d′ ( P1 ( s1 , a ) , P2 ( s2 , a ) ) ≤ ≤ maxa∈AcR|(r1)s1a−(r3)s3a|+|(r3)s3a−(r2)s2a| _a∈ A\c_R|(r_1)_s_1^a-(r_3)_s_3^a|+|(r% _3)_s_3^a-(r_2)_s_2^a|maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r3 )s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTa | + | ( r3 )s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | +cT(Wd′(P1(s1,a),P3(s3,a))+Wd′(P3(s3,a),P2(s2,a))) +c_T(W_d (P_1(s_1,a),P_3(s_3,a))+W_d % (P_3(s_3,a),P_2(s_2,a)))\+ citalic_T ( Witalic_d′ ( P1 ( s1 , a ) , P3 ( s3 , a ) ) + Witalic_d′ ( P3 ( s3 , a ) , P2 ( s2 , a ) ) ) ≤ ≤ maxa∈AcR|(r1)s1a−(r3)s3a|+cT(Wd′(P1(s1,a),P3(s3,a)) _a∈ A\c_R|(r_1)_s_1^a-(r_3)_s_3^a|+c_% T(W_d (P_1(s_1,a),P_3(s_3,a))\maxitalic_a ∈ A citalic_R | ( r1 )s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa - ( r3 )s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTa | + citalic_T ( Witalic_d′ ( P1 ( s1 , a ) , P3 ( s3 , a ) ) +maxa∈A|(r3)s3a−(r2)s2a|+Wd′(P3(s3,a),P2(s2,a))) + _a∈ A\|(r_3)_s_3^a-(r_2)_s_2^a|+W_d^% (P_3(s_3,a),P_2(s_2,a)))\+ maxitalic_a ∈ A | ( r3 )s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTa - ( r2 )s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa | + Witalic_d′ ( P3 ( s3 , a ) , P2 ( s2 , a ) ) ) = == dℳ1,ℳ3′⁢(s1,s3)+dℳ3,ℳ2′⁢(s3,s2).superscriptsubscriptsubscriptℳ1subscriptℳ3′subscript1subscript3superscriptsubscriptsubscriptℳ3subscriptℳ2′subscript3subscript2 \,d_M_1,M_3 (s_1,s_3)+d_% M_3,M_2 (s_3,s_2).dcaligraphic_M start_POSTSUBSCRIPT 1 , M3 end_POSTSUBSCRIPT′ ( s1 , s3 ) + dcaligraphic_M start_POSTSUBSCRIPT 3 , M2 end_POSTSUBSCRIPT′ ( s3 , s2 ) . We now move on to proving dH⁢(ℳ1,ℳ2)subscriptsubscriptℳ1subscriptℳ2 d_H(M_1,M_2)ditalic_H ( M1 , M2 ) =max⁡maxs1∈S1⁡mins2∈S2⁡dℳ1,ℳ2′⁢(s1,s2),maxs2∈S2⁡mins1∈S1⁡dℳ1,ℳ2′⁢(s1,s2)absentsubscriptsubscript1subscript1subscriptsubscript2subscript2subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2subscriptsubscript2subscript2subscriptsubscript1subscript1subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2 = \ _s_1∈ S_1 _s_2∈ S_2d % _M_1,M_2(s_1,s_2), _s_2∈ S_2 _s_% 1∈ S_1d _M_1,M_2(s_1,s_2) \= max maxitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) , maxitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) ≤dH⁢(ℳ1,ℳ3)+dH⁢(ℳ3,ℳ2).absentsubscriptsubscriptℳ1subscriptℳ3subscriptsubscriptℳ3subscriptℳ2 ≤ d_H(M_1,M_3)+d_H(M_3% ,M_2).≤ ditalic_H ( M1 , M3 ) + ditalic_H ( M3 , M2 ) . Fix s1*superscriptsubscript1s_1^*s1* and s2*superscriptsubscript2s_2^*s2* to be a solution to this expression: s1*∈arg⁡maxs1∈S1⁡mins2∈S2⁡dℳ1,ℳ2′⁢(s1,s2),s2*∈arg⁡mins2∈S2⁡dℳ1,ℳ2′⁢(s1*,s2)⁢ in the first case orformulae-sequencesuperscriptsubscript1subscriptsubscript1subscript1subscriptsubscript2subscript2subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2superscriptsubscript2subscriptsubscript2subscript2subscriptsuperscript′subscriptℳ1subscriptℳ2superscriptsubscript1subscript2 in the first case or s_1^*∈ _s_1∈ S_1 _s_2∈ S_2d^% _M_1,M_2(s_1,s_2),\,s_2^*∈ _% s_2∈ S_2d _M_1,M_2(s_1^*,s_2)% in the first case ors1* ∈ arg maxitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) , s2* ∈ arg minitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1* , s2 ) in the first case or s2*∈arg⁡maxs2∈S2⁡mins1∈S1⁡dℳ1,ℳ2′⁢(s1,s2),s1*∈arg⁡mins1∈S1⁡dℳ1,ℳ2′⁢(s1*,s2)⁢ otherwise; thenformulae-sequencesuperscriptsubscript2subscriptsubscript2subscript2subscriptsubscript1subscript1subscriptsuperscript′subscriptℳ1subscriptℳ2subscript1subscript2superscriptsubscript1subscriptsubscript1subscript1subscriptsuperscript′subscriptℳ1subscriptℳ2superscriptsubscript1subscript2 otherwise; then s_2^*∈ _s_2∈ S_2 _s_1∈ S_1d^% _M_1,M_2(s_1,s_2),\,s_1^*∈ _% s_1∈ S_1d _M_1,M_2(s_1^*,s_2)% otherwise; thens2* ∈ arg maxitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) , s1* ∈ arg minitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1* , s2 ) otherwise; then dH⁢(ℳ1,ℳ2)=subscriptsubscriptℳ1subscriptℳ2absent d_H(M_1,M_2)=ditalic_H ( M1 , M2 ) = dℳ1,ℳ2′⁢(s1*,s2*)subscriptsuperscript′subscriptℳ1subscriptℳ2superscriptsubscript1superscriptsubscript2 d _M_1,M_2(s_1^*,s_2^*)d′caligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1* , s2* ) ≤ ≤ mins3∈S3⁡dℳ1,ℳ3′⁢(s1*,s3)+dℳ3,ℳ2′⁢(s3,s2*)subscriptsubscript3subscript3subscriptsuperscript′subscriptℳ1subscriptℳ3superscriptsubscript1subscript3subscriptsuperscript′subscriptℳ3subscriptℳ2subscript3superscriptsubscript2 _s_3∈ S_3d _M_1,M_3% (s_1^*,s_3)+d _M_3,M_2(s_3,s_2^*)minitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M3 end_POSTSUBSCRIPT ( s1* , s3 ) + d′caligraphic_M start_POSTSUBSCRIPT 3 , M2 end_POSTSUBSCRIPT ( s3 , s2* ) (triangle inequality for d′) ≤ ≤ mins3∈S3⁡dℳ1,ℳ3′⁢(s1*,s3)+mins3∈S3⁡dℳ3,ℳ2′⁢(s3,s2*)subscriptsubscript3subscript3subscriptsuperscript′subscriptℳ1subscriptℳ3superscriptsubscript1subscript3subscriptsubscript3subscript3subscriptsuperscript′subscriptℳ3subscriptℳ2subscript3superscriptsubscript2 _s_3∈ S_3d _M_1,M_3% (s_1^*,s_3)+ _s_3∈ S_3d _M_3,M% _2(s_3,s_2^*)minitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M3 end_POSTSUBSCRIPT ( s1* , s3 ) + minitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 3 , M2 end_POSTSUBSCRIPT ( s3 , s2* ) ≤ ≤ maxs1∈S1⁡mins3∈S3⁡dℳ1,ℳ3′⁢(s1,s3)+maxs2∈S2⁡mins3∈S3⁡dℳ3,ℳ2′⁢(s3,s2)subscriptsubscript1subscript1subscriptsubscript3subscript3subscriptsuperscript′subscriptℳ1subscriptℳ3subscript1subscript3subscriptsubscript2subscript2subscriptsubscript3subscript3subscriptsuperscript′subscriptℳ3subscriptℳ2subscript3subscript2 _s_1∈ S_1 _s_3∈ S_3d _M% _1,M_3(s_1,s_3)+ _s_2∈ S_2 _s_3∈ S_3d% _M_3,M_2(s_3,s_2)maxitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M3 end_POSTSUBSCRIPT ( s1 , s3 ) + maxitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 3 , M2 end_POSTSUBSCRIPT ( s3 , s2 ) ≤ ≤ max⁡maxs1∈S1⁡mins3∈S3⁡dℳ1,ℳ3′⁢(s1,s3),maxs3∈S3⁡mins1∈S1⁡dℳ1,ℳ3′⁢(s1,s3)subscriptsubscript1subscript1subscriptsubscript3subscript3subscriptsuperscript′subscriptℳ1subscriptℳ3subscript1subscript3subscriptsubscript3subscript3subscriptsubscript1subscript1subscriptsuperscript′subscriptℳ1subscriptℳ3subscript1subscript3 \ _s_1∈ S_1 _s_3∈ S_3d _% M_1,M_3(s_1,s_3), _s_3∈ S_3 _s_1% ∈ S_1d _M_1,M_3(s_1,s_3) \max maxitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M3 end_POSTSUBSCRIPT ( s1 , s3 ) , maxitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 1 ∈ S1 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 1 , M3 end_POSTSUBSCRIPT ( s1 , s3 ) +max⁡maxs3∈S3⁡mins2∈S2⁡dℳ3,ℳ2′⁢(s3,s2),maxs2∈S2⁡mins3∈S3⁡dℳ3,ℳ2′⁢(s3,s2)subscriptsubscript3subscript3subscriptsubscript2subscript2subscriptsuperscript′subscriptℳ3subscriptℳ2subscript3subscript2subscriptsubscript2subscript2subscriptsubscript3subscript3subscriptsuperscript′subscriptℳ3subscriptℳ2subscript3subscript2 + \ _s_3∈ S_3 _s_2∈ S_2d % _M_3,M_2(s_3,s_2), _s_2∈ S_2 _s_% 3∈ S_3d _M_3,M_2(s_3,s_2) \+ max maxitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 3 , M2 end_POSTSUBSCRIPT ( s3 , s2 ) , maxitalic_s start_POSTSUBSCRIPT 2 ∈ S2 end_POSTSUBSCRIPT minitalic_s start_POSTSUBSCRIPT 3 ∈ S3 end_POSTSUBSCRIPT d′caligraphic_M start_POSTSUBSCRIPT 3 , M2 end_POSTSUBSCRIPT ( s3 , s2 ) ≤ ≤ dH⁢(ℳ1,ℳ3)+dH⁢(ℳ3,ℳ2).subscriptsubscriptℳ1subscriptℳ3subscriptsubscriptℳ3subscriptℳ2 d_H(M_1,M_3)+d_H(M_3,% M_2).ditalic_H ( M1 , M3 ) + ditalic_H ( M3 , M2 ) . ∎ B.2. Proofs toward Theorem 5.7 See 5.5 Proof. Recall that ∑j(Pk)i⁢jsubscriptsubscriptsuperscript _j(P^k)_ij∑j ( Pitalic_k )i j is the probability that an agent starting in si∈S∖Ssafesubscriptsubscriptsafes_i∈ S S_safesitalic_i ∈ S ∖ Ssafe stays in S∖SsafesubscriptsafeS S_safeS ∖ Ssafe during the first k steps. (PksuperscriptP^kPitalic_k denotes the k’th power of matrix Pi⁢jsubscriptP_ijPitalic_i j). For an agent starting in the state sisubscripts_isitalic_i the expected number of steps before reaching SsafesubscriptsafeS_safeSsafe is then given by ∑k=1∞k⁢P⁢r⁢(reaching ⁢Ssafe⁢ at step ⁢k)superscriptsubscript1reaching subscriptsafe at step _k=1^∞kPr(reaching S_safe at % step k)∑k = 1∞ k P r ( reaching Ssafe at step k ) =∑k=1∞k⁢(∑j(Pk)i⁢j−∑j(Pk+1)i⁢j)absentsuperscriptsubscript1subscriptsubscriptsuperscriptsubscriptsubscriptsuperscript1 = _k=1^∞k ( _j(P^k)_ij- _j(P^k+1)% _ij )= ∑k = 1∞ k ( ∑j ( Pitalic_k )i j - ∑j ( Pitalic_k + 1 )i j ) =∑j(∑k=1∞k⁢(Pk)i⁢j−∑k=1∞(k−1)⁢(Pk)i⁢j)absentsubscriptsuperscriptsubscript1subscriptsuperscriptsuperscriptsubscript11subscriptsuperscript = _j ( _k=1^∞k(P^k)_ij- _k=1^% ∞(k-1)(P^k)_ij )= ∑j ( ∑k = 1∞ k ( Pitalic_k )i j - ∑k = 1∞ ( k - 1 ) ( Pitalic_k )i j ) =∑j∑k=1∞(Pk)i⁢jabsentsubscriptsuperscriptsubscript1subscriptsuperscript = _j _k=1^∞(P^k)_ij= ∑j ∑k = 1∞ ( Pitalic_k )i j The lemma follows by taking the Δ Δ-weighted sum over all initial sisubscripts_isitalic_i. ∎ To prove Theorem 5.7 we will first show that Ns⁢(ℳ,ε)=Ns⁢(ℳ′,ε)subscriptℳsubscriptsuperscriptℳ′N_s(M, )=N_s(M , )Nitalic_s ( M , ε ) = Nitalic_s ( M′ , ε ) if ℳ′M M′ is obtained from ℳMM by taking a quotient by bisimulation equivalence. Lemma B.2. Suppose ℳ′=ℳ/∼M =M/ ′ = M / ∼, where ∼similar-to ∼ is the bisimulation relation. Let π be a policy in ℳMM with π⁢(s1)=π⁢(s2)subscript1subscript2π(s_1)=π(s_2)π ( s1 ) = π ( s2 ) if s1∼s2similar-tosubscript1subscript2s_1 s_2s1 ∼ s2 and let π′normal-′π π′ be the policy in ℳ′normal-′M M′ induced by π. Let Pi,jsubscriptP_i,jPitalic_i , j and Pl,m′subscriptnormal-′P_l,m Pitalic_l , m′ be the corresponding transition matrices. Suppose si∈Ssubscripts_i∈ Ssitalic_i ∈ S lies in the equivalence class sl′subscriptnormal-′s_l sitalic_l′ of S′normal-′S S′, then ∑j:sj∈sm′(Pk)i,j=(P′⁣k)l,m.subscript:subscriptsuperscriptsubscript′subscriptsuperscriptsubscriptsuperscript′ _j:s_j∈ s_m (P^k)_i,j=(P k)_l,m.∑j : s start_POSTSUBSCRIPT j ∈ sitalic_m′ end_POSTSUBSCRIPT ( Pitalic_k )i , j = ( P′ k )l , m . Proof. The proof is by induction on k. For k=11k=1k = 1 by definition of bisimulation relation we have that for every two states si1∼si2∈sl′similar-tosubscriptsubscript1subscriptsubscript2superscriptsubscript′s_i_1 s_i_2∈ s_l sitalic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ sitalic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ sitalic_l′ we have ∑j:sj∈sm′Pi1,j=∑j:sj∈sm′Pi2,j:=Pl,m′subscript:subscriptsuperscriptsubscript′subscriptsubscript1subscript:subscriptsuperscriptsubscript′subscriptsubscript2assignsubscriptsuperscript′ _j:s_j∈ s_m P_i_1,j= _j:s_j∈ s_m P_% i_2,j:=P _l,m∑j : s start_POSTSUBSCRIPT j ∈ sitalic_m′ end_POSTSUBSCRIPT Pitalic_i start_POSTSUBSCRIPT 1 , j end_POSTSUBSCRIPT = ∑j : s start_POSTSUBSCRIPT j ∈ sitalic_m′ end_POSTSUBSCRIPT Pitalic_i start_POSTSUBSCRIPT 2 , j end_POSTSUBSCRIPT := P′italic_l , m. For k>11k>1k > 1 we use inductive assumption to obtain ∑j:sj∈sm′(Pk)i,jsubscript:subscriptsuperscriptsubscript′subscriptsuperscript _j:s_j∈ s_m (P^k)_i,j∑j : s start_POSTSUBSCRIPT j ∈ sitalic_m′ end_POSTSUBSCRIPT ( Pitalic_k )i , j =∑j:sj∈sm′∑q(Pk−1)i,q⁢Pq,jabsentsubscript:subscriptsuperscriptsubscript′subscriptsubscriptsuperscript1subscript = _j:s_j∈ s_m _q(P^k-1)_i,qP_q,j= ∑j : s start_POSTSUBSCRIPT j ∈ sitalic_m′ end_POSTSUBSCRIPT ∑q ( Pitalic_k - 1 )i , q Pitalic_q , j =∑j:sj∈sm′∑n∑r:sr∈sn′(Pk−1)i,r⁢Pr,jabsentsubscript:subscriptsuperscriptsubscript′subscriptsubscript:subscriptsubscriptsuperscript′subscriptsuperscript1subscript = _j:s_j∈ s_m _n _r:s_r∈ s^% _n(P^k-1)_i,rP_r,j= ∑j : s start_POSTSUBSCRIPT j ∈ sitalic_m′ end_POSTSUBSCRIPT ∑n ∑r : s start_POSTSUBSCRIPT r ∈ s′italic_n end_POSTSUBSCRIPT ( Pitalic_k - 1 )i , r Pitalic_r , j =∑n∑r:sr∈sn′((Pk−1)i,r⁢∑j:sj∈sm′Pr,j)absentsubscriptsubscript:subscriptsubscriptsuperscript′subscriptsuperscript1subscript:subscriptsuperscriptsubscript′subscript = _n _r:s_r∈ s _n ((P^k-1)_i,r% _j:s_j∈ s_m P_r,j )= ∑n ∑r : s start_POSTSUBSCRIPT r ∈ s′italic_n end_POSTSUBSCRIPT ( ( Pitalic_k - 1 )i , r ∑j : s start_POSTSUBSCRIPT j ∈ sitalic_m′ end_POSTSUBSCRIPT Pitalic_r , j ) =∑n(∑r:sr∈sn′(Pk−1)i,r)⁢Pn,m′absentsubscriptsubscript:subscriptsubscriptsuperscript′subscriptsuperscript1subscriptsuperscript′ = _n ( _r:s_r∈ s _n(P^k-1)_i,r% )P _n,m= ∑n ( ∑r : s start_POSTSUBSCRIPT r ∈ s′italic_n end_POSTSUBSCRIPT ( Pitalic_k - 1 )i , r ) P′italic_n , m =∑n(P′⁣(k−1))l,n⁢Pn,m′=(P′⁣k)l,mabsentsubscriptsubscriptsuperscript′1subscriptsuperscript′subscriptsuperscript′ = _n(P (k-1))_l,nP _n,m=(P k)_% l,m= ∑n ( P′ ( k - 1 ) )l , n P′italic_n , m = ( P′ k )l , m ∎ Proposition B.3. Suppose ℳ′=ℳ/∼M =M/ ′ = M / ∼, where ∼similar-to ∼ is the bisimulation relation, and assume that S′=S/∼S_safe =S_safe/ ′ = Ssafe / ∼. Then Ns⁢(ℳ,ε)=Ns⁢(ℳ′,ε)subscriptℳsubscriptsuperscriptℳnormal-′N_s(M, )=N_s(M , )Nitalic_s ( M , ε ) = Nitalic_s ( M′ , ε ). Proof. First we show that Ns⁢(ℳ′,ε)≤Ns⁢(ℳ,ε)subscriptsuperscriptℳ′subscriptℳN_s(M , )≤ N_s(M, )Nitalic_s ( M′ , ε ) ≤ Nitalic_s ( M , ε ). Suppose π′π π′ is a ε ε-optimal policy in ℳ′M M′. Define policy π on ℳMM by setting π⁢(s)=π′⁢(s′)superscript′π(s)=π (s )π ( s ) = π′ ( s′ ), where s′ is the equivalence class containing s. By the results about bisimulation metrics and optimal value functions [Ferns et al., 2004] we have that π′π π′ is ε ε-optimal in ℳ′M M′. It follows then by Lemma B.2 and Lemma 5.5 (noting that ℳ′M M′’s sl⁢o⁢n⁢gsubscripts_longsitalic_l o n g maps to ℳMM’s) that Ns⁢(ℳ′,π′,S′∖Ssafe′,Ssafe′)=Ns⁢(ℳ,π,S∖Ssafe,Ssafe)≤Nsubscriptsuperscriptℳ′superscript′subscriptsafe′subscriptsafe′subscriptℳsubscriptsafesubscriptsafeN_s(M ,π ,S S_safe^% ,S_safe )=N_s(M,π,S S_% safe,S_safe)≤ NNitalic_s ( M′ , π′ , S′ ∖ Ssafe′ , Ssafe′ ) = Nitalic_s ( M , π , S ∖ Ssafe , Ssafe ) ≤ N. In the other direction, let π be ε ε-optimal in ℳMM with Ns⁢(ℳ,π,S∖Ssafe,Ssafe)=Ns⁢(ℳ,ε)subscriptℳsubscriptsafesubscriptsafesubscriptℳN_s(M,π,S S_safe,S_safe)=N_s(% M, )Nitalic_s ( M , π , S ∖ Ssafe , Ssafe ) = Nitalic_s ( M , ε ). We want to show that Ns⁢(ℳ,π,S∖Ssafe,Ssafe)≤NsubscriptℳsubscriptsafesubscriptsafeN_s(M,π,S S_safe,S_safe)≤ NNitalic_s ( M , π , S ∖ Ssafe , Ssafe ) ≤ N. Let Pi⁢jsubscriptP_ijPitalic_i j denote the transition matrix between states in S∖SsafesubscriptsafeS S_safeS ∖ Ssafe corresponding to π. Since π may prescribe different actions to elements of the same bisimulation equivalence class we cannot immediately use Lemma B.2. Instead, we will define a new policy π~~ πover~ start_ARG π end_ARG on ℳMM by replacing all actions π⁢(s)π(s)π ( s ) for s∈Cs∈ Cs ∈ C by π⁢(s⁢(C))π(s(C))π ( s ( C ) ), where s⁢(C)∈Cs(C)∈ Cs ( C ) ∈ C is chosen to maximize NssubscriptN_sNitalic_s. Since the change happens between actions at bisimilar states, the new policy will still be ε ε-optimal. More precisely, we modify π inductively. Fix a bisimulation equivalence class C and let s⁢(C)s(C)s ( C ) be such that the expected number of steps before reaching SsafesubscriptsafeS_safeSsafe starting at s⁢(C)s(C)s ( C ) is greater than or equal to the expected number of steps before reaching SsafesubscriptsafeS_safeSsafe for any other starting point in C. Define π1⁢(s)=π⁢(s⁢(C))subscript1 _1(s)=π(s(C))π1 ( s ) = π ( s ( C ) ) for all s∈Cs∈ Cs ∈ C. Note that π1subscript1 _1π1 is also ε ε-optimal and that this modification can not decrease NssubscriptN_sNitalic_s. We proceed this way until we obtain an ε ε-optimal policy π~~ πover~ start_ARG π end_ARG, such that π~⁢(s)=π⁢(s⁢(C))~ π(s)=π(s(C))over~ start_ARG π end_ARG ( s ) = π ( s ( C ) ) if s∈Cs∈ Cs ∈ C and Ns⁢(ℳ,π~,S∖Ssafe,Ssafe)≥Ns⁢(ℳ,π,S∖Ssafe,Ssafe)subscriptℳ~subscriptsafesubscriptsafesubscriptℳsubscriptsafesubscriptsafeN_s(M, π,S S_safe,S_safe)≥ N% _s(M,π,S S_safe,S_safe)Nitalic_s ( M , over~ start_ARG π end_ARG , S ∖ Ssafe , Ssafe ) ≥ Nitalic_s ( M , π , S ∖ Ssafe , Ssafe ). Observe that π~~ πover~ start_ARG π end_ARG is ε ε-optimal and induces a well-defined policy π′π π′ on ℳ′M M′. By Lemma B.2 we have Ns⁢(ℳ,ε)≤Ns⁢(ℳ,π~,S∖Ssafe,Ssafe)=Ns⁢(ℳ′,π′,S′∖Ssafe′,Ssafe′)≤Nsubscriptℳsubscriptℳ~subscriptsafesubscriptsafesubscriptsuperscriptℳ′superscript′subscriptsafe′subscriptsafe′N_s(M, )≤ N_s(M, π,S S_% safe,S_safe)=N_s(M ,π ,S^% S_safe ,S_safe )≤ NNitalic_s ( M , ε ) ≤ Nitalic_s ( M , over~ start_ARG π end_ARG , S ∖ Ssafe , Ssafe ) = Nitalic_s ( M′ , π′ , S′ ∖ Ssafe′ , Ssafe′ ) ≤ N ∎ Proposition B.4. Suppose ℳ1=(S1,A,Pℳ1,R1,γ)subscriptℳ1subscript1subscriptsubscriptℳ1subscript1M_1=(S_1,A,P_M_1,R_1,γ)M1 = ( S1 , A , Pcaligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , R1 , γ ) is (N,ε0)subscript0(N, _0)( N , ε0 )-safe. There exists ε⁢(ℳ1)subscriptℳ1 (M_1)ε ( M1 ) satisfying the following property. Let ℳ2=(S2,A,Pℳ2,R2,γ)subscriptℳ2subscript2subscriptsubscriptℳ2subscript2M_2=(S_2,A,P_M_2,R_2,γ)M2 = ( S2 , A , Pcaligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , R2 , γ ) be such that there exists a bijective map b:S1→S2normal-:normal-→subscript1subscript2b:S_1→ S_2b : S1 → S2 with dℳ1,ℳ2⁢(s,b⁢(s))<εsubscriptsubscriptℳ1subscriptℳ2d_M_1,M_2(s,b(s))< _M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s , b ( s ) ) < ε and dℳ1,ℳ2⁢(s1,s2)>εsubscriptsubscriptℳ1subscriptℳ2subscript1subscript2d_M_1,M_2(s_1,s_2)> dcaligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) > square-root start_ARG ε end_ARG for s2≠b⁢(s1)subscript2subscript1s_2≠ b(s_1)s2 ≠ b ( s1 ). Then ℳ2subscriptℳ2M_2M2 is (N2,ε0/2)subscript2subscript02(N_2, _0/2)( N2 , ε0 / 2 )-safe for N2≤N1+1subscript2subscript11N_2≤ N_1+1N2 ≤ N1 + 1. Proof. Suppose π2subscript2 _2π2 is an ε0/2−limit-fromsubscript02 _0/2-ε0 / 2 -optimal policy in ℳ2subscriptℳ2M_2M2. Then by the properties of dHsubscriptd_Hditalic_H we have that π1=b−1∘π2∘bsubscript1superscript1subscript2 _1=b^-1 _2 bπ1 = b- 1 ∘ π2 ∘ b is an ε0−limit-fromsubscript0 _0-ε0 -optimal policy in ℳ1subscriptℳ1M_1M1 (assuming ε ε is small enough).444In fact, if π2subscript2 _2π2 is a (ε0−η)subscript0( _0-η)( ε0 - η )-optimal policy for any constant η>00η>0η > 0, then π1subscript1 _1π1 is ε0subscript0 _0ε0-optimal for ε ε is small enough. Let Pi,j=Pℳ1⁢(si,π1⁢(si))⁢(sj)subscriptsubscriptsubscriptℳ1subscriptsubscript1subscriptsubscriptP_i,j=P_M_1(s_i, _1(s_i))(s_j)Pitalic_i , j = Pcaligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( sitalic_i , π1 ( sitalic_i ) ) ( sitalic_j ) and Qi,j=Pℳ2⁢(b⁢(si),π2⁢(si))⁢(b⁢(sj))subscriptsubscriptsubscriptℳ2subscriptsubscript2subscriptsubscriptQ_i,j=P_M_2(b(s_i), _2(s_i))(b(s_j))Qitalic_i , j = Pcaligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( b ( sitalic_i ) , π2 ( sitalic_i ) ) ( b ( sitalic_j ) ) for si,sj∈S1∖Ssafe1subscriptsubscriptsubscript1superscriptsubscriptsafe1s_i,s_j∈ S_1 S_safe^1sitalic_i , sitalic_j ∈ S1 ∖ Ssafe1. Define functions ui:S1→[0,1]:subscript→subscript101u_i:S_1→[0,1]uitalic_i : S1 → [ 0 , 1 ] and vi:S2→[0,1]:subscript→subscript201v_i:S_2→[0,1]vitalic_i : S2 → [ 0 , 1 ]: ui⁢(b⁢(sj))=vi⁢(sj)=εsubscriptsubscriptsubscriptsubscriptu_i(b(s_j))=v_i(s_j)= uitalic_i ( b ( sitalic_j ) ) = vitalic_i ( sitalic_j ) = square-root start_ARG ε end_ARG if i=ji=ji = j and ui⁢(b⁢(sj))=vi⁢(sj)=0subscriptsubscriptsubscriptsubscript0u_i(b(s_j))=v_i(s_j)=0uitalic_i ( b ( sitalic_j ) ) = vitalic_i ( sitalic_j ) = 0 otherwise. From dℳ1,ℳ2⁢(s1,s2)>εsubscriptsubscriptℳ1subscriptℳ2subscript1subscript2d_M_1,M_2(s_1,s_2)> dcaligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( s1 , s2 ) > square-root start_ARG ε end_ARG for s1≠b⁢(s2)subscript1subscript2s_1≠ b(s_2)s1 ≠ b ( s2 ) it follows that ui⁢(b⁢(sj))−vi⁢(sk)≤ε<dℳ1,ℳ2⁢(b⁢(sj),sk)subscriptsubscriptsubscriptsubscriptsubscriptsubscriptℳ1subscriptℳ2subscriptsubscript u_i(b(s_j))-v_i(s_k)≤ <d_M% _1,M_2(b(s_j),s_k)uitalic_i ( b ( sitalic_j ) ) - vitalic_i ( sitalic_k ) ≤ square-root start_ARG ε end_ARG < dcaligraphic_M start_POSTSUBSCRIPT 1 , M2 end_POSTSUBSCRIPT ( b ( sitalic_j ) , sitalic_k ) if ⁢j≠kif if j≠ kif j ≠ k ui⁢(b⁢(sj))−vi⁢(sk)=0subscriptsubscriptsubscriptsubscript0 u_i(b(s_j))-v_i(s_k)=0uitalic_i ( b ( sitalic_j ) ) - vitalic_i ( sitalic_k ) = 0 if ⁢j=kif if j=kif j = k Hence, by the Kantorovich-Rubinstein duality we have ∑kPi,k⁢uj⁢(b⁢(sk))−Qi,k⁢vj⁢(sk)≤⁢(Pi,Qi)subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscript _kP_i,ku_j(b(s_k))-Q_i,kv_j(s_k) % (P_i,Q_i)∑k Pitalic_i , k uitalic_j ( b ( sitalic_k ) ) - Qitalic_i , k vitalic_j ( sitalic_k ) ≤ W ( Pitalic_i , Qitalic_i ) (4) It then follows that Pi,j−Qi,jsubscriptsubscript P_i,j-Q_i,jPitalic_i , j - Qitalic_i , j =1ε⁢(∑kPi,k⁢uj⁢(b⁢(sk))−Qi,k⁢vj⁢(sk))absent1subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscript = 1 ( _kP_i,ku_j(b(s_k)% )-Q_i,kv_j(s_k) )= divide start_ARG 1 end_ARG start_ARG square-root start_ARG ε end_ARG end_ARG ( ∑k Pitalic_i , k uitalic_j ( b ( sitalic_k ) ) - Qitalic_i , k vitalic_j ( sitalic_k ) ) ≤1ε⁢(Pi,Qi)<εabsent1subscriptsubscript ≤ 1 W(P_i,Q_i)< ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG ε end_ARG end_ARG W ( Pitalic_i , Qitalic_i ) < square-root start_ARG ε end_ARG Similarly, we obtain Qi,j−Pi,j≤εsubscriptsubscriptQ_i,j-P_i,j≤ Qitalic_i , j - Pitalic_i , j ≤ square-root start_ARG ε end_ARG, so |Pi,j−Qi,j|≤εsubscriptsubscript |P_i,j-Q_i,j|≤ | Pitalic_i , j - Qitalic_i , j | ≤ square-root start_ARG ε end_ARG (5) Recall that N′=Ns⁢(ℳ2,π2,S2∖Ssafe2,Ssafe2)superscript′subscriptsubscriptℳ2subscript2subscript2superscriptsubscriptsafe2superscriptsubscriptsafe2N =N_s(M_2, _2,S_2 S_safe^2,S_% safe^2)N′ = Nitalic_s ( M2 , π2 , S2 ∖ Ssafe2 , Ssafe2 ) satisfies N′=maxi⁢∑k=1∞∑j(Qk)i,jsuperscript′subscriptsuperscriptsubscript1subscriptsubscriptsuperscriptN = _i _k=1^∞ _j(Q^k)_i,jN′ = maxitalic_i ∑k = 1∞ ∑j ( Qitalic_k )i , j. For sufficiently small ε>00 >0ε > 0 by equation (5) we have |maxi⁢∑k=1∞∑j(Qk)i,j−maxi⁢∑k=1∞∑j(Pk)i,j|≤1subscriptsuperscriptsubscript1subscriptsubscriptsuperscriptsubscriptsuperscriptsubscript1subscriptsubscriptsuperscript1| _i _k=1^∞ _j(Q^k)_i,j- _i _k=1^∞% _j(P^k)_i,j|≤ 1| maxitalic_i ∑k = 1∞ ∑j ( Qitalic_k )i , j - maxitalic_i ∑k = 1∞ ∑j ( Pitalic_k )i , j | ≤ 1 It follows then that N′≤maxi⁢∑k=1∞∑j(Pk)i,j+1≤N⁢(ℳ1,ε)+1superscript′subscriptsuperscriptsubscript1subscriptsubscriptsuperscript1subscriptℳ11N ≤ _i _k=1^∞ _j(P^k)_i,j+1≤ N(% M_1, )+1N′ ≤ maxitalic_i ∑k = 1∞ ∑j ( Pitalic_k )i , j + 1 ≤ N ( M1 , ε ) + 1 ∎ Proof of Theorem 5.7. We start by taking a quotient of ℳ0=ℳ/∼M_0=M/ 0 = M / ∼ by the bisimulation relation to obtain MDP ℳ0=(S0,A,R0,P0,γ)subscriptℳ0subscript0subscript0subscript0M_0=(S_0,A,R_0,P_0,γ)M0 = ( S0 , A , R0 , P0 , γ ). By Proposition B.3 we have that ℳ0subscriptℳ0M_0M0 is (ε,N)( ,N)( ε , N )-safe. Note also that dH⁢(ℳ0,ℳ)=0subscriptsubscriptℳ0ℳ0d_H(M_0,M)=0ditalic_H ( M0 , M ) = 0. Pick δ>00δ>0δ > 0 satisfying δ<12⁢mins1≠s2∈S0⁡dℳ0⁢(s1,s2)12subscriptsubscript1subscript2subscript0subscriptsubscriptℳ0subscript1subscript2δ< 12 _s_1≠ s_2∈ S_0d_M_0(s_1,s_% 2)δ < divide start_ARG 1 end_ARG start_ARG 2 end_ARG minitalic_s start_POSTSUBSCRIPT 1 ≠ s2 ∈ S0 end_POSTSUBSCRIPT dcaligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s1 , s2 ) and δ<ε22δ< 2δ < divide start_ARG ε end_ARG start_ARG 2 end_ARG. Let ℳ′=(S′,A,R′,P′,γ)superscriptℳ′superscript′M =(S ,A,R ,P ,γ)M′ = ( S′ , A , R′ , P′ , γ ) be an MDP satisfying conditions of Theorem 5.7 for the δ>00δ>0δ > 0 we choose. Condition dH⁢(ℳ′,ℳ0)<δsubscriptsuperscriptℳ′subscriptℳ0d_H(M ,M_0)< _H ( M′ , M0 ) < δ and our choice of δ imply that for each element s∈S′s∈ S s ∈ S′ there is a unique element s0⁢(s)∈S0subscript0subscript0s_0(s)∈ S_0s0 ( s ) ∈ S0 with dℳ′,ℳ0⁢(s,s0⁢(s))<δsubscriptsuperscriptℳ′subscriptℳ0subscript0d_M ,M_0(s,s_0(s))< _M′ , M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( s , s0 ( s ) ) < δ. Consider MDP ℳ1=(S′,A,R1,P1,γ)subscriptℳ1superscript′subscript1subscript1M_1=(S ,A,R_1,P_1,γ)M1 = ( S′ , A , R1 , P1 , γ ) defined by setting R1⁢(s)=R0⁢(s0⁢(s))subscript1subscript0subscript0R_1(s)=R_0(s_0(s))R1 ( s ) = R0 ( s0 ( s ) ) and P1⁢(s)=P0⁢(s0⁢(s))subscript1subscript0subscript0P_1(s)=P_0(s_0(s))P1 ( s ) = P0 ( s0 ( s ) ). It is straightforward to check that dℳ′,ℳ1⁢(s,s)<δsubscriptsuperscriptℳ′subscriptℳ1d_M ,M_1(s,s)< _M′ , M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( s , s ) < δ. We can then apply Proposition B.4 to MDPs ℳ′M M′ and ℳ1subscriptℳ1M_1M1 to conclude that ℳ′M M′ is (N′,ε/2)superscript′2(N , /2)( N′ , ε / 2 )-safe for N′ controlled in terms of Ns⁢(ε,ℳ1)subscriptsubscriptℳ1N_s( ,M_1)Nitalic_s ( ε , M1 ). On the other hand, by construction there exists a bisimulation relation ∼similar-to ∼ on ℳ1subscriptℳ1M_1M1 with ℳ1/∼=ℳ0M_1/ =M_0M1 / ∼ = M0. By Proposition B.3 we have Ns⁢(ε,ℳ1)=Ns⁢(ε,ℳ)subscriptsubscriptℳ1subscriptℳN_s( ,M_1)=N_s( ,M)Nitalic_s ( ε , M1 ) = Nitalic_s ( ε , M ). This concludes the proof of the theorem. ∎ B.3. Proof of Theorem 5.10 We show that the “playing dead” example demonstrates that assumption that Ss⁢a⁢f⁢esubscriptS_safeSitalic_s a f e is isolated in Theorem 5.7 is necessary. We claim that the bisimulation distance d⁢(spd,sterm)subscriptpdsubscripttermd(s_pd,s_term)d ( spd , sterm ) in S′ is small. Recall that d is a fixed point of a map T⁢(d)⁢(spd,sterm)=maxa⁡(1−γ)⁢|R⁢(spd,a)−R⁢(sterm,a)|+γ⁢d⁢(P⁢(spd,a),P⁢(sterm,a)).subscriptpdsubscripttermsubscript1subscriptpdsubscripttermsubscriptsubscriptpdsubscripttermT(d)(s_pd,s_term)= _a\(1-γ)|R(s_pd,a)-R(% s_term,a)|+ _d(P(s_pd,a),P(s_term% ,a))\.T ( d ) ( spd , sterm ) = maxitalic_a ( 1 - γ ) | R ( spd , a ) - R ( sterm , a ) | + γ Witalic_d ( P ( spd , a ) , P ( sterm , a ) ) . We can then compute d⁢(sterm,spd)subscripttermsubscriptpd d(s_term,s_pd)d ( sterm , spd ) =T⁢(d)⁢(spd,sterm)=(1−γ)⁢δ+γ⁢d⁢(P⁢(spd,a0),P⁢(sterm,a0))absentsubscriptpdsubscriptterm1subscriptsubscriptpdsubscript0subscripttermsubscript0 =T(d)(s_pd,s_term)=(1-γ)δ+γ% W_d(P(s_pd,a_0),P(s_term,a_0))= T ( d ) ( spd , sterm ) = ( 1 - γ ) δ + γ Witalic_d ( P ( spd , a0 ) , P ( sterm , a0 ) ) ≤(1−γ)⁢δ+γ⁢((1−δ)⁢d⁢(sterm,spd)+δ⁢d⁢(spd,s0))absent11subscripttermsubscriptpdsubscriptpdsubscript0 ≤(1-γ)δ+γ ((1-δ)d(s_term,s_% pd)+δ d(s_pd,s_0) )≤ ( 1 - γ ) δ + γ ( ( 1 - δ ) d ( sterm , spd ) + δ d ( spd , s0 ) ) ≤δ+γ⁢(1−δ)⁢d⁢(sterm,spd).absent1subscripttermsubscriptpd ≤δ+γ(1-δ)d(s_term,s_pd).≤ δ + γ ( 1 - δ ) d ( sterm , spd ) . Here the second line follows by choosing lsi,sjsubscriptsubscriptsubscriptl_s_i,s_jlitalic_s start_POSTSUBSCRIPT i , sitalic_j end_POSTSUBSCRIPT with lst⁢e⁢r⁢m,s0=δsubscriptsubscriptsubscript0l_s_term,s_0= _s start_POSTSUBSCRIPT t e r m , s0 end_POSTSUBSCRIPT = δ, lst⁢e⁢r⁢m,sp⁢d=1−δsubscriptsubscriptsubscript1l_s_term,s_pd=1- _s start_POSTSUBSCRIPT t e r m , sitalic_p d end_POSTSUBSCRIPT = 1 - δ and all other entries 00; by definition of dsubscriptW_dWitalic_d we have d⁢(P⁢(spd,a),P⁢(sterm,a))≤∑lsi,sj⁢d⁢(si,sj)subscriptsubscriptpdsubscripttermsubscriptsubscriptsubscriptsubscriptsubscriptW_d(P(s_pd,a),P(s_term,a))≤Σ l_s_i,s_j% d(s_i,s_j)Witalic_d ( P ( spd , a ) , P ( sterm , a ) ) ≤ ∑ litalic_s start_POSTSUBSCRIPT i , sitalic_j end_POSTSUBSCRIPT d ( sitalic_i , sitalic_j ). We conclude that d⁢(spd,sterm)≤δ1−γ+γ⁢δ<εsubscriptpdsubscriptterm1d(s_pd,s_term)≤ δ1-γ+γδ< ( spd , sterm ) ≤ divide start_ARG δ end_ARG start_ARG 1 - γ + γ δ end_ARG < ε. Since d⁢(spd,sterm)subscriptpdsubscripttermd(s_pd,s_term)d ( spd , sterm ) is small, it is straightforward to check that the Hausdorff distance dH⁢(ℳ,ℳ′)≤O⁢(N⁢γ⁢δ)<O⁢(ε)subscriptℳsuperscriptℳ′d_H(M,M )≤ O(Nγδ)<O( )ditalic_H ( M , M′ ) ≤ O ( N γ δ ) < O ( ε ). On the other hand, all nearly optimal policies in ℳMM will transform to policies in ℳ′M M′ that visit S∖stermsubscripttermS s_termS ∖ sterm infinitely many times. Appendix C Case (2): LLM agents C.1. LLM construction An example of how an LLM can be used to construct a policy for an MDP is illustrated in figure 5. …ToYouIn<all other tokens>…<all tokens>cancopy…the‘/⋮…EndEndEnd…E⁢("User: How can I copy ‘/dir_1/file‘ to ‘/dir_2‘?"),I"User: How can I copy ‘/dir_1/file‘ to ‘/dir_2‘?"E ( 164.0pt"User: How can I copy `/dir\_1/file` to `/dir\_% 2`?" ),\,IE ( "User: How can I copy ‘/dir_1/file‘ to ‘/dir_2‘?" ) , IE⁢("User: <...> Bot: To"),I"User: <...> Bot: To"E ( 99.0pt"User: <...> n \, \, \, \, \, \, % Bot: To" ),\,IE ( "User: <...> Bot: To" ) , IE⁢("User: <...> Bot: You"),I"User: <...> Bot: You"E ( 99.0pt"User: <...> n \, \, \, \, \, \, % Bot: You" ),\,IE ( "User: <...> Bot: You" ) , IE⁢("User: <...> Bot: In"),I"User: <...> Bot: In"E ( 99.0pt"User: <...> n \, \, \, \, \, \, % Bot: In" ),\,IE ( "User: <...> Bot: In" ) , IE⁢("User: <...> Bot: You can"),I"User: <...> Bot: You can"E ( 99.0pt"User: <...> n \, \, \, \, \, \, % Bot: You can" ),\,IE ( "User: <...> Bot: You can" ) , IE⁢("User: <...> Bot: To copy"),I"User: <...> Bot: To copy"E ( 99.0pt"User: <...> n \, \, \, \, \, \, % Bot: To copy" ),\,IE ( "User: <...> Bot: To copy" ) , IE⁢("User: <...> Bot: To copy the"),I"User: <...> Bot: To copy the"E ( 112.0pt"User: <...> n \, \, \, \, \, \,% Bot: To copy the" ),\,IE ( "User: <...> Bot: To copy the" ) , IE⁢("User: <...> Bot: To copy ‘/"),I"User: <...> Bot: To copy ‘/"E ( 105.0pt"User: <...> n \, \, \, \, \, \,% Bot: To copy `/" ),\,IE ( "User: <...> Bot: To copy ‘/" ) , IE⁢("User: <...> Bot: To copy the file ‘/dir_1/file‘ to ‘/dir_2‘ <...>"),I"User: <...> Bot: To copy the file ‘/dir_1/file‘ to ‘/dir_2‘ <...>"E ( 222.0pt"User: <...> n \, \, \, \, \, \,% \, \, \, \, \, \, Bot: To copy the file `/dir\_1/file` to `/dir\_2` <...>"% ),\,IE ( "User: <...> Bot: To copy the file ‘/dir_1/file‘ to ‘/dir_2‘ <...>" ) , IE⁢("User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...>"),I"User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...>"E ( 165.0pt"User: <...> n \, \, \, \, \, \,% Bot: To copy `/dir\_1/file` to `/dir\_2`, you can use the `cp` command <...>"% ),\,IE ( "User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...>" ) , IE⁢("User: <...> Bot: You can copy the file ‘/dir_1/file‘ to ‘/dir_2‘ using the ‘cp‘ command <...>"),I"User: <...> Bot: You can copy the file ‘/dir_1/file‘ to ‘/dir_2‘ using the ‘cp‘ command <...>"E ( 180.0pt"User: <...> n \, \, \, \, \, \,% Bot: You can copy the file `/dir\_1/file` to `/dir\_2` using the `cp` command% <...>" ),\,IE ( "User: <...> Bot: You can copy the file ‘/dir_1/file‘ to ‘/dir_2‘ using the ‘cp‘ command <...>" ) , IE⁢("User: <...> Bot: To copy the file ‘/dir_1/file‘ to ‘/dir_2‘ <...> User: Thanks!"),I′"User: <...> Bot: To copy the file ‘/dir_1/file‘ to ‘/dir_2‘ <...> User: Thanks!"superscript′E ( 222.0pt"User: <...> n \, \, \, \, \, \,% \, \, \, \, \, \, \, Bot: To copy the file `/dir\_1/file` to `/dir\_2` <...> % n \, \, \, \, \, \, \, \, \, \, \, \, User: Thanks!" ),% \,I E ( "User: <...> Bot: To copy the file ‘/dir_1/file‘ to ‘/dir_2‘ <...> User: Thanks!" ) , I′E⁢("User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...> User: Thanks!"),I′"User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...> User: Thanks!"superscript′E ( 165.0pt"User: <...> n \, \, \, \, \, \,% Bot: To copy `/dir\_1/file` to `/dir\_2`, you can use the `cp` command <...> % n \, \, \, \, \, \, User: Thanks!" ),\,I E ( "User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...> User: Thanks!" ) , I′ ′E⁢("User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...> User: I tried but <...>"),I′"User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...> User: I tried but <...>"superscript′E ( 165.0pt"User: <...> n \, \, \, \, \, \,% Bot: To copy `/dir\_1/file` to `/dir\_2`, you can use the `cp` command <...> % n \, \, \, \, \, \, User: I tried but <...>" ),\,I^% E ( "User: <...> Bot: To copy ‘/dir_1/file‘ to ‘/dir_2‘, you can use the ‘cp‘ command <...> User: I tried but <...>" ) , I′ ′ ′ For original input "User: How can I copy ‘/dir_1/file‘ to ‘/dir_2‘?", the LLM agent starts at an MDP state which corresponds to the embedding E⁢(⋅)⋅E(·)E ( ⋅ ) of that input and which may include other information I on the current world state. and takes the agent to the state corresponding to the embedding of the new chat history, with the other information I unchanged. (“<...>” is text omitted for length.) …until it emits the End token, which prints out the full response to the user. Figure 5. Example of constructing an MDP policy from an LLM. Note that the transitions are split into those depending deterministically on the agent’s actions (red) and those depending entirely on the environment (blue). C.2. Proofs toward Theorem 6.5 First we prove points (1) and (2). (1) πsubscriptS_πSitalic_π is not necessarily upper semicontinuous, because, as in 5.4.2, an arbitrarily small change in transition probabilities may suddenly increase the probability of reaching SsafesubscriptsafeS_safeSsafe—e.g. for a recurrent state sbadsubscriptbads_badsbad, the probability of transition sbad→Ssafe→subscriptbadsubscriptsafes_bad→ S_safesbad → Ssafe changing from 0 to ϵitalic-ϵεϵ redirects all the end-state probability mass from sbadsubscriptbads_badsbad to SsafesubscriptsafeS_safeSsafe. (2) Note that (3) implies (2) because for any η>00η>0η > 0, we may take a δ<ϵ,ηB⁢(ℳ)italic-ϵℳδ<ε, ηB(M)δ < ϵ , divide start_ARG η end_ARG start_ARG B ( M ) end_ARG; then ‖δ⁢ℳ‖1<δ⟹−δ⁢πηB⁢(ℳ)<−δ⁢π‖δ⁢ℳ‖1<B⁢(ℳ)subscriptnormℳ1subscriptℳsubscriptsubscriptnormℳ1ℳ\| \|_1<δ - _π % ηB(M)<- _π\| \|_% 1<B(M)∥ δ M ∥1 < δ ⟹ - divide start_ARG δ Sitalic_π end_ARG start_ARG divide start_ARG η end_ARG start_ARG B ( M ) end_ARG end_ARG < - divide start_ARG δ Sitalic_π end_ARG start_ARG ∥ δ M ∥1 end_ARG < B ( M ), so δ⁢π>−ηsubscript _π>-ηδ Sitalic_π > - η, as desired. Thus it suffices to prove (3). (3) Our proof of (3) will follow this overall structure: Proof sketch. The end behavior of ℳMM, and hence the value of πsubscriptS_πSitalic_π, depends on high powers of the transition matrix: PNsuperscriptP^NPitalic_N as N→∞→N→∞N → ∞. To determine the change in the end behavior, then, we do the following: (1) Bound the change in each element of the transition matrix P (Lemma C.1). (2) Bound ‖δ⁢P‖1subscriptnorm1\|δ P\|_1∥ δ P ∥1, the sum of the absolute values of the changes in the transition matrix, by ‖δ⁢ℳ‖1subscriptnormℳ1\| \|_1∥ δ M ∥1 (Corollary C.2). (3) Express πsubscriptS_πSitalic_π in terms of P (Lemma C.3). (4) Combine these results to obtain a bound on |δ⁢π|‖δ⁢ℳ‖1subscriptsubscriptnormℳ1 | _π|\| \|_1divide start_ARG | δ Sitalic_π | end_ARG start_ARG ∥ δ M ∥1 end_ARG in the case where StranssubscripttransS_transStrans, the states from which SsafesubscriptsafeS_safeSsafe is reachable, is not changed by δ⁢ℳ δ M (Lemma C.4). (5) Show that sufficiently small changes in ℳMM cannot decrease |Strans|subscripttrans|S_trans|| Strans | (Lemma C.5). (6) Take the limit as ‖δ⁢ℳ‖1→0→subscriptnormℳ10\| \|_1→ 0∥ δ M ∥1 → 0 to get a bound on the local rate of decrease, handling the case where StranssubscripttransS_transStrans is changed by δ⁢ℳ δ M. ∎ Now we advance to the full proof. First, we bound the change in transition probabilities δ⁢Pδ Pδ P due to both the change in the environmental transition function (i.e. the probabilities of navigating from one state to another given a certain action) and the changes in the policy π from the state location changes (i.e. the probabilities of selecting a certain action given a current state). Lemma C.1. |δ⁢Pi⁢j|≤12⁢‖∇π⁢(si)‖1⁢|δ⁢si|+∑a|δ⁢P⁢(si,a,sj)|subscript12subscriptnorm∇subscript1subscriptsubscriptsubscriptsubscript|δ P_ij|≤ 12\|∇π(s_i)\|_1|δ s_i|+ _a|% δ P(s_i,a,s_j)|| δ Pitalic_i j | ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ π ( sitalic_i ) ∥1 | δ sitalic_i | + ∑a | δ P ( sitalic_i , a , sitalic_j ) |. Proof. δ⁢Pδ Pδ P depends on δ⁢ℳ δ M both via changes to the transition probabilities directly and how the changes to the state locations affect the actions chosen by π. |δ⁢Pi⁢j|=subscriptabsent |δ P_ij|=| δ Pitalic_i j | = |∑aπ⁢(si+δ⁢si,a)⁢(P⁢(si,a,sj)+δ⁢P⁢(si,a,sj))−π⁢(si,a)⁢P⁢(si,a,sj)|subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscript | _aπ(s_i+δ s_i,a)(P(s_i,a,s_j)+δ P% (s_i,a,s_j))-π(s_i,a)P(s_i,a,s_j) || ∑a π ( sitalic_i + δ sitalic_i , a ) ( P ( sitalic_i , a , sitalic_j ) + δ P ( sitalic_i , a , sitalic_j ) ) - π ( sitalic_i , a ) P ( sitalic_i , a , sitalic_j ) | = == |∑aπ⁢(si,a)⁢P⁢(si,a,sj)+|⁢δ⁢si|(∇δ⁢siπ⁢(si,a))⁢P⁢(si,a,sj)+π⁢(si,a)⁢δ⁢P⁢(si,a,sj)conditionallimit-fromsubscriptsubscriptsubscriptsubscriptsubscriptsubscript∇subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscript | _aπ(s_i,a)P(s_i,a,s_j)+|δ s_i|(∇% _δ s_iπ(s_i,a))P(s_i,a,s_j)+π(s_i,a)δ P(s_i,a,s_j)| ∑a π ( sitalic_i , a ) P ( sitalic_i , a , sitalic_j ) + | δ sitalic_i | ( ∇δ s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT π ( sitalic_i , a ) ) P ( sitalic_i , a , sitalic_j ) + π ( sitalic_i , a ) δ P ( sitalic_i , a , sitalic_j ) +|δsi|(∇δ⁢siπ(si,a))δP(si,a,sj)−π(si,a)P(si,a,sj)| +|δ s_i|( _δ s_iπ(s_i,a))δ P(s_i,% a,s_j)-π(s_i,a)P(s_i,a,s_j) |+ | δ sitalic_i | ( ∇δ s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT π ( sitalic_i , a ) ) δ P ( sitalic_i , a , sitalic_j ) - π ( sitalic_i , a ) P ( sitalic_i , a , sitalic_j ) | = == |∑a|⁢δ⁢si⁢|(∇δ⁢siπ⁢(si,a))⁢P⁢(si,a,sj)+π⁢(si,a)⁢δ⁢P⁢(si,a,sj)+|⁢δ⁢si⁢|(∇δ⁢siπ⁢(si,a))⁢δ⁢P⁢(si,a,sj)|subscriptsubscriptsubscript∇subscriptsubscriptsubscriptsubscriptlimit-fromsubscriptsubscriptsubscriptsubscriptsubscript∇subscriptsubscriptsubscriptsubscript | _a|δ s_i|( _δ s_iπ(s_i,a))P% (s_i,a,s_j)+π(s_i,a)δ P(s_i,a,s_j)+|δ s_i|( _% δ s_iπ(s_i,a))δ P(s_i,a,s_j) || ∑a | δ sitalic_i | ( ∇δ s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT π ( sitalic_i , a ) ) P ( sitalic_i , a , sitalic_j ) + π ( sitalic_i , a ) δ P ( sitalic_i , a , sitalic_j ) + | δ sitalic_i | ( ∇δ s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT π ( sitalic_i , a ) ) δ P ( sitalic_i , a , sitalic_j ) | = == |∑a|⁢δ⁢si⁢|(∇δ⁢siπ⁢(si,a))⁢P⁢(si,a,sj)+π⁢(si,a)⁢δ⁢P⁢(si,a,sj)|removing the δ2 term.subscriptsubscriptsubscript∇subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptremoving the δ2 term. | _a|δ s_i|( _δ s_iπ(s_i,a))P% (s_i,a,s_j)+π(s_i,a)δ P(s_i,a,s_j) | 10.00002pt% removing the $δ^2$ term.| ∑a | δ sitalic_i | ( ∇δ s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT π ( sitalic_i , a ) ) P ( sitalic_i , a , sitalic_j ) + π ( sitalic_i , a ) δ P ( sitalic_i , a , sitalic_j ) | removing the δ2 term. Note that |δ⁢si|⁢∇δ⁢siπ⁢(si,a)subscriptsubscript∇subscriptsubscript|δ s_i| _δ s_iπ(s_i,a)| δ sitalic_i | ∇δ s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT π ( sitalic_i , a ) is the change in the probability that π assigns to action a when sisubscripts_isitalic_i moves by δ⁢sisubscriptδ s_iδ sitalic_i. In other words, this is the contribution from policy change to the change in the transition matrix. Since the sum of π over all actions must always equal 1, the sum of these changes is 0. 0≤P⁢(si,a,sj)≤10subscriptsubscript10≤ P(s_i,a,s_j)≤ 10 ≤ P ( sitalic_i , a , sitalic_j ) ≤ 1, so the sum is maximized if P⁢(si,a,sj)subscriptsubscriptP(s_i,a,s_j)P ( sitalic_i , a , sitalic_j ) is 0 whenever the change is negative and 1 whenever it is positive. Since the sum of all the positive changes is half the sum of the absolute value of the changes (again since the changes overall sum to 0), this yields ≤ ≤ 12⁢∑a|δ⁢si|⁢|∇δ⁢siπ⁢(si,a)|+|∑aπ⁢(si,a)⁢δ⁢P⁢(si,a,sj)|.12subscriptsubscriptsubscript∇subscriptsubscriptsubscriptsubscriptsubscriptsubscript 12 _a|δ s_i|| _δ s_iπ(s_i,% a)|+ | _aπ(s_i,a)δ P(s_i,a,s_j) |.divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑a | δ sitalic_i | | ∇δ s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT π ( sitalic_i , a ) | + | ∑a π ( sitalic_i , a ) δ P ( sitalic_i , a , sitalic_j ) | . For the first term, note that ∑a|∇δsiπ⁢(si,a)|subscriptsubscript∇subscriptsubscriptsubscript _a| _ _s_iπ(s_i,a)|∑a | ∇δ start_POSTSUBSCRIPT s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT π ( sitalic_i , a ) | is precisely ‖∇δsiπ⁢(si)‖1subscriptnormsubscript∇subscriptsubscriptsubscript1\| _ _s_iπ(s_i)\|_1∥ ∇δ start_POSTSUBSCRIPT s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT π ( sitalic_i ) ∥1. For the second term, note that δ⁢P⁢(si,a,sj)subscriptsubscriptδ P(s_i,a,s_j)δ P ( sitalic_i , a , sitalic_j ) is the change in the probability of transitioning from sisubscripts_isitalic_i to sjsubscripts_jsitalic_j conditional on selecting action a. Since π⁢(si,a)subscriptπ(s_i,a)π ( sitalic_i , a ) is the probability of selecting action a, the second sum is the environmental contribution to the change in the transition matrix. Since π is positive and the sum over all actions of π⁢(si,a)subscriptπ(s_i,a)π ( sitalic_i , a ) is 1, we obtain ≤ ≤ 12⁢‖∇π⁢(si)‖1⁢|δ⁢si|+∑a|δ⁢P⁢(si,a,sj)|.12subscriptnorm∇subscript1subscriptsubscriptsubscriptsubscript 12\|∇π(s_i)\|_1|δ s_i|+ _a|δ P% (s_i,a,s_j)|.divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ π ( sitalic_i ) ∥1 | δ sitalic_i | + ∑a | δ P ( sitalic_i , a , sitalic_j ) | . ∎ Corollary C.2. ‖δ⁢P‖1≤‖δ⁢ℳ‖1.subscriptnorm1subscriptnormℳ1\|δ P\|_1≤\| \|_1.∥ δ P ∥1 ≤ ∥ δ M ∥1 . Proof. Recall that ‖δ⁢ℳ‖1=12⋅|S|⋅b⋅‖δ⁢S‖1+‖δ⁢T‖1subscriptnormℳ1⋅12subscriptnorm1subscriptnorm1\| \|_1= 12·|S|· b·\|δ S\|_1+\|% δ T\|_1∥ δ M ∥1 = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ | S | ⋅ b ⋅ ∥ δ S ∥1 + ∥ δ T ∥1. Note that, by definition, ‖∇π⁢(si)‖1≤bsubscriptnorm∇subscript1\|∇π(s_i)\|_1≤ b∥ ∇ π ( sitalic_i ) ∥1 ≤ b. Sum over i,ji,ji , j: ‖δ⁢P‖1=∑i,j|δ⁢Pi⁢j|≤∑i,j12⁢‖∇π⁢(si)‖1⁢|δ⁢si|+∑i,j,a|δ⁢P⁢(si,a,sj)|≤‖δ⁢ℳ‖1.subscriptnorm1subscriptsubscriptsubscript12subscriptnorm∇subscript1subscriptsubscriptsubscriptsubscriptsubscriptnormℳ1\|δ P\|_1= _i,j|δ P_ij|≤ _i,j 12\|∇π% (s_i)\|_1|δ s_i|+ _i,j,a|δ P(s_i,a,s_j)|≤\|δ% M\|_1.∥ δ P ∥1 = ∑i , j | δ Pitalic_i j | ≤ ∑i , j divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ π ( sitalic_i ) ∥1 | δ sitalic_i | + ∑i , j , a | δ P ( sitalic_i , a , sitalic_j ) | ≤ ∥ δ M ∥1 . ∎ In fact, was the reason we defined ‖δ⁢ℳ‖1subscriptnormℳ1\| \|_1∥ δ M ∥1 the way we did—the system’s behavior is fully described by the transition matrix given by combining ℳMM’s transition function with a policy, so the size of the change in this transition matrix is a good candidate for a metric on ℳ MM. Lemma C.3. π=Δ⋅∑t=0∞(P⁢I)t⁢P⁢v.subscript⋅Δsuperscriptsubscript0superscriptsubscriptsubscriptS_π= · _t=0^∞(PI_trans)^tPv_% safe.Sitalic_π = Δ ⋅ ∑t = 0∞ ( P Itrans )t P vsafe . Proof. Let Itrans:=d⁢i⁢a⁢g⁢(si∈Strans)assignsubscripttranssubscriptdouble-struck-subscriptsubscripttransI_trans:=diag(1_s_i∈ S_trans)Itrans := d i a g ( blackboard_s start_POSTSUBSCRIPT i ∈ Strans end_POSTSUBSCRIPT ) be the matrix that acts as the identity on StranssubscripttransS_transStrans, the states of P from which SsafesubscriptsafeS_safeSsafe is reachable, and which zeroes out all the recurrent states. Let vsafe:=si∈Sr⁢e⁢cassignsubscriptsafesubscriptdouble-struck-subscriptsubscriptv_safe:=1_s_i∈ S_recvsafe := blackboard_s start_POSTSUBSCRIPT i ∈ Sitalic_r e c end_POSTSUBSCRIPT be the vector with ones at every safe state and zeroes everywhere else. Then, Δ⋅(P⁢Itrans)t⋅Δsuperscriptsubscripttrans ·(PI_trans)^tΔ ⋅ ( P Itrans )t is the row vector corresponding to the probability mass in StranssubscripttransS_transStrans at step t, noting that we are free to apply the projection ItranssubscripttransI_transItrans at every step rather than just at the end because no probability mass can ever return to these states once it has left (by construction). For convenience, we define Ptrans=P⁢ItranssubscripttranssubscripttransP_trans=PI_transPtrans = P Itrans. Right-multiplying this by P yields the distribution of the probability mass that was still in StranssubscripttransS_transStrans at step t after one more step. Finally, taking the inner product with vsafesubscriptsafev_safevsafe sums up the portion of this distribution that is now in SsafesubscriptsafeS_safeSsafe: so each Δ⋅(Ptrans)t⁢P⁢vsafe⋅Δsuperscriptsubscripttranssubscriptsafe ·(P_trans)^tPv_safeΔ ⋅ ( Ptrans )t P vsafe is the amount of probability mass that moves into SsafesubscriptsafeS_safeSsafe at step t+11t+1t + 1. So, we can sum over all t to get the total probability mass that moves into SsafesubscriptsafeS_safeSsafe. ∎ Now we bound the change in πsubscriptS_πSitalic_π in terms of the change in P. Lemma C.4. If δ⁢ℳ δ M does not change SsubscriptS_transStrans (i.e. for every state, it does not change whether SsubscriptS_safeSsafe is reachable from that state), |δ⁢π|‖δ⁢ℳ‖1≤(1−λ1)−1⁢(1+(1−λ1)−1)⁢|S|,subscriptsubscriptnormℳ1superscript1subscript111superscript1subscript11subscript | _π|\| \|_1≤(1- _1)% ^-1 (1+(1- _1)^-1 )|S_safe|,divide start_ARG | δ Sitalic_π | end_ARG start_ARG ∥ δ M ∥1 end_ARG ≤ ( 1 - λ1 )- 1 ( 1 + ( 1 - λ1 )- 1 ) | Ssafe | , where λ1subscript1 _1λ1 is the largest eigenvalue of PsubscriptP_transPtrans. Proof. We express δ⁢πsubscript _πδ Sitalic_π in terms of δ⁢Pδ Pδ P and δ⁢Ptranssubscripttransδ P_transδ Ptrans; the latter is well-defined because δ⁢ℳ δ M does not change StranssubscripttransS_transStrans. δ⁢π=subscriptabsent _π=δ Sitalic_π = Δ⋅∑t=0∞(Ptrans+δ⁢Ptrans)t⁢(P+δ⁢P)⁢vsafe−Δ⋅∑t=0∞Ptranst⁢P⁢vsafe⋅Δsuperscriptsubscript0superscriptsubscripttranssubscripttranssubscriptsafe⋅Δsuperscriptsubscript0superscriptsubscripttranssubscriptsafe · _t=0^∞(P_trans+δ P_% trans)^t(P+δ P)v_safe- · _t=0^∞P_ % trans^tPv_safeΔ ⋅ ∑t = 0∞ ( Ptrans + δ Ptrans )t ( P + δ P ) vsafe - Δ ⋅ ∑t = 0∞ Ptransitalic_t P vsafe = == Δ⋅(∑t=0∞(Ptrans+δ⁢Ptrans)t⁢(P+δ⁢P)−∑t=0∞Ptranst⁢P)⁢vsafe⋅Δsuperscriptsubscript0superscriptsubscripttranssubscripttranssuperscriptsubscript0superscriptsubscripttranssubscriptsafe · ( _t=0^∞(P_trans+δ P_% trans)^t(P+δ P)- _t=0^∞P_trans^tP )% v_safeΔ ⋅ ( ∑t = 0∞ ( Ptrans + δ Ptrans )t ( P + δ P ) - ∑t = 0∞ Ptransitalic_t P ) vsafe We expand to first order in δ⁢Pδ Pδ P using the binomial theorem to obtain = == Δ⋅(∑t=0∞(Ptranst+t⁢Ptranst−1⁢δ⁢Ptrans)⁢(P+δ⁢P)−∑t=0∞Ptranst⁢P)⁢vsafe⋅Δsuperscriptsubscript0superscriptsubscripttranssuperscriptsubscripttrans1subscripttranssuperscriptsubscript0superscriptsubscripttranssubscriptsafe · ( _t=0^∞(P_trans^t+tP_% trans^t-1δ P_trans)(P+δ P)- _t=0^∞P_% trans^tP )v_safeΔ ⋅ ( ∑t = 0∞ ( Ptransitalic_t + t Ptransitalic_t - 1 δ Ptrans ) ( P + δ P ) - ∑t = 0∞ Ptransitalic_t P ) vsafe and expand and once again remove a higher-order δ⁢Pδ Pδ P term to get = == Δ⋅(∑t=0∞Ptranst⁢P+Ptranst⁢δ⁢P+t⁢Ptranst−1⁢δ⁢Ptrans⁢P−∑t=0∞Ptranst⁢P)⁢vsafe⋅Δsuperscriptsubscript0superscriptsubscripttranssuperscriptsubscripttranssuperscriptsubscripttrans1subscripttranssuperscriptsubscript0superscriptsubscripttranssubscriptsafe · ( _t=0^∞P_trans^tP+P_% trans^tδ P+tP_trans^t-1δ P_transP-Σ% _t=0^∞P_trans^tP )v_safeΔ ⋅ ( ∑t = 0∞ Ptransitalic_t P + Ptransitalic_t δ P + t Ptransitalic_t - 1 δ Ptrans P - ∑t = 0∞ Ptransitalic_t P ) vsafe = == Δ⋅(∑t=0∞Ptranst⁢δ⁢P+t⁢Ptranst−1⁢δ⁢Ptrans⁢P)⁢vsafe⋅Δsuperscriptsubscript0superscriptsubscripttranssuperscriptsubscripttrans1subscripttranssubscriptsafe · ( _t=0^∞P_trans^tδ P+% tP_trans^t-1δ P_transP )v_safeΔ ⋅ ( ∑t = 0∞ Ptransitalic_t δ P + t Ptransitalic_t - 1 δ Ptrans P ) vsafe = == Δ⋅((∑t=0∞Ptranst)⁢δ⁢P+(∑t=0∞t⁢Ptranst−1)⁢δ⁢Ptrans⁢P)⁢vsafe⋅Δsuperscriptsubscript0superscriptsubscripttranssuperscriptsubscript0superscriptsubscripttrans1subscripttranssubscriptsafe · ( ( _t=0^∞P_trans^t% )δ P+ ( _t=0^∞tP_trans^t-1 )δ P% _transP )v_safeΔ ⋅ ( ( ∑t = 0∞ Ptransitalic_t ) δ P + ( ∑t = 0∞ t Ptransitalic_t - 1 ) δ Ptrans P ) vsafe Recall that a geometric series of matrix powers ∑t=0∞Atsuperscriptsubscript0superscript _t=0^∞A^t∑t = 0∞ Aitalic_t converges to (1−A)−1superscript11(1-A)^-1( 1 - A )- 1 if and only if all the eigenvalues of the matrix are strictly less than 1. This is indeed the case for PtranssubscripttransP_transPtrans, since only recurrent states have eigenvalue 1. Also recall that ∑t=1∞t⁢At−1=(1−A)−2superscriptsubscript1superscript1superscript12 _t=1^∞tA^t-1=(1-A)^-2∑t = 1∞ t Aitalic_t - 1 = ( 1 - A )- 2 since (1−A)⁢∑t=1∞t⁢At−1=∑t=1∞t⁢At−1−∑t=0∞t⁢At=∑t=0∞(t+1)⁢At−t⁢At=∑t=0∞At1superscriptsubscript1superscript1superscriptsubscript1superscript1superscriptsubscript0superscriptsuperscriptsubscript01superscriptsuperscriptsuperscriptsubscript0superscript(1-A) _t=1^∞tA^t-1= _t=1^∞tA^t-1- _t=0^% ∞tA^t= _t=0^∞(t+1)A^t-tA^t= _t=0^∞A^t( 1 - A ) ∑t = 1∞ t Aitalic_t - 1 = ∑t = 1∞ t Aitalic_t - 1 - ∑t = 0∞ t Aitalic_t = ∑t = 0∞ ( t + 1 ) Aitalic_t - t Aitalic_t = ∑t = 0∞ Aitalic_t. δ⁢π=subscriptabsent _π=δ Sitalic_π = Δ⋅((1−Ptrans)−1⁢δ⁢P+(1−Ptrans)−2⁢δ⁢P⁢Itrans⁢P)⁢vsafe⋅Δsuperscript1subscripttrans1superscript1subscripttrans2subscripttranssubscriptsafe · ( (1-P_trans )^-1δ P+% (1-P_trans )^-2δ PI_transP )v_% safeΔ ⋅ ( ( 1 - Ptrans )- 1 δ P + ( 1 - Ptrans )- 2 δ P Itrans P ) vsafe We will bound both of these terms. Note that the eigenvalues of (1−Ptrans)−1superscript1subscripttrans1(1-P_trans)^-1( 1 - Ptrans )- 1 are 11−λi11subscript 11- _idivide start_ARG 1 end_ARG start_ARG 1 - λitalic_i end_ARG where λisubscript _iλitalic_i are the eigenvalues of PtranssubscripttransP_transPtrans. Since 11−λ11 11-λdivide start_ARG 1 end_ARG start_ARG 1 - λ end_ARG is monotonically increasing in λ for λ<11λ<1λ < 1, the largest eigenvalue of (1−Ptrans)−1superscript1subscripttrans1(1-P_trans)^-1( 1 - Ptrans )- 1 is 11−λ111subscript1 11- _1divide start_ARG 1 end_ARG start_ARG 1 - λ1 end_ARG, where λ1subscript1 _1λ1 is the largest eigenvalue of PtranssubscripttransP_transPtrans. Then, |δ⁢π|≤subscriptabsent | _π|≤| δ Sitalic_π | ≤ Δ⋅((1−λ1)−1⁢‖δ⁢P‖1+(1−λ1)−2⁢‖δ⁢P‖1⁢Itrans⁢P)⁢vsafe⋅Δsuperscript1subscript11subscriptnorm1superscript1subscript12subscriptnorm1subscripttranssubscriptsafe · ((1- _1)^-1\|δ P\|_1+(1- _% 1)^-2\|δ P\|_1I_transP )v_safeΔ ⋅ ( ( 1 - λ1 )- 1 ∥ δ P ∥1 + ( 1 - λ1 )- 2 ∥ δ P ∥1 Itrans P ) vsafe ≤ ≤ |Δ|⁢((1−λ1)−1⁢‖δ⁢P‖1+(1−λ1)−2⁢‖δ⁢P‖1)⁢|vsafe|Δsuperscript1subscript11subscriptnorm1superscript1subscript12subscriptnorm1subscriptsafe | | ((1- _1)^-1\|δ P\|_1+(1- _1% )^-2\|δ P\|_1 )|v_safe|| Δ | ( ( 1 - λ1 )- 1 ∥ δ P ∥1 + ( 1 - λ1 )- 2 ∥ δ P ∥1 ) | vsafe | ≤ ≤ (1−λ1)−1⁢(1+(1−λ1)−1)⁢|Ssafe|⁢‖δ⁢ℳ‖1.superscript1subscript111superscript1subscript11subscriptsafesubscriptnormℳ1 (1- _1)^-1 (1+(1- _1)^-1 )|S_% safe|\| \|_1.( 1 - λ1 )- 1 ( 1 + ( 1 - λ1 )- 1 ) | Ssafe | ∥ δ M ∥1 . Dividing by ‖δ⁢ℳ‖1subscriptnormℳ1\| \|_1∥ δ M ∥1, we obtain the desired inequality. ∎ Now consider the case where StranssubscripttransS_transStrans is changed by δ⁢Mδ Mδ M. Lemma C.5. For each ℳMM, ∃ϵ>0italic-ϵ0∃\,ε>0∃ ϵ > 0 with ‖δ⁢ℳ‖1<ϵ⟹S⊂S′subscriptnormℳ1italic-ϵsubscriptsuperscriptsubscriptnormal-′\| \|_1<ε S_trans⊂ S_% trans ∥ δ M ∥1 < ϵ ⟹ Strans ⊂ Strans′. Proof. Consider a state s from which SsafesubscriptsafeS_safeSsafe was reachable in ℳMM. Let PssubscriptP_sPitalic_s be the probability that an agent eventually transitions to SsafesubscriptsafeS_safeSsafe from s. By continuity of PssubscriptP_sPitalic_s, for ‖δ⁢ℳ‖1<ϵssubscriptnormℳ1subscriptitalic-ϵ\| \|_1< _s∥ δ M ∥1 < ϵitalic_s small enough, PssubscriptP_sPitalic_s will still be positive in ℳ′M M′. Taking the minimum ϵssubscriptitalic-ϵ _sϵitalic_s over all s the result follows.555The requirement that ℳMM and ℳ′M M′ be close enough that there are no nonzero transitions of ℳMM which become 0 in ℳ′M M′ suggests the possibility of using KL divergence to compare the transition functions, since DK⁢L⁢(P∥Q)subscriptconditionalD_KL(P\|Q)Ditalic_K L ( P ∥ Q ) is infinite if Q lacks support somewhere P does have support. This may be interesting to apply to scenarios like RLHF on a base model, where the KL divergence is estimated and constrained during training. Note that Strans′ubscripttrans′S_trans Strans′ may strictly contain StranssubscripttransS_transStrans: a small perturbation may cause SsafesubscriptsafeS_safeSsafe to be reachable from a state s, e.g. by adding a transition directly from s to SsafesubscriptsafeS_safeSsafe. ∎ Finally, we may conclude the proof of lower semicontinuity with bounded rate of decrease by combining Lemma C.4 with Lemma C.5. Note that if Strans′>Stranssubscriptsuperscript′transsubscripttransS _trans>S_transS′trans > Strans, Lemma C.4 still provides a lower bound on the safety: if s′∈Strans′∖Stranssuperscript′subscriptsuperscript′transsubscripttranss ∈ S _trans S_transs′ ∈ S′trans ∖ Strans, then Ptrans+δ⁢PtranssubscripttranssubscripttransP_trans+δ P_transPtrans + δ Ptrans to excludes entries corresponding to s′. Thus, any probability flow that goes through s′ is not included in the bound on safety. Therefore, the true value of πsubscriptS_πSitalic_π accounting for s′ can only increase from the lower bound from C.4. Therefore, we can use Lemma C.5 to obtain ϵ>0italic-ϵ0ε>0ϵ > 0 such that ‖δ⁢ℳ‖1<ϵ⟹−δ⁢π‖δ⁢ℳ‖1≤(1−λ1)−1⁢(1+(1−λ1)−1)⁢|Ssafe|.subscriptnormℳ1italic-ϵsubscriptsubscriptnormℳ1superscript1subscript111superscript1subscript11subscriptsafe\| \|_1<ε - _π\|% \|_1≤(1- _1)^-1 (1+(1- _1)^-1% )|S_safe|.∥ δ M ∥1 < ϵ ⟹ - divide start_ARG δ Sitalic_π end_ARG start_ARG ∥ δ M ∥1 end_ARG ≤ ( 1 - λ1 )- 1 ( 1 + ( 1 - λ1 )- 1 ) | Ssafe | . C.3. Proof of Proposition 6.6 Proof. By Lemma C.3, π=Δ⋅∑t=0∞(P⁢Itrans)t⁢P⁢vsafesubscript⋅Δsuperscriptsubscript0superscriptsubscripttranssubscriptsafeS_π= · _t=0^∞(PI_trans)^tPv_% safeSitalic_π = Δ ⋅ ∑t = 0∞ ( P Itrans )t P vsafe. Since πsubscriptS_πSitalic_π is a linear function of Δ Δ, it is continuous in Δ Δ; in fact, it is uniformly continuous: |π′−π|=superscriptsubscript′subscriptabsent |S_π -S_π|=| Sitalic_π′ - Sitalic_π | = |(Δ′−Δ)⋅∑t=0∞(P⁢Itrans)t⁢P⁢vsafe|⋅superscriptΔ′Δsuperscriptsubscript0superscriptsubscripttranssubscriptsafe |( - )· _t=0^∞(PI_% trans)^tPv_safe || ( Δ′ - Δ ) ⋅ ∑t = 0∞ ( P Itrans )t P vsafe | ≤ ≤ ‖δ⁢Δ‖2⁢‖∑t=0∞(P⁢Itrans)t⁢P⁢vsafe‖2subscriptnormΔ2subscriptnormsuperscriptsubscript0superscriptsubscripttranssubscriptsafe2 \|δ \|_2 \| _t=0^∞(PI_trans)% ^tPv_safe \|_2∥ δ Δ ∥2 ∥ ∑t = 0∞ ( P Itrans )t P vsafe ∥2 ≤ ≤ ‖δ⁢Δ‖2,subscriptnormΔ2 \|δ \|_2,∥ δ Δ ∥2 , which is independent of ℳMM. (This also holds for ∥1\|\|_1∥ ∥1 and other reasonable norms for δ⁢Δδ δ Δ.) ∎ Appendix D Discussion: Lower hemicontinuity D.1. Definition Hemicontinuity generalizes semicontinuity to set-valued functions. A lower semicontinuous function must increase only gradually (but may suddenly decrease); a lower hemicontinuous function must move from a point only gradually (but may suddenly add new points). Definition D.1. [Aliprantis and Border, 2007]. A function Γ:A→2Bnormal-:normal-Γnormal-→superscript2 :A→ 2^BΓ : A → 2B is lower hemicontinuous at a if for any open V⊆2Bsuperscript2V 2^BV ⊆ 2B with Γ⁢(a)∩V≠∅normal-Γ (a)∩ V≠ Γ ( a ) ∩ V ≠ ∅ there is a neighbourhood U of a so that x∈U⟹Γ⁢(x)∩V≠∅normal-Γx∈ U (x)∩ V≠ ∈ U ⟹ Γ ( x ) ∩ V ≠ ∅. The image of ℳMM under the functions ℱδsubscriptℱF_δFitalic_δ and ℱπ′subscriptℱ′F_π Fitalic_π′ below represents ℳMM’s region of safety. Lower hemicontinuity means that, as ℳMM changes, the safe region can recede only gradually, but may suddenly expand to new areas. D.2. ℱδsubscriptℱF_δFitalic_δ (defined in section 7.1, item 1) is lower hemicontinuous Let ℳMM be an MDP and V⊆ℝ≥0×[0,1]subscriptℝabsent001V _≥ 0×[0,1]V ⊆ ℝ≥ 0 × [ 0 , 1 ] be an open set with ℱδ⁢(ℳ)∩V≠∅subscriptℱℳF_δ(M)∩ V≠ _δ ( M ) ∩ V ≠ ∅. Let (N,ε)∈ℱδ⁢(ℳ)∩Vsubscriptℱℳ(N, ) _δ(M)∩ V( N , ε ) ∈ Fitalic_δ ( M ) ∩ V. Since V is an open set, there are some η1,η2>0subscript1subscript20 _1, _2>0η1 , η2 > 0 so that (N+η2,ε−η1)∈Vsubscript2subscript1(N+ _2, - _1)∈ V( N + η2 , ε - η1 ) ∈ V. Now, let U be the ball centered at ℳMM with radius δ⁢(η1,η2)subscript1subscript2δ( _1, _2)δ ( η1 , η2 ) as in remark 5.9. Then we apply theorem 5.7 to find that every ℳ′∈Usuperscriptℳ′M ∈ UM′ ∈ U is (N+η2,ε−η1)subscript2subscript1(N+ _2, - _1)( N + η2 , ε - η1 )-safe, i.e. (N+η2,ε−η1)∈ℱδ⁢(ℳ′)subscript2subscript1subscriptℱsuperscriptℳ′(N+ _2, - _1) _δ(M )( N + η2 , ε - η1 ) ∈ Fitalic_δ ( M′ ); thus, ℱδ⁢(ℳ′)∩V≠∅subscriptℱsuperscriptℳ′F_δ(M )∩ V≠ _δ ( M′ ) ∩ V ≠ ∅ for all ℳ′∈Usuperscriptℳ′M ∈ UM′ ∈ U, so ℱδsubscriptℱF_δFitalic_δ is lower hemicontinuous, as desired. D.3. ℱπ′subscriptℱ′F_π Fitalic_π′ (defined in section 7.1, item 2) is lower hemicontinuous The proof is extremely similar to that for ℱδsubscriptℱF_δFitalic_δ. Let ℳMM be an MDP and V⊆ℙ⁢(Sℳ)×[0,1]ℙsubscriptℳ01V (S_M)×[0,1]V ⊆ ℙ ( Scaligraphic_M ) × [ 0 , 1 ] be an open set with ℱπ′⁢(ℳ)∩V≠∅superscriptsubscriptℱ′ℳF_π (M)∩ V≠ _π′ ( M ) ∩ V ≠ ∅. Let (Δ,p)∈ℱπ′⁢(ℳ)∩VΔsuperscriptsubscriptℱ′ℳ( ,p) _π (M)∩ V( Δ , p ) ∈ Fitalic_π′ ( M ) ∩ V. Since V is an open set, there is some 0<η<p00<η<p0 < η < p so that (Δ,p−η)∈VΔ( ,p-η)∈ V( Δ , p - η ) ∈ V. Now, let U be the ball centered at ℳMM with radius min⁡ϵ,ηB⁢(ℳ)italic-ϵℳ \ε, ηB(M) \min ϵ , divide start_ARG η end_ARG start_ARG B ( M ) end_ARG , where ϵitalic-ϵεϵ and B⁢(ℳ)ℳB(M)B ( M ) are as in theorem 6.5. Then we apply theorem 6.5 to find that every ℳ′∈Usuperscriptℳ′M ∈ UM′ ∈ U has −π⁢(ℳ′,Δ)−π⁢(ℳ,Δ)db⁢(ℳ′,ℳ)<B⁢(ℳ)subscriptsuperscriptℳ′ΔsubscriptℳΔsubscriptsuperscriptℳ′ℳ- S_π(M , )-S_π(% M, )d_b(M ,M)<B(M)- divide start_ARG Sitalic_π ( M′ , Δ ) - Sitalic_π ( M , Δ ) end_ARG start_ARG ditalic_b ( M′ , M ) end_ARG < B ( M ). Then db⁢(ℳ′,ℳ)<ηB⁢(ℳ)⟹−π⁢(ℳ′,Δ)+π⁢(ℳ,Δ)<ηsubscriptsuperscriptℳ′ℳsubscriptsuperscriptℳ′ΔsubscriptℳΔd_b(M ,M)< ηB(M) -% S_π(M , )+S_π(M,% )< _b ( M′ , M ) < divide start_ARG η end_ARG start_ARG B ( M ) end_ARG ⟹ - Sitalic_π ( M′ , Δ ) + Sitalic_π ( M , Δ ) < η and p≤π⁢(ℳ,Δ)⟹p−η<π⁢(ℳ′,Δ)subscriptℳΔsubscriptsuperscriptℳ′Δp _π(M, ) p-η<S_π(% M , )p ≤ Sitalic_π ( M , Δ ) ⟹ p - η < Sitalic_π ( M′ , Δ ). Thus (Δ,p−η)∈ℱπ′⁢(ℳ′)∩VΔsuperscriptsubscriptℱ′ℳ′( ,p-η) _π (M )∩ V( Δ , p - η ) ∈ Fitalic_π′ ( M′ ) ∩ V, so ℱπ′subscriptℱ′F_π Fitalic_π′ is lower hemicontinuous, as desired.