Paper deep dive

Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

Frédéric Berdoz, Roger Wattenhofer

Year: 2024Venue: arXiv preprintArea: Formal/TheoreticalType: TheoreticalEmbeddings: 273

Abstract

Abstract:While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment (i.e., lack of transparency regarding their true objective) has thus far hindered their use in critical applications such as social decision processes. More importantly, existing alignment methods provide no formal guarantees on the safety of such models. Drawing from utility and social choice theory, we provide a novel quantitative definition of alignment in the context of social decision-making. Building on this definition, we introduce probably approximately aligned (i.e., near-optimal) policies, and we derive a sufficient condition for their existence. Lastly, recognizing the practical difficulty of satisfying this condition, we introduce the relaxed concept of safe (i.e., nondestructive) policies, and we propose a simple yet robust method to safeguard the black-box policy of any autonomous agent, ensuring all its actions are verifiably safe for the society.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 5:28:57 PM

Summary

The paper addresses the challenge of aligning autonomous AI agents with human interests in critical social decision-making processes. It introduces a quantitative definition of alignment based on utility and social choice theory, defines 'probably approximately aligned' (PAA) policies, and proposes a method to safeguard black-box policies to ensure they remain verifiably safe for society.

Entities (5)

Frédéric Berdoz · researcher · 100%Roger Wattenhofer · researcher · 100%Probably Approximately Aligned (PAA) Policies · concept · 98%Social Markov Decision Processes · framework · 95%Utility Theory · theory · 95%

Relation Signals (3)

Frédéric Berdoz → authored → Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

confidence 100% · Paper title and author list

Roger Wattenhofer → authored → Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

confidence 100% · Paper title and author list

Probably Approximately Aligned (PAA) Policies → definedin → Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

confidence 95% · The paper introduces PAA policies as a core contribution.

Cypher Suggestions (2)

Find all papers authored by a specific researcher. · confidence 90% · unvalidated

MATCH (p:Paper)<-[:AUTHORED]-(r:Researcher {name: 'Frédéric Berdoz'}) RETURN p.title

Identify concepts defined within a specific paper. · confidence 90% · unvalidated

MATCH (c:Concept)-[:DEFINED_IN]->(p:Paper {title: 'Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies'}) RETURN c.name

Full Text

272,739 characters extracted from source content.

Expand or collapse full text

Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies Frédéric Berdoz ETH Zürich fberdoz@ethz.ch Roger Wattenhofer ETH Zürich wattenhofer@ethz.ch Abstract While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment (i.e., lack of transparency regarding their true objective) has thus far hindered their use in critical applications such as social decision processes. More importantly, existing alignment methods provide no formal guarantees on the safety of such models. Drawing from utility and social choice theory, we provide a novel quantitative definition of alignment in the context of social decision-making. Building on this definition, we introduce probably approximately aligned (i.e., near-optimal) policies, and we derive a sufficient condition for their existence. Lastly, recognizing the practical difficulty of satisfying this condition, we introduce the relaxed concept of safe (i.e., nondestructive) policies, and we propose a simple yet robust method to safeguard the black-box policy of any autonomous agent, ensuring all its actions are verifiably safe for the society. 1 Introduction The deployment of AI systems in critical applications, such as social decision-making, is often stalled by the following two shortcomings: 1) They are brittle and usually provide no guarantees on their expected performance when deployed in the real world [8], and 2) there is no formal guarantee that the objective they have been trained against, typically a scalar quantity such as a loss or a reward, faithfully represents human interest at large [41]. Addressing these limitations is commonly referred to as AI alignment, an umbrella term including a wide array of methods supposed to make AI systems of different modalities behave as intended [17, 24]. Yet, to our knowledge, every metric for alignment is a posteriori, i.e., a system is deemed aligned as long as it does not display misaligned behavior (e.g., through red teaming [14]). This stems from the fact that most of these methods focus on aligning generative models of complex modalities (text, images, video, audio, etc.) where the input and output domains are particularly vast, and where no single metric can perfectly represent the intended behavior. In the context of critical (e.g., social) decision processes, where an autonomous agent must repeatedly take actions in a complex environment with many stakeholders, a posteriori alignment is not sufficient. Indeed, for the same reasons that a society would not trust a human policymaker with hidden motives and unknown track record, it would also distrust an autonomous policymaker whose objective is not, a priori, perfectly clear and verifiably aligned, as the cost of a single bad action (due to their known brittleness) could easily outweigh the benefits of leveraging such systems. This issue is accentuated by the fact that, unlike humans, holding a deceptive AI agent accountable remains a challenge [25]. Conversely, recent breakthroughs in AI have significantly increased its potential for beneficial use in these critical settings. For example, tax rates and public spending are typically set periodically by a parliament. However, this small group of representatives is inevitably overwhelmed by the vast amount of complex economic data as well as the pleas of millions of individuals. Due to this information bottleneck, governments can only make educated guesses about what is best for the society (assuming they can come to an agreement in the first place). On the other hand, provided that it is verifiably aligned, an autonomous government could efficiently leverage this vast amount of information in order to find the optimal tax rates and public spending (see Figure 1). Figure 1: Democratic (left) vs autonomous (right) governments. Transparent elections must be replaced with reliable alignment mechanisms. While it is clear that a priori alignment is essential for the safe delegation of such decision power, it comes at the cost of two unavoidable prerequisites. Firstly, it requires a definition of alignment that is both quantitative and measurable. Given our focus on social decision processes, where the perception of each action may differ among individuals, we define and quantify alignment by drawing from long-established theories of utility and social choice. For completeness, we also discuss all the assumptions that allow such a metric to exist. Secondly, one can only ensure that an agent is a priori aligned if one independently understands (at least reasonably well) how its available actions may affect society, regardless of how those effects are perceived. Indeed, asserting the safety of an autonomous agent would be impossible if it could take actions with unknown consequences. Although we make no assumption of the exact nature of this knowledge (domain specific expertise, heuristics, physical/data-driven modeling, or a mix thereof), we assume it is encapsulated in a world model that estimates, given an action and a current state, the probability of moving to any other state. A natural question then arises: Can we leverage this knowledge to construct a verifiably aligned policy? Or, at the very least, can we use it to ensure the safety of a more complex (black-box) policy? We address these questions by introducing probably approximately aligned (PAA) and safe policies, and by studying their existence based on the quality of this knowledge. Akin to the probably approximately correct (PAC) framework in the theory of learnability [38], we are also interested in the sample complexity of finding PAA policies. In our setting, this complexity is two-faceted: First, the number of calls to the world model (similar to the number of calls to a perfect generative model in [18]), and second, the amount of feedback (think ballots) required to confidently rank which state is socially preferred. While we are not yet concerned with efficient PAA policies, whose sample complexity is at most polynomial with respect to the desired tolerances, we argue that these complexities should, at least, be independent of the number of possible states (similarly to sparse sampling algorithms [19]), as most natural state spaces are infinite. We refer to such policies as computable. Conversely, as one can decide which decisions are delegated to autonomous agents, we assume that the action space is finite. Concretely, our contribution is threefold: • First, we define a new type of Social Markov Decision Processes, replacing the traditional reward with aggregated utilities of individuals. Expanding on this definition, we present a formal quantitative definition of alignment in the context of social choice, which naturally leads to the concept of probably approximately aligned (PAA) policies. • Next, given an approximate world model with sufficient statistical accuracy (which we quantitatively derive), we prove the existence of PAA policies by amending a simple sparse sampling algorithm. • Finally, we propose a simple and intuitive method to safeguard any black-box (potentially misaligned) policy in order to ensure that it does not take any destructive decisions, i.e., actions that might lead to a state where even the optimal policy is unsatisfactory. We refer to these adapted policies as safe policies. 2 Background 2.1 Utility and social choice theory We are interested in building autonomous agents whose objective is to maximizes social satisfaction by taking actions that alter the state of the society. In this section, we define what is meant by social satisfaction (often called social welfare), and we provide the conditions under which it is quantifiable. 2.1.1 Utility theory It is commonly assumed that the agency of an individual is governed by its internal binary preference relationship ⪯precedes-or-equals ⪯ over the set of outcomes SS. When presented with two choices s and s′, the individual will introspect its satisfaction (welfare) levels and either strictly prefer one outcome (s≺s′precedessuperscript′s s s ≺ s′ or s′≺sprecedessuperscript′s s′ ≺ s) or be indifferent to both (s∼s′similar-tosuperscript′s s s ∼ s′). We are interested in quantitatively measuring these welfare levels. Debreu’s representation theorem [10] states that if this preference relationship is complete, continuous, and transitive on the topological space SS, then there exists infinitely many continuous, real-valued functions u:→ℝ:→ℝu:S : S → blackboard_R (called utility functions) such that u⁢(s)≤u⁢(s′)⇔s⪯s′,∀s,s′∈⇔superscript′formulae-sequenceprecedes-or-equalssuperscript′for-allsuperscript′u(s)≤ u(s )\, \,s s ,\,∀ s,s^% ∈Su ( s ) ≤ u ( s′ ) ⇔ s ⪯ s′ , ∀ s , s′ ∈ S (note that any strictly monotonically increasing function can transform a valid utility function into another). While these findings establish the existence of these utility functions, which are proxies for the intrinsic welfare levels of individuals, they do not provide insights into their measurability and interpersonal comparability. It is these two properties, however, that eventually determine what measure of social welfare can be derived. In a nutshell, measurability and comparability impose how much information can be extracted from the values |ui⁢(s)−ui⁢(s′)|subscriptsubscriptsuperscript′|u_i(s)-u_i(s )|| uitalic_i ( s ) - uitalic_i ( s′ ) | and ui⁢(s)−uj⁢(s)subscriptsubscriptu_i(s)-u_j(s)uitalic_i ( s ) - uitalic_j ( s ), respectively, for any i≠ji≠ ji ≠ j and s≠s′≠ s s ≠ s′. We detail the various measurability and comparability levels in Appendix A.1.1, and we refer to these levels as the informational basis of utilities. Apart from that, we fix 0<Um⁢i⁢n≤ui⁢(s)≤Um⁢a⁢x<∞0subscriptsubscriptsubscript0<U_min≤ u_i(s)≤ U_max<∞0 < Uitalic_m i n ≤ uitalic_i ( s ) ≤ Uitalic_m a x < ∞ for any s and i (we will allow Um⁢i⁢n=0subscript0U_min=0Uitalic_m i n = 0 in specific cases). That is, we assume that individuals cannot be infinitely satisfied or dissatisfied, and that they must scale their utilities when reporting (which does not imply measurability or comparability!). Lastly, we define Δ⁢U≜Um⁢a⁢x−Um⁢i⁢n≜Δsubscriptsubscript U U_max-U_minΔ U ≜ Uitalic_m a x - Uitalic_m i n and ≜u:→[Um⁢i⁢n,Um⁢a⁢x]≜conditional-set→subscriptsubscriptU \u:S→[U_min,U_max]\U ≜ u : S → [ Uitalic_m i n , Uitalic_m a x ] . 2.1.2 Social choice theory Let ℐII be a society composed of N members, each with its preference relationship ⪯isubscriptprecedes-or-equals _i⪯i and a corresponding utility function ui∈subscriptu_i∈Uuitalic_i ∈ U over state space SS, i∈ℐi∈Ii ∈ I. Let ℛsubscriptℛR_SRcaligraphic_S be the set of complete orderings on SS and ∈superscript u∈U^Nu ∈ Ubold_N be a vector gathering the utility functions of all individuals. A social welfare functional f (SWFL) is a mapping f→ℛ→subscriptsubscriptℛD_f→R_SDitalic_f → Rcaligraphic_S with f⊆NsubscriptsuperscriptD_f U^NDitalic_f ⊆ Uitalic_N. In other words, it is an aggregator of individuals’ utilities, indirectly preferences. A long line of work [9, 32, 29] has attempted to define which conditions this SWFL should satisfy (sometimes called axioms of cardinal welfarism, see Appendix A.1.2 for an extensive list of these properties and their respective implications). For the remainder of this work, we will follow the common assumption that any reasonable SWFL should satisfy the following: universality (U), informational basis invariance (XI), independence of irrelevant alternatives (IIA), weak Pareto criterion (WP) and anonymity (A). An important result [29] states that, for any informational basis (X) listed in Appendix A.1.1 and any SWFL f satisfying (XI), (U), (IIA) and (WP), there exists a social welfare function (SWF) W:ℝN→ℝ:→superscriptℝW:R^N W : blackboard_RN → blackboard_R such that, if W⁢(⁢(s))>W⁢(⁢(s′))superscript′W(u(s))>W(u(s ))W ( u ( s ) ) > W ( u ( s′ ) ), then s ranks strictly higher than s′ in f⁢()f(u)f ( u ). This is important as it states that the best social state must maximize a certain function W, which can therefore be used as a measure of social satisfaction. In other words, the non-welfare characteristics (i.e., any information influencing f⁢()f( u)f ( u ) beside uu, such as the judgement of an AI agent) are of secondary importance, as they can only break ties between s and s′ such that W⁢(⁢(s))=W⁢(⁢(s′))superscript′W(u(s))=W(u(s ))W ( u ( s ) ) = W ( u ( s′ ) ) and cannot be detrimental to the society. Although we do not require it in this work, maximization of W can be made sufficient if one imposes Welfarism (W), e.g., by replacing (WP) with Weak Pareto Indifference (WPI) or more drastically by imposing Continuity (C) (see Appendix A.1.2 for more details). We are left with the following question: Given a SWFL satisfying (XI), (U), (IIA), (WP) and (A), what is the form of the corresponding SWF? It turns out that the choice is relatively limited and depends mostly on the informational basis invariance (XI). It has been shown, with additional small technical assumptions [7], that the power mean defined in Eq. (1) covers all possible SFWL. See Appendix A.1.3 for a detailed mapping between informational bases, SWFLs and parameter q. Wq⁢(⁢(s);ℐ)=mini∈ℐ⁡ui⁢(s)q=−∞1|ℐ|⁢∑i∈ℐui⁢(s)q∈ℝ∗∏i∈ℐui⁢(s)|ℐ|q=0maxi∈ℐ⁡ui⁢(s)q=∞subscriptℐcasessubscriptℐsubscript1ℐsubscriptℐsubscriptsuperscriptsuperscriptℝℐsubscriptproductℐsubscript0subscriptℐsubscriptW_q(u(s);I)= \ array[]l _i∈% Iu_i(s)&q=-∞\\ [q] 1|I|Σ _i∈Iu_i(s)^q% &q ^*\\ [|I|]Π _i∈Iu_i(s)&q=0\\ _i∈Iu_i(s)&q=∞ array .Witalic_q ( u ( s ) ; I ) = start_ARRAY start_ROW start_CELL minitalic_i ∈ I uitalic_i ( s ) end_CELL start_CELL q = - ∞ end_CELL end_ROW start_ROW start_CELL nth-root start_ARG q end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | I | end_ARG ∑i ∈ I uitalic_i ( s )q end_ARG end_CELL start_CELL q ∈ blackboard_R∗ end_CELL end_ROW start_ROW start_CELL nth-root start_ARG | I | end_ARG start_ARG ∏i ∈ I uitalic_i ( s ) end_ARG end_CELL start_CELL q = 0 end_CELL end_ROW start_ROW start_CELL maxitalic_i ∈ I uitalic_i ( s ) end_CELL start_CELL q = ∞ end_CELL end_ROW end_ARRAY (1) 2.1.3 Future discounted social welfare At deployment, a safe autonomous agent must provide assurances that its future actions will continue to serve the best interests of society. This becomes ill-defined if these interests evolve with time. To address this, we assume that both ℐII and uu are constant, i.e., ui⁢(s;t)=ui⁢(s;t′)≡ui⁢(s)subscriptsubscriptsuperscript′subscriptu_i(s;t)=u_i(s;t )≡ u_i(s)uitalic_i ( s ; t ) = uitalic_i ( s ; t′ ) ≡ uitalic_i ( s ) for all s, i and discrete times t≠t′≠ t t ≠ t′ . In addition, we also assume that the meaning of these utilities does not change with time, that is, if ui⁢(s;t)≥ui⁢(s′;t′)subscriptsubscriptsuperscript′u_i(s;t)≥ u_i(s ;t )uitalic_i ( s ; t ) ≥ uitalic_i ( s′ ; t′ ) for states s,s′,s s , s′ and times t≠t′≠ t t ≠ t′, then i’s welfare is at least as high in state s at time t than in state s′ at time t′ (or vice versa for ≤). Finally, we assume that the SWFL f is such that its corresponding SWF remains the same. In other words, only the method to break ties between states can evolve through time. This makes it possible to predict, at time t, what will be the satisfaction levels at time t′>tsuperscript′t >t′ > t in any given state. However, to model the fact that humans prefer immediate reward, we discount the utility of the state at time t′ with a factor γ(t′−t)superscriptsuperscript′γ^(t -t)γ( t start_POSTSUPERSCRIPT ′ - t ) end_POSTSUPERSCRIPT when comparing it with the utility of the state at time t, where γ∈[0,1[γ∈[0,1[γ ∈ [ 0 , 1 [ is a discount factor. From these assumptions, it becomes possible, at time t=00t=0t = 0, to quantify the cumulative social welfare of any future state trajectory s1⁢s2⁢s3⁢s4⁢…subscript1subscript2subscript3subscript4…s_1s_2s_3s_4...s1 s2 s3 s4 … by computing the quantity ∑t=0∞γt⁢Wq⁢(⁢(+))superscriptsubscript0superscriptsubscriptsubscript1 _t=0^∞γ^tW_q( u(s_t+1))∑t = 0∞ γitalic_t Witalic_q ( u ( sbold_t + 1 ) ), which we will refer to as the future discounted social welfare of that trajectory (see Section 2.2.2). Using this quantity, we formally define alignment as follows: An autonomous agent is aligned if and only if it always takes actions that maximize the expected future discounted social welfare. 2.2 Social dynamics The expectation in the above definition accounts for the inherent randomness of most natural systems. In this section, we model the social dynamics as a particular type of Markov Decision Process (MDP), where the probability of transitioning to any state depends solely on the current state and next action. 2.2.1 Markov decision process Let ℳ=(,,p,r,γ)ℳM=(S,A,p,r,γ)M = ( S , A , p , r , γ ) be an infinite horizon, γ-discounted, discrete time MDP where SS is the state space (discrete or continuous), AA is the action space (discrete or continuous), p:×→⁢():→p:S×A→P(S)p : S × A → P ( S ) is the transition dynamics of the environment (with ⁢()P(S)P ( S ) the set of probability distributions over SS), r:×→[Rm⁢i⁢n,Rm⁢a⁢x]:→subscriptsubscriptr:S×A→[R_min,R_max]r : S × A → [ Ritalic_m i n , Ritalic_m a x ] is the reward of the environment, and γ∈[0,1[γ∈[0,1[γ ∈ [ 0 , 1 [ is a discount factor (favoring immediate over distant rewards). Given s∈s∈Ss ∈ S and a∈a∈Aa ∈ A, p⁢(s′|s,a)conditionalsuperscript′p(s |s,a)p ( s′ | s , a ) is the probability of transitioning to state s′ after taking action a in state s, and r⁢(s,a)r(s,a)r ( s , a ) is the expected immediate reward after taking that action. In MDPs, actions are chosen according to a policy π:→⁢():→π:S→P(A)π : S → P ( A ), with π⁢(a|s)conditionalπ(a|s)π ( a | s ) the probability of taking action a in state s. Given an initial state s0subscript0s_0s0, the tuple (ℳ,π,s0)ℳsubscript0(M,π,s_0)( M , π , s0 ) fully defines a distribution pτsubscriptp_τpitalic_τ over trajectories τ=s0⁢a0⁢s1⁢a1⁢s2⁢a2⁢…subscript0subscript0subscript1subscript1subscript2subscript2…τ=s_0a_0s_1a_1s_2a_2...τ = s0 a0 s1 a1 s2 a2 …, where at∼π(⋅|st)a_t π(·|s_t)aitalic_t ∼ π ( ⋅ | sitalic_t ) and st+1∼p(⋅|st,at)s_t+1 p(·|s_t,a_t)sitalic_t + 1 ∼ p ( ⋅ | sitalic_t , aitalic_t ). If the environment dynamics or the policy are deterministic, we will use the slight abuse of notation st+1=p⁢(st,at)subscript1subscriptsubscripts_t+1=p(s_t,a_t)sitalic_t + 1 = p ( sitalic_t , aitalic_t ) and at=π⁢(st)subscriptsubscripta_t=π(s_t)aitalic_t = π ( sitalic_t ), respectively. The efficacy of a given policy π is measured by the state and state-action value functions, defined respectively as follows: Vπ⁢(s)=τ∼pτ(⋅|π,s0=s)⁢[∑t=0∞γt⁢r⁢(st,at)],Qπ⁢(s,a)=r⁢(s,a)+γ⁢s′∼p(⋅|a,s)⁢[Vπ⁢(s′)].V^π(s)=E_τ p_τ(·|π,s_0=s) [ _t=0^% ∞γ^tr(s_t,a_t) ], Q^π(s,a)=r(s,a)+γ% E_s p(·|a,s)[V^π(s )].Vitalic_π ( s ) = blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | π , s0 = s ) end_POSTSUBSCRIPT [ ∑t = 0∞ γitalic_t r ( sitalic_t , aitalic_t ) ] , Qitalic_π ( s , a ) = r ( s , a ) + γ blackboard_Es′ ∼ p ( ⋅ | a , s ) [ Vitalic_π ( s′ ) ] . For a given state s and action a, the optimal state and action-state value functions are defined by V∗⁢(s)=supπVπ⁢(s)superscriptsubscriptsupremumsuperscriptV^*(s)= _πV^π(s)V∗ ( s ) = supitalic_π Vitalic_π ( s ) and Q∗⁢(s,a)=supπQπ⁢(s,a)superscriptsubscriptsupremumsuperscriptQ^*(s,a)= _πQ^π(s,a)Q∗ ( s , a ) = supitalic_π Qitalic_π ( s , a ). Given ε≥00 ≥ 0ε ≥ 0, a policy is called ε ε-optimal (or optimal) if it satisfies Vπ⁢(s)≥V∗⁢(s)−εsuperscriptsuperscriptV^π(s)≥ V^*(s)- Vitalic_π ( s ) ≥ V∗ ( s ) - ε (respectively Vπ⁢(s)=V∗⁢(s)superscriptsuperscriptV^π(s)=V^*(s)Vitalic_π ( s ) = V∗ ( s )) for all s. Given small technical assumptions, it can be shown that there always exists an optimal policy [37, 12]. 2.2.2 Social Markov decision process Definition (Social Markov Decision Process). Let ℐII be a society with utility profile ∈superscript u∈U^Nu ∈ Ubold_N and social welfare function WqsubscriptW_qWitalic_q. In addition, let SS and AA be the corresponding state and action spaces, p the social environment dynamics and γ a discount factor. The MDP (,,p,rℐ,γ)subscriptℐ(S,A,p,r_I,γ)( S , A , p , rcaligraphic_I , γ ) with reward rℐsubscriptℐr_Ircaligraphic_I defined by rℐ⁢(s,a)=s′∼p(⋅|s,a)⁢[Wq⁢(⁢(′);ℐ)]r_I(s,a)=E_s p(·|s,a)[W_q( u(s^% );I)]rcaligraphic_I ( s , a ) = blackboard_Es′ ∼ p ( ⋅ | s , a ) [ Witalic_q ( u ( s′ ) ; I ) ] (2) is a Social Markov Decision Process (SMDP), denoted ℳℐ=(,,p,Wq,,γ)subscriptℳℐsubscriptM_I=(S,A,p,W_q, u,γ)Mcaligraphic_I = ( S , A , p , Witalic_q , u , γ ). In this setting, SS contains all the possible states of the N individuals, as well as all those of the environment in which they evolve, and AA contains all the actions that are delegated to an autonomous agent. Expanding on this definition, we can formally define the alignment metric proposed above: Definition (Expected Future Discounted Social Welfare). Let ℳℐ=(,,p,Wq,,γ)subscriptℳℐsubscriptM_I=(S,A,p,W_q, u,γ)Mcaligraphic_I = ( S , A , p , Witalic_q , u , γ ) be a SMDP. The expected future discounted social welfare of a policy π in state s is defined as π⁢(s)=τ∼pτ(⋅|π,s0=s)⁢[∑t=0∞γt⁢Wq⁢(⁢(+);ℐ)],W^π(s)=E_τ p_τ(·|π,s_0=s) [% _t=0^∞γ^tW_q( u(s_t+1);I) ],Witalic_π ( s ) = blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | π , s0 = s ) end_POSTSUBSCRIPT [ ∑t = 0∞ γitalic_t Witalic_q ( u ( sbold_t + 1 ) ; I ) ] , and takes value between m⁢i⁢n≜Um⁢i⁢n1−γ≜subscriptsubscript1W_min U_min1-γWitalic_m i n ≜ divide start_ARG Uitalic_m i n end_ARG start_ARG 1 - γ end_ARG and m⁢a⁢x≜Um⁢a⁢x1−γ≜subscriptsubscript1W_max U_max1-γWitalic_m a x ≜ divide start_ARG Uitalic_m a x end_ARG start_ARG 1 - γ end_ARG, with Δ⁢≜m⁢a⁢x−m⁢i⁢n≜Δsubscriptsubscript W W_max-W_minΔ W ≜ Witalic_m a x - Witalic_m i n. As shown with the next lemma, the expected future discounted social welfare of a SMDP is equivalent to the state value function of the corresponding MDP. This equivalence makes it a natural metric for alignment, as it enables the use of a wide array of known results on MDPs. Lemma 1. For any SMDP ℳℐ=(,,p,Wq,,γ)subscriptℳℐsubscriptM_I=(S,A,p,W_q, u,γ)Mcaligraphic_I = ( S , A , p , Witalic_q , u , γ ), the expected future discounted social welfare of a policy π is the state value function of π in the MDP ℳ=(,,p,rℐ,γ)ℳsubscriptℐM=(S,A,p,r_I,γ)M = ( S , A , p , rcaligraphic_I , γ ), with rℐsubscriptℐr_Ircaligraphic_I set in Eq. (2). The proof follows directly from the tower property of conditional expectations (see Appendix A.2.1). 2.2.3 Approximate rewards If p is unknown, the true reward of the SMDP in Eq. (2) can only be estimated a posteriori, i.e., after taking action a in state s multiple times and observing Wq⁢(⁢(′))subscriptsuperscript′W_q( u(s ))Witalic_q ( u ( s′ ) ). This would require a long exploration phase if SS is large, which can be costly and even impossible in critical decisions processes. Instead, one must usually plan using an approximate dynamics model p^∈⁢() p∈P(S)over start_ARG p end_ARG ∈ P ( S ) to anticipate the effect of an action. Moreover, even if p is known, computing Wq⁢(⁢(′))subscriptsuperscript′W_q( u(s ))Witalic_q ( u ( s′ ) ) exactly for a given s′∼psimilar-tosuperscript′s ps′ ∼ p would require full knowledge of ⁢(′)superscript′ u(s )u ( s′ ), which is only possible by obtaining feedback from the entire society about s′ without making additional assumptions on uu. For these reasons, we consider a more realistic scenario: Given a set of assessors In⊆ℐsubscriptℐI_n IIitalic_n ⊆ I of size n≤Nn≤ Nn ≤ N and an approximate dynamics model p^ pover start_ARG p end_ARG, the true reward can be approximated by asking the assessors about their utilities on the anticipated future societal states: r^In⁢(s,a;K)=^s′∼p^(⋅|s,a)K⁢[Wq⁢(⁢(′);)]≜⁢∑=⁢(⁢()). r_I_n(s,a;K)= E^K_s p(·|s,a)% [W_q( u(s );I_n)] 1K _k=1^KW_q(% u(s^k)).over start_ARG r end_ARGI start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , a ; K ) = over start_ARG blackboard_E end_ARGKitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ Witalic_q ( u ( s′ ) ; Ibold_n ) ] ≜ divide start_ARG 1 end_ARG start_ARG K end_ARG ∑k = 1bold_K Wbold_q ( u ( sbold_k ) ) . (3) where ^s′∼p^Ksubscriptsuperscript^similar-tosuperscript′ E^K_s pover start_ARG blackboard_E end_ARGKitalic_s′ ∼ over start_ARG p end_ARG is a “Monte Carlo” estimation of s′∼p^subscriptsimilar-tosuperscript′^E_s pblackboard_Es′ ∼ over start_ARG p end_ARG, with K∈ℕK ∈ blackboard_N samples independently drawn from p^(⋅|s,a) p(·|s,a)over start_ARG p end_ARG ( ⋅ | s , a ), denoted s1,s2⁢…,sKsuperscript1superscript2…superscripts^1,s^2...,s^Ks1 , s2 … , sitalic_K. The core of our analysis is to understand how K, n and the inaccuracies of p^ pover start_ARG p end_ARG affect the validity of alignment guarantees. 3 Results 3.1 Existence of aligned policies Having formally derived a quantitative measure of alignment πsuperscriptW^πWitalic_π in the context of social decision processes, we are now prepared to introduce and prove the existence of verifiably aligned policies: Definition (Probably Approximately Aligned Policy). Given 0≤δ<1010≤δ<10 ≤ δ < 1, ε>00 >0ε > 0 and a SMDP ℳℐ=(,,p,Wq,,γ)subscriptℳℐsubscriptM_I=(S,A,p,W_q, u,γ)Mcaligraphic_I = ( S , A , p , Witalic_q , u , γ ), a policy π is δ⁢-⁢ε⁢-PAA--PAAδ - -PAAδ - ε -PAA (Probably Approximately Aligned) if, for any given s∈s∈Ss ∈ S, the following inequality holds with probability at least 1−δ11- 1 - δ: π⁢(s)≥maxπ′⁡π′⁢(s)−εsuperscriptsubscriptsuperscript′superscript′W^π(s)≥ _π W^π (s)- _π ( s ) ≥ maxitalic_π′ Witalic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( s ) - ε (4) Definition (Approximately Aligned Policy). Given ε>00 >0ε > 0, a policy π is ε⁢-A-A -Aε -A (Approximately Aligned) if and only if it is 0⁢-⁢ε⁢-PAA0--PAA0 - -PAA0 - ε -PAA. We state below one of our main contribution, i.e., the existence of computable PAA and A policies, which follows directly from Theorem 3 (δ>00δ>0δ > 0) and Corollary 4 (δ=00δ=0δ = 0) in the next section. Theorem 2 (Existence of PAA and A policies). Given a SMDP ℳℐ=(,,p,Wq,,γ)subscriptℳℐsubscriptM_I=(S,A,p,W_q, u,% γ)Mcaligraphic_I = ( S , A , p , Witalic_q , u , γ ) with q∈ℝq ∈ blackboard_R and any tolerances ε>00 >0ε > 0 and 0≤δ<1010≤δ<10 ≤ δ < 1, if there exists an approximate world model p^ pover start_ARG p end_ARG such that sup(s,a)∈×DK⁢L(p(⋅|s,a)∥p^(⋅|s,a))<ε2⁢(1−γ)48⁢Δ⁢2, _(s,a)∈S×AD_KL(p(·|s,a)\| p(% ·|s,a))< ^2(1-γ)^48 W^2,sup( s , a ) ∈ S × A Ditalic_K L ( p ( ⋅ | s , a ) ∥ over start_ARG p end_ARG ( ⋅ | s , a ) ) < divide start_ARG ε2 ( 1 - γ )4 end_ARG start_ARG 8 Δ W2 end_ARG , (5) then there exists a computable δ-ε ε-PAA policy. Consequently, there also exists a computable ε ε-A policy. 3.2 Near optimal planning We prove Theorem 2 by providing a planning policy πP⁢A⁢Asubscript _PAAπitalic_P A A and by proving it satisfies Eq. (4) under the given assumptions. To this end, we present a modified version of the sparse sampling algorithm [19] (which originally assumes that p and r are known, which is not the case here). Given some parameters K,CK,CK , C and n, we define the recursive functions: Q^h⁢(s,a;K,C,In)superscript^ℎsubscript Q^h(s,a;K,C,I_n)over start_ARG Q end_ARGh ( s , a ; K , C , Iitalic_n ) =0h=0r^In⁢(s,a;K)+γ⁢^s′∼p^(⋅|s,a)C⁢[V^h−1⁢(s′;K,C,In)]h∈ℕ∗ = \ array[]l0&h=0\\ r_I_n(s,a;K)+γ E^C_s p(·% |s,a) [ V^h-1(s ;K,C,I_n) ]&h ^* % array .= start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL h = 0 end_CELL end_ROW start_ROW start_CELL over start_ARG r end_ARGI start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , a ; K ) + γ over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ over start_ARG V end_ARGh - 1 ( s′ ; K , C , Iitalic_n ) ] end_CELL start_CELL h ∈ blackboard_N∗ end_CELL end_ROW end_ARRAY (8) V^h⁢(s;K,C,In)superscript^ℎsubscript V^h(s;K,C,I_n)over start_ARG V end_ARGh ( s ; K , C , Iitalic_n ) =maxa∈⁡Q^h⁢(s,a;K,C,In).absentsubscriptsuperscript^ℎsubscript = _a∈A Q^h(s,a;K,C,I_n).= maxitalic_a ∈ A over start_ARG Q end_ARGh ( s , a ; K , C , Iitalic_n ) . Intuitively, Q^hsuperscript^ℎ Q^hover start_ARG Q end_ARGh and V^hsuperscript^ℎ V^hover start_ARG V end_ARGh are recursive approximations of Q∗superscriptQ^*Q∗ and V∗superscriptV^*V∗, K and C controls the accuracy of the empirical expectation operators in Q^ Qover start_ARG Q end_ARG, hℎh controls how far one looks into the future, and n controls the accuracy of the social welfare function estimates. The proposed PAA policy is simply the greedy policy acting on the state-action value estimates, i.e., πP⁢A⁢A⁢(s)=maxa∈⁡Q^H⁢(s,a;K,C,In).subscriptsubscriptsuperscript^subscript _PAA(s)= _a∈A Q^H(s,a;K,C,I_n).πitalic_P A A ( s ) = maxitalic_a ∈ A over start_ARG Q end_ARGH ( s , a ; K , C , Iitalic_n ) . (9) It is deterministic in the sense that, for a given Q^Hsuperscript Q^Hover start_ARG Q end_ARGH, it outputs a single action. However, Q^Hsuperscript Q^Hover start_ARG Q end_ARGH is non-deterministic since InsubscriptI_nIitalic_n and s′ are sampled randomly. The next results clarifies under which conditions πP⁢A⁢Asubscript _PAAπitalic_P A A is indeed δ-ε ε-PAA (Theorem 3) or ε ε-A (Corollary 4). Theorem 3 (πP⁢A⁢Asubscript _PAAπitalic_P A A is δ-ε ε-PAA). Let ℳℐ=(,,p,Wq,,γ)subscriptℳℐsubscriptM_I=(S,A,p,W_q, u,γ)Mcaligraphic_I = ( S , A , p , Witalic_q , u , γ ) be a SMDP with q∈ℝq ∈ blackboard_R and p^ pover start_ARG p end_ARG an approximate dynamics model such that d≜sup(s,a)DK⁢L(p(⋅|s,a)∥p^(⋅|s,a))<ε2⁢(1−γ)68⁢Δ⁢U2d _(s,a)D_KL(p(·|s,a)\| p(·|s,a))< % ^2(1-γ)^68 U^2d ≜ sup( s , a ) Ditalic_K L ( p ( ⋅ | s , a ) ∥ over start_ARG p end_ARG ( ⋅ | s , a ) ) < divide start_ARG ε2 ( 1 - γ )6 end_ARG start_ARG 8 Δ U2 end_ARG for any desired tolerances ε>00 >0ε > 0 and 0<δ<1010<δ<10 < δ < 1. For any k≥logγ⁡((1−γ)⁢εΔ⁢U−8⁢d(1−γ)2)subscript1Δ8superscript12k≥ _γ ( (1-γ) U- 8d% (1-γ)^2 )k ≥ logitalic_γ ( divide start_ARG ( 1 - γ ) ε end_ARG start_ARG Δ U end_ARG - divide start_ARG square-root start_ARG 8 d end_ARG end_ARG start_ARG ( 1 - γ )2 end_ARG ), define β≜((1−γ)2⁢ε8−d⁢Δ⁢U8⁢(1−γ)−(1−γ)⁢γk⁢Δ⁢U8)≜superscript128Δ811superscriptΔ8β ( (1-γ)^2 8- d U% 8(1-γ)- (1-γ)γ^k U8 )β ≜ ( divide start_ARG ( 1 - γ )2 ε end_ARG start_ARG 8 end_ARG - divide start_ARG square-root start_ARG d end_ARG Δ U end_ARG start_ARG square-root start_ARG 8 end_ARG ( 1 - γ ) end_ARG - divide start_ARG ( 1 - γ ) γitalic_k Δ U end_ARG start_ARG 8 end_ARG ) and let πP⁢A⁢Asubscript _PAAπitalic_P A A be the policy defined in Eq. (9) with parameters • H≥max⁡1,logγ⁡(βΔ⁢U)1subscriptΔH≥ \1, _γ ( β U ) \H ≥ max 1 , logitalic_γ ( divide start_ARG β end_ARG start_ARG Δ U end_ARG ) , • K≥Δ⁢U2β2⁢((H−1)⁢ln⁡(24⁢kH−1⁢(H−1)⁢||⁢Δ⁢U2β2)+ln⁡(1δ))Δsuperscript2superscript211241Δsuperscript2superscript21K≥ U^2β^2 ((H-1) ( [H-1]24k(H-1)|% A| U^2β^2 )+ ( 1δ% ) )K ≥ divide start_ARG Δ U2 end_ARG start_ARG β2 end_ARG ( ( H - 1 ) ln ( nth-root start_ARG H - 1 end_ARG start_ARG 24 k end_ARG ( H - 1 ) | A | divide start_ARG Δ U2 end_ARG start_ARG β2 end_ARG ) + ln ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) ), • C≥γ2(1−γ)2⁢Ksuperscript2superscript12C≥ γ^2(1-γ)^2KC ≥ divide start_ARG γ2 end_ARG start_ARG ( 1 - γ )2 end_ARG K, • n≥N⁢(1+Δ⁢U2⁢N2⁢K⁢Γ⁢(β,Um⁢i⁢n,Um⁢a⁢x,q))−1superscript1Δsuperscript22Γsubscriptsubscript1n≥ N (1+ U^2N2K (β,U_min,U_max,q% ) )^-1n ≥ N ( 1 + divide start_ARG Δ U2 N end_ARG start_ARG 2 K end_ARG Γ ( β , Uitalic_m i n , Uitalic_m a x , q ) )- 1, and where Γ⁢(β,Um⁢i⁢n,Um⁢a⁢x,q)Γsubscriptsubscript (β,U_min,U_max,q )Γ ( β , Uitalic_m i n , Uitalic_m a x , q ) is a function defined in Eq. (16). Then πP⁢A⁢Asubscript _PAAπitalic_P A A is a δ-ε ε-PAA policy. Corollary 4. Let ℳℐ=(,,p,Wq,,γ)subscriptℳℐsubscriptM_I=(S,A,p,W_q, u,γ)Mcaligraphic_I = ( S , A , p , Witalic_q , u , γ ) be a SMDP with q∈ℝq ∈ blackboard_R and p^ pover start_ARG p end_ARG an approximate dynamics model such that d≜sup(s,a)DK⁢L(p(⋅|s,a)∥p^(⋅|s,a))<ε2⁢(1−γ)68⁢Δ⁢U2d _(s,a)D_KL(p(·|s,a)\| p(·|s,a))< % ^2(1-γ)^68 U^2d ≜ sup( s , a ) Ditalic_K L ( p ( ⋅ | s , a ) ∥ over start_ARG p end_ARG ( ⋅ | s , a ) ) < divide start_ARG ε2 ( 1 - γ )6 end_ARG start_ARG 8 Δ U2 end_ARG for any desired tolerance ε>00 >0ε > 0. Define β≜(1−γ)2⁢ε10−2⁢d⁢Δ⁢U5⁢(1−γ)≜superscript12102Δ51β (1-γ)^2 10- 2d U5% (1-γ)β ≜ divide start_ARG ( 1 - γ )2 ε end_ARG start_ARG 10 end_ARG - divide start_ARG square-root start_ARG 2 d end_ARG Δ U end_ARG start_ARG 5 ( 1 - γ ) end_ARG and let πP⁢A⁢Asubscript _PAAπitalic_P A A be the policy defined in Eq. (9) with parameters H,CH,CH , C and n as in Theorem 3 and K≥Δ⁢U2β2⁢((H−1)⁢ln⁡(12H−1⁢(H−1)⁢||⁢Δ⁢U2β2)+ln⁡(Δ⁢Uβ))Δsuperscript2superscript211121Δsuperscript2superscript2ΔK≥ U^2β^2 ((H-1) ( [H-1]12(H-1)|% A| U^2β^2 )+ ( U% β ) )K ≥ divide start_ARG Δ U2 end_ARG start_ARG β2 end_ARG ( ( H - 1 ) ln ( nth-root start_ARG H - 1 end_ARG start_ARG 12 end_ARG ( H - 1 ) | A | divide start_ARG Δ U2 end_ARG start_ARG β2 end_ARG ) + ln ( divide start_ARG Δ U end_ARG start_ARG β end_ARG ) ). Then πP⁢A⁢Asubscript _PAAπitalic_P A A is an ε ε-A policy. See Appendix A.2.2 for the full derivation of these two results. We now outline a proof sketch. The skeleton of the proof is similar to the one proposed by Kearns et al. [19] for their original sparse sampling algorithm, although several additional tricks and intermediate results are necessary to accommodate the approximate world model p^ pover start_ARG p end_ARG and reward r^ℐsubscript^ℐ r_Iover start_ARG r end_ARGI. First, we derive a concentration inequality for the power mean function in order to quantify the approximation error between Wq⁢(⋅;In)subscript⋅subscriptW_q(·;I_n)Witalic_q ( ⋅ ; Iitalic_n ) and Wq⁢(⋅;ℐ)subscript⋅ℐW_q(·;I)Witalic_q ( ⋅ ; I ). Using a slightly modified (two-sided) version of the Hoeffding-Serfling inequality [4, 33] (see Lemma 9 in Appendix A.2.1), we find the following bounds: Lemma 5. Let WqsubscriptW_qWitalic_q be the power-mean defined in Eq. (1), with q∈ℝq ∈ blackboard_R. Given a,b∈ℝ+∗superscriptsubscriptℝa,b _+^*a , b ∈ blackboard_R+∗ (or ℝ+subscriptℝR_+blackboard_R+ for q=11q=1q = 1) such that a<ba<ba < b , let ∈[a,b]NsuperscriptX∈[a,b]^NX ∈ [ a , b ]N be a set of size N and let XnsubscriptX_nXitalic_n be a subset of size n<Nn<Nn < N sampled uniformly at random without replacement from XX. Then, for 0<ε<Wq⁢()0subscript0< <W_q(X)0 < ε < Witalic_q ( X ) and m=min⁡(n,N−n)m= (n,N-n)m = min ( n , N - n ), ℙ⁢[|Wq⁢(Xn)−Wq⁢()|≥ε]≤2⁢exp⁡(−2⁢n⁢ε2(1−nN)⁢(1+1m)⁢Γ⁢(ε,a,b,q)),ℙdelimited-[]subscriptsubscriptsubscript22superscript2111ΓP [|W_q(X_n)-W_q(X)|≥ ]% ≤ 2 (- 2n ^2(1- nN)(1+ 1m)% ( ,a,b,q) ),blackboard_P [ | Witalic_q ( Xitalic_n ) - Witalic_q ( X ) | ≥ ε ] ≤ 2 exp ( - divide start_ARG 2 n ε2 end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG m end_ARG ) end_ARG Γ ( ε , a , b , q ) ) , (10) where Γ⁢(ε,a,b,q)=(1−2q)2⁢b2⁢q−2(aq−bq)2q<01(b+ε)2⁢(log⁡b−log⁡a)2q=0q2⁢a2⁢q(b+(1−q)⁢ε)2⁢(bq−aq)20<q<11(b−a)2q=1q2⁢a2⁢q(b+q⁢ε)2⁢(bq−aq)2q>1Γcasessuperscript1superscript22superscript22superscriptsuperscriptsuperscript201superscript2superscript20superscript2superscript2superscript12superscriptsuperscriptsuperscript2011superscript21superscript2superscript2superscript2superscriptsuperscriptsuperscript21 ( ,a,b,q)= \ array[]l (1-2^q% )^2b^2q-2(a^q-b^q)^2&q<0\\ 1(b+ )^2( b- a)^2&q=0\\ q^2a^2q(b+(1-q) )^2(b^q-a^q)^2&0<q<1\\ 1(b-a)^2&q=1\\ q^2a^2q(b+q )^2(b^q-a^q)^2&q>1\\ array .Γ ( ε , a , b , q ) = start_ARRAY start_ROW start_CELL divide start_ARG ( 1 - 2q )2 b2 q - 2 end_ARG start_ARG ( aitalic_q - bitalic_q )2 end_ARG end_CELL start_CELL q < 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG ( b + ε )2 ( log b - log a )2 end_ARG end_CELL start_CELL q = 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG q2 a2 q end_ARG start_ARG ( b + ( 1 - q ) ε )2 ( bitalic_q - aitalic_q )2 end_ARG end_CELL start_CELL 0 < q < 1 end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG ( b - a )2 end_ARG end_CELL start_CELL q = 1 end_CELL end_ROW start_ROW start_CELL divide start_ARG q2 a2 q end_ARG start_ARG ( b + q ε )2 ( bitalic_q - aitalic_q )2 end_ARG end_CELL start_CELL q > 1 end_CELL end_ROW end_ARRAY (16) Similarly, for q∈±∞plus-or-minusq∈\±∞\q ∈ ± ∞ : ℙ⁢[|Wq⁢(Xn)−Wq⁢()|≥ε]≤1−nN.ℙdelimited-[]subscriptsubscriptsubscript1P [|W_q(X_n)-W_q(X)|≥ ]% ≤ 1- nN.blackboard_P [ | Witalic_q ( Xitalic_n ) - Witalic_q ( X ) | ≥ ε ] ≤ 1 - divide start_ARG n end_ARG start_ARG N end_ARG . See Appendix A.2.1 for the full proof. Note that, for q=11q=1q = 1 (utilitarian rule), it is sufficient to have n=N⁢(1+N2⁢K)−1superscript121n=N (1+ N2K )^-1n = N ( 1 + divide start_ARG N end_ARG start_ARG 2 K end_ARG )- 1 in Theorem 3. However, for q≠11q≠ 1q ≠ 1, Γ Γ depends highly on a and b (respectively Um⁢i⁢nsubscriptU_minUitalic_m i n and Um⁢a⁢xsubscriptU_maxUitalic_m a x in our setting). Worse, for q=±∞plus-or-minusq=±∞q = ± ∞, the bound depends linearly on n. Indeed, q=−∞q=-∞q = - ∞ corresponds to the egalitarian rule, which defines social welfare as the lowest welfare among individuals. As this individual might be unique, the probability of not selecting it in InsubscriptI_nIitalic_n can be as high as 1−nN11- nN1 - divide start_ARG n end_ARG start_ARG N end_ARG. The same argument can be made for q=+∞q=+∞q = + ∞, which is why we purposefully avoid these scenarios in Theorems 2, 3 and 8. Regarding the error induced by the approximate model p^ pover start_ARG p end_ARG, we bound it using the following lemma: Lemma 6. Let f:→[fm⁢i⁢n,fm⁢a⁢x]:→subscriptsubscriptf:S→[f_min,f_max]f : S → [ fitalic_m i n , fitalic_m a x ] be a bounded function with 0≤fm⁢i⁢n≤fm⁢a⁢x<∞0subscriptsubscript0≤ f_min≤ f_max<∞0 ≤ fitalic_m i n ≤ fitalic_m a x < ∞, and p,p^∈⁢()^p, p∈P(S)p , over start_ARG p end_ARG ∈ P ( S ) be two distributions such that DK⁢L⁢(p∥p^)≤d∈ℝsubscriptconditional^ℝD_KL(p\| p)≤ d _K L ( p ∥ over start_ARG p end_ARG ) ≤ d ∈ blackboard_R and . Then |s∼p⁢[f⁢(s)]−s∼p^⁢[f⁢(s)]|≤2⁢(fm⁢a⁢x−fm⁢i⁢n)⁢min⁡d2,1−e−d.subscriptsimilar-todelimited-[]subscriptsimilar-to^delimited-[]2subscriptsubscript21superscript |E_s p[f(s)]-E_s p[f(s)] |≤ 2% (f_max-f_min) \ d2,1-e^-d\.| blackboard_Es ∼ p [ f ( s ) ] - blackboard_Es ∼ over start_ARG p end_ARG [ f ( s ) ] | ≤ 2 ( fitalic_m a x - fitalic_m i n ) square-root start_ARG min divide start_ARG d end_ARG start_ARG 2 end_ARG , 1 - e- d end_ARG . See Appendix A.2.1 for the full proof. Combining Lemmas 5 and 6 along with other classical concentration inequalities, we can bound the error |Q∗⁢(s,a)−Q^h⁢(s,a)|superscriptsuperscript^ℎ|Q^*(s,a)- Q^h(s,a)|| Q∗ ( s , a ) - over start_ARG Q end_ARGh ( s , a ) | with high probability. The last step of the proof is to quantify how this error affects the state value function VπP⁢A⁢AsuperscriptsubscriptV _PAAVitalic_πitalic_P A A (consequently πP⁢A⁢AsuperscriptsubscriptW _PAAWitalic_πitalic_P A A from Lemma 1), which can be done using the following results: Lemma 7. Let Q^ Qover start_ARG Q end_ARG be a (randomized) approximation of Q∗superscriptQ^*Q∗ such that |Q∗⁢(s,a)−Q^⁢(s,a)|≤εsuperscript^|Q^*(s,a)- Q(s,a)|≤ | Q∗ ( s , a ) - over start_ARG Q end_ARG ( s , a ) | ≤ ε with probability at least 1−δ11- 1 - δ for any state-action pair (s,a)(s,a)( s , a ), with ε>00 >0ε > 0 and 0≤δ<1010≤δ<10 ≤ δ < 1. Let πQ^subscript _ Qπover start_ARG Q end_ARG be the greedy policy defined by πQ^⁢(s)=arg⁡maxa∈⁡Q^⁢(s,a)subscript^subscript _ Q(s)= _a Q(s,a)πover start_ARG Q end_ARG ( s ) = arg maxitalic_a ∈ A over start_ARG Q end_ARG ( s , a ). Then, for all states s: 1)V∗(s)−VπQ^(s) 1) V^*(s)-V _ Q(s)1 ) V∗ ( s ) - Vitalic_πover start_ARG Q end_ARG ( s ) ≤2⁢ε1−γ+γk⁢(Vm⁢a⁢x−Vm⁢i⁢n)absent21superscriptsubscriptsubscript ≤ 2 1-γ+γ^k(V_max-V_min)≤ divide start_ARG 2 ε end_ARG start_ARG 1 - γ end_ARG + γitalic_k ( Vitalic_m a x - Vitalic_m i n ) with probability at least ⁢1−2⁢k⁢δ,∀k∈ℕ∗,with probability at least 12for-allsuperscriptℕ with probability at least 1-2kδ,∀ k % ^*,with probability at least 1 - 2 k δ , ∀ k ∈ blackboard_N∗ , 2)V∗(s)−VπQ^(s) 2) V^*(s)-V _ Q(s)2 ) V∗ ( s ) - Vitalic_πover start_ARG Q end_ARG ( s ) ≤2⁢ε+2⁢δ⁢(Vm⁢a⁢x−Vm⁢i⁢n)1−γabsent22subscriptsubscript1 ≤ 2 +2δ(V_max-V_min)1-γ≤ divide start_ARG 2 ε + 2 δ ( Vitalic_m a x - Vitalic_m i n ) end_ARG start_ARG 1 - γ end_ARG almost surely. These are fairly general results as they do not depend on how Q^ Qover start_ARG Q end_ARG is derived. Statements closely related to 2)2)2 ) have already been shown [19, 35]. We provide a proof in Appendix A.2.1 for completeness. 3.3 Safe policies Although Theorem 2 may initially inspire optimism regarding the title of the paper, the policy πP⁢A⁢Asubscript _PAAπitalic_P A A proposed in Eq. (9) is expensive for small ε ε, both in terms of sample complexity and in terms of required accuracy of the world model. A more efficient PAA policy derived in future work might partially solve the sample complexity issue, but the challenge of building predictive models of high accuracy remains untouched. In most realistic settings, DK⁢L⁢(p∥p^)subscriptconditional^D_KL(p\| p)Ditalic_K L ( p ∥ over start_ARG p end_ARG ) is imposed by the state-of-the-art knowledge upon which p^ pover start_ARG p end_ARG is built, which implicitly restricts the achievable tolerance ε ε. Therefore, it seems unlikely that such policies could be used as a primary tool for social decisions, as their sole objective would be to maximize a dubious approximation of social welfare. On the other hand, even for large ε ε, we will show that we can use our PAA policy to adapt any black-box policy π (e.g., a policy built on top of a LLM) into a safe policy, which we formally define as follows: Definition (Safe Policy). Given ω∈[m⁢i⁢n,m⁢a⁢x]subscriptsubscriptω∈[W_min,W_max]ω ∈ [ Witalic_m i n , Witalic_m a x ] and 0<δ<1010<δ<10 < δ < 1, a policy π is δ-ω-safe if, for any current state s, the inequality s′∼p(⋅|s,a)⁢[supπ′π′⁢(s′)]≥ωE_s p(·|s,a) [ _π W% ^π (s ) ]≥ _Es′ ∼ p ( ⋅ | s , a ) [ supitalic_π′ Witalic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( s′ ) ] ≥ ω holds with probability at least 1−δ11- 1 - δ for any action a such that π⁢(a|s)>0conditional0π(a|s)>0π ( a | s ) > 0. Intuitively, a safe policy ensures (with high probability and in expectation over the environment dynamics) that the society is not led in a destructive state, that is, a state which might generate high immediate satisfaction but where no policy can generate an expected future discounted social welfare of at least ω. This is considerably weaker than the PAA requirements, as we are no longer concerned about social welfare optimality. The ability to adapt any black-box policy into a safe policy would allow to leverage their strengths while fully removing their brittleness (by bounding the probability of a destructive decision by any desired value δ>00δ>0δ > 0). To this end, we use another type of policy: Definition (Restricted Version of a Black-Box Policy). Let π:→:→π:S→Aπ : S → A be any policy and ¯⁢(s)⊆¯ A(s) Aover¯ start_ARG A end_ARG ( s ) ⊆ A be restricted subsets of actions for all states s, with Π⁢(s)≜∑a∈¯⁢(s)π⁢(a|s)≜Πsubscript¯conditional (s) _a∈ A(s)π(a|s)Π ( s ) ≜ ∑a ∈ over¯ start_ARG A end_ARG ( s ) π ( a | s ). The restricted version π¯ πover¯ start_ARG π end_ARG of π is defined as π¯⁢(a|s)≜0a∈∖¯⁢(s)orΠ⁢(s)=0,π⁢(a|s)1−Π⁢(s)a∈¯⁢(s)and0<Π⁢(s)<1,π⁢(a|s)a∈¯⁢(s)andΠ⁢(s)=1,≜¯conditionalcases0¯orΠ0conditional1Π¯and0Π1conditional¯andΠ1 π(a|s) \ array[]clcc0&a∈A% A(s)& or& (s)=0,\\ π(a|s)1- (s)&a∈ A(s)& and&0< (s)<1,\\ π(a|s)&a∈ A(s)& and& (s)=1, array .over¯ start_ARG π end_ARG ( a | s ) ≜ start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL a ∈ A ∖ over¯ start_ARG A end_ARG ( s ) end_CELL start_CELL or end_CELL start_CELL Π ( s ) = 0 , end_CELL end_ROW start_ROW start_CELL divide start_ARG π ( a | s ) end_ARG start_ARG 1 - Π ( s ) end_ARG end_CELL start_CELL a ∈ over¯ start_ARG A end_ARG ( s ) end_CELL start_CELL and end_CELL start_CELL 0 < Π ( s ) < 1 , end_CELL end_ROW start_ROW start_CELL π ( a | s ) end_CELL start_CELL a ∈ over¯ start_ARG A end_ARG ( s ) end_CELL start_CELL and end_CELL start_CELL Π ( s ) = 1 , end_CELL end_ROW end_ARRAY (17) This is similar to action masking presented in [20]. It might happen that π¯⁢(a|s)=0¯conditional0 π(a|s)=0over¯ start_ARG π end_ARG ( a | s ) = 0 for all actions a, in which case it stops operating. However, if this happens, we have the guarantee that, with high probability, the society is currently not in a destructive state. The challenge lies in finding what are the subset of safe actions for every s. Our proposed method to safeguard any policy is the following: Theorem 8 (Safeguarding a Black-Box Policy). Given a SMDP ℳℐ=(,,p,Wq,,γ)subscriptℳℐsubscriptM_I=(S,A,p,W_q, u,% γ)Mcaligraphic_I = ( S , A , p , Witalic_q , u , γ ) with q∈ℝq ∈ blackboard_R, a predictive model p^ pover start_ARG p end_ARG and desired tolerances ω∈[m⁢i⁢n,m⁢a⁢x]subscriptsubscriptω∈[W_min,W_max]ω ∈ [ Witalic_m i n , Witalic_m a x ] and 0<δ<1010<δ<10 < δ < 1, define Q^ω⁢(s,a)≜Q^H⁢(s,a;K,C,In)≜subscript^superscript^subscript Q_ω(s,a) Q^H(s,a;K,C,I_n)over start_ARG Q end_ARGω ( s , a ) ≜ over start_ARG Q end_ARGH ( s , a ; K , C , Iitalic_n ) with Q^Hsuperscript Q^Hover start_ARG Q end_ARGH given in Eq. (8) and any H,K,C,n≥11H,K,C,n≥ 1H , K , C , n ≥ 1. For any policy π, let πs⁢a⁢f⁢esubscript _safeπitalic_s a f e be the restricted version of π obtained with the restricted subsets s⁢a⁢f⁢e⁢(s)≜a∈:Q^ω⁢(s,a)≥γ⁢ω+Um⁢a⁢x+α≜subscriptconditional-setsubscript^subscriptA_safe(s) \a∈A: Q_ω(s,a)% ≥γω+U_max+α\Aitalic_s a f e ( s ) ≜ a ∈ A : over start_ARG Q end_ARGω ( s , a ) ≥ γ ω + Uitalic_m a x + α , where α α ≜2⁢Δ⁢U⁢d′(1−γ)2+ln⁡(12⁢(C⁢||)H−1δ)1−γ⁢(N−n⁢N⁢Γm⁢a⁢x+Δ⁢U22⁢K+γ⁢Δ⁢U22⁢C⁢(1−γ)2)+γH⁢Δ⁢U1−γ,≜absent2Δsuperscript′1212superscript11subscriptΓΔsuperscript22Δsuperscript22superscript12superscriptΔ1 2 Ud (1-γ)^2+ % ( 12(C|A|)^H-1δ )1-γ (% N-nnN _max+ U^22K+γ % U^22C(1-γ)^2 )+ γ^H U1-% γ,≜ divide start_ARG 2 Δ U d′ end_ARG start_ARG ( 1 - γ )2 end_ARG + divide start_ARG square-root start_ARG ln ( divide start_ARG 12 ( C | A | )H - 1 end_ARG start_ARG δ end_ARG ) end_ARG end_ARG start_ARG 1 - γ end_ARG ( square-root start_ARG divide start_ARG N - n end_ARG start_ARG n N Γitalic_m a x end_ARG end_ARG + square-root start_ARG divide start_ARG Δ U2 end_ARG start_ARG 2 K end_ARG end_ARG + γ square-root start_ARG divide start_ARG Δ U2 end_ARG start_ARG 2 C ( 1 - γ )2 end_ARG end_ARG ) + divide start_ARG γitalic_H Δ U end_ARG start_ARG 1 - γ end_ARG , and with the shortened notation Γm⁢a⁢x≜Γ⁢(Um⁢a⁢x,Um⁢i⁢n,Um⁢a⁢x,q)≜subscriptΓsubscriptsubscriptsubscript _max (U_max,U_min,U_max,q)Γitalic_m a x ≜ Γ ( Uitalic_m a x , Uitalic_m i n , Uitalic_m a x , q ), d′≜min⁡d2,1−e−d≜superscript′21superscriptd \ d2,1-e^-d\d′ ≜ square-root start_ARG min divide start_ARG d end_ARG start_ARG 2 end_ARG , 1 - e- d end_ARG and d≜sup(s,a)DK⁢L(p(⋅|s,a)∥p^(⋅|s,a))d _(s,a)D_KL(p(·|s,a)\| p(·|s,a))d ≜ sup( s , a ) Ditalic_K L ( p ( ⋅ | s , a ) ∥ over start_ARG p end_ARG ( ⋅ | s , a ) ). Then πs⁢a⁢f⁢esubscript _safeπitalic_s a f e is δ-ω-safe. See Appendix A.2.3 for the full proof. While we are, in theory, not restricted by the statistical accuracy of the world model to find a safe version of any black-box policy, low statistical accuracy will, in practice, drastically reduce the number of verifiably safe actions, which at some point will render the safe policy obsolete (as it will refuse to take any action). 4 Related work MDP for social choice MDPs have already been used in the context of social decision processes. For instance, Parkes and Procaccia [28] (and more recently [21]) use social choice MDPs in order to tackle decision-making under dynamic preferences. However, their setting is significantly different from ours: In their work, the state space SS is the set of preference profiles NsuperscriptU^NUitalic_N, and p dictate how these preferences evolve based on the outcome selected by the social choice functional (the policy). Another relevant line of study is that of preference-based reinforcement learning (PbRL) [40], where the traditional numerical reward of the underlying MDP is replaced with relative preferences between state-action trajectories. While social decision processes can also be cast as a PbRL problem, no previous work has attempted to formally and quantitatively define social alignment in that setting. AI safety A long line of work has attempted to address the challenge of building verifiably safe autonomous systems [2, 30]. In the context of MDPs, safe reinforcement learning (RL) tackles this challenge by introducing safety constraints [1]. See [15, 20] for comprehensive surveys on the topic. However, RL relies on exploration, which is not allowed in our setting. On the other hand, existing planning methods (where exploration is not needed if a world model p^ pover start_ARG p end_ARG is available) do not relate the accuracy of p^ pover start_ARG p end_ARG to the validity of the desired safety guarantees, as they mostly assume that p is known. Alignment The goal of alignment can be entirely different based on the context [13]. Recent research on this topic has primarily focused on aligning large language models (LLMs) [17] using human feedback [6, 26] and derivatives [11, 39, 5]. While some work has attempted to tackle LLM alignment from a social choice perspective [22], the issue of aligning the meaning of generated text is, by nature, both qualitative and subjective, and therefore separate from ours. A setting closer to our work is the value alignment problem [34, 23], based on the theory of basic human values [31], where the preferences of individuals (over social states) are assumed to be guided by a predefined set of common values, and where the goal is to find “norms” (i.e., hard-coded logical constraints on actions) that guide society towards states that maximize these values. The alignment of these norm can be quantified by measuring the level of these values in the subsequent states, and the alignment of an autonomous system is simply given by the alignment of the norms it follows. However, this measure is intractable in most realistic settings, as it must be computed over all possible state trajectories [34]. 5 Limitations and future work From a practical standpoint, the main challenge lies in building a reliable world model p^ pover start_ARG p end_ARG, since PAA guarantees depend on its statistical accuracy, which can only be measured exactly if the true world model p is known. In practice, a conservative estimate of this accuracy could be used instead. Another limitation arises from the dependence on the informational basis of utilities, a philosophical question that falls outside the scope of this paper but that is common to all systems involving human feedback. A third practical limitation is the assumptions that individuals can observe and evaluate the entire social states when reporting their utilities. Future work could extend our analysis to the setting of partially observable MDP (POMDP) [42], for instance. Finally, the assumption that individuals have static preferences can also be challenged, but it is not clear how evolving preferences can be modeled, let alone factored in our analysis. From a theoretical perspective, the complexity results presented in the various theorems are poor for q≠11q≠ 1q ≠ 1 and γ≈11γ≈ 1γ ≈ 1. These dependencies are hard to improve, as they relate to a known property of the power mean [7] and the ability to foresee the future, respectively. Lastly, while we make no assumption about the distribution of utilities uu, one could investigate how such assumptions might improve these complexities (e.g., using Bernstein-Serfling inequality [4]). 6 Conclusion We present a formal and quantitative definition of alignment in the context of social choice, leading to the concept of probably approximately aligned policies. Using an approximate world model, we derive sufficient conditions for such policies to exist (and be computable). In addition, we introduce the relaxed concept of safe policies, and we present a method to make any policy safe by restricting its action space. Overall, this work provides a first attempt at rigorously defining the alignment of governing autonomous agents, and at quantifying the resources (n, K, C, H) and knowledge (p^ pover start_ARG p end_ARG) needed to enforce the desired level of alignment (ε ε) or safety (ω) with high confidence (δ). Acknowledgments and Disclosure of Funding Special thanks to Prof. Dr. Hans Gersbach for his insightful and thorough feedback, which contributed to the refinement and clarity of this paper. References Altman [2021] Eitan Altman. Constrained Markov Decision Processes. Routledge, 2021. Amodei et al. [2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete Problems in AI Safety. arXiv preprint arXiv:1606.06565, 2016. Arrow [1951] Kenneth Joseph Arrow. Social Choice and Individual Values. Wiley: New York, 1951. Bardenet and Maillard [2015] Rémi Bardenet and Odalric-Ambrym Maillard. Concentration Inequalities for Sampling without Replacement. Bernoulli, 21(3):1361 – 1385, 2015. Bowman et al. [2022] Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosuite, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring Progress on Scalable Oversight for Large Language Models. arXiv preprint arXiv:2211.03540, 2022. Christiano et al. [2017] Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems 30 (NIPS), pages 4299–4307, 2017. Cousins [2021] Cyrus Cousins. An Axiomatic Theory of Provably-Fair Welfare-Centric Machine Learning. In Advances in Neural Information Processing Systems 34 (NeurIPS), pages 16610–16621, 2021. Cummings [2021] Mary Cummings. Rethinking the Maturity of Artificial Intelligence in Safety-Critical Settings . AI Magazine, 42(1):6–15, 2021. D’Aspremont and Gevers [1977] Claude D’Aspremont and Louis Gevers. Equity and the Informational Basis of Collective Choice. The Review of Economic Studies, 44(2):199–209, 1977. Debreu and Hildenbrand [1983] Gerard Debreu and Werner Hildenbrand. Representation of a preference ordering by a numerical function. In Mathematical Economics: Twenty Papers of Gerard Debreu, Econometric Society Monographs, chapter 6, page 105–110. Cambridge University Press, 1983. Dong et al. [2023] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. arXiv preprint arXiv:2304.06767, 2023. Feinberg [2011] Eugene A. Feinberg. Total Expected Discounted Reward MDPS: Existence of Optimal Policies. Wiley Encyclopedia of Operations Research and Management Science. John Wiley & Sons, Ltd, 2011. Gabriel [2020] Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020. Ganguli et al. [2022] Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv preprint arXiv:2209.07858, 2022. Gu et al. [2023] Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, Yaodong Yang, and Alois Knoll. A Review of Safe Reinforcement Learning: Methods, Theory and Applications. arXiv preprint arXiv:2205.10330, 2023. Hicks and Allen [1934] John R. Hicks and Roy G. D. Allen. A Reconsideration of the Theory of Value. Part I. Economica, 1(1):52–76, 1934. Ji et al. [2023] Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. AI Alignment: A Comprehensive Survey. arXiv preprint arXiv:2310.19852, 2023. Kakade [2003] Sham Machandranath Kakade. On the Sample Complexity of Reinforcement Learning. Doctoral dissertation, University College London (United Kingdom), 2003. Kearns et al. [2002] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A Sparse Sampling Algorithm for Near-optimal Planning in Large Markov Decision Processes. Machine learning, 49:193–208, 2002. Krasowski et al. [2023] Hanna Krasowski, Jakob Thumm, Marlon Müller, Lukas Schäfer, Xiao Wang, and Matthias Althoff. Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking. arXiv preprint arXiv:2205.06750, 2023. Kulkarni and Neth [2020] Kshitij Kulkarni and Sven Neth. Social Choice with Changing Preferences: Representation Theorems and Long-run Policies. arXiv preprint arXiv:2011.02544, 2020. Mishra [2023] Abhilash Mishra. AI Alignment and Social Choice: Fundamental Limitations and Policy Implications. arXiv preprint arXiv:2310.16048, 2023. Montes and Sierra [2021] Nieves Montes and Carles Sierra. Value-Guided Synthesis of Parametric Normative Systems. In Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), page 907–915, 2021. Ngo et al. [2022] Richard Ngo, Lawrence Chan, and Sören Mindermann. The Alignment Problem from a Deep Learning Perspective. arXiv preprint arXiv:2209.00626, 2022. Novelli et al. [2023] Claudio Novelli, Mariarosaria Taddeo, and Luciano Floridi. Accountability in Artificial Intelligence: What it is and how it works. AI & SOCIETY, pages 1–12, 2023. Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training Language Models to Follow Instructions with Human Feedback. In Advances in Neural Information Processing Systems 35 (NeurIPS), pages 27730–27744, 2022. Pareto [1906] Vilfredo Pareto. Manuale di Economia Politica con una Introduzione alla Scienza Sociale. Piccola biblioteca scientifica. Societa editrice libraria, 1906. Parkes and Procaccia [2013] David Parkes and Ariel Procaccia. Dynamic Social Choice with Evolving Preferences. In Proceedings of the 27th AAAI conference on artificial intelligence, pages 767–773, 2013. Roberts [1980] Kevin W. S. Roberts. Interpersonal Comparability and Social Choice Theory. The Review of Economic Studies, 47(2):421–439, 1980. Russell et al. [2015] Stuart Russell, Daniel Dewey, and Max Tegmark. Research Priorities for Robust and Beneficial Artificial Intelligence. AI magazine, 36(4):105–114, 2015. Schwartz [1992] Shalom H. Schwartz. Universals in the Content and Structure of Values: Theoretical Advances and Empirical Tests in 20 Countries. In Mark P. Zanna, editor, Advances in Experimental Social Psychology, volume 25, pages 1–65. Academic Press, 1992. Sen [1977] Amartya Sen. On Weights and Measures: Informational Constraints in Social Welfare Analysis. Econometrica, 45(7):1539–1572, 1977. Serfling [1974] R. J. Serfling. Probability Inequalities for the Sum in Sampling without Replacement. The Annals of Statistics, 2(1):39 – 48, 1974. Sierra et al. [2021] Carles Sierra, Nardine Osman, Pablo Noriega, Jordi Sabater-Mir, and Antoni Perelló. Value Alignment: A Formal Approach. arXiv preprint arXiv:2110.09240, 2021. Singh and Yee [1994] Satinder P. Singh and Richard C. Yee. An Upper Bound on the Loss from Approximate Optimal-Value Functions. Machine Learning, 16(3):227–233, 1994. Strotz [1953] Robert H. Strotz. Cardinal Utility. The American Economic Review, 43(2):384–397, 1953. Sutton and Barto [2018] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018. Valiant [1984] Leslie Gabriel Valiant. A Theory of the Learnable. Communications of the ACM, 27(11):1134–1142, 1984. Wei et al. [2022] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned Language Models are Zero-Shot Learners. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022. Wirth et al. [2017] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A Survey of Preference-Based Reinforcement Learning Methods. Journal of Machine Learning Research, 18(136):1–46, 2017. Zhuang and Hadfield-Menell [2020] Simon Zhuang and Dylan Hadfield-Menell. Consequences of Misaligned AI. In Advances in Neural Information Processing Systems 33 (NeurIPS), pages 15763–15773, 2020. Åström, Karl Johan [1965] Åström, Karl Johan. Optimal Control of Markov Processes with Incomplete State Information I. Journal of Mathematical Analysis and Applications, 10:174–205, 1965. Appendix A Appendix A.1 Utility and Social Choice Theory A.1.1 Measuring and comparing utilities Measurability of u∈u∈Uu ∈ U can either be cardinal, where one can numerically measure absolute levels of satisfaction for a given state up to positive affine transformations, or ordinal, where one can only hope to rank states, meaning one can numerically measure utilities only up to increasing monotone transformations [16, 36, 27]. Additional dichotomization becomes possible when comparability is taken into account [29, 9, 32]. In a nutshell, utilities are either ordinal non-comparable (ONC), where individuals transform their intrinsic utilities differently when reporting; cardinal non-comparable (CNC), which is similar to ONC but with cardinal measurability), ordinal level comparable (OLC), where individuals transform their intrinsic utilities similarly when reporting; cardinal unit comparable (CUC), in which the affine transforms of the individuals have identical scaling factor but different bias; cardinal fully comparable (CFC), where individuals have the same affine transform; or cardinal ratio-scale comparable (CRS), in which all transforms are unbiased with the same scaling factor. Note that CRS is stronger than CFC in the sense that one can have statements such has “Bob is x times more satisfied than Alice” under the CRS assumption. In this paper, we refer to this classification as the informational basis of utilities. A.1.2 SWFL properties and implications Here is a list of the most common properties imposed on a given SWFL f:f→ℛ:→subscriptsubscriptℛf:D_f→R_Sf : Ditalic_f → Rcaligraphic_S, along with their definitions. We denote ⪯f⁢()subscriptprecedes-or-equals _f( u)⪯f ( u ) the binary preference relationship corresponding to f⁢()f( u)f ( u ), that is, s⪯f⁢()s′subscriptprecedes-or-equalssuperscript′s _f( u)s s ⪯f ( u ) s′ if and only if s′ ranks equally or higher than s in f⁢()f( u)f ( u ). (U) Universality or unrestricted domain: f=NsubscriptsuperscriptD_f=U^NDitalic_f = Uitalic_N. (IIA) Independence of Irrelevant Alternatives: For every ,′∈Nsuperscript′u,u ∈U^Nu , u′ ∈ Uitalic_N and ′⊆superscript′S SS′ ⊆ S, if ⁢(s)=′⁢(s)superscript′u(s)=u (s)u ( s ) = u′ ( s ) for all s∈′s∈S s ∈ S′, then f′⁢()=f′⁢(′)subscriptsuperscript′subscriptsuperscript′f_S (u)=f_S (u^% )fcaligraphic_S′ ( u ) = fcaligraphic_S′ ( u′ ), where f′⁢()subscriptsuperscript′f_S (u)fcaligraphic_S′ ( u ) is the partial ranking obtained after excluding ∖′S S S ∖ S′ from f⁢()f(u)f ( u ). (WP) Weak Pareto criterion or unanimity: For all pairs s,s′∈superscript′s,s ∈Ss , s′ ∈ S, if ⁢(s)>⁢(s′)superscript′u(s)>u(s )u ( s ) > u ( s′ ), then s ranks strictly higher than s′ in f⁢()f(u)f ( u ). (WPI) Weak Pareto with Indifference criterion: For all pairs s,s′∈superscript′s,s ∈Ss , s′ ∈ S, if ⁢(s)≥⁢(s′)superscript′u(s) (s )u ( s ) ≥ u ( s′ ), then s ranks equally or higher than s′ in f⁢()f(u)f ( u ). (SP) Strong Pareto criterion: For all pairs s,s′∈superscript′s,s ∈Ss , s′ ∈ S, if there exists i∈ℐi∈Ii ∈ I such that ui⁢(s)>ui⁢(s′)subscriptsubscriptsuperscript′u_i(s)>u_i(s )uitalic_i ( s ) > uitalic_i ( s′ ) and uj⁢(s)≥uj⁢(s′)subscriptsubscriptsuperscript′u_j(s)≥ u_j(s )uitalic_j ( s ) ≥ uitalic_j ( s′ ), ∀j∈ℐ∖ifor-allℐ∀ j∈I \i\∀ j ∈ I ∖ i , then s ranks strictly higher than s′ in f⁢()f(u)f ( u ). (NI) Non-Imposition: For all R∈ℛR∈RR ∈ R, there exists ∈fsubscriptu∈D_fu ∈ Ditalic_f such that f⁢()=Rf(u)=Rf ( u ) = R. (C) Continuity: For any ∈ℝsuperscriptℝ v ^Nv ∈ blackboard_RN and s,s′∈superscript′s,s ∈Ss , s′ ∈ S, the sets ′∈ℝ:⁢(′)=′,⁢()=⁢ and ⁢⪯⁢()′⁢ for some ⁢∈,conditional-setsuperscript′ℝformulae-sequencesuperscript′and subscriptprecedes-or-equalssuperscript′ for some superscript\ v ^N: u(s )= v , % u(s)= v and s _f( u)s for some % u∈U^N\, v′ ∈ blackboard_RN : u ( s′ ) = v′ , u ( s ) = v and s ⪯f ( u ) s′ for some u ∈ Ubold_N , and ′∈ℝ:⁢(′)=′,⁢()=⁢ and ⁢′⪯⁢()⁢ for some ⁢∈conditional-setsuperscript′ℝformulae-sequencesuperscript′and superscript′subscriptprecedes-or-equals for some superscript\ v ^N: u(s )= v , % u(s)= v and s _f( u)s for some % u∈U^N\ v′ ∈ blackboard_RN : u ( s′ ) = v′ , u ( s ) = v and s′ ⪯f ( u ) s for some u ∈ Ubold_N are closed. (WC) Weak Continuity: For any ∈superscript u∈U^Nu ∈ Ubold_N and ∈ℝ+Nsuperscriptsubscriptℝ _+^Nitalic_ε ∈ blackboard_R+N, there exists ′∈superscript′ u ∈U^Nu′ ∈ Ubold_N such that f⁢()=⁢(′)superscript′f( u)=f( u )f ( u ) = f ( u′ ) and <⁢()−′⁢()<0superscript′ 0< u(s)- u (s)< 0 < u ( s ) - u′ ( s ) < italic_ε for all s∈s∈Ss ∈ S (component-wise inequalities). (N) Neutrality or welfarism: For all quadruples s,s′,t,t′∈superscript′s,s ,t,t ∈Ss , s′ , t , t′ ∈ S, if ,′∈Nsuperscript′u,u ∈U^Nu , u′ ∈ Uitalic_N are such that ⁢(s)=′⁢(s′)superscript′u(s)=u (s )u ( s ) = u′ ( s′ ) and ⁢(t)=′⁢(t′)superscript′u(t)=u (t )u ( t ) = u′ ( t′ ), then f⁢()f(u)f ( u ) and f⁢(′)superscript′f(u )f ( u′ ) agree on the partial rankings of (s,t)(s,t)( s , t ) and (s′,t′)superscript′(s ,t )( s′ , t′ ) (i.e., either s and s′ are preferred, or either t and t′). (A) Anonymity or symmetry: For all ∈Nsuperscriptu∈U^Nu ∈ Uitalic_N, f⁢()=f⁢(′)superscript′f(u)=f(u )f ( u ) = f ( u′ ) with ′superscript′u u′ any permutation of uu. (ND) Non-Dictatorship: There is no single individual i such that ⪯i⁣=⁣⪯f⁢()subscriptprecedes-or-equalssubscriptprecedes-or-equals _i= _f(u)⪯i = ⪯f ( u ) for any uu. (IC) Incentive compatibility: It is in the best interest of each individual to report their true preferences (i.e., there is no tactical voting). (S) Separability or independence of unconcerned agents: For all ,′∈superscript′u,u ∈Uu , u′ ∈ U, if there exists ℐ′⊆ℐsuperscriptℐ′ℐI II′ ⊆ I such that ui=α∈ℝsubscriptℝu_i=α _i = α ∈ blackboard_R and ui′=α′∈ℝsubscriptsuperscript′ℝu _i=α ′italic_i = α′ ∈ blackboard_R for i∈ℐ′i∈I i ∈ I′, and ui=ui′subscriptsuperscriptsubscript′u_i=u_i uitalic_i = uitalic_i′ for i∈ℐ∖ℐ′ℐsuperscriptℐ′i∈I I i ∈ I ∖ I′, then f⁢()=f⁢(′)superscript′f(u)=f(u )f ( u ) = f ( u′ ). ℐ′I I′ is the set of unconcerned agents. (PDT) Pigou-Dalton Transfer principle: Define u¯⁢(s)=∑i∈ℐui⁢(s)¯subscriptℐsubscript u(s)= _i∈Iu_i(s)over¯ start_ARG u end_ARG ( s ) = ∑i ∈ I uitalic_i ( s ). For all ∈Nsuperscriptu∈U^Nu ∈ Uitalic_N and s,t∈s,t∈Ss , t ∈ S such that u¯⁢(s)=u¯⁢(t)¯ u(s)= u(t)over¯ start_ARG u end_ARG ( s ) = over¯ start_ARG u end_ARG ( t ), if |ui⁢(s)−u¯⁢(s)|≤|ui⁢(t)−u¯⁢(t)|subscript¯subscript¯|u_i(s)- u(s)|≤|u_i(t)- u(t)|| uitalic_i ( s ) - over¯ start_ARG u end_ARG ( s ) | ≤ | uitalic_i ( t ) - over¯ start_ARG u end_ARG ( t ) | for all i∈ℐi∈Ii ∈ I, then s ranks equally or higher than t in f⁢()f(u)f ( u ). (XI) Informational basis Invariance (XI): The SWFL is invariant under the given measurability and comparability assumptions: (ONCI), (CNCI), (CUCI), (CFCI), (CRSI), (OLCI). That is, f⁢()=⁢(′)superscript′f( u)=f( u )f ( u ) = f ( u′ ) for any two utility profile ,′superscript′ u, u u , u′ that are undistinguishable under the given informational basis. We now provide a short, intuitive explanation for the properties (U), (XI), (IIA), (WP) and (A) that we impose in this work. Enforcing unrestricted domains ensures that the SWFL always outputs a ranking. Informational basis invariance ensures that the SWFL outputs the same ranking for two preference profiles that are indistinguishable under the given measurability and comparability assumptions. Independence of irrelevant alternatives ensures that the rankings are robust, in the sense that they are not incoherently affected by removing or adding other options. The weak Pareto criterion ensures that the SWFL represents reasonably well the preference of society (if an outcome is unanimously preferred over another, then it must also be preferred at the social level). Anonymity ensures that all individuals have equal influence on the social ranking. Lastly, we recall a few important results (assuming ||≥33|S|≥ 3| S | ≥ 3). • (CRSI) ⇒ ⇒ (CFCI) ⇒ ⇒ (CUCI) ⇒ ⇒ (CNCI) ⇒ ⇒ (ONCI). Additionally (OLCI) ⇒ ⇒ (ONCI). • (XI) + (U) + (IIA) + (WP) ⇒ ⇒ There exists W:ℝN→ℝ:→superscriptℝW:R^N W : blackboard_RN → blackboard_R such that W⁢(⁢(s))>W⁢(⁢(s′))superscript′W(u(s))>W(u(s ))W ( u ( s ) ) > W ( u ( s′ ) ) implies that s ranks strictly higher than s′ in f⁢()f(u)f ( u ). This is important as it states that the best social state s must maximize W. In other words, even if f is not neutral, the non-welfarism characteristics are of a secondary importance and can only break ties between s and t such that W⁢(⁢(s))=W⁢(⁢(t))W(u(s))=W(u(t))W ( u ( s ) ) = W ( u ( t ) ). Maximization can be made sufficient if one imposes (W) on f (e.g., by replacing (WP) by (WPI) or more drastically by imposing continuity on f). • Arrow’s impossibility theorem [3]: (ONCI) + (U) + (WP) + (I) ⇒ ⇒ ¬ ¬(ND). While this theorem seems to prevent any hope of finding good SWFLs, its statement is strongly dependent on the (often hidden) ONCI assumptions. Indeed, it is challenging to find a good social aggregator that only knows rankings of alternatives. Strengthening the measurability and comparability assumptions (and thus narrowing the informational basis invariance property) allows to find SWFLs that are non-dictatorial. • (A) ⇒ ⇒ (ND). • Gibbard–Satterthwaite theorem (single winner elections): (ONCI) + (IC) ⇒ ⇒ ¬ ¬(ND). • (SP) ⇒ ⇒ (WP) but the converse is not true. • (XI) ⇒ ⇒ (WC) A.1.3 Power mean and SWFL correspondence Table 1: The different social welfare functions (SWF) corresponding to a SWFL that satisfies (U), (IIA), (WP), (A) and (XI) for the different informational bases [29]. (XI) SWF (W) Power mean (ONCI) or (CNCI) Impossibility (Arrow [3]) - (CUCI) ∑i∈ℐui⁢(s)subscriptℐsubscriptΣ _i∈Iu_i(s)∑i ∈ I uitalic_i ( s ) q=11q=1q = 1 (OLCI) mini∈ℐ⁡ui⁢(s)subscriptℐsubscript _i∈Iu_i(s)minitalic_i ∈ I uitalic_i ( s ) or maxi∈ℐ⁡ui⁢(s)subscriptℐsubscript _i∈Iu_i(s)maxitalic_i ∈ I uitalic_i ( s ) q∈±∞plus-or-minusq∈\±∞\q ∈ ± ∞ (CFCI) mini∈ℐ⁡ui⁢(s)subscriptℐsubscript _i∈Iu_i(s)minitalic_i ∈ I uitalic_i ( s ), maxi∈ℐ⁡ui⁢(s)subscriptℐsubscript _i∈Iu_i(s)maxitalic_i ∈ I uitalic_i ( s ) or ∑i∈ℐui⁢(s)subscriptℐsubscriptΣ _i∈Iu_i(s)∑i ∈ I uitalic_i ( s ) q∈±∞,1plus-or-minus1q∈\±∞,1\q ∈ ± ∞ , 1 (CRSI) mini∈ℐ⁡ui⁢(s)subscriptℐsubscript _i∈Iu_i(s)minitalic_i ∈ I uitalic_i ( s ), maxi∈ℐ⁡ui⁢(s)subscriptℐsubscript _i∈Iu_i(s)maxitalic_i ∈ I uitalic_i ( s ), ∑i∈ℐui⁢(s)qsubscriptℐsubscriptsuperscriptΣ _i∈Iu_i(s)^q∑i ∈ I uitalic_i ( s )q or ∑i∈ℐlog⁡[ui⁢(s)]subscriptℐsubscriptΣ _i∈I [u_i(s)]∑i ∈ I log [ uitalic_i ( s ) ] q∈ℝ∪±∞ℝplus-or-minusq ∪\±∞\q ∈ blackboard_R ∪ ± ∞ A.2 Proofs A.2.1 Intermediate results See 1 Proof. Let τt=s0,a0,s1,a1,…,stsubscriptsubscript0subscript0subscript1subscript1…subscript _t=s_0,a_0,s_1,a_1,...,s_tτitalic_t = s0 , a0 , s1 , a1 , … , sitalic_t denote a truncated trajectory up to time t. From the definitions of Vπ⁢(s;ℳ)superscriptℳV^π(s;M)Vitalic_π ( s ; M ) and rℐsubscriptℐr_Ircaligraphic_I, we have Vπ⁢(s;ℳ)superscriptℳ V^π(s;M)Vitalic_π ( s ; M ) =τ∼pτ(⋅|π,s0=s)⁢[∑t=0∞γt⁢rℐ⁢(st,at)] =E_τ p_τ(·|π,s_0=s) [ _t=0% ^∞γ^tr_I(s_t,a_t) ]= blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | π , s0 = s ) end_POSTSUBSCRIPT [ ∑t = 0∞ γitalic_t rcaligraphic_I ( sitalic_t , aitalic_t ) ] =τ∼pτ(⋅|π,s0=s)⁢[∑t=0∞γt⁢s′∼p(⋅|st,at)⁢[Wq⁢(⁢(′))]] =E_τ p_τ(·|π,s_0=s) [ _t=0% ^∞γ^tE_s p(·|s_t,a_t)[W_q(% u(s ))] ]= blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | π , s0 = s ) end_POSTSUBSCRIPT [ ∑t = 0∞ γitalic_t blackboard_Es′ ∼ p ( ⋅ | s start_POSTSUBSCRIPT t , aitalic_t ) end_POSTSUBSCRIPT [ Witalic_q ( u ( s′ ) ) ] ] =∑t=0∞γt⁢τ∼pτ(⋅|π,s0=s)⁢[s′∼p(⋅|st,at)⁢[Wq⁢(⁢(′))]] = _t=0^∞γ^tE_τ p_τ(·% |π,s_0=s) [E_s p(·|s_t,a_t)[W_q(% u(s ))] ]= ∑t = 0∞ γitalic_t blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | π , s0 = s ) end_POSTSUBSCRIPT [ blackboard_Es′ ∼ p ( ⋅ | s start_POSTSUBSCRIPT t , aitalic_t ) end_POSTSUBSCRIPT [ Witalic_q ( u ( s′ ) ) ] ] =∑t=0∞γt⁢τt∼pτt(⋅|π,s0=s),at∼π(⋅|st)⁢[s′∼p(⋅|st,at)⁢[Wq⁢(⁢(′))]] = _t=0^∞γ^tE_ _t p_ _t% (·|π,s_0=s),a_t π(·|s_t) [E_s % p(·|s_t,a_t)[W_q( u(s ))] ]= ∑t = 0∞ γitalic_t blackboard_Eτ start_POSTSUBSCRIPT t ∼ pitalic_τ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( ⋅ | π , s0 = s ) , aitalic_t ∼ π ( ⋅ | sitalic_t ) end_POSTSUBSCRIPT [ blackboard_Es′ ∼ p ( ⋅ | s start_POSTSUBSCRIPT t , aitalic_t ) end_POSTSUBSCRIPT [ Witalic_q ( u ( s′ ) ) ] ] =∑t=0∞γt⁢τt+1∼pτt+1(⋅|π,s0=s)⁢[Wq⁢(⁢(+))] = _t=0^∞γ^tE_ _t+1 p_ _% t+1(·|π,s_0=s) [W_q( u(s_t+1)) ]= ∑t = 0∞ γitalic_t blackboard_Eτ start_POSTSUBSCRIPT t + 1 ∼ pitalic_τ start_POSTSUBSCRIPT t + 1 end_POSTSUBSCRIPT ( ⋅ | π , s0 = s ) end_POSTSUBSCRIPT [ Witalic_q ( u ( sbold_t + 1 ) ) ] =∑t=0∞γt⁢τ∼pτ(⋅|π,s0=s)⁢[Wq⁢(⁢(+))] = _t=0^∞γ^tE_τ p_τ(·% |π,s_0=s) [W_q( u(s_t+1)) ]= ∑t = 0∞ γitalic_t blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | π , s0 = s ) end_POSTSUBSCRIPT [ Witalic_q ( u ( sbold_t + 1 ) ) ] =τ∼pτ(⋅|π,s0=s)⁢[∑t=0∞γt⁢Wq⁢(⁢(+))] =E_τ p_τ(·|π,s_0=s) [ _t=0% ^∞γ^tW_q( u(s_t+1)) ]= blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | π , s0 = s ) end_POSTSUBSCRIPT [ ∑t = 0∞ γitalic_t Witalic_q ( u ( sbold_t + 1 ) ) ] =π⁢(s;ℳℐ).absentsuperscriptsubscriptℳℐ =W^π(s;M_I).= Witalic_π ( s ; Mcaligraphic_I ) . ∎ See 5 Proof. To simplify the notation, we write S=Wq⁢(Xn)subscriptsubscriptS=W_q(X_n)S = Witalic_q ( Xitalic_n ) and μ=Wq⁢()subscriptμ=W_q(X)μ = Witalic_q ( X ) such that Sqsuperscript S^qSitalic_q =1n⁢∑xi∈Xnxiq,absent1subscriptsubscriptsubscriptsuperscriptsubscript = 1n _x_i∈ X_nx_i^q,= divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x start_POSTSUBSCRIPT i ∈ Xitalic_n end_POSTSUBSCRIPT xitalic_iitalic_q , μqsuperscript μ^qμitalic_q =1N⁢∑xi∈xiqabsent1subscriptsubscriptsuperscriptsubscript = 1N _x_i∈Xx_i^q= divide start_ARG 1 end_ARG start_ARG N end_ARG ∑x start_POSTSUBSCRIPT i ∈ X end_POSTSUBSCRIPT xitalic_iitalic_q for ⁢q∈ℝ∗for superscriptℝ for q ^*for q ∈ blackboard_R∗ log⁡S Slog S =1n⁢∑xi∈Xnlog⁡xiqabsent1subscriptsubscriptsubscriptsuperscriptsubscript = 1n _x_i∈ X_n x_i^q= divide start_ARG 1 end_ARG start_ARG n end_ARG ∑x start_POSTSUBSCRIPT i ∈ Xitalic_n end_POSTSUBSCRIPT log xitalic_iitalic_q log⁡μ μ =1N⁢∑xi∈log⁡xiqabsent1subscriptsubscriptsuperscriptsubscript = 1N _x_i∈X x_i^q= divide start_ARG 1 end_ARG start_ARG N end_ARG ∑x start_POSTSUBSCRIPT i ∈ X end_POSTSUBSCRIPT log xitalic_iitalic_q for ⁢q=0for 0 for q=0for q = 0 For q∈ℝq ∈ blackboard_R, we have ℙ⁢[|S−μ|≥ε]ℙdelimited-[] [|S-μ|≥ ]blackboard_P [ | S - μ | ≥ ε ] =1−ℙ⁢[μ−ε<S<μ+ε]absent1ℙdelimited-[] =1-P [μ- <S<μ+ ]= 1 - blackboard_P [ μ - ε < S < μ + ε ] =1−ℙ⁢[(μ+ε)q<Sq<(μ−ε)q]q<01−ℙ⁢[log⁡(μ−ε)<log⁡S<log⁡(μ+ε)]q=01−ℙ⁢[(μ−ε)q<Sq<(μ+ε)q]q>0absentcases1ℙdelimited-[]superscriptsuperscriptsuperscript01ℙdelimited-[]01ℙdelimited-[]superscriptsuperscriptsuperscript0 = \ array[]lr1-P [(μ+ )^% q<S^q<(μ- )^q ]&q<0\\ 1-P [ (μ- )< S< (μ+ ) ]&% q=0\\ 1-P [(μ- )^q<S^q<(μ+ )^q ]&q% >0 array .= start_ARRAY start_ROW start_CELL 1 - blackboard_P [ ( μ + ε )q < Sitalic_q < ( μ - ε )q ] end_CELL start_CELL q < 0 end_CELL end_ROW start_ROW start_CELL 1 - blackboard_P [ log ( μ - ε ) < log S < log ( μ + ε ) ] end_CELL start_CELL q = 0 end_CELL end_ROW start_ROW start_CELL 1 - blackboard_P [ ( μ - ε )q < Sitalic_q < ( μ + ε )q ] end_CELL start_CELL q > 0 end_CELL end_ROW end_ARRAY (21) Therefore: • For q<00q<0q < 0: ℙ⁢[(μ+ε)q<Sq<(μ−ε)q]ℙdelimited-[]superscriptsuperscriptsuperscript [(μ+ )^q<S^q<(μ- )^q% ]blackboard_P [ ( μ + ε )q < Sitalic_q < ( μ - ε )q ] =ℙ⁢[(1+εμ)q⁢μq<Sq<(1−εμ)q⁢μq]absentℙdelimited-[]superscript1superscriptsuperscriptsuperscript1superscript =P [ (1+ μ )^qμ^% q<S^q< (1- μ )^qμ^q ]= blackboard_P [ ( 1 + divide start_ARG ε end_ARG start_ARG μ end_ARG )q μitalic_q < Sitalic_q < ( 1 - divide start_ARG ε end_ARG start_ARG μ end_ARG )q μitalic_q ] ≥ℙ⁢[(1−(1−2q)⁢εμ)⁢μq<Sq<(1+(1−2q)⁢εμ)⁢μq]absentℙdelimited-[]11superscript2superscriptsuperscript11superscript2superscript [ (1- (1-2^q) μ% )μ^q<S^q< (1+ (1-2^q) μ )μ^q ]≥ blackboard_P [ ( 1 - divide start_ARG ( 1 - 2q ) ε end_ARG start_ARG μ end_ARG ) μitalic_q < Sitalic_q < ( 1 + divide start_ARG ( 1 - 2q ) ε end_ARG start_ARG μ end_ARG ) μitalic_q ] =ℙ⁢[μq−(1−2q)⁢μq−1⁢ε<Sq<μq+(1−2q)⁢μq−1⁢ε]absentℙdelimited-[]superscript1superscript2superscript1superscriptsuperscript1superscript2superscript1 =P [μ^q-(1-2^q)μ^q-1 <S^q<μ% ^q+(1-2^q)μ^q-1 ]= blackboard_P [ μitalic_q - ( 1 - 2q ) μitalic_q - 1 ε < Sitalic_q < μitalic_q + ( 1 - 2q ) μitalic_q - 1 ε ] ≥ℙ⁢[μq−(1−2q)⁢bq−1⁢ε<Sq<μq+(1−2q)⁢bq−1⁢ε]absentℙdelimited-[]superscript1superscript2superscript1superscriptsuperscript1superscript2superscript1 [μ^q-(1-2^q)b^q-1 <S^q<% μ^q+(1-2^q)b^q-1 ]≥ blackboard_P [ μitalic_q - ( 1 - 2q ) bitalic_q - 1 ε < Sitalic_q < μitalic_q + ( 1 - 2q ) bitalic_q - 1 ε ] =ℙ⁢[|Sq−μq|<(1−2q)⁢bq−1⁢ε]absentℙdelimited-[]superscriptsuperscript1superscript2superscript1 =P [|S^q-μ^q|<(1-2^q)b^q-1 ]= blackboard_P [ | Sitalic_q - μitalic_q | < ( 1 - 2q ) bitalic_q - 1 ε ] (22) where we have used the following approximation (holding for 0<x≤1010<x≤ 10 < x ≤ 1 and q<00q<0q < 0): (1+x)q≤1−(1−2q)⁢x<1<1+(1−2q)⁢x≤(1−x)q.superscript111superscript2111superscript2superscript1(1+x)^q≤ 1-(1-2^q)x<1<1+(1-2^q)x≤(1-x)^q.( 1 + x )q ≤ 1 - ( 1 - 2q ) x < 1 < 1 + ( 1 - 2q ) x ≤ ( 1 - x )q . Combining Eq. (21) and (22), and using the Hoeffding-Serfling inequality (Lemma 9) after observing that bq≤xiq≤aqsuperscriptsuperscriptsubscriptsuperscriptb^q≤ x_i^q≤ a^qbitalic_q ≤ xitalic_iitalic_q ≤ aitalic_q for all i, we get ℙ⁢[|S−μ|≥ε]ℙdelimited-[] [|S-μ|≥ ]blackboard_P [ | S - μ | ≥ ε ] ≤ℙ⁢[|Sq−μq|≥(1−2q)⁢bq−1⁢ε]absentℙdelimited-[]superscriptsuperscript1superscript2superscript1 [|S^q-μ^q|≥(1-2^q)b^q-1% ]≤ blackboard_P [ | Sitalic_q - μitalic_q | ≥ ( 1 - 2q ) bitalic_q - 1 ε ] ≤2⁢exp⁡(−2⁢n⁢(1−2q)2⁢b2⁢q−2⁢ε2(1−nN)⁢(1+1m)⁢(aq−bq)2).absent22superscript1superscript22superscript22superscript2111superscriptsuperscriptsuperscript2 ≤ 2 (- 2n(1-2^q)^2b^2q-2 ^2(1% - nN)(1+ 1m)(a^q-b^q)^2 ).≤ 2 exp ( - divide start_ARG 2 n ( 1 - 2q )2 b2 q - 2 ε2 end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG m end_ARG ) ( aitalic_q - bitalic_q )2 end_ARG ) . • For q=00q=0q = 0: ℙ⁢[log⁡(μ−ε)<log⁡S<log⁡(μ+ε)]ℙdelimited-[] [ (μ- )< S< (μ+% ) ]blackboard_P [ log ( μ - ε ) < log S < log ( μ + ε ) ] =ℙ⁢[log⁡(1−εμ)<log⁡S−log⁡μ<log⁡(1+εμ)]absentℙdelimited-[]11 =P [ (1- μ)< S- μ<% (1+ μ) ]= blackboard_P [ log ( 1 - divide start_ARG ε end_ARG start_ARG μ end_ARG ) < log S - log μ < log ( 1 + divide start_ARG ε end_ARG start_ARG μ end_ARG ) ] ≥ℙ⁢[−εμ+ε<log⁡S−log⁡μ<εμ+ε]absentℙdelimited-[] [- μ+ < S-% μ< μ+ ]≥ blackboard_P [ - divide start_ARG ε end_ARG start_ARG μ + ε end_ARG < log S - log μ < divide start_ARG ε end_ARG start_ARG μ + ε end_ARG ] ≥ℙ⁢[−εb+ε<log⁡S−log⁡μ<εb+ε],absentℙdelimited-[] [- b+ < S-% μ< b+ ],≥ blackboard_P [ - divide start_ARG ε end_ARG start_ARG b + ε end_ARG < log S - log μ < divide start_ARG ε end_ARG start_ARG b + ε end_ARG ] , (23) where we have used the following approximation (holding for 0<x<1010<x<10 < x < 1): log⁡(1−x)≤−x1+x<0<x1+x≤log⁡(1+x).11011 (1-x)≤- x1+x<0< x1+x≤ (1+x).log ( 1 - x ) ≤ - divide start_ARG x end_ARG start_ARG 1 + x end_ARG < 0 < divide start_ARG x end_ARG start_ARG 1 + x end_ARG ≤ log ( 1 + x ) . Combining Eq. (21) and (23) and using the Hoeffding-Serfling inequality (Lemma 9) after observing that log⁡a≤log⁡xiq≤log⁡bsuperscriptsubscript a≤ x_i^q≤ blog a ≤ log xitalic_iitalic_q ≤ log b for all i, we get ℙ⁢[|S−μ|≥ε]ℙdelimited-[] [|S-μ|≥ ]blackboard_P [ | S - μ | ≥ ε ] ≤ℙ⁢[|log⁡S−log⁡μ|≥εb+ε]absentℙdelimited-[] [| S- μ|≥ b+% ]≤ blackboard_P [ | log S - log μ | ≥ divide start_ARG ε end_ARG start_ARG b + ε end_ARG ] ≤2⁢exp⁡(−2⁢n⁢ε2(1−nN)⁢(1+1m)⁢(b+ε)2⁢(log⁡b−log⁡a)2).absent22superscript2111superscript2superscript2 ≤ 2 (- 2n ^2(1- nN)(1+ % 1m)(b+ )^2( b- a)^2 ).≤ 2 exp ( - divide start_ARG 2 n ε2 end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG m end_ARG ) ( b + ε )2 ( log b - log a )2 end_ARG ) . • For 0<q<1010<q<10 < q < 1: ℙ⁢[(μ−ε)q<Sq<(μ+ε)q]ℙdelimited-[]superscriptsuperscriptsuperscript [(μ- )^q<S^q<(μ+ )^q% ]blackboard_P [ ( μ - ε )q < Sitalic_q < ( μ + ε )q ] =ℙ⁢[(1−εμ)q⁢μq<Sq<(1+εμ)q⁢μq]absentℙdelimited-[]superscript1superscriptsuperscriptsuperscript1superscript =P [ (1- μ )^qμ^% q<S^q< (1+ μ )^qμ^q ]= blackboard_P [ ( 1 - divide start_ARG ε end_ARG start_ARG μ end_ARG )q μitalic_q < Sitalic_q < ( 1 + divide start_ARG ε end_ARG start_ARG μ end_ARG )q μitalic_q ] ≥ℙ⁢[−q⁢ε⁢μqμ+(1−q)⁢ε<Sq−μq<q⁢ε⁢μqμ+(1−q)⁢ε]absentℙdelimited-[]superscript1superscriptsuperscriptsuperscript1 [- q μ^qμ+(1-q)% <S^q-μ^q< q μ^qμ+(1-q) ]≥ blackboard_P [ - divide start_ARG q ε μitalic_q end_ARG start_ARG μ + ( 1 - q ) ε end_ARG < Sitalic_q - μitalic_q < divide start_ARG q ε μitalic_q end_ARG start_ARG μ + ( 1 - q ) ε end_ARG ] ≥ℙ⁢[−q⁢ε⁢μqb+(1−q)⁢ε<Sq−μq<q⁢ε⁢μqb+(1−q)⁢ε]absentℙdelimited-[]superscript1superscriptsuperscriptsuperscript1 [- q μ^qb+(1-q)% <S^q-μ^q< q μ^qb+(1-q) ]≥ blackboard_P [ - divide start_ARG q ε μitalic_q end_ARG start_ARG b + ( 1 - q ) ε end_ARG < Sitalic_q - μitalic_q < divide start_ARG q ε μitalic_q end_ARG start_ARG b + ( 1 - q ) ε end_ARG ] ≥ℙ⁢[−q⁢aq⁢εb+(1−q)⁢ε<Sq−μq<q⁢aq⁢εb+(1−q)⁢ε]absentℙdelimited-[]superscript1superscriptsuperscriptsuperscript1 [- qa^q b+(1-q) % <S^q-μ^q< qa^q b+(1-q) ]≥ blackboard_P [ - divide start_ARG q aitalic_q ε end_ARG start_ARG b + ( 1 - q ) ε end_ARG < Sitalic_q - μitalic_q < divide start_ARG q aitalic_q ε end_ARG start_ARG b + ( 1 - q ) ε end_ARG ] =ℙ⁢[|Sq−μq|<q⁢aq⁢εb+(1−q)⁢ε]absentℙdelimited-[]superscriptsuperscriptsuperscript1 =P [|S^q-μ^q|< qa^q b+(1-q% ) ]= blackboard_P [ | Sitalic_q - μitalic_q | < divide start_ARG q aitalic_q ε end_ARG start_ARG b + ( 1 - q ) ε end_ARG ] (24) where we have used the following approximation (holding for 0<x≤1010<x≤ 10 < x ≤ 1 and 0<q<1010<q<10 < q < 1): (1−x)q≤1−q⁢x1+(1−q)⁢x<1<1+q⁢x1+(1−q)⁢x≤(1+x)q.superscript11111111superscript1(1-x)^q≤ 1- qx1+(1-q)x<1<1+ qx1+(1-q)x≤(1+x)^q.( 1 - x )q ≤ 1 - divide start_ARG q x end_ARG start_ARG 1 + ( 1 - q ) x end_ARG < 1 < 1 + divide start_ARG q x end_ARG start_ARG 1 + ( 1 - q ) x end_ARG ≤ ( 1 + x )q . Combining Eq. (21) and (24), and using the Hoeffding-Serfling inequality (Lemma 9) after observing that aq≤xiq≤bqsuperscriptsuperscriptsubscriptsuperscripta^q≤ x_i^q≤ b^qaitalic_q ≤ xitalic_iitalic_q ≤ bitalic_q for all i, we get ℙ⁢[|S−μ|≥ε]ℙdelimited-[] [|S-μ|≥ ]blackboard_P [ | S - μ | ≥ ε ] ≤ℙ⁢[|Sq−μq|≥q⁢aq⁢εb+(1−q)⁢ε]absentℙdelimited-[]superscriptsuperscriptsuperscript1 [|S^q-μ^q|≥ qa^q % b+(1-q) ]≤ blackboard_P [ | Sitalic_q - μitalic_q | ≥ divide start_ARG q aitalic_q ε end_ARG start_ARG b + ( 1 - q ) ε end_ARG ] ≤2⁢exp⁡(−2⁢n⁢q2⁢a2⁢q⁢ε2(1−nN)⁢(1+1m)⁢(b+(1−q)⁢ε)2⁢(bq−aq)2).absent22superscript2superscript2superscript2111superscript12superscriptsuperscriptsuperscript2 ≤ 2 (- 2nq^2a^2q ^2(1- n% N)(1+ 1m)(b+(1-q) )^2(b^q-a^q)^2 ).≤ 2 exp ( - divide start_ARG 2 n q2 a2 q ε2 end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG m end_ARG ) ( b + ( 1 - q ) ε )2 ( bitalic_q - aitalic_q )2 end_ARG ) . • For q=11q=1q = 1: We directly apply the Hoeffding-Serfling inequality (Lemma 9) to obtain ℙ⁢[|S−μ|≥ε]≤2⁢exp⁡(−2⁢n⁢ε2(1−nN)⁢(1+1m)⁢(b−a)).ℙdelimited-[]22superscript2111P [|S-μ|≥ ]≤ 2 (- 2n% ^2(1- nN)(1+ 1m)(b-a) ).blackboard_P [ | S - μ | ≥ ε ] ≤ 2 exp ( - divide start_ARG 2 n ε2 end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG m end_ARG ) ( b - a ) end_ARG ) . • For q>11q>1q > 1: ℙ⁢[(μ−ε)q<Sq<(μ+ε)q]ℙdelimited-[]superscriptsuperscriptsuperscript [(μ- )^q<S^q<(μ+ )^q% ]blackboard_P [ ( μ - ε )q < Sitalic_q < ( μ + ε )q ] =ℙ⁢[(1−εμ)q⁢μq<Sq<(1+εμ)q⁢μq]absentℙdelimited-[]superscript1superscriptsuperscriptsuperscript1superscript =P [ (1- μ )^qμ^% q<S^q< (1+ μ )^qμ^q ]= blackboard_P [ ( 1 - divide start_ARG ε end_ARG start_ARG μ end_ARG )q μitalic_q < Sitalic_q < ( 1 + divide start_ARG ε end_ARG start_ARG μ end_ARG )q μitalic_q ] ≥ℙ⁢[(1−q⁢εμ+q⁢ε)⁢μq<Sq<(1+q⁢εμ+q⁢ε)⁢μq]absentℙdelimited-[]1superscriptsuperscript1superscript [ (1- q μ+q % )μ^q<S^q< (1+ q μ+q )μ^% q ]≥ blackboard_P [ ( 1 - divide start_ARG q ε end_ARG start_ARG μ + q ε end_ARG ) μitalic_q < Sitalic_q < ( 1 + divide start_ARG q ε end_ARG start_ARG μ + q ε end_ARG ) μitalic_q ] ≥ℙ⁢[(1−q⁢εb+q⁢ε)⁢μq<Sq<(1+q⁢εb+q⁢ε)⁢μq]absentℙdelimited-[]1superscriptsuperscript1superscript [ (1- q b+q % )μ^q<S^q< (1+ q b+q )μ^q% ]≥ blackboard_P [ ( 1 - divide start_ARG q ε end_ARG start_ARG b + q ε end_ARG ) μitalic_q < Sitalic_q < ( 1 + divide start_ARG q ε end_ARG start_ARG b + q ε end_ARG ) μitalic_q ] ≥ℙ⁢[μq−q⁢aq⁢εb+q⁢ε<Sq<μq+q⁢aq⁢εb+q⁢ε]absentℙdelimited-[]superscriptsuperscriptsuperscriptsuperscriptsuperscript [μ^q- qa^q b+q% <S^q<μ^q+ qa^q b+q ]≥ blackboard_P [ μitalic_q - divide start_ARG q aitalic_q ε end_ARG start_ARG b + q ε end_ARG < Sitalic_q < μitalic_q + divide start_ARG q aitalic_q ε end_ARG start_ARG b + q ε end_ARG ] =ℙ⁢[|Sq−μq|<q⁢aq⁢εb+q⁢ε]absentℙdelimited-[]superscriptsuperscriptsuperscript =P [|S^q-μ^q|< qa^q b+q% ]= blackboard_P [ | Sitalic_q - μitalic_q | < divide start_ARG q aitalic_q ε end_ARG start_ARG b + q ε end_ARG ] (25) where we have used the following approximation (holding for 0<x≤1010<x≤ 10 < x ≤ 1 and q>11q>1q > 1): (1−x)q≤1−q⁢x1+q⁢x<1<1+q⁢x1+q⁢x≤(1+x)q.superscript111111superscript1(1-x)^q≤ 1- qx1+qx<1<1+ qx1+qx≤(1+x)^q.( 1 - x )q ≤ 1 - divide start_ARG q x end_ARG start_ARG 1 + q x end_ARG < 1 < 1 + divide start_ARG q x end_ARG start_ARG 1 + q x end_ARG ≤ ( 1 + x )q . Combining Eq. (21) and (25), and using the Hoeffding-Serfling inequality (Lemma 9) after observing that aq≤xiq≤bqsuperscriptsuperscriptsubscriptsuperscripta^q≤ x_i^q≤ b^qaitalic_q ≤ xitalic_iitalic_q ≤ bitalic_q for all i, we get ℙ⁢[|S−μ|≥ε]ℙdelimited-[] [|S-μ|≥ ]blackboard_P [ | S - μ | ≥ ε ] ≤ℙ⁢[|Sq−μq|≥q⁢aq⁢εb+q⁢ε]absentℙdelimited-[]superscriptsuperscriptsuperscript [|S^q-μ^q|≥ qa^q % b+q ]≤ blackboard_P [ | Sitalic_q - μitalic_q | ≥ divide start_ARG q aitalic_q ε end_ARG start_ARG b + q ε end_ARG ] ≤2⁢exp⁡(−2⁢n⁢q2⁢a2⁢q⁢ε2(1−nN)⁢(1+1m)⁢(b+q⁢ε)2⁢(bq−aq)2).absent22superscript2superscript2superscript2111superscript2superscriptsuperscriptsuperscript2 ≤ 2 (- 2nq^2a^2q ^2(1- n% N)(1+ 1m)(b+q )^2(b^q-a^q)^2 ).≤ 2 exp ( - divide start_ARG 2 n q2 a2 q ε2 end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG m end_ARG ) ( b + q ε )2 ( bitalic_q - aitalic_q )2 end_ARG ) . • For q=+∞q=+∞q = + ∞: Since we make no assumptions on the distributions of xisubscriptx_ixitalic_i other than a≤xi≤bsubscripta≤ x_i≤ ba ≤ xitalic_i ≤ b, we can only guarantee that |maxxi∈Xn⁡xi−maxxi∈⁡xi|<εsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscript| _x_i∈ X_nx_i- _x_i∈Xx_i|< | maxitalic_x start_POSTSUBSCRIPT i ∈ Xitalic_n end_POSTSUBSCRIPT xitalic_i - maxitalic_x start_POSTSUBSCRIPT i ∈ X end_POSTSUBSCRIPT xitalic_i | < ε if xm⁢a⁢x=arg⁡maxxi∈⁡xisubscriptsubscriptsubscriptsubscriptx_max= _x_i∈Xx_ixitalic_m a x = arg maxitalic_x start_POSTSUBSCRIPT i ∈ X end_POSTSUBSCRIPT xitalic_i is sampled in XnsubscriptX_nXitalic_n. This happens with probability at least nN nNdivide start_ARG n end_ARG start_ARG N end_ARG (the maximum might not be unique), and thus ℙ⁢[|maxxi∈Xn⁡xi−maxxi∈⁡xi|≥ε]≤1−nNℙdelimited-[]subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscript1P [| _x_i∈ X_nx_i- _x_i∈Xx_i% |≥ ]≤ 1- nNblackboard_P [ | maxitalic_x start_POSTSUBSCRIPT i ∈ Xitalic_n end_POSTSUBSCRIPT xitalic_i - maxitalic_x start_POSTSUBSCRIPT i ∈ X end_POSTSUBSCRIPT xitalic_i | ≥ ε ] ≤ 1 - divide start_ARG n end_ARG start_ARG N end_ARG. • For q=−∞q=-∞q = - ∞: Same analysis as for q=+∞q=+∞q = + ∞. ∎ See 6 Proof. |s∼p⁢[f⁢(s)]−s∼p^⁢[f⁢(s)]|subscriptsimilar-todelimited-[]subscriptsimilar-to^delimited-[] |E_s p[f(s)]-E_s p[f(s)]% || blackboard_Es ∼ p [ f ( s ) ] - blackboard_Es ∼ over start_ARG p end_ARG [ f ( s ) ] | =|∫f⁢(s)⁢p⁢(s)⁢ds−∫f⁢(s)⁢p^⁢(s)⁢ds|absentsubscriptdifferential-dsubscript^differential-d = | _Sf(s)p(s) \!ds- _% Sf(s) p(s) \!ds |= | ∫S f ( s ) p ( s ) d s - ∫S f ( s ) over start_ARG p end_ARG ( s ) d s | =|∫(f⁢(s)−fm⁢i⁢n)⁢p⁢(s)⁢ds−∫(f⁢(s)−fm⁢i⁢n)⁢p^⁢(s)⁢ds|absentsubscriptsubscriptdifferential-dsubscriptsubscript^differential-d = | _S(f(s)-f_min)p(s) \!d% s- _S(f(s)-f_min) p(s) \!ds |= | ∫S ( f ( s ) - fitalic_m i n ) p ( s ) d s - ∫S ( f ( s ) - fitalic_m i n ) over start_ARG p end_ARG ( s ) d s | =|∫(f⁢(s)−fm⁢i⁢n)⁢(p⁢(s)−p^⁢(s))⁢ds|absentsubscriptsubscript^differential-d = | _S(f(s)-f_min)(p(s)- p(s)) % \!ds |= | ∫S ( f ( s ) - fitalic_m i n ) ( p ( s ) - over start_ARG p end_ARG ( s ) ) d s | ≤∫|f⁢(s)−fm⁢i⁢n|⁢|p⁢(s)−p^⁢(s)|⁢dsabsentsubscriptsubscript^differential-d ≤ _S|f(s)-f_min||p(s)- p(s)| % \!ds≤ ∫S | f ( s ) - fitalic_m i n | | p ( s ) - over start_ARG p end_ARG ( s ) | d s ≤(fm⁢a⁢x−fm⁢i⁢n)⁢∫|p⁢(s)−p^⁢(s)|⁢dsabsentsubscriptsubscriptsubscript^differential-d ≤(f_max-f_min) _S|p(s)- p(s)| % \!ds≤ ( fitalic_m a x - fitalic_m i n ) ∫S | p ( s ) - over start_ARG p end_ARG ( s ) | d s =2⁢(fm⁢a⁢x−fm⁢i⁢n)⁢δ⁢(p,p^)absent2subscriptsubscript =2(f_max-f_min)δ(p, p)= 2 ( fitalic_m a x - fitalic_m i n ) δ ( p , over start_ARG p end_ARG ) where we have used the definition of the total variation distance: δ⁢(p,p^)=12⁢∫|p⁢(s)−p^⁢(s)|⁢ds^12subscript^differential-dδ(p, p)= 12 _S|p(s)- p(s)| \!% dsδ ( p , over start_ARG p end_ARG ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫S | p ( s ) - over start_ARG p end_ARG ( s ) | d s. By Pinsker’s inequality, we have δ⁢(p,p^)≤12⁢DK⁢L⁢(p∥p^).^12subscriptconditional^δ(p, p)≤ 12D_KL(p\| p).δ ( p , over start_ARG p end_ARG ) ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG Ditalic_K L ( p ∥ over start_ARG p end_ARG ) end_ARG . Additionally, by Bretagnolle and Huber’s inequality: δ⁢(p,p^)≤1−e−DK⁢L⁢(p∥p^).^1superscriptsubscriptconditional^δ(p, p)≤ 1-e^-D_KL(p\| p).δ ( p , over start_ARG p end_ARG ) ≤ square-root start_ARG 1 - e- Ditalic_K L ( p ∥ over start_ARG p end_ARG ) end_ARG . The result follows from the assumption DK⁢L⁢(p∥p^)≤dsubscriptconditional^D_KL(p\| p)≤ dDitalic_K L ( p ∥ over start_ARG p end_ARG ) ≤ d. ∎ See 7 Proof. First, note that if |Q∗⁢(s,a)−Q^⁢(s,a)|≤εsuperscript^|Q^*(s,a)- Q(s,a)|≤ | Q∗ ( s , a ) - over start_ARG Q end_ARG ( s , a ) | ≤ ε with probability at least 1−δ11- 1 - δ for all state-action pairs (s,a)(s,a)( s , a ), then Q∗⁢(s,π∗⁢(s))−Q∗⁢(s,πQ^⁢(s))≤2⁢εsuperscriptsuperscriptsuperscriptsubscript^2Q^*(s,π^*(s))-Q^*(s, _ Q(s))≤ 2 Q∗ ( s , π∗ ( s ) ) - Q∗ ( s , πover start_ARG Q end_ARG ( s ) ) ≤ 2 ε with probability at least 1−2⁢δ121-2 1 - 2 δ since Q∗⁢(s,π∗⁢(s))≤Q^⁢(s,π∗⁢(s))+ε≤Q^⁢(s,πQ^⁢(s))+ε≤[Q∗⁢(s,πQ^⁢(s))+ε]+ε.superscriptsuperscript^superscript^subscript^delimited-[]superscriptsubscript^Q^*(s,π^*(s))≤ Q(s,π^*(s))+ ≤ Q(s, _% Q(s))+ ≤[Q^*(s, _ Q(s))+ ]+ .Q∗ ( s , π∗ ( s ) ) ≤ over start_ARG Q end_ARG ( s , π∗ ( s ) ) + ε ≤ over start_ARG Q end_ARG ( s , πover start_ARG Q end_ARG ( s ) ) + ε ≤ [ Q∗ ( s , πover start_ARG Q end_ARG ( s ) ) + ε ] + ε . The factor 2222 in the probability comes from the fact that we need the Q^ Qover start_ARG Q end_ARG estimates to be accurate for both actions π∗⁢(s)superscriptπ^*(s)π∗ ( s ) and πQ^⁢(s)subscript _ Q(s)πover start_ARG Q end_ARG ( s ). For the first inequality, note that, with probability at least 1−2⁢k⁢δ121-2k 1 - 2 k δ, the first k estimates of Q^ Qover start_ARG Q end_ARG are 2⁢ε22 2 ε-accurate for actions π∗⁢(st)superscriptsubscriptπ^*(s_t)π∗ ( sitalic_t ) and πQ^⁢(st)subscript^subscript _ Q(s_t)πover start_ARG Q end_ARG ( sitalic_t ), t=0,…,k−10…1t=0,...,k-1t = 0 , … , k - 1. If this happens, then we have V∗⁢(s)−VπQ^⁢(s)superscriptsuperscriptsubscript V^*(s)-V _ Q(s)V∗ ( s ) - Vitalic_πover start_ARG Q end_ARG ( s ) =Q∗⁢(s,π∗⁢(s))−QπQ^⁢(s,πQ^⁢(s))absentsuperscriptsuperscriptsuperscriptsubscript^subscript =Q^*(s,π^*(s))-Q _ Q(s, _ Q(s))= Q∗ ( s , π∗ ( s ) ) - Qitalic_πover start_ARG Q end_ARG ( s , πover start_ARG Q end_ARG ( s ) ) ≤2⁢ε+Q∗⁢(s,πQ^⁢(s))−QπQ^⁢(s,πQ^⁢(s))absent2superscriptsubscript^superscriptsubscript^subscript ≤ 2 +Q^*(s, _ Q(s))-Q _ Q(s,% _ Q(s))≤ 2 ε + Q∗ ( s , πover start_ARG Q end_ARG ( s ) ) - Qitalic_πover start_ARG Q end_ARG ( s , πover start_ARG Q end_ARG ( s ) ) =2⁢ε+γ⁢s′∼p(⋅|s,πQ^(s))⁢[V∗⁢(s′)−VπQ^⁢(s′)] =2 + _s p(·|s, _% Q(s)) [V^*(s )-V _ Q(s ) ]= 2 ε + γ blackboard_Es′ ∼ p ( ⋅ | s , π start_POSTSUBSCRIPT over start_ARG Q end_ARG ( s ) ) end_POSTSUBSCRIPT [ V∗ ( s′ ) - Vitalic_πover start_ARG Q end_ARG ( s′ ) ] ≤2⁢ε⁢∑t=0k−1γt+γk⁢(Vm⁢a⁢x−Vm⁢i⁢n)absent2superscriptsubscript01superscriptsuperscriptsubscriptsubscript ≤ 2 _t=0^k-1γ^t+γ^k(V_max-V_% min)≤ 2 ε ∑t = 0k - 1 γitalic_t + γitalic_k ( Vitalic_m a x - Vitalic_m i n ) ≤2⁢ε1−γ+γk⁢(Vm⁢a⁢x−Vm⁢i⁢n),absent21superscriptsubscriptsubscript ≤ 2 1-γ+γ^k(V_max-V_min),≤ divide start_ARG 2 ε end_ARG start_ARG 1 - γ end_ARG + γitalic_k ( Vitalic_m a x - Vitalic_m i n ) , which concludes the first part of the proof. Concerning the second inequality, since Vm⁢i⁢n≤Q∗⁢(s,a)≤Vm⁢a⁢xsubscriptsuperscriptsubscriptV_min≤ Q^*(s,a)≤ V_maxVitalic_m i n ≤ Q∗ ( s , a ) ≤ Vitalic_m a x, we have Q^[Q∗(s,πQ^(s))]≥(1−2δ)[Q∗(s,π∗(s))−2ε]+2δVm⁢i⁢n≥Q∗(s,π∗(s)−(2ε+2δ(Vm⁢a⁢x−Vm⁢i⁢n)).E_ Q [Q^*(s, _ Q(s)) ]≥(1-2δ)[Q^*% (s,π^*(s))-2 ]+2δ V_min≥ Q^*(s,π^*(s)-(2% +2δ(V_max-V_min)).blackboard_Eover start_ARG Q end_ARG [ Q∗ ( s , πover start_ARG Q end_ARG ( s ) ) ] ≥ ( 1 - 2 δ ) [ Q∗ ( s , π∗ ( s ) ) - 2 ε ] + 2 δ Vitalic_m i n ≥ Q∗ ( s , π∗ ( s ) - ( 2 ε + 2 δ ( Vitalic_m a x - Vitalic_m i n ) ) . Let πjsubscript _jπitalic_j be a policy that replicates πQ^subscript _ Qπover start_ARG Q end_ARG for the first j actions and that is optimal from action j+11j+1j + 1 onward. We now show by induction that Vπj⁢(s)≥V∗⁢(s)−λjsuperscriptsubscriptsuperscriptsubscriptV _j(s)≥ V^*(s)- _jVitalic_πitalic_j ( s ) ≥ V∗ ( s ) - λitalic_j for all s, where λ=λ1=2⁢ε+2⁢δ⁢(Vm⁢a⁢x−Vm⁢i⁢n)subscript122subscriptsubscriptλ= _1=2 +2δ(V_max-V_min)λ = λ1 = 2 ε + 2 δ ( Vitalic_m a x - Vitalic_m i n ) and λj=λ+γ⁢λj−1subscriptsubscript1 _j=λ+γ _j-1λitalic_j = λ + γ λitalic_j - 1 for j>11j>1j > 1. This clearly holds for j=11j=1j = 1: Vπ1⁢(s)superscriptsubscript1 V _1(s)Vitalic_π1 ( s ) =Q^⁢[r⁢(s,πQ^⁢(s))+γ⁢s′∼p(⋅|s,πQ^(s))⁢[V∗⁢(s′)]] =E_ Q [r(s, _ Q(s))+ % _s p(·|s, _ Q(s))[V^*(s )] ]= blackboard_Eover start_ARG Q end_ARG [ r ( s , πover start_ARG Q end_ARG ( s ) ) + γ blackboard_Es′ ∼ p ( ⋅ | s , π start_POSTSUBSCRIPT over start_ARG Q end_ARG ( s ) ) end_POSTSUBSCRIPT [ V∗ ( s′ ) ] ] =Q^⁢[Q∗⁢(s,πQ^⁢(s))]absentsubscript^delimited-[]superscriptsubscript =E_ Q [Q^*(s, _ Q(s)) ]= blackboard_Eover start_ARG Q end_ARG [ Q∗ ( s , πover start_ARG Q end_ARG ( s ) ) ] ≥Q∗⁢(s,π∗⁢(s))−λabsentsuperscriptsuperscript ≥ Q^*(s,π^*(s))-λ≥ Q∗ ( s , π∗ ( s ) ) - λ =V∗⁢(s)−λ1.absentsuperscriptsubscript1 =V^*(s)- _1.= V∗ ( s ) - λ1 . For j>11j>1j > 1, assuming that the statement holds for j−11j-1j - 1, we have Vπj⁢(s)superscriptsubscript V _j(s)Vitalic_πitalic_j ( s ) =Q^⁢[r⁢(s,πQ^⁢(s))+γ⁢s′∼p(⋅|s,πQ^(s))⁢[Vπj−1⁢(s′)]] =E_ Q [r(s, _ Q(s))+ % _s p(·|s, _ Q(s))[V _j-1(s )] ]= blackboard_Eover start_ARG Q end_ARG [ r ( s , πover start_ARG Q end_ARG ( s ) ) + γ blackboard_Es′ ∼ p ( ⋅ | s , π start_POSTSUBSCRIPT over start_ARG Q end_ARG ( s ) ) end_POSTSUBSCRIPT [ Vitalic_πitalic_j - 1 ( s′ ) ] ] ≥Q^⁢[r⁢(s,πQ^⁢(s))+γ⁢s′∼p(⋅|s,πQ^(s))⁢[V∗⁢(s′)−λj−1]] _ Q [r(s, _ Q(s))+γ % E_s p(·|s, _ Q(s))[V^*(s )- _j-% 1] ]≥ blackboard_Eover start_ARG Q end_ARG [ r ( s , πover start_ARG Q end_ARG ( s ) ) + γ blackboard_Es′ ∼ p ( ⋅ | s , π start_POSTSUBSCRIPT over start_ARG Q end_ARG ( s ) ) end_POSTSUBSCRIPT [ V∗ ( s′ ) - λitalic_j - 1 ] ] =Q^⁢[r⁢(s,πQ^⁢(s))+γ⁢s′∼p(⋅|s,πQ^(s))⁢[V∗⁢(s′)]]−γ⁢λj−1 =E_ Q [r(s, _ Q(s))+ % _s p(·|s, _ Q(s))[V^*(s )] ]-γ% _j-1= blackboard_Eover start_ARG Q end_ARG [ r ( s , πover start_ARG Q end_ARG ( s ) ) + γ blackboard_Es′ ∼ p ( ⋅ | s , π start_POSTSUBSCRIPT over start_ARG Q end_ARG ( s ) ) end_POSTSUBSCRIPT [ V∗ ( s′ ) ] ] - γ λitalic_j - 1 =Q^[Q∗(s,πQ^(s)]−γλj−1 =E_ Q [Q^*(s, _ Q(s) ]-γ% _j-1= blackboard_Eover start_ARG Q end_ARG [ Q∗ ( s , πover start_ARG Q end_ARG ( s ) ] - γ λitalic_j - 1 ≥Q∗⁢(s,π∗⁢(s))−λ−γ⁢λj−1absentsuperscriptsuperscriptsubscript1 ≥ Q^*(s,π^*(s))-λ-γ _j-1≥ Q∗ ( s , π∗ ( s ) ) - λ - γ λitalic_j - 1 =V∗⁢(s)−λj.absentsuperscriptsubscript =V^*(s)- _j.= V∗ ( s ) - λitalic_j . We now show that limj→∞Vπj⁢(s)=VπQ^⁢(s)subscript→superscriptsubscriptsuperscriptsubscript _j→∞V _j(s)=V _ Q(s)limitalic_j → ∞ Vitalic_πitalic_j ( s ) = Vitalic_πover start_ARG Q end_ARG ( s ) for all s. Noting that Vπj⁢(s)≥VπQ^⁢(s)superscriptsubscriptsuperscriptsubscript^V _j(s)≥ V _ Q(s)Vitalic_πitalic_j ( s ) ≥ Vitalic_πover start_ARG Q end_ARG ( s ) (due to the optimality of πjsubscript _jπitalic_j after j steps), we have 0≤Vπj⁢(s)−VπQ^⁢(s)0superscriptsubscriptsuperscriptsubscript 0≤ V _j(s)-V _ Q(s)0 ≤ Vitalic_πitalic_j ( s ) - Vitalic_πover start_ARG Q end_ARG ( s ) =τ∼pτ(⋅|πj,s0=s)⁢[∑i=0∞γi⁢r⁢(si,ai)]−τ∼pτ(⋅|πQ^,s0=s)⁢[∑i=0∞γi⁢r⁢(si,ai)] =E_τ p_τ(·| _j,s_0=s) [ _% i=0^∞γ^ir(s_i,a_i) ]-E_τ p_τ(% ·| _ Q,s_0=s) [ _i=0^∞γ^ir(s_i,a_i) ]= blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | πitalic_j , s0 = s ) end_POSTSUBSCRIPT [ ∑i = 0∞ γitalic_i r ( sitalic_i , aitalic_i ) ] - blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | πover start_ARG Q end_ARG , s0 = s ) end_POSTSUBSCRIPT [ ∑i = 0∞ γitalic_i r ( sitalic_i , aitalic_i ) ] =τ∼pτ(⋅|πj,s0=s)⁢[∑i=j∞γi⁢r⁢(si,ai)]−τ∼pτ(⋅|πQ^,s0=s)⁢[∑i=j∞γi⁢r⁢(si,ai)] =E_τ p_τ(·| _j,s_0=s) [ _% i=j^∞γ^ir(s_i,a_i) ]-E_τ p_τ(% ·| _ Q,s_0=s) [ _i=j^∞γ^ir(s_i,a_i) ]= blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | πitalic_j , s0 = s ) end_POSTSUBSCRIPT [ ∑i = j∞ γitalic_i r ( sitalic_i , aitalic_i ) ] - blackboard_Eτ ∼ p start_POSTSUBSCRIPT τ ( ⋅ | πover start_ARG Q end_ARG , s0 = s ) end_POSTSUBSCRIPT [ ∑i = j∞ γitalic_i r ( sitalic_i , aitalic_i ) ] ≤∑i=j∞γi⁢(Rm⁢a⁢x−Rm⁢i⁢n)absentsuperscriptsubscriptsuperscriptsubscriptsubscript ≤ _i=j^∞γ^i(R_max-R_min)≤ ∑i = j∞ γitalic_i ( Ritalic_m a x - Ritalic_m i n ) =γj⁢Rm⁢a⁢x−Rm⁢i⁢n1−γ⁢→j→∞⁢0.absentsuperscriptsubscriptsubscript1→0 =γ^j R_max-R_min1-γ j→∞% →0.= γitalic_j divide start_ARG Ritalic_m a x - Ritalic_m i n end_ARG start_ARG 1 - γ end_ARG start_UNDERACCENT j → ∞ end_UNDERACCENT start_ARG → end_ARG 0 . Therefore, by the squeeze theorem, we have limj→∞Vπj⁢(s)−VπQ^⁢(s)=0subscript→superscriptsubscriptsuperscriptsubscript^0 _j→∞V _j(s)-V _ Q(s)=0limitalic_j → ∞ Vitalic_πitalic_j ( s ) - Vitalic_πover start_ARG Q end_ARG ( s ) = 0. Finally, noting that limj→∞λj=∑j=0∞γj⁢λ=λ1−γ,subscript→subscriptsuperscriptsubscript0superscript1 _j→∞ _j= _j=0^∞γ^jλ= % λ1-γ,limitalic_j → ∞ λitalic_j = ∑j = 0∞ γitalic_j λ = divide start_ARG λ end_ARG start_ARG 1 - γ end_ARG , we have for all s: VπQ^⁢(s)=limj→∞Vπj⁢(s)≥V∗⁢(s)−limj→∞λj=V∗⁢(s)−2⁢ε+2⁢δ⁢(Vm⁢a⁢x−Vm⁢i⁢n)1−γ.superscriptsubscript^subscript→superscriptsubscriptsuperscriptsubscript→subscriptsuperscript22subscriptsubscript1V _ Q(s)= _j→∞V _j(s)≥ V^*(s)- _j→% ∞ _j=V^*(s)- 2 +2δ(V_max-V_min)1-% γ.Vitalic_πover start_ARG Q end_ARG ( s ) = limitalic_j → ∞ Vitalic_πitalic_j ( s ) ≥ V∗ ( s ) - limitalic_j → ∞ λitalic_j = V∗ ( s ) - divide start_ARG 2 ε + 2 δ ( Vitalic_m a x - Vitalic_m i n ) end_ARG start_ARG 1 - γ end_ARG . ∎ Lemma 9 (Hoeffding-Serfling Inequalities [4, 33]). Let =xii=1Nsuperscriptsubscriptsubscript1X=\x_i\_i=1^NX = xitalic_i i = 1N be a finite set of N>11N>1N > 1 real points and Xn=Xjj=1nsubscriptsuperscriptsubscriptsubscript1X_n=\X_j\_j=1^nXitalic_n = Xitalic_j j = 1n a subset of size n<Nn<Nn < N sampled uniformly at random without replacement from XX. Additionally, denote μ=1N⁢∑i=1Nxi1superscriptsubscript1subscriptμ= 1N _i=1^Nx_iμ = divide start_ARG 1 end_ARG start_ARG N end_ARG ∑i = 1N xitalic_i, a=mini⁡xisubscriptsubscripta= _ix_ia = minitalic_i xitalic_i and b=maxi⁡xisubscriptsubscriptb= _ix_ib = maxitalic_i xitalic_i. Then, for m=min⁡(n,N−n)m= (n,N-n)m = min ( n , N - n ) and any ε>00 >0ε > 0: ℙ⁢[|1n⁢∑j=1nXj−μ|≥ε]≤2⁢exp⁡−2⁢ε2⁢n(1−nN)⁢(1+1m)⁢(b−a)2.ℙdelimited-[]1superscriptsubscript1subscript22superscript2111superscript2P [ | 1n _j=1^nX_j-μ |≥% ]≤ 2 \- 2 ^2n(1- nN)(% 1+ 1m)(b-a)^2 \.blackboard_P [ | divide start_ARG 1 end_ARG start_ARG n end_ARG ∑j = 1n Xitalic_j - μ | ≥ ε ] ≤ 2 exp - divide start_ARG 2 ε2 n end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG m end_ARG ) ( b - a )2 end_ARG . Proof. First, we show the slightly more general result: 1) 1)1 ) ℙ⁢[1n⁢∑j=1nXj−μ≥ε]≤exp⁡−2⁢ε2⁢n(1−nN)⁢(1+1n)⁢(b−a)2,ℙdelimited-[]1superscriptsubscript1subscript2superscript2111superscript2 [ 1n _j=1^nX_j-μ≥% ]≤ \- 2 ^2n(1- nN)(1+% 1n)(b-a)^2 \,blackboard_P [ divide start_ARG 1 end_ARG start_ARG n end_ARG ∑j = 1n Xitalic_j - μ ≥ ε ] ≤ exp - divide start_ARG 2 ε2 n end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG n end_ARG ) ( b - a )2 end_ARG , 2) 2)2 ) ℙ⁢[1n⁢∑j=1nXj−μ≤−ε]≤exp⁡−2⁢ε2⁢n(1−nN)⁢(1+1N−n)⁢(b−a)2.ℙdelimited-[]1superscriptsubscript1subscript2superscript2111superscript2 [ 1n _j=1^nX_j-μ≤-% ]≤ \- 2 ^2n(1- nN)(1+% 1N-n)(b-a)^2 \.blackboard_P [ divide start_ARG 1 end_ARG start_ARG n end_ARG ∑j = 1n Xitalic_j - μ ≤ - ε ] ≤ exp - divide start_ARG 2 ε2 n end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG N - n end_ARG ) ( b - a )2 end_ARG . The proof of 1) is similar to the one proposed in [4]: Let Zn=1n⁢∑j=1nXj−μsubscript1superscriptsubscript1subscriptZ_n= 1n _j=1^nX_j- _n = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑j = 1n Xitalic_j - μ. We have for any λ>00λ>0λ > 0: ℙ⁢[Zn≥ε]=ℙ⁢[eλ⁢n⁢Zn≥eλ⁢n⁢ε]≤⁢[eλ⁢n⁢Zn]eλ⁢n⁢ε≤exp⁡18⁢(b−a)2⁢λ2⁢(n+1)⁢(1−nN)−λ⁢n⁢ε,ℙdelimited-[]subscriptℙdelimited-[]superscriptsubscriptsuperscriptdelimited-[]superscriptsubscriptsuperscript18superscript2superscript211P [Z_n≥ ]=P [e^λ nZ_% n≥ e^λ n ]≤ E[e^λ nZ_n% ]e^λ n ≤ \ 18(b-a)^2λ^2(n+% 1) (1- nN )-λ n \,blackboard_P [ Zitalic_n ≥ ε ] = blackboard_P [ eitalic_λ n Zitalic_n ≥ eitalic_λ n ε ] ≤ divide start_ARG blackboard_E [ eitalic_λ n Zitalic_n ] end_ARG start_ARG eitalic_λ n ε end_ARG ≤ exp divide start_ARG 1 end_ARG start_ARG 8 end_ARG ( b - a )2 λ2 ( n + 1 ) ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) - λ n ε , where we have used Markov’s inequality along with Proposition 2.3 from [4] (slightly improving the original result proposed by Serfling [33]). The result 1) follows by finding λ that minimizes this upper-bound, i.e., λ=4⁢n⁢ε(n+1)⁢(1−nN)⁢(b−a)2.411superscript2λ= 4n (n+1)(1- nN)(b-a)^2.λ = divide start_ARG 4 n ε end_ARG start_ARG ( n + 1 ) ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( b - a )2 end_ARG . For 2), we note that sampling XnsubscriptX_nXitalic_n is equivalent to sampling XN−n′=∖Xnsubscriptsuperscript′subscriptX _N-n=X X_nX′italic_N - n = X ∖ Xitalic_n. Therefore: ℙ⁢[1n⁢∑j=1nXj−μ≤−ε]ℙdelimited-[]1superscriptsubscript1subscript [ 1n _j=1^nX_j-μ≤-% ]blackboard_P [ divide start_ARG 1 end_ARG start_ARG n end_ARG ∑j = 1n Xitalic_j - μ ≤ - ε ] =ℙ⁢[1n⁢(∑j=1nXj−n⁢μ)≤−ε]absentℙdelimited-[]1superscriptsubscript1subscript =P [ 1n ( _j=1^nX_j-nμ % )≤- ]= blackboard_P [ divide start_ARG 1 end_ARG start_ARG n end_ARG ( ∑j = 1n Xitalic_j - n μ ) ≤ - ε ] =ℙ⁢[1n⁢(N⁢μ−∑j=1N−nXj′−n⁢μ)≤−ε]absentℙdelimited-[]1superscriptsubscript1subscriptsuperscript′ =P [ 1n (Nμ- _j=1^N-nX % _j-nμ )≤- ]= blackboard_P [ divide start_ARG 1 end_ARG start_ARG n end_ARG ( N μ - ∑j = 1N - n X′italic_j - n μ ) ≤ - ε ] =ℙ⁢[N−n⁢(μ−1N−n⁢∑j=1N−nXj′)≤−ε]absentℙdelimited-[]1superscriptsubscript1subscriptsuperscript′ =P [ N-nn (μ- 1N-n _j=1^% N-nX _j )≤- ]= blackboard_P [ divide start_ARG N - n end_ARG start_ARG n end_ARG ( μ - divide start_ARG 1 end_ARG start_ARG N - n end_ARG ∑j = 1N - n X′italic_j ) ≤ - ε ] =ℙ⁢[1N−n⁢∑j=1N−nXj′−μ≥n⁢εN−n]absentℙdelimited-[]1superscriptsubscript1subscriptsuperscript′ =P [ 1N-n _j=1^N-nX _j-μ% ≥ n N-n ]= blackboard_P [ divide start_ARG 1 end_ARG start_ARG N - n end_ARG ∑j = 1N - n X′italic_j - μ ≥ divide start_ARG n ε end_ARG start_ARG N - n end_ARG ] ≤exp⁡−2⁢ε2⁢n(1−nN)⁢(1+1N−n)⁢(b−a)2,absent2superscript2111superscript2 ≤ \- 2 ^2n(1- nN)(1+ % 1N-n)(b-a)^2 \,≤ exp - divide start_ARG 2 ε2 n end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG N - n end_ARG ) ( b - a )2 end_ARG , where we have used 1) in the last step. Lastly, the final result follows from Boole’s inequality (union bound) between 1) and 2). ∎ A.2.2 Existence of PAA and A policies See 3 Proof. Similarly to [19], the core idea (and challenge) of the proof is to bound |Q∗⁢(s,a)−Q^H⁢(s,a)|superscriptsuperscript^|Q^*(s,a)- Q^H(s,a)|| Q∗ ( s , a ) - over start_ARG Q end_ARGH ( s , a ) | for all state-action pairs so that Lemma 7. However, in contrast with [19], we must now deal with an approximate dynamics model, as well as an approximate reward, which significantly complicates the task. To do so, we write the following for h>0ℎ0h>0h > 0 (omitting the dependency of K, C and InsubscriptI_nIitalic_n in the notation): Q∗⁢(s,a)−Q^h⁢(s,a)superscriptsuperscript^ℎ Q^*(s,a)- Q^h(s,a)Q∗ ( s , a ) - over start_ARG Q end_ARGh ( s , a ) =rℐ⁢(s,a)−r^In⁢(s,a)+γ⁢(s′∼p(⋅|s,a)⁢[V∗⁢(s′)]−^s′∼p^(⋅|s,a)C⁢[V^h−1⁢(s′)]) =r_I(s,a)- r_I_n(s,a)+γ (E% _s p(·|s,a)[V^*(s )]- E^C_s^% p(·|s,a)[ V^h-1(s )] )= rcaligraphic_I ( s , a ) - over start_ARG r end_ARGI start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ( s , a ) + γ ( blackboard_Es′ ∼ p ( ⋅ | s , a ) [ V∗ ( s′ ) ] - over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ over start_ARG V end_ARGh - 1 ( s′ ) ] ) =s′∼p(⋅|s,a)⁢[Wq⁢(⁢(′);ℐ)]−^′∼^(⋅|,)⁢[⁢(⁢(′);)] =E_s p(·|s,a)[W_q( u(s % );I)]- E^K_s p(·|s,a)[W_% q( u(s );I_n)]= blackboard_Es′ ∼ p ( ⋅ | s , a ) [ Witalic_q ( u ( s′ ) ; I ) ] - over start_ARG blackboard_E end_ARGKbold_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ Wbold_q ( u ( s′ ) ; Ibold_n ) ] +γ⁢(s′∼p(⋅|s,a)⁢[V∗⁢(s′)]−^s′∼p^(⋅|s,a)C⁢[V^h−1⁢(s′)]) +γ(E_s p(·|s,a)[V^*(s^% )]- E^C_s p(·|s,a)[ V^h-1% (s )])+ γ ( blackboard_Es′ ∼ p ( ⋅ | s , a ) [ V∗ ( s′ ) ] - over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ over start_ARG V end_ARGh - 1 ( s′ ) ] ) =s′∼p(⋅|s,a)⁢[Wq⁢(⁢(′);ℐ)]−′∼(⋅|,)⁢[⁢(⁢(′);)] =E_s p(·|s,a)[W_q( u(s % );I)]-E_s p(·|s,a)[W_q( u(s^% );I_n)]= blackboard_Es′ ∼ p ( ⋅ | s , a ) [ Witalic_q ( u ( s′ ) ; I ) ] - blackboard_Es′ ∼ p ( ⋅ | s , a ) [ Wbold_q ( u ( s′ ) ; Ibold_n ) ] +s′∼p(⋅|s,a)⁢[Wq⁢(⁢(′);)]−′∼^(⋅|,)⁢[⁢(⁢(′);)] +E_s p(·|s,a)[W_q( u(s^% );I_n)]-E_s p(·|s,a)[W_q( u(s% );I_n)]+ blackboard_Es′ ∼ p ( ⋅ | s , a ) [ Witalic_q ( u ( s′ ) ; Ibold_n ) ] - blackboard_Es′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ Wbold_q ( u ( s′ ) ; Ibold_n ) ] +s′∼p^(⋅|s,a)⁢[Wq⁢(⁢(′);)]−^′∼^(⋅|,)⁢[⁢(⁢(′);)] +E_s p(·|s,a)[W_q( u% (s );I_n)]- E^K_s p(·|s,a)[% W_q( u(s );I_n)]+ blackboard_Es′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ Witalic_q ( u ( s′ ) ; Ibold_n ) ] - over start_ARG blackboard_E end_ARGKbold_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ Wbold_q ( u ( s′ ) ; Ibold_n ) ] +γ⁢(s′∼p(⋅|s,a)⁢[V∗⁢(s′)]−s′∼p^(⋅|s,a)⁢[V∗⁢(s′)]) +γ(E_s p(·|s,a)[V^*(s^% )]-E_s p(·|s,a)[V^*(s )])+ γ ( blackboard_Es′ ∼ p ( ⋅ | s , a ) [ V∗ ( s′ ) ] - blackboard_Es′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ V∗ ( s′ ) ] ) +γ⁢(s′∼p^(⋅|s,a)⁢[V∗⁢(s′)]−^s′∼p^(⋅|s,a)C⁢[V∗⁢(s′)]) +γ(E_s p(·|s,a)[V^*% (s )]- E^C_s p(·|s,a)[V^*(s^% )])+ γ ( blackboard_Es′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ V∗ ( s′ ) ] - over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ V∗ ( s′ ) ] ) +γ⁢(^s′∼p^(⋅|s,a)C⁢[V∗⁢(s′)]−^s′∼p^(⋅|s,a)C⁢[V^h−1⁢(s′)]) +γ( E^C_s p(·|s% ,a)[V^*(s )]- E^C_s p(·|s,a)% [ V^h-1(s )])+ γ ( over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ V∗ ( s′ ) ] - over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ over start_ARG V end_ARGh - 1 ( s′ ) ] ) such that, for any state-action pair, |Q∗⁢(s,a)−Q^h⁢(s,a)|superscriptsuperscript^ℎ |Q^*(s,a)- Q^h(s,a)|| Q∗ ( s , a ) - over start_ARG Q end_ARGh ( s , a ) | ≤s′∼p(⋅|s,a)⁢[|Wq⁢(⁢(′);ℐ)−⁢(⁢(′);)|⏟Z1] _s p(·|s,a)[ |W_q(% u(s );I)-W_q( u(s );I_n)|_Z_1]≤ blackboard_Es′ ∼ p ( ⋅ | s , a ) [ under⏟ start_ARG | Witalic_q ( u ( s′ ) ; I ) - Wbold_q ( u ( s′ ) ; Ibold_n ) | end_ARGZ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] +|s′∼p(⋅|s,a)⁢[Wq⁢(⁢(′);)]−′∼^(⋅|,)⁢[⁢(⁢(′);)]|⏟Z2 + |E_s p(·|s,a)[W_q% ( u(s );I_n)]-E_s p(·|s,a)[W_% q( u(s );I_n)]|_Z_2+ under⏟ start_ARG | blackboard_Es′ ∼ p ( ⋅ | s , a ) [ Witalic_q ( u ( s′ ) ; Ibold_n ) ] - blackboard_Es′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ Wbold_q ( u ( s′ ) ; Ibold_n ) ] | end_ARGZ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT +|s′∼p^(⋅|s,a)⁢[Wq⁢(⁢(′);)]−^′∼^(⋅|,)⁢[⁢(⁢(′);)]|⏟Z3 + |E_s p(·|s,a)% [W_q( u(s );I_n)]- E^K_s p% (·|s,a)[W_q( u(s );I_n)]|_Z_3+ under⏟ start_ARG | blackboard_Es′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ Witalic_q ( u ( s′ ) ; Ibold_n ) ] - over start_ARG blackboard_E end_ARGKbold_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ Wbold_q ( u ( s′ ) ; Ibold_n ) ] | end_ARGZ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT +γ⁢|s′∼p(⋅|s,a)⁢[V∗⁢(s′)]−s′∼p^(⋅|s,a)⁢[V∗⁢(s′)]|⏟Z4 +γ |E_s p(·|s,a)% [V^*(s )]-E_s p(·|s,a)[V^*(s^% )]|_Z_4+ γ under⏟ start_ARG | blackboard_Es′ ∼ p ( ⋅ | s , a ) [ V∗ ( s′ ) ] - blackboard_Es′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ V∗ ( s′ ) ] | end_ARGZ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT +γ⁢|s′∼p^(⋅|s,a)⁢[V∗⁢(s′)]−^s′∼p^(⋅|s,a)C⁢[V∗⁢(s′)]|⏟Z5 +γ |E_s p(·% |s,a)[V^*(s )]- E^C_s p(·|s,% a)[V^*(s )]|_Z_5+ γ under⏟ start_ARG | blackboard_Es′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ V∗ ( s′ ) ] - over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ V∗ ( s′ ) ] | end_ARGZ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT +γ⁢^s′∼p^(⋅|s,a)C⁢[|V∗⁢(s′)−V^h−1⁢(s′)|]⏟Z6 +γ E^C_s % p(·|s,a)[|V^*(s )- V^h-1(s )|]_Z_6+ γ under⏟ start_ARG over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ | V∗ ( s′ ) - over start_ARG V end_ARGh - 1 ( s′ ) | ] end_ARGZ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT where we have used the triangle inequality, the linearity of expectation and the fact that |⁢[X]|≤⁢[|X|]delimited-[]delimited-[]|E[X]| [|X|]| blackboard_E [ X ] | ≤ blackboard_E [ | X | ] for any random variable X. From Lemma 5, we have that Z1≤ε1subscript1subscript1Z_1≤ _1Z1 ≤ ε1 with probability at least 1−δ11subscript11- _11 - δ1 where δ1≜2⁢exp⁡(−2⁢n⁢ε12(1−nN)⁢(1+1m)⁢Γ⁢(ε1,Um⁢i⁢n,Um⁢a⁢x,q))≤2⁢exp⁡(−n⁢ε121−nN⁢Γ⁢(ε1,Um⁢i⁢n,Um⁢a⁢x,q)).≜subscript122superscriptsubscript12111Γsubscript1subscriptsubscript2superscriptsubscript121Γsubscript1subscriptsubscript _1 2 (- 2n _1^2(1- nN)(% 1+ 1m) ( _1,U_min,U_max,q) )≤ 2 % (- n _1^21- nN ( _1,U_min,U_% max,q) ).δ1 ≜ 2 exp ( - divide start_ARG 2 n ε12 end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) ( 1 + divide start_ARG 1 end_ARG start_ARG m end_ARG ) end_ARG Γ ( ε1 , Uitalic_m i n , Uitalic_m a x , q ) ) ≤ 2 exp ( - divide start_ARG n ε12 end_ARG start_ARG 1 - divide start_ARG n end_ARG start_ARG N end_ARG end_ARG Γ ( ε1 , Uitalic_m i n , Uitalic_m a x , q ) ) . Recall that Um⁢i⁢n≤Wq⁢(⁢())≤subscriptsubscriptsubscriptU_min≤ W_q( u(s))≤ U_maxUitalic_m i n ≤ Witalic_q ( u ( s ) ) ≤ Ubold_max and Vm⁢i⁢n=Um⁢i⁢n1−γ≤V∗⁢(s)≤Um⁢a⁢x1−γ=Vm⁢a⁢xsubscriptsubscript1superscriptsubscript1subscriptV_min= U_min1-γ≤ V^*(s)≤ U_max1-γ=V_maxVitalic_m i n = divide start_ARG Uitalic_m i n end_ARG start_ARG 1 - γ end_ARG ≤ V∗ ( s ) ≤ divide start_ARG Uitalic_m a x end_ARG start_ARG 1 - γ end_ARG = Vitalic_m a x for any state s and utility profile uu. Therefore, from Lemma 6, we have (with probability 1) Z2≤2⁢Δ⁢U⁢min⁡d2,1−e−d≜ε2andZ4≤2⁢Δ⁢U1−γ⁢min⁡d2,1−e−d≜ε4,formulae-sequencesubscript22Δ21superscript≜subscript2andsubscript42Δ121superscript≜subscript4Z_2≤ 2 U \ d2,1-e^-d\ _2% and Z_4≤ 2 U1-γ \ % d2,1-e^-d\ _4,Z2 ≤ 2 Δ U square-root start_ARG min divide start_ARG d end_ARG start_ARG 2 end_ARG , 1 - e- d end_ARG ≜ ε2 and Z4 ≤ 2 divide start_ARG Δ U end_ARG start_ARG 1 - γ end_ARG square-root start_ARG min divide start_ARG d end_ARG start_ARG 2 end_ARG , 1 - e- d end_ARG ≜ ε4 , (26) where Δ⁢U≜Um⁢a⁢x−Um⁢i⁢n≜Δsubscriptsubscript U U_max-U_minΔ U ≜ Uitalic_m a x - Uitalic_m i n and d≜sup(s,a)DK⁢L(p(⋅|s,a)||p^(⋅|s,a))d _(s,a)D_KL(p(·|s,a)|| p(·|s,a))d ≜ sup( s , a ) Ditalic_K L ( p ( ⋅ | s , a ) | | over start_ARG p end_ARG ( ⋅ | s , a ) ). Furthermore, by the standard Hoeffding’s inequality, we have Z3≤ε3subscript3subscript3Z_3≤ _3Z3 ≤ ε3 with probability at least 1−δ31subscript31- _31 - δ3 where δ3≜2⁢exp⁡(−2⁢K⁢ε32Δ⁢U2),≜subscript322superscriptsubscript32Δsuperscript2 _3 2 (- 2K _3^2 U^2% ),δ3 ≜ 2 exp ( - divide start_ARG 2 K ε32 end_ARG start_ARG Δ U2 end_ARG ) , and Z5≤ε5subscript5subscript5Z_5≤ _5Z5 ≤ ε5 with probability at least 1−δ51subscript51- _51 - δ5 where δ5≜2⁢exp⁡(−2⁢C⁢(1−γ)2⁢ε52Δ⁢U2).≜subscript522superscript12superscriptsubscript52Δsuperscript2 _5 2 (- 2C(1-γ)^2 _5^2% U^2 ).δ5 ≜ 2 exp ( - divide start_ARG 2 C ( 1 - γ )2 ε52 end_ARG start_ARG Δ U2 end_ARG ) . Finally, we have ^s′∼p^(⋅|s,a)C⁢[|V∗⁢(s′)−V^h−1⁢(s′)|] E^C_s p(·|s,a)[|V^*(s^% )- V^h-1(s )|]over start_ARG blackboard_E end_ARGCitalic_s′ ∼ over start_ARG p end_ARG ( ⋅ | s , a ) [ | V∗ ( s′ ) - over start_ARG V end_ARGh - 1 ( s′ ) | ] =1C⁢∑i=1C|V∗⁢(si)−V^h−1⁢(si)|absent1superscriptsubscript1superscriptsubscriptsuperscript^ℎ1subscript = 1C _i=1^C|V^*(s_i)- V^h-1(s_i)|= divide start_ARG 1 end_ARG start_ARG C end_ARG ∑i = 1C | V∗ ( sitalic_i ) - over start_ARG V end_ARGh - 1 ( sitalic_i ) | =1C⁢∑i=1C|maxa⁡Q∗⁢(si,a)−maxa⁡Q^h−1⁢(si,a)|absent1superscriptsubscript1subscriptsuperscriptsubscriptsubscriptsuperscript^ℎ1subscript = 1C _i=1^C| _aQ^*(s_i,a)- _a Q% ^h-1(s_i,a)|= divide start_ARG 1 end_ARG start_ARG C end_ARG ∑i = 1C | maxitalic_a Q∗ ( sitalic_i , a ) - maxitalic_a over start_ARG Q end_ARGh - 1 ( sitalic_i , a ) | ≤1C⁢∑i=1C|Q∗⁢(si,a~i)−Q^h−1⁢(si,a~i)|absent1superscriptsubscript1superscriptsubscriptsubscript~superscript^ℎ1subscriptsubscript~ ≤ 1C _i=1^C|Q^*(s_i, a_i)- Q^% h-1(s_i, a_i)|≤ divide start_ARG 1 end_ARG start_ARG C end_ARG ∑i = 1C | Q∗ ( sitalic_i , over~ start_ARG a end_ARGi ) - over start_ARG Q end_ARGh - 1 ( sitalic_i , over~ start_ARG a end_ARGi ) | with a~i≜arg⁡maxa⁡Q∗⁢(si,a)if ⁢maxa⁡Q∗⁢(si,a)≥maxa⁡Q^h−1⁢(si,a)arg⁡maxa⁡Q^h−1⁢(si,a)if ⁢maxa⁡Q∗⁢(si,a)<maxa⁡Q^h−1⁢(si,a),≜subscript~casessubscriptsuperscriptsubscriptif subscriptsuperscriptsubscriptsubscriptsuperscript^ℎ1subscriptsubscriptsuperscript^ℎ1subscriptif subscriptsuperscriptsubscriptsubscriptsuperscript^ℎ1subscript a_i \ array[]lc _aQ^*(s_i,a)&% if _aQ^*(s_i,a)≥ _a Q^h-1(s_i,a)\\ _a Q^h-1(s_i,a)& if _aQ^*(s_i,a)< _a% Q^h-1(s_i,a) array .,over~ start_ARG a end_ARGi ≜ start_ARRAY start_ROW start_CELL arg maxitalic_a Q∗ ( sitalic_i , a ) end_CELL start_CELL if maxitalic_a Q∗ ( sitalic_i , a ) ≥ maxitalic_a over start_ARG Q end_ARGh - 1 ( sitalic_i , a ) end_CELL end_ROW start_ROW start_CELL arg maxitalic_a over start_ARG Q end_ARGh - 1 ( sitalic_i , a ) end_CELL start_CELL if maxitalic_a Q∗ ( sitalic_i , a ) < maxitalic_a over start_ARG Q end_ARGh - 1 ( sitalic_i , a ) end_CELL end_ROW end_ARRAY , and where the last inequality is obtained after a careful analysis of the absolute value operator. This suggests that we can proceed by induction on hℎh. Let α0≜Δ⁢U1−γ≜subscript0Δ1 _0 U1-γα0 ≜ divide start_ARG Δ U end_ARG start_ARG 1 - γ end_ARG and αh≜ε1+ε2+ε3+γ⁢(ε4+ε5+αh−1)≜subscriptℎsubscript1subscript2subscript3subscript4subscript5subscriptℎ1 _h _1+ _2+ _3+γ(% _4+ _5+ _h-1)αitalic_h ≜ ε1 + ε2 + ε3 + γ ( ε4 + ε5 + αitalic_h - 1 ) for h>0ℎ0h>0h > 0, and let ϕ0≜0≜subscriptitalic-ϕ00 _0 0ϕ0 ≜ 0 and ϕh≜δ1+δ3+δ5+C⁢||⁢ϕh−1≜subscriptitalic-ϕℎsubscript1subscript3subscript5subscriptitalic-ϕℎ1 _h _1+ _3+ _5+C|A| _h-1ϕitalic_h ≜ δ1 + δ3 + δ5 + C | A | ϕitalic_h - 1 for h>0ℎ0h>0h > 0. We start the induction by noting that, for any state-action pair, |Q∗⁢(s,a)−Q^0⁢(s,a)|=|Q∗⁢(s,a)|≤Δ⁢U1−γ=α0with probability ⁢1=1−ϕ0formulae-sequencesuperscriptsuperscript^0superscriptΔ1subscript0with probability 11subscriptitalic-ϕ0|Q^*(s,a)- Q^0(s,a)|=|Q^*(s,a)|≤ U1-γ=α% _0 with probability 1=1- _0| Q∗ ( s , a ) - over start_ARG Q end_ARG0 ( s , a ) | = | Q∗ ( s , a ) | ≤ divide start_ARG Δ U end_ARG start_ARG 1 - γ end_ARG = α0 with probability 1 = 1 - ϕ0 For h>1ℎ1h>1h > 1, assuming that, for any state-action pair, |Q∗⁢(s,a)−Q^h−1⁢(s,a)|≤αh−1superscriptsuperscript^ℎ1subscriptℎ1|Q^*(s,a)- Q^h-1(s,a)|≤ _h-1| Q∗ ( s , a ) - over start_ARG Q end_ARGh - 1 ( s , a ) | ≤ αitalic_h - 1 with probability at least 1−ϕh−11subscriptitalic-ϕℎ11- _h-11 - ϕitalic_h - 1, we have from above |Q∗⁢(s,a)−Q^h⁢(s,a)|≤ε1+ε2+ε3+γ⁢(ε4+ε5+αh−1)=αhsuperscriptsuperscript^ℎsubscript1subscript2subscript3subscript4subscript5subscriptℎ1subscriptℎ|Q^*(s,a)- Q^h(s,a)|≤ _1+ _2+ _% 3+γ( _4+ _5+ _h-1)= _h| Q∗ ( s , a ) - over start_ARG Q end_ARGh ( s , a ) | ≤ ε1 + ε2 + ε3 + γ ( ε4 + ε5 + αitalic_h - 1 ) = αitalic_h (27) with probability at least 1−δ1−δ3−δ5−C⁢||⁢ϕh−1=1−ϕh1subscript1subscript3subscript5subscriptitalic-ϕℎ11subscriptitalic-ϕℎ1- _1- _3- _5-C|A| _h-1=1- _h1 - δ1 - δ3 - δ5 - C | A | ϕitalic_h - 1 = 1 - ϕitalic_h, as we require that all C estimates of Q^h−1superscript^ℎ1 Q^h-1over start_ARG Q end_ARGh - 1 are accurate for each action. Solving for αHsubscript _Hαitalic_H, we get αHsubscript _Hαitalic_H =∑i=0H−1γi⁢(ε1+ε2+ε3+γ⁢(ε4+ε5))+γH⁢Δ⁢U1−γabsentsuperscriptsubscript01superscriptsubscript1subscript2subscript3subscript4subscript5superscriptΔ1 = _i=0^H-1γ^i( _1+ _2+% _3+γ( _4+ _5))+γ^H % U1-γ= ∑i = 0H - 1 γitalic_i ( ε1 + ε2 + ε3 + γ ( ε4 + ε5 ) ) + γitalic_H divide start_ARG Δ U end_ARG start_ARG 1 - γ end_ARG =(ε1+ε2+ε3+γ⁢(ε4+ε5))⁢1−γH1−γ+γH⁢Δ⁢U1−γabsentsubscript1subscript2subscript3subscript4subscript51superscript1superscriptΔ1 =( _1+ _2+ _3+γ(% _4+ _5)) 1-γ^H1-γ+γ^H % U1-γ= ( ε1 + ε2 + ε3 + γ ( ε4 + ε5 ) ) divide start_ARG 1 - γitalic_H end_ARG start_ARG 1 - γ end_ARG + γitalic_H divide start_ARG Δ U end_ARG start_ARG 1 - γ end_ARG ≤ε1+ε2+ε3+γ⁢(ε4+ε5)+γH⁢Δ⁢U1−γ.absentsubscript1subscript2subscript3subscript4subscript5superscriptΔ1 ≤ _1+ _2+ _3+γ(% _4+ _5)+γ^H U1-γ.≤ divide start_ARG ε1 + ε2 + ε3 + γ ( ε4 + ε5 ) + γitalic_H Δ U end_ARG start_ARG 1 - γ end_ARG . (28) Note that ε2subscript2 _2ε2 and ε4subscript4 _4ε4 in Eq. (26) are non-controllable and only depend on the accuracy of the environment model p^ pover start_ARG p end_ARG. If we require |V∗⁢(s)−VπP⁢A⁢A⁢(s)|≤εsuperscriptsuperscriptsubscript|V^*(s)-V _PAA(s)|≤ | V∗ ( s ) - Vitalic_πitalic_P A A ( s ) | ≤ ε, then we must have 2⁢(ε2+γ⁢ε4)(1−γ)2<ε2subscript2subscript4superscript12 2( _2+γ _4)(1-γ)^2< start_ARG 2 ( ε2 + γ ε4 ) end_ARG start_ARG ( 1 - γ )2 end_ARG < ε from Lemma 7. Using the definitions of ε2subscript2 _2ε2 and ε4subscript4 _4ε4 from Eq. (26), this condition becomes min⁡d2,1−e−d<(1−γ)6⁢ε216⁢Δ⁢U221superscriptsuperscript16superscript216Δsuperscript2 \ d2,1-e^-d\< (1-γ)^6 ^216 U^% 2min divide start_ARG d end_ARG start_ARG 2 end_ARG , 1 - e- d < divide start_ARG ( 1 - γ )6 ε2 end_ARG start_ARG 16 Δ U2 end_ARG with d≜sup(s,a)DK⁢L(p(⋅|s,a)∥p^(⋅|s,a))d _(s,a)D_KL(p(·|s,a)\| p(·|s,a))d ≜ sup( s , a ) Ditalic_K L ( p ( ⋅ | s , a ) ∥ over start_ARG p end_ARG ( ⋅ | s , a ) ). Since ε≤Δ⁢U(1−γ)Δ1 ≤ U(1-γ)ε ≤ divide start_ARG Δ U end_ARG start_ARG ( 1 - γ ) end_ARG (as the bound would be trivial otherwise), the right-hand side of the inequality is less than 1111, and thus it is less restrictive (on p^ pover start_ARG p end_ARG) to only bound d22 d2divide start_ARG d end_ARG start_ARG 2 end_ARG (as x2≤1−e−x21superscript x2≤ 1-e^-xdivide start_ARG x end_ARG start_ARG 2 end_ARG ≤ 1 - e- x for x∈[0,1]01x∈[0,1]x ∈ [ 0 , 1 ]), which concludes the proof on the required accuracy of the dynamics model. Now that we have ensured that the environment model is accurate enough, we must choose the algorithm’s parameters (namely K, C and n) to ensure that the other inaccuracies do not exceed the remaining “approximation budget" ε−2⁢(ε2+γ⁢ε4)(1−γ)2≥ε−8⁢d⁢Δ⁢U(1−γ)3>0.2subscript2subscript4superscript128Δsuperscript130 - 2( _2+γ _4)(1-γ)^2% ≥ - 8d U(1-γ)^3>0.ε - divide start_ARG 2 ( ε2 + γ ε4 ) end_ARG start_ARG ( 1 - γ )2 end_ARG ≥ ε - divide start_ARG square-root start_ARG 8 d end_ARG Δ U end_ARG start_ARG ( 1 - γ )3 end_ARG > 0 . Solving for ϕHsubscriptitalic-ϕ _Hϕitalic_H, we get ϕHsubscriptitalic-ϕ _Hϕitalic_H =(δ1+δ3+δ5)⁢∑i=0H−1(C⁢||)iabsentsubscript1subscript3subscript5superscriptsubscript01superscript =( _1+ _3+ _5) _i=0^H-1(C|A% |)^i= ( δ1 + δ3 + δ5 ) ∑i = 0H - 1 ( C | A | )i =(δ1+δ3+δ5)⁢(C⁢||)H−1C⁢||−1absentsubscript1subscript3subscript5superscript11 =( _1+ _3+ _5) (C|A|)^H-1% C|A|-1= ( δ1 + δ3 + δ5 ) divide start_ARG ( C | A | )H - 1 end_ARG start_ARG C | A | - 1 end_ARG ≤2⁢(δ1+δ3+δ5)⁢(C⁢||)H−1,absent2subscript1subscript3subscript5superscript1 ≤ 2( _1+ _3+ _5)(C|A|)^H-1,≤ 2 ( δ1 + δ3 + δ5 ) ( C | A | )H - 1 , (29) where the last inequality follows from the fact that C⁢||−1≥12⁢C⁢||112C|A|-1≥ 12C|A|C | A | - 1 ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG C | A | as C≥11C≥ 1C ≥ 1 and ||≥22|A|≥ 2| A | ≥ 2. Finally, using part 1) of Lemma 7, we get that for any state s, with probability at least 1−2⁢k⁢ϕH12subscriptitalic-ϕ1-2k _H1 - 2 k ϕitalic_H, h∈ℕ∗ℎsuperscriptℕh ^*h ∈ blackboard_N∗, V∗⁢(s)−VπP⁢A⁢A⁢(s)superscriptsuperscriptsubscript V^*(s)-V _PAA(s)V∗ ( s ) - Vitalic_πitalic_P A A ( s ) ≤2⁢αH+γk⁢Δ⁢U1−γabsent2subscriptsuperscriptΔ1 ≤ 2 _H+γ^k U1-γ≤ divide start_ARG 2 αitalic_H + γitalic_k Δ U end_ARG start_ARG 1 - γ end_ARG (30) ≤2(1−γ)2⁢(ε1+ε2+ε3+γ⁢(ε4+ε5)+γH⁢Δ⁢U+1−γ2⁢γk⁢Δ⁢U).absent2superscript12subscript1subscript2subscript3subscript4subscript5superscriptΔ12superscriptΔ ≤ 2(1-γ)^2 ( _1+ _2% + _3+γ( _4+ _5)+γ^H U+% 1-γ2γ^k U ).≤ divide start_ARG 2 end_ARG start_ARG ( 1 - γ )2 end_ARG ( ε1 + ε2 + ε3 + γ ( ε4 + ε5 ) + γitalic_H Δ U + divide start_ARG 1 - γ end_ARG start_ARG 2 end_ARG γitalic_k Δ U ) . That is, using the definitions of ε2subscript2 _2ε2 and ε4subscript4 _4ε4, and imposing the tolerance ε ε, we require ε1+ε3+γ⁢ε5+γH⁢Δ⁢U+1−γ2⁢γk⁢Δ⁢U≤(1−γ)22⁢(ε−8⁢d⁢Δ⁢U(1−γ)3).subscript1subscript3subscript5superscriptΔ12superscriptΔsuperscript1228Δsuperscript13 _1+ _3+γ _5+γ^H U+ % 1-γ2γ^k U≤ (1-γ)^22 ( -% 8d U(1-γ)^3 ).ε1 + ε3 + γ ε5 + γitalic_H Δ U + divide start_ARG 1 - γ end_ARG start_ARG 2 end_ARG γitalic_k Δ U ≤ divide start_ARG ( 1 - γ )2 end_ARG start_ARG 2 end_ARG ( ε - divide start_ARG square-root start_ARG 8 d end_ARG Δ U end_ARG start_ARG ( 1 - γ )3 end_ARG ) . We fix ε1=ε3=γ⁢ε5=γH⁢Δ⁢U=(1−γ)28⁢(ε−8⁢d⁢Δ⁢U(1−γ)3−11−γ⁢γk⁢Δ⁢U)≜βsubscript1subscript3subscript5superscriptΔsuperscript1288Δsuperscript1311superscriptΔ≜ _1= _3=γ _5=γ^H U= % (1-γ)^28 ( - 8d U(1-γ)^3-% 11-γ^k U ) βε1 = ε3 = γ ε5 = γitalic_H Δ U = divide start_ARG ( 1 - γ )2 end_ARG start_ARG 8 end_ARG ( ε - divide start_ARG square-root start_ARG 8 d end_ARG Δ U end_ARG start_ARG ( 1 - γ )3 end_ARG - divide start_ARG 1 end_ARG start_ARG 1 - γ end_ARG γitalic_k Δ U ) ≜ β. From that, we directly obtain the required planning “depth” H=max⁡1,⌈logγ⁡(βΔ⁢U)⌉1subscriptΔH= \1, _γ ( β U )% \H = max 1 , ⌈ logitalic_γ ( divide start_ARG β end_ARG start_ARG Δ U end_ARG ) ⌉ , as well as the lower bound k≥logγ⁡((1−γ)⁢εΔ⁢U−8⁢d(1−γ)2)subscript1Δ8superscript12k≥ _γ ( (1-γ) U- 8d% (1-γ)^2 )k ≥ logitalic_γ ( divide start_ARG ( 1 - γ ) ε end_ARG start_ARG Δ U end_ARG - divide start_ARG square-root start_ARG 8 d end_ARG end_ARG start_ARG ( 1 - γ )2 end_ARG ). Additionally, we choose C=γ2(1−γ)2⁢Ksuperscript2superscript12C= γ^2(1-γ)^2KC = divide start_ARG γ2 end_ARG start_ARG ( 1 - γ )2 end_ARG K such that δ3=δ5=2⁢exp⁡(−2⁢K⁢β2(Δ⁢U)2)subscript3subscript522superscript2superscriptΔ2 _3= _5=2 (- 2Kβ^2( U)^2 )δ3 = δ5 = 2 exp ( - divide start_ARG 2 K β2 end_ARG start_ARG ( Δ U )2 end_ARG ), and we choose n such that δ1≤δ3subscript1subscript3 _1≤ _3δ1 ≤ δ3, or equivalently n⁢Γ⁢(β,Um⁢i⁢n,Um⁢a⁢x,q)(1−nN)≥2⁢KΔ⁢U2.Γsubscriptsubscript12Δsuperscript2 n (β,U_min,U_max,q)(1- nN)≥ 2K U% ^2.divide start_ARG n Γ ( β , Uitalic_m i n , Uitalic_m a x , q ) end_ARG start_ARG ( 1 - divide start_ARG n end_ARG start_ARG N end_ARG ) end_ARG ≥ divide start_ARG 2 K end_ARG start_ARG Δ U2 end_ARG . This is satisfied for n≥N⁢(1+Δ⁢U2⁢Γ⁢(β,Um⁢i⁢n,Um⁢a⁢x,q)⁢N2⁢K)−1.superscript1Δsuperscript2Γsubscriptsubscript21n≥ N (1+ U^2 (β,U_min,U_max,q)N2K )^% -1.n ≥ N ( 1 + divide start_ARG Δ U2 Γ ( β , Uitalic_m i n , Uitalic_m a x , q ) N end_ARG start_ARG 2 K end_ARG )- 1 . With this choice of C and n, we can write δ1+δ3+δ5≤6⁢exp⁡(−2⁢K⁢β2Δ⁢U2).subscript1subscript3subscript562superscript2Δsuperscript2 _1+ _3+ _5≤ 6 (- 2Kβ^2 U^% 2 ).δ1 + δ3 + δ5 ≤ 6 exp ( - divide start_ARG 2 K β2 end_ARG start_ARG Δ U2 end_ARG ) . The last step is to choose K such that 2⁢k⁢ϕH≤δ2subscriptitalic-ϕ2k _H≤ 2 k ϕitalic_H ≤ δ, that is 24⁢k⁢(K⁢||)H−1⁢exp⁡(−2⁢K⁢β2Δ⁢U2)≤δ.24superscript12superscript2Δsuperscript224k(K|A|)^H-1 (- 2Kβ^2 U^2 )% ≤δ.24 k ( K | A | )H - 1 exp ( - divide start_ARG 2 K β2 end_ARG start_ARG Δ U2 end_ARG ) ≤ δ . (31) If H=11H=1H = 1 (i.e., γ≤βΔ⁢UΔγ≤ β Uγ ≤ divide start_ARG β end_ARG start_ARG Δ U end_ARG), we can simply choose K≥Δ⁢U22⁢β2⁢ln⁡(24⁢kδ),Δsuperscript22superscript224K≥ U^22β^2 ( 24kδ ),K ≥ divide start_ARG Δ U2 end_ARG start_ARG 2 β2 end_ARG ln ( divide start_ARG 24 k end_ARG start_ARG δ end_ARG ) , and if H>11H>1H > 1 (i.e., γ>βΔ⁢UΔγ> β Uγ > divide start_ARG β end_ARG start_ARG Δ U end_ARG), we can choose K≥Δ⁢U2β2⁢((H−1)⁢ln⁡(24⁢kH−1⁢(H−1)⁢||⁢Δ⁢U2β2)+ln⁡(1δ)).Δsuperscript2superscript211241Δsuperscript2superscript21K≥ U^2β^2 ((H-1) ( [H-1]24k(H-1)|% A| U^2β^2 )+ ( 1δ% ) ).K ≥ divide start_ARG Δ U2 end_ARG start_ARG β2 end_ARG ( ( H - 1 ) ln ( nth-root start_ARG H - 1 end_ARG start_ARG 24 k end_ARG ( H - 1 ) | A | divide start_ARG Δ U2 end_ARG start_ARG β2 end_ARG ) + ln ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) ) . Indeed, setting x≜24⁢kH−1⁢(H−1)⁢||≜1241x [H-1]24k(H-1)|A|x ≜ nth-root start_ARG H - 1 end_ARG start_ARG 24 k end_ARG ( H - 1 ) | A | and y≜βΔ⁢U≜Δy β Uy ≜ divide start_ARG β end_ARG start_ARG Δ U end_ARG and substituting the expression for K, the left-hand term of inequality (31) can be rewritten as (xy2)H−1⁢(ln⁡(xy2)+1H−1⁢ln⁡(1δ))H−1⁢(y2x)2⁢(H−1)⁢δ2superscriptsuperscript21superscriptsuperscript21111superscriptsuperscript221superscript2 ( xy^2 )^H-1 ( ( xy^2% )+ 1H-1 ( 1δ ) )^H-1 ( % y^2x )^2(H-1)δ^2( divide start_ARG x end_ARG start_ARG y2 end_ARG )H - 1 ( ln ( divide start_ARG x end_ARG start_ARG y2 end_ARG ) + divide start_ARG 1 end_ARG start_ARG H - 1 end_ARG ln ( divide start_ARG 1 end_ARG start_ARG δ end_ARG ) )H - 1 ( divide start_ARG y2 end_ARG start_ARG x end_ARG )2 ( H - 1 ) δ2 =(ln⁡(xy2⁢δ1H−1)xy2)H−1⁢δ2absentsuperscriptsuperscript2superscript11superscript21superscript2 = ( ( xy^2δ 1H-1 % ) xy^2 )^H-1δ^2= ( divide start_ARG ln ( divide start_ARG x end_ARG start_ARG y2 δdivide start_ARG 1 end_ARG start_ARG H - 1 end_ARG end_ARG ) end_ARG start_ARG divide start_ARG x end_ARG start_ARG y2 end_ARG end_ARG )H - 1 δ2 =(ln⁡(xy2⁢δ1H−1)x⁢δ1H−1y2⁢δ1H−1)H−1⁢δ2absentsuperscriptsuperscript2superscript11superscript11superscript2superscript111superscript2 = ( ( xy^2δ 1H-1 % ) xδ 1H-1y^2δ 1H-1 )^H-1% δ^2= ( divide start_ARG ln ( divide start_ARG x end_ARG start_ARG y2 δdivide start_ARG 1 end_ARG start_ARG H - 1 end_ARG end_ARG ) end_ARG start_ARG divide start_ARG x δdivide start_ARG 1 end_ARG start_ARG H - 1 end_ARG end_ARG start_ARG y2 δdivide start_ARG 1 end_ARG start_ARG H - 1 end_ARG end_ARG end_ARG )H - 1 δ2 =(ln⁡(xy2⁢δ1H−1)xy2⁢δ1H−1)H−1⁢δabsentsuperscriptsuperscript2superscript11superscript2superscript111 = ( ( xy^2δ 1H-1 % ) xy^2δ 1H-1 )^H-1δ= ( divide start_ARG ln ( divide start_ARG x end_ARG start_ARG y2 δdivide start_ARG 1 end_ARG start_ARG H - 1 end_ARG end_ARG ) end_ARG start_ARG divide start_ARG x end_ARG start_ARG y2 δdivide start_ARG 1 end_ARG start_ARG H - 1 end_ARG end_ARG end_ARG )H - 1 δ ≤δ,absent ≤δ,≤ δ , (32) where we have used the observation that x≥11x≥ 1x ≥ 1 and y≤11y≤ 1y ≤ 1 along with the fact that 0≤ln⁡(z)z≤1010≤ (z)z≤ 10 ≤ divide start_ARG ln ( z ) end_ARG start_ARG z end_ARG ≤ 1 for all z≥11z≥ 1z ≥ 1 in the last step. Finally, assuming there exists an optimal policy in ℳℐsubscriptℳℐM_IMcaligraphic_I (see [12, 37] for the necessary conditions), and knowing from Lemma 1 that Vπ⁢(s)=π⁢(s)superscriptsuperscriptV^π(s)=W^π(s)Vitalic_π ( s ) = Witalic_π ( s ) for any policy, we know that V∗⁢(s)−VπP⁢A⁢A⁢(s)≤εsuperscriptsuperscriptsubscriptV^*(s)-V _PAA(s)≤ ∗ ( s ) - Vitalic_πitalic_P A A ( s ) ≤ ε implies πP⁢A⁢A⁢(s)≥supπ′π′⁢(s)−εsuperscriptsubscriptsubscriptsupremumsuperscript′superscript′W _PAA(s)≥ _π W^π % (s)- _πitalic_P A A ( s ) ≥ supitalic_π′ Witalic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( s ) - ε. ∎ See 4 Proof. The first part of the proof is identical to Theorem 3, but we use part 2) of Lemma 7 instead of part 1) in Eq. (30). With that, we have for any state s: V∗⁢(s)−VπP⁢A⁢A⁢(s)superscriptsuperscriptsubscript V^*(s)-V _PAA(s)V∗ ( s ) - Vitalic_πitalic_P A A ( s ) ≤2⁢(1−γ)⁢αH+2⁢ϕH⁢Δ⁢U(1−γ)2absent21subscript2subscriptitalic-ϕΔsuperscript12 ≤ 2(1-γ) _H+2 _H U(1-γ)^2≤ divide start_ARG 2 ( 1 - γ ) αitalic_H + 2 ϕitalic_H Δ U end_ARG start_ARG ( 1 - γ )2 end_ARG ≤2(1−γ)2⁢(ε1+ε2+ε3+γ⁢(ε4+ε5)+γH⁢Δ⁢U+2⁢Δ⁢U⁢(δ1+δ3+δ5)⁢(C⁢||)H−1).absent2superscript12subscript1subscript2subscript3subscript4subscript5superscriptΔ2Δsubscript1subscript3subscript5superscript1 ≤ 2(1-γ)^2 ( _1+ _2% + _3+γ( _4+ _5)+γ^H U+2% U( _1+ _3+ _5)(C|A|)^H-1 ).≤ divide start_ARG 2 end_ARG start_ARG ( 1 - γ )2 end_ARG ( ε1 + ε2 + ε3 + γ ( ε4 + ε5 ) + γitalic_H Δ U + 2 Δ U ( δ1 + δ3 + δ5 ) ( C | A | )H - 1 ) . That is, using the definitions of ε2subscript2 _2ε2 and ε4subscript4 _4ε4, and imposing the tolerance ε ε, we require ε1+ε3+γ⁢ε5+γH⁢Δ⁢U+2⁢Δ⁢U⁢(δ1+δ3+δ5)⁢(C⁢||)H−1≤(1−γ)22⁢(ε−8⁢d⁢Δ⁢U(1−γ)3).subscript1subscript3subscript5superscriptΔ2Δsubscript1subscript3subscript5superscript1superscript1228Δsuperscript13 _1+ _3+γ _5+γ^H U+2% U( _1+ _3+ _5)(C|A|)^H-1≤ (1% -γ)^22 ( - 8d U(1-γ)^3% ).ε1 + ε3 + γ ε5 + γitalic_H Δ U + 2 Δ U ( δ1 + δ3 + δ5 ) ( C | A | )H - 1 ≤ divide start_ARG ( 1 - γ )2 end_ARG start_ARG 2 end_ARG ( ε - divide start_ARG square-root start_ARG 8 d end_ARG Δ U end_ARG start_ARG ( 1 - γ )3 end_ARG ) . We fix ε1=ε3=γ⁢ε5=γH⁢Δ⁢U=β≜(1−γ)210⁢(ε−8⁢d⁢Δ⁢U(1−γ)3)subscript1subscript3subscript5superscriptΔ≜superscript12108Δsuperscript13 _1= _3=γ _5=γ^H U=β% (1-γ)^210 ( - 8d U% (1-γ)^3 )ε1 = ε3 = γ ε5 = γitalic_H Δ U = β ≜ divide start_ARG ( 1 - γ )2 end_ARG start_ARG 10 end_ARG ( ε - divide start_ARG square-root start_ARG 8 d end_ARG Δ U end_ARG start_ARG ( 1 - γ )3 end_ARG ). From that, we directly obtain the required planning “depth” H=max⁡1,⌈logγ⁡(βΔ⁢U)⌉1subscriptΔH= \1, _γ ( β U )% \H = max 1 , ⌈ logitalic_γ ( divide start_ARG β end_ARG start_ARG Δ U end_ARG ) ⌉ . Similarly to Theorem 3, we choose C=γ2(1−γ)2⁢Ksuperscript2superscript12C= γ^2(1-γ)^2KC = divide start_ARG γ2 end_ARG start_ARG ( 1 - γ )2 end_ARG K and n≥N⁢(1+Δ⁢U2⁢Γ⁢(β,Um⁢i⁢n,Um⁢a⁢x,q)⁢N2⁢K)−1,superscript1Δsuperscript2Γsubscriptsubscript21n≥ N (1+ U^2 (β,U_min,U_max,q)N2K )^% -1,n ≥ N ( 1 + divide start_ARG Δ U2 Γ ( β , Uitalic_m i n , Uitalic_m a x , q ) N end_ARG start_ARG 2 K end_ARG )- 1 , such that δ1+δ3+δ5≤6⁢exp⁡(−2⁢K⁢β2Δ⁢U2).subscript1subscript3subscript562superscript2Δsuperscript2 _1+ _3+ _5≤ 6 (- 2Kβ^2 U^% 2 ).δ1 + δ3 + δ5 ≤ 6 exp ( - divide start_ARG 2 K β2 end_ARG start_ARG Δ U2 end_ARG ) . The last step is to choose K such that 12⁢(K⁢||)H−1⁢exp⁡(−2⁢K⁢β2Δ⁢U2)≤βΔ⁢U.12superscript12superscript2Δsuperscript2Δ12(K|A|)^H-1 (- 2Kβ^2 U^2 )% ≤ β U.12 ( K | A | )H - 1 exp ( - divide start_ARG 2 K β2 end_ARG start_ARG Δ U2 end_ARG ) ≤ divide start_ARG β end_ARG start_ARG Δ U end_ARG . (33) If H=11H=1H = 1 (i.e., γ≤βΔ⁢UΔγ≤ β Uγ ≤ divide start_ARG β end_ARG start_ARG Δ U end_ARG), we can simply choose K≥Δ⁢U22⁢β2⁢ln⁡(12⁢Δ⁢Uβ),Δsuperscript22superscript212ΔK≥ U^22β^2 ( 12 Uβ ),K ≥ divide start_ARG Δ U2 end_ARG start_ARG 2 β2 end_ARG ln ( divide start_ARG 12 Δ U end_ARG start_ARG β end_ARG ) , and if H>11H>1H > 1 (i.e., γ>βΔ⁢UΔγ> β Uγ > divide start_ARG β end_ARG start_ARG Δ U end_ARG), we can choose K≥Δ⁢U2β2⁢((H−1)⁢ln⁡(12H−1⁢(H−1)⁢||⁢Δ⁢U2β2)+ln⁡(Δ⁢Uβ)).Δsuperscript2superscript211121Δsuperscript2superscript2ΔK≥ U^2β^2 ((H-1) ( [H-1]12(H-1)|% A| U^2β^2 )+ ( U% β ) ).K ≥ divide start_ARG Δ U2 end_ARG start_ARG β2 end_ARG ( ( H - 1 ) ln ( nth-root start_ARG H - 1 end_ARG start_ARG 12 end_ARG ( H - 1 ) | A | divide start_ARG Δ U2 end_ARG start_ARG β2 end_ARG ) + ln ( divide start_ARG Δ U end_ARG start_ARG β end_ARG ) ) . We show that this choice of K satisfies Eq. (33) by setting δ=βΔ⁢UΔδ= β Uδ = divide start_ARG β end_ARG start_ARG Δ U end_ARG and k=1212k= 12k = divide start_ARG 1 end_ARG start_ARG 2 end_ARG in Eq. (32). We also conclude the proof using Lemma 1. ∎ A.2.3 Safe policies See 8 Proof. Assuming there exists an optimal policy in ℳℐsubscriptℳℐM_IMcaligraphic_I (see [12, 37] for the necessary conditions), and knowing from Lemma 1 that Vπ⁢(s)=π⁢(s)superscriptsuperscriptV^π(s)=W^π(s)Vitalic_π ( s ) = Witalic_π ( s ) for any policy, we can write supπ′π′⁢(s)=V∗⁢(s)subscriptsupremumsuperscript′superscript′ _π W^π (s)=V^*(s)supitalic_π′ Witalic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( s ) = V∗ ( s ). Therefore, the condition for safe actions becomes s′∼p(⋅|s,a)⁢[V∗⁢(s′)]≥ωE_s p(·|s,a)[V^*(s )]≥ _Es′ ∼ p ( ⋅ | s , a ) [ V∗ ( s′ ) ] ≥ ω. Using the definition of the optimal state-action value function Q∗superscriptQ^*Q∗: Q∗⁢(s,a)=rℐ⁢(s,a)+γ⁢s′∼p(⋅|s,a)⁢[V∗⁢(s′)],Q^*(s,a)=r_I(s,a)+ _s p(·|s,a)% [V^*(s )],Q∗ ( s , a ) = rcaligraphic_I ( s , a ) + γ blackboard_Es′ ∼ p ( ⋅ | s , a ) [ V∗ ( s′ ) ] , we can rewrite this condition once again as Q∗⁢(s,a)≥γ⁢ω+rℐ⁢(s,a).superscriptsubscriptℐQ^*(s,a)≥γω+r_I(s,a).Q∗ ( s , a ) ≥ γ ω + rcaligraphic_I ( s , a ) . We know from Eq. (27) in the proof of Theorem 3 that, for any state-action pair (s,a)(s,a)( s , a ), |Q∗⁢(s,a)−Q^H⁢(s,a;K,C,In)|≤αHsuperscriptsuperscript^subscriptsubscript|Q^*(s,a)- Q^H(s,a;K,C,I_n)|≤ _H| Q∗ ( s , a ) - over start_ARG Q end_ARGH ( s , a ; K , C , Iitalic_n ) | ≤ αitalic_H with probability at least 1−ϕH1subscriptitalic-ϕ1- _H1 - ϕitalic_H, where αHsubscript _Hαitalic_H and ϕHsubscriptitalic-ϕ _Hϕitalic_H are given in Eq. (28) and (29), respectively. That is, if Q^H⁢(s,a;K,C,In)≥γ⁢ω+rℐ⁢(s,a)+αHsuperscript^subscriptsubscriptℐsubscript Q^H(s,a;K,C,I_n)≥γω+r_I(s,a)+ _Hover start_ARG Q end_ARGH ( s , a ; K , C , Iitalic_n ) ≥ γ ω + rcaligraphic_I ( s , a ) + αitalic_H, then the above condition is satisfied with probability at least 1−ϕH1subscriptitalic-ϕ1- _H1 - ϕitalic_H. From Eq. (28), we have αH≤ε2+γ⁢ε41−γ+ε1+ε3+γ⁢ε51−γ+γH⁢Δ⁢U1−γ,subscriptsubscript2subscript41subscript1subscript3subscript51superscriptΔ1 _H≤ _2+γ _41-γ+ % _1+ _3+γ _51-γ+ γ^% H U1-γ,αitalic_H ≤ divide start_ARG ε2 + γ ε4 end_ARG start_ARG 1 - γ end_ARG + divide start_ARG ε1 + ε3 + γ ε5 end_ARG start_ARG 1 - γ end_ARG + divide start_ARG γitalic_H Δ U end_ARG start_ARG 1 - γ end_ARG , (34) where (using d≜sup(s,a)DK⁢L(p(⋅|s,a)∥p^(⋅|s,a))d _(s,a)D_KL(p(·|s,a)\| p(·|s,a))d ≜ sup( s , a ) Ditalic_K L ( p ( ⋅ | s , a ) ∥ over start_ARG p end_ARG ( ⋅ | s , a ) )): ε1subscript1 _1ε1 =N−n⁢N⁢Γ⁢(ε1,Um⁢i⁢n,Um⁢a⁢x,q)⁢ln⁡(2δ1)≤N−n⁢N⁢Γ⁢(Um⁢a⁢x,Um⁢i⁢n,Um⁢a⁢x,q)⁢ln⁡(2δ1),absentΓsubscript1subscriptsubscript2subscript1Γsubscriptsubscriptsubscript2subscript1 = N-nnN ( _1,U_min,U_max,q) % ( 2 _1 )≤ N-nnN (U_max,U_min% ,U_max,q) ( 2 _1 ),= square-root start_ARG divide start_ARG N - n end_ARG start_ARG n N Γ ( ε1 , Uitalic_m i n , Uitalic_m a x , q ) end_ARG ln ( divide start_ARG 2 end_ARG start_ARG δ1 end_ARG ) end_ARG ≤ square-root start_ARG divide start_ARG N - n end_ARG start_ARG n N Γ ( Uitalic_m a x , Uitalic_m i n , Uitalic_m a x , q ) end_ARG ln ( divide start_ARG 2 end_ARG start_ARG δ1 end_ARG ) end_ARG , ε2subscript2 _2ε2 =2⁢Δ⁢U⁢min⁡d2,1−e−d,absent2Δ21superscript =2 U \ d2,1-e^-d\,= 2 Δ U square-root start_ARG min divide start_ARG d end_ARG start_ARG 2 end_ARG , 1 - e- d end_ARG , ε3subscript3 _3ε3 =Δ⁢U22⁢K⁢ln⁡(2δ3),absentΔsuperscript222subscript3 = U^22K ( 2 _3 % ),= square-root start_ARG divide start_ARG Δ U2 end_ARG start_ARG 2 K end_ARG ln ( divide start_ARG 2 end_ARG start_ARG δ3 end_ARG ) end_ARG , ε4subscript4 _4ε4 =2⁢Δ⁢U1−γ⁢min⁡d2,1−e−d,absent2Δ121superscript = 2 U1-γ \ d2,1-e^-d\,= divide start_ARG 2 Δ U end_ARG start_ARG 1 - γ end_ARG square-root start_ARG min divide start_ARG d end_ARG start_ARG 2 end_ARG , 1 - e- d end_ARG , ε5subscript5 _5ε5 =Δ⁢U22⁢C⁢(1−γ)2⁢ln⁡(2δ5).absentΔsuperscript22superscript122subscript5 = U^22C(1-γ)^2 ( 2% _5 ).= square-root start_ARG divide start_ARG Δ U2 end_ARG start_ARG 2 C ( 1 - γ )2 end_ARG ln ( divide start_ARG 2 end_ARG start_ARG δ5 end_ARG ) end_ARG . Finally, in order to obtain a δ-ω-safe policy, we must have ϕH≤δsubscriptitalic-ϕ _H≤δϕitalic_H ≤ δ. From Eq. (29), this is satisfied if 2⁢(δ1+δ3+δ5)⁢(C⁢||)H−1≤δ,2subscript1subscript3subscript5superscript12( _1+ _3+ _5)(C|A|)^H-1≤δ,2 ( δ1 + δ3 + δ5 ) ( C | A | )H - 1 ≤ δ , or similarly if δ1=δ3=δ5=δ6⁢(C⁢||)H−1subscript1subscript3subscript56superscript1 _1= _3= _5= δ6(C|A|)^H-1δ1 = δ3 = δ5 = divide start_ARG δ end_ARG start_ARG 6 ( C | A | )H - 1 end_ARG. Defining d′=min⁡d2,1−e−dsuperscript′21superscriptd = \ d2,1-e^-d\d′ = square-root start_ARG min divide start_ARG d end_ARG start_ARG 2 end_ARG , 1 - e- d end_ARG and Γm⁢a⁢x=Γ⁢(Um⁢a⁢x,Um⁢i⁢n,Um⁢a⁢x,q)subscriptΓsubscriptsubscriptsubscript _max= (U_max,U_min,U_max,q)Γitalic_m a x = Γ ( Uitalic_m a x , Uitalic_m i n , Uitalic_m a x , q ), and substituting ε1,ε2,ε3,ε4,ε5,δ1,δ3,δ5subscript1subscript2subscript3subscript4subscript5subscript1subscript3subscript5 _1, _2, _3, _4, _5% , _1, _3, _5ε1 , ε2 , ε3 , ε4 , ε5 , δ1 , δ3 , δ5 in Eq. (34), we obtain αH≤2⁢Δ⁢U⁢d′(1−γ)2+ln⁡(12⁢(C⁢||)H−1δ)1−γ⁢(N−n⁢N⁢Γm⁢a⁢x+Δ⁢U22⁢K+γ⁢Δ⁢U22⁢C⁢(1−γ)2)+γH⁢Δ⁢U1−γ.subscript2Δsuperscript′1212superscript11subscriptΓΔsuperscript22Δsuperscript22superscript12superscriptΔ1 _H≤ 2 Ud (1-γ)^2+ (% 12(C|A|)^H-1δ )1-γ ( % N-nnN _max+ U^22K+γ U% ^22C(1-γ)^2 )+ γ^H U1-γ.αitalic_H ≤ divide start_ARG 2 Δ U d′ end_ARG start_ARG ( 1 - γ )2 end_ARG + divide start_ARG square-root start_ARG ln ( divide start_ARG 12 ( C | A | )H - 1 end_ARG start_ARG δ end_ARG ) end_ARG end_ARG start_ARG 1 - γ end_ARG ( square-root start_ARG divide start_ARG N - n end_ARG start_ARG n N Γitalic_m a x end_ARG end_ARG + square-root start_ARG divide start_ARG Δ U2 end_ARG start_ARG 2 K end_ARG end_ARG + γ square-root start_ARG divide start_ARG Δ U2 end_ARG start_ARG 2 C ( 1 - γ )2 end_ARG end_ARG ) + divide start_ARG γitalic_H Δ U end_ARG start_ARG 1 - γ end_ARG . Using the fact that rℐ⁢(s,a)≤Um⁢a⁢xsubscriptℐsubscriptr_I(s,a)≤ U_maxrcaligraphic_I ( s , a ) ≤ Uitalic_m a x for any state-action pair, we can conservatively define the restricted subsets of safe actions as s⁢a⁢f⁢e⁢(s)=a:Q^H⁢(s,a;K,C,In)≥γ⁢ω+Um⁢a⁢x+αsubscriptconditional-setsuperscript^subscriptsubscriptA_safe(s)=\a: Q^H(s,a;K,C,I_n)≥γω+U_max% +α\Aitalic_s a f e ( s ) = a : over start_ARG Q end_ARGH ( s , a ; K , C , Iitalic_n ) ≥ γ ω + Uitalic_m a x + α for any state s, where α is the right-hand term of the above inequality. Therefore, restricting any policy π with these subsets ensures that s′∼p(⋅|s,a)⁢[V∗⁢(s′)]≥ωE_s p(·|s,a)[V^*(s )]≥ _Es′ ∼ p ( ⋅ | s , a ) [ V∗ ( s′ ) ] ≥ ω with probability at least 1−δ11- 1 - δ for any action a that has a non-zero probability of being selected. ∎