← Back to papers

Paper deep dive

Information Theoretic Guarantees For Policy Alignment In Large Language Models

Youssef Mroueh

Year: 2024Venue: arXiv preprintArea: Formal/TheoreticalType: TheoreticalEmbeddings: 242

Abstract

Abstract:Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy with respect to an $f$-divergence such as the $\mathsf{KL}$ divergence. The best of $n$ alignment policy selects a sample from the reference policy that has the maximum reward among $n$ independent samples. For both cases (policy alignment and best of $n$), recent works showed empirically that the reward improvement of the aligned policy on the reference one scales like $\sqrt{\mathsf{KL}}$, with an explicit bound in $n$ on the $\mathsf{KL}$ for the best of $n$ policy. We show in this paper that the $\sqrt{\mathsf{KL}}$ information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. Moreover, we prove for the best of $n$ policy, that the $\mathsf{KL}$ upper bound can be obtained for any $f$-divergence via a reduction to exponential order statistics owing to the Rényi representation of order statistics, and a data processing inequality. If additional information is known on the tails of the aligned policy we show that tighter control on the reward improvement can be obtained via the Rényi divergence. Finally we demonstrate how these upper bounds transfer from proxy rewards to golden rewards which results in a decrease in the golden reward improvement due to overestimation and approximation errors of the proxy reward.

Tags

ai-safety (imported, 100%)alignment-training (suggested, 80%)formaltheoretical (suggested, 92%)theoretical (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 6:49:24 PM

Summary

This paper provides information-theoretic upper bounds for policy alignment in Large Language Models (LLMs), specifically focusing on the KL divergence scaling laws for both standard policy alignment and 'best-of-n' sampling. The author proves that reward improvement scales with the square root of the KL divergence under sub-gaussian reward assumptions, extends these results to f-divergences and Rényi divergences, and demonstrates how these bounds transfer from proxy rewards to golden rewards, providing a theoretical basis for Goodhart's law in LLM alignment.

Entities (5)

Best of n Policy · methodology · 100%KL Divergence · mathematical-concept · 100%Large Language Models · technology · 100%Goodhart's Law · phenomenon · 95%Rényi Divergence · mathematical-concept · 95%

Relation Signals (3)

KL Divergence bounds Reward Improvement

confidence 95% · the reward improvement of the aligned policy on the reference one scales like square-root of KL

Best of n Policy isaformof Policy Alignment

confidence 95% · Another important paradigm for optimizing the reward is test time alignment via best of n sampling

Proxy Reward leadsto Goodhart's Law

confidence 90% · optimizing the proxy reward results in deterioration of the golden reward. This phenomena is referred to... as Goodhart’s law.

Cypher Suggestions (2)

Find all methodologies related to policy alignment · confidence 90% · unvalidated

MATCH (m:Methodology)-[:IS_A_FORM_OF]->(p:Methodology {name: 'Policy Alignment'}) RETURN m

Identify concepts that lead to Goodhart's Law · confidence 90% · unvalidated

MATCH (c:Concept)-[:LEADS_TO]->(g:Phenomenon {name: 'Goodhart\'s Law'}) RETURN c

Full Text

241,411 characters extracted from source content.

Expand or collapse full text

Information Theoretic Guarantees For Policy Alignment In Large Language Models Youssef Mroueh IBM Research Abstract. Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy with respect to an f-divergence such as the KLsansserif_KL divergence. The best of n alignment policy selects a sample from the reference policy that has the maximum reward among n independent samples. For both cases (policy alignment and best of n), recent works showed empirically that the reward improvement of the aligned policy on the reference one scales like KLsquare-root start_ARG sansserif_KL end_ARG, with an explicit bound in n on the KLsansserif_KL for the best of n policy. We show in this paper that the KLsquare-root start_ARG sansserif_KL end_ARG information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. Moreover, we prove for the best of n policy, that the KLsansserif_KL upper bound can be obtained for any f-divergence via a reduction to exponential order statistics owing to the Rényi representation of order statistics, and a data processing inequality. If additional information is known on the tails of the aligned policy we show that tighter control on the reward improvement can be obtained via the Rényi divergence. Finally we demonstrate how these upper bounds transfer from proxy rewards to golden rewards which results in a decrease in the golden reward improvement due to overestimation and approximation errors of the proxy reward. 1. Introduction Aligning Large Language Models (LLMs) with human preferences allows a tradeoff between maintaining the utility of the pre-trained reference model and the alignment of the model with human values such as safety or other socio-technical considerations. Alignment is becoming a crucial step in LLMs training pipeline, especially as these models are leveraged in decision making as well as becoming more and more accessible to the general public. Policy alignment starts by learning a reward model that predicts human preferences, these reward models are typically fine-tuned LLMs that are trained on pairwise human preference data (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022). The reward is then optimized using training time alignment i.e via policy gradient based reinforcement learning leading to the so called Reinforcemnent Learning from Human Feedback (RLHF) (Christiano et al., 2017). RLHF ensures that the reward is maximized while the policy π stays close to the initial reference policy πrefsubscriptref _refπroman_ref in the sense of the Kullback-Leibler divergence (π||πref) KL(π|| _ref)sansserif_KL ( π | | πroman_ref ). Other variants of these training time alignment have been proposed via direct preference optimization (Rafailov et al., 2024) (Zhao et al., 2023) (Ethayarajh et al., 2024). Another important paradigm for optimizing the reward is test time alignment via best of n sampling from the reference policy and retaining the sample that maximizes the reward. The resulting policy is known as the best of n policy. The best of n policy is also used in controlled decoding settings (Yang and Klein, 2021; Mudgal et al., 2023) and in fine-tuning LLMs to match the best of n policy responses (Touvron et al., 2023). Figure 1. Qualitiative plot of centered rewards vs. KL of Proxy and Gold Rewards for both Best of n and RL policies. (See Fig. 1 a) and b) in (Gao et al., 2023) for scaling laws in policy alignment). (Gao et al., 2023) and (Hilton and Gao, 2022) studied the scaling laws of reward models optimization in both the RL and the best of n setups. (Gao et al., 2023) distinguished between “golden reward” that can be thought of as the golden human preference and “proxy reward” which is trained to predict the golden reward. For proxy rewards (Gao et al., 2023) found experimentally for both RL and best of n policies that the reward improvement on the reference policy scales as (π||πref) KL(π|| _ref)square-root start_ARG sansserif_KL ( π | | πroman_ref ) end_ARG. Similar observations for reward improvement scaling in RL were made in (Bai et al., 2022). For golden rewards, (Gao et al., 2023) showed for both RL and best of n policies that LLMs that optimize the proxy reward suffer from over-optimization in the sense that as the policy drifts from the reference policy, optimizing the proxy reward results in deterioration of the golden reward. This phenomena is referred to in (Gao et al., 2023) (Hilton and Gao, 2022) as Goodhart’s law. A qualitative plot of scaling laws discovered in (Gao et al., 2023) is given in Figure 1. For the best of n policy, most works in this space assumed that (π||πref)=log(n)−n−1n KL(π|| _ref)= (n)- n-1nsansserif_KL ( π | | πroman_ref ) = log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG (Stiennon et al., 2020; Coste et al., 2024; Nakano et al., 2021; Go et al., 2024; Gao et al., 2023). Recently Beirami et al. (2024) showed that this is in fact an inequality under the assumption that the reward is one to one map (a bijection) and for finite alphabets. The main contribution of this papers are : (1) We provide in Theorem 17 in Section 2 a new proof for the best of n policy inequality (π||πref)≤log(n)−n−1n KL(π|| _ref)≤ (n)- n-1nsansserif_KL ( π | | πroman_ref ) ≤ log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG and show via a reduction to exponential random variables that it is a consequence of the data processing inequality of the KLsansserif_KL divergence. We extend this inequality beyond the setup of (Beirami et al., 2024) of one to one rewards and finite alphabets to a more realistic setup of surjective rewards and beyond finite alphabets. We also give conditions under which the equality is met, and extend those inequalities to f-divergences and Rényi divergences. (2) We show in Section 3 that the scaling laws on policy improvement versus KLsansserif_KL of (Gao et al., 2023) are information theoretic upper bounds and are consequences of transportation inequalities with the KLsansserif_KL divergence under sub-gaussian tails of the reward under the reference policy. We discuss how the dependency on KLsansserif_KL is driven only by the tails of the reward under the reference model, and cannot be improved by a better alignment algorithm and can only be improved if the tails of the reference rewards are fatter than sub-gaussian such as sub-gamma or sub-exponential tails. (3) We study in Theorem 4 the tightness of these information theoretical upper bounds when the tails of the optimized policy are also known via new transportation inequalities for the Rényi divergence DαsubscriptD_αDitalic_α for in α∈(0,1)01α∈(0,1)α ∈ ( 0 , 1 ), and show that the upper bound KLsquare-root start_ARG sansserif_KL end_ARG can not be met, echoing Goodhart’s law of (Gao et al., 2023). (4) We finally study in Section 4 the transfer of transportation inequalities from proxy rewards to golden rewards and prove that indeed the golden reward improvement is hindered by “overestimation” of the proxy reward as reported empirically in (Gao et al., 2023). 2. The Alignment Problem 2.1. RLHF: A Constrained Policy Optimization Problem Let XX be the space of prompts and YY be the space of responses y∈y ∈ Y from a LLM conditioned on a prompt x∈x ∈ X. The reference LLM is represented as policy πref⁢(y|x)subscriptrefconditional _ref(y|x)πroman_ref ( y | x ), i.e as a conditional probability on YY given a prompt x∈x ∈ X. Let ρsubscript _Xρcaligraphic_X be a distribution on prompts, and a r a reward, r:×→ℝ:→ℝr:X×Y : X × Y → blackboard_R, r represents a safety or alignment objective that is desirable to maximize. Given a reference policy πrefsubscriptref _refπroman_ref, the goal of alignment is to find a policy π∗superscriptπ^*π∗ that maximizes the reward r and that it is still close to the original reference policy for some positive Δ>0Δ0 >0Δ > 0: πy|x∗=argmaxπy|x∼ρy∼π(.|x)r(x,y) s.t ∫(π(y|x)||πref(y|x))dρ(x)≤Δ, π^*_y|x= _ _y|xE_x _ % XE_y π(.|x)r(x,y) s.t _X KL% (π(y|x)|| _ref(y|x))d _X(x)≤ ,π∗italic_y | x = arg maxitalic_π start_POSTSUBSCRIPT y | x end_POSTSUBSCRIPT blackboard_Ex ∼ ρ start_POSTSUBSCRIPT X end_POSTSUBSCRIPT blackboard_Ey ∼ π ( . | x ) r ( x , y ) s.t ∫X sansserif_KL ( π ( y | x ) | | πref ( y | x ) ) d ρcaligraphic_X ( x ) ≤ Δ , (1) where (π(y|x)||πref(y|x))=y∼π.|xlog(π⁢(y|x)πref⁢(y|x)). KL(π(y|x)|| _ref(y|x))=E_y π.|x % ( π(y|x) _ref(y|x) ).sansserif_KL ( π ( y | x ) | | πref ( y | x ) ) = blackboard_Ey ∼ π . | x log ( divide start_ARG π ( y | x ) end_ARG start_ARG πref ( y | x ) end_ARG ) . With some abuse of notation, we write π⁢(x,y)=π⁢(y|x)⁢ρ⁢(x)conditionalsubscriptπ(x,y)=π(y|x) _X(x)π ( x , y ) = π ( y | x ) ρcaligraphic_X ( x ) and πref⁢(x,y)=πref⁢(y|x)⁢ρ⁢(x)subscriptrefsubscriptrefconditionalsubscript _ref(x,y)= _ref(y|x) _X(x)πroman_ref ( x , y ) = πroman_ref ( y | x ) ρcaligraphic_X ( x ). Let ⁢(×)P(X×Y)P ( X × Y ) be joint probability defined on ×X×YX × Y that has ρsubscript _Xρcaligraphic_X as marginal on XX. Hence we can write the alignment problem (1) in a more compact way as follows: supπ∈⁢(×)∫rdπ s.t (π||πref)≤Δ. _π (X×Y) rdπ s.t % KL(π|| _ref)≤ .supitalic_π ∈ P ( X × Y ) ∫ r d π s.t sansserif_KL ( π | | πroman_ref ) ≤ Δ . (2) For β>00β>0β > 0, we can also write a penalized form of this constrained policy optimization problem as follows: supπ∈⁢(×)∫rdπ−1β(π||πref). _π (X×Y) rdπ- 1% β KL(π|| _ref).supitalic_π ∈ P ( X × Y ) ∫ r d π - divide start_ARG 1 end_ARG start_ARG β end_ARG sansserif_KL ( π | | πroman_ref ) . It is easy to see that the optimal policy of the penalized problem is given by: πβ,r⁢(y|x)=exp⁡(β⁢r⁢(x,y))⁢πref⁢(y|x)∫exp⁡(β⁢r⁢(x,y))⁢πref⁢(y|x),ρ⁢almost surely.subscriptconditionalsubscriptrefconditionaldifferential-dsubscriptrefconditionalsubscriptalmost surely _β,r(y|x)= (β r(x,y)) _ref(y|x) (% β r(x,y))d _ref(y|x), _Xalmost surely.πitalic_β , r ( y | x ) = divide start_ARG exp ( β r ( x , y ) ) πroman_ref ( y | x ) end_ARG start_ARG ∫ exp ( β r ( x , y ) ) d πroman_ref ( y | x ) end_ARG , ρcaligraphic_X almost surely . (3) The constrained problem (2) has a similar solution (See for e.g (Yang et al., 2024)): πλΔ,r⁢(y|x)=exp⁡(r⁢(x,y)λΔ)⁢πref⁢(y|x)∫exp⁡(r⁢(x,y)λΔ)⁢πref⁢(y|x),ρ⁢almost surely,subscriptsubscriptΔconditionalsubscriptΔsubscriptrefconditionalsubscriptΔdifferential-dsubscriptrefconditionalsubscriptalmost surely _ _ ,r(y|x)= ( r(x,y) _% ) _ref(y|x) ( r(x,y) _ )d% _ref(y|x), _Xalmost surely,πitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT ( y | x ) = divide start_ARG exp ( divide start_ARG r ( x , y ) end_ARG start_ARG λroman_Δ end_ARG ) πroman_ref ( y | x ) end_ARG start_ARG ∫ exp ( divide start_ARG r ( x , y ) end_ARG start_ARG λroman_Δ end_ARG ) d πroman_ref ( y | x ) end_ARG , ρcaligraphic_X almost surely , (4) where λΔ>0subscriptΔ0 _ >0λroman_Δ > 0 is a lagrangian that satisfies ∫(πλΔ,r(y|x)||πref(y|x))dρ(x)=Δ. _X KL( _ _ ,r(y|x)|| _ref(% y|x))d _X(x)= .∫X sansserif_KL ( πitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT ( y | x ) | | πref ( y | x ) ) d ρcaligraphic_X ( x ) = Δ . 2.2. Best of n Policy Alignment Let X be the random variable associated with prompts such that Law(X)=ρX)= _XX ) = ρcaligraphic_X. Let Y be the random variable associated with the conditional response of πrefsubscriptref _refπroman_ref given X. Define the conditional reward of the reference policy : R(Y)|X:=r(X,Y) where Y ∼πref(.|X),R(Y)|X:=r(X,Y) where Y _ref(.|X),R ( Y ) | X := r ( X , Y ) where Y ∼ πref ( . | X ) , we assume that R⁢(Y)|XconditionalR(Y)|XR ( Y ) | X admits a CDF denoted as FR⁢(Y)|XsubscriptconditionalF_R(Y)|XFitalic_R ( Y ) | X and let FR⁢(Y)|X−1subscriptsuperscript1conditionalF^-1_R(Y)|XF- 1R ( Y ) | X be its quantile: FR⁢(Y)|X(−1)⁢(p)=infη:FR⁢(Y)|X⁢(η)≥p⁢ for ⁢p∈[0,1].subscriptsuperscript1conditionalinfimumconditional-setsubscriptconditional for 01F^(-1)_R(Y)|X(p)= \η:F_R(Y)|X(η)≥ p\ for p∈[0,1].F( - 1 )R ( Y ) | X ( p ) = inf η : Fitalic_R ( Y ) | X ( η ) ≥ p for p ∈ [ 0 , 1 ] . Let Y1⁢…⁢Ynsubscript1…subscriptY_1… Y_nY1 … Yitalic_n be independent samples from πref(.|X) _ref(.|X)πref ( . | X ). We define the best of n reward as follows: R(n)⁢(Y)|X=maxi=1⁢…⁢n⁡R⁢(Yi)|X,conditionalsuperscriptconditionalsubscript1…subscriptR^(n)(Y)|X= _i=1… nR(Y_i)|X,R( n ) ( Y ) | X = maxitalic_i = 1 … n R ( Yitalic_i ) | X , (5) this the maximum of n iid random variables with a common CDF FR⁢(Y)|XsubscriptconditionalF_R(Y)|XFitalic_R ( Y ) | X. The best of n policy corresponds to Y(n)|X:=arg⁡maxi=1⁢…⁢n⁡r⁢(X,Yi).assignconditionalsuperscriptsubscript1…subscriptY^(n)|X:= _i=1… nr(X,Y_i).Y( n ) | X := arg maxitalic_i = 1 … n r ( X , Yitalic_i ) . We note πr,ref(n)(.|X)π^(n)_r,ref(.|X)π( n )r , ref ( . | X ) the law of Y(n)|XconditionalsuperscriptY^(n)|XY( n ) | X. πr,ref(n)subscriptsuperscriptrefπ^(n)_r,refπ( n )r , ref is referred to as the best of n alignment policy. We consider two setups for the reward: Assumption 1. We assume that the reward r is a one to one map for a fixed x, and admits an inverse hx:ℝ→:subscriptℎ→ℝh_x:R _x : blackboard_R → Y such that hx⁢(r⁢(x,y))=ysubscriptℎh_x(r(x,y))=yhitalic_x ( r ( x , y ) ) = y. This assumption was considered in (Beirami et al., 2024). Nevertheless this assumption is strong and not usually meet in practice, we weaken this assumption to the following: Assumption 2. We assume that there is a stochastic map HXsubscriptH_XHitalic_X such that HX(RY|X))=Y|XH_X(R_Y|X)) d=Y|XHitalic_X ( Ritalic_Y | X ) ) overd start_ARG = end_ARG Y | X and HX(RY(n)|X))=Y(n)|XH_X(R_Y^(n)|X)) d=Y^(n)|XHitalic_X ( Ritalic_Y( n ) | X ) ) overd start_ARG = end_ARG Y( n ) | X. Under Assumption 2, the reward can be surjective which is more realistic but we assume that there is a stochastic map that ensures invertibility not point-wise but on a distribution level. Our assumption means that we have conditionnaly on X: R|X→Y|X→conditionalconditionalR|X→ Y|XR | X → Y | X form a markov chain i.e exists A⁢(Y|R,X)conditionalA(Y|R,X)A ( Y | R , X ) so that PY|X=A⁢(Y|R,X)⁢PR|X,subscriptconditionalconditionalsubscriptconditionalP_Y|X=A(Y|R,X)P_R|X,Pitalic_Y | X = A ( Y | R , X ) Pitalic_R | X , and PY(n)|X=A⁢(Y|R,X)⁢PR(n)|X.subscriptconditionalsuperscriptconditionalsubscriptconditionalsuperscriptP_Y^(n)|X=A(Y|R,X)P_R^(n)|X.Pitalic_Y( n ) | X = A ( Y | R , X ) Pitalic_R( n ) | X . Best of n Policy KLsansserif_KL Guarantees: A reduction to Exponentials. In what follows for random variables Z,Z′,Z Z , Z′ with laws pZ,pZ′subscriptsubscriptsuperscript′p_Z,p_Z pitalic_Z , pitalic_Z′ we write interchangeably: (pZ||pZ′)=(Z||Z′). KL(p_Z||p_Z )= KL(Z||Z ).sansserif_KL ( pitalic_Z | | pitalic_Z′ ) = sansserif_KL ( Z | | Z′ ) . Let us start by looking at [R(n)(Y)||R(Y)|X] KL [R^(n)(Y)||R(Y) |X ]sansserif_KL [ R( n ) ( Y ) | | R ( Y ) | X ] the KLsansserif_KL divergence between the conditional reward of the best of n policy and that of the reference policy. Let E∼E⁢x⁢p⁢(1)similar-to1E Exp(1)E ∼ E x p ( 1 ), the optimal transport map FR⁢(Y)|X−1∘FEsubscriptsuperscript1conditionalsubscriptF^-1_R(Y)|X F_EF- 1R ( Y ) | X ∘ Fitalic_E from the exponential distribution E to R⁢(Y)|XconditionalR(Y)|XR ( Y ) | X (See for example Theorem 2.5 in (Santambrogio, 2015): E is atomless, but R⁢(Y)|XconditionalR(Y)|XR ( Y ) | X can be discrete valued) allows us to write: R⁢(Y)|X=dFR⁢(Y)|X−1∘FE⁢(E),superscriptconditionalsubscriptsuperscript1conditionalsubscriptR(Y)|X d=F^-1_R(Y)|X F_E(E),R ( Y ) | X start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP F- 1R ( Y ) | X ∘ Fitalic_E ( E ) , (6) where =dsuperscript d=start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP means equality in distribution. On the other hand, let R(1)⁢(Y)|X≤⋯≤R(n)⁢(Y)|Xconditionalsuperscript1⋯conditionalsuperscriptR^(1)(Y)|X≤…≤ R^(n)(Y)|XR( 1 ) ( Y ) | X ≤ ⋯ ≤ R( n ) ( Y ) | X be the order statistics of the rewards of n independent samples Yi,i=1⁢…⁢nsubscript1…Y_i,i=1… nYitalic_i , i = 1 … n, Yi∼πref(.|X)Y_i _ref(.|X)Yitalic_i ∼ πroman_ref ( . | X ). The order statistics refer to sorting the random variable from the minimum (index (1)1(1)( 1 )) to the maximum (index (n)(n)( n )). Consider n independent exponential E1,…⁢Ensubscript1…subscriptE_1,… E_nE1 , … Eitalic_n, where Ei∼exp⁡(1)similar-tosubscript1E_i (1)Eitalic_i ∼ exp ( 1 ), and their order statistics E(1)≤E(2)≤…⁢E(n)superscript1superscript2…superscriptE^(1)≤ E^(2)≤… E^(n)E( 1 ) ≤ E( 2 ) ≤ … E( n ). The Rényi representation of order statistics (Rényi, 1953), similar to the Optimal Transport (OT) representation allows us to express the distribution of the order statistics of the rewards in terms of the order statistics of exponentials as follows: (R(1)(Y)|X,…,R(n)(Y)|X)=(FR⁢(Y)|X−1∘FE(E(1)),…,FR⁢(Y)|X−1∘FE(E(n))). (R^(1)(Y)|X,…,R^(n)(Y)|X ) d= (F^-1_R(Y)|X% F_E(E^(1)),…,F^-1_R(Y)|X F_E(E^(n)) ).( R( 1 ) ( Y ) | X , … , R( n ) ( Y ) | X ) overd start_ARG = end_ARG ( F- 1R ( Y ) | X ∘ Fitalic_E ( E( 1 ) ) , … , F- 1R ( Y ) | X ∘ Fitalic_E ( E( n ) ) ) . (7) The central idea in the Rényi representation is that the mapping FR⁢(Y)|X−1∘FEsubscriptsuperscript1conditionalsubscriptF^-1_R(Y)|X F_EF- 1R ( Y ) | X ∘ Fitalic_E is monotonic and hence ordering preserving and by the OT representation each component is distributed as R⁢(Y)|XconditionalR(Y)|XR ( Y ) | X. See (Boucheron and Thomas, 2012) for more account on the Rényi representation of order statistics. Hence using the OT representation in (6) and the Rényi representation of the maximum (7), we can reduce the KLsansserif_KL between the rewards to a KLsansserif_KL on functions of exponentials and their order statistics: [R(n)(Y)||R(Y)|X] KL [R^(n)(Y)||R(Y) |X ]sansserif_KL [ R( n ) ( Y ) | | R ( Y ) | X ] =(FR⁢(Y)|X−1∘FE(E(n))||FR⁢(Y)|X−1∘FE(E)) = KL (F^-1_R(Y)|X F_E(E^(n)) | % |F^-1_R(Y)|X F_E(E) )= sansserif_KL ( F- 1R ( Y ) | X ∘ Fitalic_E ( E( n ) ) | | F- 1R ( Y ) | X ∘ Fitalic_E ( E ) ) =(TX(E(n))||TX(E)), = KL(T_X(E^(n))||T_X(E)),= sansserif_KL ( Titalic_X ( E( n ) ) | | Titalic_X ( E ) ) , (8) where TX=FR⁢(Y)|X−1∘FE=F(r(X,.))♯πref(.|X)−1∘FE.T_X=F^-1_R(Y)|X F_E=F^-1_(r(X,.))_ _ref(.% |X) F_E.Titalic_X = F- 1R ( Y ) | X ∘ Fitalic_E = F- 1( r ( X , . ) ) start_POSTSUBSCRIPT ♯ πroman_ref ( . | X ) end_POSTSUBSCRIPT ∘ Fitalic_E . Under Assumption 1 we can write samples from the best of n policy as Y(n)|X=hX⁢(Rn⁢(Y))|XconditionalsuperscriptconditionalsubscriptℎsuperscriptY^(n)|X=h_X(R^n(Y))|XY( n ) | X = hitalic_X ( Ritalic_n ( Y ) ) | X and from the reference policy as Y|X=hX⁢(R⁢(Y))|X.conditionalconditionalsubscriptℎY|X=h_X(R(Y))|X.Y | X = hitalic_X ( R ( Y ) ) | X . Hence we have by the data processing inequality (DPI) for the KLsansserif_KL divergence (See for e.g (Polyanskiy and Wu, 2023)) under Assumption 1: (πr,ref(n)||πref|X) KL(π^(n)_r,ref|| _ref|X)sansserif_KL ( π( n )r , ref | | πref | X ) =(Y(n)||Y|X) = KL(Y^(n)||Y|X)= sansserif_KL ( Y( n ) | | Y | X ) =(hX(Rn(Y))||hX(R(Y))|X) = KL(h_X(R^n(Y))||h_X(R(Y))|X)= sansserif_KL ( hitalic_X ( Ritalic_n ( Y ) ) | | hitalic_X ( R ( Y ) ) | X ) =(Rn(Y)||R(Y)|X) By Assumption 1 hX is one to one and DPI is an equality = KL(R^n(Y)||R(Y)|X) By Assumption assum:% rewardBij $h_X$ is one to one and DPI is an equality~= sansserif_KL ( Ritalic_n ( Y ) | | R ( Y ) | X ) By Assumption hitalic_X is one to one and DPI is an equality =(TX(E(n))||TX(E)) Rényi and Optimal Transport Representations (Eq (8)) = KL(T_X(E^(n))||T_X(E)) R\'enyi and % Optimal Transport Representations (Eq eq:OTandRenyi)= sansserif_KL ( Titalic_X ( E( n ) ) | | Titalic_X ( E ) ) Rényi and Optimal Transport Representations (Eq ( ) ) (9) Recall that TX=FR⁢(Y)|X−1∘FEsubscriptsubscriptsuperscript1conditionalsubscriptT_X=F^-1_R(Y)|X F_ETitalic_X = F- 1R ( Y ) | X ∘ Fitalic_E, FEsubscriptF_EFitalic_E is one to one. If the space YY is finite, R⁢(Y|X)conditionalR(Y|X)R ( Y | X ) has a discontinuous CDF hence not strictly monotonic. It follows that its quantile FR⁢(Y)|X−1subscriptsuperscript1conditionalF^-1_R(Y)|XF- 1R ( Y ) | X is not a one to one map and TXsubscriptT_XTitalic_X as a result is not a one to one map and hence we have by DPI (that is an inequality in this case since TXsubscriptT_XTitalic_X is not one to one): (TX(E(n))||TX(E))≤(E(n)||E) KL(T_X(E^(n))||T_X(E))≤ KL(E^(n)||E)sansserif_KL ( Titalic_X ( E( n ) ) | | Titalic_X ( E ) ) ≤ sansserif_KL ( E( n ) | | E ) (10) If the space YY is infinite and we assume that R⁢(Y|X)conditionalR(Y|X)R ( Y | X ) is continuous and strictly monotonic then FR⁢(Y)|X−1subscriptsuperscript1conditionalF^-1_R(Y)|XF- 1R ( Y ) | X is a one to one map, and as a result TXsubscriptT_XTitalic_X is a one to one map and the DPI is an equality in this case: (TX(E(n))||TX(E))=(E(n)||E) KL(T_X(E^(n))||T_X(E))= KL(E^(n)||E)sansserif_KL ( Titalic_X ( E( n ) ) | | Titalic_X ( E ) ) = sansserif_KL ( E( n ) | | E ) (11) Hence under Assumption 1 and for YY finite combining (9) and (10) we have: (πr,ref(n)||πref|X)≤(E(n)||E), KL(π^(n)_r,ref|| _ref|X)≤ KL(E^% (n)||E),sansserif_KL ( π( n )r , ref | | πref | X ) ≤ sansserif_KL ( E( n ) | | E ) , (12) and under Assumption 1 and for YY infinite and assuming FR⁢(Y)|XsubscriptconditionalF_R(Y)|XFitalic_R ( Y ) | X is continuous and strictly monotonic, combining (9) and (11) we have: (πr,ref(n)||πref|X)=(E(n)||E). KL(π^(n)_r,ref|| _ref|X)= KL(E^(n)% ||E).sansserif_KL ( π( n )r , ref | | πref | X ) = sansserif_KL ( E( n ) | | E ) . (13) Under the more realistic Assumption 2 we can also apply the DPI on the stochastic map HXsubscriptH_XHitalic_X, since DPI also holds for stochastic maps ( under our assumption R|X→Y|X→conditionalconditionalR|X→ Y|XR | X → Y | X see for example (van Erven and Harremos, 2014) Example 2) (πr,ref(n)||πref|X) KL(π^(n)_r,ref|| _ref|X)sansserif_KL ( π( n )r , ref | | πref | X ) =(HX(Rn(Y))||HX(R(Y))|X)) = KL(H_X(R^n(Y))||H_X(R(Y))|X))= sansserif_KL ( Hitalic_X ( Ritalic_n ( Y ) ) | | Hitalic_X ( R ( Y ) ) | X ) ) ≤(Rn(Y)||R(Y)|X)=(TX(E(n))||TX(E)), ≤ KL(R^n(Y)||R(Y)|X)= KL(T_X(E^(n))||T_% X(E)),≤ sansserif_KL ( Ritalic_n ( Y ) | | R ( Y ) | X ) = sansserif_KL ( Titalic_X ( E( n ) ) | | Titalic_X ( E ) ) , (14) and hence under Assumption 2 regardless whether TXsubscriptT_XTitalic_X is a one to one map or not, thus we have: (πr,ref(n)||πref|X)≤(E(n)||E) KL(π^(n)_r,ref|| _ref|X)≤ KL(E^% (n)||E)sansserif_KL ( π( n )r , ref | | πref | X ) ≤ sansserif_KL ( E( n ) | | E ). The following Lemma gives a closed form expression for (E(n)||E) KL(E^(n)||E)sansserif_KL ( E( n ) | | E ): Lemma 1 ( KLsansserif_KL Between Exponential and Maximum of Exponentials). Let E∼exp⁡(1)similar-to1E (1)E ∼ exp ( 1 ), and E1,…⁢Ensubscript1…subscriptE_1,… E_nE1 , … Eitalic_n be iid exponentials and E(n)superscriptE^(n)E( n ) their maximum, we have: (E(n)||E)=log(n)−n−1n. KL(E^(n)||E)= (n)- n-1n.sansserif_KL ( E( n ) | | E ) = log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG . (15) Hence we conclude with the following result: Theorem 1. The best of n policy satisfies under (i) Assumption 1 (reward one to one) and for finite YY or under (i) Assumption 2 (existence of stochastic “inverse”) : (πr,ref(n)||πref)≤(E(n)||E)=log(n)−n−1n. KL(π^(n)_r,ref|| _ref)≤ KL(E^% (n)||E)= (n)- n-1n.sansserif_KL ( π( n )r , ref | | πroman_ref ) ≤ sansserif_KL ( E( n ) | | E ) = log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG . (16) Under Assumption 1, for infinite YY and assuming FR⁢(Y|X)subscriptconditionalF_R(Y|X)Fitalic_R ( Y | X ) is continuous and strictly increasing for all X we have: (πr,ref(n)||πref)=(E(n)||E)=log(n)−n−1n. KL(π^(n)_r,ref|| _ref)= KL(E^(n)% ||E)= (n)- n-1n.sansserif_KL ( π( n )r , ref | | πroman_ref ) = sansserif_KL ( E( n ) | | E ) = log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG . (17) Proof. Combining Lemma 15, the analysis above and taking expectation on X we obtain the result. ∎ Beirami et al. (2024) showed this result under condition (i) which is not a realistic setting and used the finiteness of YY to provide a direct proof. Our analysis via chaining DPI and using OT and Rényi representations to reduce the problem to exponentials allows us to extend the result to a more realistic setup under condition (i) i.e the existence of a stochastic “inverse", without any assumption on YY. Furthermore we unveil under which conditions the equality holds that was assumed to hold in previous works (Stiennon et al., 2020) (Coste et al., 2024; Nakano et al., 2021; Go et al., 2024) (Hilton and Gao, 2022) (Gao et al., 2023). Our approach of reduction to exponentials using Rényi representation of order statistics and data processing inequalities extends to bounding the f- divergence Df(πr,ref(n)||πref)D_f(π^(n)_r,ref|| _ref)Ditalic_f ( π( n )r , ref | | πroman_ref ) as well as the α Rényi divergence. The Rényi divergence for α∈(0,1)∪(1,∞)011α∈(0,1)∪(1,∞)α ∈ ( 0 , 1 ) ∪ ( 1 , ∞ ) is defined as follows: Dα(P||Q)=1(α−1)log(∫pα(x)q1−α(x)dx)D_α(P||Q)= 1(α-1) ( p^α(x)q^1-α(% x)dx )Ditalic_α ( P | | Q ) = divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG log ( ∫ pitalic_α ( x ) q1 - α ( x ) d x ) the limit as α→1→1α→ 1α → 1 coincides with KLsansserif_KL, i.e: D1(P||Q))=(P||Q)D_1(P||Q))= KL(P||Q)D1 ( P | | Q ) ) = sansserif_KL ( P | | Q ). These bounds are summarized in Table 1. Full proofs and theorems are in the Appendix. Divergence f⁢(x)f(x)f ( x ) Bound on Df(πr,ref(n)||πref)D_f(π^(n)_r,ref|| _ref)Ditalic_f ( π( n )r , ref | | πroman_ref ) KLsansserif_KL x⁢log⁡(x)x (x)x log ( x ) log⁡(n)−n−1n1 (n)- n-1nlog ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG Chi-squared (x−1)2superscript12(x-1)^2( x - 1 )2 (n−1)22⁢n−1superscript1221 (n-1)^22n-1divide start_ARG ( n - 1 )2 end_ARG start_ARG 2 n - 1 end_ARG Total Variation f⁢(x)=12⁢|x−1|121f(x)= 12|x-1|f ( x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG | x - 1 | (1n)1n−1−(1n)n−1superscript111superscript11( 1n) 1n-1-( 1n) nn-1( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG 1 end_ARG start_ARG n - 1 end_ARG - ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG n end_ARG start_ARG n - 1 end_ARG Hellinger distance (1−x)2superscript12(1- x)^2( 1 - square-root start_ARG x end_ARG )2 2⁢(1−n)2n+12superscript1212 (1- n)^2n+12 divide start_ARG ( 1 - square-root start_ARG n end_ARG )2 end_ARG start_ARG n + 1 end_ARG Forward KLsansserif_KL −log⁡(x)- (x)- log ( x ) n−1−log⁡(n)1n-1- (n)n - 1 - log ( n ) α Rényi Divergence NA 1(α−1)⁢log⁡(nα⁢(n−1)+1)11superscript11 1(α-1) ( n^α(n-1)+1 )divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG log ( divide start_ARG nitalic_α end_ARG start_ARG α ( n - 1 ) + 1 end_ARG ) Table 1. Best of n policy f-Divergence and α Rényi Divergence Bounds. Best of n-Policy Dominance on the Reference Policy. The following proposition shows that the best of n policy leads to an improved reward on average: Proposition 1. R(n)superscriptR^(n)R( n ) dominates R in the first order dominance that is R(n)superscriptR^(n)R( n ) dominates R on all quantiles: QR(n)⁢(t)≥QR⁢(t),∀t∈[0,1].formulae-sequencesubscriptsuperscriptsubscriptfor-all01Q_R^(n)(t)≥ Q_R(t),∀ t∈[0,1].Qitalic_R( n ) ( t ) ≥ Qitalic_R ( t ) , ∀ t ∈ [ 0 , 1 ] . It follows that we have ⁢R(n)≥⁢RsuperscriptER^(n) _E R( n ) ≥ blackboard_E R. Best of n Policy and RL Policy The following proposition discusses the sub-optimality of the best of n policy with respect to the alignment RL objective given in (1): Proposition 2. Assume a bounded reward in [−M,M][-M,M][ - M , M ]. For Δ>0Δ0 >0Δ > 0 and n=exp⁡(Δ)Δn= ( )n = exp ( Δ ) the best of n policy πr,ref(n)subscriptsuperscriptrefπ^(n)_r,refπ( n )r , ref and the Δ Δ Constrained RL policy πλΔ,rsubscriptsubscriptΔ _ _ ,rπitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT (given in (4)) satisfy: (πr,ref(n)||πλΔ,r)≤2⁢π⁢M⁢(e2⁢MλΔ−1)λΔexp(−Δ2). KL(π^(n)_r,ref|| _ _ ,r)≤ % 2πM(e 2M _ -1) _ (- % 2).sansserif_KL ( π( n )r , ref | | πitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT ) ≤ divide start_ARG square-root start_ARG 2 π end_ARG M ( edivide start_ARG 2 M end_ARG start_ARG λroman_Δ end_ARG - 1 ) end_ARG start_ARG λroman_Δ end_ARG exp ( - divide start_ARG Δ end_ARG start_ARG 2 end_ARG ) . A similar asymptotic result appeared in (Yang et al., 2024) for Δ→∞→Δ →∞Δ → ∞, showing as n→∞→n→∞n → ∞, (πr,ref(n)||πλΔ,r)→0 KL(π^(n)_r,ref|| _ _ ,r)→ 0sansserif_KL ( π( n )r , ref | | πitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT ) → 0, we provide here a non asymptotic result for finite n and finite Δ Δ. 3. Reward Improvement Guarantees Through Transportation Inequalities Notations Let X be a real random variable. The logarithmic moment generating function of X is defined as follows for λ∈ℝλ λ ∈ blackboard_R: ψX⁢(λ)=log⁡X⁢eλ⁢(X−⁢X).subscriptsubscriptsuperscript _X(λ)= _Xe^λ(X-EX).ψitalic_X ( λ ) = log blackboard_EX eitalic_λ ( X - blackboard_E X ) . X is said to be sub-Gaussian with variance σ2superscript2σ^2σ2 if : ψX⁢(λ)≤λ2⁢σ22⁢for all ⁢λ∈ℝ.subscriptsuperscript2superscript22for all ℝ _X(λ)≤ λ^2σ^22for all λ∈% R.ψitalic_X ( λ ) ≤ divide start_ARG λ2 σ2 end_ARG start_ARG 2 end_ARG for all λ ∈ blackboard_R . We denote ⁢(σ2)superscript2 SubGauss(σ^2)sansserif_SubGauss ( σ2 ) the set of sub-Gaussian random variables with variance σref2subscriptsuperscript2refσ^2_refσ2roman_ref. X is said to be sub-Gamma on the right tail with variance factor σ2superscript2σ^2σ2 and a scale parameter c>00c>0c > 0 if : ψX⁢(λ)≤λ2⁢σ22⁢(1−c⁢λ)⁢for every ⁢λ⁢ such that ⁢0<λ<1c.subscriptsuperscript2superscript221for every such that 01 _X(λ)≤ λ^2σ^22(1-cλ)for % every λ such that 0<λ< 1c.ψitalic_X ( λ ) ≤ divide start_ARG λ2 σ2 end_ARG start_ARG 2 ( 1 - c λ ) end_ARG for every λ such that 0 < λ < divide start_ARG 1 end_ARG start_ARG c end_ARG . We denote ⁢(σ2,c)superscript2 SubGamma(σ^2,c)sansserif_SubGamma ( σ2 , c ) the set of left and right tailed sub-Gamma random variables. Sub-gamma tails can be thought as an interpolation between sub-Gaussian and sub-exponential tails. Scaling Laws in Alignment It has been observed empirically (Coste et al., 2024; Nakano et al., 2021; Go et al., 2024; Hilton and Gao, 2022; Gao et al., 2023) that optimal RL policy πλΔ,rsubscriptsubscriptΔ _ _ ,rπitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT satisfy the following inequality for a constant σref2subscriptsuperscript2refσ^2_refσ2roman_ref : πλΔ,r⁢r−πref⁢r≤2σ2(πλΔ,r||πref).E_ _ _ ,rr-E_ _refr≤% 2σ^2 KL( _ _ ,r|| _ref).blackboard_Eπ start_POSTSUBSCRIPT λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT end_POSTSUBSCRIPT r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG 2 σ2 sansserif_KL ( πitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT | | πroman_ref ) end_ARG . A similar scaling for best of n policy : πr,ref(n)⁢r−πref⁢r≤2⁢σ2⁢(log⁡n−n−1n),subscriptsubscriptsuperscriptrefsubscriptsubscriptref2superscript21E_π^(n)_r,refr-E_ _refr≤% 2σ^2 ( n- n-1n ),blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG 2 σ2 ( log n - divide start_ARG n - 1 end_ARG start_ARG n end_ARG ) end_ARG , and those bounds are oftentimes tight even when empirically estimated from samples. This hints that those bounds are information theoretic and independent of the alignment problem. Indeed if the reward was bounded, a simple application of Pinsker inequality gives rise to KLsquare-root start_ARG sansserif_KL end_ARG scaling. Let TVsansserif_TV be the total variation distance, we have: ⁢(π,πref)=12⁢sup‖r‖∞≤1π⁢r−πref⁢r≤12(π||πref). TV(π, _ref)= 12 _||r||_∞≤ 1% E_πr-E_ _refr≤ 12 % KL(π|| _ref).sansserif_TV ( π , πroman_ref ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG sup| | r | | start_POSTSUBSCRIPT ∞ ≤ 1 end_POSTSUBSCRIPT blackboard_Eπ r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG sansserif_KL ( π | | πroman_ref ) end_ARG . Hence we can deduce that for bounded rewards r with norm infinity ‖r‖∞subscriptnorm||r||_∞| | r | |∞ that: π⁢r−πref⁢r≤2||r||∞2(π||πref).E_πr-E_ _refr≤ 2||r||^2_∞% KL(π|| _ref).blackboard_Eπ r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG 2 | | r | |2∞ sansserif_KL ( π | | πroman_ref ) end_ARG . Nevertheless this boundedness assumption on the reward is not realistic, since most reward models are unbounded: quoting Lambert et al. (2024) “ implemented by appending a linear layer to predict one logit or removing the final decoding layers and replacing them with a linear layer” and hence the reward is unbounded by construction. We will show in what follows that those scalings laws are tied to the tails of the reward under the reference policy and are instances of transportation inequalities. 3.1. Transportation Inequalities with KLsansserif_KL Divergence For a policy π∈⁢()π (Y)π ∈ P ( Y ) and for a reward function r:→ℝ:→ℝr:Y : Y → blackboard_R , we note r♯⁢πsubscript♯r_ ♯ π, the push-forward map of π through r. The reader is referred to Appendix D.1 for background on transportation inequalities and how they are derived from the so-called Donsker-Varadhan variational representation of the KLsansserif_KL divergence. The following Proposition is an application of Lemma 4.14 in (Boucheron et al., 2013)): Proposition 3 (Transportation Inequalities). The following inequalities hold depending on the tails of r♯⁢πrefsubscript♯subscriptrefr_ _refr♯ πroman_ref: (1) Assume that r♯⁢πref∈⁢(σref2)subscript♯subscriptrefsubscriptsuperscript2refr_ _ref∈ SubGauss(σ^2_ref)r♯ πroman_ref ∈ sansserif_SubGauss ( σ2roman_ref ). For any π∈⁢()π (Y)π ∈ P ( Y ) that is absolutely continuous with respect to πrefsubscriptref _refπroman_ref, and such that (π||πref)<∞ KL(π|| _ref)<∞sansserif_KL ( π | | πroman_ref ) < ∞ then we have: |π⁢r−πref⁢r|≤2σref2(π||πref). |E_πr-E_ _refr |≤ 2% σ^2_ref KL(π|| _ref).| blackboard_Eπ r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r | ≤ square-root start_ARG 2 σ2roman_ref sansserif_KL ( π | | πroman_ref ) end_ARG . (2) Assume that r♯⁢πref∈⁢(σref2,c)subscript♯subscriptrefsubscriptsuperscript2refr_ _ref∈ SubGamma(σ^2_ref,c)r♯ πroman_ref ∈ sansserif_SubGamma ( σ2roman_ref , c ). For any π∈⁢()π (Y)π ∈ P ( Y ) that is absolutely continuous with respect to πrefsubscriptref _refπroman_ref, and such that (π||πref)<∞ KL(π|| _ref)<∞sansserif_KL ( π | | πroman_ref ) < ∞ then we have: |πr−πrefr|≤2σref2(π||πref)+c(π||πref) |E_πr-E_ _refr |≤ 2% σ^2_ref KL(π|| _ref)+c KL(% π|| _ref)| blackboard_Eπ r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r | ≤ square-root start_ARG 2 σ2roman_ref sansserif_KL ( π | | πroman_ref ) end_ARG + c sansserif_KL ( π | | πroman_ref ) In particular we have the following Corollary: Corollary 1 (Expected Reward Improvement). If r♯⁢πref∈⁢(σref2)subscript♯subscriptrefsubscriptsuperscript2refr_ _ref∈ SubGauss(σ^2_ref)r♯ πroman_ref ∈ sansserif_SubGauss ( σ2roman_ref ) the following holds for the optimal RL policy πλΔ,rsubscriptsubscriptΔ _ _ ,rπitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT and for the best of n policy πr,ref(n)subscriptsuperscriptrefπ^(n)_r,refπ( n )r , ref: (1) For the optimal RL policy πλΔ,rsubscriptsubscriptΔ _ _ ,rπitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT we have: 0≤πλΔ,r⁢r−πref⁢r≤2σref2(πλΔ,r||πref)≤2⁢σref2⁢Δ.0 _ _ _ ,rr-E_ _refr% ≤ 2σ^2_ref KL( _ _ ,r|| _% ref)≤ 2σ^2_ref .0 ≤ blackboard_Eπ start_POSTSUBSCRIPT λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT end_POSTSUBSCRIPT r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG 2 σ2roman_ref sansserif_KL ( πitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT | | πroman_ref ) end_ARG ≤ square-root start_ARG 2 σ2roman_ref Δ end_ARG . (2) For the Best of n policy πr,ref(n)subscriptsuperscriptrefπ^(n)_r,refπ( n )r , ref, under Assumption 2 we have: 0≤πr,ref(n)⁢r−πref⁢r≤2σref2(πr,ref(n)||πref)≤2⁢σref2⁢(log⁡n−n−1n).0 _π^(n)_r,refr-E_ _refr% ≤ 2σ^2_ref KL(π^(n)_r,ref||% _ref)≤ 2σ^2_ref ( n- n-1% n ).0 ≤ blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG 2 σ2roman_ref sansserif_KL ( π( n )r , ref | | πroman_ref ) end_ARG ≤ square-root start_ARG 2 σ2roman_ref ( log n - divide start_ARG n - 1 end_ARG start_ARG n end_ARG ) end_ARG . A similar statement holds under sub-gamma tails of the reward of the reference model. We turn now to providing a bound in high probability on the empirical reward improvement of RL: Remark 1. Item (1) in Corollary 1 shows that the σref2⁢subscriptsuperscript2ref σ^2_ref KLsquare-root start_ARG σ2roman_ref sansserif_KL end_ARG provides an upper bound on the reward improvement of the alignment under subgaussian tails of the reference reward. Under subgaussian tails of the reference, this information theoretic barrier can not be broken with a better algorithm. On way to improve on the KLsquare-root start_ARG sansserif_KL end_ARG ceiling is by aiming at having a reference model with a reward that has subgamma tails to improve the upper limit to σref2⁢+c⁢subscriptsuperscript2ref σ^2_ref KL+c KLsquare-root start_ARG σ2roman_ref sansserif_KL end_ARG + c sansserif_KL, or to subexponential tails to be linear in the KLsansserif_KL. Item (2) can be seen as a refinement on the classical 2⁢σref2⁢log⁡(n)2subscriptsuperscript2ref 2σ^2_ref (n)square-root start_ARG 2 σ2roman_ref log ( n ) end_ARG upper bound on the expectation of maximum of subgaussians see for e.g Corollary 2.6 in (Boucheron et al., 2013). If in addition r is positive and for X=r♯⁢πref−πref⁢rsubscript♯subscriptrefsubscriptsubscriptrefX=r_ _ref-E_ _refrX = r♯ πroman_ref - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r we have for t>00t>0t > 0 , ℙ⁢(X>t)≥ℙ⁢(|g|>t)ℙP(X>t) (|g|>t)blackboard_P ( X > t ) ≥ blackboard_P ( | g | > t ), where g∼⁢(0,σℓ2)similar-to0subscriptsuperscript2ℓg (0,σ^2_ )g ∼ N ( 0 , σ2roman_ℓ ) (where σℓ2subscriptsuperscript2ℓσ^2_ σ2roman_ℓ is a variance) , then we have a matching lower bound for πr,ref(n)subscriptsuperscriptrefπ^(n)_r,refπ( n )r , ref that scales with σℓ2⁢log⁡(n)subscriptsuperscript2ℓ σ^2_ (n)square-root start_ARG σ2roman_ℓ log ( n ) end_ARG for sufficiently large n (See (Kamath, 2015)). The following Theorem gives high probability bounds for the excess reward when estimated from empirical samples: Theorem 2 (High Probability Empirical Reward Improvement For RL). Assume r♯⁢πref∈⁢(σref2)subscript♯subscriptrefsubscriptsuperscript2refr_ _ref∈ SubGauss(σ^2_ref)r♯ πroman_ref ∈ sansserif_SubGauss ( σ2roman_ref ). Let β>11β>1β > 1 and t0>0subscript00t_0>0t0 > 0. Let πβ,rsubscript _β,rπitalic_β , r be the optimal policy of the penalized RL problem given in Equation (3). Let Ri,βsubscriptR_i,βRitalic_i , β and Ri,ref,i=1⁢…⁢msubscriptref1…R_i,ref,i=1… mRitalic_i , ref , i = 1 … m be the rewards evaluated at m samples from πβ,rsubscript _β,rπitalic_β , r and πrefsubscriptref _refπroman_ref. Assume that the β-Rényi divergence Dβ(πβ,r||πref)D_β( _β,r|| _ref)Ditalic_β ( πitalic_β , r | | πroman_ref ) and (πβ,r||πref) KL( _β,r|| _ref)sansserif_KL ( πitalic_β , r | | πroman_ref ) are both finite. The following inequality holds with probability at least 1−e−m⁢t022⁢σref2−e−m⁢(β−1)⁢t01superscriptsubscriptsuperscript202subscriptsuperscript2refsuperscript1subscript01-e^- mt^2_02σ^2_ref-e^-m(β-1)t_01 - e- divide start_ARG m t start_POSTSUPERSCRIPT 20 end_ARG start_ARG 2 σ2roman_ref end_ARG end_POSTSUPERSCRIPT - e- m ( β - 1 ) t0: 1m⁢∑i=1mRi,β−1m⁢∑i=1mRi,ref1superscriptsubscript1subscript1superscriptsubscript1subscriptref 1m _i=1^mR_i,β- 1m _i=1^mR_% i,refdivide start_ARG 1 end_ARG start_ARG m end_ARG ∑i = 1m Ritalic_i , β - divide start_ARG 1 end_ARG start_ARG m end_ARG ∑i = 1m Ritalic_i , ref ≤2σref2(πβ,r||πref)+Dβ(πβ,r||πref)−(πβ,r||πref)β+2⁢t0. ≤ 2σ^2_ref KL( _β,r||% _ref)+ D_β( _β,r|| _ref)-% KL( _β,r|| _ref)β+2t_0.≤ square-root start_ARG 2 σ2roman_ref sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG + divide start_ARG Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG start_ARG β end_ARG + 2 t0 . Note that in Theorem 2, we did not make any assumptions on the tails of r♯⁢πβ,rsubscript♯subscriptr_ _β,rr♯ πitalic_β , r and we see that this results in a biased concentration inequality with a non-negative bias Dβ(πβ,r||πref)−(πβ,r||πref)β≥0 D_β( _β,r|| _ref)- KL( _β,r||% _ref)β≥ 0divide start_ARG Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG start_ARG β end_ARG ≥ 0. For the best of n policy, if the reward was positive and has a folded normal distribution (absolute value of gaussians), (Boucheron and Thomas, 2012) provides concentration bounds, owing to the subgamma tails of the maximum of absolute value of Gaussians. 3.2. Tail Adaptive Transportation Inequalities with the Rényi Divergence An important question on the tightness of the bounds rises from the bounds in Corollary 1. We answer this question by considering additional information on the tails of the reward under the policy π, and we obtain tail adaptive bounds that are eventually tighter than the one in Corollary 1. Our new bounds leverage a variational representation of the Rényi divergence that uses the logarithmic moment generating function of both measures at hand. Preliminaries for the Rényi Divergence The Donsker-Varadahn representation of KLsansserif_KL was crucial in deriving transportation inequalities. In Shayevitz (2011) the following variational form is given for the Rényi divergence in terms of the KLsansserif_KL divergence, for all α∈ℝα α ∈ blackboard_R (1−α)Dα(P||Q)=infRα(R||P)+(1−α)(R||Q)(1-α)D_α(P||Q)= _Rα KL(R||P)+(1-α) % KL(R||Q)( 1 - α ) Ditalic_α ( P | | Q ) = infitalic_R α sansserif_KL ( R | | P ) + ( 1 - α ) sansserif_KL ( R | | Q ) (18) A similar variational form was rediscovered in (Anantharam, 2018). Finally a Donsker-Varadahn-Rényi representation of DαsubscriptD_αDitalic_α was given in (Birrell et al., 2021). For all α∈ℝ+,α≠0,1formulae-sequencesuperscriptℝ01α ^+,α≠ 0,1α ∈ blackboard_R+ , α ≠ 0 , 1 we have : 1αDα(P||Q)=suph∈ℋ1α−1log(Pe(α−1)⁢h)−1αlog(Qeα⁢h), 1αD_α(P||Q)= _h 1α-1 % (E_Pe^(α-1)h )- 1α (E% _Qe^α h ),divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) = supitalic_h ∈ H divide start_ARG 1 end_ARG start_ARG α - 1 end_ARG log ( blackboard_EP e( α - 1 ) h ) - divide start_ARG 1 end_ARG start_ARG α end_ARG log ( blackboard_EQ eitalic_α h ) , (19) where ℋ=h|∫e(α−1)⁢h⁢P<∞,∫eα⁢h⁢Q<∞.ℋconditional-setℎformulae-sequencesuperscript1ℎdifferential-dsuperscriptℎdifferential-dH= \h | e^(α-1)hdP<∞, e^α hdQ<% ∞ \.H = h | ∫ e( α - 1 ) h d P < ∞ , ∫ eitalic_α h d Q < ∞ . Birrell et al. (2021) presents a direct proof of this formulation without exploring its link to the representation given in (18), we show in what follows an elementary proof via convex conjugacy, the duality relationship between equations (18) and (19). Theorem 3. For 0<α<1010<α<10 < α < 1 Equations (18) and (19) are dual of one another. For α>11α>1α > 1 they are Toland Dual. We collect in what follows elementary lemmas that will be instrumental to derive transportation inequalities in terms of the Rényi divergence. Proofs are given in the Appendix. Lemma 2. Let α∈(0,1)∪(1,∞)011α∈(0,1)∪(1,∞)α ∈ ( 0 , 1 ) ∪ ( 1 , ∞ ), and define ℋ=h|e(α−1)⁢(h−∫h⁢P)∈L1⁢(P),e(α)⁢(h−∫h⁢Q)∈L1⁢(Q)ℋconditional-setℎformulae-sequencesuperscript1ℎdifferential-dsuperscript1superscriptℎdifferential-dsuperscript1H=\h|e^(α-1)(h- hdP)∈ L^1(P),e^(α)(h- hdQ)% ∈ L^1(Q)\H = h | e( α - 1 ) ( h - ∫ h d P ) ∈ L1 ( P ) , e( α ) ( h - ∫ h d Q ) ∈ L1 ( Q ) . We have for all h∈ℋℎℋh ∈ H and for α∈(0,1)∪(1,∞)011α∈(0,1)∪(1,∞)α ∈ ( 0 , 1 ) ∪ ( 1 , ∞ ) ∫hdP−∫hdQ≤1αDα(P||Q)−1α−1log(∫e(α−1)⁢(h−∫h⁢P)dP)+1αlog(∫eα⁢(h−∫h⁢Q)dQ) hdP- hdQ≤ 1αD_α(P||Q)- 1α-1 % ( e^(α-1)(h- hdP)dP )+ 1α ( e% ^α(h- hdQ)dQ )∫ h d P - ∫ h d Q ≤ divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) - divide start_ARG 1 end_ARG start_ARG α - 1 end_ARG log ( ∫ e( α - 1 ) ( h - ∫ h d P ) d P ) + divide start_ARG 1 end_ARG start_ARG α end_ARG log ( ∫ eitalic_α ( h - ∫ h d Q ) d Q ) Lemma 3. The following limit holds for the Rényi divergence limα→01αDα(P||Q)=(Q||P). _α→ 0 1αD_α(P||Q)= KL(Q||P).limitalic_α → 0 divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) = sansserif_KL ( Q | | P ) . Transportation Inequalities with Rényi Divergence. The following theorem shows that when considering the tails of π we can obtain tighter upper bounds using the Rényi divergence that is more tail adaptive: Theorem 4 (Tail Adaptive Transportation Inequalities). Let α∈(0,1)01α∈(0,1)α ∈ ( 0 , 1 ). Assume r♯⁢π∈⁢(σπ2)subscript♯subscriptsuperscript2r_ π∈ SubGauss(σ^2_π)r♯ π ∈ sansserif_SubGauss ( σ2italic_π ) and r♯⁢πref∈⁢(σref2)subscript♯subscriptrefsubscriptsuperscript2refr_ _ref∈ SubGauss(σ^2_ref)r♯ πroman_ref ∈ sansserif_SubGauss ( σ2roman_ref ) then we have for all α∈(0,1)01α∈(0,1)α ∈ ( 0 , 1 ): π⁢r−πref⁢r≤2⁢((1−α)⁢σπ2+α⁢σref2)⁢Dα(π||πref)α. _πr-E_ _refr≤ 2((1-% α)σ^2_π+ασ^2_ref) D_α(π||% _ref)α.blackboard_Eπ r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG 2 ( ( 1 - α ) σ2italic_π + α σ2roman_ref ) divide start_ARG Ditalic_α ( π | | πroman_ref ) end_ARG start_ARG α end_ARG end_ARG . (20) In particular if there exits α∈(0,1)01α∈(0,1)α ∈ ( 0 , 1 ) such that Dα(π||πref)≤α⁢σref2(1−α)⁢σπ2+α⁢σref2(π||πref)D_α(π|| _ref)≤ ασ^2_ref% (1-α)σ^2_π+ασ^2_ref KL(π||% _ref)Ditalic_α ( π | | πroman_ref ) ≤ divide start_ARG α σ2roman_ref end_ARG start_ARG ( 1 - α ) σ2italic_π + α σ2roman_ref end_ARG sansserif_KL ( π | | πroman_ref ), then the tail adaptive upper bound given in Equation (20) is tighter than the one provided by the tails of πrefsubscriptref _refπroman_ref only i.e σref2(π||πref) σ^2_ref KL(π|| _ref)square-root start_ARG σ2roman_ref sansserif_KL ( π | | πroman_ref ) end_ARG. Note that this is possible because DαsubscriptD_αDitalic_α is increasing in α∈(0,1)01α∈(0,1)α ∈ ( 0 , 1 ) (van Erven and Harremos, 2014), i.e Dα(π||πref)≤(π||πref)D_α(π|| _ref)≤ KL(π|| _ref)Ditalic_α ( π | | πroman_ref ) ≤ sansserif_KL ( π | | πroman_ref ), and α⁢σref2(1−α)⁢σπ2+α⁢σref2≤1subscriptsuperscript2ref1subscriptsuperscript2subscriptsuperscript2ref1 ασ^2_ref(1-α)σ^2_π+ασ% ^2_ref≤ 1divide start_ARG α σ2roman_ref end_ARG start_ARG ( 1 - α ) σ2italic_π + α σ2roman_ref end_ARG ≤ 1. Note that taking limits α→0→0α→ 0α → 0 (applying Lemma 3) and α→1→1α→ 1α → 1, and taking the minimum of the upper bounds we obtain: π⁢r−πref⁢r≤2min(σπref2(π||πref),σπ2(πref||π)), _πr-E_ _refr≤ 2 % (σ^2_ _ref KL(π|| _ref),σ^2% _π KL( _ref||π)),blackboard_Eπ r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG 2 min ( σ2italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT sansserif_KL ( π | | πroman_ref ) , σ2italic_π sansserif_KL ( πroman_ref | | π ) ) end_ARG , this inequality can be also obtained by applying Proposition 3 twice: on the tails of π and πrefsubscriptref _refπroman_ref respectively. Another important implication of Theorem 4, other than tighter than KLsansserif_KL upper bound, is that if we were to change the RL alignment problem (1) to be constrained by Dα,α∈(0,1)subscript01D_α,α∈(0,1)Ditalic_α , α ∈ ( 0 , 1 ) instead of KLsansserif_KL, we may end up with a smaller upper limit on the reward improvement. This DαsubscriptD_αDitalic_α constrained alignment may lead to a policy that under-performs when compared to a policy obtained with the KLsansserif_KL constraint. This was indeed observed experimentally in (Wang et al., 2024) that used constraints with α- divergences for α∈(0,1)01α∈(0,1)α ∈ ( 0 , 1 ) (that are related to Rényi divergences) and noticed a degradation in the reward improvement w.r.t the policy obtained using KLsansserif_KL constraints. 4. Transportation Inequality Transfer From Proxy to Golden Reward As we saw in the previous sections, the tightness of (π||πref) KL(π|| _ref)square-root start_ARG sansserif_KL ( π | | πroman_ref ) end_ARG upper bound in alignment can be due to the tails of the reward of the aligned policy π (Theorem 4) and to the concentration around the mean in finite sample size (Theorem 2). Another important consideration is the mismatch between the golden reward r∗superscriptr^*r∗ that one desires to maximize that is expensive and difficult to obtain (for example human evaluation) and a proxy reward r that approximates r∗superscriptr^*r∗. The proxy reward r is used instead of r∗superscriptr^*r∗ in RL and in best of n policy. While we may know the tails of the reward r of the reference and aligned model, we don’t have access to this information on the golden reward r∗superscriptr^*r∗. We show in this section how to transfer transportation inequalities from r to r∗superscriptr^*r∗ for RL and Best of n policy. Proposition 4 (r∗superscriptr^*r∗ Transportation Inequality for RL Policy ). The following inequality holds: πβ,r⁢r∗−πref⁢r∗≤πβ,r⁢r−πref⁢r−1β⁢log⁡(∫eβ(r−r∗−(∫rdπref−∫r∗dπref)⁢πβ,r∗),E_ _β,rr^*-E_ _refr^*≤% E_ _β,rr-E_ _refr- 1β% ( e^β(r-r^*- ( rd _ref- r^*dπ% _ref )d _β,r^* ),blackboard_Eπ start_POSTSUBSCRIPT β , r end_POSTSUBSCRIPT r∗ - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r∗ ≤ blackboard_Eπ start_POSTSUBSCRIPT β , r end_POSTSUBSCRIPT r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r - divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β ( r - r start_POSTSUPERSCRIPT ∗ - ( ∫ r d πroman_ref - ∫ r∗ d πroman_ref ) end_POSTSUPERSCRIPT d πitalic_β , r∗ ) , Assume r♯⁢πref∈⁢(σref2)subscript♯subscriptrefsubscriptsuperscript2refr_ _ref∈ SubGauss(σ^2_ref)r♯ πroman_ref ∈ sansserif_SubGauss ( σ2roman_ref ), and there exists δ>00δ>0δ > 0 such that: 1βlog(∫eβ(r−r∗−(∫rdπref−∫r∗dπref)dπβ,r∗)≥δ(πβ,r∗||πref), 1β ( e^β(r-r^*- ( rd _ref% - r^*d _ref )d _β,r^* )≥δ% KL( _β,r^*|| _ref),divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β ( r - r start_POSTSUPERSCRIPT ∗ - ( ∫ r d πroman_ref - ∫ r∗ d πroman_ref ) end_POSTSUPERSCRIPT d πitalic_β , r∗ ) ≥ δ sansserif_KL ( πitalic_β , r∗ | | πroman_ref ) , then we have: πβ,r∗−πrefr∗≤2σref2(πβ,r||πref)−δ(πβ,r∗||πref).E_ _β,rr^*-E_ _refr^*≤ % 2σ^2_ref KL( _β,r|| _ref)-% δ KL( _β,r^*|| _ref).blackboard_Eπ start_POSTSUBSCRIPT β , r end_POSTSUBSCRIPT r∗ - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r∗ ≤ square-root start_ARG 2 σ2roman_ref sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG - δ sansserif_KL ( πitalic_β , r∗ | | πroman_ref ) . Note that 1β⁢log⁡(∫eβ(r−r∗−(∫rdπref−∫r∗dπref)⁢πβ,r∗) 1β ( e^β(r-r^*- ( rd _ref% - r^*d _ref )d _β,r^* )divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β ( r - r start_POSTSUPERSCRIPT ∗ - ( ∫ r d πroman_ref - ∫ r∗ d πroman_ref ) end_POSTSUPERSCRIPT d πitalic_β , r∗ ) is interpreted here as an interpolation between the mean and the maximum of its argument on the support of πβ,r∗subscriptsuperscript _β,r^*πitalic_β , r∗ (Proposition 9 in (Feydy et al., 2018)). Indeed as β→0→0β→ 0β → 0, this boils down to the mean on ∫(r−r∗)⁢πβ,r∗−(∫r⁢πref−∫r∗⁢πref)superscriptdifferential-dsubscriptsuperscriptdifferential-dsubscriptrefsuperscriptdifferential-dsubscriptref (r-r^*)d _β,r^*- ( rd _ref- r^*dπ% _ref )∫ ( r - r∗ ) d πitalic_β , r∗ - ( ∫ r d πroman_ref - ∫ r∗ d πroman_ref ) and β→∞→β→∞β → ∞ this boils down to max⁢πβ,r∗⁡r−r∗−(∫r⁢πref−∫r∗⁢πref)subscriptsubscriptsuperscriptsuperscriptdifferential-dsubscriptrefsuperscriptdifferential-dsubscriptref _ supp _β,r^*\r-r^*- ( rd _ref% - r^*d _ref )\maxsansserif_supp π start_POSTSUBSCRIPT β , r∗ end_POSTSUBSCRIPT r - r∗ - ( ∫ r d πroman_ref - ∫ r∗ d πroman_ref ) . Our assumption means that r overestimates r∗superscriptr^*r∗ and the overestimation is accentuated as we drift from πrefsubscriptref _refπroman_ref on which r was learned. This assumption echoes findings in (Gao et al., 2023) that show that the transportation inequalities suffer from overestimation of proxy reward models of the golden reward (See Figure 8 in (Gao et al., 2023)). Note that in Proposition 4, we are evaluating the golden reward r∗superscriptr^*r∗ improvement when using the proxy reward optimal policy πβ,rsubscript _β,rπitalic_β , r. We see that the golden reward of the RL policy inherits the transportation inequality from the proxy one but the improvement of the reward is hindered by possible overestimation of the golden reward by the proxy model. This explains the dip in performance as measured by the golden reward depicted in Figure 1 and reported in (Gao et al., 2023). Proposition 5 (r∗superscriptr^*r∗ Transportation Inequality for Best of n Policy). Let ε>00 >0ε > 0. Let r be a surrogate reward such that ∥r−r∗∥∞≤εsubscriptdelimited-∥superscript r-r^* _∞≤ ∥ r - r∗ ∥∞ ≤ ε and assume r♯⁢πref∈⁢(σref2)subscript♯subscriptrefsubscriptsuperscript2refr_ _ref∈ SubGauss(σ^2_ref)r♯ πroman_ref ∈ sansserif_SubGauss ( σ2roman_ref ) then the best of n policy πr,ref(n)subscriptsuperscriptrefπ^(n)_r,refπ( n )r , ref satisfies: πr,ref(n)⁢(r∗)−πref⁢(r∗)≤2⁢σref2⁢(log⁡(n)−n−1n)+2⁢ε⁢((1n)1n−1−(1n)n−1).subscriptsubscriptsuperscriptrefsuperscriptsubscriptsubscriptrefsuperscript2subscriptsuperscript2ref12superscript111superscript11E_π^(n)_r,ref(r^*)-E_ _ref% (r^*)≤ 2σ^2_ref ( (n)- n-1n )% +2 ( ( 1n ) 1n-1- ( 1n% ) nn-1 ).blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT ( r∗ ) - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( r∗ ) ≤ square-root start_ARG 2 σ2roman_ref ( log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG ) end_ARG + 2 ε ( ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG 1 end_ARG start_ARG n - 1 end_ARG - ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG n end_ARG start_ARG n - 1 end_ARG ) . Transportation inequalities transfers for the best of n policy from r to r∗superscriptr^*r∗ and pays only an additional error term ∥r−r∗∥∞⁢(πr,ref(n)|πref)subscriptdelimited-∥superscriptconditionalsubscriptsuperscriptrefsubscriptref r-r^* _∞ TV(π^(n)_r,ref% | _ref)∥ r - r∗ ∥∞ sansserif_TV ( π( n )r , ref | πroman_ref ) , an upper bound of this total variation as a function of n is given in Table 1. As mentioned in remark 1, if we have lower bounds on the tail of the reference reward, then we also have a lower bound on the reward improvement that scales like C⁢σℓ2⁢log⁡(n)−2⁢ε⁢((1n)1n−1−(1n)n−1).subscriptsuperscript2ℓ2superscript111superscript11C σ^2_ (n)-2 ( ( 1n )^% 1n-1- ( 1n ) nn-1 ).C square-root start_ARG σ2roman_ℓ log ( n ) end_ARG - 2 ε ( ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG 1 end_ARG start_ARG n - 1 end_ARG - ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG n end_ARG start_ARG n - 1 end_ARG ) . This is in line with empirical findings in (Hilton and Gao, 2022) (Gao et al., 2023) that showed that best of n policy is resilient as the reward model r gets closer to r∗superscriptr^*r∗. 5. Conclusion We presented in this paper a comprehensive information theoretical analysis of policy alignment using reward optimization with RL and best of n sampling. We showed for best of n a bound on KLsansserif_KL under realistic assumptions on the reward. Our analysis showed that the alignment reward improvement, is intrinsically constrained by the tails of the reward under the reference policy and controlling the KLsansserif_KL divergence results in an upper bound of the policy improvement. We showed that the KLsansserif_KL bound may not be tight if the tails of the optimized policy satisfy a condition expressed via Rényi divergence. We also explained the deterioration of the golden reward via overestimation of the proxy reward. References Anantharam [2018] V. Anantharam. A variational characterization of rényi divergences. IEEE Transactions on Information Theory, 64(11):6979–6989, 2018. Bai et al. [2022] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. Beirami et al. [2024] A. Beirami, A. Agarwal, J. Berant, A. D’Amour, J. Eisenstein, C. Nagpal, and A. T. Suresh. Theoretical guarantees on the best-of-n alignment policy, 2024. Birrell et al. [2021] J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet, and J. Wang. Variational representations and neural network estimation of rényi divergences. SIAM Journal on Mathematics of Data Science, 3(4):1093–1116, 2021. Boucheron and Thomas [2012] S. Boucheron and M. Thomas. Concentration inequalities for order statistics. 2012. Boucheron et al. [2013] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-5. doi: 10.1093/ACPROF:OSO/9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255.001.0001. Christiano et al. [2017] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.c/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf. Coste et al. [2024] T. Coste, U. Anwar, R. Kirk, and D. Krueger. Reward model ensembles help mitigate overoptimization. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=dcjtMYkpXx. Ethayarajh et al. [2024] K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024. Feydy et al. [2018] J. Feydy, T. Séjourné, F.-X. Vialard, S. ichi Amari, A. Trouvé, and G. Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences, 2018. Gao et al. [2023] L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023. Go et al. [2024] D. Go, T. Korbak, G. Kruszewski, J. Rozen, and M. Dymetman. Compositional preference models for aligning LMs. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=tiiAzqi6Ol. Hilton and Gao [2022] J. Hilton and L. Gao. Measuring goodhart’s law, 2022. Kamath [2015] G. Kamath. Bounds on the expectation of the maximum of samples from a gaussian. URL http://w. gautamkamath. com/writings/gaussian max. pdf, 10:20–30, 2015. Lambert et al. [2024] N. Lambert, V. Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, N. A. Smith, and H. Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024. Mudgal et al. [2023] S. Mudgal, J. Lee, H. Ganapathy, Y. Li, T. Wang, Y. Huang, Z. Chen, H.-T. Cheng, M. Collins, T. Strohman, et al. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023. Nakano et al. [2021] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021. Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. Polyanskiy and Wu [2023] Y. Polyanskiy and Y. Wu. Information theory: From coding to learning, 2023. Rafailov et al. [2024] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. Rényi [1953] A. Rényi. On the theory of order statistics. Acta Mathematica Academiae Scientiarum Hungarica, 4:191–231, 1953. URL https://api.semanticscholar.org/CorpusID:123132570. Santambrogio [2015] F. Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Birkhäuser, Cham, 2015. ISBN 9783319208275. doi: 10.1007/978-3-319-20828-2. Shayevitz [2011] O. Shayevitz. On rényi measures and hypothesis testing. In 2011 IEEE International Symposium on Information Theory Proceedings, pages 894–898, 2011. doi: 10.1109/ISIT.2011.6034266. Stiennon et al. [2020] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020. Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. van Erven and Harremos [2014] T. van Erven and P. Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014. doi: 10.1109/TIT.2014.2320500. Wang et al. [2024] C. Wang, Y. Jiang, C. Yang, H. Liu, and Y. Chen. Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2cRzmWXK9N. Yang et al. [2024] J. Q. Yang, S. Salamatian, Z. Sun, A. T. Suresh, and A. Beirami. Asymptotics of language model alignment, 2024. Yang and Klein [2021] K. Yang and D. Klein. FUDGE: Controlled text generation with future discriminators. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.276. URL https://aclanthology.org/2021.naacl-main.276. Zhao et al. [2023] Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh, and P. J. Liu. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0qSOodKmJaN. Appendix A Broader Impact and Limitations We believe this work explaining scaling laws for reward models and alignment will give practitioners insights regarding the limits of what is attainable via alignment. All assumptions under which our statements hold are given. We don’t see any negative societal impact of our work. Appendix B Proofs For Best of n Policy B.1. Best of n Policy KLsansserif_KL Guarantees Proof of Lemma 15. (E(n)||E) KL(E^(n)||E)sansserif_KL ( E( n ) | | E ) =∫0+∞fE(n)⁢(x)⁢log⁡(fE(n)⁢(x)fE⁢(x))⁢xabsentsuperscriptsubscript0subscriptsuperscriptsubscriptsuperscriptsubscriptdifferential-d = _0^+∞f_E^(n)(x) ( f_E^(n)(x)% f_E(x) )dx= ∫0+ ∞ fitalic_E( n ) ( x ) log ( divide start_ARG fitalic_E( n ) ( x ) end_ARG start_ARG fitalic_E ( x ) end_ARG ) d x We have fE⁢(x)=e−x⁢1x≥0subscriptsuperscriptsubscript10f_E(x)=e^-x1_x≥ 0fitalic_E ( x ) = e- x 1x ≥ 0. Note that the CDF of maximum of exponential FE(n)⁢(x)=(1−e−x)⁢1x≥0,subscriptsuperscript1superscriptsubscript10F_E^(n)(x)=(1-e^-x)1_x≥ 0,Fitalic_E( n ) ( x ) = ( 1 - e- x ) 1x ≥ 0 , and hence fE(n)⁢(x)=n⁢(1−e−x)n−1⁢e−x⁢1x≥0subscriptsuperscriptsuperscript1superscript1superscriptsubscript10f_E^(n)(x)=n(1-e^-x)^n-1e^-x1_x≥ 0fitalic_E( n ) ( x ) = n ( 1 - e- x )n - 1 e- x 1x ≥ 0. Hence we have: (E(n)||E) KL(E^(n)||E)sansserif_KL ( E( n ) | | E ) =∫0+∞n⁢(1−e−x)n−1⁢e−x⁢log⁡(n⁢(1−e−x)n−1⁢e−xe−x)⁢xabsentsuperscriptsubscript0superscript1superscript1superscriptsuperscript1superscript1superscriptsuperscriptdifferential-d = _0^+∞n(1-e^-x)^n-1e^-x ( n(1-e^% -x)^n-1e^-xe^-x )dx= ∫0+ ∞ n ( 1 - e- x )n - 1 e- x log ( divide start_ARG n ( 1 - e- x )n - 1 e- x end_ARG start_ARG e- x end_ARG ) d x =∫0+∞n⁢(1−e−x)n−1⁢e−x⁢log⁡(n⁢(1−e−x)n−1)⁢xabsentsuperscriptsubscript0superscript1superscript1superscriptsuperscript1superscript1differential-d = _0^+∞n(1-e^-x)^n-1e^-x (n(1-e^-x)^% n-1 )dx= ∫0+ ∞ n ( 1 - e- x )n - 1 e- x log ( n ( 1 - e- x )n - 1 ) d x Let u=1−e−x1superscriptu=1-e^-xu = 1 - e- x, we have d⁢u=e−x⁢d⁢xsuperscriptdu=e^-xdxd u = e- x d x. It follows that : (E(n)||E) KL(E^(n)||E)sansserif_KL ( E( n ) | | E ) =∫01n⁢un−1⁢log⁡(n⁢un−1)⁢uabsentsuperscriptsubscript01superscript1superscript1differential-d = _0^1nu^n-1 (nu^n-1 )du= ∫01 n uitalic_n - 1 log ( n uitalic_n - 1 ) d u =∫01n⁢un−1⁢(log⁡(n)+(n−1)⁢log⁡(u))⁢uabsentsuperscriptsubscript01superscript11differential-d = _0^1nu^n-1 ( (n)+(n-1) (u) )du= ∫01 n uitalic_n - 1 ( log ( n ) + ( n - 1 ) log ( u ) ) d u =log⁡(n)⁢∫01un+(n−1)⁢∫01n⁢un−1⁢log⁡(u)⁢uabsentsuperscriptsubscript01differential-dsuperscript1superscriptsubscript01superscript1differential-d = (n) _0^1du^n+(n-1) _0^1nu^n-1 (u)du= log ( n ) ∫01 d uitalic_n + ( n - 1 ) ∫01 n uitalic_n - 1 log ( u ) d u =log⁡(n)+(n−1)⁢∫01d⁢(un⁢log⁡u−unn)absent1superscriptsubscript01superscriptsuperscript = (n)+(n-1) _0^1d(u^n u- u^nn)= log ( n ) + ( n - 1 ) ∫01 d ( uitalic_n log u - divide start_ARG uitalic_n end_ARG start_ARG n end_ARG ) =log⁡(n)−n−1n.absent1 = (n)- n-1n.= log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG . ∎ B.2. Best of n Policy f divergence and Rényi Divergence Best of n Policy f divergence and Renyi divergence Guarantees Given that our proof technique relies on DPI and Rényi representation, we show that similar results hold for any f-divergence and for the Rényi divergence: Df(P||Q)=∫q(x)f(p⁢(x)q⁢(x))dx,D_f (P||Q )= q(x)f ( p(x)q(x) )dx,Ditalic_f ( P | | Q ) = ∫ q ( x ) f ( divide start_ARG p ( x ) end_ARG start_ARG q ( x ) end_ARG ) d x , (21) where f is convex and f⁢(1)=010f(1)=0f ( 1 ) = 0. Hence we have by DPI for f-divergences: Theorem 5. Under Assumption 2 the best of n policy satisfies for any f-divergence: Df(πr,ref(n)||πref)≤∫01f(nun−1)duD_f(π^(n)_r,ref|| _ref)≤ _0^1% f (nu^n-1 )duDitalic_f ( π( n )r , ref | | πroman_ref ) ≤ ∫01 f ( n uitalic_n - 1 ) d u (22) Proof of Theorem 5. Df(πr,ref(n)||πref|X) _f(π^(n)_r,ref|| _ref|X)Ditalic_f ( π( n )r , ref | | πref | X ) =Df(Y(n)||Y|X) =D_f(Y^(n)||Y|X)= Ditalic_f ( Y( n ) | | Y | X ) =Df(HX(Rn(Y))||HX(R(Y))|X) =D_f(H_X(R^n(Y))||H_X(R(Y))|X)= Ditalic_f ( Hitalic_X ( Ritalic_n ( Y ) ) | | Hitalic_X ( R ( Y ) ) | X ) ≤Df(Rn(Y)||R(Y)|X) By the data processing inequality _f(R^n(Y)||R(Y)|X) By the data processing% inequality~≤ Ditalic_f ( Ritalic_n ( Y ) | | R ( Y ) | X ) By the data processing inequality (23) =Df(TX(E(n))||TX(E)) Renyi and Optimal Transport Representations (8) =D_f(T_X(E^(n))||T_X(E)) Renyi and Optimal% Transport Representations eq:OTandRenyi= Ditalic_f ( Titalic_X ( E( n ) ) | | Titalic_X ( E ) ) Renyi and Optimal Transport Representations ( ) =Df(E(n)||E) since TX is a monotonic bijection DPI is an equality =D_f(E^(n)||E) since $T_X$ is a monotonic % bijection DPI is an equality= Ditalic_f ( E( n ) | | E ) since Titalic_X is a monotonic bijection DPI is an equality (24) =∫0+∞fE⁢(x)⁢f⁢(fE(n)⁢(x)fE⁢(x))⁢xabsentsuperscriptsubscript0subscriptsubscriptsuperscriptsubscriptdifferential-d = _0^+∞f_E(x)f ( f_E^(n)(x)f_E(x)% )dx= ∫0+ ∞ fitalic_E ( x ) f ( divide start_ARG fitalic_E( n ) ( x ) end_ARG start_ARG fitalic_E ( x ) end_ARG ) d x (25) =∫0∞(e−x)⁢f⁢(n⁢(1−e−x)n−1)⁢uabsentsuperscriptsubscript0superscriptsuperscript1superscript1differential-d = _0^∞(e^-x)f (n(1-e^-x)^n-1 )du= ∫0∞ ( e- x ) f ( n ( 1 - e- x )n - 1 ) d u (26) =∫01f⁢(n⁢un−1)⁢u.absentsuperscriptsubscript01superscript1differential-d = _0^1f(nu^n-1)du.= ∫01 f ( n uitalic_n - 1 ) d u . (27) In particular we have the following bounds for common f divergences: • For f⁢(x)=x⁢log⁡(x)f(x)=x (x)f ( x ) = x log ( x ) we obtain the KL divergence and we have the result: ∫01nun−1log(nun−1)du=(E(n)||E)=log(n)−n−1n. _0^1nu^n-1 (nu^n-1)du= KL(E^(n)||E)= (n)- n-1% n.∫01 n uitalic_n - 1 log ( n uitalic_n - 1 ) d u = sansserif_KL ( E( n ) | | E ) = log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG . • For f⁢(x)=(x−1)2superscript12f(x)=(x-1)^2f ( x ) = ( x - 1 )2 we obtain the chi-squared divergence and we have: ∫01(n⁢un−1−1)2⁢u=∫01(n2⁢u2⁢(n−1)−2⁢n⁢un−1+1)⁢u=n22⁢n−1⁢u2⁢n−1−2⁢un+u|01=n22⁢n−1−2+1=n2−2⁢n+12⁢n−1=(n−1)22⁢n−1superscriptsubscript01superscriptsuperscript112differential-dsuperscriptsubscript01superscript2superscript212superscript11differential-dsuperscript221superscript212superscriptevaluated-at01superscript22121superscript22121superscript1221 _0^1 (nu^n-1-1 )^2du= _0^1(n^2u^2(n-1)-2nu^n-% 1+1)du= n^22n-1u^2n-1-2u^n+u|^1_0= n^22n-1-2+1=% n^2-2n+12n-1= (n-1)^22n-1∫01 ( n uitalic_n - 1 - 1 )2 d u = ∫01 ( n2 u2 ( n - 1 ) - 2 n uitalic_n - 1 + 1 ) d u = divide start_ARG n2 end_ARG start_ARG 2 n - 1 end_ARG u2 n - 1 - 2 uitalic_n + u |10 = divide start_ARG n2 end_ARG start_ARG 2 n - 1 end_ARG - 2 + 1 = divide start_ARG n2 - 2 n + 1 end_ARG start_ARG 2 n - 1 end_ARG = divide start_ARG ( n - 1 )2 end_ARG start_ARG 2 n - 1 end_ARG. • For f⁢(x)=12⁢|x−1|121f(x)= 12|x-1|f ( x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG | x - 1 |, we obtain the total variation distance (T⁢V)(TV)( T V ) and we have: 12∫01|nun−1−1|du=12(∫0u∗(1−nun−1)du+(∫u∗1(nun−1−1)du)=(u∗−(u∗)n), 12 _0^1 |nu^n-1-1 |du= 12( _0^u^*% (1-nu^n-1 )du+( _u^*^1 (nu^n-1-1 )du)=(u^*-(% u^*)^n),divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫01 | n uitalic_n - 1 - 1 | d u = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ∫0u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 1 - n uitalic_n - 1 ) d u + ( ∫u∗1 ( n uitalic_n - 1 - 1 ) d u ) = ( u∗ - ( u∗ )n ) ,where n⁢(u∗)(n−1)=1superscriptsuperscript11n(u^*)^(n-1)=1n ( u∗ )( n - 1 ) = 1, i.e u∗=(1n)1n−1superscriptsuperscript111u^*=( 1n) 1n-1u∗ = ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG 1 end_ARG start_ARG n - 1 end_ARG . Hence the TV is (1n)1n−1−(1n)n−1superscript111superscript11( 1n) 1n-1-( 1n) nn-1( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG 1 end_ARG start_ARG n - 1 end_ARG - ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG n end_ARG start_ARG n - 1 end_ARG. • For f⁢(x)=(1−x)2superscript12f(x)=(1- x)^2f ( x ) = ( 1 - square-root start_ARG x end_ARG )2 we have the hellinger distance: ∫01(n⁢un−12−1)2⁢u=∫01(n⁢un−1−2⁢n⁢un−12+1)⁢u=un−2⁢n⁢un+12n+12+u|01=2⁢(1−2⁢n+1)=2⁢(1−n)2n+1superscriptsubscript01superscriptsuperscript1212differential-dsuperscriptsubscript01superscript12superscript121differential-dsuperscript2superscript1212evaluated-at0121212superscript121 _0^1 ( nu n-12-1 )^2du= _0^1(nu^n-% 1-2 nu n-12+1)du=u^n-2 n u n+12% n+12+u |^1_0=2(1- 2 nn+1)=2 (1- n)% ^2n+1∫01 ( square-root start_ARG n end_ARG udivide start_ARG n - 1 end_ARG start_ARG 2 end_ARG - 1 )2 d u = ∫01 ( n uitalic_n - 1 - 2 square-root start_ARG n end_ARG udivide start_ARG n - 1 end_ARG start_ARG 2 end_ARG + 1 ) d u = uitalic_n - 2 square-root start_ARG n end_ARG divide start_ARG udivide start_ARG n + 1 end_ARG start_ARG 2 end_ARG end_ARG start_ARG divide start_ARG n + 1 end_ARG start_ARG 2 end_ARG end_ARG + u |10 = 2 ( 1 - divide start_ARG 2 square-root start_ARG n end_ARG end_ARG start_ARG n + 1 end_ARG ) = 2 divide start_ARG ( 1 - square-root start_ARG n end_ARG )2 end_ARG start_ARG n + 1 end_ARG • For f⁢(x)=−log⁡(x)f(x)=- (x)f ( x ) = - log ( x ), we obtain the forward KL and we have : ∫01f⁢(n⁢un−1)⁢u=n−1−log⁡(n)superscriptsubscript01superscript1differential-d1 _0^1f(nu^n-1)du=n-1- (n)∫01 f ( n uitalic_n - 1 ) d u = n - 1 - log ( n ). ∎ Guarantees with Rényi Divergence Turning now to the Rényi divergence for α∈(0,1)∪(1,∞)011α∈(0,1)∪(1,∞)α ∈ ( 0 , 1 ) ∪ ( 1 , ∞ ): Dα(P||Q)=1(α−1)log(∫pα(x)q1−α(x)dx)D_α(P||Q)= 1(α-1) ( p^α(x)q^1-α(% x)dx )Ditalic_α ( P | | Q ) = divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG log ( ∫ pitalic_α ( x ) q1 - α ( x ) d x ) the limit as α→1→1α→ 1α → 1 D1(P||Q))=(P||Q)D_1(P||Q))= KL(P||Q)D1 ( P | | Q ) ) = sansserif_KL ( P | | Q ) . Theorem 6. Under Assumption 2 the best of n policy satisfies: Dα(πr,ref(n)||πref)≤1(α−1)log(nα⁢(n−1)+1)D_α(π^(n)_r,ref|| _ref)≤ 1(α% -1) ( n^α(n-1)+1 )Ditalic_α ( π( n )r , ref | | πroman_ref ) ≤ divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG log ( divide start_ARG nitalic_α end_ARG start_ARG α ( n - 1 ) + 1 end_ARG ) (28) Proof of Theorem 28. Applying DPI that holds also for the Rényi divergence twice from Y,Y(n)superscriptY,Y^(n)Y , Y( n ) to R,R(n)superscriptR,R^(n)R , R( n ) and from R,R(n)superscriptR,R^(n)R , R( n ) to E,E(n)superscriptE,E^(n)E , E( n ) we obtain : Dα(πr,ref(n)||πref|X)≤Dα(E(n)||E)D_α(π^(n)_r,ref|| _ref|X)≤ D_α(E^(% n)||E)Ditalic_α ( π( n )r , ref | | πref | X ) ≤ Ditalic_α ( E( n ) | | E ) Dα(E(n)||E) D_α(E^(n)||E)Ditalic_α ( E( n ) | | E ) =1(α−1)⁢log⁡(∫0∞nα⁢(1−e−x)α⁢(n−1)⁢e−α⁢x⁢e−x⁢(1−α)⁢x)absent11superscriptsubscript0superscriptsuperscript1superscript1superscriptsuperscript1differential-d = 1(α-1) ( _0^∞n^α(1-e^-% x)^α(n-1)e^-α xe^-x(1-α)dx )= divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG log ( ∫0∞ nitalic_α ( 1 - e- x )α ( n - 1 ) e- α x e- x ( 1 - α ) d x ) =1(α−1)⁢log⁡(∫0+∞nα⁢(1−e−x)α⁢(n−1)⁢e−x⁢x)absent11superscriptsubscript0superscriptsuperscript1superscript1superscriptdifferential-d = 1(α-1) ( _0^+∞n^α(1-e^% -x)^α(n-1)e^-xdx )= divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG log ( ∫0+ ∞ nitalic_α ( 1 - e- x )α ( n - 1 ) e- x d x ) Let u=1−e−x1superscriptu=1-e^-xu = 1 - e- x we have d⁢u=e−x⁢d⁢xsuperscriptdu=e^-xdxd u = e- x d x Dα(E(n)||E) D_α(E^(n)||E)Ditalic_α ( E( n ) | | E ) =1(α−1)⁢log⁡(∫01nα⁢uα⁢(n−1)⁢u)absent11superscriptsubscript01superscriptsuperscript1differential-d = 1(α-1) ( _0^1n^αu^α(n-% 1)du )= divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG log ( ∫01 nitalic_α uitalic_α ( n - 1 ) d u ) =1(α−1)⁢(log⁡nα+log⁢∫01uα⁢(n−1)⁢u)absent11superscriptsuperscriptsubscript01superscript1differential-d = 1(α-1) ( n^α+ _0^1u^% α(n-1)du )= divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG ( log nitalic_α + log ∫01 uitalic_α ( n - 1 ) d u ) =1(α−1)⁢(log⁡nα+log⁡uα⁢(n−1)+1α⁢(n−1)+1|01)absent11superscriptevaluated-atsuperscript111101 = 1(α-1) ( n^α+ u^α(n-% 1)+1α(n-1)+1 |^1_0 )= divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG ( log nitalic_α + log divide start_ARG uitalic_α ( n - 1 ) + 1 end_ARG start_ARG α ( n - 1 ) + 1 end_ARG |10 ) =1(α−1)⁢log⁡(nα⁢(n−1)+1)absent11superscript11 = 1(α-1) ( n^α(n-1)+1 )= divide start_ARG 1 end_ARG start_ARG ( α - 1 ) end_ARG log ( divide start_ARG nitalic_α end_ARG start_ARG α ( n - 1 ) + 1 end_ARG ) ∎ From Renyi to KL guarantees Let s1⁢(α)=(α−1)subscript11s_1(α)=(α-1)s1 ( α ) = ( α - 1 ) , and s2⁢(α)=log⁡(nα⁢(n−1)+1)subscript2superscript11s_2(α)= ( n^α(n-1)+1 )s2 ( α ) = log ( divide start_ARG nitalic_α end_ARG start_ARG α ( n - 1 ) + 1 end_ARG ), we have Dα(E(n)||E)=s2⁢(α)s1⁢(α)D_α(E^(n)||E)= s_2(α)s_1(α)Ditalic_α ( E( n ) | | E ) = divide start_ARG s2 ( α ) end_ARG start_ARG s1 ( α ) end_ARG , we have (E(n)||E)=limα→1Dα(E(n)||E)=limα→1s2⁢(α)sα=00 KL(E^(n)||E)= _α→ 1D_α(E^(n)||E)= _α% → 1 s_2(α)s_α= 00sansserif_KL ( E( n ) | | E ) = limitalic_α → 1 Ditalic_α ( E( n ) | | E ) = limitalic_α → 1 divide start_ARG s2 ( α ) end_ARG start_ARG sitalic_α end_ARG = divide start_ARG 0 end_ARG start_ARG 0 end_ARG, hence applying L’Hôpital rule we have: limα→1s2⁢(α)s1⁢(α)=limα→1s2′⁢(α)s1′⁢(α)=limα→1log⁡(n)−n−1α⁢(n−1)+11=log⁡(n)−n−1nsubscript→1subscript2subscript1subscript→1subscriptsuperscript′2subscriptsuperscript′1subscript→111111 _α→ 1 s_2(α)s_1(α)= _α→ 1 % s _2(α)s _1(α)= _α→ 1 (n% )- n-1α(n-1)+11= (n)- n-1nlimitalic_α → 1 divide start_ARG s2 ( α ) end_ARG start_ARG s1 ( α ) end_ARG = limitalic_α → 1 divide start_ARG s′2 ( α ) end_ARG start_ARG s′1 ( α ) end_ARG = limitalic_α → 1 divide start_ARG log ( n ) - divide start_ARG n - 1 end_ARG start_ARG α ( n - 1 ) + 1 end_ARG end_ARG start_ARG 1 end_ARG = log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG. Hence we recover the result for the KLsansserif_KL divergence. B.3. Best of n Dominance Proof of Proposition 1 . FE(n)⁢(x)=(FE⁢(x))n≤FE⁢(x),∀x≥0formulae-sequencesubscriptsuperscriptsuperscriptsubscriptsubscriptfor-all0F_E^(n)(x)=(F_E(x))^n≤ F_E(x),∀ x≥ 0Fitalic_E( n ) ( x ) = ( Fitalic_E ( x ) )n ≤ Fitalic_E ( x ) , ∀ x ≥ 0, which means also that FE(n)−1⁢(t)≥FE−1⁢(t),∀t∈[0,1]formulae-sequencesubscriptsuperscript1superscriptsubscriptsuperscript1for-all01F^-1_E^(n)(t)≥ F^-1_E(t),∀ t∈[0,1]F- 1E( n ) ( t ) ≥ F- 1E ( t ) , ∀ t ∈ [ 0 , 1 ], which means that E(n)superscriptE^(n)E( n ) dominates E in the first stochastic order : E(n)⁢≽FSD⁢EsuperscriptFSDsucceeds-or-equalsE^(n) FSD E( n ) underFSD start_ARG ≽ end_ARG E , which means there exists a coupling between E(n)superscriptE^(n)E( n ) and E, π∈Π⁢(E(n),E)Πsuperscriptπ∈ (E^(n),E)π ∈ Π ( E( n ) , E ), such that E≥eE≥ eE ≥ e, for all (E,e)∼πsimilar-to(E,e) π( E , e ) ∼ π. On the other hand By Rényi and Monge map representations we have: R(n)=FR−1∘FE⁢(E(n))superscriptsubscriptsuperscript1subscriptsuperscriptR^(n)=F^-1_R F_E(E^(n))R( n ) = F- 1R ∘ Fitalic_E ( E( n ) ) and R=FR−1∘FE⁢(E)subscriptsuperscript1subscriptR=F^-1_R F_E(E)R = F- 1R ∘ Fitalic_E ( E ), given that T=FR−1∘FEsubscriptsuperscript1subscriptT=F^-1_R F_ET = F- 1R ∘ Fitalic_E is non decreasing the same coupling π guarantees that T⁢(E)≥T⁢(e)T(E)≥ T(e)T ( E ) ≥ T ( e ), for all (E,e)∼πsimilar-to(E,e) π( E , e ) ∼ π and Hence R(n)⁢≽FSD⁢RsuperscriptFSDsucceeds-or-equalsR^(n) FSD R( n ) underFSD start_ARG ≽ end_ARG R. ∎ Corollary 2. Best of n-polciy has higher expectation : ⁢R(n)≥⁢R,superscriptER^(n) ,blackboard_E R( n ) ≥ blackboard_E R , and is a safer policy, let the Tail Value at Risk be: TVARp⁢(X)=1p⁢∫0pQR⁢(t)⁢tsubscriptTVAR1superscriptsubscript0subscriptdifferential-dTVAR_p(X)= 1p _0^pQ_R(t)dtTVARitalic_p ( X ) = divide start_ARG 1 end_ARG start_ARG p end_ARG ∫0p Qitalic_R ( t ) d t We have TVARp⁢(Rn)≥TVARp⁢(R),∀p∈[0,1]formulae-sequencesubscriptTVARsuperscriptsubscriptTVARfor-all01TVAR_p(R^n) _p(R),∀ p∈[0,1]TVARitalic_p ( Ritalic_n ) ≥ TVARitalic_p ( R ) , ∀ p ∈ [ 0 , 1 ] Proof of Corollary 2. First order dominance implies second order dominance (i.e by integrating quantiles). Expectation is obtained for p=11p=1p = 1. ∎ Appendix C Best of n and RL Policy Proof of Proposition 2. We fix here β=1λΔ1subscriptΔβ= 1 _ β = divide start_ARG 1 end_ARG start_ARG λroman_Δ end_ARG (πr,ref(n)||πβ,r) KL(π^(n)_r,ref|| _β,r)sansserif_KL ( π( n )r , ref | | πitalic_β , r ) =∫πr,ref(n)⁢(y|x)⁢log⁡(πr,ref(n)⁢(y|x)πβ,r⁢(y|x))=∫πr,ref(n)⁢(y|x)⁢log⁡(πr,ref(n)⁢(y|x)πref⁢(y|x)⁢eβ⁢r⁢(x,y)Zβ⁢(x))absentsubscriptsuperscriptrefconditionalsubscriptsuperscriptrefconditionalsubscriptconditionalsubscriptsuperscriptrefconditionalsubscriptsuperscriptrefconditionalsubscriptrefconditionalsuperscriptsubscript = π^(n)_r,ref(y|x) ( π^(n)_r,% ref(y|x) _β,r(y|x) )= π^(n)_r,ref(% y|x) ( π^(n)_r,ref(y|x) _ref(y|x)% e^β r(x,y)Z_β(x) )= ∫ π( n )r , ref ( y | x ) log ( divide start_ARG π( n )r , ref ( y | x ) end_ARG start_ARG πitalic_β , r ( y | x ) end_ARG ) = ∫ π( n )r , ref ( y | x ) log ( divide start_ARG π( n )r , ref ( y | x ) end_ARG start_ARG πroman_ref ( y | x ) divide start_ARG eitalic_β r ( x , y ) end_ARG start_ARG Zitalic_β ( x ) end_ARG end_ARG ) =(πr,ref(n)||πref)+log(πrefeβ⁢r)−β∫rdπr,ref(n) = KL(π^(n)_r,ref|| _ref)+ % (E_ _refe^β r )-β rdπ^(n)_% r,ref= sansserif_KL ( π( n )r , ref | | πroman_ref ) + log ( blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT eitalic_β r ) - β ∫ r d π( n )r , ref On the other hand by optimality of πβ,rsubscript _β,rπitalic_β , r we have: (πβ,r||πref)=β∫rdπβ,r−log(∫eβ⁢rdπref) KL ( _β,r|| _ref )=β rd _% β,r- ( e^β rd _ref )\\ sansserif_KL ( πitalic_β , r | | πroman_ref ) = β ∫ r d πitalic_β , r - log ( ∫ eitalic_β r d πroman_ref ) and hence we have: (πr,ref(n)||πβ,r) KL(π^(n)_r,ref|| _β,r)sansserif_KL ( π( n )r , ref | | πitalic_β , r ) =(πr,ref(n)||πref)−(πβ,r||πref)+β(∫rdπβ,r−∫rdπr,ref(n)) = KL(π^(n)_r,ref|| _ref)-% KL ( _β,r|| _ref )+β ( rd% _β,r- rdπ^(n)_r,ref )= sansserif_KL ( π( n )r , ref | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) + β ( ∫ r d πitalic_β , r - ∫ r d π( n )r , ref ) We choose n such that : (πr,ref(n)||πref)≤log(n)−n−1n≤(πβ,r||πref)=Δ KL(π^(n)_r,ref|| _ref)≤ (n)- n% -1n≤ KL ( _β,r|| _ref )= _KL ( π( n )r , ref | | πroman_ref ) ≤ log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG ≤ sansserif_KL ( πitalic_β , r | | πroman_ref ) = Δ and we conclude choosing n=eΔsuperscriptΔn=e n = eroman_Δ therefore for that choice of n that: (πr,ref(n)||πβ,r)≤β(∫rdπβ,r−∫rdπr,ref(n)) KL(π^(n)_r,ref|| _β,r)≤β ( rd% _β,r- rdπ^(n)_r,ref )sansserif_KL ( π( n )r , ref | | πitalic_β , r ) ≤ β ( ∫ r d πitalic_β , r - ∫ r d π( n )r , ref ) On the other hand we have: |∫r⁢πβ,r−∫r⁢πr,ref(n)|differential-dsubscriptdifferential-dsubscriptsuperscriptref | rd _β,r- rdπ^(n)_r,ref || ∫ r d πitalic_β , r - ∫ r d π( n )r , ref | =|∫r⁢exp⁡(β⁢r)⁢1Zβ⁢πref−∫maxi⁡r⁢(xi)⁢πref⁢(x1)⁢…⁢πref⁢(xn)|absent1subscriptdifferential-dsubscriptrefsubscriptsubscriptdifferential-dsubscriptrefsubscript1…differential-dsubscriptrefsubscript = | r (β r) 1Z_βd _ref-% _ir(x_i)d _ref(x_1)… d _ref(x_n) |= | ∫ r exp ( β r ) divide start_ARG 1 end_ARG start_ARG Zitalic_β end_ARG d πroman_ref - ∫ maxitalic_i r ( xitalic_i ) d πroman_ref ( x1 ) … d πroman_ref ( xitalic_n ) | =|∫(1n⁢∑i=1nr⁢(xi)⁢exp⁡(β⁢r⁢(xi))Zβ−maxi⁡r⁢(xi))⁢πref⁢(x1)⁢…⁢πref⁢(xn)|absent1superscriptsubscript1subscriptsubscriptsubscriptsubscriptsubscriptdifferential-dsubscriptrefsubscript1…differential-dsubscriptrefsubscript = | ( 1n _i=1^n r(x_i) (% β r(x_i))Z_β- _ir(x_i) )d _ref(x_1)% … d _ref(x_n) |= | ∫ ( divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n divide start_ARG r ( xitalic_i ) exp ( β r ( xitalic_i ) ) end_ARG start_ARG Zitalic_β end_ARG - maxitalic_i r ( xitalic_i ) ) d πroman_ref ( x1 ) … d πroman_ref ( xitalic_n ) | =|∫(1n⁢∑i=1nr⁢(xi)⁢exp⁡(β⁢r⁢(xi))∑i=1nexp⁡(β⁢r⁢(xi))⁢∑i=1nexp⁡(β⁢r⁢(xi))Zβ−maxi⁡r⁢(xi))⁢πref⁢(x1)⁢…⁢πref⁢(xn)|absent1superscriptsubscript1subscriptsubscriptsuperscriptsubscript1subscriptsuperscriptsubscript1subscriptsubscriptsubscriptsubscriptdifferential-dsubscriptrefsubscript1…differential-dsubscriptrefsubscript = | ( 1n _i=1^n r(x_i) (% β r(x_i)) _i=1^n (β r(x_i)) _i=1^n (% β r(x_i))Z_β- _ir(x_i) )d _ref(x_1)% … d _ref(x_n) |= | ∫ ( divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n divide start_ARG r ( xitalic_i ) exp ( β r ( xitalic_i ) ) end_ARG start_ARG ∑i = 1n exp ( β r ( xitalic_i ) ) end_ARG divide start_ARG ∑i = 1n exp ( β r ( xitalic_i ) ) end_ARG start_ARG Zitalic_β end_ARG - maxitalic_i r ( xitalic_i ) ) d πroman_ref ( x1 ) … d πroman_ref ( xitalic_n ) | ≤∫|max⁡r⁢(xi)⁢(1n⁢∑i=1nexp⁡(β⁢r⁢(xi))Zβ−1)|⁢πref⁢(x1)⁢…⁢πref⁢(xn)absentsubscript1superscriptsubscript1subscriptsubscript1differential-dsubscriptrefsubscript1…differential-dsubscriptrefsubscript ≤ | r(x_i) ( 1n _i=1^n% (β r(x_i))Z_β-1 ) |d _ref(x_1)% … d _ref(x_n)≤ ∫ | max r ( xitalic_i ) ( divide start_ARG divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n exp ( β r ( xitalic_i ) ) end_ARG start_ARG Zitalic_β end_ARG - 1 ) | d πroman_ref ( x1 ) … d πroman_ref ( xitalic_n ) ≤MZβ⁢|∑i=1nexp⁡(β⁢r⁢(xi))−Zβ|absentsubscriptsuperscriptsubscript1subscriptsubscript ≤ MZ_βE | _i=1^n (β r% (x_i))-Z_β |≤ divide start_ARG M end_ARG start_ARG Zitalic_β end_ARG blackboard_E | ∑i = 1n exp ( β r ( xitalic_i ) ) - Zitalic_β | where we used the following fact, followed by Jensen inequality : ∑i=1nr⁢(xi)⁢exp⁡(β⁢r⁢(xi))∑i=1nexp⁡(β⁢r⁢(xi))≤maxi⁡r⁢(xi).superscriptsubscript1subscriptsubscriptsuperscriptsubscript1subscriptsubscriptsubscript _i=1^n r(x_i) (β r(x_i)) _i=1^n (β r(x% _i))≤ _ir(x_i).∑i = 1n divide start_ARG r ( xitalic_i ) exp ( β r ( xitalic_i ) ) end_ARG start_ARG ∑i = 1n exp ( β r ( xitalic_i ) ) end_ARG ≤ maxitalic_i r ( xitalic_i ) . Assume that the reward is bounded hence we have by Hoeffding inequality : ℙ⁢(|1n⁢∑i=1nexp⁡(β⁢r⁢(xi))−Zβ|≥t)≤2⁢e−n⁢t22⁢(exp⁡(β⁢M)−exp⁡(−β⁢M))2ℙ1superscriptsubscript1subscriptsubscript2superscriptsuperscript22superscript2P ( | 1n _i=1^n (β r(x_i))-Z_β% |≥ t )≤ 2e^- nt^22( (β M)- (-β M))^2% blackboard_P ( | divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n exp ( β r ( xitalic_i ) ) - Zitalic_β | ≥ t ) ≤ 2 e- divide start_ARG n t start_POSTSUPERSCRIPT 2 end_ARG start_ARG 2 ( exp ( β M ) - exp ( - β M ) )2 end_ARG end_POSTSUPERSCRIPT Hence we have: ⁢|∑i=1nexp⁡(β⁢r⁢(xi))−Zβ|≤2⁢π2⁢exp⁡(β⁢M)−exp⁡(−β⁢M)nsuperscriptsubscript1subscriptsubscript22E | _i=1^n (β r(x_i))-Z_β |≤ 2 % π2 (β M)- (-β M) nblackboard_E | ∑i = 1n exp ( β r ( xitalic_i ) ) - Zitalic_β | ≤ 2 square-root start_ARG divide start_ARG π end_ARG start_ARG 2 end_ARG end_ARG divide start_ARG exp ( β M ) - exp ( - β M ) end_ARG start_ARG square-root start_ARG n end_ARG end_ARG (πr,ref(exp⁡(Δ))||πλΔ,r)≤MλΔ⁢Z1/λΔ2⁢π(exp(βM)−exp(−βM))exp⁡(−Δ). KL(π^( ( ))_r,ref|| _ _ ,r)% ≤ M _ Z_1/ _ 2π( (β M)-% (-β M)) (- ).sansserif_KL ( π( exp ( Δ ) )r , ref | | πitalic_λ start_POSTSUBSCRIPT Δ , r end_POSTSUBSCRIPT ) ≤ divide start_ARG M end_ARG start_ARG λroman_Δ Z1 / λ start_POSTSUBSCRIPT Δ end_POSTSUBSCRIPT end_ARG square-root start_ARG 2 π end_ARG ( exp ( β M ) - exp ( - β M ) ) square-root start_ARG exp ( - Δ ) end_ARG . ∎ Appendix D Transportation Inequalities and KL Divergence D.1. Transportation Inequalities with KL The following Lemma (Lemma 4.14 in [Boucheron et al., 2013]) uses the Donsker-Varadhan representation of the KL divergence to obtain bounds on the change of measure , and using the tails of πrefsubscriptref _refπroman_ref. Lemma 4 (Lemma 4.14 in [Boucheron et al., 2013]). Let ψ be a convex and continuously differentiable function ψ on a possibly unbounded interval [0,b)0[0,b)[ 0 , b ), and assume ψ⁢(0)=ψ′⁢(0)=00superscript′00ψ(0)=ψ (0)=0ψ ( 0 ) = ψ′ ( 0 ) = 0. Define for every x≥00x≥ 0x ≥ 0, the convex conjugate ψ∗⁢(x)=supλ∈[0,b)λ⁢x−ψ⁢(λ)superscriptsubscriptsupremum0ψ^*(x)= _λ∈[0,b)λ x-ψ(λ)ψ∗ ( x ) = supitalic_λ ∈ [ 0 , b ) λ x - ψ ( λ ) , and let ψ∗−1⁢(t)=infx≥0:ψ∗⁢(x)>tsuperscriptabsent1infimumconditional-set0superscriptψ^*-1(t)= \x≥ 0:ψ^*(x)>t\ψ∗ - 1 ( t ) = inf x ≥ 0 : ψ∗ ( x ) > t . Then the following statements are equivalent: (i) For λ∈[0,b)0λ∈[0,b)λ ∈ [ 0 , b ) log⁡(∫eλ⁢(r−∫r⁢Q)⁢Q)≤ψ⁢(λ),superscriptdifferential-ddifferential-d ( e^λ(r- rdQ)dQ )≤ψ(λ),log ( ∫ eitalic_λ ( r - ∫ r d Q ) d Q ) ≤ ψ ( λ ) , (i) For any probability measure P that is absolutely continuous with respect to Q and such that (P||Q)<∞ KL(P||Q)<∞sansserif_KL ( P | | Q ) < ∞: ∫rdP−∫rdQ≤ψ∗−1((P||Q)). rdP- rdQ≤ψ^*-1( KL(P||Q)).∫ r d P - ∫ r d Q ≤ ψ∗ - 1 ( sansserif_KL ( P | | Q ) ) . Lemma 5 ( Inverse of the conjugate [Boucheron et al., 2013]). (1) If Q∈⁢(σ2)superscript2Q∈ SubGauss(σ^2)Q ∈ sansserif_SubGauss ( σ2 ), we have for t≥00t≥ 0t ≥ 0 ψ∗−1⁢(t)=2⁢σ2⁢t.superscriptabsent12superscript2ψ^*-1(t)= 2σ^2t.ψ∗ - 1 ( t ) = square-root start_ARG 2 σ2 t end_ARG . (2) If Q∈⁢(σ2,c)superscript2Q∈ Subgamma(σ^2,c)Q ∈ sansserif_Subgamma ( σ2 , c ), we have for t≥00t≥ 0t ≥ 0 ψ∗−1⁢(t)=2⁢σ2⁢t+c⁢tsuperscriptabsent12superscript2ψ^*-1(t)= 2σ^2t+ctψ∗ - 1 ( t ) = square-root start_ARG 2 σ2 t end_ARG + c t. We give here a direct proof for the subgaussian case: Proof. By the Donsker Varadhan representation of the KLsansserif_KL we have: (P||Q)=suph∫hdP−log(∫ehdQ) KL(P||Q)= _h hdP- ( e^hdQ )sansserif_KL ( P | | Q ) = supitalic_h ∫ h d P - log ( ∫ eitalic_h d Q ) Fix x and M>00M>0M > 0 and define for 0<λ<M00<λ<M0 < λ < M hλ⁢(y)=λ⁢(r⁢(x,y)−πref⁢(y|x)⁢r⁢(x,y))subscriptℎsubscriptsubscriptrefconditionalh_λ(y)=λ (r(x,y)-E_ _ref(y|x)r(x,y) )hitalic_λ ( y ) = λ ( r ( x , y ) - blackboard_Eπ start_POSTSUBSCRIPT ref ( y | x ) end_POSTSUBSCRIPT r ( x , y ) ) We omit in what follows x and y, but the reader can assume from here on that π and πrefsubscriptref _refπroman_ref are conditioned on x. Note that Rref|x=(r(x,.))♯πref(.|x)R_ref|x=(r(x,.))_ _ref(.|x)Rroman_ref | x = ( r ( x , . ) )♯ πroman_ref ( . | x ) and we assume Rref|xconditionalsubscriptrefR_ref|xRroman_ref | x subgaussian. Note that πref⁢ehλ=πref|x⁢eλ⁢(r−πref|x⁢r)=MRref|x⁢(λ),subscriptsubscriptrefsuperscriptsubscriptℎsubscriptconditionalsubscriptrefsuperscriptsubscriptconditionalsubscriptrefsubscriptconditionalsubscriptrefE_ _refe^h_λ=E_ _% ref|xe^λ(r-E_ _ref|xr)=M_R_% ref|x(λ),blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT eitalic_hitalic_λ = blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT eitalic_λ ( r - blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT r ) = Mitalic_R start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT ( λ ) , where MRref|xsubscriptconditionalsubscriptrefM_R_ref|xMitalic_R start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT the moment generating function of the reward under the reference policy. Rref|xconditionalsubscriptrefR_ref|xRroman_ref | x is subgaussian we have for all λ∈ℝλ λ ∈ blackboard_R: πref|x⁢ehλ≤eλ2⁢σ22≤eM2⁢σ22<∞subscriptconditionalsubscriptrefsuperscriptsubscriptℎsuperscriptsuperscript2superscript22superscriptsuperscript2superscript22E_ _ref|xe^h_λ≤ e λ% ^2σ^22≤ e M^2σ^22<∞blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT eitalic_hitalic_λ ≤ edivide start_ARG λ start_POSTSUPERSCRIPT 2 σ2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≤ edivide start_ARG M start_POSTSUPERSCRIPT 2 σ2 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT < ∞ Hence hλ∈ℋsubscriptℎℋh_λ _λ ∈ H and we have for all π<<πrefmuch-less-thansubscriptrefπ<\!\!< _refπ < < πroman_ref and for all 0<M<∞00<M<∞0 < M < ∞ and 0<λ<M00<λ<M0 < λ < M: λπ|x(r−πref|xr)≤(π||πref|x)+log(πref|xeλ⁢(r−πref|x⁢r)) _π|x(r-E_ _ref|xr)≤ KL% (π|| _ref|x)+ (E_ _ref|xe^% λ(r-E_ _ref|xr) )λ blackboard_Eπ | x ( r - blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT r ) ≤ sansserif_KL ( π | | πroman_ref | x ) + log ( blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT eitalic_λ ( r - blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT r ) ) or equivalently: π|xr−πref|xr≤1λ(π||πref|x)+1λlog(πref|xeλ⁢(r−πref|x⁢r))E_π|xr-E_ _ref|xr≤ 1λ% KL(π|| _ref|x)+ 1λ (E_% _ref|xe^λ(r-E_ _ref|xr) )blackboard_Eπ | x r - blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT r ≤ divide start_ARG 1 end_ARG start_ARG λ end_ARG sansserif_KL ( π | | πroman_ref | x ) + divide start_ARG 1 end_ARG start_ARG λ end_ARG log ( blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT eitalic_λ ( r - blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT r ) ) Finally we have for π<<πrefmuch-less-thansubscriptrefπ<\!\!< _refπ < < πroman_ref for all 0<λ<M00<λ<M0 < λ < M: π|xr−πref|xr≤1λ(π||πref|x)+1λlog(MRref|x(λ))E_π|xr-E_ _ref|xr≤ 1λ% KL(π|| _ref|x)+ 1λ (M_R_% ref|x(λ) )blackboard_Eπ | x r - blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT r ≤ divide start_ARG 1 end_ARG start_ARG λ end_ARG sansserif_KL ( π | | πroman_ref | x ) + divide start_ARG 1 end_ARG start_ARG λ end_ARG log ( Mitalic_R start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT ( λ ) ) (29) Being a subgaussian, the MGF of Rref|xconditionalsubscriptrefR_ref|xRroman_ref | x is bounded as follows: log⁡(MRref|x⁢(λ))≤λ2⁢σ22.subscriptconditionalsubscriptrefsuperscript2superscript22 (M_R_ref|x(λ) )≤ λ^2σ^2% 2.log ( Mitalic_R start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT ( λ ) ) ≤ divide start_ARG λ2 σ2 end_ARG start_ARG 2 end_ARG . Hence we have for : π|xr−πref|xr≤1λ(π||πref|x)+λ⁢σ22E_π|xr-E_ _ref|xr≤ 1λ% KL(π|| _ref|x)+ λσ^22blackboard_Eπ | x r - blackboard_Eπ start_POSTSUBSCRIPT ref | x end_POSTSUBSCRIPT r ≤ divide start_ARG 1 end_ARG start_ARG λ end_ARG sansserif_KL ( π | | πroman_ref | x ) + divide start_ARG λ σ2 end_ARG start_ARG 2 end_ARG Integrating over x we obtain for all π<<πrefmuch-less-thansubscriptrefπ<\!\!< _refπ < < πroman_ref and all 0<λ<M00<λ<M0 < λ < M: πr−πrefr≤1λ(π||πref)+λ⁢σ22E_πr-E_ _refr≤ 1λ % KL(π|| _ref)+ λσ^22blackboard_Eπ r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ divide start_ARG 1 end_ARG start_ARG λ end_ARG sansserif_KL ( π | | πroman_ref ) + divide start_ARG λ σ2 end_ARG start_ARG 2 end_ARG Define : δ(λ)=1λ(π||πref)+λ⁢σ22δ(λ)= 1λ KL(π|| _ref)+ % λσ^22δ ( λ ) = divide start_ARG 1 end_ARG start_ARG λ end_ARG sansserif_KL ( π | | πroman_ref ) + divide start_ARG λ σ2 end_ARG start_ARG 2 end_ARG minimizing the upper bound δ⁢(λ)δ(λ)δ ( λ ) for λ∈(0,M]0λ∈(0,M]λ ∈ ( 0 , M ], taking derivative δ′⁢(λ)=−(π||πref)λ2+σ22=0δ (λ)=- KL(π|| _ref)λ^% 2+ σ^22=0δ′ ( λ ) = - divide start_ARG sansserif_KL ( π | | πroman_ref ) end_ARG start_ARG λ2 end_ARG + divide start_ARG σ2 end_ARG start_ARG 2 end_ARG = 0 gives λ∗=2(π||πref)σ2λ^*= 2 KL(π|| _ref)σ^2λ∗ = square-root start_ARG divide start_ARG 2 sansserif_KL ( π | | πroman_ref ) end_ARG start_ARG σ2 end_ARG end_ARG. Taking M=2⁢λ∗,2superscriptM=2λ^*,M = 2 λ∗ , λ∗superscriptλ^*λ∗ is the minimizer. Putting this in the bound we have finally for all rewards r for all π: π⁢r−πref⁢r≤2σ2(π||πref).E_πr-E_ _refr≤ 2σ^2 % KL(π|| _ref).blackboard_Eπ r - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT r ≤ square-root start_ARG 2 σ2 sansserif_KL ( π | | πroman_ref ) end_ARG . (30) ∎ Proof of Corollary 1. (i) This follows from optimality of πλΔsubscriptsubscriptΔ _ _ πitalic_λ start_POSTSUBSCRIPT Δ end_POSTSUBSCRIPT and applying the transportation inequality for gaussian tail. (i) This follows from applying Corollary 2 (best of n policy has larger mean ) and 17 for bounding the KLsansserif_KL. ∎ Proof of Theorem 2. For the penalized RL we have by optimality: ∫rdπβ,r−1β(πβ,r||πref) rd _β,r- 1β KL( _β,r||% _ref)∫ r d πitalic_β , r - divide start_ARG 1 end_ARG start_ARG β end_ARG sansserif_KL ( πitalic_β , r | | πroman_ref ) =1β⁢log⁡(∫eβ⁢r⁢πref)absent1superscriptdifferential-dsubscriptref = 1β ( e^β rd _ref )= divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β r d πroman_ref ) =1β⁢log⁡(∫eβ⁢(r−∫r⁢πref)⁢πref)+∫r⁢πrefabsent1superscriptdifferential-dsubscriptrefdifferential-dsubscriptrefdifferential-dsubscriptref = 1β ( e^β(r- rd _ref% )d _ref )+ rd _ref= divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β ( r - ∫ r d πroman_ref ) d πroman_ref ) + ∫ r d πroman_ref It follows that : 1β⁢log⁡(∫eβ⁢(r−∫r⁢πref)⁢πref)1superscriptdifferential-dsubscriptrefdifferential-dsubscriptref 1β ( e^β(r- rd _ref% )d _ref )divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β ( r - ∫ r d πroman_ref ) d πroman_ref ) =∫rdπβ,r−∫rdπref−1β(πβ,r||πref) = rd _β,r- rd _ref- 1β% KL( _β,r|| _ref)= ∫ r d πitalic_β , r - ∫ r d πroman_ref - divide start_ARG 1 end_ARG start_ARG β end_ARG sansserif_KL ( πitalic_β , r | | πroman_ref ) (31) On the other hand by the variational representation of the Rényi divergence we have: ∫r⁢πβ,r−∫r⁢πrefdifferential-dsubscriptdifferential-dsubscriptref rd _β,r- rd _ref∫ r d πitalic_β , r - ∫ r d πroman_ref ≤Dβ(πβ,r||πref)β−1β−1⁢log⁡(∫e(β−1)⁢(r−∫r⁢πβ,r)⁢πβ,r) ≤ D_β( _β,r|| _ref)β-% 1β-1 ( e^(β-1)(r- rd _β,r)d _% β,r )≤ divide start_ARG Ditalic_β ( πitalic_β , r | | πroman_ref ) end_ARG start_ARG β end_ARG - divide start_ARG 1 end_ARG start_ARG β - 1 end_ARG log ( ∫ e( β - 1 ) ( r - ∫ r d πitalic_β , r ) d πitalic_β , r ) +1β⁢log⁡(∫eβ⁢(r−∫r⁢πref)⁢πref)1superscriptdifferential-dsubscriptrefdifferential-dsubscriptref + 1β ( e^β(r- rd _ref% )d _ref )+ divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β ( r - ∫ r d πroman_ref ) d πroman_ref ) (32) Summing Equations (31) and (32) we obtain a bound on the moment generating function at β of r♯⁢πβ,rsubscript♯subscriptr_ _β,rr♯ πitalic_β , r (this is not a uniform bound , it holds only for β): 1β−1⁢log⁡(∫e(β−1)⁢(r−∫r⁢πβ,r)⁢πβ,r)11superscript1differential-dsubscriptdifferential-dsubscript 1β-1 ( e^(β-1)(r- rd _β,% r)d _β,r )divide start_ARG 1 end_ARG start_ARG β - 1 end_ARG log ( ∫ e( β - 1 ) ( r - ∫ r d πitalic_β , r ) d πitalic_β , r ) ≤Dβ(πβ,r||πref)−(πβ,r||πref)β. ≤ D_β( _β,r|| _ref)- KL% ( _β,r|| _ref)β.≤ divide start_ARG Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG start_ARG β end_ARG . (33) Let us assume β>11β>1β > 1 we have therefore the following bound on the logarithmic moment generation function at β−11β-1β - 1 ψr♯⁢πβ,r(β−1)≤β−1β(Dβ(πβ,r||πref)−(πβ,r||πref)) _r_ _β,r(β-1)≤ β-1β (D_β% ( _β,r|| _ref)- KL( _β,r|| _% ref) )ψitalic_r start_POSTSUBSCRIPT ♯ πitalic_β , r end_POSTSUBSCRIPT ( β - 1 ) ≤ divide start_ARG β - 1 end_ARG start_ARG β end_ARG ( Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) ) Let Ri,β=r♯⁢πβ,r,i=1⁢…⁢mformulae-sequencesubscriptsubscript♯subscript1…R_i,β=r_ _β,r,i=1… mRitalic_i , β = r♯ πitalic_β , r , i = 1 … m , the reward evaluation of m independent samples of πβ,rsubscript _β,rπitalic_β , r we have: ℙ⁢∑i=1m(Ri,β−∫r⁢πβ,r)>m⁢tℙsuperscriptsubscript1subscriptdifferential-dsubscript \ _i=1^m(R_i,β- rd _β,r% )>mt \blackboard_P ∑i = 1m ( Ritalic_i , β - ∫ r d πitalic_β , r ) > m t =ℙ⁢(e∑i=1m(β−1)⁢(Ri,β−∫r⁢πβ,r)>em⁢(β−1)⁢t)absentℙsuperscriptsuperscriptsubscript11subscriptdifferential-dsubscriptsuperscript1 =P(e _i=1^m(β-1)(R_i,β- rd _% β,r)>e^m(β-1)t)= blackboard_P ( e∑italic_i = 1 start_POSTSUPERSCRIPT m ( β - 1 ) ( Ritalic_i , β - ∫ r d πitalic_β , r ) end_POSTSUPERSCRIPT > eitalic_m ( β - 1 ) t ) ≤e−(β−1)⁢m⁢t⁢em⁢ψRβ⁢(β−1)absentsuperscript1superscriptsubscriptsubscript1 ≤ e^-(β-1)mte^m _R_β(β-1)≤ e- ( β - 1 ) m t eitalic_m ψitalic_R start_POSTSUBSCRIPT β end_POSTSUBSCRIPT ( β - 1 ) ≤e−(β−1)⁢m⁢t⁢emβ−1β(Dβ(πβ,r||πref)−(πβ,r||πref)) ≤ e^-(β-1)mte^m β-1β (D_β(π% _β,r|| _ref)- KL( _β,r|| _ref)% )≤ e- ( β - 1 ) m t eitalic_m divide start_ARG β - 1 end_ARG start_ARG β end_ARG ( Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) ) ≤e−m⁢(β−1)⁢(t−Dβ(πβ,r||πref)−(πβ,r||πref)β) ≤ e^-m(β-1) (t- D_β( _β,r|| _% ref)- KL( _β,r|| _ref)β )≤ e- m ( β - 1 ) ( t - divide start_ARG Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG start_ARG β end_ARG ) (34) Let t0>0subscript00t_0>0t0 > 0, hence we have for β>11β>1β > 1: ℙ⁢1m⁢∑i=1mRi,β>∫r⁢πβ,r+t0+Dβ(πβ,r||πref)−(πβ,r||πref)β≤e−m⁢(β−1)⁢t0P \ 1m _i=1^mR_i,β> rd _β,r+t_% 0+ D_β( _β,r|| _ref)- KL( _β,% r|| _ref)β \≤ e^-m(β-1)t_0blackboard_P divide start_ARG 1 end_ARG start_ARG m end_ARG ∑i = 1m Ritalic_i , β > ∫ r d πitalic_β , r + t0 + divide start_ARG Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG start_ARG β end_ARG ≤ e- m ( β - 1 ) t0 Now turning to Rref=r♯⁢πrefsubscriptrefsubscript♯subscriptrefR_ref=r_ _refRroman_ref = r♯ πroman_ref, since Rref∈⁢(σref2)subscriptrefsubscriptsuperscript2refR_ref∈ SubGauss(σ^2_ref)Rroman_ref ∈ sansserif_SubGauss ( σ2roman_ref ) we have for every t0>0subscript00t_0>0t0 > 0 : ℙ⁢−1m⁢∑i=1mRi,ref>−∫r⁢πref+t0≤e−m⁢t022⁢σref2ℙ1superscriptsubscript1subscriptrefdifferential-dsubscriptrefsubscript0superscriptsubscriptsuperscript202subscriptsuperscript2refP \- 1m _i=1^mR_i,ref>- rd _% ref+t_0 \≤ e^- mt^2_02σ^2_ref% blackboard_P - divide start_ARG 1 end_ARG start_ARG m end_ARG ∑i = 1m Ritalic_i , ref > - ∫ r d πroman_ref + t0 ≤ e- divide start_ARG m t start_POSTSUPERSCRIPT 20 end_ARG start_ARG 2 σ2roman_ref end_ARG end_POSTSUPERSCRIPT Hence we have with probability at least 1−e−m⁢t022⁢σref2−e−m⁢(β−1)⁢t01superscriptsubscriptsuperscript202subscriptsuperscript2refsuperscript1subscript01-e^- mt^2_02σ^2_ref-e^-m(β-1)t_01 - e- divide start_ARG m t start_POSTSUPERSCRIPT 20 end_ARG start_ARG 2 σ2roman_ref end_ARG end_POSTSUPERSCRIPT - e- m ( β - 1 ) t0: 1m⁢∑i=1mRi,β−1m⁢∑i=1mRi,ref1superscriptsubscript1subscript1superscriptsubscript1subscriptref 1m _i=1^mR_i,β- 1m _i=1^mR_% i,refdivide start_ARG 1 end_ARG start_ARG m end_ARG ∑i = 1m Ritalic_i , β - divide start_ARG 1 end_ARG start_ARG m end_ARG ∑i = 1m Ritalic_i , ref ≤∫r⁢πβ,r−∫r⁢πref+2⁢t0+Dβ(πβ,r||πref)−(πβ,r||πref)β ≤ rd _β,r- rd _ref+2t_0+ D% _β( _β,r|| _ref)- KL( _β,r|| _% ref)β≤ ∫ r d πitalic_β , r - ∫ r d πroman_ref + 2 t0 + divide start_ARG Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG start_ARG β end_ARG ≤2σref2(π||πref)+2⁢t0+Dβ(πβ,r||πref)−(πβ,r||πref)β. ≤ 2σ^2_ref KL(π|| _ % ref)+2t_0+ D_β( _β,r|| _ref)- KL% ( _β,r|| _ref)β.≤ square-root start_ARG 2 σ2roman_ref sansserif_KL ( π | | πroman_ref ) end_ARG + 2 t0 + divide start_ARG Ditalic_β ( πitalic_β , r | | πroman_ref ) - sansserif_KL ( πitalic_β , r | | πroman_ref ) end_ARG start_ARG β end_ARG . ∎ Appendix E Proofs for Transportation Inequalities and Rényi Divergence Proposition 6 (Fenchel Conjugate Propreties). Let F and G be convex functions on a space E and F∗superscriptF^*F∗, G∗superscriptG^*G∗ be their convex conjugates defined on E∗superscriptE^*E∗. We have: (1) Let Fγ⁢(x)=γ⁢F⁢(x)subscriptF_γ(x)=γ F (x )Fitalic_γ ( x ) = γ F ( x ) we have: Fγ∗⁢(p)=γ⁢F∗⁢(pγ)subscriptsuperscriptsuperscriptF^*_γ(p)=γ F^* ( pγ )F∗italic_γ ( p ) = γ F∗ ( divide start_ARG p end_ARG start_ARG γ end_ARG ) (35) (2) Duality: minx∈E⁡F⁢(x)+G⁢(x)=maxp∈E∗−F∗⁢(−p)−G∗⁢(p)subscriptsubscriptsuperscriptsuperscriptsuperscript _x∈ EF(x)+G(x)= _p∈ E^*-F^*(-p)-G^*(p)minitalic_x ∈ E F ( x ) + G ( x ) = maxitalic_p ∈ E∗ - F∗ ( - p ) - G∗ ( p ) (36) (3) Toland Duality: minx∈E⁡F⁢(x)−G⁢(x)=minp⁡G∗⁢(p)−F∗⁢(p)subscriptsubscriptsuperscriptsuperscript _x∈ EF(x)-G(x)= _pG^*(p)-F^*(p)minitalic_x ∈ E F ( x ) - G ( x ) = minitalic_p G∗ ( p ) - F∗ ( p ) (37) Proof of Theorem 3. Let γ>00γ>0γ > 0 , let FP,γ(R)=γ(R||P)F_P,γ(R)=γ KL(R||P)Fitalic_P , γ ( R ) = γ sansserif_KL ( R | | P ), the Fenchel conjugate of FP,1(.)F_P,1(.)Fitalic_P , 1 ( . ) is defined for hℎh bounded and measurable function as follows FP,1∗⁢(h)=log⁡P⁢eh.subscriptsuperscript1ℎsubscriptsuperscriptℎF^*_P,1(h)= _Pe^h.F∗italic_P , 1 ( h ) = log blackboard_EP eitalic_h . It follows by 1) in Proposition 6 that : FP,γ∗⁢(h)=γ⁢FP,1∗⁢(hγ)=γ⁢log⁡P⁢ehγsubscriptsuperscriptℎsubscriptsuperscript1ℎsubscriptsuperscriptℎF^*_P,γ(h)=γ F^*_P,1( hγ)=γ _% Pe hγF∗italic_P , γ ( h ) = γ F∗italic_P , 1 ( divide start_ARG h end_ARG start_ARG γ end_ARG ) = γ log blackboard_EP edivide start_ARG h end_ARG start_ARG γ end_ARG. For 0<α<1010<α<10 < α < 1: The objective function in (18) is the sum of convex functions: FP,α⁢(R)+FQ,1−α⁢(R)subscriptsubscript1F_P,α(R)+F_Q,1-α(R)Fitalic_P , α ( R ) + Fitalic_Q , 1 - α ( R ), by (2) in Proposition 6, we have by duality: (1−α)Dα(P||Q) (1-α)D_α(P||Q)( 1 - α ) Ditalic_α ( P | | Q ) =infRFP,α⁢(R)+FQ,1−α⁢(R)absentsubscriptinfimumsubscriptsubscript1 = _RF_P,α(R)+F_Q,1-α(R)= infitalic_R Fitalic_P , α ( R ) + Fitalic_Q , 1 - α ( R ) =suph∈ℋ−FP,α∗⁢(−h)−FQ,1−α∗⁢(h)absentsubscriptsupremumℎℋsubscriptsuperscriptℎsubscriptsuperscript1ℎ = _h -F^*_P,α(-h)-F^*_Q,1-α(h)= supitalic_h ∈ H - F∗italic_P , α ( - h ) - F∗italic_Q , 1 - α ( h ) =suph∈ℋ−α⁢log⁡P⁢e−hα−(1−α)⁢log⁡Q⁢eh1−αabsentsubscriptsupremumℎℋsubscriptsuperscriptℎ1subscriptsuperscriptℎ1 = _h -α _Pe^- h% α-(1-α) _Qe h1-α= supitalic_h ∈ H - α log blackboard_EP e- divide start_ARG h end_ARG start_ARG α end_ARG - ( 1 - α ) log blackboard_EQ edivide start_ARG h end_ARG start_ARG 1 - α end_ARG Replacing hℎh by (1−α)⁢(α)⁢h1ℎ(1-α)(α)h( 1 - α ) ( α ) h does not change the value of the sup and hence we obtain: (1−α)Dα(P||Q) (1-α)D_α(P||Q)( 1 - α ) Ditalic_α ( P | | Q ) =suph∈ℋ−α⁢log⁡P⁢e−(1−α)⁢(α)⁢hα−(1−α)⁢log⁡Q⁢e(1−α)⁢(α)⁢h1−αabsentsubscriptsupremumℎℋsubscriptsuperscript1ℎ1subscriptsuperscript1ℎ1 = _h -α _Pe^- (1-% α)(α)hα-(1-α) _Qe (1-α)(% α)h1-α= supitalic_h ∈ H - α log blackboard_EP e- divide start_ARG ( 1 - α ) ( α ) h end_ARG start_ARG α end_ARG - ( 1 - α ) log blackboard_EQ edivide start_ARG ( 1 - α ) ( α ) h end_ARG start_ARG 1 - α end_ARG =suph∈ℋ−α⁢log⁡P⁢e−(1−α)⁢h−(1−α)⁢log⁡Q⁢eα⁢h.absentsubscriptsupremumℎℋsubscriptsuperscript1ℎ1subscriptsuperscriptℎ = _h -α _Pe^-(1-α)h-% (1-α) _Qe^α h.= supitalic_h ∈ H - α log blackboard_EP e- ( 1 - α ) h - ( 1 - α ) log blackboard_EQ eitalic_α h . dividing by 1α⁢(1−α)11 1α(1-α)divide start_ARG 1 end_ARG start_ARG α ( 1 - α ) end_ARG both sides we obtain for 0<α<1010<α<10 < α < 1: 1αDα(P||Q)=suph∈ℋ−11−αlogPe−(1−α)⁢h−1αlogQeα⁢h 1αD_α(P||Q)= _h - 11-α % E_Pe^-(1-α)h- 1α _Qe^α hdivide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) = supitalic_h ∈ H - divide start_ARG 1 end_ARG start_ARG 1 - α end_ARG log blackboard_EP e- ( 1 - α ) h - divide start_ARG 1 end_ARG start_ARG α end_ARG log blackboard_EQ eitalic_α h For α>11α>1α > 1: The objective function in (18) is the difference of convex functions: FP,α⁢(R)−FQ,α−1⁢(R)subscriptsubscript1F_P,α(R)-F_Q,α-1(R)Fitalic_P , α ( R ) - Fitalic_Q , α - 1 ( R ), by Toland Duality (3) in Proposition 6 we have: (1−α)Dα(P||Q) (1-α)D_α(P||Q)( 1 - α ) Ditalic_α ( P | | Q ) =infRFP,α⁢(R)−FQ,α−1⁢(R)absentsubscriptinfimumsubscriptsubscript1 = _RF_P,α(R)-F_Q,α-1(R)= infitalic_R Fitalic_P , α ( R ) - Fitalic_Q , α - 1 ( R ) =infh∈ℋFQ,α−1∗⁢(h)−FP,α∗⁢(h)absentsubscriptinfimumℎℋsubscriptsuperscript1ℎsubscriptsuperscriptℎ = _h F^*_Q,α-1(h)-F^*_P,α(h)= infitalic_h ∈ H F∗italic_Q , α - 1 ( h ) - F∗italic_P , α ( h ) =infh∈ℋ(α−1)⁢log⁡Q⁢eh(α−1)−α⁢log⁡P⁢ehαabsentsubscriptinfimumℎℋ1subscriptsuperscriptℎ1subscriptsuperscriptℎ = _h (α-1) _Qe h(% α-1)-α _Pe hα= infitalic_h ∈ H ( α - 1 ) log blackboard_EQ edivide start_ARG h end_ARG start_ARG ( α - 1 ) end_ARG - α log blackboard_EP edivide start_ARG h end_ARG start_ARG α end_ARG The inf does not change when we replace hℎh by α⁢(α−1)⁢h1ℎα(α-1)hα ( α - 1 ) h, hence we have: (α−1)Dα(P||Q) (α-1)D_α(P||Q)( α - 1 ) Ditalic_α ( P | | Q ) =−infh∈ℋ(α−1)⁢log⁡Q⁢eα⁢(α−1)⁢h(α−1)−α⁢log⁡P⁢eα⁢(α−1)⁢hαabsentsubscriptinfimumℎℋ1subscriptsuperscript1ℎ1subscriptsuperscript1ℎ =- _h (α-1) _Qe % α(α-1)h(α-1)-α _Pe α(α% -1)hα= - infitalic_h ∈ H ( α - 1 ) log blackboard_EQ edivide start_ARG α ( α - 1 ) h end_ARG start_ARG ( α - 1 ) end_ARG - α log blackboard_EP edivide start_ARG α ( α - 1 ) h end_ARG start_ARG α end_ARG =suph∈ℋα⁢log⁡P⁢e(α−1)⁢h−(α−1)⁢log⁡Q⁢eα⁢habsentsubscriptsupremumℎℋsubscriptsuperscript1ℎ1subscriptsuperscriptℎ = _h α _Pe^(α-1)h-(% α-1) _Qe^α h= supitalic_h ∈ H α log blackboard_EP e( α - 1 ) h - ( α - 1 ) log blackboard_EQ eitalic_α h dividing both sides by 1α⁢(α−1)11 1α(α-1)divide start_ARG 1 end_ARG start_ARG α ( α - 1 ) end_ARG we obtain for α>11α>1α > 1: 1αDα(P||Q)=suph∈ℋ1α−1logPe(α−1)⁢h−1αlogQeα⁢h. 1αD_α(P||Q)= _h 1α-1 % E_Pe^(α-1)h- 1α _Qe^α h.divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) = supitalic_h ∈ H divide start_ARG 1 end_ARG start_ARG α - 1 end_ARG log blackboard_EP e( α - 1 ) h - divide start_ARG 1 end_ARG start_ARG α end_ARG log blackboard_EQ eitalic_α h . ∎ Proof of Lemma 2 . Adding and subtracting in the exponential ∫h⁢Pℎdifferential-d hdP∫ h d P and ∫h⁢Qℎdifferential-d hdQ∫ h d Q resp we obtain the result: 1α−1⁢log⁡(∫e(α−1)⁢h⁢P)−1α⁢log⁡(∫eα⁢h⁢Q)=1α−1⁢log⁡(∫e(α−1)⁢(h−∫h⁢P+∫h⁢P)⁢P)−1α⁢log⁡(∫eα⁢(h−∫h⁢Q+∫h⁢Q)⁢Q)=∫h⁢P−∫h⁢Q+1α−1⁢log⁡(∫e(α−1)⁢(h−∫h⁢P)⁢P)−1α⁢log⁡(∫eα⁢(h−∫h⁢Q)⁢Q)11superscript1ℎdifferential-d1superscriptℎdifferential-d11superscript1ℎdifferential-dℎdifferential-ddifferential-d1superscriptℎdifferential-dℎdifferential-ddifferential-dℎdifferential-dℎdifferential-d11superscript1ℎdifferential-ddifferential-d1superscriptℎdifferential-ddifferential-d 1α-1 ( e^(α-1)hdP )- 1α % ( e^α hdQ )= 1α-1 ( e^(α-1% )(h- hdP+ hdP)dP )- 1α ( e^α(h-% hdQ+ hdQ)dQ )= hdP- hdQ+ 1α-1 (% e^(α-1)(h- hdP)dP )- 1α ( e^% α(h- hdQ)dQ )divide start_ARG 1 end_ARG start_ARG α - 1 end_ARG log ( ∫ e( α - 1 ) h d P ) - divide start_ARG 1 end_ARG start_ARG α end_ARG log ( ∫ eitalic_α h d Q ) = divide start_ARG 1 end_ARG start_ARG α - 1 end_ARG log ( ∫ e( α - 1 ) ( h - ∫ h d P + ∫ h d P ) d P ) - divide start_ARG 1 end_ARG start_ARG α end_ARG log ( ∫ eitalic_α ( h - ∫ h d Q + ∫ h d Q ) d Q ) = ∫ h d P - ∫ h d Q + divide start_ARG 1 end_ARG start_ARG α - 1 end_ARG log ( ∫ e( α - 1 ) ( h - ∫ h d P ) d P ) - divide start_ARG 1 end_ARG start_ARG α end_ARG log ( ∫ eitalic_α ( h - ∫ h d Q ) d Q ) ∎ Proof of Lemma 3. Note that we have for 0<α<1010<α<10 < α < 1, 1αDα(P||Q)=11−αD1−α(Q||P) 1αD_α(P||Q)= 11-αD_1-α(Q||P)divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) = divide start_ARG 1 end_ARG start_ARG 1 - α end_ARG D1 - α ( Q | | P ) (See Proposition 2 in van Erven and Harremos [2014]). Taking limits we obtain limα→01αDα(P||Q)=D1(Q||P)=(Q||P). _α→ 0 1αD_α(P||Q)=D_1(Q||P)= KL(Q||% P).limitalic_α → 0 divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) = D1 ( Q | | P ) = sansserif_KL ( Q | | P ) . ∎ Proof of Theorem 4 . For 0<α<1010<α<10 < α < 1, we have for all h∈ℋℎℋh ∈ H : ∫h⁢P−∫h⁢Qℎdifferential-dℎdifferential-d hdP- hdQ∫ h d P - ∫ h d Q ≤1αDα(P||Q)+11−αlog(∫e(α−1)⁢(h−∫h⁢P)dP)+1αlog(∫eα⁢(h−∫h⁢Q)dQ) ≤ 1αD_α(P||Q)+ 11-α (% e^(α-1)(h- hdP)dP )+ 1α ( e^% α(h- hdQ)dQ )≤ divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) + divide start_ARG 1 end_ARG start_ARG 1 - α end_ARG log ( ∫ e( α - 1 ) ( h - ∫ h d P ) d P ) + divide start_ARG 1 end_ARG start_ARG α end_ARG log ( ∫ eitalic_α ( h - ∫ h d Q ) d Q ) (38) Assuming r is bounded 0<r<b00<r<b0 < r < b then we have (r)♯⁢P−P⁢rsubscript♯subscript(r)_ P-E_Pr( r )♯ P - blackboard_EP r and (r)♯⁢Q−Q⁢rsubscript♯subscript(r)_ Q-E_Qr( r )♯ Q - blackboard_EQ r are sub-Gaussian with parameter σ2=b24superscript2superscript24σ^2= b^24σ2 = divide start_ARG b2 end_ARG start_ARG 4 end_ARG. Hence we have for λ∈ℝλ λ ∈ blackboard_R: P⁢eλ⁢(r−∫r⁢P)≤exp⁡(λ2⁢σP22)⁢ and ⁢Q⁢eλ⁢(r−∫r⁢Q)≤exp⁡(λ2⁢σQ22),subscriptsuperscriptdifferential-dsuperscript2subscriptsuperscript22 and subscriptsuperscriptdifferential-dsuperscript2subscriptsuperscript22E_Pe^λ(r- rdP)≤ ( λ^2σ^2% _P2 ) and E_Qe^λ(r- rdQ)≤ (% λ^2σ^2_Q2 ),blackboard_EP eitalic_λ ( r - ∫ r d P ) ≤ exp ( divide start_ARG λ2 σ2italic_P end_ARG start_ARG 2 end_ARG ) and blackboard_EQ eitalic_λ ( r - ∫ r d Q ) ≤ exp ( divide start_ARG λ2 σ2italic_Q end_ARG start_ARG 2 end_ARG ) , Fix a finite M>00M>0M > 0. For 0<λ<M00<λ<M0 < λ < M and P=π|xconditionalP=π|xP = π | x and Q=πref|xconditionalsubscriptrefQ= _ref|xQ = πroman_ref | x, consider hλ=λ⁢rsubscriptℎh_λ=λ rhitalic_λ = λ r, thanks to subgaussianity and boundedness of λ, hλ∈ℋsubscriptℎℋh_λ _λ ∈ H for all λ∈(0,M)0λ∈(0,M)λ ∈ ( 0 , M ). Hence we have by Equation (38) for all λ∈(0,M)0λ∈(0,M)λ ∈ ( 0 , M ): λ(∫rdP−∫rdQ)≤1αDα(P||Q)+11−αlog(∫eλ⁢(α−1)⁢(r−∫r⁢P)dP)+1αlog(∫eλ⁢α⁢(r−∫r⁢Q)dQ)λ ( rdP- rdQ )≤ 1αD_α(P||Q)+% 11-α ( e^λ(α-1)(r- rdP)dP )+% 1α ( e^λα(r- rdQ)dQ )λ ( ∫ r d P - ∫ r d Q ) ≤ divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( P | | Q ) + divide start_ARG 1 end_ARG start_ARG 1 - α end_ARG log ( ∫ eitalic_λ ( α - 1 ) ( r - ∫ r d P ) d P ) + divide start_ARG 1 end_ARG start_ARG α end_ARG log ( ∫ eitalic_λ α ( r - ∫ r d Q ) d Q ) we have by sub-Gaussianity: 11−α⁢log⁡(∫eλ⁢(α−1)⁢(r−∫r⁢P)⁢P)11superscript1differential-ddifferential-d 11-α ( e^λ(α-1)(r- rdP)% dP )divide start_ARG 1 end_ARG start_ARG 1 - α end_ARG log ( ∫ eitalic_λ ( α - 1 ) ( r - ∫ r d P ) d P ) ≤11−α⁢λ2⁢(1−α)2⁢σP22=λ2⁢(1−α)⁢σP22absent11superscript2superscript12subscriptsuperscript22superscript21subscriptsuperscript22 ≤ 11-α λ^2(1-α)^2σ^2_% P2= λ^2(1-α)σ^2_P2≤ divide start_ARG 1 end_ARG start_ARG 1 - α end_ARG divide start_ARG λ2 ( 1 - α )2 σ2italic_P end_ARG start_ARG 2 end_ARG = divide start_ARG λ2 ( 1 - α ) σ2italic_P end_ARG start_ARG 2 end_ARG 1α⁢log⁡(∫eλ⁢α⁢(r−∫r⁢Q)⁢Q)1superscriptdifferential-ddifferential-d 1α ( e^λα(r- rdQ)dQ )divide start_ARG 1 end_ARG start_ARG α end_ARG log ( ∫ eitalic_λ α ( r - ∫ r d Q ) d Q ) ≤1α⁢λ2⁢α2⁢σQ22=λ2⁢α⁢σQ22absent1superscript2superscript2subscriptsuperscript22superscript2subscriptsuperscript22 ≤ 1α λ^2α^2σ^2_Q2% = λ^2ασ^2_Q2≤ divide start_ARG 1 end_ARG start_ARG α end_ARG divide start_ARG λ2 α2 σ2italic_Q end_ARG start_ARG 2 end_ARG = divide start_ARG λ2 α σ2italic_Q end_ARG start_ARG 2 end_ARG It follows that for all λ∈(0,M)0λ∈(0,M)λ ∈ ( 0 , M ) λ⁢(∫r⁢π⁢|x−∫r⁢πref|⁢x)differential-ddifferential-dsubscriptref λ ( rdπ|x- rd _ref|x )λ ( ∫ r d π | x - ∫ r d πroman_ref | x ) ≤1α⁢Dα⁢(π⁢|x|⁢|πref|⁢x)+λ2⁢(1−α)⁢σP22+λ2⁢α⁢σQ22absent1subscriptsubscriptrefsuperscript21subscriptsuperscript22superscript2subscriptsuperscript22 ≤ 1αD_α(π|x|| _ref|x)+ % λ^2(1-α)σ^2_P2+ λ^2ασ^2_Q% 2≤ divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( π | x | | πroman_ref | x ) + divide start_ARG λ2 ( 1 - α ) σ2italic_P end_ARG start_ARG 2 end_ARG + divide start_ARG λ2 α σ2italic_Q end_ARG start_ARG 2 end_ARG =1α⁢Dα⁢(π⁢|x|⁢|πref|⁢x)+λ2⁢((1−α)⁢σP2+α⁢σQ2)2absent1subscriptsubscriptrefsuperscript21subscriptsuperscript2subscriptsuperscript22 = 1αD_α(π|x|| _ref|x)+ % λ^2((1-α)σ^2_P+ασ^2_Q)2= divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( π | x | | πroman_ref | x ) + divide start_ARG λ2 ( ( 1 - α ) σ2italic_P + α σ2italic_Q ) end_ARG start_ARG 2 end_ARG Integrating over x we obtain: λ(∫rdπ−∫rdπref)≤1αDα(π||πref)+λ2⁢((1−α)⁢σP2+α⁢σQ2)2 λ ( rdπ- rd _ref )≤ % 1αD_α(π|| _ref)+ λ^2((1-α)% σ^2_P+ασ^2_Q)2λ ( ∫ r d π - ∫ r d πroman_ref ) ≤ divide start_ARG 1 end_ARG start_ARG α end_ARG Ditalic_α ( π | | πroman_ref ) + divide start_ARG λ2 ( ( 1 - α ) σ2italic_P + α σ2italic_Q ) end_ARG start_ARG 2 end_ARG Finally we have: ∫rdπ−∫rdπref≤1λ⁢αDα(π||πref)+λ⁢((1−α)⁢σP2+α⁢σQ2)2 rdπ- rd _ref≤ 1λαD_% α(π|| _ref)+ λ((1-α)σ^2_P+% ασ^2_Q)2∫ r d π - ∫ r d πroman_ref ≤ divide start_ARG 1 end_ARG start_ARG λ α end_ARG Ditalic_α ( π | | πroman_ref ) + divide start_ARG λ ( ( 1 - α ) σ2italic_P + α σ2italic_Q ) end_ARG start_ARG 2 end_ARG minimizing over λ∈(0,M)0λ∈(0,M)λ ∈ ( 0 , M ): we obtain λ∗=2Dα(π||πref)((1−α)⁢σP2+α⁢σQ2)⁢αλ^*= 2D_α(π|| _ref)((1-α)% σ^2_P+ασ^2_Q)αλ∗ = square-root start_ARG divide start_ARG 2 Ditalic_α ( π | | πroman_ref ) end_ARG start_ARG ( ( 1 - α ) σ2italic_P + α σ2italic_Q ) α end_ARG end_ARG, M is free of choice, choosing M=2⁢λ∗2superscriptM=2λ^*M = 2 λ∗, gives that λ∗superscriptλ^*λ∗ is the minimizer and hence we have for all α∈(0,1)01α∈(0,1)α ∈ ( 0 , 1 ): ∫r⁢π−∫r⁢πref≤2((1−α)σP2+ασQ2)Dα(π||πref)α. rdπ- rd _ref≤ 2((1-α)% σ^2_P+ασ^2_Q)D_α(π|| _ref)% α.∫ r d π - ∫ r d πroman_ref ≤ square-root start_ARG divide start_ARG 2 ( ( 1 - α ) σ2italic_P + α σ2italic_Q ) Ditalic_α ( π | | πroman_ref ) end_ARG start_ARG α end_ARG end_ARG . ∎ Appendix F Goodhart Laws Proof of Proposition 4. We have by duality: 1βlog(∫eβ⁢r∗dπref)=supν∫r∗dν−1β(ν||πref) 1β ( e^β r^*d _ref )= _% ν r^*dν- 1β KL(ν|| _ref)divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT d πroman_ref ) = supitalic_ν ∫ r∗ d ν - divide start_ARG 1 end_ARG start_ARG β end_ARG sansserif_KL ( ν | | πroman_ref ) hence for ν=πβ,rsubscriptν= _β,rν = πitalic_β , r we have: 1βlog(∫eβ⁢r∗dπref)≥∫r∗dπβ,r−1β(πβ,r||πref) 1β ( e^β r^*d _ref )≥% r^*d _β,r- 1β KL( _β,r|| _% ref)divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT d πroman_ref ) ≥ ∫ r∗ d πitalic_β , r - divide start_ARG 1 end_ARG start_ARG β end_ARG sansserif_KL ( πitalic_β , r | | πroman_ref ) Hence: ∫r∗dπβ,r≤1βlog(∫eβ⁢r∗dπref)+1β(πβ,r||πref) r^*d _β,r≤ 1β ( e^% β r^*d _ref )+ 1β KL( _β,r% || _ref)∫ r∗ d πitalic_β , r ≤ divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT d πroman_ref ) + divide start_ARG 1 end_ARG start_ARG β end_ARG sansserif_KL ( πitalic_β , r | | πroman_ref ) On the other hand by optimality of πβ,rsubscript _β,rπitalic_β , r we have: (πβ,r||πref)=β∫rdπβ,r−log(∫eβ⁢rdπref) KL ( _β,r|| _ref )=β rd _% β,r- ( e^β rd _ref )\\ sansserif_KL ( πitalic_β , r | | πroman_ref ) = β ∫ r d πitalic_β , r - log ( ∫ eitalic_β r d πroman_ref ) Hence we have: ∫r∗⁢πβ,rsuperscriptdifferential-dsubscript r^*d _β,r∫ r∗ d πitalic_β , r ≤1β⁢log⁡(∫eβ⁢r∗⁢πref)+∫r⁢πβ,r−1β⁢log⁡(∫eβ⁢r⁢πref)≤∫r⁢πβ,r+1β⁢log⁡(∫eβ⁢r∗⁢πref∫eβ⁢r⁢πref)absent1superscriptsuperscriptdifferential-dsubscriptrefdifferential-dsubscript1superscriptdifferential-dsubscriptrefdifferential-dsubscript1superscriptsuperscriptdifferential-dsubscriptrefsuperscriptdifferential-dsubscriptref ≤ 1β ( e^β r^*d _% ref )+ rd _β,r- 1β ( e^β rd% _ref )≤ rd _β,r+ 1β (% e^β r^*d _ref e^β rd _% ref )≤ divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT d πroman_ref ) + ∫ r d πitalic_β , r - divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β r d πroman_ref ) ≤ ∫ r d πitalic_β , r + divide start_ARG 1 end_ARG start_ARG β end_ARG log ( divide start_ARG ∫ eitalic_β r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT d πroman_ref end_ARG start_ARG ∫ eitalic_β r d πroman_ref end_ARG ) It follows that: ∫r∗⁢πβ,r−∫r∗⁢πrefsuperscriptdifferential-dsubscriptsuperscriptdifferential-dsubscriptref r^*d _β,r- r^*d _ref∫ r∗ d πitalic_β , r - ∫ r∗ d πroman_ref ≤∫r⁢πβ,r−∫r⁢πref+1β⁢log⁡(∫eβ⁢(r∗−∫r∗⁢πref)⁢πref∫eβ⁢(r−∫r⁢πref)⁢πref)absentdifferential-dsubscriptdifferential-dsubscriptref1superscriptsuperscriptsuperscriptdifferential-dsubscriptrefdifferential-dsubscriptrefsuperscriptdifferential-dsubscriptrefdifferential-dsubscriptref ≤ rd _β,r- rd _ref+ 1β% ( e^β(r^*- r^*d _ref)d _% ref e^β(r- rd _ref)d _ref )≤ ∫ r d πitalic_β , r - ∫ r d πroman_ref + divide start_ARG 1 end_ARG start_ARG β end_ARG log ( divide start_ARG ∫ eitalic_β ( r start_POSTSUPERSCRIPT ∗ - ∫ r∗ d πroman_ref ) end_POSTSUPERSCRIPT d πroman_ref end_ARG start_ARG ∫ eitalic_β ( r - ∫ r d πroman_ref ) d πroman_ref end_ARG ) ∫eβ⁢(r∗−∫r∗⁢πref)⁢πref∫eβ⁢(r−∫r⁢πref)⁢πrefsuperscriptsuperscriptsuperscriptdifferential-dsubscriptrefdifferential-dsubscriptrefsuperscriptdifferential-dsubscriptrefdifferential-dsubscriptref e^β(r^*- r^*d _ref)d _% ref e^β(r- rd _ref)d _refdivide start_ARG ∫ eitalic_β ( r start_POSTSUPERSCRIPT ∗ - ∫ r∗ d πroman_ref ) end_POSTSUPERSCRIPT d πroman_ref end_ARG start_ARG ∫ eitalic_β ( r - ∫ r d πroman_ref ) d πroman_ref end_ARG =∫eβ(r∗−r−(∫r∗dπref−∫rdπref)⁢eβ⁢r⁢d⁢πref∫eβ⁢r⁢πref = e^β(r^*-r- ( r^*d _ref- rd% _ref ) e^β rd _ref e^β r% d _ref= ∫ eitalic_β ( r start_POSTSUPERSCRIPT ∗ - r - ( ∫ r∗ d πroman_ref - ∫ r d πroman_ref ) end_POSTSUPERSCRIPT divide start_ARG eitalic_β r d πroman_ref end_ARG start_ARG ∫ eitalic_β r d πroman_ref end_ARG =∫eβ(r∗−r−(∫r∗dπref−∫rdπref)⁢πβ,r = e^β(r^*-r- ( r^*d _ref- rd% _ref )d _β,r= ∫ eitalic_β ( r start_POSTSUPERSCRIPT ∗ - r - ( ∫ r∗ d πroman_ref - ∫ r d πroman_ref ) end_POSTSUPERSCRIPT d πitalic_β , r Hence we have finally: ∫r∗⁢πβ,r−∫r∗⁢πref≤∫r⁢πβ,r−∫r⁢πref+1β⁢log⁡(∫eβ(r∗−r−(∫r∗dπref−∫rdπref)⁢πβ,r) r^*d _β,r- r^*d _ref≤ rd _β,r% - rd _ref+ 1β ( e^β(r^*-r-% ( r^*d _ref- rd _ref )d _% β,r )∫ r∗ d πitalic_β , r - ∫ r∗ d πroman_ref ≤ ∫ r d πitalic_β , r - ∫ r d πroman_ref + divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β ( r start_POSTSUPERSCRIPT ∗ - r - ( ∫ r∗ d πroman_ref - ∫ r d πroman_ref ) end_POSTSUPERSCRIPT d πitalic_β , r ) ∫r∗⁢πβ,r−∫r∗⁢πref≤∫r⁢πβ,r−∫r⁢πref−1β⁢log⁡(∫eβ(r−r∗−(∫rdπref−∫r∗dπref)⁢πβ,r∗) r^*d _β,r- r^*d _ref≤ rd _β,r% - rd _ref- 1β ( e^β(r-r^*-% ( rd _ref- r^*d _ref )d _% β,r^* )∫ r∗ d πitalic_β , r - ∫ r∗ d πroman_ref ≤ ∫ r d πitalic_β , r - ∫ r d πroman_ref - divide start_ARG 1 end_ARG start_ARG β end_ARG log ( ∫ eitalic_β ( r - r start_POSTSUPERSCRIPT ∗ - ( ∫ r d πroman_ref - ∫ r∗ d πroman_ref ) end_POSTSUPERSCRIPT d πitalic_β , r∗ ) The proof follows from using the subgaussianity of r♯⁢πrefsubscript♯subscriptrefr_ _refr♯ πroman_ref and the assumption on the soft max. ∎ Proof of Proposition 5. π⁢(r∗−r)−πref⁢(r∗−r)≤2⁢‖r−r∗‖∞⁢(π,πref)subscriptsuperscriptsubscriptsubscriptrefsuperscript2subscriptnormsuperscriptsubscriptref _π(r^*-r)-E_ _ref(r^*-r% )≤ 2||r-r^*||_∞ TV(π, _ref)blackboard_Eπ ( r∗ - r ) - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( r∗ - r ) ≤ 2 | | r - r∗ | |∞ sansserif_TV ( π , πroman_ref ) For πr,ref(n)subscriptsuperscriptrefπ^(n)_r,refπ( n )r , ref, we have: πr,ref(n)⁢(r∗)−πref⁢(r∗)≤πr,ref(n)⁢(r)−πref⁢(r)+2⁢‖r−r∗‖∞⁢(πr,ref(n),πref)subscriptsubscriptsuperscriptrefsuperscriptsubscriptsubscriptrefsuperscriptsubscriptsubscriptsuperscriptrefsubscriptsubscriptref2subscriptnormsuperscriptsubscriptsuperscriptrefsubscriptrefE_π^(n)_r,ref(r^*)-E_ _ref% (r^*) _π^(n)_r,ref(r)-E_ _ % ref(r)+2||r-r^*||_∞ TV(π^(n)_r,ref, _% ref)blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT ( r∗ ) - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( r∗ ) ≤ blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT ( r ) - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( r ) + 2 | | r - r∗ | |∞ sansserif_TV ( π( n )r , ref , πroman_ref ) and πr,ref(n)⁢(r∗)−πref⁢(r∗)≥πr,ref(n)⁢(r)−πref⁢(r)−2⁢‖r−r∗‖∞⁢(πr,ref(n),πref)subscriptsubscriptsuperscriptrefsuperscriptsubscriptsubscriptrefsuperscriptsubscriptsubscriptsuperscriptrefsubscriptsubscriptref2subscriptnormsuperscriptsubscriptsuperscriptrefsubscriptrefE_π^(n)_r,ref(r^*)-E_ _ref% (r^*) _π^(n)_r,ref(r)-E_ _ % ref(r)-2||r-r^*||_∞ TV(π^(n)_r,ref, _% ref)blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT ( r∗ ) - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( r∗ ) ≥ blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT ( r ) - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( r ) - 2 | | r - r∗ | |∞ sansserif_TV ( π( n )r , ref , πroman_ref ) By the data processing inequality we have: ⁢(πr,ref(n),πref)≤⁢(Rr,ref(n),R)=(1n)1n−1−(1n)n−1subscriptsuperscriptrefsubscriptrefsubscriptsuperscriptrefsuperscript111superscript11 TV(π^(n)_r,ref, _ref)≤ TV(R^(% n)_r,ref,R)=( 1n) 1n-1-( 1n) n% n-1sansserif_TV ( π( n )r , ref , πroman_ref ) ≤ sansserif_TV ( R( n )r , ref , R ) = ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG 1 end_ARG start_ARG n - 1 end_ARG - ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG n end_ARG start_ARG n - 1 end_ARG If r has subguassian tails under πrefsubscriptref _refπroman_ref than we have: πr,ref(n)⁢(r∗)−πref⁢(r∗)≤2⁢σ2⁢(log⁡(n)−n−1n)+2⁢‖r−r∗‖∞⁢((1n)1n−1−(1n)n−1)subscriptsubscriptsuperscriptrefsuperscriptsubscriptsubscriptrefsuperscript2superscript212subscriptnormsuperscriptsuperscript111superscript11E_π^(n)_r,ref(r^*)-E_ _ref% (r^*)≤ 2σ^2 ( (n)- n-1n )+2||r-r^*||_% ∞ (( 1n) 1n-1-( 1n) nn-1 )blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT ( r∗ ) - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( r∗ ) ≤ square-root start_ARG 2 σ2 ( log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG ) end_ARG + 2 | | r - r∗ | |∞ ( ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG 1 end_ARG start_ARG n - 1 end_ARG - ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG n end_ARG start_ARG n - 1 end_ARG ) πr,ref(n)⁢(r∗)−πref⁢(r∗)≤2⁢σ2⁢(log⁡(n)−n−1n)+2⁢infr∈ℋ‖r−r∗‖∞⁢((1n)1n−1−(1n)n−1).subscriptsubscriptsuperscriptrefsuperscriptsubscriptsubscriptrefsuperscript2superscript212subscriptinfimumℋsubscriptnormsuperscriptsuperscript111superscript11E_π^(n)_r,ref(r^*)-E_ _ref% (r^*)≤ 2σ^2 ( (n)- n-1n )+2 _r∈% H||r-r^*||_∞ (( 1n) 1n-1-( 1% n) nn-1 ).blackboard_Eπ( n ) start_POSTSUBSCRIPT r , ref end_POSTSUBSCRIPT ( r∗ ) - blackboard_Eπ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( r∗ ) ≤ square-root start_ARG 2 σ2 ( log ( n ) - divide start_ARG n - 1 end_ARG start_ARG n end_ARG ) end_ARG + 2 infitalic_r ∈ H | | r - r∗ | |∞ ( ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG 1 end_ARG start_ARG n - 1 end_ARG - ( divide start_ARG 1 end_ARG start_ARG n end_ARG )divide start_ARG n end_ARG start_ARG n - 1 end_ARG ) . ∎