Paper deep dive
A Diffusion Analysis of Policy Gradient for Stochastic Bandits
Tor Lattimore
Abstract
Abstract:We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $\eta = O(\Delta^2/\log(n))$ the regret is $O(k \log(k) \log(n) / \eta)$ where $n$ is the horizon and $\Delta$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $\eta = O(\Delta^2)$.
Tags
Links
- Source: https://arxiv.org/abs/2603.10219v1
- Canonical: https://arxiv.org/abs/2603.10219v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/13/2026, 1:08:19 AM
Summary
The paper provides a continuous-time diffusion analysis of the policy gradient algorithm for k-armed stochastic bandits. It establishes regret bounds for the continuous-time approximation, proving that with a learning rate η = O(Δ²/log(n)), the regret is O(k log(k) log(n) / η). It also demonstrates that for instances with logarithmically many arms, the regret is linear unless η = O(Δ²), highlighting the sensitivity of policy gradient to the learning rate in multi-armed settings.
Entities (5)
Relation Signals (2)
Policy Gradient → appliedto → Stochastic Bandits
confidence 100% · We study a continuous-time diffusion approximation of policy gradient for k-armed stochastic bandits.
Learning Rate → determines → Regret
confidence 95% · we show that if η ≤ cΔ²/log(n), then the regret is at most Ck log(k) log(n)/η.
Cypher Suggestions (2)
Find all algorithms and the problem settings they are applied to. · confidence 90% · unvalidated
MATCH (a:Algorithm)-[:APPLIED_TO]->(p:ProblemSetting) RETURN a.name, p.name
Identify the relationship between hyperparameters and performance metrics. · confidence 90% · unvalidated
MATCH (h:Hyperparameter)-[:DETERMINES]->(m:Metric) RETURN h.name, m.name
Full Text
55,271 characters extracted from source content.
Expand or collapse full text
lattimore@google.com A Diffusion Analysis of Policy Gradient for Stochastic Bandits Tor Lattimore Abstract We study a continuous-time diffusion approximation of policy gradient for k-armed stochastic bandits. We prove that with a learning rate η=O(Δ2/log(n))η=O( ^2/ (n)) the regret is O(klog(k)log(n)/η)O(k (k) (n)/η) where n is the horizon and Δ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless η=O(Δ2)η=O( ^2). 1 Introduction Policy gradient is a classical and widely used reinforcement learning algorithm (sutton1998reinforcement). Our focus is on the dynamics of policy gradient with the softmax policy class on Gaussian bandits. Even in this simple setup, only the two-armed setting is really properly understood. We opt for the unusual simplification of studying a continuous-time diffusion approximation of the policy gradient algorithm. We believe (but do not prove) that this is a high-quality approximation. This approach comes with a number of advantages. Most straightforwardly, the randomness arising from the sampling of actions is removed, which slightly simplifies the analysis. More significantly, by switching to continuous time we can exploit the vast literature on stochastic differential equations. Needless to say, we hope the proof ideas can be generalised to discrete time eventually, but for now we are happy to settle with simpler analysis and new means for understanding a delicate problem. Basic notation We let 1 and 0 denote vectors of all ones and zeros, respectively, and with the dimension always obvious from the context. The standard basis vectors in ℝkR^k are e1,…,eke_1,…,e_k and the identity matrix is IdId. Given a vector x, diag(x)diag(x) is the diagonal matrix with x as its diagonal. We let ∥⋅∥p \|· \|_p be the p-norm and for positive definite A, ‖x‖A=(x⊤Ax)1/2 \|x \|_A=(x Ax)^1/2. We let c,C>0c,C>0 be suitably small/large absolute constants. Bandit notation There are k actions and the horizon is n. We assume purely for the sake of simplifying cluttered expressions that n≥max(3,k)n≥ (3,k). We will consider Gaussian rewards with mean μ∈ℝkμ ^k and standard deviation σ∈ℝkσ ^k. We assume that 1≥μ⋆=μ1>μ2≥⋯≥μk≥01≥ _ = _1> _2≥·s≥ _k≥ 0 and that ‖σ‖∞≤1 \|σ \|_∞≤ 1. We let Δa=μ1−μa _a= _1- _a. Given a vector θ∈ℝkθ ^k we let π(θ)π(θ) be the softmax policy with π(θ)a∝exp(θa)π(θ)_a ( _a). In discrete time the learner sequentially selects actions (At)t=1n(A_t)_t=1^n over n rounds and observes rewards Yt=μAt+σAtϵtY_t= _A_t+ _A_t _t where (ϵt)t=1n( _t)_t=1^n is an independent sequence of standard Gaussian random variables. The regret in the discrete time setting is Regn=nμ⋆−∑t=1nμAt, _n=n _ - _t=1^n _A_t\,, where μ⋆=maxa∈1,…,kμa _ = _a∈\1,…,k\ _a. Our main focus, however, is on a continuous time approximation of the discrete process. Let (Bt)(B_t) be a standard k-dimensional Brownian motion on some probability space and (ℱt)(F_t) be its natural filtration. In the continuous-time bandit problem the learner selects a policy (πt)( _t), which is a process in the simplex adapted to the natural filtration associated with the process (Xt)(X_t) satisfying the stochastic differential equation (SDE) given by dXt=diag(πt)μdt+diag(πt)Σ1/2dBt, \!X_t=diag( _t) \!t+diag( _t) ^1/2d\!B_t\,, where Σ=diag(σ2) =diag(σ^2). The existence of a solution to (Xt)(X_t) might be hard to resolve for complex policies (πt)( _t) but will be straightforward for our choice of a softmax policy updated using (continuous time) policy gradient. The regret in continuous time is Regn=∫0nRtdtReg_n= _0^nR_td\!t with Rt=μ⋆−⟨πt,μ⟩R_t= _ - _t,μ . Note, (Bt)(B_t) is always used for the k-dimensional Brownian motion driving the process (Xt)(X_t). We use (Wt)(W_t) for some standard 11-dimensional Brownian motion, but not always the same one. The classical policy gradient algorithm is given in Algorithm˜1, while its continuous time approximation is in Algorithm˜2. ⬇ 1args: learning rate η>0η>0 2let θ1= _1= 0 3for t=1t=1 to ∞: 4 let πt=π(θt) _t=π( _t) 5 sample AtA_t from πt _t and observe Yt∼(μAt,σAt2)Y_t ( _A_t, _A_t^2) 6 update ∀aθt+1,a=θt,a+η(At=a−πt,a)Yt∀ a\,\, _t+1,a= _t,a+η( 1_A_t=a- _t,a)Y_t Algorithm 1: Discrete time policy gradient. ⬇ 1args: learning rate η>0η>0 2let θ0=∈ℝk _0= 0 ^k 3for t∈[0,n]t∈[0,n]: 4 let πt=π(θt) _t=π( _t) 5 observe dXt=diag(πt)μdt+diag(πt)Σ1/2dBtd\!X_t=diag( _t) \!t+diag( _t) ^1/2d\!B_t 6 update dθt=η(Id−πt⊤)dXtd\! _t=η(Id- _t 1 )d\!X_t Algorithm 2: Continuous time policy gradient. Contribution We study the regret of Algorithm˜2. On the positive side, we show that if η≤cΔ22/log(n)η≤ c _2^2/ (n), then the regret is at most Cklog(k)log(n)η. Ck (k) (n)η\,. More negatively, we show that if η=Ω(Δ22)η= ( _2^2), then there exist instances with logarithmically many actions and a very large horizon where the regret is Ω(nΔ2) (n _2). Related work mei2020global study policy gradient in discrete time without noise and show O(1/t)O(1/t) convergence. A similar result in continuous time is provided by walton2020short. mei2023stochastic also prove almost sure convergence, but with the assumption that the learning rate is O(Δ22/k3/2)O( _2^2/k^3/2). mei2024small show that policy gradient converges to a unique optimal action almost surely with any learning rate and that asymptotically the regret is O(log(t))O( (t)) with non-specified problem-dependent constants. baudry2025does show that setting the learning rate η≤Δ2η≤ _2 is needed for logarithmic regret and that this can fail when η>Δ2η> _2. When k>2k>2 they argue that η must be O(1/k)O(1/k) and prove that the regret is O(k2log(n)/Δ2)O(k^2 (n)/ _2) when η∼Δ2/kη _2/k and all suboptimal arms have the same gap. 2 Policy gradient in continuous time In discrete time the classical policy gradient algorithm specified to k-armed bandits is derived as follows. The value function is v(θ)=⟨π(θ),μ⟩v(θ)= π(θ),μ and its gradient is v′(θ)=(diag(π)−ππ⊤)μ. v (θ)=(diag(π)-π )μ\,. (1) This quantity is not observed, but is classically estimated in round t using Monte-Carlo by ∇^t=(Id−πt⊤)eAtYt=(Id−πt⊤)eAtYt. ∇_t=(Id- _t 1 )e_A_tY_t=(Id- _t 1 )e_A_tY_t\,. The policy gradient update with learning rate η is then θt+1=θt+η∇^t. _t+1= _t+η ∇_t\,. In continuous time there are two ways to proceed. One option is to approximate the discrete time stochastic process (θt)( _t) by a continuous process with matching variance and drift. Alternatively, the mechanism for deriving the policy gradient method can be repeated from first principles in continuous time. Eq.˜1 remains the same in continuous time. Because the drift of the continuous-time process (Xt)(X_t) is diag(πt)μdiag( _t)μ, an unbiased continuous-time estimator of v′(θt)v ( _t) is (Id−πt⊤)dXt(Id- _t 1 )d\!X_t. Then the continuous-time policy gradient algorithm is naturally defined by the stochastic process dθt \! _t =η(diag(πt)−πtπt⊤)μdt+ηdiag(πt)Σ1/2dBt−ηπt⟨πt,Σ1/2dBt⟩. =η(diag( _t)- _t _t ) \!t+ ( _t) ^1/2d\!B_t-η _t _t, ^1/2d\!B_t \,. This is equivalent to approximating the discrete time policy gradient algorithm by matching the mean and variance. Elementary properties Before the serious analysis, we start with a collection of elementary results about the vectors (θt)( _t) and (πt)( _t) defined in Algorithm˜2. The proofs either appear inline or in the appendix. Lemma 1 (conservation). ∑a=1kθt,a=0 _a=1^k _t,a=0 almost surely. Proof. Substitute the definition of dθtd\! _t to show that ⟨,dθt⟩=0 1,d\! _t =0. ∎ The next lemma shows that with high probability θt,a _t,a never becomes too negative. This is intuitive since by Lemma˜1, maxbθt,b≥0 _b _t,b≥ 0 so that πt,a≤exp(θt,a) _t,a≤ ( _t,a). Hence, if θt,a _t,a becomes very negative, then πt,a _t,a becomes vanishingly small and the updates by policy gradient to θt,a _t,a become negligible. Lemma 2. Suppose that η≤1η≤ 1. Then, for all actions a∈1,…,ka∈\1,…,k\ and δ∈(0,1)δ∈(0,1), ℙ(inft≤nθt,a≤−log(n/δ))≤δ. ( _t≤ n _t,a≤- (n/δ ) )≤δ\,. The standard behaviour of an effective bandit algorithm is that suboptimal arms are slowly eliminated as they are identified as being statistically suboptimal. For randomised algorithms this roughly means that 1/πt,11/ _t,1 is a measure of the number of actions that still appear competitive. The following lemma relates this quantity to the magnitude of θt,1 _t,1 showing that as θt,1 _t,1 grows, the number of arms that appear competitive reduces: Lemma 3. Suppose that θt,a∈[−log(n/δ),klog(n/δ)] _t,a∈[- (n/δ),k (n/δ)] for all a and θt,1≥maxaθt,a−1 _t,1≥ _a _t,a-1. Then 1πt,1≤6klog(n/δ)θt,1+1+log(n/δ). 1 _t,1≤ 6k (n/δ) _t,1+1+ (n/δ)\,. The conditions in Lemma˜3 are interesting. The first holds with high probability by Lemmas˜1 and 2. Later we will prove the second holds with high probability when the learning rate is sufficiently small and is equivalent to πt,1/πt,a≥1/e _t,1/ _t,a≥ 1/e for all a. 3 Upper bounds To avoid clutter we assume that the variance matrix is the identity: Σ=Id =Id. This corresponds to the assumption in the discrete time setting that the reward distributions for all actions have unit variance. All results in this section still hold when Σ⪯Id . As a warmup, we start with the two-armed setting, which is straightforward and also well-understood in the discrete time setting (baudry2025does, Theorem 1). In fact, our proof follows theirs almost without modification. Proposition 4. Suppose that a=Δ2/η>1a= _2/η>1. Then with k=2k=2, [Regn]≤a2Δ2log(1+2(a+1)nΔ22a2)+a22(a−1)Δ2. [Reg_n]≤ a2 _2 (1+ 2(a+1)n _2^2a^2 )+ a^22(a-1) _2\,. Proof. Abbreviate πt=πt,1 _t= _t,1. Starting with the same regret decomposition as proposed by baudry2025does, the regret is Regn=Δ2∫0n(1−πt)dt=Δ2∫0nπt(1−πt)dt+Δ2∫0n(1−πt)2dt. _n= _2 _0^n(1- _t)d\!t= _2 _0^n _t(1- _t)d\!t+ _2 _0^n(1- _t)^2d\!t\,. (2) Let Zt=θt,1−θt,2Z_t= _t,1- _t,2 and Mt=exp(bZt)M_t= (bZ_t). By Itô’s rule, [Mn]−1 [M_n]-1 =[∫0nπt(1−πt)Mt(2bηΔ2+2b2η2)dt]=[∫0nπt(1−πt)MtΔ22(2b/a+2b2/a2)dt]. =E [ _0^n _t(1- _t)M_t(2bη _2+2b^2η^2)d\!t ]=E [ _0^n _t(1- _t)M_t _2^2(2b/a+2b^2/a^2)d\!t ]\,. When b=1b=1, then Mt=πt/(1−πt)M_t= _t/(1- _t) and in this case [Mn]=1+[∫0nπt2Δ22(2/a+2/a2)]≤1+(2/a+2/a2)nΔ22. [M_n]=1+E [ _0^n _t^2 _2^2(2/a+2/a^2) ]≤ 1+ (2/a+2/a^2 )n _2^2\,. (3) Therefore, since (Zt)(Z_t) has drift 2ηΔ2πt(1−πt)2η _2 _t(1- _t) and Z0=0Z_0=0, Δ2[∫0nπt(1−πt)dt] _2E [ _0^n _t(1- _t)d\!t ] =[Zn]2η≤log([Mn])2η≤a2Δ2log(1+2(a+1)nΔ22a2), = E[Z_n]2η≤ (E[M_n])2η≤ a2 _2 (1+ 2(a+1)n _2^2a^2 )\,, (4) where the inequalities follow from Jensen, Eq.˜3 and because a=Δ2/ηa= _2/η. For the other term in the decomposition, let b=−1b=-1 so that Mt=(1−πt)/πtM_t=(1- _t)/ _t and [Mn]−1=[∫0n(1−πt)2Δ22(2/a2−2/a)]. [M_n]-1=E [ _0^n(1- _t)^2 _2^2(2/a^2-2/a) ]\,. Since MnM_n is non-negative and a>1a>1, Δ2[∫0n(1−πt)2dt]≤a22(a−1)Δ2. _2E [ _0^n(1- _t)^2d\!t ]≤ a^22(a-1) _2\,. (5) Combining Eqs.˜4 and 5 with Eq.˜2 yields the bound. ∎ Remark 5. As n→∞n→∞ and by choosing a→1a→ 1 very slowly one obtains the regret for the (non-anytime) algorithm of [Regn]∼log(n)2Δ2E[Reg_n] (n)2 _2, which appears to contradict the lower bound by LR85. The reason there is no contradiction is that the learning rate depends on Δ2 _2 and in this scenario the classical lower bound does not apply and the log(n)2Δ2 (n)2 _2 regret is optimal (garivier2016explore). The argument above does not generalise to the situation where there are more than two actions. Not only is the problem more difficult to analyse, the learning rate needs to be considerably smaller and the regret is correspondingly much worse. We will prove the following: Theorem 6. Suppose that η≤Δ228log(2n2)η≤ _2^28 (2n^2), then [Regn]=O(klog(k)log(n)η)E[Reg_n]=O ( k (k) (n)η ). We start with a straightforward lemma governing the dynamics of log(πt,1πt,a) ( _t,1 _t,a) that reveals the main challenge when k>2k>2. Lemma 7. Let Zt,a=θt,1−θt,aZ_t,a= _t,1- _t,a. Then dZt,a \!Z_t,a =η[πt,aΔa+(πt,1−πt,a)Rt]dt+ηπt,1+πt,a−(πt,1−πt,a)2dWt, =η [ _t,a _a+( _t,1- _t,a)R_t ]d\!t+η _t,1+ _t,a-( _t,1- _t,a)^2d\!W_t\,, where (Wt)(W_t) is a standard Brownian motion. The expression in Lemma˜7 is revealing. If we are to prove the algorithm is learning, then we must show that Zt,aZ_t,a increases with time for a≠1a≠ 1. Naturally this will be easiest to prove if the SDE associated with Zt,a=θt,1−θt,aZ_t,a= _t,1- _t,a has positive drift and well-controlled noise. Regrettably, however, this is not the case. The drift ηπt,aΔa+η(πt,1−πt,a)Rtη _t,a _a+η( _t,1- _t,a)R_t can be negative if πt,1<πt,a _t,1< _t,a and the regret is large. If there were no noise, then an elementary analysis of the differential equation would show that the drift is always positive because at initialisation πt,1≥πt,a _t,1≥ _t,a and the positive drift ensures this stays true throughout the trajectory. The following lemma shows that when η is suitably small, then Zt,aZ_t,a can be lower-bounded with high probability. Lemma 8. Suppose that η≤Δ228log(2n/δ)η≤ _2^28 (2n/δ), then ℙ(inft≤nZt,a≤−Δ2/2)≤δP( _t≤ nZ_t,a≤- _2/2)≤δ. Proof. Abbreviate Zt=Zt,aZ_t=Z_t,a and let τ=inft:Zt=−Δ2/2τ= \t:Z_t=- _2/2\ and s(t)=η2∫0t(πu,1+πu,a−(πu,1−πu,a)2)dus(t)=η^2 _0^t( _u,1+ _u,a-( _u,1- _u,a)^2)d\!u and t(⋅)=s−1(⋅)t(·)=s^-1(·) be its inverse. Lastly, let Us(t)=ZtU_s(t)=Z_t. By definition, inft≤nZt _t≤ nZ_t has the same law as infs≤s(n)Us _s≤ s(n)U_s. Moreover, when Us≥−Δ2/2U_s≥- _2/2, then dUs \!U_s =1η[πt(s),aΔa+πt(s),a(exp(Us)−1)Rt(s)πt(s),1+πt(s),a−(πt(s),1−πt(s),a)2]ds+dWs = 1η [ _t(s),a _a+ _t(s),a( (U_s)-1)R_t(s) _t(s),1+ _t(s),a-( _t(s),1- _t(s),a)^2 ]d\!s+d\!W_s ≥12η[πt(s),aΔaπt(s),1+πt(s),a]ds+dWs=Δa2η(exp(Us)+1)ds+dWs. ≥ 12η [ _t(s),a _a _t(s),1+ _t(s),a ]d\!s+d\!W_s= _a2η( (U_s)+1)d\!s+d\!W_s\,. where the inequality holds because Rt(s)≤1R_t(s)≤ 1 and exp(Us)−1≥exp(−Δ2/2)−1≥−Δ2/2≥−Δa/2 (U_s)-1≥ (- _2/2)-1≥- _2/2≥- _a/2, which means the numerator is positive. Looking at this relation, when the learning rate is small relative to Δ2 _2, the process has considerable positive drift when Us∈[−Δ2/2,1]U_s∈[- _2/2,1]. Formally, since s(n)≤ns(n)≤ n and by a comparison to the SDE in Proposition˜16, ℙ(infUs:0≤s≤s(n)≤−Δ2/2)≤δP ( \U_s:0≤ s≤ s(n)\≤- _2/2 )≤δ. ∎ We are now in a position to prove Theorem˜6. Proof of Theorem˜6. Let δ=1/nδ=1/n and Zt,a=θt,1−θt,aZ_t,a= _t,1- _t,a and define a stopping time τ by τ=inft≤n:mina∈2,…,kZt,a≤−Δ2/2 or mina∈1,…,kθt,a≤−log(n/δ). τ= \t≤ n: _a∈ 2,…,kZ_t,a≤- _2/2 or _a∈ 1,…,k _t,a≤- (n/δ) \\,. By Lemmas˜2 and 8 and a union bound, with probability at least 1−2kδ1-2kδ, τ=nτ=n. For t≤τt≤τ, Lemma˜1 shows that −log(n/δ)≤θt,1=−∑a=2kθt,a≤klog(n/δ)- (n/δ)≤ _t,1=- _a=2^k _t,a≤ k (n/δ). For u>−log(n/δ)u>- (n/δ), let ψ′(u)=6klog(n/δ)u+1+log(n/δ)ψ (u)= 6k (n/δ)u+1+ (n/δ), which is the function that appeared in Lemma˜3. Then let ψ(u)=∫0uψ′(v)dv=6klog(n/δ)log(u+1+log(n/δ)1+log(n/δ)). ψ(u)= _0^uψ (v)d\!v=6k (n/δ) ( u+1+ (n/δ)1+ (n/δ) )\,. Later we will use the fact that ψ(klog(n/δ))≤6klog(n/δ)log(1+k). ψ(k (n/δ))≤ 6k (n/δ) (1+k)\,. (6) Moreover, for u>−log(n/δ)u>- (n/δ), ψ′(u)=−6klog(n/δ)(u+1+log(n/δ))2≥−ψ′(u). ψ (u)=- 6k (n/δ)(u+1+ (n/δ))^2≥-ψ (u)\,. (7) Let σt2=η2πt,1(1−πt,1) _t^2=η^2 _t,1(1- _t,1). By definition dθt,1=ηπt,1Rtdt+σtdWtd\! _t,1=η _t,1R_td\!t+ _td\!W_t and hence by Itô’s formula, dψ(θt,1) \!ψ( _t,1) =ψ′(θt,1)dθt,1+12ψ′(θt,1)σt2dt =ψ ( _t,1)d\! _t,1+ 12ψ ( _t,1) _t^2d\!t ≥(a)[ηψ′(θt,1)πt,1Rt−η2ψ′(θt,1)πt,1(1−πt,1)2]dt+ψ′(θt,1)σtdWt [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(a)≥ [ηψ ( _t,1) _t,1R_t- η^2ψ ( _t,1) _t,1(1- _t,1)2 ]d\!t+ψ ( _t,1) _td\!W_t ≥(b)[ηψ′(θt,1)πt,1Rt−η2ψ′(θt,1)πt,1Rt2Δ2]dt+ψ′(θt,1)σtdWt [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(b)≥ [ηψ ( _t,1) _t,1R_t- η^2ψ ( _t,1) _t,1R_t2 _2 ]d\!t+ψ ( _t,1) _td\!W_t ≥(c)12ηψ′(θt,1)πt,1Rtdt+ψ′(θt,1)σtdWt [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(c)≥ 12ηψ ( _t,1) _t,1R_td\!t+ψ ( _t,1) _td\!W_t ≥(d)12ηRtdt+ψ′(θt,1)σtdWt. [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(d)≥ 12η R_td\!t+ψ ( _t,1) _td\!W_t\,. where (a) follows from Eq.˜7, (b) since Rt≥(1−πt,1)Δ2R_t≥(1- _t,1) _2, (c) since η≤Δ2η≤ _2 and (d) follows from Lemma˜3. Taking expectations and rearranging in combination with Eq.˜6 shows that [Regτ] [Reg_τ] ≤2η[ψ(θτ,1)]≤12klog(n/δ)log(1+k)η. ≤ 2ηE[ψ( _τ,1)]≤ 12k (n/δ) (1+k)η\,. Since ℙ(τ<n)≤2kδ=2k/nP(τ<n)≤ 2kδ=2k/n, [Regn]≤2k+[Regτ]E[Reg_n]≤ 2k+E[Reg_τ] and the argument is complete. ∎ Remark 9. The analysis in Lemma˜8 can be improved by refining the bound Rt≤1R_t≤ 1 to Rt≤ΔkR_t≤ _k. With this change one obtains the same bound as Theorem˜6 but the condition on η can be relaxed to η=O~(Δ22/Δk)η= O( _2^2/ _k). Consequentially, when all actions have the same suboptimality gap and η=cΔ2log(n)η= c _2 (n), then [Regn]=O(klog(k)log(n)2Δ2). [Reg_n]=O ( k (k) (n)^2 _2 )\,. 4 Lower bound Proposition˜4 shows that when there are only two actions, then setting η just slightly smaller than Δ2 _2 leads to near-optimality of policy gradient. The next construction shows that when there are more than two actions the correct choice of η can be as small as Δ22 _2^2 and in general no choice leads to near-optimal regret in the sense of being remotely close to the lower bound by LR85. The fundamental difference is that when k=2k=2, the drift of θt,1−θt,2 _t,1- _t,2 is always positive. When there are more arms, this only holds on trajectories that are well-behaved in the sense that θt,1−θt,a _t,1- _t,a never gets too negative for a≠1a≠ 1. Lower bound construction The poor behaviour of policy gradient is most dramatic when k>2k>2 and μ is such that Δ=(0,Δ2,1,…,1) =(0, _2,1,…,1) with Δ2 _2 close to zero. The following arguments can be made to work with Σ=Id =Id but are simplified considerably by instead letting Σ=diag(e1+e2) =diag(e_1+e_2) so that the only noise arises from the first two actions. What happens in this scenario is that for a long time actions 11 and 22 are nearly statistically indistinguishable. Since actions a>2a>2 are very suboptimal, θt,a _t,a for a>2a>2 decreases rapidly and θt,1 _t,1 and θt,2 _t,2 increase. But they do not increase together at the same rate. Rather, unless the learning rate is very small, the dynamics of policy gradient effectively ‘picks a winner’ in 1,2\1,2\ at random and by the time πt,a _t,a is negligible for a>2a>2, it turns out that either πt,1 _t,1 or πt,2 _t,2 is nearly 11. From then on policy gradient behaves as if the bandit has two actions, but with a potentially very disadvantageous initialisation. Theorem 10. When Δ=(0,Δ2,1,…,1) =(0, _2,1,…,1) and Σ=diag(1,1,0,…,0) =diag(1,1,0,…,0) and k≥Clog(n/Δ2)k≥ C (n/ _2) and learning rate bounded in CΔ22≤η≤c/kC _2^2≤η≤ c/k, the regret of continuous policy gradient (Algorithm˜2) satisfies [Regn]=Ω(nΔ2)E[Reg_n]= (n _2). Remark 11. The assumption that η≤c/kη≤ c/k is for simplicity in the analysis and can be removed at the price of a slightly more complicated proof. In full gory detail the proof is embarrassingly intricate. We try to present the high-level argument in the main body and defer the fiddly computations to the appendix. Alternatively, you can skip all the analysis and look at Fig.˜1 in Appendix˜F, which illustrates the striking behaviour of the discrete time policy gradient on the instance in Theorem˜10. You may also wonder if the lower bound might hold with k=3k=3. We believe the answer is no, which is partially supported by the results in Fig.˜2. Proof of Theorem˜10. Abbreviate m=k−2m=k-2 and πt,3+=∑a=3kπt,a _t,3+= _a=3^k _t,a to be the probability assigned to all but the first two actions. Our argument revolves around showing that with constant probability θt,2 _t,2 grows much faster than θt,1 _t,1 for the initial period when πt,3+ _t,3+ is decreasing rapidly. Let St=θt,1+θt,2S_t= _t,1+ _t,2 and Zt=θt,1−θt,2Z_t= _t,1- _t,2 and Ct=ηπt,3+(1−πt,3+)C_t=η _t,3+(1- _t,3+) and s(t)=∫0tCudus(t)= _0^tC_ud\!u and s↦t(s)s t(s) be its inverse. Definition 12. Let ϵ=2/mε=2/m and define a stopping time τ as the first time that any of the following holds: (a) τ=nτ=n; or (b) Sτ∉((1−Δ2)s(τ)−1,s(τ)+1)S_τ∉((1- _2)s(τ)-1,s(τ)+1); or (c) Zτ≥2arcsinh(η(e−s(τ)/4−1/16)e(1−ϵ)s(τ)/2)Z_τ≥ 2arcsinh ( η(e^-s(τ)/4-1/16)e^(1-ε)s(τ)/2 ); or (d) s(τ)=smax≜11−ϵ[log(400/η)+log(n)]s(τ)=s_ 11-ε[ (400/η)+ (n)]. By our assumption that the variance only arises from the first two actions it follows that θt,a=θt,b _t,a= _t,b for all t and a,b≥3a,b≥ 3, which by Lemma˜1 and the definition St=θt,1+θt,2S_t= _t,1+ _t,2 means that θt,a=−St/m _t,a=-S_t/m and πt,1πt,3+ _t,1 _t,3+ =1mexp(St+Zt2+Stm)≜Gt. = 1m ( S_t+Z_t2+ S_tm ) G_t\,. Note that if t≤τt≤τ, then by the inequality arcsinh(x)≤−log(−2x)arcsinh(x)≤- (-2x), Zt≤min(1,log(400η)−(1−ϵ)s(t)). Z_t≤ (1, ( 400η )-(1-ε)s(t) )\,. (8) Step 1: Dynamics and intuition By substituting the definitions and crude bounding we obtain the following: dSt \!S_t =[1−πt,21−πt,3+Δ2]Ctdt+ηπt,3+CtdWtand = [1- _t,21- _t,3+ _2 ]C_td\!t+ η _t,3+C_td\!W_t (9) dZt \!Z_t ≤[Δ2(1+2Gt)+tanh(Zt/2)]Ctdt+ησt2CtdWt, ≤ [ _2(1+2G_t)+ (Z_t/2) ]C_td\!t+ η _t^2C_td\!W_t\,, (10) where 1−tanh(Zt/2)2≤σt2≤1+4Gt. 1- (Z_t/2)^2≤ _t^2≤ 1+4G_t\,. (11) The tedious calculation establishing the above is deferred to the last step. You will surely notice that these SDEs have been massaged so that the drift terms depend linearly on CtC_t and the diffusion terms depend linearly on Ct C_t, which suggests we might make progress by changing time using the clock s(t)=∫0tCudus(t)= _0^tC_ud\!u. Rather than moving immediately to the formalities, let us spend a moment to argue informally. The SDE in Eq.˜9 is very well-behaved after a time change. Since Δ2 _2 is small and πt,2≤πt,1+πt,2=1−πt,3+ _t,2≤ _t,1+ _t,2=1- _t,3+, we have dSt(s)≈ds+ηπt(s),3+dWs, \!S_t(s) \!s+ η _t(s),3+d\!W_s\,, which implies that with high probability St(s)≈sS_t(s)≈ s for all s≤smaxs≤ s_ . The SDE in Eq.˜10 is much more complicated. At initialisation Gt=1/mG_t=1/m is relatively small, so that dZt(s)≈[Δ2+tanh(Zt(s)/2)]ds+ηdWs. \!Z_t(s)≈[ _2+ (Z_t(s)/2)]d\!s+ ηd\!W_s\,. When |z||z| is small, then tanh(z/2)≈z/2 (z/2)≈ z/2. Hence, when |Zt(s)||Z_t(s)| is small, a reasonable approximation for a solution (see Remark˜17) is Zt(s)≈2(exp(s/2)−1)Δ2+ηexp(s/2)∫0sexp(−u/2)dWu, Z_t(s)≈ 2( (s/2)-1) _2+ η (s/2) _0^s (-u/2)d\!W_u\,, which means that Zt(s)Z_t(s) has approximate law (2(exp(s/2)−1)Δ2,η(exp(s)−1))N(2( (s/2)-1) _2,η( (s)-1)). In particular, provided that η η is suitably larger than Δ2 _2, then the noise dominates and with roughly constant probability |Zt(s)|≈exp(s/2)η|Z_t(s)|≈ (s/2) η and sign(Zt(s))sign(Z_t(s)) is close to uniform on −1,1\-1,1\. In particular, if s≈log(1/η)s≈ (1/η), then |Zt(s)|≈1|Z_t(s)|≈ 1. At this point tanh(Zt(s)/2)≈sign(Zt(s)) (Z_t(s)/2) (Z_t(s)) and the drift term is roughly constant and depending on the sign the process drifts up or down at a roughly linear rate. Hence a reasonable ansatz is that |Zt(s)|≈s|Z_t(s)|≈ s for large s with the sign roughly uniform when η η is sufficiently larger than Δ2 _2. This turns out to be the correct rate. As a consequence, we can expect that St≈s(t)S_t≈ s(t) and Zt≈−s(t)Z_t≈-s(t) with roughly constant probability. Hence θt,1=(St+Zt)/2≈0 _t,1=(S_t+Z_t)/2≈ 0 while θt,2=(St−Zt)/2≈s(t) _t,2=(S_t-Z_t)/2≈ s(t), which at minimum implies that θt,1 _t,1 does not grow much larger than θt,2 _t,2 and implies linear regret. The challenge is we need a sample path result and the additional terms in Eq.˜10 need to be handled carefully. Remark 13. Handling the additional terms is unsurprisingly non-trivial. Indeed, they are what causes the argument above to fail asymptotically, which it must because for any learning rate policy gradient does learn asymptotically. Step 2: Formal details The following proposition formalises the ansatz in the previous step. The proof is based on a technical comparison to the standard linear SDE and is deferred to Appendix˜E. Proposition 14. With probability at least c either τ=nτ=n or s(τ)=smaxs(τ)=s_ . Hence it suffices to show that on this event ∫0nRtdt≥cnΔ2 _0^nR_td\!t≥ cn _2. There are two cases. When τ=nτ=n, then Zt<1Z_t<1 for all t≤nt≤ n and πt,1≤eπt,2≤e(1−πt,1) _t,1≤ e _t,2≤ e(1- _t,1) and hence πt,1≤e/(1+e) _t,1≤ e/(1+e), from which it follows that Regn=Ω(nΔ2)Reg_n= (n _2). Suppose instead that τ<nτ<n, which means that s(τ)=smaxs(τ)=s_ . Therefore, by Eq.˜8, Zτ≤log(400η)−(1−ϵ)smax=−log(n). Z_τ≤ ( 400η )-(1-ε)s_ =- (n)\,. Similarly, Sτ≥(1−Δ2)smax−1≥(1−ϵ)smax−1=log(400nη)−1≥log(n/η). S_τ≥(1- _2)s_ -1≥(1-ε)s_ -1= ( 400nη )-1≥ (n/η)\,. Therefore πτ,1≤exp(Zτ)≤1n _τ,1≤ (Z_τ)≤ 1n and since η≤c/k≤c/mη≤ c/k≤ c/m, πτ,3+≤mexp(−Sτm−Sτ−Zτ2)≤mηn≤1/n. _τ,3+≤ m (- S_τm- S_τ-Z_τ2 )≤ mηn≤ 1/n\,. Hence πt,2≥1−2/n _t,2≥ 1-2/n and from this a straightforward argument shows that with high probability the algorithm never recovers: ℙ(supt≥τπt,1≥1/2|ℱτ)≤1/2. ( _t≥τ _t,1≥ 1/2\, |\,F_τ )≤ 1/2\,. Since Zt<1Z_t<1 for t≤τt≤τ, it follows that with constant probability πt,1≤1/2 _t,1≤ 1/2 for all t≤nt≤ n and therefore [Regn]=Ω(nΔ2)E[Reg_n]= (n _2) as required. Step 4: The calculations We promised the concrete calculations. By definition dθt,1 \! _t,1 =ηπt,1Rtdt+η[πt,1dBt,1−πt,13/2dBt,1−πt,1πt,2dBt,2] and =η _t,1R_td\!t+η [ _t,1d\!B_t,1- _t,1^3/2d\!B_t,1- _t,1 _t,2d\!B_t,2 ] and dθt,2 \! _t,2 =ηπt,2[Rt−Δ2]dt+η[πt,2dBt,2−πt,23/2dBt,2−πt,2πt,1dBt,1]. =η _t,2 [R_t- _2 ]d\!t+η [ _t,2d\!B_t,2- _t,2^3/2d\!B_t,2- _t,2 _t,1d\!B_t,1 ]\,. The calculation for dStd\!S_t follows by substituting the definitions. For dZtd\!Z_t, we have dZt \!Z_t =η[πt,2Δ2+(πt,1−πt,2)Rt]dt+ηπt,1+πt,2−(πt,1−πt,2)2dWt =η [ _t,2 _2+( _t,1- _t,2)R_t ]d\!t+η _t,1+ _t,2-( _t,1- _t,2)^2d\!W_t =[αtΔ2+tanh(Zt/2)]Ctdt+ησt2CtdWt, = [ _t _2+ (Z_t/2) ]C_td\!t+ η _t^2C_td\!W_t\,, with (αt)( _t) and (σt)( _t) defined by αt _t =πt,2(1+πt,1−πt,2)πt,3+(1−πt,3+) = _t,2(1+ _t,1- _t,2) _t,3+(1- _t,3+) σt2 _t^2 =1+4πt,1πt,2πt,3+(1−πt,3+)−(1−πt,3+)(πt,1−πt,2πt,1+πt,2)2. =1+ 4 _t,1 _t,2 _t,3+(1- _t,3+)-(1- _t,3+) ( _t,1- _t,2 _t,1+ _t,2 )^2\,. Since πt,2≤πt,1+πt,2=1−πt,3+ _t,2≤ _t,1+ _t,2=1- _t,3+, αt=πt,2(2πt,1+πt,3+)πt,3+(1−πt,3+)≤1+2Gt. _t= _t,2(2 _t,1+ _t,3+) _t,3+(1- _t,3+)≤ 1+2G_t\,. Moreover, πt,1−πt,2πt,1+πt,2=tanh(Zt/2) _t,1- _t,2 _t,1+ _t,2= (Z_t/2) so that 1−tanh(Zt/2)2≤σt2≤1+4Gt1- (Z_t/2)^2≤ _t^2≤ 1+4G_t. ∎ 5 Discussion Continuous time vs discrete time When the learning rate is small, the continuous time and discrete time processes are extremely similar. Even for larger learning rates, the noise terms contract asymptotically in both cases, suggesting that the diffusion approximation may be good for any learning rate once the algorithm reaches the asymptotic regime. We believe our proof technique for the upper bound is likely to translate to the discrete time setting, probably with only a little tedium. Proving the lower bound in discrete time is probably more challenging and arguably less worthwhile. Refining the lower bound Our upper bound is O~(k/Δ22) O(k/ _2^2). The lower bound shows that this is not in general improvable when k is logarithmic. An interesting direction is to extend the lower bound construction to introduce a linear dependence on k. k-dependence baudry2025does constructed a lower bound showing that the learning rate needs to be O(1/k)O(1/k) and speculated that the correct learning rate might be around Δ2/k _2/k. Our upper bound holds without assuming that η is less than O(1/k)O(1/k), suggesting that this is a transient requirement where the continuous time and discrete time setups are not comparable. Moreover, our lower bound shows that in certain cases the learning rate needs to be O(Δ22)O( _2^2) in order to achieve sublinear regret. Logarithmic factors The upper bound in Theorem˜6 relies on the assumption that η≤cΔ22/log(n)η≤ c _2^2/ (n). One may wonder if this logarithmic term in the denominator can be dropped. The argument in Lemma˜8 is quite crude, so there may be room for improvement there. References Appendix A Technical inequalities Proposition 15. Suppose that X0=0X_0=0 and dXt=adt+dBtd\!X_t=ad\!t+d\!B_t with a>0a>0. Then ℙ(inft≥0Xt≤−ϵ)≤exp(−2aϵ). ( _t≥ 0X_t≤-ε )≤ (-2aε)\,. Proof. Use the fact that Mt=exp(−2aXt)M_t= (-2aX_t) is a martingale and Doob’s maximal inequality. ∎ Proposition 16. Suppose that X0=0X_0=0 and dXt=aexp(Xt)+1dt+dBtd\!X_t= a (X_t)+1d\!t+d\!B_t. Then ℙ(infXt:0≤t≤n≤−ϵ)≤(1+n/2)exp(−2aϵe+1). ( \X_t:0≤ t≤ n\≤-ε )≤(1+ n/2) (- 2aεe+1 )\,. In particular, if a≥e+12ϵlog((1+n/2)/δ)a≥ e+12ε ((1+ n/2)/δ), then ℙ(infXt:0≤t≤n≤−ϵ)≤δP( \X_t:0≤ t≤ n\≤-ε)≤δ. Proof. Obviously the drift is positive everywhere and when Xt∈(−∞,1]X_t∈(-∞,1] the drift term is at least a/(e+1)a/(e+1). Hence, letting τ=inft:Xt∈−ϵ,1τ= \t:X_t∈\-ε,1\\, it holds that ℙ(Xτ=−ϵ)≤exp(−2aϵ/(e+1))P(X_τ=-ε)≤ (-2aε/(e+1)). Moreover, since XtX_t has positive drift everywhere, the number of down-crossings DnD_n of the interval [0,1][0,1] is bounded in expectation by the number of down-crossings of a Brownian motion in the same time interval, which by Doob’s down-crossing lemma is at most [Dn]≤[max(0,Bn)]≤n/2E[D_n] [ (0,B_n)]≤ n/2. Therefore ℙ(infXt:0≤t≤n≤−ϵ)≤(1+n/2)exp(−(2aϵ)(e+1))P ( \X_t:0≤ t≤ n\≤-ε )≤(1+ n/2) (-(2aε)(e+1)). ∎ Appendix B Proof of Lemma 2 Let Mt=exp(−θt,a)M_t= (- _t,a). By Lemma˜1 there exists a b such that θt,b≥0 _t,b≥ 0 and hence Mt≤exp(θt,b−θt,a)=πt,b/πt,a≤1/πt,aM_t≤ ( _t,b- _t,a)= _t,b/ _t,a≤ 1/ _t,a. Therefore, by Itô’s formula dMt \!M_t =Mt[ηπt,a(μ¯t−μa)+η2πt,a(1−πt,a)2]dt+ηMtπt,a(1−πt,a)dWt =M_t [η _t,a( μ_t- _a)+ η^2 _t,a(1- _t,a)2 ]d\!t+η M_t _t,a(1- _t,a)d\!W_t ≤[η+η22]dt+ηMtπt,a(1−πt,a)dWt. ≤ [η+ η^22 ]d\!t+η M_t _t,a(1- _t,a)d\!W_t\,. Therefore, since M0=1M_0=1, by Markov’s inequality ℙ(supt≤nMt≥1δ+nδ[η+η22])≤δ, ( _t≤ nM_t≥ 1δ+ nδ [η+ η^22 ] )≤δ\,, which after naively simplifying η≤cη≤ c implies that with probability at least 1−δ1-δ, θt,a≥−log(n/δ) _t,a≥- (n/δ ). Appendix C Proof of Lemma 3 Suppose that a≤0≤ba≤ 0≤ b and K=x∈[a,b]m:∑i=1mxi=0K=\x∈[a,b]^m: _i=1^mx_i=0\ and let F be the set of measurable functions from [0,m]→[a,b][0,m]→[a,b] with ∫0mf(x)dx=0 _0^mf(x)d\!x=0. Then maxx∈K‖exp(x)‖1 _x∈ K \| (x) \|_1 ≤maxf∈F∫0mexp(f(x))dx=m[b−aexp(a)−ab−aexp(b)]. ≤ _f∈ F _0^m (f(x))d\!x=m [ bb-a (a)- ab-a (b) ]\,. To reduce clutter we drop the time indices. Hence, by the previous display with b=θ1+1b= _1+1 and a=−log(n/δ)a=- (n/δ), ‖exp(θ)‖1 \| (θ) \|_1 ≤k[θ1+1θ1+1+log(n/δ)(δn)+log(n/δ)θ1+1+log(n/δ)exp(θ1+1)] ≤ k [ _1+1 _1+1+ (n/δ) ( δn )+ (n/δ) _1+1+ (n/δ) ( _1+1) ] ≤kδn+klog(n/δ)θ1+1+log(n/δ)exp(θ1+1). ≤ kδn+ k (n/δ) _1+1+ (n/δ) ( _1+1)\,. Since θ1≥θa−1 _1≥ _a-1 and ∑a=1kθa=0 _a=1^k _a=0 it follows that θ1≥−1 _1≥-1. Therefore 1π1=‖exp(θ)‖1exp(−θ1)≤e[kδn+klog(n/δ)θ1+1+log(n/δ)]. 1 _1= \| (θ) \|_1 (- _1)≤ e [ kδn+ k (n/δ) _1+1+ (n/δ) ]\,. The result now follows by crudely simplifying the constants. Appendix D Proof of Lemma 7 Let dZt,a=btdt+⟨σt,dBt⟩d\!Z_t,a=b_td\!t+ _t,d\!B_t . By definition dZt,a=⟨e1−ea,dθt⟩d\!Z_t,a= e_1-e_a,d\! _t Substituting the definition of dθtd\! _t, the drift is ⟨e1−ea,(diag(πt)−πtπt⊤)μ⟩ e_1-e_a,(diag( _t)- _t _t )μ =πt,1(μ1−μ¯t)−πt,a(μa−μ¯t) = _t,1( _1- μ_t)- _t,a( _a- μ_t) =πt,1Rt−πt,a(Rt−Δa) = _t,1R_t- _t,a(R_t- _a) =πt,aΔa+(πt,1−πt,a)Rt. = _t,a _a+( _t,1- _t,a)R_t\,. Similarly, the diffusion coefficient is η‖e1−ea‖diag(πt)−πtπt⊤=ηπt,1+πt,a−(πt,1−πt,a)2η \|e_1-e_a \|_diag( _t)- _t _t =η _t,1+ _t,a-( _t,1- _t,a)^2. Appendix E Proof of Proposition 14 The analysis of (St)(S_t) is straightforward. We have dSt(s)≤ds+ηπt(s),3+dWs. \!S_t(s) \!s+ η _t(s),3+d\!W_s\,. Hence, St(s)≤s+ηWsS_t(s)≤ s+ ηW_s and by the maximal inequality, with probability at least 1−δ1-δ for all t such that s(t)≤smaxs(t)≤ s_ , St(s)≤s+2ηsmaxlog(1/δ)<s+1. S_t(s)≤ s+ 2η s_ (1/δ)<s+1\,. Similarly, dSt(s)≥(1−Δ2)ds+ηπt(s),3+dWsd\!S_t(s)≥(1- _2)d\!s+ η _t(s),3+d\!W_s and with probability at least 1−δ1-δ, for all t such that s(t)≤smaxs(t)≤ s_ , St(s)>(1−Δ2)s−1. S_t(s)>(1- _2)s-1\,. The last two parts involve (Zt)(Z_t) and are more challenging. Remark 17 (method of integrating factors). Remember that the linear SDE with X0=0X_0=0 and dXt=[At+BtXt]dt+CtdWtd\!X_t= [A_t+B_tX_t ]d\!t+C_td\!W_t has the solution Xt=eβ(t)[∫0te−β(u)Audu+∫0te−β(u)CudWu] with β(t)=∫0tBudu. X_t=e^β(t) [ _0^te^-β(u)A_ud\!u+ _0^te^-β(u)C_ud\!W_u ] with β(t)= _0^tB_ud\!u\,. Recall the identities sinh′(x)=cosh(x) (x)= (x) and cosh′(x)=sinh(x) (x)= (x) and cosh(x)=1+sinh2(x) (x)= 1+ ^2(x). Let Ut=sinh(Zt/2)U_t= (Z_t/2). By Itô’s formula, dUt \!U_t ≤Ct2[Δ2(1+2Gt)1+Ut2+Ut+η(1+4Gt)Ut4]dt+12ησt2(1+Ut2)CtdWt. ≤ C_t2 [ _2(1+2G_t) 1+U_t^2+U_t+ η(1+4G_t)U_t4 ]d\!t+ 12 η _t^2(1+U_t^2)C_td\!W_t\,. Let us start by upper bounding the drift term. Suppose that t≤τt≤τ. Then, Ut≤ηU_t≤ η. Hence, Driftt Drift_t =Ct2[Δ2(1+2Gt)1+Ut2+Ut+η(1+4Gt)Ut4] = C_t2 [ _2(1+2G_t) 1+U_t^2+U_t+ η(1+4G_t)U_t4 ] ≤(a)Ct2[Δ2(1+2Gt)(1+2)+Ut(1−Δ2(1+2Gt)Ut<0)+η(1+4Gt)4] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(a)≤ C_t2 [ _2(1+2G_t)(1+ 2)+U_t (1- _2(1+2G_t) 1_U_t<0 )+ η(1+4G_t)4 ] ≤(b)Ct2[6(Δ2+η)(1+Gt)+Ut(1−Δ2(1+2Gt)Ut<0)] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(b)≤ C_t2 [6( _2+η)(1+G_t)+U_t (1- _2(1+2G_t) 1_U_t<0 ) ] ≤(c)Ct2[6(Δ2+η)(1+Gt)+Ut(1−Δ2(1+400mη)Ut<0)] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(c)≤ C_t2 [6( _2+η)(1+G_t)+U_t (1- _2 (1+ 400m η ) 1_U_t<0 ) ] ≤(d)Ct[3(Δ2+η+ϵη)(1+Gt)+Ut(1−ϵ)2] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(d)≤C_t [3 ( _2+η+ε η )(1+G_t)+ U_t (1-ε )2 ] ≤(e)Ct[η96(1+Gt)+Ut(1−ϵ)2]. [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(e)≤C_t [ η96(1+G_t)+ U_t (1-ε )2 ]\,. where (a) follows from the inequality 1+u2≤1+2−uu<0 1+u^2≤ 1+ 2-u 1_u<0 for u≤1u≤ 1. (b) by direct simplification. (c) by the bound on GtG_t below (Eq.˜13). (d) since η≥CΔ2 η≥ C _2 and Δ2≤1/m _2≤ 1/m and by the definition of ϵ=2/mε=2/m. (e) follows from the assumptions on η and m. Hence, by a comparison, there exists a coupling between (Us(t))(U_s(t)) and another process (Vs)(V_s) such that Ut≤Vs(t)U_t≤ V_s(t) where V0=0V_0=0 and dVs \!V_s =[3(Δ2+η)(1+Gt(s))+ϵη2+Vs(1−ϵ)2]+12ηγs2dWs, = [3( _2+η)(1+G_t(s))+ ε η2+ V_s (1-ε )2 ]+ 12 η _s^2d\!W_s\,, where 1≤γs2≤(1+4Gt(s))(1+Vs2)1≤ _s^2≤(1+4G_t(s))(1+V_s^2). Exact solution Recall now that Gt=1mexp(St+Zt2+Stm)≤8mexp(s(t)/2), G_t= 1m ( S_t+Z_t2+ S_tm )≤ 8m (s(t)/2 )\,, (12) where we used Definition˜12 to argue that for t≤τt≤τ, St≤s(t)+1S_t≤ s(t)+1 and the assumption that (smax+1)/m≤1(s_ +1)/m≤ 1. Alternatively, GtG_t can be bounded in another way for t≤τt≤τ by Gt=1mexp(St+Zt2+Stm) G_t= 1m ( S_t+Z_t2+ S_tm ) ≤200mη, ≤ 200m η\,, (13) which again follows from Definition˜12 and Eq.˜8 and the assumption that (smax+1)/m≤1(s_ +1)/m≤ 1. By the method of integrating factors (Remark˜17), Vs=e(1−ϵ)s/2[η96∫0se−(1−ϵ)u/2(1+Gt(u))du+ηW[M]s], V_s=e^(1-ε)s/2 [ η96 _0^se^-(1-ε)u/2(1+G_t(u))d\!u+ ηW_[M]_s ]\,, (14) where [M]s[M]_s is the quadratic variation of the martingale Ms=12∫0se−(1−ϵ)u/2γudWu. M_s= 12 _0^se^-(1-ε)u/2 _ud\!W_u\,. Let κ=infs:Vs=1 or [M]s=3/4κ= \s:V_s=1 or [M]_s=3/4\. Constant probability event By definition, [M]s≤3/2[M]_s≤ 3/2 for s≤κs≤κ. Consider the event E that Wu+u∈[−1/8,1/8]W_u+u∈[-1/8,1/8] for all u≤3/2u≤ 3/2. An elementary argument shows that ℙ(E)>cP(E)>c. For the remainder we show the following hold on E for all s≤smaxs≤ s_ : (a) [M]s<3/4[M]_s<3/4. (b) Vs<1V_s<1. (c) Vs≤−η40e(1−ϵ)s/2V_s≤- η40e^(1-ε)s/2. These imply that κ=smaxκ=s_ and in particular by our comparison to (Us(t))(U_s(t)) the claim holds. Bounding the quadratic variation Note that by definition, if s≤κs≤κ, then Vs<1V_s<1 and by Eq.˜14, Vs2≤max(1,4ηe(1−ϵ)s), V_s^2≤ (1,4η e^(1-ε)s )\,, (15) where we used the fact that when s≤κs≤κ, then [M]s≤3/2[M]_s≤ 3/2 and hence W[M]s≥−2W_[M]_s≥-2. By definition the quadratic variation is [M]s [M]_s =14∫0se−(1−ϵ)uγu2du. = 14 _0^se^-(1-ε)u _u^2d\!u\,. (16) Since all the terms are positive and γu≥1 _u≥ 1, it immediately follows that [M]s≥(1−e−s)/4[M]_s≥(1-e^-s)/4. Upper bounding is more delicate. Since γt∈[0,4] _t∈[0,4], [M]s [M]_s ≤(a)14∫0se−(1−ϵ)u(1+4Gt(u))(1+Vu2)du [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(a)≤ 14 _0^se^-(1-ε)u(1+4G_t(u))(1+V_u^2)d\!u ≤(b)14∫0se−(1−ϵ)u(1+4Gt(u))(2+4ηe(1−ϵ)u)du [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(b)≤ 14 _0^se^-(1-ε)u(1+4G_t(u))(2+4η e^(1-ε)u)d\!u ≤(c)14∫0se−(1−ϵ)u(1+800mmin(1η,eu/2))(2+4ηe(1−ϵ)u)du [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(c)≤ 14 _0^se^-(1-ε)u (1+ 800m ( 1 η,e^u/2 ) ) (2+4η e^(1-ε)u )d\!u ≤(d)1−e−(1−ϵ)s2(1−ϵ)+1/8 [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(d)≤ 1-e^-(1-ε)s2(1-ε)+1/8 ≤(e)3/4. [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(e)≤3/4\,. where (a) follows from Eq.˜16, (b) from Eq.˜15, (c) from Eq.˜12 and Eq.˜13, (d) by computing the integral and naive bounding and (e) is immediate. Therefore (1−e−s)/2≤[M]s≤3/2(1-e^-s)/2≤[M]_s≤ 3/2 for all s≤smaxs≤ s_ . Bounding the SDE Our exact solution is Vs V_s =e(1−ϵ)s/2[η96∫0se−(1−ϵ)u/2(1+Gt(u))du+ηW[M]s] =e^(1-ε)s/2 [ η96 _0^se^-(1-ε)u/2(1+G_t(u))d\!u+ ηW_[M]_s ] ≤(a)e(1−ϵ)s/2[η96∫0se−(1−ϵ)u/2(1+8meu/2)du+ηW[M]s] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(a)≤e^(1-ε)s/2 [ η96 _0^se^-(1-ε)u/2 (1+ 8me^u/2 )d\!u+ ηW_[M]_s ] ≤(b)e(1−ϵ)s/2[η96∫0se−u/2+ϵsmax/2(1+8meu/2)du+ηW[M]s] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(b)≤e^(1-ε)s/2 [ η96 _0^se^-u/2+ε s_ /2 (1+ 8me^u/2 )d\!u+ ηW_[M]_s ] ≤(c)e(1−ϵ)s/2[η48∫0se−u/2(1+8meu/2)du+ηW[M]s] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(c)≤e^(1-ε)s/2 [ η48 _0^se^-u/2 (1+ 8me^u/2 )d\!u+ ηW_[M]_s ] ≤(d)e(1−ϵ)s/2[η48(2+8smaxm)+ηW[M]s] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(d)≤e^(1-ε)s/2 [ η48 (2+ 8s_ m )+ ηW_[M]_s ] ≤(e)e(1−ϵ)s/2[η16+ηW[M]s] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(e)≤e^(1-ε)s/2 [ η16+ ηW_[M]_s ] ≤(f)e(1−ϵ)s/2[η16+η(e−s/4−1/8)] [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(f)≤e^(1-ε)s/2 [ η16+ η(e^-s/4-1/8) ] <(g)e(1−ϵ)s/2η(e−s/4−1/16). [rgb]0.6,0,0 [named]pgfstrokecolorrgb0.6,0,0(g)<e^(1-ε)s/2 η(e^-s/4-1/16)\,. where (a) follows from Eq.˜13. (b) since s≤smaxs≤ s_ . (c) since exp(ϵsmax/2)≤2 (ε s_ /2)≤ 2. (d) by crudely bounding the integral. (e) since m≥8smaxm≥ 8s_ . (f) since Wu≤1/8−uW_u≤ 1/8-u and [M]s≥(1−e−s)/4[M]_s≥(1-e^-s)/4. (g) since η<cη<c and η≥CΔ2 η≥ C _2. Hence, with constant probability, whenever t≤τt≤τ, then St(s)∈((1−Δ2)s−1,s+1)S_t(s)∈((1- _2)s-1,s+1) and Zt=2arcsinh(Ut)≤2arcsinh(Vs(t))<2arcsinh(e(1−ϵ)s/2η(e−s/4−1/16)), Z_t=2arcsinh(U_t)≤ 2arcsinh(V_s(t))<2arcsinh (e^(1-ε)s/2 η(e^-s/4-1/16) )\,, which completes the proof. Appendix F Experiments We collect here the experiments in the lower bound construction in Theorem˜10 but using discrete time policy gradient as outlined in Algorithm˜1. Figure 1: The plot shows 40 trajectories of πt,1 _t,1 produced by Algorithm˜1 on the instance of Theorem˜10 for 6 different learning rates with Δ2=0.002 _2=0.002. Figure 2: The figure shows the results for the same experiment as in Fig.˜1 but with k=3k=3, suggesting that the logarithmic number of arms used in the lower bound is needed.