Paper deep dive

Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Alireza Mousavi-Hosseini, Murat A. Erdogdu

Year: 2026Venue: arXiv preprintArea: stat.MLType: PreprintEmbeddings: 195

Abstract

Abstract:We study post-training linear autoregressive models with outcome and process rewards. Given a context $\boldsymbol{x}$, the model must predict the response $\boldsymbol{y} \in Y^N$, a sequence of length $N$ that satisfies a $\gamma$ margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood $\alpha$, a variant of policy gradient (PG) can achieve likelihood $1 - \varepsilon$ with an essentially minimax optimal number of reward queries $\tilde{O}((\alpha^{-1} + \varepsilon^{-1})/\gamma^2)$. However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in $N$ to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in $N$ via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/13/2026, 12:29:14 AM

Summary

The paper investigates the theoretical foundations of post-training autoregressive models using policy gradients (PG) with outcome and process rewards. It identifies a 'base model barrier' where PG with outcome rewards requires exponential reward queries to exceed the support of the base model, whereas process reward models (PRM) can overcome this by leveraging token-level likelihood quantiles. The authors prove minimax optimality for PG variants and demonstrate that adaptive learning rates in SGD and PG achieve near-optimal performance in statistical and online learning settings.

Entities (5)

Policy Gradient · algorithm · 98%SGD · algorithm · 98%Outcome Reward Model · reward-mechanism · 95%Process Reward Model · reward-mechanism · 95%Likelihood Quantile · metric · 92%

Relation Signals (3)

SGD → achieves → Near Optimal Test Error

confidence 95% · SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning

Process Reward Model → alleviates → Base Model Barrier

confidence 90% · demonstrate how PG variants in this setting avoid the curse of dimensionality in N

Policy Gradient → uses → Outcome Reward Model

confidence 90% · post-training with outcome rewards is governed by a property of the base model

Cypher Suggestions (2)

Find all algorithms and their associated reward mechanisms · confidence 90% · unvalidated

MATCH (a:Algorithm)-[:USES]->(r:RewardMechanism) RETURN a.name, r.name

Identify constraints related to specific algorithms · confidence 85% · unvalidated

MATCH (a:Algorithm)-[:SUBJECT_TO]->(c:Constraint) RETURN a.name, c.name

Full Text

194,921 characters extracted from source content.

Expand or collapse full text

Post-Training with Policy Gradients: Optimality and the Base Model Barrier Alireza Mousavi-Hosseini1 Murat A. Erdogdu2 Abstract We study post-training linear autoregressive models with outcome and process rewards. Given a context x, the model must predict the response ∈N y ^N, a sequence of length N that satisfies a γ margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood α, a variant of policy gradient (PG) can achieve likelihood 1−ε1- with an essentially minimax optimal number of reward queries ~((α−1+ε−1)/γ2) O((α^-1+ ^-1)/γ^2). However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in N to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in N via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest. 11footnotetext: University of Toronto, Vector Institute. mousavi@cs.toronto.edu.22footnotetext: University of Toronto, Vector Institute, and Numurho. erdogdu@cs.toronto.edu 1 Introduction Figure 1: The evolution of the model’s likelihood to generate the correct response over different contexts throughout PG. (Left) The average likelihood over samples with initial likelihood ≈0≈ 0 under base model. On-policy PG with PRM is able to improve this average while with ORM stays at 0. (Center) The comparison between the expected test error with ORM where the error plateaus at a threshold, and with PRM where the error continues to decrease. (Right) The likelihood of individual samples throughout PG with ORM, where the color denotes initial likelihood. Experiment details are presented in Section 6. Outcome-based reinforcement learning (RL) has been a promising approach to enable large language models (LLMs) to go beyond their pre-training data. RL has been particularly successful in domains such as math and coding where one is able to verify the correctness of the model’s response, without requiring a complete ground-truth solution [9, 27]. However, the extent to which RL can create new knowledge that is missing from the base model is unclear. While some works argue that RL with outcome rewards can genuinely enhance the base model [32], others have observed that the effect of RL post-training has mostly been to sharpen the base distribution, and the fine-tuned model is not able to generate responses that are outside the support of the base model [34, 33, 21]. Indeed, explicitly sampling from the sharpened base distribution can match or outperform RL [12], which suggests that RL cannot successfully improve upon the base model unless it is already highly capable. On the theoretical side, prior works have attempted to characterize the benefits of RL post-training compared to supervised fine-tuning (SFT) [23] or on top of pre-training [28, 14]. While providing positive results for RL, these works leave out the role of the base model. FMR [8] demonstrates that a notion of coverage is needed for computationally efficient exploration with RL. However, their definition does not depend on the final desired likelihood, the probability that the model generates a correct response to a context. CHG+ [3] studies coverage profile which characterizes the probability of success with best-of-N sampling, and consider how log-likelihood maximization improves this coverage. However, the consequences of this definition for RL post-training, which is used to achieve the best-of-N performance using a single generation, is less clear. CJRX [4] studies outcome-based online RL, with a focus on sample complexity, while the computational efficiency of their algorithm is unclear. Importantly, the works above leave open the study of practical algorithms for post-training, specifically PG variants such as proximal policy optimization (PPO) [25] and group relative policy optimization (GRPO) [26]. In this work, we aim to answer the following question in a theoretically tractable setting: Q1. How do the number of reward queries and policy gradient steps in post-training depend on the base model for on- and off-support samples? We answer the above question by contrasting the on- and off-support cases. For sequences of length N, the probability that the uniform policy generates a correct response is exponentially small in N. If the base policy from pre-training achieves a probability of correct response (likelihood) that is non-trivial, i.e. at most polynomially small in N, we consider this sample to be on the base model support, and the complexities above remain polynomial. However, if the base policy is not significantly better than a uniform distribution on a sample, then the sample is off-support and improving its likelihood may require exponentially many iterations. This phenomenon is demonstrated in Figure 1. While this gives us a per-sample answer to the above question, a natural next step is to investigate the expected error over the entire distribution: Q2. Can RL post-training achieve a significantly smaller expected test error compared to the base model while being computationally efficient? We answer the above question for outcome and process rewards. For outcome reward models, we show that in the worst-case, any post-training algorithm may require exponentially many reward queries to significantly improve upon the base model obtained by SGD. However, this exponential dependence on the sequence length can be alleviated by using a process reward model, where after generating each token, the learner can query the reward to verify the correctness of the response so far. 1.1 Our Contributions Our main contributions are as follows: • In Section 3.1, we study post-training with outcome rewards and access to a base model q, with the goal of predicting a response ∗()=(y1∗(),…,yN∗()) y^*( x)=(y^*_1( x),…,y^*_N( x)), where yi∈1,…,ky_i∈\1,…,k\, to an input context x, which satisfies a γ margin assumption. We show that conditioned on the base model achieving a non-trivial likelihood α on a sample (,∗())( x, y^*( x)), a variant of policy-gradient can boost this likelihood and get expected error ε with ~(1/(αγ2ε)) O(1/(αγ^2 )) iterations. As a consequence, we show that PG with a uniform behavior policy can achieve the minimax optimal mistake bound ~(kN/γ2) O(k^N/γ^2) in online learning. • In Section 3.2, we show that for the overall expected error to go below ε , the number of reward queries depends on a property of the base model, which we refer to as the Likelihood Quantile (LQ). This creates a fundamental barrier of the base model; if the base model is pre-trained with SGD, PG requires exponentially many reward queries to go below the SGD error rate. • In Section 4, we demonstrate how using process rewards instead of outcome rewards can alleviate the above problem, with the number of reward queries instead depending linearly on N and another property of the base model, called the Token-Level Likelihood Quantile, which is linear in k in the worst-case. • In Section 5, we establish the minimax optimality of the policy gradient variants explored in Sections 3 and 4. We also rule out the possibility of significant algorithmic improvements beyond SGD in pre-training for better LQ; the minimax optimal LQ is of order k−Nk^-N unless enough labeled samples are given for an SGD pre-trained model to have very low test error of order k−Nk^-N. 1.2 Additional Related Works Online Linear Classification with Bandit Feedback. The problem of post-training a policy with outcome rewards can be cast as an instance of a contextual bandit problem. For the case of N=1N=1, there has been a rich literature on learning a linear multiclass classifier with bandit feedback. The seminal work of KSST [13] introduced the Banditron algorithm, a modification of the classical Perceptron algorithm, that only achieves a mistake bound ~(k2/γ2) O(k^2/γ^2). DH [6] shows that a mistake bound of ~(k/γ2) O(k/γ^2) is information-theoretically optimal, and presents an inefficient algorithm to achieve it. BPS+ [2] provides a computationally efficient algorithm that only achieves the ~(k/γ2) O(k/γ^2) rate under a stronger notion of margin. Others have considered more general data distributions and achieved a relative mistake bound of order T T, see e.g. vdHFCB [29] and references therein. In contrast, we introduce a simple online algorithm as a variant of PG with a uniform behavior policy that achieves the minimax optimal mistake bound ~(kN/γ2) O(k^N/γ^2), thus achieving the desirable ~(k/γ2) O(k/γ^2) when N=1N=1 while also being computationally efficient per iteration. Thus, we answer the open question of achieving such an efficient online learning algorithm raised in [2]. SGD for Separable Data. The study of SGD on logistic loss for linearly separable data first gained attention through its connection to the max-margin SVM and its implicit bias SHN+ [19], JT [11]. Extensions to multiclass settings have been considered in [17, 20]. Our analysis for both SGD and PG is more similar to Sha [18]. In contrast, we significantly go beyond the binary setting by considering autoregressive models and adapting the margin assumption to sequences of labels, where each label has k≥2k≥ 2 choices. Policy Gradient Analysis. A comprehensive study of theoretical properties of policy gradient methods in tabular settings was carried out by AKLM [1]. However, their guarantees for PG with softmax parameterization are only asymptotic. Further works have made this guarantee non-asymptotic and extended beyond the tabular setting [35, 16, 36, 15]. We remark however that none of these results are directly applicable in our setting, since their generality prevents establishing tight bounds. We instead carry out an analysis tailored to the 0-1 reward structure for post-training an autoregressive model. Notation. For quantities a and b, we use a≲ba b or a=(b)a=O(b) to denote a≤Cba≤ Cb for some absolute constant C>0C>0, and define a≳ba b and a=Ω(b)a= (b) similarly. We use ~ O and Ω~ to hide polylogarithmic factors. We use f(ε)=o(1)f( )=o(1) to denote f(ε)=(1/log⁡(1/ε))f( )=O(1/ (1/ )), a stricter version of the usual definition limε↓0f(ε)=0 _ 0f( )=0. We use the notation 1:i=(yj)j=1i y_1:i=(y_j)_j=1^i, and 1:0 y_1:0 is an empty sequence. We use Clip⁡(x,a,b)≔((x∨a)∧b)Clip(x,a,b) ((x a) b) for x,a,b∈ℝx,a,b . For a set Y, we define +≔⋃i=1NiY^+ _i=1^NY^i. We use y e_y to denote the ythy^th standard basis of ℝkR^k. 2 Autoregressive Linear Models and Pre-Training Given access to a verifier, post-training can be performed using outcome rewards. This approach yields a contextual bandit formulation: given a prompt ∈ x , the model p(⋅|)p_ W(·\,|\, x) generates a response =(y1,…,yN)∈N y=(y_1,…,y_N) ^N of length N, and receives a reward r(,)r( x, y) from an outcome reward model (ORM). For simplicity, we fix the length of the response, while our arguments can easily be adapted to dynamic lengths. We assume ||=k |Y |=k. For many verifiers this reward is binary, i.e. r(,)=1r( x, y)=1 if y is a correct answer to x, and r(,)=0r( x, y)=0 otherwise. There are also settings where a process reward model (PRM) can provide intermediate signal to the learner. We study this formulation in the subsequent section. We specify the generative model p(⋅|)p_ w(·\,|\, x) using the decomposition p(|)=∏i=1Np(yi|,1:i−1)p_ w( y\,|\, x)= _i=1^Np_ w(y_i\,|\, x, y_1:i-1), where p(yi|,1:i−1)=exp(⟨,ϕ(,1:i−1,yi)⟩)∑y′∈exp(⟨,ϕ(,1:i−1,y′⟩)).p_ w(y_i\,|\, x, y_1:i-1)= ( w,φ( x, y_1:i-1,y_i) ) _y ( w,φ( x, y_1:i-1,y ) ). Here, ϕ:×+→ℝDφ:X×Y^+ ^D is the feature map. Current models typically implement ϕφ using a deep Transformer [30]. The above formulation simplifies the analysis by only focusing on the final linear layer while freezing the feature map ϕφ. We assume that each prompt x has a unique correct response ∗() y^*( x) that satisfies the following assumption. Assumption 1. Suppose there exists ∗∈ℝD w^* ^D such that ∥∗∥2=1 \| w^* \|_2=1 and for all i∈[N]i∈[N], y≠yi∗()y≠ y^*_i( x), ⟨∗,ϕ(,1:i∗())⟩≥⟨∗,ϕ(,1:i−1∗(),y)⟩+γ, w^*,φ( x, y^*_1:i( x)) ≥ w^*,φ( x, y^*_1:i-1( x),y) +γ, a.s. over x for some γ>0γ>0. We further assume ∥ϕ(,1:i∗())∥2≤1 \|φ( x, y^*_1:i( x)) \|_2≤ 1 for all i∈[N]i∈[N] and a.s. over x. When N=1N=1, letting ϕ(,y)=Vec⁡(y⊤)φ( x,y)=Vec( e_y x ) and =Vec⁡() w=Vec( W) with ∈ℝk×d W ^k× d, the above reduces to a standard γ-margin assumption for k-class linear classification, and the autoregressive model reduces to the usual logistic model. Notice that the above only requires separability at the token level while assuming all previous tokens are correct, as opposed to sequence level separability which would be harder to satisfy. With Assumption 1, it might be tempting to only train on a single token to recover ∗ w^*, thus remove dependence on N from the complexity of supervised training. However, such a direction may not be unique, and a direction that separates some tokens might not be useful for the rest. Before moving to the analysis of post-training, we first recall results about pre-training. During pre-training, we have access to i.i.d. labeled examples ((t),∗((t)))t≥0( x^(t), y^*( x^(t)))_t≥ 0, and we can run SGD iterates as t+1=t+ηt∇log⁡pt(∗((t))|(t)), w_t+1= w_t+ _t∇ p_ w_t( y^*( x^(t))\,|\, x^(t)), (SGD) where ηt _t is a potentially adaptive learning rate. [3] demonstrated that with a constant LR, SGD suffers from a convergence rate that depends linearly on the sequence length N. They mitigate this issue by analyzing an Adagrad-type [7] adaptive LR on SGD with a sufficiently large mini-batch size. Under Assumption 1, we can study the simpler variant above that operates in a one-pass mode over the pre-training data and does not require batching, hence can also be applied in online settings. Proposition 1. Suppose Assumption 1 holds, and let (t)t=0T−1( w_t)_t=0^T-1 denote the iterates of SGD with learning rate schedule (ηt)( _t). Then, 1. (Constant LR) If ηt=1/(2N) _t=1/(2N), then 1−[p¯TSGD(∗()|)]≲Nlog(Tkγ2)2γ2T1- E [p_ w^SGD_T( y^*( x)\,|\, x) ] N (Tkγ^2)^2γ^2T where ¯TSGD≔1T∑t=0T−1t w^SGD_T 1T _t=0^T-1 w_t is the averaged iterate. 2. (Adaptive LR) If ηt=(2+4∥∇logpt(∗(t)|t)∥2)−1, _t= (2+4 \|∇ p_ w_t( y^*( x_t)\,|\, x_t) \|_2 )^-1, then 1−1T∑t=0T−1[pt(∗()|)]≲log(TNkγ2)2γ2T. 1- 1T _t=0^T-1 E [p_ w_t( y^*( x)\,|\, x) ] (TNkγ^2)^2γ^2T. In other words, if we draw τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\), then 1−[pτ(∗()|)]≲log(TNkγ2)2γ2T.1- E [p_ w_τ( y^*( x)\,|\, x) ] (TNkγ^2)^2γ^2T. A few remarks are in order. First, it is well-known that the VC dimension of the class of γ-margin half-spaces in dimension D≥1/γ2D≥ 1/γ^2 is at least Ω(1/γ2) (1/γ^2) (e.g. can be proved by considering an orthonormal set of points). Therefore, by standard minimax lower bounds on sample complexity using VC dimension in realizable settings [24, Theorem 6.8], we require at least Ω(1/(γ2ε)) (1/(γ^2 )) samples to achieve a test error ε in the worst-case for N=1N=1 and k=2k=2, which can thus be extended to all N and k. This lower bound can alternatively be derived using the simpler technique we employ in Theorem 10. Interestingly, while for N=(1)N=O(1) SGD with constant or adaptive LR (almost) achieves the minimax optimal sample complexity, for N≫1N 1, it is only the adaptive LR that does so, with nearly the same rate as the N=1,k=2N=1,k=2 case. Moreover, while we state Proposition 1 and other results in the paper in a conventional statistical learning setup where data is i.i.d., proofs are obtained by an online analysis and an online-to-batch conversion (see Appendix A for details). As a consequence, an online learner that samples responses (t)∼pt(⋅|(t)) y^(t) p_ w_t(·\,|\, x^(t)) would make at most ~(1/γ2) O(1/γ^2) mistakes in expectation over online (potentially adversarially chosen) data, which nearly matches the classical Perceptron algorithm and is minimax optimal, while covering the broader setting of autoregressive linear models. 3 Policy Gradient with Outcome Reward (Bandit Feedback) For post-training, we only have access to i.i.d. prompts (t)t≥0( x_t)_t≥ 0 as well as a reward model, but not the labels directly. In this section, we assume access to a reward model for the final response generated, i.e. r:×N→0,1r:X×Y^N→\0,1\ where r(,)=[=∗()].r( x, y)= 1[ y= y^*( x)]. This stage is typically performed using some policy gradient algorithm such as PPO or GRPO. Here, we study the most basic policy gradient algorithm, also referred to as REINFORCE [31]. The PG update with outcome rewards is given by t+1=t+ηtrt∇log⁡pt((t)|(t)), w_t+1= w_t+ _tr_t∇ p_ w_t( y^(t)\,|\, x^(t)), (PG-OR) where rt≔r((t),(t))r_t r( x^(t), y^(t)), t∼qt(⋅|(t)) y_t q_t(·\,|\, x^(t)) and qtq_t is the behavior policy at round t which can be adaptively chosen using past samples. Unless otherwise stated, we assume 0= w_0= 0 for simplicity. While in practice the initialization is provided by the base model, in our analysis the benefit of the base model comes from the behavior policy, therefore we can simplify the analysis by initializing at zero. Two remarks are in order. • For (PG-OR) to be policy gradient in the sense of following the population reward gradient, we have to introduce the importance weights pt((t)|(t))/qt((t)|(t))p_ w_t( y^(t)\,|\, x^(t))/q_t( y^(t)\,|\, x^(t)) unless we are fully on-policy, i.e. qt=ptq_t=p_ w_t. We defer the study of this fully on-policy PG to Section A.5 since our analysis for it yields a qualitatively similar behavior to subsequent PG variants we study, yet with a suboptimal convergence rate. When qt≠ptq_t≠ p_ w_t, practical algorithms such as PPO clip the importance weight to be close to 1. Our analysis can also accommodate clipped importance weights; in Section A.6 we study the following updates t+1=t+ηtrtρt∇log⁡pt((t)|(t)) w_t+1= w_t+ _tr_t _t∇ p_ w_t( y^(t)\,|\, x^(t)) where ρt=Clip⁡(pt((t)|(t))qt((t)|(t)),1/ζ,ζ) _t=Clip ( p_ w_t( y^(t)\,|\, x^(t))q_t( y^(t)\,|\, x^(t)),1/ζ,ζ ) for some ζ≥1ζ≥ 1. In summary, we can show that all convergence rates will be scaled by a factor of ζ2ζ^2. In the main text, we focus the presentation on the case where the importance weights are removed, i.e. ζ=1ζ=1, as it optimizes the convergence rate and achieves minimax optimality, which suggests that following the (unbiased estimate of) population reward gradient in this setting is not strictly necessary. • Note that (PG-OR) considers a simple advantage estimator with no baseline. Once again, this is motivated by the minimax optimality of the updates. No advantage estimator achieves a better rate with only Assumption 1. 3.1 Conditional Convergence of PG We begin with the following conditional guarantee. The proof of all results in this section are stated in Appendix A. Theorem 2 (PG-OR with Constant LR). Let (t)( w_t) denote the (PG-OR) iterates. Suppose Assumption 1 holds. For simplicity, assume η=1/(2N)η=1/(2N) and ∥0∥2≤1/γ⋅log(NkTγ2) \| w_0 \|_2≤ 1/γ· (NkTγ^2). Let x denote a test sample. For any α∈(0,1]α∈(0,1], define the event ℰα≔mint⁡qt(∗()|)≥αE_α \ _tq_t( y^*( x)\,|\, x)≥α\ and πα≔ℙ(ℰα) _α (E_α ). Then, ¯TPG=1T∑t=0T−1t w^PG_T= 1T _t=0^T-1 w_t satisfies [p¯TPG(∗()|)|ℰα]≥1−~(Nπααγ2T). E [p_ w^PG_T( y^*( x)\,|\, x)\,|\,E_α ]≥ 1- O ( N _αγ^2T ). (3.1) The conditioning above highlights that the test performance on an individual sample after post-training depends on how likely the behavior policy qtq_t is to generate the correct response for that sample. In the bound above, α denotes this likelihood. If we set qt=Unifq_t= Unif for all t, then πα=1 _α=1 for α=k−Nα=k^-N and πα=0 _α=0 for any α>k−Nα>k^-N. The way a base model enters the above bound is through the function α↦πα _α. The slower it decays, the faster the convergence will be for samples that have an initial likelihood at least α. Similar to SGD, we can remove N in the convergence rate above by using adaptive learning rates. Theorem 3 (PG-OR with Adaptive LR). Consider the same setting as Theorem 2, except we use the adaptive learning rate ηt=(4+2∥∇logpt((t)|(t))∥2)−1 _t=(4+2 \|∇ p_ w_t( y^(t)\,|\, x^(t)) \|_2)^-1. Then, in expectation over the randomness in training, in the test prompt x, and in τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\), we have [pτ(∗()|)|ℰα]≥1−~(1πααγ2T), E [p_ w_τ( y^*( x)\,|\, x)\,|\,E_α ]≥ 1- O ( 1 _αγ^2T ), (3.2) where we recall ℰα≔mint⁡qt(∗()|)≥αE_α \ _tq_t( y^*( x)\,|\, x)≥α\ and πα≔ℙ(ℰα) _α (E_α ) for any α∈(0,1]α∈(0,1]. Remark. [Online Learning Regret Bound] As discussed earlier, while we stated the guarantee of Theorem 3 for i.i.d. data, we can utilize the same proof technique for an arbitrary sequence (t),∗((t)) x^(t), y^*( x^(t)), where an adversary can choose (t) x^(t) and ∗((t)) y^*( x^(t)) for all t, with the only constraint being Assumption 1 (see Proposition 19 and its proof). In particular, if qtq_t is simply the uniform policy, we have πα=1 _α=1 for α=k−Nα=k^-N, thus we can remove conditioning and write 1T∑t=0T−1[pt(∗((t))|(t))]≥1−~(kNγ2T). 1T _t=0^T-1 E [p_ w_t( y^*( x^(t))\,|\, x^(t)) ]≥ 1- O ( k^Nγ^2T ). Therefore, the online learning algorithm that draws the response (t) y^(t) to (t) x^(t) according to ptp_ w_t will have an expected number of mistakes ~(kN/γ2) O(k^N/γ^2) over T rounds. Indeed, this number of mistakes is minimax optimal (up to log terms) as established in Theorem 11. To our knowledge, this is the first online learning algorithm in this setting that is computationally efficient per iteration, while nearly achieving the minimax optimal number of mistakes. Prior works achieved this number of expected mistakes under a stronger notion of separability and only for N=1N=1 [2]. 3.2 Unconditional Convergence of PG In the previous section, we conditioned the expected error on the initial likelihoods of test samples. Our focus now is to use those results to bound the expected test error over the entire distribution. For a policy q, we define the Likelihood Quantile (LQ) function q:ℝ[0,1]→ℝ[0,1]Q_q:R[0,1] [0,1] as q(ε)≔supα∈[0,1]:ℙ(q(∗()|)≤α)≤ε.Q_q( ) \α∈[0,1]:P_ x(q( y^*( x)\,|\, x)≤α)≤ \. Specifically, q(ε)Q_q( ) provides the ε -quantile of the random variable q(∗()|)q( y^*( x)\,|\, x), which is the likelihood of the ground-truth data (,∗())( x, y^*( x)) under the base model q. As can be seen from Figure 2, the more concentrated around 1 the random variable q(∗()|)q( y^*( x)\,|\, x) is (corresponding to longer pre-training), the larger q(ε)Q_q( ) will be for a fixed ε . With πα≔ℙ(q(∗()|)≥α) _α _ x(q( y^*( x)\,|\, x)≥α), the mapping α↦1−πα 1- _α is the CDF of the likelihood over test samples, and qQ_q is the generalized inverse of this function; it satisfies q(1−πα)≥αQ_q(1- _α)≥α. Through this observation, LQ is connected to other notions studied in the literature; 1−πα1- _α is referred to as the α-coverage profile in CHG+ [3]. We refer to references therein for other definitions in the literature. We can examine the lower bounds on qQ_q when q=p¯nSGDq=p_ w^SGD_n is obtained by SGD pre-training with TSGD=nT^SGD=n samples: • With the constant LR η≍1η 1: q(ε)≥e−~(N/(nγ2ε))Q_q( )≥ e^- O(N/(nγ^2 )). • With the adaptive LR ηt≍(1+∥∇logpt∥2)−1 _t (1+ \|∇ p_ w_t \|_2)^-1: q(ε)≥(1−~((nγ2ε)−1))∨0Q_q( )≥ (1- O((nγ^2 )^-1) ) 0. Further, for the uninformative uniform policy q, we have q(ε)=k−NQ_q( )=k^-N for all ε<1 <1. We can now state unconditional bounds on the accuracy of the post-trained model. Corollary 4 (Test Error of PG-OR). Suppose Assumption 1 holds. Given a base policy q0q_0, let q=12q0+12Unifq= 12q_0+ 12 Unif. Consider (PG-OR) with qt=q_t=q for all t and with the adaptive learning rate of Theorem 3. Let τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\). Then, for all ε∈(0,1) ∈(0,1), with T=~(min⁡(q0((1−o(1))ε)−1,kN)γ2ε)T= O ( (Q_q_0 ((1-o(1)) )^-1,k^N )γ^2 ) iterations, we have [pτ(∗()|)]≥1−ε E [p_ w_τ( y^*( x)\,|\, x) ]≥ 1- . Remark. We distinguish between the reciprocal and inverse in our notation; q0(ε)−1=1/q0(ε)Q_q_0( )^-1=1/Q_q_0( ) while q0−1(⋅)Q_q_0^-1(·) denotes the inverse function of q0(⋅)Q_q_0(·). By monotonicity of q0Q_q_0, we have ε≥1−πk−N=q0−1(k−N) ≥ 1- _k^-N=Q_q_0^-1(k^-N) iff q0(ε)≥k−NQ_q_0( )≥ k^-N. Therefore, we can express the above bound as T=~(q0((1−o(1))ε)−1γ2ε)ifε≥q0−1(k−N)~(kNγ2ε)otherwise.T= cases O ( Q_q_0 ((1-o(1)) )^-1γ^2 ) &if _q_0^-1(k^-N)\\[5.0pt] O ( k^Nγ^2 ) &otherwise. cases (3.3) (t) x^(t), ptp_ w_t, q0q_0, m ∼pt(⋅|(t)) y p_ w_t(·\,|\, x^(t)) j=1j=1 while j≤mj≤ m and r((t),)=0r( x^(t), y)=0 do j=j+1j=j+1, ∼(12q0+12Unif)(⋅|) y ( 12q_0+ 12 Unif)(·\,|\, x) end while return y Algorithm 1 Best-of-m exploration for PG-OR. To understand the base model barrier, consider the case where q0q_0 is given by SGD with adaptive LR on n labeled samples. Then, q0−1(k−N)=Ω~(1/(nγ2))Q^-1_q_0(k^-N)= (1/(nγ^2)). This suggests that to go below expected error ~(1/(γ2n)) O(1/(γ^2n)), which is already achieved by the SGD-trained base model, policy gradient will require exponentially many iterations. One may wonder if this is due to a suboptimal bound on q0Q_q_0 for SGD, or due to a fundamental suboptimality of SGD itself. In Section 5, we rule out both cases by showing a minimax bound of q(ε)≤(k−N/(1−γ2nε))Q_q( ) (k^-N/(1-γ^2n )) among all algorithms that output q with n labeled i.i.d. samples. If n≪1/(γ2ε)n 1/(γ^2 ), the upper bound above inevitably depends on kNk^N. If n≫1/(γ2ε)n 1/(γ^2 ), the SGD pre-trained model already achieves error ε . The barrier above is not specific to policy gradient and its variants. We demonstrate in Theorem 11 that (PG-OR) with the behavior policy described above is minimax optimal (up to log factors) among algorithms that have access to q0q_0 and only make one reward query per context. However, it is possible to improve both sample/iteration complexity and number of reward queries by considering a modification that makes multiple queries per round. Specifically, by choosing qtq_t according to Algorithm 1, we have the following. Corollary 5. Suppose Assumption 1 holds. Given a base policy q0q_0 and for any ε∈(0,1) ∈(0,1), consider (PG-OR) with adaptive learning rate of Theorem 3 and qtq_t given by Algorithm 1 with m=min⁡(⌈q0((1−o(1))ε)−1⌉,kN)m= ( _q_0 ((1-o(1)) )^-1 ,k^N ). Let T and Q denote the total number of iterations (also samples) and reward queries (also queries to q0q_0) respectively, and τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\). Then, with Q=~((m+ε−1)/γ2)Q= O((m+ ^-1)/γ^2) reward queries and T=~(1/(γ2ε))T= O(1/(γ^2 )) iterations, we have [pτ(∗()|)]≥1−ε E [p_ w_τ( y^*( x)\,|\, x) ]≥ 1- . Remark. Compared to Corollary 4, the number of PG iterations and context samples only scale with ~(1/(γ2ε)) O(1/(γ^2 )). On the other hand, the number of reward queries (and sampling queries from q0q_0) still blows up exponentially for ε<q0−1(k−N) <Q_q_0^-1(k^-N), thus still facing the base model barrier. Remark. As we prove in Theorem 10, both the number of samples and reward queries above are minimax optimal (up to log factors) among learners with access to q0q_0. Unlike Corollary 4, the behavior policy in Corollary 5 depends on the current policy ptp_ w_t, taking advantage of online exploration. If we remove line 1 from Algorithm 1, then the expected number of reward queries will always be exponential in N. When ε<q0−1(k−N) <Q^-1_q_0(k^-N), we could chose m=∞m=∞ and obtain the same bounds in expectation. In this case, PG and SGD coincide, and we only rely on Algorithm 1 to recover the ground-truth label of (t) x^(t). Figure 2: Likelihood CDF (left) and quantile (right) functions for a model trained with Adagrad with varying number of steps. As the model trains for longer, likelihoods increase and q(ε)Q_q( ) gets closer to 11 for fixed ε . Details provided in Section 6. ((t) x^(t), ptp_ w_t, q, m) ∼pt(⋅|(t)) y p_ w_t(·\,|\, x^(t)) if r∗(,)=1r^*( x, y)=1 then return y end if for i=1,…,Ni=1,…,N do j=1j=1, yi,j∼q(⋅|,1:i−1)y_i,j q(·\,|\, x, y_1:i-1) while j≤mj≤ m and r∗(,1:i−1,yi,j)=0r^*( x, y_1:i-1,y_i,j)=0 do j=j+1,yi,j∼(12q0+12Unif)(⋅|,1:i−1)j=j+1,y_i,j ( 12q_0+ 12 Unif)(·\,|\, x, y_1:i-1) end while yi=yi,jy_i=y_i,j end for return y Algorithm 2 Best-of-m exploration for PG-PR. 4 Policy Gradient with Process Rewards In the previous section, we demonstrated the hardness of post-training when we only have access to outcome rewards. An alternative is to provide the model with reward feedback throughout the generation process [22, 5]. While no longer a contextual bandit problem, the policy gradient iterates can be written as t+1=t−ηt∇ℓt(t),ℓt(t)≔−∑i=1NAi(t)log⁡pt(yi(t)|,1:i−1(t)) split w_t+1&= w_t- _t _ w _t( w_t),\\ _t( w_t)& - _i=1^NA^(t)_i p_ w_t(y^(t)_i\,|\, x, y^(t)_1:i-1) split (PG-PR) where (t)∼qt(⋅|t) y^(t) q_t(·\,|\, x_t) is sampled from some behavior policy, and Ai(t)A^(t)_i is the advantage of taking action yi(t)y^(t)_i at step i of rollout t. Suppose we have a 0-1 reward function r∗:×+→0,1r^*:X×Y^+→\0,1\ that satisfies r∗(,1:i)=[1:i=1:i∗()],∀i∈[N].r^*( x, y_1:i)= 1[ y_1:i= y^*_1:i( x)], ∀\,i∈[N]. (4.1) The most basic estimator for advantage would be the (undiscounted) return from taking action yi(t)y^(t)_i, i.e. Ai(t)=∑j≥ir∗((t),1:j(t))A^(t)_i= _j≥ ir^*( x^(t), y^(t)_1:j). However, we can achieve optimal rates by using a simpler advantage Ai(t)=r∗(,1:i(t))A^(t)_i=r^*( x, y^(t)_1:i), which is what we consider in the following. This form of reward allows us to efficiently explore sequences with a base model that is only accurate for next token prediction (and not predicting the entire sequence). In particular, we can define the Token-Level Likelihood Quantile function qTL:[0,1]→[0,1]Q_q^TL:[0,1]→[0,1] as qTL(ε)≔supα∈[0,1]:ℙ(mini∈[N]q(yi∗()|,i−1∗())≤α)≤ε splitQ_q^TL( ) \&α∈[0,1]:P_ x ( _i∈[N]q(y^*_i( x)\,|\, x, y^*_i-1( x))≤α )≤ \ split Unlike LQ, for a uniform base policy q, we have qTL(ε)=k−1Q_q^TL( )=k^-1 for all ε<1 <1, notably independent of N. This definition leads to the following guarantee. The proofs of this section are deferred to Appendix B. Theorem 6 (PG-PR with Adaptive Learning Rate). Suppose Assumption 1 holds. Given a base policy q0q_0 and for any ε∈(0,1) ∈(0,1), let qtq_t be the output of Algorithm 2 with m=⌈2(log⁡N+1)⋅min⁡(q0TL((1−o(1))ε)−1,k)⌉m= 2( N+1)· (Q_q_0^TL ((1-o(1)) )^-1,k ) . Let T and Q denote the total number of post-training iterations (also samples) and reward (also sampling from q0q_0) queries respectively, and τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\). Consider (PG-PR) iterates with the adaptive learning rate ηt=(4+2‖∇ℓt(t)‖)−1 _t=(4+2\| _ w _t( w_t)\|)^-1. Then, with T=~(1/(γ2ε))T= O(1/(γ^2 )) iterations and Q=~(Nmin⁡(q0TL((1−o(1))ε)−1,k)γ2+1γ2ε)Q= O ( N (Q_q_0^TL ((1-o(1)) )^-1,k )γ^2+ 1γ^2 ) reward queries we have [pτ(∗()|)]≥1−ε E [p_ w_τ( y^*( x)\,|\, x) ]≥ 1- . We immediately observe that compared to Corollary 5 which required ~((kN+ε−1)/γ2) O((k^N+ ^-1)/γ^2) reward queries in the worst-case, the above theorem requires only ~((Nk+ε−1)/γ2) O((Nk+ ^-1)/γ^2) reward queries. Further, the requirement on the base model is much milder for this theorem. q0q_0 only needs to generate the next correct token conditioned on a partially correct response. One can observe that for any base model q0q_0, we have q0TL(ε)≥q0(ε)Q^TL_q_0( ) _q_0( ). In certain settings, the benefit of process reward models will be significant, and they will enable PG to go beyond the support of the base model and achieve significantly smaller expected test error. One such example is provided below. Corollary 7. Suppose Assumption 1 holds, and additionally, the feature map satisfies ϕ(,1:i∗())=ψ()φ( x, y^*_1:i( x))=ψ( x) for some ψ:ℝd→ℝDψ:R^d ^D and all i∈[N]i∈[N]. Let (tSGD)t=0n−1( w^SGD_t)_t=0^n-1 denote the SGD iterates with constant learning rate η=1/(2N)η=1/(2N), and define q0=p¯nSGDq_0=p_ w^SGD_n where ¯nSGD=1n∑t=0n−1tSGD w^SGD_n= 1n _t=0^n-1 w^SGD_t. Let (tPG)t=0T−1( w^PG_t)_t=0^T-1 denote the (PG-PR) iterates in the setting of Theorem 6, and let τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\). 1. To achieve [p¯nSGD(∗()|)]≥1−ε E [p_ w^SGD_n( y^*( x)\,|\, x) ]≥ 1- , we have a supervised pre-training sample complexity of n=~(N/(γ2ε))n= O(N/(γ^2 )). 2. To achieve [pτPG(∗()|)]≥1−ε E [p_ w^PG_τ( y^*( x)\,|\, x) ]≥ 1- , we have a post-training iteration complexity T=~(1/(γ2ε))T= O(1/(γ^2 )). Further, the sufficient number of reward queries is given by Q=~((N+ε−1)/γ2)ifn=Ω~(1/(γ2ε))~((Nk+ε−1)/γ2)otherwise.Q= cases O((N+ ^-1)/γ^2)&if n= (1/(γ^2 ))\\ O((Nk+ ^-1)/γ^2)&otherwise cases. The above clearly demonstrates (PG-PR) with an SGD-trained base policy as an optimal combination of PG and SGD. While learning with (constant LR) SGD alone would require a number of supervised samples scaling linearly with N, and doing PG alone would require a number of reward queries scaling with NkNk; their combination removes the N factor in the supervised sample complexity and the k factor in the number of reward queries. We remark however that this still does not outperform SGD with adaptive LR with the same number of supervised samples. We leave such an attempt to future work. 5 Lower Bounds In this section, we argue about the tightness and optimality of the policy gradient algorithms studied for post-training, and of SGD with adaptive LR for pre-training. To establish post-training optimality, it is natural to define classes of algorithms with access to a base model and a reward function. Our lower bounds cover both online learning settings where the learner receives only one reward feedback per iteration and its goal is to minimize the number of mistakes, and the statistical setting where the learner has access to the reward model and can make multiple queries on the same context, and its goal is to have high accuracy on a new i.i.d. sample. We are specifically interested in the role the base model plays in the lower bounds. Hence, we formalize the definition of online and statistical learners with access to a base model as an oracle, and we believe these definitions might be useful in subsequent works on this topic. Definition 8 (Online Learner with Bandit Feedback). An online learner with access to base model q(⋅|)q(·\,|\, x) is a potentially randomized algorithm that at every iteration receives t x_t, predicts t y_t (using its history and the base model), receives reward r(t,t)r( x_t, y_t), and proceeds to the next round. Definition 9 (Statistical Learner with Bandit Feedback). A statistical learner with access to base model q(⋅|)q(·\,|\, x) is a potentially randomized algorithm that receives n samples tt=1n\ x_t\_t=1^n, and on sample t makes mtm_t reward queries r(t,t(i))i=1mt\r( x_t, y^(i)_t)\_i=1^m_t where t(i) y^(i)_t can be chosen adaptively using access to the base model. The algorithm then constructs the training set S=(t,t(i),r(t,t(i))):t∈[T],i∈[mt]S=\( x_t, y^(i)_t,r( x_t, y^(i)_t))\,:\,t∈[T],i∈[m_t]\ and returns a prediction rule ^(S):→N y(S):X ^N. The performance of an online learner is measured by the number of mistakes it makes over T rounds, i.e. ∑t=0T−1[t≠∗(t)] _t=0^T-1 1[ y_t≠ y^*( x_t)], while the performance of a statistical learner is measured by its expected error over a new test sample, i.e. ℙ(^(S)()≠y∗())P_ x( y(S)( x)≠ y^*( x)). Note that (PG-OR) is an example of both algorithms. It can be used to make online predictions as t∼pt(⋅|t) y_t p_ w_t(·\,|\, x_t) (assuming the behavior policy does not make additional reward queries), and it can also be used to make predictions on new test samples at the end of training via ^∼p¯(⋅|) y p_ w(·\,|\, x) where ¯ w is the averaged iterate. With the above definitions, we are ready to state the lower bound for statistical learners, which establishes the near optimality of the PG variant with adaptive learning rate presented of Corollary 5. The proofs of this section are deferred to Appendix C. Theorem 10 (Lower Bound for Statistical Learner under ORM). Let (t)t≥0( x_t)_t≥ 0 be an i.i.d. sequence drawn from some distribution D. Consider the statistical learner of Definition 9, where r is the outcome reward. For every γ>0γ>0, α∈[k−N,0.5]α∈[k^-N,0.5], ε∗∈(0,1/2) ^*∈(0,1/2), ε∈(0,1/2) ∈(0,1/2), and D≥k⌊1/γ2⌋D≥ k 1/γ^2 , there exists a base policy q satisfying q(ε)=αifε≥ε∗k−Nifε<ε∗Q_q( )= casesα&if ≥ ^*\\ k^-N&if < ^* cases (5.1) and a feature map ϕφ, such that for every learner with access to q, there exist a distribution D and ∗ w^* that satisfy Assumption 1 with margin γ and to achieve expected test error less than ε the learner requires at least Q=Ω(q((1+o(1))ε)−1/γ2)Q= (Q_q ((1+o(1)) )^-1/γ^2 ) reward queries and at least T=Ω(1/(γ2ε))T= (1/(γ^2 )) samples. We highlight a contrast with the result of [10]. They show that learning a reward model using trajectory samples can be done efficiently with a polynomial dependence on N. Our lower bound however, states that the number of reward queries, either true reward or queries to a learned reward model, may have to scale exponentially with N in the worst-case, which provides a computational, rather than a statistical, barrier. Next, we establish the near optimality of the online PG variant with adaptive learning rate of Corollary 4 among online learners. Theorem 11 (Lower Bound for Online Learner under ORM). Consider the same setting as Theorem 10, except with the online learner of Definition 8. There exists a base policy q satisfying (5.1) and a feature map ϕφ such that for every learner, there is a distribution such that to achieve expected test error ε on a new sample the learner needs at least T=Ω(q((1+o(1))ε)−1/(γ2ε))T= (Q_q ((1+o(1)) )^-1/(γ^2 ) ) reward queries (equivalently samples). Further, over T≥kN/ε∗T≥ k^N/ ^* rounds the expected number of mistakes is at least Ω(kN/γ2) (k^N/γ^2). 5.1 Lower Bounds on the Base Model So far we established that under an ORM, the optimal number of reward queries to achieve expected error ε scales as q(ε)−1/γ2Q_q( )^-1/γ^2. One may hope that there is a supervised pre-training algorithm that with a modest number of samples n can achieve an LQ that is polynomially (rather than exponentially) small in N. The following lower bound shows that such an algorithm does not exist in general. Theorem 12 (Lower Bound for Pre-Training). Let Sn≔((t),∗((t))t=1nS_n \( x^(t), y^*( x^(t))\_t=1^n be n i.i.d. samples drawn from some distribution D. Consider a learner that receives SnS_n and outputs a base policy q. For every n≥1n≥ 1, γ>0γ>0, ε∈(0,1/(8nγ2)) ∈(0,1/(8nγ^2)), and D≥k⌊1/γ2⌋D≥ k 1/γ^2 and every learner, there exists a distribution D that satisfies Assumption 1, and q(ε)≤k−N/(1−(γ2nε))Q_q( )≤ k^-N/(1-O(γ^2n )). The above lower bound implies that to have q(ε)≫k−NQ_q( ) k^-N, any pre-training algorithm needs n≫1/(γ2ε)n 1/(γ^2 ). On the other hand, SGD (with adaptive LR) has an expected error of ~(1/(γ2n)) O(1/(γ^2n)) over n samples. Thus with n≫1/(γ2ε)n 1/(γ^2 ) samples, SGD can already achieve an expected error of order ε (up to log terms), and there will be no significant improvement coming from PG in terms of expected error. This argument shows that the barrier of the base model is not an artifact of our analysis or specific to pre-training with SGD, it is a fundamental property of post-training with outcome rewards. 6 Experiments To validate intuitions from our theory, we perform simple experiments on a synthetic dataset where x is drawn according to either a mixture model or uniformly from the hypercube ±1d\± 1\^d. For the mixture model, to generate x we sample uniformly from the standard basis vectors, add an isotropic Gaussian noise with per-dimension standard deviation 0.05/d0.05/ d whose norm is clipped at 0.050.05, and project the resulting vector on d−1(d)S^d-1( d). To generate ∗() y^*( x), we first fix 1∗∈ℝk×d W^*_1 ^k× d and 2∗∈ℝk×k W^*_2 ^k× k once by drawing their entries i.i.d. from (0,1)N(0,1). Then y1∗()=argmax⁡(1∗)y^*_1( x)= *arg\,max( W^*_1 x) and yi∗=argmax⁡(1∗+2∗yi−1∗())y^*_i= *arg\,max( W^*_1 x+ W^*_2 e_y^*_i-1( x)). Since yi+1y_i+1 can be predicted from x and yiy_i, we can use the feature map ϕ(,1:i)=Concat⁡(Vec⁡(yi⊤),Vec⁡(yiyi−1⊤))φ( x, y_1:i)=Concat (Vec( e_y_i x ),Vec( e_y_i e_y_i-1 ) ) if i>1i>1 and ϕ(,y)=Concat⁡(Vec⁡(y⊤),k2)φ( x,y)=Concat (Vec( e_y x ), 0_k^2 ) otherwise. Thus ϕ(⋅,⋅)∈ℝdk+k2φ(·,·) ^dk+k^2. In Figure 1, we use the mixture dataset described above with N=128N=128 and d=k=32d=k=32. To obtain the base model, we train a linear autoregressive model initialized from zero using Adagrad on negative log-likelihood for 10001000 steps with a learning rate of 0.10.1 and a batch size of 256256. We then use this model to initialize on-policy PG with either outcome rewards of Section 3 or process rewards of Section 4. For PG, we use Adagrad with 40004000 steps, 0.10.1 learning rate, and a batch size of 10241024. Each batch is a fresh draw of i.i.d. samples during both pre- and post-training. For Figure 1, we define off-support samples as mixture centers that have a likelihood under base model smaller than 10−1210^-12. 4 out of 32 centers satisfy this constraint. We plot the average likelihood of generating a correct response over these 4 samples throughout PG training. PG with ORM is unable to boost the likelihood over any of these 4 samples, while the average likelihood improves under PRM. We further plot the expected error over 10241024 i.i.d. test samples throughout PG. We also plot the likelihood trajectory of 16 out of 32 mixture centers under ORM, where we can observe samples that start near 0 stay near 0. This is line with Theorem 3. For these samples, the initial likelihood is α≍k−Nα k^-N, and improving over the base model takes exponentially many iterations. However, samples with larger α show progress throughout PG. In Figure 2, we illustrate the evolution of the LQ function and its inverse (CDF) over the supervised training trajectory. For this experiment, we use the uniform over hypercube distribution for x, and set N=128N=128, d=32d=32, and k=10k=10. We use the Adagrad optimizer with learning rate 1.01.0 and batch size 256256. The results can be reproduced using https://github.com/mousavih/rlvr-base-model-barrier. 7 Conclusion In this paper, we studied variants of policy gradient for post-training an autoregressive linear model, where the responses satisfy a separability assumption. We proved that with an outcome reward model, PG can efficiently increase the likelihood of samples inside the support of the base model. To go outside this support and achieve test error significantly smaller than the base model, the number of reward queries depends on a property of the base model we call the likelihood quantile, and can be exponential in the sequence length. This issue can be alleviated using a process reward model, where we introduce a token-level likelihood quantile and show that it scales more favorably with sequence length. We further demonstrated the tightness of our analysis, as well as fundamental statistical limitations on LQ, through minimax lower bounds. While we assume access to accurate process rewards, it is interesting to find algorithms that can learn such a reward with statistical and computational efficiency whenever possible. Another open question is to extend the results of this paper to settings with noisy and non-separable responses, and develop optimal post-training algorithms accordingly. Acknowledgments The authors would like to thank Denny Wu and Juno Kim for helpful discussions. MAE was partially supported by the NSERC Grant [2019-06167], the CIFAR AI Chairs program, the CIFAR Catalyst grant, and the Ontario Early Researcher Award. References AKLM [21] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021. BPS+ [19] Alina Beygelzimer, David Pal, Balazs Szorenyi, Devanathan Thiruvenkatachari, Chen-Yu Wei, and Chicheng Zhang. Bandit multiclass linear classification: Efficient algorithms for the separable case. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 624–633. PMLR, 2019. CHG+ [25] Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, and Dylan J. Foster. The coverage principle: How pre-training enables post-training, 2025. CJRX [25] Fan Chen, Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Outcome-based online reinforcement learning: Algorithms and fundamental limits. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. CYW+ [25] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025. DH [13] Amit Daniely and Tom Helbertal. The price of bandit information in multiclass online classification. In Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 93–104. PMLR, 2013. DHS [11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. FMR [25] Dylan J Foster, Zakaria Mhammedi, and Dhruv Rohatgi. Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration. arXiv preprint arXiv:2503.07453, 2025. GYZ+ [25] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. JRX [25] Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Do we need to verify step by step? rethinking process supervision from a theoretical perspective. In Forty-second International Conference on Machine Learning, 2025. JT [19] Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1772–1798. PMLR, 25–28 Jun 2019. KD [25] Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025. KSST [08] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th international conference on Machine learning, pages 440–447, 2008. KWLS [25] Juno Kim, Denny Wu, Jason Lee, and Taiji Suzuki. Metastable dynamics of chain-of-thought reasoning: Provable benefits of search, rl and distillation. arXiv preprint arXiv:2502.01694, 2025. LZBY [20] Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020. MXSS [20] Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. In International conference on machine learning, pages 6820–6829. PMLR, 2020. RSSW [24] Hrithik Ravi, Clayton Scott, Daniel Soudry, and Yutong Wang. The implicit bias of gradient descent on separable multiclass data. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Sha [21] Ohad Shamir. Gradient methods never overfit on separable data. Journal of Machine Learning Research, 22(85):1–20, 2021. SHN+ [18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70):1–57, 2018. SK [25] Matan Schliserman and Tomer Koren. Multiclass loss geometry matters for generalization of gradient descent in separable classification. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. SLX+ [25] Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr. arXiv preprint arXiv:2506.10947, 2025. SNF+ [24] Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024. SRLK [25] Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal. arXiv preprint arXiv:2502.12118, 2025. SSBD [14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. SWD+ [17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. SWZ+ [24] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. TDG+ [25] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. TMUK [25] Nikolaos Tsilivis, Eran Malach, Karen Ullrich, and Julia Kempe. How reinforcement learning after next-token prediction facilitates learning. arXiv preprint arXiv:2510.11495, 2025. vdHFCB [21] Dirk van der Hoeven, Federico Fusco, and Nicolò Cesa-Bianchi. Beyond bandit feedback in online multiclass classification. Advances in neural information processing systems, 34:13280–13291, 2021. VSP+ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017. Wil [92] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992. WLZ+ [25] Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms. arXiv preprint arXiv:2506.14245, 2025. WXL+ [25] Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may not escape its origin. arXiv preprint arXiv:2507.14843, 2025. YCL+ [25] Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025. ZKZB [20] Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Basar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020. ZNS+ [21] Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi Wang, et al. On the convergence and sample efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021. Appendix A Omitted Results and Proofs of Section 3 Our core argument for PG upper bounds is that PG can be seen as online gradient descent on a convex loss. Therefore, we state and use the following general fact for online gradient descent. Proposition 13. Let (ℓt)t=0T−1( _t)_t=0^T-1 be a sequence of convex and non-negative functions on ℝDR^D. Consider online gradient descent updates (t)t=0T( w_t)_t=0^T, initialized from 0 w_0 and defined as t+1=t−ηt∇ℓt w_t+1= w_t- _t∇ _t where (ηt)t=0T−1( _t)_t=0^T-1 is the learning rate sequence. Then: 1. If ∥∇2ℓt∥2≤L \|∇^2 _t \|_2≤ L for some L>0L>0 and all t, let ηt=η _t=η where η∈(0,L−1)η∈(0,L^-1). Then, for all ∈ℝD v ^D we have 1T∑t=0T−1ℓt(t)≤11−ηL(1T∑t=0T−1ℓt()+∥−0∥222ηT). 1T _t=0^T-1 _t( w_t)≤ 11-η L ( 1T _t=0^T-1 _t( v)+ \| v- w_0 \|_2^22η T ). (A.1) 2. If ∥∇ℓt∥2≤Cℓt \|∇ _t \|_2≤ C _t for some C>0C>0 and all t, let ηt=η/(λ+∥∇ℓt∥2) _t=η/(λ+ \|∇ _t \|_2) where λ>0λ>0 and η∈(0,2C−1)η∈(0,2C^-1). Then, for all ∈ℝD v ^D we have 1T∑t=0T−1ℓt(t)1+Cλ−1ℓt(t)≤11−Cη/2(1T∑t=0T−1ℓt()+λ∥−0∥222ηT). 1T _t=0^T-1 _t( w_t)1+Cλ^-1 _t( w_t)≤ 11-Cη/2 ( 1T _t=0^T-1 _t( v)+ λ \| v- w_0 \|_2^22η T ). (A.2) While the constant learning rate case is well-known in the literature, the adaptive learning rate case, inspired by [3], is particularly relevant for gradient descent to maximize log-likelihood on softmax-linear policies where the smoothness constant can grow with the sequence length. To prove this proposition, we recall the following lemmas. Lemma 14. Suppose f:ℝd→ℝf:R^d satisfies ∥∇2f∥2≤L \|∇^2f \|_2≤ L. Then, ∥∇f(x)∥22≤2L(f(x)−inff) \|∇ f(x) \|_2^2≤ 2L(f(x)- f) for all x∈ℝdx ^d. Proof. By smoothness, for any pair ,′ x, x we have f(′)≤f()+⟨∇f(),′−⟩+L2∥′−∥22f( x )≤ f( x)+ ∇ f( x), x - x + L2 \| x - x \|_2^2. The proof is completed by letting ′=−1L∇f() x = x- 1L∇ f( x). ∎ Lemma 15. Suppose (t)t=0T−1( u_t)_t=0^T-1 is an arbitrary sequence of vectors, and starting from 0 w_0, define t+1≔t−t w_t+1 w_t- u_t for all t≤T−1t≤ T-1. Then, for any v, ∑t=0T−1⟨t,t−⟩≤∥−0∥22+∑t=0T−1∥t∥22. _t=0^T-1 u_t, w_t- v ≤ \| v- w_0 \|^22+ _t=0^T-1 \| u_t \|^22. Proof. For any t≥0t≥ 0, ∥t+1−∥2=∥t−∥2+∥t∥2−2⟨t,t−⟩. \| w_t+1- v \|^2= \| w_t- v \|^2+ \| u_t \|^2-2 u_t, w_t- v . By rearranging the terms, 2∑t=0T−1⟨t,t−⟩=∑t=0T−1∥t−∥2−∥t+1−∥2+∑t=0T−1∥t∥2≤∥0−∥2+∑t=0T−1∥t∥2.2 _t=0^T-1 u_t, w_t- v = _t=0^T-1 \| w_t- v \|^2- \| w_t+1- v \|^2+ _t=0^T-1 \| u_t \|^2≤ \| w_0- v \|^2+ _t=0^T-1 \| u_t \|^2. ∎ We can now state the proof of the proposition. Proof. [Proof of Proposition 13] By convexity of ℓt _t, for any ∈ℝD v ^D we have ℓt(t)−ℓt()≤⟨∇ℓt(t),t−⟩ _t( w_t)- _t( v)≤ ∇ _t( w_t), w_t- v . Therefore ∑t=0T−1ηt(ℓt(t)−ℓt()) _t=0^T-1 _t( _t( w_t)- _t( v)) ≤∑t=0T−1⟨ηt∇ℓt(t),t−⟩ ≤ _t=0^T-1 _t∇ _t( w_t), w_t- v ≤∥−0∥222+∑t=0T−1ηt2∥∇ℓt(t)∥222. ≤ \| v- w_0 \|_2^22+ _t=0^T-1 _t^2 \|∇ _t( w_t) \|_2^22. (A.3) Now, suppose ∥∇2ℓt∥2≤L \|∇^2 _t \|_2≤ L. By Lemma 14 and non-negativity of ℓt _t, we have ∥∇ℓt∥22≤2Lℓt \|∇ _t \|_2^2≤ 2L _t, consequently ∑t=0T−1ηt(ℓt(t)−ℓt())≤∥−0∥222+∑t=0T−1ηt2ℓt(t). _t=0^T-1 _t( _t( w_t)- _t( v))≤ \| v- w_0 \|_2^22+ _t=0^T-1 _t^2 _t( w_t). Recall that in the first case we choose ηt=η<1/L _t=η<1/L, therefore by rearranging the above (1−ηL)∑t=0T−1ℓt(t)≤∑t=0T−1ℓt()+∥−0∥222,(1-η L) _t=0^T-1 _t( w_t)≤ _t=0^T-1 _t( v)+ \| v- w_0 \|_2^22, which finishes the proof of the first case. For the second case, we have ηt=η/(λ+∥∇ℓt(t)∥2) _t=η/(λ+ \|∇ _t( w_t) \|_2). Thus ηt∥∇ℓt(t)∥2≤η _t \|∇ _t( w_t) \|_2≤η. Combining this fact with (A.3) yields ∑t=0T−1ηt(ℓt(t)−ℓt()) _t=0^T-1 _t( _t( w_t)- _t( v)) ≤∥−0∥222+η∑t=0T−1ηt∥∇ℓt(t)∥22 ≤ \| v- w_0 \|_2^22+ η _t=0^T-1 _t \|∇ _t( w_t) \|_22 ≤∥−0∥222+Cη∑t=0T−1ηtℓt(t)2. ≤ \| v- w_0 \|_2^22+ Cη _t=0^T-1 _t _t( w_t)2. By rearranging the terms, (1−Cη2)∑t=0T−1ηtℓt(t) (1- Cη2 ) _t=0^T-1 _t _t( w_t) ≤∑t=0T−1ηtℓt()+∥−0∥222 ≤ _t=0^T-1 _t _t( v)+ \| v- w_0 \|_2^22 ≤∑t=0T−1ηℓt()λ+∥−0∥222. ≤ _t=0^T-1 η _t( v)λ+ \| v- w_0 \|_2^22. Furthermore, notice that ηtℓt(t)=ηℓt(t)λ+∥∇ℓt(t)∥2≥ηℓt(t)λ+Cℓt(t). _t _t( w_t)= η _t( w_t)λ+ \|∇ _t( w_t) \|_2≥ η _t( w_t)λ+C _t( w_t). As a result (1−Cη2)∑t=0T−1ℓt(t)1+Cλ−1ℓt(t)≤∑t=0T−1ℓt()+λ∥−0∥222η, (1- Cη2 ) _t=0^T-1 _t( w_t)1+Cλ^-1 _t( w_t)≤ _t=0^T-1 _t( v)+ λ \| v- w_0 \|_2^22η, which concludes the proof of the second case. ∎ With this general theorem in hand, we can prove the results of Section 3. Note that Proposition 1 is a special case of Theorem 2 and Theorem 3, where qt(⋅|)=δ∗()q_t(·\,|\, x)= _ y^*( x) is the Dirac measure. As a result, we only need to prove the two theorems. A.1 Proof of Theorem 2 This theorem uses the first case of Proposition 13. Thus, we need to estimate the smoothness of log-likelihood, achieved by the following lemma. Lemma 16. For any ∈ℝD,∈ℝd w ^D, x ^d, and ∈N y ^N, we have ∇2logp(|)=∑i=1N−Yi[ϕ(,1:i−1,y)ϕ(,1:i−1,Yi)⊤]+Yi[ϕ(,1:i−1,Yi)]Yi[ϕ(,1:i−1,Yi)⊤]∇^2_ w p_ w( y\,|\, x)= _i=1^N- E_Y_i [φ( x, y_1:i-1,y)φ( x, y_1:i-1,Y_i) ]+ E_Y_i [φ( x, y_1:i-1,Y_i) ] E_Y_i [φ( x, y_1:i-1,Y_i) ] where Yi∼p(⋅|,1:i−1)Y_i p_ w(·\,|\, x, y_1:i-1). As a result, if ∥ϕ(,1:i)∥2≤R \|φ( x, y_1:i) \|_2≤ R for all ,,i x, y,i, then ∥∇2logp(|)∥2≤NR2 \|∇^2 p_ w( y\,|\, x) \|_2≤ NR^2 for all , x, y. Proof. We have ∇2log⁡p(|)=∑i=1N∇2log⁡p(yi|,1:i−1).∇^2_ w p_ w( y\,|\, x)= _i=1^N∇^2_ w p_ w(y_i\,|\, x, y_1:i-1). Further, ∇logp(yi|,1:i−1)=ϕ(,1:i−1,yi)−Yi[ϕ(,1:i−1,Yi)]. _ w p_ w(y_i\,|\, x, y_1:i-1)=φ( x, y_1:i-1,y_i)- E_Y_i [φ( x, y_1:i-1,Y_i) ]. Taking a second derivative yields ∇2log⁡p(yi|,1:i−1) ∇^2_ w p_ w(y_i\,|\, x, y_1:i-1) =−∑y=1kϕ(,1:i−1,y)∇p(y|,1:i−1) =- _y=1^kφ( x, y_1:i-1,y)∇ p_ w(y\,|\, x, y_1:i-1) =−∑y=1kp(y|,1:i−1)ϕ(,1:i−1,y)∇log⁡p(y|,1:i−1) =- _y=1^kp_ w(y\,|\, x, y_1:i-1)φ( x, y_1:i-1,y)∇ p_ w(y\,|\, x, y_1:i-1) =−Yi[ϕ(,1:i−1,Yi)∇logp(Yi|,1:i−1)] =- E_Y_i [φ( x, y_1:i-1,Y_i)∇ p_ w(Y_i\,|\, x, y_1:i-1) ] =−Yi[ϕ(,1:i−1,Yi)ϕ(,1:i−1,Yi)⊤]+Yi[ϕ(,1:i−1,Yi)]Yi[ϕ(,1:i−1,Yi)⊤], =- E_Y_i [φ( x, y_1:i-1,Y_i)φ( x, y_1:i-1,Y_i) ]+ E_Y_i [φ( x, y_1:i-1,Y_i) ] E_Y_i [φ( x, y_1:i-1,Y_i) ], which completes the proof. ∎ To handle settings that involve both i.i.d. and online settings, we state a more general version of Theorem 2 as a corollary of Proposition 13, and then show how Theorem 2 for i.i.d. data follows from this corollary. Proposition 17. Let (t)( w_t) denote the (PG-OR) iterates. Suppose the sequence ((t),∗((t)))( x^(t), y^*( x^(t))) is chosen arbitrarily (in particular not necessarily i.i.d.), with the only constraint being Assumption 1. For simplicity, assume ηt=1/(2N) _t=1/(2N) for all t and ∥0∥2≤1/γ⋅log(kTγ2) \| w_0 \|_2≤ 1/γ· (kTγ^2). Then, (1),…,(T)[1T∑t=0T−1−qt(∗((t))|(t))logpt(∗((t))|(t))]≤10Nlog(kTγ2)2γ2T, E_ y^(1),…, y^(T) [ 1T _t=0^T-1-q_t( y^*( x^(t))\,|\, x^(t)) p_ w_t( y^*( x^(t))\,|\, x^(t)) ]≤ 10N (kTγ^2)^2γ^2T, (A.4) where the expectation is over the randomness in sampling (t)∼pt(⋅|(t)) y^(t) p_ w_t(·\,|\, x^(t)). Proof. Note that since r(t,t)=[t=∗(t)]r( x_t, y_t)= 1[ y_t= y^*( x_t)], we can rewrite (PG-OR) as t+1=t+ηtr((t),(t))∇log⁡pt(∗(t)|(t)). w_t+1= w_t+ _tr( x^(t), y^(t))∇ p_ w_t( y^*^(t)\,|\, x^(t)). For simplicity, throughout the proof we will use the shorthand ∗(t)≔∗((t)) y^*^(t) y^*( x^(t)). Define ℓt()≔−r((t),(t))logpt(∗(t))|(t)). _t( w) -r( x^(t), y^(t)) p_ w_t( y^*^(t))\,|\, x^(t)). Then ℓt _t is convex and ∥∇2ℓt∥2≤N \|∇^2 _t \|_2≤ N by Lemma lemma 16 and the fact that r((t),(t))≤1r( x^(t), y^(t))≤ 1. Therefore, for ηt=1/(2N) _t=1/(2N), we can invoke the first case of Proposition 13 to write 1T∑t=0T−1−r((t),(t))logpt(∗(t))|(t))≤2T∑t=0T−1−r((t),(t))logp(∗(t)|(t))+2N∥−0∥22T 1T _t=0^T-1-r( x^(t), y^(t)) p_ w_t( y^*^(t))\,|\, x^(t))≤ 2T _t=0^T-1-r( x^(t), y^(t)) p_ v( y^*^(t)\,|\, x^(t))+ 2N \| v- w_0 \|_2^2T (A.5) for any ∈ℝD v ^D. Next, we take expectation over the randomness of (0),…,(T−1) y^(0),…, y^(T-1) from both sides. In the above, the only random quantities are ((t))t=0T−1( y^(t))_t=0^T-1 and (t)t=0T−1( w_t)_t=0^T-1, as we are considering an arbitrary sequence of context ((t))t=0T−1( x^(t))_t=0^T-1. Note that t w_t is independent of the draw of (t) y^(t), therefore [r((t),(t))logpt(∗(t)|(t))] E [r( x^(t), y^(t)) p_ w_t( y^*^(t)\,|\, x^(t)) ] =[logpt(∗((t))|(t))[r((t),(t))|t]] = E [ p_ w_t( y^*( x^(t))\,|\, x^(t)) E [r( x^(t), y^(t))\,|\, w_t ] ] =[qt(∗(t)|(t))logpt(∗(t)|(t))]. = E [q_t( y^*^(t)\,|\, x^(t)) p_ w_t( y^*^(t)\,|\, x^(t)) ]. In the above, we used the fact that [r((t),(t))|t]=qt(∗(t)|(t)) E [r( x^(t), y^(t))\,|\, w_t ]=q_t( y^*^(t)\,|\, x^(t)). By plugging this into (A.5), we have [1T∑t=0T−1−qt(∗(t)|(t))logpt(∗(t)|(t))] E [ 1T _t=0^T-1-q_t( y^*^(t)\,|\, x^(t)) p_ w_t( y^*^(t)\,|\, x^(t)) ] ≤2T∑t=0T−1−qt(∗(t)|(t))log⁡p(∗(t)|(t))+2N∥−0∥22T ≤ 2T _t=0^T-1-q_t( y^*^(t)\,|\, x^(t)) p_ v( y^*^(t)\,|\, x^(t))+ 2N \| v- w_0 \|_2^2T ≤−2T∑t=0T−1log⁡p(∗(t)|(t))+2N∥−0∥22T. ≤- 2T _t=0^T-1 p_ v( y^*^(t)\,|\, x^(t))+ 2N \| v- w_0 \|_2^2T. Let ∗=Rv∗ v^*=R_v w^* where ∗ w^* is the unit-norm vector of Assumption 1, and r is to be chosen later. Then, for any x, −log⁡p(∗|) - p_ v( y^*\,|\, x) =−∑i=1Nlog⁡p(yi∗|,1:i−1∗) =- _i=1^N p_ v(y^*_i\,|\, x, y^*_1:i-1) =∑i=1Nlog⁡(1+∑y′≠yi∗eRv⟨∗,ϕ(,1:i−1∗,y′)−ϕ(,1:i∗)⟩) = _i=1^N (1+ _y ≠ y^*_ie^R_v w^*,φ( x, y^*_1:i-1,y )-φ( x, y^*_1:i) ) ≤Nke−Rvγ. ≤ Nke^-R_vγ. Therefore, (0),…,(T−1)[1T∑t=0T−1−qtlogpt(∗(t)|(t))] E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1-q_t p_ w_t( y^*^(t)\,|\, x^(t)) ] ≤2Nke−Rvγ+4N∥22+4N∥0∥22T ≤ 2Nke^-R_vγ+ 4N \| v \|_2^2+4N \| w_0 \|_2^2T ≤2Nke−Rvγ+4NRv2+4N∥0∥22T. ≤ 2Nke^-R_vγ+ 4NR_v^2+4N \| w_0 \|_2^2T. We can balance the first two terms by choosing Rv=1γlog⁡(kTγ2)R_v= 1γ (kTγ^2), and obtain (0),…,(T−1)[1T∑t=0T−1−qtlogpt(∗(t)|(t))] E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1-q_t p_ w_t( y^*^(t)\,|\, x^(t)) ] ≤6Nγ2Tlog2⁡(kTγ2)+2N∥0∥22T ≤ 6Nγ^2T ^2(kTγ^2)+ 2N \| w_0 \|_2^2T ≤10Nlog2⁡(kTγ2)γ2T, ≤ 10N ^2(kTγ^2)γ^2T, which completes the proof. ∎ To finish the proof of Theorem 2, we can employ a simple online-to-batch conversion. Proof. [Proof of Theorem 2] By taking expectation from both sides of (A.4), we have [1T∑t=0T−1−qt(∗(t)|t)logpt(∗(t)|t)]≤10Nlog(kTγ2)2γ2T. E [ 1T _t=0^T-1-q_t( y^*( x_t)\,|\, x_t) p_ w_t( y^*( x_t)\,|\, x_t) ]≤ 10N (kTγ^2)^2γ^2T. Since t x_t is independent of t w_t and qtq_t, we can replace t x_t with the test sample x that has the same law as t x_t and is independent from all (t)( w_t) and (qt)(q_t) while preserving the expectation above, and obtain [1T∑t=0T−1−qt(∗()|)logpt(∗()|)]≤10Nlog(kTγ2)2γ2T. E [ 1T _t=0^T-1-q_t( y^*( x)\,|\, x) p_ w_t( y^*( x)\,|\, x) ]≤ 10N (kTγ^2)^2γ^2T. Let ℒ≔1T∑t=0T−1−qt(∗()|)log⁡pt(∗()|).L 1T _t=0^T-1-q_t( y^*( x)\,|\, x) p_ w_t( y^*( x)\,|\, x). Using the law of total expectation and the fact that ℒ≥0L≥ 0, we have [ℒ|ℰα]ℙ(ℰα)≤[ℒ] E [L\,|\,E_α ]P (E_α )≤ E [L ]. As a result, [1T∑t=0T−1−qt(∗()|)logpt(∗()|)|ℰα]≤10Nlog(kTγ2)2παγ2T. E [ 1T _t=0^T-1-q_t( y^*( x)\,|\, x) p_ w_t( y^*( x)\,|\, x)\,|\,E_α ]≤ 10N (kTγ^2)^2 _αγ^2T. Note that on ℰαE_α, we have qt(∗()|)≥αq_t( y^*( x)\,|\, x)≥α for all t, thus [1T∑t=0T−1−logpt(∗()|)|ℰα]≤10Nlog(kTγ2)2πααγ2T. E [ 1T _t=0^T-1- p_ w_t( y^*( x)\,|\, x)\,|\,E_α ]≤ 10N (kTγ^2)^2 _αγ^2T. Using the convexity of negative log-likelihood and the Jensen inequality, we obtain [−logp¯TPG(∗()|)|ℰα]≤10Nlog(kTγ2)2πααγ2T. E [- p_ w^PG_T( y^*( x)\,|\, x)\,|\,E_α ]≤ 10N (kTγ^2)^2 _αγ^2T. Finally, we can use the inequality 1−x≤−log⁡x1-x≤- x for x>0x>0 to complete the proof. ∎ A.2 Proof of Theorem 3 For this part, we are going to use the second case of Proposition 13. To do so, we will take advantage of the structure of the log-likelihood gradient characterized by the following lemma. Lemma 18. Let R=sup,,,i∥ϕ(,1:i)∥2R= _ w, x, y,i \|φ( x, y_1:i) \|_2. Then, for any ∈ℝD w ^D, ∈ℝd x ^d, ∈N y ^N, and i∈[N]i∈[N], we have ∥∇logp(yi|,1:i−1)∥2≤−2Rlogp(yi|,1:i−1). \| _ w p_ w(y_i\,|\, x, y_1:i-1) \|_2≤-2R p_ w(y_i\,|\, x, y_1:i-1). Therefore, by the triangle inequality, we also have ∥∇logp(|)∥2≤−2Rlogp(|). \| _ w p_ w( y\,|\, x) \|_2≤-2R p_ w( y\,|\, x). Proof. Let us define py=p(y|,1:i−1)p_y=p_ w(y\,|\, x, y_1:i-1). Then, ∥∇logp(yi|,1:i−1)∥2 \| _ w p_ w(y_i\,|\, x, y_1:i-1) \|_2 =∥ϕ(,1:i−1,yi)−∑y=1kϕ(,1:i−1,y)py∥2 = \|φ( x, y_1:i-1,y_i)- _y=1^kφ( x, y_1:i-1,y)p_y \|_2 ≤(1−pyi)∥ϕ(,1:i−1,yi)∥2+∑y≠yipy∥ϕ(,1:i−1,y)∥2 ≤(1-p_y_i) \|φ( x, y_1:i-1,y_i) \|_2+ _y≠ y_ip_y \|φ( x, y_1:i-1,y) \|_2 ≤2R(1−pyi) ≤ 2R(1-p_y_i) ≤−2Rlog⁡pyi, ≤-2R p_y_i, where the last inequality used the fact that 1−x≤−log⁡x1-x≤- x for x>0x>0. ∎ Similar to the previous section, we first prove a more general online guarantee for PG with adaptive learning rate. Proposition 19. Let (t)( w_t) denote the (PG-OR) iterates. Suppose the sequence ((t),∗((t)))( x^(t), y^*( x^(t))) is chosen arbitrarily (in particular not necessarily i.i.d.), with the only constraint being Assumption 1. Assume ηt=1/(2(λ+∥∇logpt((t)|(t))∥2)) _t=1/(2(λ+ \|∇ p_ w_t( y^(t)\,|\, x^(t)) \|_2)) for all t and some λ>0λ>0, and ∥0∥2≤1/γ⋅log(NkTγ2/λ) \| w_0 \|_2≤ 1/γ· (NkTγ^2/λ). Then, (0),…,(T−1)[1T∑t=0T−1qt(∗((t))|(t))min(λ2,−logpt(∗((t))|(t)))]≤20λlog(NkTγ2/λ)2γ2T. E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1q_t( y^*( x^(t))\,|\, x^(t)) ( λ2,- p_ w_t( y^*( x^(t))\,|\, x^(t)) ) ]≤ 20λ (NkTγ^2/λ)^2γ^2T. (A.6) Proof. We can make a similar observation as in Proposition 17 about the (PG-OR) update, and rewrite it as t+1 w_t+1 =t+ηtr((t),(t))∇log⁡pt(∗(t)|(t)) = w_t+ _tr( x^(t), y^(t))∇ p_ w_t( y^*^(t)\,|\, x^(t)) =t−ηt∇ℓt(t) = w_t- _t∇ _t( w_t) for ℓt(t)≔−r((t),(t))log⁡pt(∗(t)|(t)) _t( w_t) -r( x^(t), y^(t)) p_ w_t( y^*^(t)\,|\, x^(t)), where we recall ∗(t)≔∗((t)) y^*^(t) y^*( x^(t)). Note that ℓt _t is convex, and by Lemma 18, it satisfies ∥∇ℓt∥2≤2ℓt \|∇ _t \|_2≤ 2 _t. For simplicity, define rt≔r((t),(t))r_t r( x^(t), y^(t)) and ℓ~t()≔−log⁡p(∗(t)|(t)) _t( w) - p_ w( y^*^(t)\,|\, x^(t)). Recall the learning rate ηt=(λ+∥∇ℓ~t(t)∥2)−1/2 _t=(λ+ \|∇ _t( w_t) \|_2)^-1/2, and note that ηt _t is only important when rt=1r_t=1, in which case ℓ~t=ℓt _t= _t, thus ηt=(λ+∥∇ℓt(t)∥2)−1/2 _t=(λ+ \|∇ _t( w_t) \|_2)^-1/2. We can therefore invoke the second case of Proposition 13 and write 1T∑t=0T−1rtℓ~t(t)1+2λ−1ℓ~t(t)≤1T∑t=0T−1rtℓ~t(t)1+2λ−1rtℓ~t(t)≤2T∑t=0T−1rtℓ~t()+2λ∥−0∥22T 1T _t=0^T-1 r_t _t( w_t)1+2λ^-1 _t( w_t)≤ 1T _t=0^T-1 r_t _t( w_t)1+2λ^-1r_t _t( w_t)≤ 2T _t=0^T-1r_t _t( v)+ 2λ \| v- w_0 \|_2^2T for any ∈ℝD v ^D. Similar to the proof of Proposition 17, we can take expectation over the randomness of (0),…,(T−1) y^(0),…, y^(T-1) from both sides above, and use the independence of (t) y^(t) and t w_t as well as the fact that [r((t),(t))|t]=qt(∗(t)|(t))≕qt E [r( x^(t), y^(t))\,|\, w_t ]=q_t( y^*^(t)\,|\, x^(t)) q_t to write (0),…,(T−1)[1T∑t=0T−1qtℓ~t(t)1+2λ−1ℓ~t(t)]≤2T∑t=0T−1qtℓ~t()+2λ∥−0∥22T E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1 q_t _t( w_t)1+2λ^-1 _t( w_t) ]≤ 2T _t=0^T-1q_t _t( v)+ 2λ \| v- w_0 \|_2^2T Note that we have ℓ~t1+2λ−1ℓ~t=(λ/2)ℓ~tλ/2+ℓ~t≥12min⁡(λ/2,ℓ~t) _t1+2λ^-1 _t= (λ/2) _tλ/2+ _t≥ 12 (λ/2, _t). Thus, (0),…,(T−1)[1T∑t=0T−1qtmin(λ2,ℓ~t)] E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1q_t ( λ2, _t) ] ≤4T∑t=0T−1qtℓ~t()+4λ∥−0∥22T ≤ 4T _t=0^T-1q_t _t( v)+ 4λ \| v- w_0 \|_2^2T ≤4T∑t=0T−1ℓ~t()+4λ∥−0∥22T. ≤ 4T _t=0^T-1 _t( v)+ 4λ \| v- w_0 \|_2^2T. Recall from the proof of Proposition 17 that for =Rv∗ v=R_v w^* we have ℓ~t()≤Nke−Rvγ _t( v)≤ Nke^-R_vγ. Therefore, (0),…,(T−1)[1T∑t=0T−1qtmin(λ2,ℓ~t(t))] E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1q_t ( λ2, _t( w_t)) ] ≤4Nke−Rvγ2+4λ∥−0∥22T ≤ 4Nke^-R_vγ^2+ 4λ \| v- w_0 \|_2^2T ≤4Nke−Rvγ2+4λ∥22+4λ∥0∥22T ≤ 4Nke^-R_vγ^2+ 4λ \| v \|_2^2+4λ \| w_0 \|_2^2T ≤4Nke−Rvγ2+8λRv2+8λ∥0∥22T. ≤ 4Nke^-R_vγ^2+ 8λ R_v^2+8λ \| w_0 \|_2^2T. By choosing Rv=1/γ⋅log⁡(NkTγ2/λ)R_v=1/γ· (NkTγ^2/λ), we obtain (0),…,(T−1)[1T∑t=0T−1qtmin(λ2,ℓ~t(t))]≤20λlog(NkTγ2/λ)2γ2T, E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1q_t ( λ2, _t( w_t)) ]≤ 20λ (NkTγ^2/λ)^2γ^2T, which completes the proof. ∎ Similar to the previous section, we can use an online-to-batch conversion to prove Theorem 3. Proof. [Proof of Theorem 3] Note that by taking expectation from both sides of (A.6) and replacing t x_t with x (as they are both independent of t w_t, qtq_t and have the same law) we obtain [1T∑t=0T−1qt(∗()|)min(λ2,−logpt(∗()|))]≲λlog(NkTγ2/λ)2γ2T. E [ 1T _t=0^T-1q_t( y^*( x)\,|\, x) ( λ2,- p_ w_t( y^*( x)\,|\, x)) ] λ (NkTγ^2/λ)^2γ^2T. Using the inequality 1−x≤−log⁡x1-x≤- x for all x, we obtain [1T∑t=0T−1qt(∗()|)min(λ2,1−pt(∗()|))]≲λlog(NkTγ2/λ)2γ2T. E [ 1T _t=0^T-1q_t( y^*( x)\,|\, x) ( λ2,1-p_ w_t( y^*( x)\,|\, x)) ] λ (NkTγ^2/λ)^2γ^2T. Now by choosing λ=2λ=2, we can remove min and write [1T∑t=0T−1qt(∗()|)(1−pt(∗()|))]≲log(NkTγ2)2γ2T. E [ 1T _t=0^T-1q_t( y^*( x)\,|\, x)(1-p_ w_t( y^*( x)\,|\, x)) ] (NkTγ^2)^2γ^2T. Similar to the proof of Theorem 2, we can use the non-negativity of the LHS above along with the law of total expectation to write [1T∑t=0T−1qt(∗()|)(1−pt(∗()|))|ℰα]≲log(NkTγ2)2παγ2T. E [ 1T _t=0^T-1q_t( y^*( x)\,|\, x)(1-p_ w_t( y^*( x)\,|\, x))\,|\,E_α ] (NkTγ^2)^2 _αγ^2T. Observing that qt(∗()|)≥αq_t( y^*( x)\,|\, x)≥α on πα _α concludes the proof. ∎ A.3 Proof of Corollary 4 Let ℰ~α~≔:q(∗()|)≥α~ E_ α \ x:q( y^*( x)\,|\, x)≥ α\. Then, for any α~∈(0,1] α∈(0,1], [1−pτ(∗()|)]=[1−pτ(∗()|)|ℰ~α~]ℙ(ℰ~α~)+[1−pτ(∗()|)|ℰ~α~c]ℙ(ℰ~α~c). E [1-p_ w_τ( y^*( x)\,|\, x) ]= E [1-p_ w_τ( y^*( x)\,|\, x)\,|\, E_ α ]P ( E_ α )+ E [1-p_ w_τ( y^*( x)\,|\, x)\,|\, E^c_ α ]P ( E^c_ α ). (A.7) We will consider two approaches to choose α~ α for bounding this expression. First, note that ℙ(ℰ~α~c)=ℙ(12q0(∗()|)+k−N2<α~)≤ℙ(q0(∗()|)≤2α~)=1−π2α~P ( E^c_ α )=P ( 12q_0( y^*( x)\,|\, x)+ k^-N2< α ) (q_0( y^*( x)\,|\, x)≤ 2 α )=1- _2 α where we recall πα≔ℙ(q0(∗()|)≥α) _α (q_0( y^*( x)\,|\, x)≥α ). Then, we can rewrite (A.7) as [1−pτ(∗()|)]≤[1−pτ(∗()|)|ℰ~α~]ℙ(ℰ~α~)+1−π2α~. E [1-p_ w_τ( y^*( x)\,|\, x) ]≤ E [1-p_ w_τ( y^*( x)\,|\, x)\,|\, E_ α ]P ( E_ α )+1- _2 α. (A.8) By choosing α~=12q0−1((1−1log⁡(1/ε))ε) α= 12Q_q_0^-1 ((1- 1 (1/ )) ), we have 1−π2α~≤ε(1−1log⁡(1/ε))1- _2 α≤ (1- 1 (1/ )). Combined with the bound from Theorem 3 for the first term, we obtain [1−pτ(∗()|)] E [1-p_ w_τ( y^*( x)\,|\, x) ] ≤~(α~−1γ2T)+1−π2α~ ≤ O ( α^-1γ^2T )+1- _2 α ≤~(q0((1−o(1))ε)−1γ2T)+ε(1−1log⁡ε), ≤ O ( Q_q_0 ((1-o(1)) )^-1γ^2T )+ (1- 1 ), where o(1)o(1) is hiding the 1/log⁡(1/ε)1/ (1/ ) factor. By letting T=~(α~−1/(γ2Tε))T= O( α^-1/(γ^2T )), we can derive the first term below ε/log⁡(1/ε) / (1/ ), and the total expectation below ε . This gives us a sufficient number of iterations T=~(q0((1−o(1))ε)−1γ2ε).T= O ( Q_q_0 ((1-o(1)) )^-1γ^2 ). For the second approach, note that for α~=k−N2 α= k^-N2 we have ℙ(ℰ~α~)=1P ( E_ α )=1. Thus, we can directly use Theorem 3 to write [1−pτ(∗()|)]≤~(kN/(γ2T)) E [1-p_ w_τ( y^*( x)\,|\, x) ]≤ O(k^N/(γ^2T)). Hence, another sufficient number of iterations is given by T=~(kN/(γ2ε))T= O(k^N/(γ^2 )). We conclude the proof by considering the minimum between the two approaches. ∎ A.4 Proof of Corollary 5 Recall the definitions ℰα _α ≔:q0(∗()|)≥α, \ x:q_0( y^*( x)\,|\, x)≥α\, ℰ~α~ E_ α ≔:q(∗()|)≥α~. \ x:q( y^*( x)\,|\, x)≥ α\. From Theorem 3, [1−pτ(∗()|)|ℰ~α~]ℙ(ℰ~α~)≤~(1α~γ2T). E [1-p_ w_τ( y^*( x)\,|\, x)\,|\, E_ α ]P ( E_ α )≤ O ( 1 αγ^2T ). By inserting the above bound into (A.7) we obtain [1−pτ(∗()|)]≤~(1α~γ2T)+ℙ(ℰ~α~c). E [1-p_ w_τ( y^*( x)\,|\, x) ]≤ O ( 1 αγ^2T )+P ( E^c_ α ). (A.9) Let m≔min⁡(⌈q0((1−1log⁡(1/ε))ε)−1,kN⌉).m ( _q_0 ((1- 1 (1/ )) )^-1,k^N ). We break down the argument based on which term attains the minimum. • Suppose the minimum is attained by the first term. Since q(∗()|)≥1−(1−q0(∗()|)/2)mq( y^*( x)\,|\, x)≥ 1-(1-q_0( y^*( x)\,|\, x)/2)^m, with α=m−1α=m^-1 we have α~=1−e−1/2 α=1-e^-1/2. Further, ℙ(ℰαc)=1−πα≤1−πm−1=−1q0(m−1)≤(1−1log⁡(1/ε))ε.P (E^c_α )=1- _α≤ 1- _m^-1=Q^-1_q_0(m^-1)≤(1- 1 (1/ )) . Since ℙ(ℰ~α~c)≤ℙ(ℰαc)P ( E^c_ α ) (E^c_α ), from (A.9) we have [1−pτ(∗()|)]≤~(1γ2T)+ε(1−1log⁡(1/ε)). E [1-p_ w_τ( y^*( x)\,|\, x) ]≤ O ( 1γ^2T )+ (1- 1 (1/ )). Thus T=~(1/(γ2ε))T= O(1/(γ^2 )) is sufficient to ensure the first term is less than ε/log⁡(1/ε) / (1/ ) and the overall expectation is less than ε . • Suppose instead that the minimum is instead attained by kNk^N. Note that q(∗()|)≥1−(1−k−N/2)m≥1−e−1/2.q( y^*( x)\,|\, x)≥ 1-(1-k^-N/2)^m≥ 1-e^-1/2. Hence, this time for α~=1−e−1/2 α=1-e^-1/2 we have ℙ(ℰ~α~c)=0P ( E^c_ α )=0. As a result, (A.9) reads [1−pτ(∗()|)]≤~(1γ2T), E [1-p_ w_τ( y^*( x)\,|\, x) ]≤ O ( 1γ^2T ), and taking T=~(1/(γ2ε))T= O(1/(γ^2 )) is once again sufficient. For the number of reward queries, we observe that every time (t)=∗((t)) y^(t)= y^*( x^(t)) for (t)∼pt y^(t) p_ w_t, Algorithm 1 makes one reward query, otherwise it makes at most 1+m1+m queries. The (PG-OR) also makes one reward query. Therefore Q Q =2T+m[∑t=1T[^(t)≠∗((t))]] =2T+m E [ _t=1^T 1[ y^(t)≠ y^*( x^(t))] ] =2T+m∑t=1T[1−pt(∗((t))|(t))] =2T+m _t=1^T E [1-p_ w_t( y^*( x^(t))\,|\, x^(t)) ] =2T+m∑t=1T[1−pt(∗()|)] =2T+m _t=1^T E [1-p_ w_t( y^*( x)\,|\, x) ] (Using independence of (t) x^(t) from t w_t) ≤2T+mTℙ(ℰ~α~c)+m~(γ−2) ≤ 2T+mTP ( E^c_ α )+m O(γ^-2) (Using (A.9) and α~=1−e−1/2). (Using eq:cor_pg_or_test_improved_interm and $ α=1-e^-1/2$). Recall that ℙ(ℰ~αc)≤ε(1−o(1))P ( E^c_α )≤ (1-o(1)) and T=~(1/(γ2ε))T= O(1/(γ^2 )). Therefore we can combine the middle and last terms on the RHS above, and write Q≤2T+m~(γ−2)≤~((m+ε−1)/γ2),Q≤ 2T+m O(γ^^-2)≤ O((m+ ^-1)/γ^2), finishing the proof. ∎ A.5 Fully On-Policy Policy Gradient In this section, we will consider (PG-OR) when qt=ptq_t=p_ w_t for all t starting from a base model q0=p0q_0=p_ w_0. We first state this result and then explain the intuition behind it, and how it is qualitatively related to the observations in Section 3. We then proceed to prove the result. Proposition 20 (Fully Online PG). Consider the (PG-OR) updates under Assumption 1, with a base policy q0=p0q_0=p_ w_0 and assume ∥0∥2≲log(Tkγ2)/γ \| w_0 \|_2 (Tkγ^2)/γ for simplicity. Let τ∼Unif0,…,T−1)τ Unif\0,…,T-1\). For any δ∈(0,1)δ∈(0,1), with a proper choice of learning rate of order η=Θ~(γ2δ/N)η= (γ^2δ/N), after T=~(Nγ4δ2ε⋅min(1,1−[q0(∗()|)]δ))T= O ( Nγ^4δ^2 · (1, 1- E [q_0( y^*( x)\,|\, x) ]δ ) ) we have pτ(∗()|)≥1−εp_ w_τ( y^*( x)\,|\, x)≥ 1- with probability at least [q0(∗()|)]−δ E [q_0( y^*( x)\,|\, x) ]-δ. We can observe that with a fixed probability of [q0(∗()|)]−δ E [q_0( y^*( x)\,|\, x) ]-δ, online PG can boost the likelihood to be 1−ε1- , with a rate that although suboptimal, still depends only linearly on the sequence length and polynomially on other parameters. Crucially, the role of the base model here is to determine the probability of success, which can at most be [q0(∗()|)] E [q_0( y^*( x)\,|\, x) ]. While the above proposition does not characterize which samples at test-time win this lottery ticket, Figure 1 hints that these are samples where the initial likelihood is non-trivial. Further, we expect the above rate to be an artifact of our analysis and possibly improvable. To prove the above proposition, we make a number of observations. Lemma 21. Recall sup,,,i∥ϕ(,1:i)∥2≤1 _ w, x, y,i \|φ( x, y_1:i) \|_2≤ 1. We then have sup,,∥∇2p(|)∥2≲1 _ w, x, y \|∇^2_ wp_ w( y\,|\, x) \|_2 1. Proof. Define the shorthand notations p≔p(|),pyi≔p(y|,1:i−1),pi≔p(yi|,1:i−1),Yi∼p(⋅|,1:i−1).p p_ w( y\,|\, x), p^i_y p_ w(y\,|\, x, y_1:i-1), p^i p_ w(y_i\,|\, x, y_1:i-1), Y_i p_ w(·\,|\, x, y_1:i-1). We have ∇2p=p∇log⁡p∇log⁡p⊤+p∇2log⁡p.∇^2p=p∇ p∇ p +p∇^2 p. Therefore, ∥∇2p∥2 \|∇^2p \|_2 ≤p∥∇logp∥22+p∥∇2logp∥2 ≤ p \|∇ p \|_2^2+p \|∇^2 p \|_2 ≤p(∑i=1N∥∇logpi∥2)2+p∑i=1N∥∇2logpi∥2 ≤ p ( _i=1^N \|∇ p^i \|_2 )^2+p _i=1^N \|∇^2 p^i \|_2 (A.10) Further, from Lemma 18 with R=1R=1, recall that ∥∇logpi∥2≤−2logpi \|∇ p^i \|_2≤-2 p^i. Also, from the proof of Lemma 16, recall that ∥∇2logpi∥2 \|∇^2 p^i \|_2 =∥Yi[ϕYiϕYi⊤]−Yi[ϕYi][ϕYi]⊤∥2 = \| E_Y_i [ _Y_i _Y_i ]- E_Y_i [ _Y_i ] E [ _Y_i ] \|_2 ≤Yi[∥ϕYi∥22]−∥Yi[ϕYi]∥22 ≤ E_Y_i [ \| _Y_i \|_2^2 ]- \| E_Y_i [ _Y_i ] \|_2^2 (Since ∥CovYi(ϕYi)∥2≤Tr(CovYi(ϕYi)) \|Cov_Y_i( _Y_i) \|_2 (Cov_Y_i( _Y_i))) =Yi[∥ϕYi−Yi[ϕYi]∥22] = E_Y_i [ \| _Y_i- E_Y_i [ _Y_i ] \|_2^2 ] ≤2Yi[∥ϕYi−Yi[ϕYi]∥2] ≤ 2 E_Y_i [ \| _Y_i- E_Y_i [ _Y_i ] \|_2 ] (Since ∥ϕYi−Yi[ϕYi]∥2≤2 \| _Y_i- E_Y_i [ _Y_i ] \|_2≤ 2) ≤2∥ϕyi−Yi[ϕYi]∥2+2∑y≠yipyi∥ϕy−[Yi]ϕYi∥2 ≤ 2 \| _y_i- E_Y_i [ _Y_i ] \|_2+2 _y≠ y_ip^i_y \| _y- E [Y_i ] _Y_i \|_2 ≤4(1−pi)+4∑y≠yipyi ≤ 4(1-p_i)+4 _y≠ y_ip^i_y (From the proof of Lemma 18) ≤8(1−pi) ≤ 8(1-p_i) ≤−8log⁡pi, ≤-8 p_i, where ϕy≔ϕ(,1:i−1,y) _y φ( x, y_1:i-1,y). Plugging these estimates back into (A.10) yields ∥∇2p∥2≲p((∑i=1N−logpi)2−∑i=1Nlogpi)=e−log⁡p((−logp)2−logp). \|∇^2p \|_2 p ( ( _i=1^N- p_i )^2- _i=1^N p_i )=e^- p ((- p)^2- p ). Using the fact that the function x↦e−x(x2+x)x e^-x(x^2+x) is bounded over x≥0x≥ 0 completes the proof. ∎ To apply Proposition 17, we need to lower bound qτq_τ where τ∼Unif0,…,T−1τ Unif\0,…,T-1\. In the case of on-policy PG, this means lower bounding pτp_ w_τ itself, which is the objective of Proposition 17. We thus need another tool to lower bound pτp_ w_τ, and use this to achieve an improved conditional lower bound on pτp_ w_τ using 17. We establish this lower bound by using the fact that on-policy PG is effectively doing stochastic gradient ascent on the function ↦[p(∗()|)] w E [p_ w( y^*( x)\,|\, x) ]. Therefore, this quantity should not decrease significantly from its initial value along the PG dynamics if the learning rate is sufficiently small. This fact is established by the following lemma. Lemma 22. Let (t)( w_t) denote the on-policy (PG-OR) iterates where qt=ptq_t=p_ w_t, initialized from the base policy p0=q0p_ w_0=q_0. Suppose Assumption 1 holds and ∥0∥2≲log(Tkγ2)/γ2 \| w_0 \|_2 (Tkγ^2)/γ^2 for simplicity. Define τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\). Then, for any constant learning rate η≤1/(2N)η≤ 1/(2N), we have [pτ(∗()|)]≥[q0(∗(∗()|)]−CηNlog(Tkγ2)2γ2, E [p_ w_τ( y^*( x)\,|\, x) ]≥ E [q_0( y^*( y^*( x)\,|\, x) ]- Cη N (Tkγ^2)^2γ^2, where C>0C>0 is an absolute constant. Proof. Let F()≔−[p(∗()|)]F( w) - E [p_ w( y^*( x)\,|\, x) ] and recall ℓt()≔−r((t),(t))logp(∗((t)|(t)) _t( w) -r( x^(t), y^(t)) p_ w( y^*( x^(t)\,|\, x^(t)). Also define pt≔pt(∗((t))|(t))p_t p_ w_t( y^*( x^(t))\,|\, x^(t)) and rt≔r((t),(t))r_t r( x^(t), y^(t)) for brevity. Recall that using the 0-1 structure of the reward, we can write (PG-OR) as t+1=t−η∇ℓt(t). w_t+1= w_t-η∇ _t( w_t). Then [∇ℓt(t)|t]=[r((t),(t))∇logpt(∗((t))|(t))|t]=∇F(t), E [∇ _t( w_t)\,|\, w_t ]= E [r( x^(t), y^(t))∇ p_ w_t( y^*( x^(t))\,|\, x^(t))\,|\, w_t ]=∇ F( w_t), as expected from policy gradient, where we used [r((t),(t))|(t),t]=pt(∗((t))|(t)) E [r( x^(t), y^(t))\,|\, x^(t), w_t ]=p_ w_t( y^*( x^(t))\,|\, x^(t)). Further, Lemma 21 proves that ∥∇2F()∥2≤L \|∇^2F( w) \|_2≤ L where L=(1)L=O(1) for all w. Hence, using the smoothness of F, we can write F(t+1) F( w_t+1) ≤F(t)−⟨F(t),η∇ℓt(t)⟩+Lη22∥∇ℓt(t)∥22 ≤ F( w_t)- F( w_t),η∇ _t( w_t) + Lη^22 \|∇ _t( w_t) \|_2^2 =F(t)−⟨F(t),η∇ℓt(t)⟩+Lη2rt2∥∇logpt∥22, =F( w_t)- F( w_t),η∇ _t( w_t) + Lη^2r_t2 \|∇ p_t \|_2^2, where we used r2=r^2=r. By taking expectation from both sides, we obtain [F(t+1)] E [F( w_t+1) ] ≤[F(t)]−η[∥∇F(t)∥22]+Lη22[pt∥∇logpt∥22] ≤ E [F( w_t) ]-η E [ \|∇ F( w_t) \|_2^2 ]+ Lη^22 E [p_t \|∇ p_t \|_2^2 ] ≤[F(t)]+Lη2N2[−ptlogpt] ≤ E [F( w_t) ]+ Lη^2N2 E [-p_t p_t ] (By Lemma 16) ≤[F(0)]+Lη2N2∑t=0T−1[−ptlogpt]. ≤ E [F( w_0) ]+ Lη^2N2 _t=0^T-1 E [-p_t p_t ]. Therefore, using [pτ(∗()|)]=1T∑t=0T−1[pt(∗()|)] E [p_ w_τ( y^*( x)\,|\, x) ]= 1T _t=0^T-1 E [p_ w_t( y^*( x)\,|\, x) ], we have [pτ(∗()|)]≥[q0(∗()|)]−Lη2N2∑t=0T−1[−ptlogpt]. E [p_ w_τ( y^*( x)\,|\, x) ]≥ E [q_0( y^*( x)\,|\, x) ]- Lη^2N2 _t=0^T-1 E [-p_t p_t ]. (A.11) To bound ∑t=0T−1[−ptlogpt] _t=0^T-1 E [-p_t p_t ], we can follow the exact same path as the proof of Proposition 17, while keeping η≤1/(2N)η≤ 1/(2N) in all equations, which yields ∑t=0T−1[−ptlogpt]≲log(Nkγ2ηT)2γ2η+∥0∥22ηT. _t=0^T-1 E [-p_t p_t ] (Nkγ^2η T)^2γ^2η+ \| w_0 \|_2^2η T. (A.12) Using the bound ∥0∥2≲log(Tkγ2)/γ \| w_0 \|_2 (Tkγ^2)/γ, we obtain ∑t=0T−1[−ptlogpt]≲log(Tkγ2)2γ2η. _t=0^T-1 E [-p_t p_t ] (Tkγ^2)^2γ^2η. Substituting this bound into (A.11) results in [pτ(∗()|)]≥[q0(∗()|)]−CNηlog(Tkγ2)2γ2, E [p_ w_τ( y^*( x)\,|\, x) ]≥ E [q_0( y^*( x)\,|\, x) ]- CNη (Tkγ^2)^2γ^2, for some absolute constant C>0C>0, where we used the fact that L=(1)L=O(1). ∎ With the tools above, we can state the proof of the proposition. Proof. [Proof of Proposition 20] Let α∈(0,1)α∈(0,1) to be fixed later. For convenience, we use the notation pτ≔pτ(∗()|)p_ w_τ p_ w_τ( y^*( x)\,|\, x) where (,∗())( x, y^*( x)) is an independently drawn test sample. We rely on the following decomposition ℙ(1−pτ≥ε) (1-p_ w_τ≥ ) =ℙ(1−pτ≥ε|pτ≥α)+ℙ(1−pτ≥ε|pτ<α)ℙ(pτ<α) =P (1-p_ w_τ≥ \,|\,p_ w_τ≥α )+P (1-p_ w_τ≥ \,|\,p_ w_τ<α )P (p_ w_τ<α ) ≤ℙ(1−pτ≥ε|pτ≥α)ℙ(pτ≥α)+ℙ(pτ<α) (1-p_ w_τ≥ \,|\,p_ w_τ≥α )P (p_ w_τ≥α )+P (p_ w_τ<α ) ≤1ε[1−pτ|pτ≥α]ℙ(pτ≥α)+[1−pτ]1−α ≤ 1 E [1-p_ w_τ\,|\,p_ w_τ≥α ]P (p_ w_τ≥α )+ E [1-p_ w_τ ]1-α (By Markov’s Inequality) ≤1αε[pτ(1−pτ)|pτ≥α]ℙ(pτ≥α)+[1−pτ]1−α ≤ 1α E [p_ w_τ(1-p_ w_τ)\,|\,p_ w_τ≥α ]P (p_ w_τ≥α )+ E [1-p_ w_τ ]1-α ≤1αε[−pτlogpτ|pτ≥α]ℙ(pτ≥α)+1−pτ1−α ≤ 1α E [-p_ w_τ p_ w_τ\,|\,p_ w_τ≥α ]P (p_ w_τ≥α )+ 1-p_ w_τ1-α (Using 1−x≤−log⁡x1-x≤- x) ≤1αε[−pτlogpτ]+[1−pτ]1−α ≤ 1α E [-p_ w_τ p_ w_τ ]+ E [1-p_ w_τ ]1-α (By the law of total expectation) ≤Clog(Tkγ2)2αεγ2ηT+[1−pτ]1−α ≤ C (Tkγ^2)^2α γ^2η T+ E [1-p_ w_τ ]1-α (Using (A.12)) ≤Clog(Tkγ2)2αεγ2ηT+1−[q0(∗()|)]1−α+CηNlog(Tkγ2)2γ2(1−α), ≤ C (Tkγ^2)^2α γ^2η T+ 1- E [q_0( y^*( x)\,|\, x) ]1-α+ Cη N (Tkγ^2)^2γ^2(1-α), (By Lemma 22) where C>0C>0 is an absolute constant. For a fixed α, the optimal (constant) learning rate minimizing the above bound is given by η=(1−α)/(αNTε)η= (1-α)/(α NT ). We will have to choose T sufficiently large so that this learning rate is also smaller than 1/(2N)1/(2N), needed to obtain (A.12). By using this optimal η, we obtain ℙ(1−pτ≥ε)≤1−[q0]+α(1−[q0])1−α+~(Nα(1−α)γ4Tε),P (1-p_ w_τ≥ )≤ 1- E [q_0 ]+ α(1- E [q_0 ])1-α+ O ( Nα(1-α)γ^4T ), where we used q0≔q0(∗()|)q_0 q_0( y^*( x)\,|\, x) for brevity. Next, we must choose α. • If 1−[q0]≪δ1- E [q_0 ] δ, then to achieve probability of error 1−[q0]+δ1- E [q_0 ]+δ, the optimal α is 1/21/2, and T=~(N/(γ2εδ2))T= O(N/(γ^2 δ^2)) iterations are sufficient. Note that the learning rate will satisfy η≍1/(NTε)=Θ~(γ2δ/N)≪1/Nη 1/(NT )= (γ^2δ/N) 1/N. • If 1−[q0]≫δ1- E [q_0 ] δ, then a suitable choice of α is α=(1−[q0])/(2δ)α=(1- E [q_0 ])/(2δ), and T=~(N(1−[q0])/(γ2εδ3))T= O(N(1- E [q_0 ])/(γ^2 δ^3)) iterations are sufficient. Note that the learning rate will satisfy η≍δ/((1−[q0])NTε)=Θ~(γ2δ/N)≪1/Nη δ/((1- E [q_0 ])NT )= (γ^2δ/N) 1/N. Overall, we can conclude that with a learning rate of η=Θ~(γ2δ/N)η= (γ^2δ/N) and with T=~(Nγ4εδ2⋅min(1,1−[q0]δ))T= O ( Nγ^4 δ^2· (1, 1- E [q_0 ]δ ) ) iterations, we can achieve pτ(∗()|)≥1−εp_ w_τ( y^*( x)\,|\, x)≥ 1- with probability at least q0(∗()|)−δq_0( y^*( x)\,|\, x)-δ. ∎ A.6 Policy Gradient with Clipped Importance Weights In this section, we study the effect of clipped importance weights on the convergence of policy gradients. Namely, we study updates of the form t+1=t+ηtr((t),(t))ρt∇log⁡pt((t)|(t)), w_t+1= w_t+ _tr( x^(t), y^(t)) _t∇ p_ w_t( y^(t)\,|\, x^(t)), (A.13) where (t)∼qt(⋅|(t)) y^(t) q_t(·\,|\, x^(t)) and ρt=Clip⁡(pt((t)|(t))qt((t)|(t)),ζ−1,ζ),ζ≥1. _t=Clip ( p_ w_t( y^(t)\,|\, x^(t))q_t( y^(t)\,|\, x^(t)),ζ^-1,ζ ), ζ≥ 1. We begin by adapting the conditional convergence results of (PG-OR) to use clipped importance weights, as outlined below. Theorem 23 (PG-OR with Clipped Importance Weights). Let (t)t=0T−1( w_t)_t=0^T-1 denote the iterates of PG with outcome rewards and clipped importance weights as in (A.13). Suppose Assumption 1 holds. For simplicity, assume ∥0∥2≲1/γ⋅log(NTkγ2) \| w_0 \|_2 1/γ· (NTkγ^2). Let τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\) and x denote an i.i.d. test sample. For any α∈(0,1]α∈(0,1], define the event ℰα≔mint⁡qt(∗()|)≥αE_α \ _tq_t( y^*( x)\,|\, x)≥α\ and πα≔ℙ(ℰα) _α (E_α ). Then 1. For the constant learning rate ηt=1/(2ζN) _t=1/(2ζ N) we have [pτ(∗()|)|ℰα]≥1−~(ζ2Nπααγ2T). E [p_ w_τ( y^*( x)\,|\, x)\,|\,E_α ]≥ 1- O ( ζ^2N _αγ^2T ). Further, the same bound holds for the averaged iterate ¯TPG=1T∑t=0T−1t w^PG_T= 1T _t=0^T-1 w_t. 2. For the adaptive learning rate ηt=(4ζ+2ρt∥∇logpt((t)|(t))∥2)−1 _t=(4ζ+2 _t \|∇ p_ w_t( y^(t)\,|\, x^(t)) \|_2)^-1 we have [pτ(∗()|)|ℰα]≥1−~(ζ2πααγ2T). E [p_ w_τ( y^*( x)\,|\, x)\,|\,E_α ]≥ 1- O ( ζ^2 _αγ^2T ). Proof. The proof of both statements follow by closely tracking the proof of Theorem 2 and Theorem 3 respectively. Since they are similar, we only state the proof of the second case. We can recreate the online bound of Proposition 17 by retracing its proof steps and for learning rate ηt=1/(2(ζλ+ρt∥∇logpt((t)|(t))∥2)) _t=1/ (2(ζλ+ _t \|∇ p_ w_t( y^(t)\,|\, x^(t)) \|_2) ) obtain [1T∑t=0T−1qt(∗((t))|(t))min(λ2,−logpt(∗((t))|(t)))]≲ζ2log(NTkγ2)2γ2T. E [ 1T _t=0^T-1q_t( y^*( x^(t))\,|\, x^(t)) ( λ2,- p_ w_t( y^*( x^(t))\,|\, x^(t))) ] ζ^2 (NTkγ^2)^2γ^2T. (A.14) Specifically, this time we define ℓt()≔−rtρtlog⁡p(∗((t))|(t)) _t( w) -r_t _t p_ w( y^*( x^(t))\,|\, x^(t)). Importantly, a consequence of this definition is that while ρt _t depends on the weight t w_t, this weight is frozen and there is no gradient from ρt _t appearing in ∇ℓt()∇ _t( w). Namely ℓt() _t( w) is a still convex function of w, and satisfies ∥∇ℓt∥2≤2ℓt \|∇ _t \|_2≤ 2 _t thanks to Lemma 18. Further, notice that (A.13) is equivalent to t+1=t−ηt∇ℓt w_t+1= w_t- _t∇ _t due to the 0-1 reward structure, and recall that we use ℓ~t()≔−log⁡p(∗((t))|(t)) _t( w) - p_ w( y^*( x^(t))\,|\, x^(t)) to denote negative log-likelihood. Therefore, we can invoke the second case of Proposition 13 with the learning rate given above, which yields 1T∑t=0T−1rtρtℓ~t(t)1+2ζ−1λ−1ρtrtℓ~t(t)≤2T∑t=0T−1rtρtℓ~t()+2ζλ∥−0∥22T. 1T _t=0^T-1 r_t _t _t( w_t)1+2ζ^-1λ^-1 _tr_t _t( w_t)≤ 2T _t=0^T-1r_t _t _t( v)+ 2ζλ \| v- w_0 \|_2^2T. Using the bound ρt∈[ζ−1,ζ] _t∈[ζ^-1,ζ] and rt≤1r_t≤ 1, we can lower bound the RHS above and write 1T∑t=0T−1rtζ−1ℓ~t(t)1+2λ−1ℓ~t(t)≤2T∑t=0T−1rtρtℓ~t()+2ζλ∥−0∥22T. 1T _t=0^T-1 r_tζ^-1 _t( w_t)1+2λ^-1 _t( w_t)≤ 2T _t=0^T-1r_t _t _t( v)+ 2ζλ \| v- w_0 \|_2^2T. Next, we take expectation from both sides and use the fact that [rt|(t),t]=qt(∗((t))|(t))≕qt E [r_t\,|\, x^(t), w_t ]=q_t( y^*( x^(t))\,|\, x^(t)) q_t and that [rtρt|(t),t]≤qtmax(pt/qt,ζ−1)≤1. E [r_t _t\,|\, x^(t), w_t ]≤ q_t (p_ w_t/q_t,ζ^-1)≤ 1. We can therefore write ζ−1T[∑t=0T−1qtℓ~t(t)1+2λ−1ℓ~t(t)]≤2T∑t=0T−1ℓ~t()+2ζλ∥−0∥22T. ζ^-1T E [ _t=0^T-1 q_t _t( w_t)1+2λ^-1 _t( w_t) ]≤ 2T _t=0^T-1 _t( v)+ 2ζλ \| v- w_0 \|_2^2T. By following the remaining steps from the proof of Proposition 19, we arrive at (A.14). By letting λ=2λ=2, using the inequality 1−x≤−log⁡x1-x≤- x, and employing the same online-to-batch conversion of replacing (t) x^(t) with an independent copy x, we can show [1T∑t=0T−1qt(∗()|)(1−pt(∗()|))]≲ζ2log(NTkγ2)2γ2T E [ 1T _t=0^T-1q_t( y^*( x)\,|\, x)(1-p_ w_t( y^*( x)\,|\, x)) ] ζ^2 (NTkγ^2)^2γ^2T or equivalently [qτ(∗()|)(1−pτ(∗()|))]≲ζ2log(NTkγ2)2γ2T. E [q_τ( y^*( x)\,|\, x)(1-p_ w_τ( y^*( x)\,|\, x)) ] ζ^2 (NTkγ^2)^2γ^2T. Combining the above bound with the fact that α[1−pτ(∗()|)|ℰα]ℙ(ℰα)≤[qτ(∗()|)(1−pτ(∗()|))]α E [1-p_ w_τ( y^*( x)\,|\, x)\,|\,E_α ]P (E_α )≤ E [q_τ( y^*( x)\,|\, x)(1-p_ w_τ( y^*( x)\,|\, x)) ] finishes the proof. ∎ Using the above theorem, we can also adapt the unconditional post-training guarantees of Corollary 4 and 5 to use clipped importance weights. We state this result without presenting its proof, which can simply be obtained by going through the proofs of Corollary 4 (in Section A.3) and the proof of Corollary 5 (in Section A.4) respectively. Corollary 24. Suppose Assumption 1 holds and a base policy q0q_0 is given. Let (t)t=0T−1( w_t)_t=0^T-1 denote the iterates of PG with outcome rewards and clipped importance weights as in (A.13), with the adaptive learning rate of Theorem 23. Let T denote the number of PG iterations, and Q denote the number of reward queries. Let τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\). 1. Suppose qt=12q0+12Unifq_t= 12q_0+ 12 Unif for all t. Then, for any ε∈(0,1) ∈(0,1), with Q=T=~(ζ2min⁡(q0((1−o(1))ε)−1,kN)γ2ε)Q=T= O ( ζ^2 (Q_q_0((1-o(1)) )^-1,k^N )γ^2 ) iterations and reward queries, we achieve [pτ(∗()|)]≥1−ε E [p_ w_τ( y^*( x)\,|\, x) ]≥ 1- . 2. Suppose qt=q_t=q where q is the output of Algorithm 1 with m=min⁡(⌈q0((1−o(1))ε)−1⌉,kN)m= ( _q_0((1-o(1)) )^-1 ,k^N ). Then, for any ε∈(0,1) ∈(0,1), with Q=~((m+ε−1)/γ2)Q= O((m+ ^-1)/γ^2) reward queries and T=~(ζ2/(γ2ε))T= O(ζ^2/(γ^2 )) iterations, we achieve [pτ(∗()|)]≥1−ε E [p_ w_τ( y^*( x)\,|\, x) ]≥ 1- . We observe that when using Algorithm 1 for behavior policy, only the number of iterations will scale with ζ2ζ^2, while using a simple mixture of the base and uniform policies as in qt=12q0+12Unifq_t= 12q_0+ 12 Unif results in both the number of reward queries and iterations to scale with ζ2ζ^2. Appendix B Proofs of Section 4 Here we will state the proof of the results that appeared in Section 4. B.1 Proof of Theorem 6 To prove Theorem 6, similar to the previous section, we first prove a general statement that handles any input sequence. Proposition 25. Let (t)( w_t) denote the (PG-PR) iterates with (t)∼qt(⋅|(t)) y^(t) q_t(·\,|\, x^(t)). Suppose the sequence ((t),∗((t)))( x^(t), y^*( x^(t))) is chosen arbitrarily (in particular not necessarily i.i.d.), with the only constraint being Assumption 1. Assume ηt=1/(2(λ+∥∇ℓt(t)∥2) _t=1/(2(λ+ \|∇ _t( w_t) \|_2) for all t and some λ>0λ>0, and ∥0∥2≤1/γ⋅log(NkTγ2/λ) \| w_0 \|_2≤ 1/γ· (NkTγ^2/λ). Then, for any i∈[N]i∈[N], we have (0),…,(T−1)[1T∑t=0T−1qt(1:i∗((t))|(t))min(λ2,−logpt(∗((t))1:i|(t)))]≤20λlog(NkTγ2/λ)2γ2T. E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1q_t( y^*_1:i( x^(t))\,|\, x^(t)) ( λ2,- p_ w_t( y^*( x^(t))_1:i\,|\, x^(t)) ) ]≤ 20λ (NkTγ^2/λ)^2γ^2T. (B.1) Proof. The proof is similar to that of Proposition 19. Namely, using the fact that At(j)≠0A^(j)_t≠ 0 if and only if 1:j(t)=∗1:j(t) y^(t)_1:j= y^*^(t)_1:j (where we recall the shorthand notation ∗(t)≔∗((t)) y^*^(t) y^*( x^(t))). Then, we can write (PG-PR) as t+1 w_t+1 =t+ηt∑j=1NAj(t)log⁡pt(y∗j(t)|(t),∗1:j−1(t)) = w_t+ _t _j=1^NA^(t)_j p_ w_t(y^*^(t)_j\,|\, x^(t), y^*^(t)_1:j-1) =t−ηt∇ℓt(t) = w_t- _t∇ _t( w_t) where we recall ℓt(t)≔∑j=1NAj(t)log⁡pt(y∗j(t)|(t),∗1:j−1(t)) _t( w_t) _j=1^NA^(t)_j p_ w_t(y^*^(t)_j\,|\, x^(t), y^*^(t)_1:j-1). Once again, by non-negativity of the advantages Aj(t)≥0A^(t)_j≥ 0, ℓt _t is convex. Therefore, we attempt at applying the second case of Proposition 13. For that, we need to establish a bound on ∥∇ℓt∥2 \|∇ _t \|_2, which we do so by the triangle inequality, ∥∇ℓt()∥2≤∑j=1NAj(t)∥logp(y∗j(t)|(t),∗1:j−1(t))∥2≤−2∑j=1NAj(t)logp(y∗j(t)|(t),∗1:j−1(t))=2ℓt(), \|∇ _t( w) \|_2≤ _j=1^NA^(t)_j \| p_ w(y^*^(t)_j\,|\, x^(t), y^*^(t)_1:j-1) \|_2≤-2 _j=1^NA^(t)_j p_ w(y^*^(t)_j\,|\, x^(t), y^*^(t)_1:j-1)=2 _t( w), where the second inequality follows from Lemma 18. Using the above and the learning rate schedule ηt=1/(2(λ+∥∇ℓt(t)∥2)) _t=1/(2(λ+ \|∇ _t( w_t) \|_2)) we can invoke the second case of Proposition 13 to obtain 1T∑t=0T−1ℓt(t)1+2λ−1ℓt(t)≤2T∑t=0T−1ℓt()+2λ∥−0∥22T 1T _t=0^T-1 _t( w_t)1+2λ^-1 _t( w_t)≤ 2T _t=0^T-1 _t( v)+ 2λ \| v- w_0 \|_2^2T (B.2) for any ∈ℝD v ^D. Further, we use the following definitions ℓti() ^i_t( w) ≔−∑j=1iAj(t)log⁡p(y∗j(t)|(t),∗1:j−1(t)), - _j=1^iA^(t)_j p_ w(y^*^(t)_j\,|\, x^(t), y^*^(t)_1:j-1), ℓ~ti() ^i_t( w) ≔−log⁡p(∗1:i(t)|(t)), - p_ w( y^*^(t)_1:i\,|\, x^(t)), ℓ~t() _t( w) ≔−log⁡p(∗(t)|(t)). - p_ w( y^*^(t)\,|\, x^(t)). Since ℓti≤ℓt ^i_t≤ _t and ℓti≤ℓ~ti ^i_t≤ ^i_t, we have 1T∑t=0T−1ℓti(t)1+2λ−1ℓ~ti(t) 1T _t=0^T-1 ^i_t( w_t)1+2λ^-1 ^i_t( w_t) ≤1T∑t=0T−1ℓti(t)1+2λ−1ℓti(t) ≤ 1T _t=0^T-1 ^i_t( w_t)1+2λ^-1 ^i_t( w_t) ≤1T∑t=0T−1ℓt(t)1+2λ−1ℓt(t) ≤ 1T _t=0^T-1 _t( w_t)1+2λ^-1 _t( w_t) ≤2T∑t=0T−1ℓt()+2λ∥−0∥22T, ≤ 2T _t=0^T-1 _t( v)+ 2λ \| v- w_0 \|_2^2T, (B.3) where the last inequality follows from (B.2). Next, we take expectation over the randomness in the draw of (0),…,(T−1) y^(0),…, y^(T-1). Note that t w_t is independent of (t) y^(t), and [Aj(t)|t]=qt(1:j(t)|(t)) E [A^(t)_j\,|\, w_t ]=q_t( y^(t)_1:j\,|\, x^(t)) for all j∈[N]j∈[N]. Therefore, (0),…,(T−1)[ℓti(t)] E_ y^(0),…, y^(T-1) [ ^i_t( w_t) ] =(0),…,(T−1)[∑j=1i−qt(∗1:j(t)|(t))logpt(y∗j(t)|(t),∗1:j−1(t))] = E_ y^(0),…, y^(T-1) [ _j=1^i-q_t( y^*^(t)_1:j\,|\, x^(t)) p_ w_t(y^*^(t)_j\,|\, x^(t), y^*^(t)_1:j-1) ] ≥(0),…,(T−1)[−qt(∗1:i(t)|(t))logpt(∗1:i(t)|(t))] ≥ E_ y^(0),…, y^(T-1) [-q_t( y^*^(t)_1:i\,|\, x^(t)) p_ w_t( y^*^(t)_1:i\,|\, x^(t)) ] =(0),…,(T−1)[−qt(∗1:i(t)|(t))ℓ~ti(t)]. = E_ y^(0),…, y^(T-1) [-q_t( y^*^(t)_1:i\,|\, x^(t)) ^i_t( w_t) ]. (B.4) By combining (B.3), (B.4), and the fact that [ℓt()]≤ℓ~t() E [ _t( v) ]≤ _t( v), we obtain (0),…,(T−1)[1T∑t=0T−1ℓ~ti(t)1+2λ−1ℓ~ti(t)]≤2T∑t=0T−1ℓ~t()+2λ∥−0∥22T. E_ y^(0),…, y^(T-1) [ 1T _t=0^T-1 ^i_t( w_t)1+2λ^-1 ^i_t( w_t) ]≤ 2T _t=0^T-1 _t( v)+ 2λ \| v- w_0 \|_2^2T. The remainder of the proof follows exactly as in that of Proposition 19. Namely, we let =Rv∗ v=R_v w^* for which we have have ℓ~t()≤Nke−Rvγ _t( v)≤ Nke^-R_vγ. We choose Rv=1/γ⋅log⁡(NkTγ2/λ)R_v=1/γ· (NkTγ^2/λ) and notice that ℓ~ti1+2λ−1ℓ~ti≥12min⁡(λ/2,ℓ~ti) ^i_t1+2λ^-1 ^i_t≥ 12 (λ/2, ^i_t). Combined with the bound on ∥0∥2 \| w_0 \|_2, the proof is complete. ∎ Note that the above guarantee is more general than that of Proposition 19. While it can recover the same bound for i=Ni=N, it can cover cases where i<Ni<N, where it shows that (PG-PR) can boost the accuracy on the first i tokens if the behavior policy assigns a non-trivial probability to the first i ground-truth tokens. As done before, we can use an online-to-batch conversion to turn the above bound into a conditional guarantee on test error. Proposition 26. Let (t)( w_t) denote the (PG-PR) iterates with adaptive learning rate ηt=(4+2∥∇ℓt(t)∥2)−1 _t=(4+2 \|∇ _t( w_t) \|_2)^-1. Suppose Assumption 1 holds, and ∥0∥2≤1/γ⋅log(NkTγ2/2) \| w_0 \|_2≤ 1/γ· (NkTγ^2/2). For any i∈[N]i∈[N] and any α∈(0,1]α∈(0,1], define the event ℰi,α≔inftqt(∗()1:i|)≥αE_i,α \ _tq_t( y^*( x)_1:i\,|\, x)≥α\ and πi,α≔ℙ(ℰi,α) _i,α (E_i,α ). , in expectation over the training randomness, over the test sample x, and over the choice τ∼Unif(0,…,T−1)τ Unif(\0,…,T-1\), we have [pτ(∗()1:i|)|ℰi,α]≥1−~(1πi,ααγ2T). E [p_ w_τ( y^*( x)_1:i\,|\, x)\,|\,E_i,α ]≥ 1- O ( 1 _i,αγ^2T ). (B.5) Proof. Similar to the proof of Theorem 3, we take expectation from both sides of (B.1), and use the independence of (t) x^(t) from t w_t to replace it with an independent test sample x, and write [1T∑t=0T−1qt(1:i∗()|)min(λ2,−logpt(1:i∗()|))]≲λlog(NkTγ2/λ)2γ2T. E [ 1T _t=0^T-1q_t( y^*_1:i( x)\,|\, x) ( λ2,- p_ w_t( y^*_1:i( y)\,|\, x) ) ] λ (NkTγ^2/λ)^2γ^2T. Using 1−x≤−log⁡x1-x≤- x, we can write [1T∑t=0T−1qt(1:i∗()|)min(λ2,1−pt(1:i∗()|))]≲λlog(NkTγ2/λ)2γ2T. E [ 1T _t=0^T-1q_t( y^*_1:i( x)\,|\, x) ( λ2,1-p_ w_t( y^*_1:i( y)\,|\, x) ) ] λ (NkTγ^2/λ)^2γ^2T. By choosing λ=2λ=2, we can remove the minimum and write [qτ(1:i∗()|)(1−pτ(1:i∗()|))]≲log(NkTγ2)2γ2T, E [q_τ( y^*_1:i( x)\,|\, x) (1-p_ w_τ( y^*_1:i( y)\,|\, x) ) ] (NkTγ^2)^2γ^2T, where we replaced the average with expectation over τ. Next, we can use the law of total expectation and note that on ℰi,αE_i,α we have qτ(1:i∗()|)≥αq_τ( y^*_1:i( x)\,|\, x)≥α for all t, which yields [1−pτ(1:i∗()|)|ℰi,α]ℙ(ℰi,α)≲log(NkTγ2)2γ2T E [1-p_ w_τ( y^*_1:i( y)\,|\, x)\,|\,E_i,α ]P (E_i,α ) (NkTγ^2)^2γ^2T which completes the proof. ∎ With this proposition, we can present the proof of the main theorem of this section. Proof. [Proof of Theorem 6.] The proof is roughly similar to the argument presented in Section A.4. This time, we define ℰα _α ≔minj∈[N]⁡q0(yj∗()|,1:j−1∗())≥α, \ _j∈[N]q_0(y^*_j( x)\,|\, x, y^*_1:j-1( x))≥α\, ℰ~α~ E_ α ≔q(∗()|)≥α~, \q( y^*( x)\,|\, x)≥ α\, for α,α~∈(0,1]α, α∈(0,1]. Using the law of total expectation and Proposition 26, we have [1−pτ(∗()|)] E [1-p_ w_τ( y^*( x)\,|\, x) ] =[1−pτ(∗()|)|ℰ~α~]ℙ(ℰ~α~)+[1−pτ(∗()|)|ℰ~α~c]ℙ(ℰ~α~c) = E [1-p_ w_τ( y^*( x)\,|\, x)\,|\, E_ α ]P ( E_ α )+ E [1-p_ w_τ( y^*( x)\,|\, x)\,|\, E^c_ α ]P ( E^c_ α ) ≤~(1α~γ2T)+ℙ(ℰ~α~c). ≤ O ( 1 αγ^2T )+P ( E^c_ α ). (B.6) Let m≔⌈2(logN+1)⋅m∗⌉,m∗≔min(q0TL((1−1log⁡(1/ε)ε)−1,k).m 2( N+1)· m^* , m^* (Q_q_0^TL ((1- 1 (1/ ) )^-1,k ). Once again, we proceed by considering each case in the minimum above separately. • Suppose the minimum is attained by the first term. Note that q(yj∗()|,1:j−1∗())≥1−(1−q0(yj∗()|,1:j−1∗()/2)mq(y^*_j( x)\,|\, x, y^*_1:j-1( x))≥ 1-(1-q_0(y^*_j( x)\,|\, x, y^*_1:j-1( x)/2)^m for all j. Therefore, on ℰαE_α we have q(yj∗()|,1:j−1∗())≥1−e−αm/2q(y^*_j( x)\,|\, x, y^*_1:j-1( x))≥ 1-e^-α m/2 and q(∗()|)≥1−Ne−αm/2≥1−N−αm∗(log⁡N+1).q( y^*( x)\,|\, x)≥ 1-Ne^-α m/2≥ 1-N^-α m^*( N+1). By letting α=1/m∗α=1/m^* we obtain q(∗()|)≥1−e−1q( y^*( x)\,|\, x)≥ 1-e^-1. Thus, if we let α~≔1−e−1 α 1-e^-1, we have ℙ(ℰ~α~c)≤ℙ(ℰαc)=−1q0(1/m∗)≤(1−1log⁡(1/ε))ε.P ( E^c_ α ) (E^c_α )=Q^-1_q_0(1/m^*)≤ (1- 1 (1/ ) ) . Going back to (B.6), we obtain [1−pτ(∗()|)]≤~(1γ2T)+(1−1log⁡(1/ε))ε. E [1-p_ w_τ( y^*( x)\,|\, x) ]≤ O ( 1γ^2T )+ (1- 1 (1/ ) ) . Choosing T=~(1/(γ2ε))T= O(1/(γ^2 )) is sufficient to ensure the first term remains bounded by ε/log⁡(1/ε) / (1/ ), and the overall expected error remains bounded by ε . • Now suppose the minimum is attained by the second term. Then, q(yj∗()|,1:j−1∗())≥1−(1−1/(2k))m≥1−e−m/(2k).q(y^*_j( x)\,|\, x, y^*_1:j-1( x))≥ 1-(1-1/(2k))^m≥ 1-e^-m/(2k). And q(∗()|)≥1−Ne−m/(2k)≥1−Ne−log⁡N−1q( y^*( x)\,|\, x)≥ 1-Ne^-m/(2k)≥ 1-Ne^- N-1. Thus, for α~=1−e−1 α=1-e^-1 we have ℙ(ℰ~α~c)=0P ( E^c_ α )=0. Hence (B.6) reads [1−pτ(∗()|)]≤~(1γ2T), E [1-p_ w_τ( y^*( x)\,|\, x) ]≤ O ( 1γ^2T ), and choosing T=~(1/(γ2ε))T= O(1/(γ^2 )) is sufficient. The argument for the number of reward queries is also similar to that of Section A.4, adapted to Algorithm 2. Note that if (t)=∗((t)) y^(t)= y^*( x^(t)) for (t)=pt y^(t)=p_ w_t, then Algorithm 2 only makes one reward query, and (PG-PR) does not need any additional reward queries. Otherwise, Algorithm 2 makes at most 1+mN1+mN reward queries, and (PG-PR) makes at most N reward queries. Thus, we can write the total number of reward queries as Q Q =T+(m+1)N[∑t=1T[^(t)≠∗((t))]] =T+(m+1)N E [ _t=1^T 1[ y^(t)≠ y^*( x^(t))] ] =T+(m+1)N∑t=1T[1−pt(∗((t))|(t))] =T+(m+1)N _t=1^T E [1-p_ w_t( y^*( x^(t))\,|\, x^(t)) ] =T+(m+1)N∑t=1T[1−pt(∗()|)] =T+(m+1)N _t=1^T E [1-p_ w_t( y^*( x)\,|\, x) ] (Using independent of (t) x^(t) from t w_t) =T+(m+1)NTℙ(ℰ~α~c)+(m+1)N~(γ−2) =T+(m+1)NTP ( E^c_ α )+(m+1)N O(γ^-2) (Using (B.6) and α~=1−e−1). (Using eq:thm_pg_pr_interm and $ α=1-e^-1$). Using ℙ(ℰ~αc)≤(1−o(1))εP ( E^c_α )≤(1-o(1)) and T=~(1/(γ2ε))T= O(1/(γ^2 )), we have Q=T+(m+1)N~(γ−2)=~(mN+ε−1γ2).Q=T+(m+1)N O(γ^-2)= O ( mN+ ^-1γ^2 ). Plugging in m completes the proof. ∎ B.2 Proof of Corollary 7 The first case of the corollary directly follows from Proposition 1. For the second case, note that as a consequence of the assumption on the feature map, for a given x we have p(yi∗()|,1:i−1∗())=p(∗()|)1/Np_ w(y^*_i( x)\,|\, x, y^*_1:i-1( x))=p_ w( y^*( x)\,|\, x)^1/N (B.7) for all i. Further, by going back to Proposition 17 in the SGD setting (thus removing the base policy in (A.6)), we have [−logp¯nSGD(∗()|)]≤(Nlog(nkγ2)/(γ2n)) E [- p_ w^SGD_n( y^*( x)\,|\, x) ] (N (nkγ^2)/(γ^2n)), where we used the Jensen inequality. Recall q0=p¯nSGDq_0=p_ w^SGD_n. Define Z()≔mini∈N⁡q0(yi∗()|,1:i−1∗())Z( x) _i∈ Nq_0(y^*_i( x)\,|\, x, y^*_1:i-1( x)). By (B.7), we have Z()=q0(∗()|)1/NZ( x)=q_0( y^*( x)\,|\, x)^1/N. Consequently we have ℙ(Z()<α)=ℙ(q0(∗()|)1/N<α) (Z( x)<α )=P (q_0( y^*( x)\,|\, x)^1/N<α ) =ℙ(−1Nlogq0(∗()|)≥log(1/α)) =P (- 1N q_0( y^*( x)\,|\, x)≥ (1/α) ) ≤[−logq0(∗()|)]Nlog⁡(1/α) ≤ E [- q_0( y^*( x)\,|\, x) ]N (1/α) (By Markov’s Inequality) ≤log⁡(nkγ2)nγ2log⁡(1/α). ≤ (nkγ^2)nγ^2 (1/α). Recall that q0TL(ε)=supα∈[0,1]:ℙ(Z≤α)≤εQ_q_0^TL( )= \α∈[0,1]:P (Z≤α )≤ \. Thus we have q0TL(ε)≥exp(−log⁡(nγ2k)nγ2ε).Q_q_0^TL( )≥ (- (nγ^2k)nγ^2 ). Recall from Theorem 6 that while the number of iterations is always given by T=~(1/(γ2ε))T= O(1/(γ^2 )), the number of reward queries is given by Q=~(Nmin⁡(q0TL(ε)−1,k)+ε−1γ2).Q= O ( N (Q^TL_q_0( )^-1,k )+ ^-1γ^2 ). When n≥log⁡(k/ε)γ2εn≥ (k/ )γ^2 , one can verify that q0TL(ε)≥e−3Q^TL_q_0( )≥ e^-3. Thus, we can choose this q0TLQ^TL_q_0 in the minimum above. Otherwise, we can choose k. This finishes the proof. ∎ Appendix C Proofs of Section 5 Our minimax lower bounds rely on the following fact. Lemma 27. Suppose y is a random variable drawn uniformly from 1,…,m\1,…,m\. To estimate y, an algorithm is allowed l≤ml≤ m rounds of guesses, where at round t the algorithm produces a guess y^t y_t and receives the answer [y^t=y] 1[ y_t=y]. Let y^l y_l denote the final guess of the algorithm. For any such algorithm we have ℙ(y^l≠y)≥m−lm.P ( y_l≠ y )≥ m-lm. Proof. Define Al=∩i=1ly^i≠yA_l= _i=1^l\ y_i≠ y\. We will prove by induction that ℙ(Al)=(m−l)/mP (A_l )=(m-l)/m for l≤ml≤ m. Note that y^1 y_1 is independent of y, therefore ℙ(A1)=(m−1)/mP (A_1 )=(m-1)/m. For l≥2l≥ 2, we have ℙ(Al)=ℙ(y^l≠y|Al−1)ℙ(Al−1).P (A_l )=P ( y_l≠ y\,|\,A_l-1 )P (A_l-1 ). (C.1) Further, ℙ(y^l≠y|Al−1)= ( y_l≠ y\,|\,A_l-1 )= ℙ(y^l≠y|Al−1,y^l∈y^ii=1l−1)ℙ(y^l∈y^ii=1l−1) \,P ( y_l≠ y\,|\,A_l-1, y_l∈\ y_i\_i=1^l-1 )P ( y_l∈\ y_i\_i=1^l-1 ) +ℙ(y^l≠y|Al−1,y^l∉y^ii=1l−1)ℙ(y^l∉y^ii=1l−1) +P ( y_l≠ y\,|\,A_l-1, y_l∉\ y_i\_i=1^l-1 )P ( y_l∉\ y_i\_i=1^l-1 ) ≥ ≥ ℙ(y^l≠y|Al−1,y^l∉y^ii=1l−1) \,P ( y_l≠ y\,|\,A_l-1, y_l∉\ y_i\_i=1^l-1 ) ≥ ≥ m−lm−(l−1), m-lm-(l-1), where we used the fact that ℙ(y^l≠y|Al−1,y^l∈y^ii=1l−1)=1P ( y_l≠ y\,|\,A_l-1, y_l∈\ y_i\_i=1^l-1 )=1. Plugging back into (C.1) and using the induction hypothesis ℙ(Al−1)≥(m−(l−1))/mP (A_l-1 )≥(m-(l-1))/m, we obtain ℙ(Al)≥m−lm.P (A_l )≥ m-lm. This completes the proof of the induction. The proof of the lemma is completed by noticing that ℙ(y^l≠y)≥ℙ(Al)P ( y_l≠ y ) (A_l ). ∎ Remark. Note that the above lower bound is sharp. An optimal algorithm draws y^l y_l uniformly from 1,…,m∖y^1,…,y^l−1\1,…,m\ \ y_1,…, y_l-1\. C.1 Proof of Theorem 10 The proof proceeds by randomizing the data distribution and providing bounds in expectation with respect to this randomness, which implies the existence of a worst-case distribution with the same bound. Setup and Definitions. Define I≔⌊1/γ2⌋I 1/γ^2 and m≔⌊1/α⌋m 1/α , and let i e_i be the iith standard basis vector of ℝDR^D. Define the four sets i=iI/4+1,…,(i+1)I/4X_i=\ x^iI/4+1,…, x^(i+1)I/4\, where we assume without loss of generality that I is divisible by 44. We define D as follows: let p1≔(1−ε∗)(1−δ),p2≔(1−ε∗)δ,p3≔ε∗(1−δ),p4≔ε∗δp_1 (1- ^*)(1-δ), p_2 (1- ^*)δ, p_3 ^*(1-δ), p_4 ^*δ for some δ∈(0,1)δ∈(0,1) to be fixed later. With probability pip_i, we draw x uniformly from iX_i. Let m⊆NY_m ^N be an arbitrary set of m distinct sequences of length N. To determine the ground-truth labels ∗() y^*( x), we draw them independently from Unif(m) Unif(Y_m) for ∈1∪2 x _1 _2 and from Unif(N) Unif(Y^N) for ∈3∪4 x _3 _4. We will also use the notation (mi)i=14(m_i)_i=1^4 where mi=m_i=m for i∈1,2i∈\1,2\ and mi=kNm_i=k^N for i∈3,4i∈\3,4\. With this definition, the label for ∈i x _i is drawn uniformly from mim_i sequences. Note that mi≥2m_i≥ 2 for all i. We define q(⋅|)q(·\,|\, x) to be Unif(m) Unif(Y_m) for ∈1∪2 x _1 _2 and Unif(N) Unif(Y^N) for ∈3∪4 x _3 _4. Therefore, q(∗()|)=m−1q( y^*( x)\,|\, x)=m^-1 if ∈1∪2 x _1 _2 and q(∗()|)=k−Nq( y^*( x)\,|\, x)=k^-N otherwise. One can then verify that q satisfies q(ε)=k−Nifε<ε∗m−1ifε≥ε∗.Q_q( )= casesk^-N&if < ^*\\ m^-1&if ≥ ^*. cases For every j∈[N]j∈[N] and i∈[I]i∈[I], the feature map ϕφ is defined as ϕ(i,1:j)=ϕ(i,yj)=k(i−1)+yj.φ( x^i, y_1:j)=φ( x^i,y_j)= e_k(i-1)+y_j. As a result, ⟨ϕ(i,1:j),ϕ(u,1:v)⟩=[i=u][yj=yv] φ( x^i, y_1:j),φ( x^u, y_1:v) = 1[i=u] 1[y_j=y_v]. It is straightforward to verify that ∗≔γ∑i=1Iϕ(i,∗(i)) w^* γ _i=1^Iφ( x^i, y^*( x^i)) satisfies Assumption 1. Lower Bound on Number of Samples. Let m()m( x) denote the number of reward queries made on context x. Let ∼ x independent of S. We have ℙ(^(S)()≠∗())=∑i=14ℙ(^(S)()≠∗()|∈i)ℙ(∈i).P ( y(S)( x)≠ y^*( x) )= _i=1^4P ( y(S)( x)≠ y^*( x)\,|\, x _i )P ( x _i ). (C.2) Moreover, ℙ(^(S)()≠∗()|∈i) ( y(S)( x)≠ y^*( x)\,|\, x _i ) =[ℙ(^(S)()≠∗()|S,∈i)|∈i] = E [P ( y(S)( x)≠ y^*( x)\,|\,S, x _i )\,|\, x _i ] ≥[[∉S]⋅mi−1mi+[∈S]⋅mi−1−m()mi∨0|∈i] ≥ E [ 1[ x∉ S]· m_i-1m_i+ 1[ x∈ S]· m_i-1-m( x)m_i 0\,|\, x _i ] =ℙ(∉S|∈i)⋅mi−1mi+[[∈S](mi−1−m())mi∨0|∈i]. =P ( x∉ S\,|\, x _i )· m_i-1m_i+ E [ 1[ x∈ S](m_i-1-m( x))m_i 0\,|\, x _i ]. (C.3) To obtain a lower bound on the number of samples, we remove the second term from (C.3) to obtain ℙ(^(S)()≠∗()|∈i)≥mi−1miℙ(∉S|∈i)P ( y(S)( x)≠ y^*( x)\,|\, x _i )≥ m_i-1m_iP ( x ∈ S\,|\, x _i ) where the second inequality follows from Lemma 27. Note that if ∈i x _i, then the probability that it is drawn once is 4piI 4p_iI, and by a union bound, the probability that it is drawn at least once is at most 4MpiI 4Mp_iI. Therefore ℙ(∉S|∈i)≥(1−4piMI)∨0P ( x∉ S\,|\, x _i )≥(1- 4p_iMI) 0. Thus ℙ(^(S)()≠∗())≥∑i=14pi⋅mi−1mi⋅(1−4piMI)∨0.P ( y(S)( x)≠ y^*( x) )≥ _i=1^4p_i· m_i-1m_i· (1- 4p_iMI ) 0. Since each term is non-negative, we can remove i=1i=1 and i=3i=3 terms, from the lower bound, and write ℙ(^(S)()≠∗())≥p22(1−4p2MI)+p42(1−4p4MI)≥p2+p42(1−4(p1∨p2)MI),P ( y(S)( x)≠ y^*( x) )≥ p_22 (1- 4p_2MI )+ p_42 (1- 4p_4MI )≥ p_2+p_42 (1- 4(p_1 p_2)MI ), where we used mi≥2m_i≥ 2. For the probability of error to be less than ε we need we need M≥I4(p2∨p4)(1−2εp2+p4).M≥ I4(p_2 p_4) (1- 2 p_2+p_4 ). Recall p2=(1−ε∗)δp_2=(1- ^*)δ and p4=ε∗δp_4= ^*δ. For ε≤1/2−c ≤ 1/2-c for some absolute constant c>0c>0, let δ=2ε(1+2c)≤1δ=2 (1+2c)≤ 1. Then, the above implies M≥I4δ((1−ε∗)∨ε∗)⋅2c1+2c≥cI4δ(1+2c)≳1γ2ε,M≥ I4δ((1- ^*) ^*)· 2c1+2c≥ cI4δ(1+2c) 1γ^2 , which completes the proof of the sample lower bound. Lower Bound on Number of Reward Queries. For this part, we choose δ=1/2δ=1/2. To prove the lower bound on the number of reward queries, we go back to (C.3), and note that ℙ(^(S)()≠∗()|∈i)≥mi−1mi−[[∈S]m()mi|∈i].P ( y(S)( x)≠ y^*( x)\,|\, x _i )≥ m_i-1m_i- E [ 1[ x∈ S]m( x)m_i\,|\, x _i ]. In particular, using the fact that m1=m2=m_1=m_2=m and m3=m4=kNm_3=m_4=k^N, we can write, ℙ(^(S)()≠∗()|∈1∪2)≥m−1m−[[∈S]m()|∈1∪2]m,P ( y(S)( x)≠ y^*( x)\,|\, x _1 _2 )≥ m-1m- E [ 1[ x∈ S]m( x)\,|\, x _1 _2 ]m, and ℙ(^(S)()≠∗()|∈3∪4)≥kN−1kN−[[∈S]m()|∈3∪4]kN,P ( y(S)( x)≠ y^*( x)\,|\, x _3 _4 )≥ k^N-1k^N- E [ 1[ x∈ S]m( x)\,|\, x _3 _4 ]k^N, Define Z≔[[∈S]m()|∈1∪2]Z E [ 1[ x∈ S]m( x)\,|\, x _1 _2 ] and Z′≔[[∈S]m()|∈3∪4]Z E [ 1[ x∈ S]m( x)\,|\, x _3 _4 ]. Using (C.2), we obtain ℙ(^(S)()≠∗())≥(1−ε∗)(m−1−Zm)∨0+ε∗(kN−1−Z′kN)∨0,P ( y(S)( x)≠ y^*( x) )≥(1- ^*) ( m-1-Zm ) 0+ ^* ( k^N-1-Z k^N ) 0, (C.4) where we used the non-negativeness of each conditional probability. In particular, each term on the RHS must be bounded by the LHS, i.e. ℙ(^(S)()≠∗())≥(1−ε∗)(m−1−Zm)P ( y(S)( x)≠ y^*( x) )≥(1- ^*) ( m-1-Zm ) (C.5) when considering only ∈1∪2 x _1 _2, and ℙ(^(S)()≠∗())≥ε∗(kN−1−Z′kN)P ( y(S)( x)≠ y^*( x) )≥ ^* ( k^N-1-Z k^N ) (C.6) when considering only ∈3 x _3. Before proceeding, we observe that the number of reward queries is equal to Q=(Z+Z′)I2Q= (Z+Z )I2. This is because Q Q =[∑i=1I[i∈S]m(i)] = E [ _i=1^I 1[ x^i∈ S]m( x^i) ] =[∑i∈1∪2[i∈S]m(i)]+[∑i∈3∪4[i∈S]m(i)] = E [ _ x^i _1 _2 1[ x^i∈ S]m( x^i) ]+ E [ _ x^i _3 _4 1[ x^i∈ S]m( x^i) ] =I2[[∈S]m()|∈1∪2]+I2[[∈S]m()|∈3∪4], = I2 E [ 1[ x∈ S]m( x)\,|\, x _1 _2 ]+ I2 E [ 1[ x∈ S]m( x)\,|\, x _3 _4 ], where we recall ∼ x in the above. Now, consider the following cases: • First, consider the case where (1+1/log⁡(1/ε))ε≥ε∗(1+1/ (1/ )) ≥ ^*. We have ε ≥(1−ε∗)(m−1−Zm)+ε∗(kN−1−Z′kN) ≥(1- ^*) ( m-1-Zm )+ ^* ( k^N-1-Z k^N ) ≥m−1−(Z+Z′)m ≥ m-1-(Z+Z )m which implies Z+Z′≥m(1−ε−m−1)≳mZ+Z ≥ m(1- -m^-1) m where we used ε<1/2 <1/2 and m−1<1/2m^-1<1/2. Therefore, the number of reward queries is Q=I(Z+Z′)/2≳mI≳α−1/γ2Q=I(Z+Z )/2 mI α^-1/γ^2. Note that for the range of ε here we have q((1+1/log⁡(1/ε))ε)=αQ_q((1+1/ (1/ )) )=α, therefore Q≳q((1+1/log⁡(1/ε))ε)−1/γ2Q _q((1+1/ (1/ )) )^-1/γ^2. • Second, assume (1+1/log⁡(1/ε))ε<ε∗(1+1/ (1/ )) < ^*. Using (C.6), we have ε≥ε∗(kN−1−Z′kN) ≥ ^* ( k^N-1-Z k^N ) and consequently Z′≥kN(1−k−N−ε∗)≥kN(1−k−N−11+1/log⁡(1/ε)).Z ≥ k^N (1-k^-N- ^* )≥ k^N (1-k^-N- 11+1/ (1/ ) ). We want the multiplier of kNk^N on the RHS to be at least some absolute constant c, and W.L.O.G. we assume N is sufficiently large such that k−N≤ck^-N≤ c. We need to guarantee 1+1/log⁡(1/ε)≥(1−2c)−11+1/ (1/ )≥(1-2c)^-1. This inequality is always satisfied for ε<1/2 <1/2 as long as c is a sufficiently small absolute constant, e.g. c=0.2c=0.2 satisfies this condition by numerical calculation. With this, we have Z′≥ckNZ ≥ ck^N and Q=I(Z+Z′)/2≳kNI≳q((1+1/log⁡(1/ε))ε)−1/γ2Q=I(Z+Z )/2 k^NI _q((1+1/ (1/ )) )^-1/γ^2, which completes the proof. ∎ C.2 Proof of Theorem 11 We use the same setup as that of Section C.1. Let π~≔πα π _α if ε≥1−πα ≥ 1- _α, and π~≔1−ε(1+c) π 1- (1+c) if ε(1+2c)≤1−πα (1+2c)≤ 1- _α for some absolute constant c>0c>0. We define p1≔π~(1−δ)p_1 π(1-δ), p2≔π~δp_2 πδ, and p3≔1−π~p_3 1- π, where δ will be chosen later. Note that with this new definition, we still have ℙ(q(∗()|)≥α)≥παP_ x(q( y^*( x)\,|\, x)≥α)≥ _α. Here, we use S to denote the previous samples as well the learner’s prediction on them and the obtained reward. We begin with the same decomposition as in (C.2). This time, for each conditional probability, we write ℙ(^(S)()≠∗()|∈i) ( y(S)( x)≠ y^*( x)\,|\, x _i ) =[ℙ(^(S)()≠∗()|S,∈i)|∈i] = E [P ( y(S)( x)≠ y^*( x)\,|\,S, x _i )\,|\, x _i ] ≥[mi−1−m()mi|∈i]∨0, ≥ E [ m_i-1-m( x)m_i\,|\, x _i ] 0, where m()m( x) is the number of times x has appeared in S, and we used Lemma 27 for the inequality. Moreover, when ∈i x _i, the probability that an individual draw in S is equal to x is 4pi/I4p_i/I, hence [m()|∈i]=4TpiI E [m( x)\,|\, x _i ]= 4Tp_iI where we recall T is the number of rounds (hence samples). Therefore ℙ(^(S)()≠∗()|∈i)≥((mi−1)I−4TpimiI)∨0.P ( y(S)( x)≠ y^*( x)\,|\, x _i )≥ ( (m_i-1)I-4Tp_im_iI ) 0. (C.7) Lower Bound on the Number of Samples. We consider the following two cases: • Consider the case (1+1/log⁡(1/ε))ε≥ε∗(1+1/ (1/ )) ≥ ^*. Using (C.7) and (C.2), we have ℙ(^(S)()≠∗()) ( y(S)( x)≠ y^*( x) ) ≥∑i=14pi⋅((mi−1)I−3TpimiI)∨0. ≥ _i=1^4p_i· ( (m_i-1)I-3Tp_im_iI ) 0. (C.8) ≥p2((m−1)I−4Tp2mI)+p4((kN−1)I−4Tp4kNI) ≥ p_2 ( (m-1)I-4Tp_2mI )+p_4 ( (k^N-1)I-4Tp_4k^NI ) (C.9) ≥(p2+p4)((m−1)I−4T(p2∨p4)mI) ≥(p_2+p_4) ( (m-1)I-4T(p_2 p_4)mI ) (C.10) where we used the fact that kN≥mk^N≥ m. Recall p2=(1−ε∗)δp_2=(1- ^*)δ, and p4=ε∗δp_4= ^*δ. Then we can rearrange the above terms to write T≥mI4δ((1−ε∗)∨ε∗)(1−m−1−εδ)≥mI8δ(1−m−1−εδ).T≥ mI4δ((1- ^*) ^*) (1-m^-1- δ )≥ mI8δ (1-m^-1- δ ). By choosing δ=2ε<1δ=2 <1 and using the fact that m≥2m≥ 2, we get T≳mI/ε≳m/(γ2ε)T mI/ m/(γ^2 ). Note that for (1+1/log⁡(1/ε))ε≥ε∗(1+1/ (1/ )) ≥ ^* we have q(ε)=m−1Q_q( )=m^-1. Hence we can write T≳q((1+1/log⁡(1/ε))ε)−1/(γ2ε)T _q((1+1/ (1/ )) )^-1/(γ^2 ), which completes the proof of this case. • For the case where (1+1/log⁡(1/ε))ε<ε∗(1+1/ (1/ )) < ^*, we use (C.8) with i=4i=4, where we recall m4=kNm_4=k^N and p4=ε∗δp_4= ^*δ. We then have ε≥ε∗δ((kN−1)I−4Tδε∗kNI) ≥ ^*δ ( (k^N-1)I-4Tδ ^*k^NI ) and consequently T≥kNI4ε∗δ(1−k−N−ε∗δ).T≥ k^NI4 ^*δ (1-k^-N- ^*δ ). The argument is similar to that of Section C.1. Let c denote some absolute constant and assume N is sufficiently large such that k−N≤ck^-N≤ c. Then, we can choose δ=(ε/ε∗)⋅(1−2c)−1δ=( / ^*)·(1-2c)^-1 to guarantee T≳kNI/εT k^NI/ . We only need to make sure δ≤1δ≤ 1, and it is sufficient to show 1+1/log⁡(1/ε)≥(1−2c)−11+1/ (1/ )≥(1-2c)^-1. For ε<1/2 <1/2, this inequality always holds with c=0.2c=0.2. With this, we have T≳kN/(γ2ε)T k^N/(γ^2 ), and we conclude the proof of this case by recalling that for (1+1/log⁡(1/ε))ε<ε∗(1+1/ (1/ )) < ^* we have q((1+1/log⁡(1/ε))ε)=k−NQ_q((1+1/ (1/ )) )=k^-N. Lower Bound on the Number of Mistakes. For the lower bound on the number of mistakes, we choose δ=1/2δ=1/2 and unify 1X_1 with 2X_2 and 3X_3 with 4X_4. For sample t, define StS_t to be the collection of previous samples as well as their corresponding predictions and rewards. Then, ℙ(^(St)((t))≠∗((t)))= ( y(S_t)( x^(t))≠ y^*( x^(t)) )= ℙ(^(St)((t))≠∗((t))|(t)∈1∪2)(1−ε∗) ( y(S_t)( x^(t))≠ y^*( x^(t))\,|\, x^(t) _1 _2 )(1- ^*) +ℙ(^(St)((t))≠∗((t))|(t)∈3∪4)ε∗. +P ( y(S_t)( x^(t))≠ y^*( x^(t))\,|\, x^(t) _3 _4 ) ^*. Using Lemma 27 and arguments similar to the above paragraph, we have ℙ(^(St)((t))≠∗((t))|(t)∈1∪2)≥m−1−2(t−1)(1−ε∗)mI∨0, ( y(S_t)( x^(t))≠ y^*( x^(t))\,|\, x^(t) _1 _2 )≥ m-1-2(t-1)(1- ^*)mI 0, and ℙ(^(St)((t))≠∗((t))|(t)∈1∪2)≥kN−1−2(t−1)ε∗kNI∨0.P ( y(S_t)( x^(t))≠ y^*( x^(t))\,|\, x^(t) _1 _2 )≥ k^N-1-2(t-1) ^*k^NI 0. (C.11) Therefore, [∑t=1T^(St)((t))≠∗((t))] E [ _t=1^T y(S_t)( x^(t))≠ y^*( x^(t)) ] =∑t=1Tℙ(^(St)((t))≠∗((t))) = _t=1^TP ( y(S_t)( x^(t))≠ y^*( x^(t)) ) ≥∑t=1T(1−ε∗)⋅(m−1)I−2(t−1)(1−ε∗)mI∨0+∑t=1Tε∗⋅(kN−1)I−2(t−1)ε∗kNI∨0 ≥ _t=1^T(1- ^*)· (m-1)I-2(t-1)(1- ^*)mI 0+ _t=1^T ^*· (k^N-1)I-2(t-1) ^*k^NI 0 Note that for all t≲kNI/ε∗t k^NI/ ^*, the terms inside the second sum above are Ω(ε∗) ( ^*). Therefore, the sum above for T≥kN/ε∗T≥ k^N/ ^* is at least kNI≳kN/γ2k^NI k^N/γ^2, completing the proof. ∎ C.3 Proof of Theorem 12 We reuse I≔⌊1/γ2⌋I 1/γ^2 as well as the basis (i)( e_i) and the definition of the feature map ϕφ from Section C.1. This time, we define 1=1,…,I/2X_1=\ x^1,…, x^I/2\ and 2=I/2+1,…,IX_2=\ x^I/2+1,…, x^I\, where we assume without loss of generality that I is even. We construct the distribution as follows. With probability p1≔1−δp_1 1-δ, we draw x uniformly from 1X_1, and with probability p2=δp_2=δ, we draw it uniformly from 2X_2. We randomize the distribution of the labels, and draw ∗(i)∼Unif(N) y^*( x^i) Unif(Y^N) independently. We recall that ∗=γ∑i=1Iϕ(i,∗(i)) w^*=γ _i=1^Iφ( x^i, y^*( x^i)) satisfies Assumption 1. Let S denote the training set, and q(S)(⋅|)q(S)(·\,|\, x) be the output of the algorithm. Consider the decomposition ℙ(q(S)(∗()|)≤α)=∑i=12piℙ(q(S)(∗()|)|∈i). (q(S)( y^*( x)\,|\, x)≤α )= _i=1^2p_iP (q(S)( y^*( x)\,|\, x)\,|\, x _i ). We choose i=2i=2, and use the non-negativity of probability to write ℙ(q(S)(∗()|)≤α) (q(S)( y^*( x)\,|\, x)≤α ) ≥p2ℙ(q(S)(∗()|)≤α|∈2) ≥ p_2P (q(S)( y^*( x)\,|\, x)≤α\,|\, x _2 ) ≥p2ℙ(q(S)(∗()|)≤α|∈2,∉S)ℙ(∉S|∈2). ≥ p_2P (q(S)( y^*( x)\,|\, x)≤α\,|\, x _2, x∉ S )P ( x∉ S\,|\, x _2 ). (C.12) By a union bound, we have ℙ(∉S|∈2)≥1−2p2nIP ( x∉ S\,|\, x _2 )≥ 1- 2p_2nI. Further, note that conditioned on ∉S x∉ S, ∗() y^*( x) is independent from q(S)(⋅|)q(S)(·\,|\, x). Therefore, ℙ(q(S)(∗()|)>α,∈2,∉S) (q(S)( y^*( x)\,|\, x)>α, x _2, x∉ S ) ≤α−1[q(S)(∗()|)|∈2,∉S] ≤α^-1 E [q(S)( y^*( x)\,|\, x)\,|\, x _2, x∉ S ] (Markov Inequality) ≤α−1[1kN∑∈Nq(S)(|)|∈2,∉S] ≤α^-1 E [ 1k^N _ y ^Nq(S)( y\,|\, x)\,|\, x _2, x∉ S ] (Expectation over ∗() y^*( x)) ≤α−1k−N, ≤α^-1k^-N, where for the last inequality we used the fact that the distribution q(S)(⋅|)q(S)(·\,|\, x) must sum to 11. Plugging this back into (C.12), we obtain ℙ(q(S)(∗()|)≤α)≥δ(1−α−1k−N)(1−2δn/I).P (q(S)( y^*( x)\,|\, x)≤α )≥δ(1-α^-1k^-N)(1-2δ n/I). By choosing δ=I/(4n)δ=I/(4n), we have ∗[ℙ(q(S)(∗()|)≤α|∗)]≥I(1−α−1k−N)8n≳1−α−1k−Nγ2n, E_ y^* [P (q(S)( y^*( x)\,|\, x)≤α\,|\, y^* ) ]≥ I(1-α^-1k^-N)8n 1-α^-1k^-Nγ^2n, where we choose ∗ y^* to denote the vector of all sequences (∗(i))i=1I( y^*( x^i))_i=1^I. The fact that the expectation over labels provides this lower bound, implies the existence of at least one set of label assignments that satisfies this lower bound, namely 1−πα=ℙ(q(S)(∗()|)≤α|∗)≳1−α−1k−Nγ2n.1- _α=P (q(S)( y^*( x)\,|\, x)≤α\,|\, y^* ) 1-α^-1k^-Nγ^2n. ∎