Paper deep dive
Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment
Hsiang Hsu, Eric Lei, Chun-Fu Chen
Abstract
Abstract:Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: ``optimistic'' approaches like Best-of-$N$ suffer from reward hacking, while ``pessimistic'' regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines.
Tags
Links
- Source: https://arxiv.org/abs/2603.06797v1
- Canonical: https://arxiv.org/abs/2603.06797v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/13/2026, 12:27:54 AM
Summary
Best-of-Tails (BoT) is an adaptive inference-time alignment framework for LLMs that balances optimistic and pessimistic selection strategies. By using the Hill estimator to characterize the tail heaviness of reward distributions on a per-prompt basis, BoT dynamically adjusts its selection rule via Tsallis divergence regularization, effectively mitigating reward hacking in heavy-tailed regimes while maintaining exploration in light-tailed regimes.
Entities (5)
Relation Signals (3)
Best-of-Tails → uses → Tsallis divergence
confidence 100% · BoT uses Tsallis divergence as a tunable regularizer
Best-of-Tails → uses → Hill estimator
confidence 100% · BoT uses the Hill estimator to characterize reward-tail heaviness
Best-of-Tails → addresses → Reward Hacking
confidence 95% · BoT... dynamically adjusts its selection rule to balance exploration gains against alignment error.
Cypher Suggestions (2)
Find all methods used by the Best-of-Tails framework · confidence 90% · unvalidated
MATCH (f:Framework {name: 'Best-of-Tails'})-[:USES]->(m) RETURN m.name, m.entity_typeIdentify problems addressed by inference-time alignment strategies · confidence 85% · unvalidated
MATCH (s:Strategy)-[:ADDRESSES]->(p:Problem) RETURN s.name, p.name
Full Text
104,798 characters extracted from source content.
Expand or collapse full text
Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment Hsiang Hsu, Eric Lei, and Chun-Fu (Richard) Chen JPMorganChase Global Technology Applied Research Correspondence to: Hsiang Hsu <hsiang.hsu@jpmchase.com>. Abstract Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: “optimistic” approaches like Best-of-N suffer from reward hacking, while “pessimistic” regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines. Keywords: Inference-time alignment, inference-time scaling, Best-of-N, reward hacking, tail behavior, tail-adaptive selection, Tsallis divergence, Hill estimator, regret minimization. 1 Introduction Inference-time scaling has emerged as a key mechanism for adapting large language models (LLMs) to complex real-world tasks (Brown et al., 2024; Snell et al., 2025). Extending traditional scaling laws (Kaplan et al., 2020) to the inference phase, models are allocated additional compute at deployment—effectively giving them more time to “think”—so they can search a richer response space and select outputs that are more accurate and contextually appropriate. This paradigm has helped unleash latent capabilities of LLMs, including improved reasoning through chain-of-thought (CoT) generation (Wei et al., 2022; Feng et al., 2023), iterative self-correction and self-evaluation (Wu et al., 2025; Ren et al., 2023; Asai et al., 2024), and longer-horizon planning via structured search (Zhang et al., 2024; Yao et al., 2023). The principles of inference-time scaling have also found a natural and increasingly important application in LLM alignment, where the goal is to steer model’s behavior toward human preferences such as correctness, helpfulness, and safety (Christiano et al., 2017; Schulman et al., 2017; Rafailov et al., 2023). Rather than permanently encoding preferences into model weights through fine-tuning, inference-time alignment uses extra computation at inference to evaluate and select among multiple candidates produced by a reference model, guided by a reward model (RM) (Stiennon et al., 2020; Jinnai et al., 2024; Verdun et al., 2025; Gui et al., 2024; Geuter et al., 2025; Huang et al., 2025b; Khalaf et al., 2025). This approach offers two practical advantages: it avoids the expense and operational burden of repeated re-training, and enables rapid adaptation to new tasks and deployments by simply changing the reward objective. Figure 1: Conceptual illustration of selection probabilities (colored solid lines) for optimistic, pessimistic, and the proposed BoT strategies. The plots depict how these strategies re-weight candidates under light-tailed (left) versus heavy-tailed (right) reward distributions (black dashed lines). While optimism consistently concentrates mass on the highest rewards (risking reward hacking) and pessimism remains conservative (risking under-exploration), BoT adaptively shifts its strategy: it mimics optimism in light-tailed regimes to exploit safe gains, but pivots toward robust, conservative selection in heavy-tailed regimes to prevent over-optimization. The most prominent inference-time alignment strategy is Best-of-N (BoN), which selects the candidate with the highest reward from N generated responses (Stiennon et al., 2020). At its core, BoN reflects an “optimistic” premise: that improvements in reward scores translate reliably into improvements in true quality. In practice, however, reward models are only imperfect proxies (Laidlaw et al., 2025; Sun et al., 2025). As N increases, BoN is increasingly driven by the extreme tail of the reward distribution, precisely where reward scores can become systematically mis-calibrated and over-estimate true quality, leading to reward hacking111Also known as reward over-optimization or Goodhart’s Law. (Stiennon et al., 2020; Gao et al., 2022; Skalse et al., 2022; Kwa et al., 2024). To mitigate this failure mode, recent work has proposed more “pessimistic” alternatives that explicitly curb over-optimization through regularization or conservative selection rules (Jinnai et al., 2025; Huang et al., 2025c, b; Verdun et al., 2025). Yet, these safeguards often come at a cost, stifling the model’s exploration at inference time to discover genuinely outstanding solutions when the reward signal is informative. We visually illustrate this dichotomy in Figure 1, which demonstrates how optimistic and pessimistic strategies select the responses depending on the observed reward tail heaviness. In this paper, we formalize the central trade-off between optimistic and pessimistic inference-time alignment strategies. Specifically, we argue that the effectiveness of inference-time alignment depends not only on how accurately the reward model reflects true response quality, but also on the distribution of reward scores induced by the generated candidates. In § 3, we analyze inference-time alignment through the lens of regret minimization. Our theoretical results reveal that the reward score tail behavior fundamentally shapes what “good” selection looks like. When rewards are largely concentrated—corresponding to a light-tailed regime—overly pessimistic selection can underutilize inference compute, since aggressive conservatism makes it unlikely to surface meaningfully better responses. In this setting, optimistic strategies such as BoN can be effective by actively exploring candidates with higher proxy reward. In contrast, when reward scores exhibit heavy tails—indicating substantial headroom for exploration—BoN is more likely to chase extreme proxy scores where mis-calibration is severe, making it particularly vulnerable to reward hacking; here, more pessimistic strategies become essential for robustness. Guided by this insight, in § 4 we propose Best-of-Tails (BoT), a new framework that bridges the optimism needed to discover high-reward responses and the pessimism required to ensure their validity. BoT introduces Tsallis divergence (Tsallis, 1988; Lee et al., 2019), a generalized family of divergences that interpolates between the KL and χ2χ^2 divergences, as a tunable regularization term for response selection. By our design, BoT does not rely on a fixed hyperparameter; instead, it first characterizes the reward landscape by estimating the tail index (power coefficient) via the Hill estimator (Hill, 1975). Based on this estimate, BoT adaptively interpolates between optimistic and pessimistic selection strategies for each input prompt (cf. Figure 1), offering a robust trade-off between exploration cost and alignment error. In § 5, we empirically demonstrate that BoT consistently outperforms fixed optimistic baselines (e.g., BoN) and fixed pessimistic alternatives (Jinnai et al., 2024; Verdun et al., 2025; Huang et al., 2025b) across math, multiple-choice reasoning, and human-preference evaluations, under multiple reference and reward model configurations. We conclude in § 6 with a discussion of limitations and future research directions. Omitted proofs, additional discussions, experiment setups, and additional experiments are included in the Appendix. Code to reproduce our experiments will be released soon. 2 Taxonomy of Inference-Time Alignment through Regret Analysis Given a prompt x∈x , an LLM is represented as a (stochastic) policy π:→Δ()π:X→ (Y) that induces a conditional distribution over responses y∈y . In other words, π(y|x)π(y|x) specifies the probability of generating a response y conditioned on x. We consider a true (but unknown) reward function r∗:×→[0,Rmax]r^*:X×Y→[0,R_ max] with a bounded range RmaxR_ max. For a prompt x, the true reward r∗r^* quantifies the extent to which y aligns with human preference, passes a verifier, or demonstrates sound reasoning. In practice, r∗r^* is inaccessible, and an imperfect proxy r r is often employed instead. For a given reward function r∈r∗,r^r∈\r^*, r\, the corresponding expected reward under policy π is defined as Jr(π;x)=y∼π(⋅|x)[r(x,y)]J_r(π;x)=E_y π(·|x)[r(x,y)]. Moreover, letting ∥⋅∥Lp(Q)\|·\|_L^p(Q) denote the ℒpL^p-norm with respect to a distribution Q, we measure the quality of r r via its expected error with respect to r∗r^* under a policy: ϵp(r^;π(⋅|x))=∥r^−r∗∥Lp(π)=(y∼π(⋅|x)[|r^(x,y)−r∗(x,y)|p])1/p _p( r;π(·|x))=\| r-r^*\|_L^p(π)= (E_y π(·|x)[| r(x,y)-r^*(x,y)|^p] )^1/p. For simplicity, we denote ϵ=ϵ2(r^;πref(⋅|x))ε= _2( r; _ ref(·|x)). Finally, let (⋅) 1(·) denote the indicator function and [b]+=max(b,0)[b]_+= (b,0) for a scalar b. We define the α-exponential as expα(x)=[1+(α−1)x]+1α−1 _α(x)=[1+(α-1)x]_+ 1α-1 and α-logarithm222The α-exponential and α-logarithm converge to the standard exponential and logarithm functions as α→1α→ 1 (Lemma A.1). as logα(x)=xα−1−1α−1 _α(x)= x^α-1-1α-1. Inference-time alignment. Ideally, LLM alignment seeks a policy that maximizes the expected true reward: π∗(⋅|x)=argmaxπ∈Δ()Jr∗(π;x)π^*(·|x)= *arg\,max_π∈ (Y)J_r^*(π;x). However, since the true reward r∗r^* is typically inaccessible, practitioners must optimize a proxy objective Jr^(π;x)J_ r(π;x) instead. To prevent the model from over-exploiting the imperfect proxy, it is standard to include an explicit distortion penalty. For instance, given a λ≥0λ≥ 0, reinforcement learning from human feedback (RLHF) (Christiano et al., 2017) uses KL regularization: maxπ∈Δ()Jr^(π;x)−λDKL(π|πref), _π∈ (Y)J_ r(π;x)-λ D_ KL(π| _ ref), (1) which limits how far the aligned policy can deviate from the reference policy. Yet, optimizing this objective via iterative parameter updates is often infeasible, as modern LLMs are frequently accessible only as black boxes (e.g., via commercial APIs) or are prohibitively expensive to fine-tune. These limitations motivate the study of inference-time alignment, which aims to steer model behavior dynamically at inference time without updating model parameters. An inference-time alignment method begins with a reference policy πref _ ref that is capable of generating high-reward responses by chance. The method then steers the output distribution by re-weighting these responses using a weight function w(y|x)w(y|x), which is typically designed to be monotonically increasing in the proxy reward r^(x,y) r(x,y). At the distribution level, this induces a re-weighted policy πw(y|x)∝πref(y|x)w(y|x), _w(y|x) _ ref(y|x)w(y|x), (2) which can be interpreted as an “aligned” version of πref _ ref that favors higher proxy-reward responses. Since direct access to the full reference distribution is generally unavailable, inference-time alignment procedures operate on empirical samples. Consequently, the procedure first draws a finite set of i.i.d. candidates N=y1,⋯,yN∼πref(⋅|x)Y_N=\y_1,·s,y_N\ _ ref(·|x), and then approximate πw(y|x) _w(y|x) using techniques such as weighted sampling (Verdun et al., 2025) or rejection sampling (Liu et al., 2024; Block and Polyanskiy, 2023). We denote this finite-sample approximation by π^w(y|x) π_w(y|x). A final “best” response is then selected according to π^w(y|x) π_w(y|x), typically by sampling (Verdun et al., 2025; Huang et al., 2025b) or by greedily choosing the highest-scoring candidate (Stiennon et al., 2020). This sample–evaluate–select pipeline has become a standard framework for inference-time alignment (Huang et al., 2025a). Inference-time regret analysis. The effectiveness of an inference-time alignment policy π^w(y|x) π_w(y|x) can be quantified by the loss in expected true reward caused by imperfect alignment under a finite sample budget N. This quantity, termed the inference-time regret (Huang et al., 2025b; Aminian et al., 2025), is defined as regret(π^w;x)=Jr∗(π∗;x)−Jr∗(π^w;x). regret( π_w;x)=J_r^*(π^*;x)-J_r^*( π_w;x). (3) Analyzing the upper bound of this regret cleanly isolates key factors that govern the design of effective inference-time alignment strategies. Proposition 1. For a given prompt x∈x∈X, the inference-time regret for a general inference-time alignment policy π^w(y|x) π_w(y|x), defined via a re-weighting function w(y|x)w(y|x), admits the following upper bound333A more general upper bound is provided in Appendix A.1. (π^w;x) regret( π_w;x) ≤2RmaxDTV(π^w(⋅|x)∥πw(⋅|x))⏟(i)−Δ(πw(⋅|x))⏟(ii) ≤ 2R_ max D_ TV( π_w(·|x)\| _w(·|x))_(i)- ( _w(·|x))_(i) (4) +Cπ∗(x)−1⏟(iii)(ϵ2(r^;πref(⋅|x))⏟(iv)+‖r^‖L2(πref(⋅|x))) + C^π^*(x)-1_(i) ( _2( r; _ ref(·|x))_(iv)+\| r\|_L^2( _ ref(·|x)) ) +1+Dχ2(πw(⋅|x)∥πref(⋅|x))⏟(v)ϵ2(r^;πref(⋅|x)). + 1+D_χ^2( _w(·|x)\| _ ref(·|x))_(v) _2( r; _ ref(·|x)). Here, DTV(⋅∥⋅)D_ TV(·\|·) is the total variation (Lehmann and Romano, 2005). Δ(πw(⋅|x))=Jr^(πw)−Jr^(πref) ( _w(·|x))=J_ r( _w)-J_ r( _ ref) represents the alignment gain, i.e., the improvement in expected proxy reward achieved by shifting from πref _ ref to πw _w. We denote by Cπ∗(x)=y∼π∗(⋅|x)[π∗(y|x)/πref(y|x)]C^π^*(x)=E_y π^*(·|x)[π^*(y|x)/ _ ref(y|x)] the coverage of the reference policy πref _ ref with respect to the optimal policy π∗π^*. Finally, Dχ2(πw(⋅|x)∥πref(⋅|x))D_χ^2( _w(·|x)\| _ ref(·|x)) measures the distortion of the aligned policy πw _w from πref _ ref via the χ2χ^2 divergence. Apart from term (i), which arises solely from the finite-sample approximation of πw _w using the candidate set NY_N, the remaining terms (i)–(v) each capture a distinct axis along which inference-time alignment methods can be analyzed and improved. Crucially, the bound in Eq. (4) offers a more nuanced perspective than recent analyses (e.g., Huang et al. (2025b)), which tend to be dominated by the oracle constant Cπ∗(x)C^π^*(x) and consequently overlook the structural constraints imposed by πref _ ref. To explicitly investigate how the reward tail governs alignment dynamics, we focus on the interplay between terms (i) and (iv). Together, these terms mirror the core RLHF objective (Eq. (1)), forming a trade-off to maximize reward while penalizing distortion. We next iterate over these terms and survey the related work. Alignment gain (term (i)). Empirically, Best-of-N (BoN) (Stiennon et al., 2020) effectively maximizes the alignment gain by applying a hard selection rule that chooses the single highest-scoring candidate. However, this aggressive optimization is prone to reward hacking (Gao et al., 2022; Skalse et al., 2022). To mitigate this, soft-BoN (sBoN) (Jinnai et al., 2024; Verdun et al., 2025) relaxes the hard maximum with a softmax weighting, πsBoN(y|x)∝πref(y|x)exp(r^(x,y)/λ), _ sBoN(y|x) _ ref(y|x) ( r(x,y)/λ ), (5) thereby smoothing the selection to preserve exploration. Extending this probabilistic view beyond one-shot re-weighting, Faria and Smith (2025) formulate alignment as an Monte-Carlo Markov chain process guided by process-based rewards; while theoretically more expressive, this approach incurs significantly higher computational overhead due to iterative sampling. Coverage (term (i)). The central premise of inference-time alignment is that πref _ ref already places non-negligible probability mass on high-reward responses, so they can be uncovered by sampling (Snell et al., 2025). Inference-time alignment is effective only if πref _ ref has sufficient coverage of high-reward responses, so that NY_N is likely to include near-optimal candidates. This is captured by Cπ∗(x)C^π^*(x) in Eq. (4): smaller values indicate better overlap between πref _ ref and π∗π^*. Consequently, a key line of work—inference-aware fine-tuning—focuses on improving this coverage. For instance, Gui et al. (2024) fine-tune πref _ ref on BoN samples via supervised fine-tuning, while Balashankar et al. (2024) optimize reward transformations to stabilize this process. From a distribution-matching perspective, Amini et al. (2025) align πref _ ref by minimizing the KL divergence between it and the theoretical BoN distribution, whereas Sessa et al. (2025) employ Jeffreys divergence (Jeffreys, 1946) for greater robustness. Beyond parameter updates, recent work also leverages in-context learning to dynamically enhance coverage at inference time (Lin et al., 2024). Reward estimate (term (iv)). Reward estimation shapes inference-time alignment both through the quality of the proxy and through how the reward is modeled and used during generation. In Eq (4), reducing ϵ2(r^;πref(⋅|x)) _2( r; _ ref(·|x)) improves regret, motivating alternative reward objectives and parameterizations. For example, Sun et al. (2025) argue that the Bradley-Terry model is not uniquely required, proposing order consistency and a classification-style upper-bound objective, while Coste et al. (2024) learn reward ensembles to reduce annotation noise. More recently, Geuter et al. (2025) and Liao et al. (2025) couple reward modeling with speculative decoding (Leviathan et al., 2023): improving inference efficiency by using a lightweight auxiliary policy proposes candidates, and a πref _ ref-aware reward to guide which candidates are selected. Distortion (term (v)). The distortion regularizer is closely related to the induced steering rule: optimizing KL-regularized RLHF in Eq. (1) yields an exponential re-weighting of the reference policy, exactly the soft-BoN policy in Eq. (5) (Korbak et al., 2022; Gui et al., 2024; Beirami et al., 2025; Yang et al., 2024a; Khalaf et al., 2025). While KL is a common default, recent work has explored alternative regularizers. For instance, Jinnai et al. (2024) demonstrate that minimizing Bayes risk in BoN corresponds to Wasserstein regularization. More critical to robust alignment, Huang et al. (2025c) argue that KL regularization can be too weak to reliably prevent reward hacking. They instead propose χ2χ^2 regularization, i.e., maxπ∈Δ()Jr^(π;x)−λDχ2(π|πref) _π∈ (Y)J_ r(π;x)-λ D_χ^2(π| _ ref). This objective induces a linear re-weighting of the reference policy, yielding the Inference-Time Pessimism (ITP) policy (Huang et al., 2025b): πITP(y|x)∝πref(y|x)[r^(x,y)/λ]+. _ ITP(y|x) _ ref(y|x)[ r(x,y)/λ]_+. (6) This linear scaling represents a more “pessimistic” strategy specifically designed to mitigate reward hacking in inference-time alignment. 3 Reward Tails and Alignment Regret Beyond the explicit terms in the regret bound of Eq. (4), our analysis highlights a critical, often implicit factor: the tail behavior of the imperfect proxy reward r r under πref _ ref. By explicitly coupling alignment gain with distortion, our framework provides a tractable way to analyze how the reward tail impacts inference-time alignment. This perspective is crucial because the alignment trade-off is fundamentally driven by how the weighting function w(⋅|x)w(·|x) shifts probability mass toward the high-reward tail, where rare, extreme proxy scores can dominate steering, amplify distortion, and trigger reward hacking. In contrast, prior analyses such as Huang et al. (2025b) do not capture this structural trade-off, focusing instead on broad reward estimation errors. To systematically analyze how optimistic versus pessimistic strategies govern inference-time alignment, we look beyond aggregate metrics and examine the tail behavior of the proxy reward distribution. In this analysis, We focus on two representative strategies: sBoN (optimistic), which employs exponential re-weighting (Eq. (5)) and includes standard BoN as a limiting case (λ→0λ→ 0); and ITP (pessimistic), which uses linear re-weighting (Eq. (6)) to explicitly constrain deviation from the reference policy. We start by characterizing the reward tail of r^(x,y) r(x,y) under πref(⋅|x) _ ref(·|x). Fix a prompt x and assume the proxy reward is normalized to r^(x,y)∈[0,1] r(x,y)∈[0,1]. Since the reward is bounded, its tail behavior is defined by the concentration of probability mass near the maximum value 11. We characterize this using the endpoint gap U(x,y)=1−r^(x,y)U(x,y)=1- r(x,y), which maps upper-tail events of r r to lower-tail events of U. Motivated by the Hill estimator from extreme value theory (Hill, 1975), we parameterize this endpoint tail using a prompt-dependent scalar κ(x)>0κ(x)>0 via a Weibull-type model: Pr(U≤u)≈u1/κ(x)⇔U∼Beta(1/κ(x),1). (U≤ u)≈ u^1/κ(x) U (1/κ(x),1). (7) In this bounded setting, a “heavy tail” (κ→∞κ→∞) implies a high density of responses near the maximum reward , while a “light tail” (κ→0κ→ 0) implies that high-reward responses are exponentially rare. 3.1 Decomposing the Regret Bound To understand how tail behavior governs alignment performance, we analyze the regret bound in Eq. (4), specifically focusing on how the reward tail impacts the trade-off between alignment gain and distortion. Isolating the terms dependent on πw _w, we define a regret proxy444Note that the regret proxy B(πw;x)B( _w;x) is actually prompt-dependent. We omit the x in the notation for simplicity. B(πw)B( _w) that aggregates finite-sample error (i), distortion (v), and gain (i). Using standard rejection sampling bounds and re-parameterizing the χ2χ^2-divergence via moments of w(r^)w( r), we obtain a tractable upper bound: B(πw) B( _w) ≤2exp(−N[w]supw)⏟Sampling Error+ϵ[w2][w]⏟Distortion−Δ(πw)⏟Gain, ≤ 2 (- NE[w] w )_Sampling Error+ ε E[w^2]E[w]_Distortion- ( _w)_Gain, (8) where expectations are taken over πref _ ref. This formulation explicitly reveals an inherent gain-distortion trade-off. Any strategy that aggressively re-weights the top-reward region to maximize alignment gain inevitably inflates the higher moments of w(r^)w( r). This consequently increases distortion, which can outweigh the benefits of the realized gain, particularly in heavy-tailed regimes. 3.2 Light vs. Heavy Tails: When to be Optimistic? We now apply this framework to compare sBoN and ITP. Proposition 2 quantifies the regret proxy B(πw)B( _w) for these strategies under extreme tail conditions. Proposition 2. Fix a prompt x, and assume U(x,y)=1−r^(x,y)∼Beta(1/κ,1)U(x,y)=1- r(x,y) (1/κ,1). Let λ>0λ>0 be the inverse steering temperature. • Heavy-tailed regime (large κ): B(πsBoN) B( _ sBoN) ≤2e−N1+1/(λκ)+ϵ(1+14λ2κ)−1κ(1−λ(1−e−1/λ))+o(κ−1), ≤ 2e^- N1+1/(λκ)+ε (1+ 14λ^2κ )- 1κ (1-λ(1-e^-1/λ) )+o(κ^-1), (9) B(πITP) B( _ ITP) ≤2e−N1+1(λ+1)κ+ϵ(1+14(λ+1)2κ)−12κ(λ+1)+o(κ−1). ≤ 2e^- N1+ 1(λ+1)κ+ε (1+ 14(λ+1)^2κ )- 12κ(λ+1)+o(κ^-1). • Light-tailed regime (small κ): B(πsBoN) B( _ sBoN) ≤2e−N(λ+κ)λ(λ+1)+ϵ(1+κ22λ2)−(κ2λ+κ3λ2)+o(κ4), ≤ 2e^- N(λ+κ)λ(λ+1)+ε (1+ κ^22λ^2 )- ( κ^2λ+ κ^3λ^2 )+o(κ^4), (10) B(πITP) B( _ ITP) ≤2e−N(λ+κ)λ(λ+1)+ϵ(1+κ22λ2)−κ2λ+o(κ3). ≤ 2e^- N(λ+κ)λ(λ+1)+ε (1+ κ^22λ^2 )- κ^2λ+o(κ^3). The asymptotic behaviors in Proposition 2 reveal why neither a fixed optimistic nor a fixed pessimistic strategy is universally optimal: • Heavy tails favor pessimism (ITP). Here, the reward distribution features a long tail of high scores, creating a high risk of over-fitting (reward hacking). Consequently, the primary constraint is preventing unbounded distortion. sBoN fails here because its distortion penalty scales as O(1/λ2)O(1/λ^2), exploding under aggressive steering (λ→0λ→ 0). In contrast, ITP is inherently robust: its penalty scales as O(1/(λ+1)2)O(1/(λ+1)^2), remaining bounded even as λ→0λ→ 0, effectively acting as a saturation mechanism against heavy-tailed noise. • Light tails favor optimism (Soft-BoN). In this regime, high-reward candidates are exponentially rare (“finding a needle in a haystack”), while distortion remains naturally mild for both strategies. The primary challenge is therefore gain realization. sBoN proves superior here, with a gain term scaling as O(κ3/λ2)O(κ^3/λ^2) compared to ITP’s O(κ2/λ)O(κ^2/λ). This confirms that the aggressive, exponential re-weighting of sBoN is essential to separate and select rare optimal candidates, whereas the linear re-weighting of ITP is too conservative to fully capture the potential gain. This dichotomy motivates a prompt-adaptive strategy that dynamically interpolates between optimism and pessimism based on the specific tail behavior of each prompt, as developed in the next section. 4 BoT: A Tail-Adaptive Alignment The dichotomy between the exponential re-weighting of sBoN (optimistic) and the linear re-weighting of ITP (pessimistic) stems fundamentally from the choice of probabilistic divergence used to regularize the policy. Specifically, sBoN corresponds to KL-regularization (DKLD_ KL), while ITP corresponds to χ2χ^2-regularization (Dχ2D_χ^2). To bridge these regimes, we introduce Best-of-Tails (BoT), a unified framework that adapts the regularization to the specific tail behavior of the prompt. BoT leverages the Tsallis divergence of order α>1α>1 (Tsallis, 1988) to smoothly interpolate between the aggressive exploration of KL and the robust mode-seeking of χ2χ^2 Dα(P|Q) D_α(P|Q) =Q[logα(QP)]=1α−1(∑xP(x)αQ(x)1−α−1). =E_Q [ _α ( QP ) ]= 1α-1 ( _xP(x)^αQ(x)^1-α-1 ). (11) Crucially, this generalizes the standard regularizers: DαD_α converges to the KL divergence (DKLD_ KL) as α→1α→ 1, and recovers the χ2χ^2-divergence (Dχ2D_χ^2) at α=2α=2. For a fixed prompt x, we define the prompt-conditional BoT objective analogous to the standard RLHF objective: ℒα(π;x)=y∼π[r^(x,y)]−λDα(π(⋅|x)|πref(⋅|x)).L_α(π;x)=E_y π[ r(x,y)]-λ D_α (π(·|x)| _ ref(·|x) ). (12) Let πBoT(⋅|x) _ BoT(·|x) denote the optimal solution to Eq. (12). We show that this optimal policy retains a closed-form re-weighting structure. Proposition 3. The BoT policy πα(⋅|x) _α(·|x) follows an α-exponential re-weighting: πBoT(y|x)∝πref(y|x)expα(r^(x,y)λ). _ BoT(y|x) _ ref(y|x) _α ( r(x,y)λ ). (13) The parameter α explicitly controls the behavior of BoT. As α→1α→ 1, expα(u)→eu _α(u)→ e^u, recovering the πsBoN _ sBoN; at α=2α=2, expα(u) _α(u) becomes linear, recovering the πITP _ ITP. Intermediate α values provide a continuum between “optimistic but fragile” and “robust but pessimistic,” allowing the regularization to match the specific tail properties of the proxy reward. It is this capability to adaptively exploit the reward tail that motivates the name Best-of-Tails. 4.1 Deciding the Prompt-Dependent α(x)α(x) Building on the analysis in § 3, we derive asymptotic expansions of the regret proxy B(πBoT;x)B( _ BoT;x) to make its dependence on the interpolation parameter α, the tail index κ, and the steering temperature λ explicit, and show how optimal interpolation α adapts to tail heaviness according to κ to minimize the regret proxy. Proposition 4. Fix a prompt x with normalized reward r^∈[0,1] r∈[0,1]. For α∈(1,2]α∈(1,2], λ>0λ>0, and let the α-exponential re-weighting be wα=[1+(α−1)r^λ]+1α−1w_α=[1+(α-1) rλ]_+ 1α-1. The regret proxy B=B(πBoT)B=B( _ BoT) is given by: B B =e(−Nm0(α;κ,λ))+ϵ‖wα‖L2‖wα‖L1−Δ(πBoT), =e (-Nm_0(α;κ,λ) )+ε \|w_α\|_L^2\|w_α\|_L^1- ( _ BoT), (14) where m0(α;κ,λ):=κ[(1−ρU)1α−1]m_0(α;κ,λ):=E_κ [(1-ρ U) 1α-1 ] with ρ=(α−1)/λ1+(α−1)/λρ= (α-1)/λ1+(α-1)/λ, and the norms are taken regarding πref _ ref. Moreover, the following asymptotic expansions hold: • Heavy-tailed regime (large κ): B≈const. B . +1κ[ϵ4(λ+α−1)2+Ne−N4(λ+α−1)2(4λ−1λ+5(α−1))−α−12(λ+α−1)]. + 1κ [ ε4(λ+α-1)^2+ Ne^-N4(λ+α-1)^2 (4λ- 1λ+5(α-1) )- α-12(λ+α-1) ]. (15) • Light-tailed regime (small κ): B≈const.−κ2λ−(3−2α)κ3λ2+ϵ(1+κ22λ2). B .- κ^2λ- (3-2α)κ^3λ^2+ε (1+ κ^22λ^2 ). (16) The expansions in Proposition 4 rigorously justify the adaptive mechanism of BoT by analyzing the gradient of the regret proxy B with respect to α. We again split the discussion into the heavy- and light-tailed regimes. In the heavy-tailed regime, the distortion penalty (the coefficient of ϵε in Eq. (15)) is proportional to (λ+α−1)−2(λ+α-1)^-2. Differentiating this term with respect to α yields a strictly negative gradient proportional to −(λ+α−1)−3-(λ+α-1)^-3. Since the distortion penalty dominates the regret bound (amplified by ϵε), minimizing B requires maximizing α. Thus, the optimal strategy converges to the upper boundary α→2α→ 2. This confirms that heavy tails mathematically necessitate the bounded influence of χ2χ^2-regularization (ITP) to prevent reward hacking. Conversely, in the light-tailed regime, distortion is bounded, prioritizing alignment gain. Differentiating Eq. (16) with respect to α yields a positive gradient (2κ3λ2>0 2κ^3λ^2>0), implying the regret bound is minimized at α→1α→ 1. This justifies sBoN in light-tailed settings: minimizing regret requires aggressive re-weighting to maximize the probability of discovering rare high-reward responses. With the theoretical monotonicity of the optimal α with respect to κ established, we now turn to practical implementation. Algorithm 1 Best-of-Tails (BoT) Inference 1: Input: Prompt x, Reference policy πref _ ref, Reward model r r (normalized to [0,1][0,1] by dividing RmaxR_ max). 2: Hyper-parameters: Sample count N, Hill estimation top-K, Pivot κ0 _0, Steering temperature λ. 3: // Phase 1: Sampling & Scoring 4: Sample candidate set N=y1,…,yN∼πref(⋅|x)Y_N=\y_1,…,y_N\ _ ref(·|x). 5: Compute rewards ri←r^(x,yi)r_i← r(x,y_i) for i=1,…,Ni=1,…,N. 6: // Phase 2: Tail Estimation (Hill Estimator) 7: Sort rewards descending: r(1)≥r(2)≥⋯≥r(N)r_(1)≥ r_(2)≥…≥ r_(N). 8: Estimate tail index κ κ using the top K order statistics: 9: κ^←1K∑j=1Klog(1−r(K+1)1−r(j)) κ← 1K _j=1^K ( 1-r_(K+1)1-r_(j) ). (Eq. (17)) 10: // Phase 3: Adaptive Interpolation 11: Compute adaptive parameter α: 12: α←1+κ^κ^+κ0α← 1+ κ κ+ _0. (Eq. (18)) 13: // Phase 4: Re-weighting & Selection 14: for i=1i=1 to N do 15: Calculate unnormalized weight using α-exponential: 16: wi←expα(ri/λ)=[1+(α−1)riλ]+1α−1w_i← _α(r_i/λ)= [1+(α-1) r_iλ ]_+ 1α-1. (Prop. 3) 17: end for 18: Normalize weights: pi←wi/∑jwjp_i← w_i/ _jw_j. 19: Return Sample y⋆y from NY_N with probabilities pi\p_i\. Figure 2: Optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on GSM8K, MMLU, and MATH (rows) across different reference models and proxy rewards (columns). Each curve traces the trajectory of True Reward (r∗r^*) vs. Proxy Reward (r r) as the sample size N increases from 202^0 to 2102^10, with marker size proportional to N. The smallest marker represents N=1N=1 (standard sampling), while the largest represents N=1024N=1024. Optimistic strategies typically over-optimize the proxy reward as N grows, leading to reward hacking (where r∗r^* degrades despite increasing r r). Conversely, ITP tends to saturate early, failing to leverage larger N for further gains. The proposed BoT successfully navigates this trade-off, achieving higher true and proxy rewards without succumbing to reward hacking. 4.2 Practical Implementation To estimate the tail index κ(x)κ(x), one approach is to model the full per-prompt reward distribution (e.g., via Gaussian kernels in Raman et al. (2025)). However, directly estimating tail statistics has significantly lower sample complexity than estimating the full distribution (see Appendix B.2). Therefore, our approach is distinct: rather than modeling the entire density, we explicitly target the tail. In practice, the prompt-specific tail index κ(x)κ(x) is latent and must be estimated empirically from the proxy rewards of the sampled candidates. To this end, we employ the standard Hill estimator (Hill, 1975). Given N sampled responses with rewards sorted as r(1)≥r(2)≥⋯≥r(N)r_(1)≥ r_(2)≥…≥ r_(N), we estimate κ^(x) κ(x) using the top K statistics (where K<NK<N): κ^(x)=1K∑i=1Klog(1−r(K+1)1−r(i)). κ(x)= 1K _i=1^K ( 1-r_(K+1)1-r_(i) ). (17) See Appendix B.1 for Hill estimator details and Appendix B.2 for its sample complexity. Since Proposition 4 establishes the monotonicity of the optimal α, we propose a simple mapping to adaptively set α(x)α(x): α(x)=1+κ^(x)κ^(x)+κ0,α(x)=1+ κ(x) κ(x)+ _0, (18) where κ0 _0 is a hyperparameter serving as a pivot. This mapping smoothly interpolates between regimes: if the tail is light (κ^(x)≪κ0 κ(x) _0), α(x)→1α(x)→ 1 (sBoN); otherwise, α(x)→2α(x)→ 2 (ITP). We summarize the complete BoT inference procedure build in this section in Algorithm 1. 5 Empirical Study We complement our theoretical results with a suite of experiments that investigate the practicality of BoT, and compare against the re-weighting alignment algorithms, i.e., BoN, sBoN, and ITP introduced in § 2. Datasets and evaluation. We consider two experimental setups. In the first, we focus on benchmark tasks where the true reward r∗r^* is verifiable via rule-based answer checking. Specifically, we evaluate on three datasets of increasing difficulty: (i) the test split of GSM8K (Cobbe et al., 2021), a grade-school math dataset with integer answers; (i) the math and chemistry subsets of MMLU (Hendrycks et al., 2021a), framed as multiple-choice questions; and (i) the test split of MATH (Hendrycks et al., 2021b), which contains advanced competition-style problems with answers expressed in LaTeX. In the second setup, we study human-preference alignment using AlpacaFarm (Dubois et al., 2023), where the true reward is provided by an LLM-as-a-judge using Alpaca-RM (Dubois et al., 2023). Policies and reward models. For the first setup, we evaluate three reference policies πref _ ref spanning different architectures and increasing model sizes: Gemma-2-2B (Gemma et al., 2024), Llama-3-3B (Grattafiori et al., 2024), and Mistral-7B (Jiang et al., 2023). We prompt all πref _ ref using zero-shot CoT (Wei et al., 2022), i.e., no in-context demonstrations. For the second setup, we use a Pythia-1.4B fine-tuned on AlpacaFarm as πref _ ref (Biderman et al., 2023). For both setups, we consider four proxy reward models r r: OASST-RM (1.4B) (Köpf et al., 2023), Gemma-RM (2B) (Dong et al., 2023), Llama-RM (3B) (Yang et al., 2024b), and ARMO-RM (8B) (Wang et al., 2024). For each dataset-policy-reward combination, we generate a candidate pool of 2122^12 responses per prompt and conduct experiments with varying sample sizes N by bootstrapping subsets from this pool, reporting results averaged over 10 independent trials. We set K=max(1,N)K= (1, N) in the Hill estimator (Eq. (17)) and define κ0 _0 as the median of κ^(x) κ(x) across prompts. See Appendix B.3 for details. Verifier as true reward. Figure 2 compares the performance of BoN, sBoN, ITP, and BoT by tracing the trajectory of the true reward r∗r^* against the proxy reward r r as the sample size N increases. The trajectory begins at N=1N=1 (smallest marker) and extends to N=1024N=1024 (largest marker). Initially, increasing N improves the true reward (accuracy) across all methods, demonstrating the value of inference-time alignment in selecting the “best” responses from NY_N. However, as N grows further, optimistic strategies like BoN and sBoN—due to their aggressive re-weighting of high-proxy responses—succumb to reward hacking. This is evidenced by the characteristic reversion in the curve, where r∗r^* degrades despite continued increases in r r. Conversely, the pessimistic ITP remains highly conservative; while robust, it effectively saturates early, stopping the improvement of r r and r∗r^* once N exceeds a certain threshold. Consequently, despite preventing reward hacking, ITP may miss out on significant potential gains in regimes where the proxy is reliable. In contrast, the proposed adaptive BoT effectively navigates this trade-off, reaching higher peak true rewards than ITP while maintaining the robustness required to avoid the reward hacking pitfalls of optimistic baselines. Figure 3: Left: Performance comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on AlpacaFarm, using Llama-RM as the proxy and Alpaca-RM as the true reward. Right: The prompt-specific Hill estimates κ^(x) κ(x) and the corresponding adaptive parameter α(x)α(x). By design, BoT minimizes inference regret by shifting α→2α→ 2 under heavy tails (large κ κ) to enforce robustness, while allowing α→1α→ 1 under light tails (small κ κ) to enable aggressive exploration. LLM as true reward. We observe similar trade-offs on AlpacaFarm, shown in Figure 3 (left). Note that in this setting, the true reward is the evaluation score from Alpaca-RM rather than exact match accuracy. Consistent with the verifier results, alignment improves performance initially for all methods. However, sBoN and BoN eventually drift into the reward hacking regime, while ITP exhibits “early” robustness but premature saturation. In contrast, BoT successfully sustains high true rewards without degradation in the large-N regime. To explain this adaptability, Figure 3 (right) visualizes the Hill estimates κ^(x) κ(x) (Eq. (17)) for 1024 responses across 500 different prompts, along with the corresponding adaptive α(x)α(x) (with κ0=0.1 _0=0.1 in Eq. (18)). The plot reveals significant diversity in the reward distributions, ranging from light tails (small κ^(x) κ(x)) to heavy tails (large κ^(x) κ(x)). This heterogeneity confirms that fixing a static strategy—whether optimistic sBoN (α=1.0α=1.0) or pessimistic ITP (α=2.0α=2.0)—is suboptimal, as it limits the flexibility of inference-time alignment to exploit high-reward responses when safe, or to constrain them when risky. We defer additional results to Appendix C, including full comparisons (Appendix C.1), sample scaling curves (r∗,r^\r^*, r\ vs. N) and gain-distortion analysis (Appendix C.2), and ablation studies on λ and κ0 _0 (Appendix C.3). 6 Final Remark Here we reflect on the limitations and highlight interesting avenues for future work. Limitations. BoT’s effectiveness relies on accurate tail estimation. While asymptotically consistent, the Hill estimator can exhibit high variance in small-sample regimes (N<50N<50), potentially causing noisy α(x)α(x) selection. Furthermore, our mapping function α(x)α(x) is a heuristic interpolation; although it respects the theoretical monotonicity of the optimal strategy, it may not perfectly capture the ideal trajectory for every reward landscape. Additionally, BoT introduces hyper-parameters (pivot κ0 _0, cutoff K) that require validation. Finally, a multi-modal reward distribution can confound tail estimation, leading to mis-calibration in BoT. Future directions. We envision several extensions of the BoT framework. First, enhancing reward modeling via ensembles (Coste et al., 2024) to mitigate heavy-tailed errors at the source. To reduce latency, tail estimation could be amortized using a lightweight predictordirectly from prompt embeddings, or used to dynamically stops generating samples for light-tailed prompts while allocating larger N to heavy-tailed ones to maximize efficiency (Kalayci et al., 2025). Finally, the BoT policy could be distilled into a dense model, embedding tail-adaptive robustness directly into the weights to eliminate sampling overhead. Impact Statement. This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Disclaimer. This paper was prepared for informational purposes by the Global Technology Applied Research group of JPMorgan Chase & Co. This paper is not a product of the Research Department of JPMorgan Chase & Co. or its affiliates. Neither JPMorgan Chase & Co. nor any of its affiliates makes any explicit or implied representation or warranty and none of them accept any liability in connection with this paper, including, without limitation, with respect to the completeness, accuracy, or reliability of the information contained herein and the potential legal, compliance, tax, or accounting effects thereof. This document is not intended as investment research or investment advice, or as a recommendation, offer, or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction. References A. Amini, T. Vieira, E. Ash, and R. Cotterell (2025) Variational best-of-N alignment. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §2. G. Aminian, I. Shenfeld, A. R. Asadi, A. Beirami, and Y. Mroueh (2025) Best-of-N through the smoothing lens: KL divergence and regret analysis. In Proc. of the Workshop on Efficient Systems for Foundation Models at the International Conference on Machine Learning (ICML), Cited by: §2. A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi (2024) Self-RAG: learning to retrieve, generate, and critique through self-reflection. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §1. A. Balashankar, Z. Sun, J. Berant, J. Eisenstein, M. Collins, A. Hutter, J. Lee, C. Nagpal, F. Prost, A. Sinha, et al. (2024) InfAlign: inference-aware language model alignment. arXiv preprint arXiv:2412.19792. Cited by: §2. A. Beirami, A. Agarwal, J. Berant, A. D’Amour, J. Eisenstein, C. Nagpal, and A. T. Suresh (2025) Theoretical guarantees on the best-of-N alignment policy. In Proc. of the International Conference on Machine Learning (ICML), Cited by: §2. S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. (2023) Pythia: a suite for analyzing large language models across training and scaling. In Proc. of the International Conference on Machine Learning (ICML), Cited by: §B.3, §5. A. Block and Y. Polyanskiy (2023) The sample complexity of approximate rejection sampling with applications to smoothed online learning. In Procs. of the Annual Conference on Learning Theory (COLT), Cited by: §2. B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Ré, and A. Mirhoseini (2024) Large language monkeys: scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787. Cited by: §1. P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei (2017) Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2. K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: item 1, §5. T. Coste, U. Anwar, R. Kirk, and D. Krueger (2024) Reward model ensembles help mitigate overoptimization. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: item 4, §B.3, §2, §6. L. De Haan and S. Resnick (1998) On asymptotic normality of the hill estimator. Stochastic Models 14 (4), p. 849–866. Cited by: §B.2. H. Dong, W. Xiong, D. Goyal, Y. Zhang, W. Chow, R. Pan, S. Diao, J. Zhang, K. Shum, and T. Zhang (2023) RAFT: reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research (TMLR). External Links: ISSN 2835-8856 Cited by: item 2, §5. Y. Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P. S. Liang, and T. B. Hashimoto (2023) AlpacaFarm: a simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: item 4, §5. G. Faria and N. A. Smith (2025) Sample, don’t search: rethinking test-time alignment for language models. arXiv preprint arXiv:2504.03790. Cited by: §2. G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang (2023) Towards revealing the mystery behind chain of thought: a theoretical perspective. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1. L. Gao, J. Schulman, and J. Hilton (2022) Scaling laws for reward model overoptimization. In Proc. of the International Conference on Machine Learning (ICML), Cited by: §1, §2. T. Gemma, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, et al. (2024) Gemma 2: improving open language models at a practical size. arXiv preprint arXiv:2408.00118. Cited by: §B.3, §5. J. Geuter, Y. Mroueh, and D. Alvarez-Melis (2025) Guided speculative inference for efficient test-time alignment of llms. arXiv preprint arXiv:2506.04118. Cited by: §1, §2. A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The Llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: §B.3, §5. L. Gui, C. Gârbacea, and V. Veitch (2024) BoNBoN alignment for large language models and the sweetness of best-of-N sampling. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, §2. D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2021a) Measuring massive multitask language understanding. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: item 2, §5. D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021b) Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: item 3, §5. B. M. Hill (1975) A simple general approach to inference about the tail of a distribution. The Annals of Statistics, p. 1163–1174. Cited by: §B.1, §1, §3, §4.2. A. Huang, A. Block, D. J. Foster, D. Rohatgi, C. Zhang, M. Simchowitz, J. T. Ash, and A. Krishnamurthy (2025a) Self-improvement in language models: the sharpening mechanism. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §2. A. Huang, A. Block, Q. Liu, N. Jiang, A. Krishnamurthy, and D. J. Foster (2025b) Is best-of-N the best of them? coverage, scaling, and optimality in inference-time alignment. In Proc. of the International Conference on Machine Learning (ICML), Cited by: §C.1, §1, §1, §1, §2, §2, §2, §2, §3. A. Huang, W. Zhan, T. Xie, J. D. Lee, W. Sun, A. Krishnamurthy, and D. J. Foster (2025c) Correcting the mythos of KL-regularization: direct alignment without overoptimization via chi-squared preference optimization. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §1, §2. H. Jeffreys (1946) An invariant form for the prior probability in estimation problems. In Proc. of the Royal Society of London. Series A. Mathematical and Physical Sciences, Cited by: §2. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. (2023) Mistral 7b. arXiv preprint arXiv:2310.06825. Cited by: §B.3, §5. Y. Jinnai, T. Morimura, K. Ariu, and K. Abe (2024) Regularized best-of-N sampling to mitigate reward hacking for language model alignment. In Proc. of the Workshop on Models of Human Feedback for AI Alignment at the International Conference on Machine Learning (ICML), Cited by: §1, §1, §2, §2. Y. Jinnai, T. Morimura, K. Ariu, and K. Abe (2025) Regularized best-of-N sampling with minimum bayes risk objective for language model alignment. In Proc. of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Cited by: §1. Y. Kalayci, V. Raman, and S. Dughmi (2025) Optimal stopping vs best-of-N for inference time optimization. arXiv preprint arXiv:2510.01394. Cited by: §6. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020) Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Cited by: §1. H. Khalaf, C. M. Verdun, A. Oesterling, H. Lakkaraju, and F. d. P. Calmon (2025) Inference-time reward hacking in large language models. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2. A. Köpf, Y. Kilcher, D. Von Rütte, S. Anagnostidis, Z. R. Tam, K. Stevens, A. Barhoum, D. Nguyen, O. Stanley, R. Nagyfi, et al. (2023) OpenAssistant conversations-democratizing large language model alignment. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: item 1, §5. T. Korbak, E. Perez, and C. Buckley (2022) RL with KL penalties is better viewed as Bayesian inference. In Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2. T. Kwa, D. Thomas, and A. Garriga-Alonso (2024) Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1. C. Laidlaw, S. Singhal, and A. Dragan (2025) Correlated proxies: a new definition and improved mitigation for reward hacking. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §1. M. Ledoux (2001) The concentration of measure phenomenon. American Mathematical Society. Cited by: Lemma A.2. K. Lee, S. Kim, S. Lim, S. Choi, and S. Oh (2019) Tsallis reinforcement learning: a unified framework for maximum entropy reinforcement learning. arXiv preprint arXiv:1902.00137. Cited by: §1. E. L. Lehmann and J. P. Romano (2005) Testing statistical hypotheses. Springer. Cited by: Proposition 1. Y. Leviathan, M. Kalman, and Y. Matias (2023) Fast inference from transformers via speculative decoding. In Proc. of the International Conference on Machine Learning (ICML), Cited by: §2. B. Liao, Y. Xu, H. Dong, J. Li, C. Monz, S. Savarese, D. Sahoo, and C. Xiong (2025) Reward-guided speculative decoding for efficient llm reasoning. In Proc. of the International Conference on Machine Learning (ICML), Cited by: §2. B. Y. Lin, A. Ravichander, X. Lu, N. Dziri, M. Sclar, K. Chandu, C. Bhagavatula, and Y. Choi (2024) The unlocking spell on base LLMs: rethinking alignment via in-context learning. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §2. T. Liu, Y. Zhao, R. Joshi, M. Khalman, M. Saleh, P. J. Liu, and J. Liu (2024) Statistical rejection sampling improves preference optimization. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §2. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1. V. Raman, H. Asi, and S. Kale (2025) AdaBoN: adaptive best-of-N alignment. arXiv preprint arXiv:2505.12050. Cited by: §4.2. J. Ren, Y. Zhao, T. Vu, P. J. Liu, and B. Lakshminarayanan (2023) Self-evaluation improves selective generation in large language models. In Proc. of the I Can’t Believe It’s Not Better Workshop at Neural Information Processing Systems (NeurIPS), Cited by: §1. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1. P. G. Sessa, R. Dadashi, L. Hussenot, J. Ferret, N. Vieillard, A. Ramé, B. Shariari, S. Perrin, A. Friesen, G. Cideron, et al. (2025) BOND: aligning LLMs with best-of-N distillation. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §2. J. Skalse, N. Howe, D. Krasheninnikov, and D. Krueger (2022) Defining and characterizing reward gaming. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2. C. Snell, J. Lee, K. Xu, and A. Kumar (2025) Scaling llm test-time compute optimally can be more effective than scaling model parameters. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §1, §2. N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano (2020) Learning to summarize with human feedback. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §1, §2, §2. H. Sun, Y. Shen, and J. Ton (2025) Rethinking Bradley-Terry models in preference-based reward modeling: foundations, theory, and alternatives. In Proc. of the International Conference on Learning Representations (ICLR), Cited by: §1, §2. C. Tsallis (1988) Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics 52 (1), p. 479–487. Cited by: §1, §4. C. M. Verdun, A. Oesterling, H. Lakkaraju, and F. P. Calmon (2025) Soft best-of-N sampling for model alignment. In Proc. of the IEEE International Symposium on Information Theory (ISIT), Cited by: §1, §1, §1, §2, §2. H. Wang, W. Xiong, T. Xie, H. Zhao, and T. Zhang (2024) Interpretable preferences via multi-objective reward modeling and mixture-of-experts. arXiv preprint arXiv:2406.12845. Cited by: item 4, §5. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §B.3, §1, §5. T. Wu, W. Yuan, O. Golovneva, J. Xu, Y. Tian, J. Jiao, J. E. Weston, and S. Sukhbaatar (2025) Meta-rewarding language models: self-improving alignment with LLM-as-a-meta-judge. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1. J. Q. Yang, S. Salamatian, Z. Sun, A. T. Suresh, and A. Beirami (2024a) Asymptotics of language model alignment. In Proc. of the IEEE International Symposium on Information Theory (ISIT), Cited by: §2. R. Yang, R. Ding, Y. Lin, H. Zhang, and T. Zhang (2024b) Regularizing hidden states enables learning generalizable reward model for LLMs. Advances in Neural Information Processing Systems (NeurIPS). Cited by: item 3, §5. S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan (2023) Tree of thoughts: deliberate problem solving with large language models. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1. D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang (2024) ReST-MCTS*: LLM self-training via process reward guided tree search. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1. The appendix is divided into the following parts. Appendix A: Omitted proofs and theoretical results; Appendix B: additional discussions and details on the experimental setup; and Appendix C: Additional empirical results and ablation studies. Appendix A Omitted Proofs and Theoretical Results We first introduce (and prove) the following useful lemmas that are related in the main text and that facilitate the proofs of the propositions. Lemma A.1. α-exponential and α-logarithm are generalizations of the exponential and logarithm functions since limα→1expα(z)=exp(z) _α→ 1 _α(z)= (z) and limα→1logα(z)=log(z) _α→ 1 _α(z)= (z). Proof. Since limt→1(1+tz)1/t=exp(z) _t→ 1(1+tz)^1/t= (z), we have limα→1expα(z)=limα→1(1+(α−1)z)1/(α−1)=exp(z) _α→ 1 _α(z)= _α→ 1(1+(α-1)z)^1/(α-1)= (z). Similarly for α-logarithm, by the L’Hôpital’s rule, limα→1logα(t)=limα→1tα−1−1α−1=limβ→0tβ−1β=limβ→0eβlog(t)−1β=log(t)limβ→0eβlog(t)−1βlog(t)=log(t). _α→ 1 _α(t)= _α→ 1 t^α-1-1α-1= _β→ 0 t^β-1β= _β→ 0 e^β (t)-1β= (t) _β→ 0 e^β (t)-1β (t)= (t). (A.1) ∎ Lemma A.2 (Hölder’s inequality (Ledoux, 2001)). Let p,q∈[1,∞]p,q∈[1,∞] with 1p+1q=1 1p+ 1q=1, then for all real-valued functions f and g measurable with respect to measure μ, ‖fg‖L1(μ)≤‖f‖Lp(μ)‖g‖Lq(μ)\|fg\|_L^1(μ)≤\|f\|_L^p(μ)\|g\|_L^q(μ). When p=q=2p=q=2, the Hölder’s inequality degenerates to the Cauchy–Schwarz inequality. A.1 Proof of Proposition 1 Following the definition of the inference-time regret, we decompose it using πref _ ref as an anchor policy, (π^w;x) regret( π_w;x) =Jr∗(π∗;x)−Jr∗(π^w;x) =J_r^*(π^*;x)-J_r^*( π_w;x) (A.2) =Jr∗(π∗;x)−Jr∗(πref;x)⏟(i)−(Jr∗(π^w;x)−Jr∗(πref;x))⏟(ii). = J_r^*(π^*;x)-J_r^*( _ ref;x)_(i)- (J_r^*( π_w;x)-J_r^*( _ ref;x) )_(i). Since r∗r^* and π∗π^* are fixed, hence once πref _ ref is selected, term (i) becomes a baseline sub-optimality constant that characterize how far the reference is from optimal under the true reward, and is independent of the inference-time alignment mechanism. Using the definition of the expected density ratio, term (i) can be upper bounded by Jr∗(π∗;x)−Jr∗(πref;x) J_r^*(π^*;x)-J_r^*( _ ref;x) =(Jr^(π∗;x)−Jr^(πref;x))−(Jr^−r∗(π∗;x)−Jr^−r∗(πref;x)). = (J_ r(π^*;x)-J_ r( _ ref;x) )- (J_ r-r^*(π^*;x)-J_ r-r^*( _ ref;x) ). (A.3) By Lemma A.2, we have |Jr^−r∗(π∗;x)−Jr^−r∗(πref;x)| |J_ r-r^*(π^*;x)-J_ r-r^*( _ ref;x) | =|y∼π∗(⋅|x)[r^−r∗]−y∼πref(⋅|x)[r^−r∗]| = |E_y π^*(·|x) [ r-r^* ]-E_y _ ref(·|x) [ r-r^* ] | (A.4) =|y∼πref(⋅|x)[(π∗πref−1)(r^−r∗)]| = |E_y _ ref(·|x) [ ( π^* _ ref-1 ) ( r-r^* ) ] | ≤Cπ∗−1×ϵ2(r^;πref,x). ≤ C^π^*-1× _2( r; _ ref,x). Similarly, |Jr^(π∗;x)−Jr^(πref;x)|≤Cπ∗−1×‖r^‖L2(πref)|J_ r(π^*;x)-J_ r( _ ref;x)|≤ C^π^*-1×\| r\|_L^2( _ ref). Putting them together, term (i) can be upper bounded by Jr∗(π∗;x)−Jr∗(πref;x) J_r^*(π^*;x)-J_r^*( _ ref;x) ≤|Jr^(π∗;x)−Jr^(πref;x)|+|Jr^−r∗(π∗;x)−Jr^−r∗(πref;x)| ≤ |J_ r(π^*;x)-J_ r( _ ref;x) |+ |J_ r-r^*(π^*;x)-J_ r-r^*( _ ref;x) | (A.5) ≤Cπ∗−1(ϵ2(r^;πref,x)+‖r^‖L2(πref)) ≤ C^π^*-1 ( _2( r; _ ref,x)+\| r\|_L^2( _ ref) ) Next, for term (i), it can be similarly decompose into Jr∗(π^w;x)−Jr∗(πref;x) J_r^*( π_w;x)-J_r^*( _ ref;x) =(Jr^(π^w;x)−Jr^(πref;x))+(Jr^−r∗(π^w;x)−Jr^−r∗(πref;x)) = (J_ r( π_w;x)-J_ r( _ ref;x) )+ (J_ r-r^*( π_w;x)-J_ r-r^*( _ ref;x) ) (A.6) ≥(Jr^(π^w;x)−Jr^(πref;x))−|Jr^−r∗(π^w;x)−Jr^−r∗(πref;x)|. ≥ (J_ r( π_w;x)-J_ r( _ ref;x) )- |J_ r-r^*( π_w;x)-J_ r-r^*( _ ref;x) |. Since r^,r∗∈[0,Rmax] r,r^*∈[0,R_ max], we have Jr^(π^w;x)−Jr^(πref;x) J_ r( π_w;x)-J_ r( _ ref;x) =Jr^(π^w;x)−Jr^(πw;x)+Jr^(πw;x)−Jr^(πref;x) =J_ r( π_w;x)-J_ r( _w;x)+J_ r( _w;x)-J_ r( _ ref;x) (A.7) =Δ(πw)+Jr^(π^w;x)−Jr^(πw;x) = ( _w)+J_ r( π_w;x)-J_ r( _w;x) ≥Δ(πw)−2RmaxDTV(π^w∥πw). ≥ ( _w)-2R_ maxD_ TV( π_w\| _w). Then, by Lemma A.2 again, we have |Jr^−r∗(π^w;x)−Jr^−r∗(πref;x)| |J_ r-r^*( π_w;x)-J_ r-r^*( _ ref;x) | ≤(1+(α−1)Dα(πw∥πref))1/α‖r^−r∗‖Lβ(πref) ≤ (1+(α-1)D_α( _w\| _ ref) )^1/α\| r-r^*\|_L^β( _ ref) (A.8) =Cπw×ϵ2(r^;πref,x)if selecting α=β=2. = C _w× _2( r; _ ref,x)\;if selecting $α=β=2$. Combining Eq. (A.5), Eq. (A.7) and Eq. (A.8), we have (π^w;x) regret( π_w;x) ≤Cπ∗−1(ϵ2(r^;πref,x)+‖r^‖L2(πref))−Δ(πw)+2RmaxDTV(π^w∥πw) ≤ C^π^*-1 ( _2( r; _ ref,x)+\| r\|_L^2( _ ref) )- ( _w)+2R_ maxD_ TV( π_w\| _w) (A.9) +(1+(α−1)Dα(πw∥πref))1/α‖r^−r∗‖Lβ(πref). + (1+(α-1)D_α( _w\| _ ref) )^1/α\| r-r^*\|_L^β( _ ref). Taking α=β=2α=β=2 leads to the desired result. A.2 Proof of Proposition 2 Fix a prompt x and suppress the conditioning on it. Let the proxy reward be r^:=r^(x,Y)∈[0,1] r:= r(x,Y)∈[0,1] with Y∼πrefY _ ref. We define the endpoint gap U:=1−r^U:=1- r. By assumption: U∼Beta(1κ,1),with density fU(u)=1κu1κ−1 for u∈[0,1].U ( 1κ,1 ), density f_U(u)= 1κu 1κ-1 for u∈[0,1]. (A.10) We utilize the general regret bound decomposition derived previously, where the regret B(π)B(π) depends on the Peak (relative supremum), Dist (distortion), and Gain (Δ ) of the weighting function w(r^)w( r): B(π)≤2exp(−NPeak(w))+ϵ⋅Dist(w)−Δ(w).B(π)≤ 2 (- NPeak(w) )+ε·Dist(w)- (w). (A.11) The statistics are defined as: Peak(w)=supr^w(r^)[w(r^)],Dist(w)=[w(r^)2][w(r^)],Δ(w)=Δ(πw)=πw[r^]−[r^].Peak(w)= _ rw( r)E[w( r)],\;Dist(w)= E[w( r)^2]E[w( r)],\; (w)= ( _w)=E_ _w[ r]-E[ r]. (A.12) Preliminaries: Moments. For U∼Beta(1/κ,1)U (1/κ,1), the raw moments are given by: [Um]=∫01um1κu1κ−1u=1mκ+1.E[U^m]= _0^1u^m 1κu 1κ-1du= 1mκ+1. (A.13) This implies that in the heavy-tail regime (κ→∞κ→∞), U remains distributed over [0,1][0,1] but with mean ≈1/κ≈ 1/κ. In the light-tail regime (κ→0κ→ 0), r^=1−U r=1-U concentrates near 0 with moments: [r^]=1−[U]=κ+1,[r^2]=1−2κ+1+12κ+1.E[ r]=1-E[U]= κ+1, [ r^2]=1- 2κ+1+ 12κ+1. (A.14) As κ→0κ→ 0, for each fixed integer m≥1m≥ 1, [r^m]=m!κm+O(κm+1)E[ r^m]=m!κ^m+O(κ^m+1). Part 1: Heavy-tailed regime (large κ). Here, we analyze the statistics using U as the primary variable. 1.1 Soft-BoN. The weight is wsBoN(r^)=er^/λ=e(1−U)/λ∝e−U/λw_ sBoN( r)=e r/λ=e^(1-U)/λ e^-U/λ. Let w¯(U)=e−U/λ w(U)=e^-U/λ. • Peak: We expand the expectation of the weight using Taylor series: [e−U/λ]=1−1λ[U]+O([U2])=1−1λ(κ+1)+O(κ−2).E[e^-U/λ]=1- 1λE[U]+O(E[U^2])=1- 1λ(κ+1)+O(κ^-2). (A.15) The Peak is the ratio of the supremum (at U=0U=0) to the expectation: Peak=1[e−U/λ]=11−1λκ+o(κ−1)=1+1λκ+o(κ−1).Peak= 1E[e^-U/λ]= 11- 1λκ+o(κ^-1)=1+ 1λκ+o(κ^-1). (A.16) Substituting this into the exponential term in Eq. (A.11): exp(−NPeak)≈exp(−N1+1/(λκ)). (- NPeak )≈ (- N1+1/(λκ) ). (A.17) • Dist.: We compute the second moment [w2]E[w^2] using w¯2=e−2U/λ w^2=e^-2U/λ: [e−2U/λ]=1−2λκ+2λ2(2κ)+o(κ−1)=1−2λκ+1λ2κ+o(κ−1).E[e^-2U/λ]=1- 2λκ+ 2λ^2(2κ)+o(κ^-1)=1- 2λκ+ 1λ^2κ+o(κ^-1). (A.18) The squared distortion is: Dist2=[e−2U/λ]([e−U/λ])2≈1−2λκ+1λ2κ1−2λκ+1λ2κ(12)≈1+12λ2κ.Dist^2= E[e^-2U/λ](E[e^-U/λ])^2≈ 1- 2λκ+ 1λ^2κ1- 2λκ+ 1λ^2κ( 12)≈ 1+ 12λ^2κ. (A.19) Thus, Dist≈1+14λ2κDist≈ 1+ 14λ^2κ. • Gain: The proxy gain is Δ=[U]−πw[U] =E[U]-E_ _w[U]. We compute the tilted expectation πw[U]E_ _w[U] using the substitution t=κUt=κ U (approximating U as Exponential(1/κ)(1/κ)): [Ue−U/λ]≈∫0∞ue−u/λ1κe−u/κu≈1κ∫01ue−u/λu.E[Ue^-U/λ]≈ _0^∞ue^-u/λ 1κe^-u/κdu≈ 1κ _0^1ue^-u/λdu. (A.20) A precise integration yields πw[U]≈λκ(1−e−1/λ)E_ _w[U]≈ λκ(1-e^-1/λ). Thus: Δ≈1κ−λκ(1−e−1/λ)=1κ(1−λ(1−e−1/λ))+o(κ−1). ≈ 1κ- λκ(1-e^-1/λ)= 1κ (1-λ(1-e^-1/λ) )+o(κ^-1). (A.21) 1.2 ITP. The weight is wITP(r^)=1+r^/λ=1+(1−U)/λw_ ITP( r)=1+ r/λ=1+(1-U)/λ. Let C=1+1/λC=1+1/λ. Then wi∝C−U/λw_ i C-U/λ. • Peak: The supremum is C. The expectation is [wi]=C−[U]/λ≈C−1λκE[w_ i]=C-E[U]/λ≈ C- 1λκ. Peak=C−1λκ=11−1Cλκ≈1+1Cλκ=1+1(λ+1)κ+o(κ−1).Peak= CC- 1λκ= 11- 1Cλκ≈ 1+ 1Cλκ=1+ 1(λ+1)κ+o(κ^-1). (A.22) • Dist and Gain: Using standard variance approximations for linear weights: Dist≈1+14(λ+1)2κ+o(κ−1),Δ≈12κ(λ+1)+o(κ−1).Dist≈ 1+ 14(λ+1)^2κ+o(κ^-1), ≈ 12κ(λ+1)+o(κ^-1). (A.23) Part 2: Light-tail regime (small κ). Here r r concentrates near 0. Both soft-BoN (er^/λe r/λ) and ITP (1+r^/λ1+ r/λ) behave as 1+r^/λ1+ r/λ to the first order, and therefore they have similar terms of Peak and Dist. The difference of the bound lies in the Gain. • Peak: For w(r^)=1+r^/λw( r)=1+ r/λ: supw=1+1/λ=λ+1λ,[w]=1+[r^]λ≈1+κλ=λ+κλ. w=1+1/λ= λ+1λ, [w]=1+ E[ r]λ≈ 1+ κλ= λ+κλ. (A.24) The Peak is: Peak=(λ+1)/λ(λ+κ)/λ=λ+1λ+κ.Peak= (λ+1)/λ(λ+κ)/λ= λ+1λ+κ. (A.25) Substituting into the bound exponent: exp(−NPeak)=exp(−N(λ+κ)λ+1). (- NPeak )= (- N(λ+κ)λ+1 ). (A.26) • Dist.: Using Var(w)≈Var(r^)/λ2≈κ2/λ2Var(w) ( r)/λ^2≈κ^2/λ^2 and [w]≈1E[w]≈ 1: Dist≈1+Var(w)2[w]2≈1+κ22λ2.Dist≈ 1+ Var(w)2E[w]^2≈ 1+ κ^22λ^2. (A.27) 2.1 Soft-BoN. The Gain is given by πw[r^]=[r^er^/λ]/[er^/λ]E_ _w[ r]=E[ re r/λ]/E[e r/λ]. Expanding the numerator [r^er^/λ]E[ re r/λ] yields [r^(1+r^/λ+12λ2r^2+O(r^3))]=[r^]+1λ[r^2]+12λ2[r^3]+O([r^4]).E[ r(1+ r/λ+ 12λ^2 r^2+O( r^3))]=E[ r]+ 1λE[ r^2]+ 12λ^2E[ r^3]+O(E[ r^4]). (A.28) By the moments of r r with small κ, we obtain [r^er^/λ]=κ+2κ2λ+3κ3λ2+O(κ4),E[ re r/λ]=κ+ 2κ^2λ+ 3κ^3λ^2+O(κ^4), (A.29) while the denominator expands as [er^/λ]=1+κλ+κ2λ2+O(κ3)E[e r/λ]=1+ κλ+ κ^2λ^2+O(κ^3). Performing a ratio expansion then yields πw[r^]=κ+κ2λ+κ3λ2+O(κ4)E_ _w[ r]=κ+ κ^2λ+ κ^3λ^2+O(κ^4). Finally, subtracting the reference mean [r^]=κ+O(κ2)E[ r]=κ+O(κ^2) gives the proxy gain Δ=κ2λ+κ3λ2+O(κ4). = κ^2λ+ κ^3λ^2+O(κ^4). (A.30) 2.2 ITP. The Gain is dominated by the covariance term: Δ=Cov(w,r^)[w]≈Var(r^)/λ1≈κ2λ+O(κ3). = Cov(w, r)E[w]≈ Var( r)/λ1≈ κ^2λ+O(κ^3). (A.31) Part 3: BoN limit (λ→0λ→ 0). Best-of-N (BoN) arises as the limit of Soft-BoN as the inverse temperature 1/λ→∞1/λ→∞. Since the supremum of the weights is supywsoft(r^(x,y))=e1/λ _yw_soft( r(x,y))=e^1/λ and the normalizing constant [er^/λ]E[e r/λ] remains finite for any fixed κ, the peak ratio diverges as Peak(wsoft;x)=e1/λ/[er^/λ]→∞Peak(w_soft;x)=e^1/λ/E[e r/λ]→∞. Similarly, the distortion diverges as Dist(wsoft;x)=[e2r^/λ]/[er^/λ]→∞Dist(w_soft;x)= E[e^2 r/λ]/E[e r/λ]→∞. In the light-tail regime, the expansion [er^/λ]=1+κ/λ+κ2/λ2+O(κ3)E[e r/λ]=1+κ/λ+κ^2/λ^2+O(κ^3) implies that the peak scales as Peak(wsoft;x)≍e1/λPeak(w_soft;x) e^1/λ. Consequently, minimizing the finite-N proxy term 2exp(−N/Peak)≈2exp(−N/e1/λ)2 (-N/Peak)≈ 2 (-N/e^1/λ) requires the sample size N to scale on the order of e1/λe^1/λ, confirming the qualitative behavior of BoN. Combining Parts 1-3 completes the proof. A.3 Proof of Proposition 3 Consider the generator function ϕα(t) _α(t) of the α-Tsallis divergence ϕα(t)=tα−αt+(α−1)α(α−1),t>0,α>1. _α(t)= t^α-α t+(α-1)α(α-1),\;t>0,α>1. (A.32) Then, the α-Tsallis divergence between two distributions P and Q can be alternatively expressed as Dα(P∥Q)=∑x∈Q(x)ϕα(P(x)Q(x))D_α(P\|Q)=Σ _x∈XQ(x) _α ( P(x)Q(x) ). Note that ϕα(t) _α(t) is convex with ϕα(1)=0 _α(1)=0, and therefore DαD_α is also convex in P. Now, let the tilting function be t(y|x)=π(y|x)/πref(y|x)t(y|x)=π(y|x)/ _ ref(y|x), the objective in Eq. (12) becomes ℒα(π) L_α(π) =y∼π(⋅|x)[r^(x,y)]−λDα(π(⋅|x)∥πref(⋅|x)) =E_y π(·|x) [ r(x,y) ]-λ D_α (π(·|x)\| _ ref(·|x) ) (A.33) =∑y∈π(y|x)r^(x,y)−λ∑y∈πref(y|x)ϕα(π(y|x)πref(y|x)) =Σ _y∈Yπ(y|x) r(x,y)-λΣ _y∈Y _ ref(y|x) _α ( π(y|x) _ ref(y|x) ) =∑y∈πref(y|x)(t(y|x)r^(x,y)−λϕα(t(y|x))). =Σ _y∈Y _ ref(y|x) (t(y|x) r(x,y)-λ _α(t(y|x)) ). The simplex constraint ∑y∈π(y|x)=1Σ _y∈Yπ(y|x)=1 becomes ∑y∈πref(y|x)t(y|x)=1Σ _y∈Y _ ref(y|x)t(y|x)=1. The transformation in Eq. (A.33) actually forms the convex conjugate (Fenchel–Legendre transform) of the generator function, i.e., ψα(z)=supt>0zt−λϕα(t). _α(z)= _t>0 \zt-λ _α(t) \. (A.34) The optimizer t∗(z)=argmaxt>0zt−λϕα(t)t^*(z)= *arg\,max _t>0zt-λ _α(t) can then be derived directly from the first-order condition dt(zt−ϕα(t)λ) ddt(zt- _α(t)λ) =z−λϕα′(t)=z−λ1α(α−1)(αtα−1−α)=z−λtα−1−1α−1=0. =z-λ _α (t)=z-λ 1α(α-1)(α t^α-1-α)=z-λ t^α-1-1α-1=0. (A.35) By re-arranging and plugging z=r^(x,y)z= r(x,y), we have tα−1=1+α−1λr^(x,y)⇒t∗(r^(x,y))=[1+(α−1)r^(x,y)λ]+1α−1=expα(r^(x,y)λ),t^α-1=1+ α-1λ r(x,y) t^*( r(x,y))= [1+(α-1) r(x,y)λ ]_+ 1α-1= _α ( r(x,y)λ ), (A.36) and therefore πα(⋅|x)∝πref(⋅|x)expα(r^(x,y)λ). _α(·|x) _ ref(·|x) _α ( r(x,y)λ ). (A.37) The simplex condition ∑y∈πα(y|x)=1Σ _y∈Y _α(y|x)=1 can be satisfied by finding a ν≥0ν≥ 0 such that ∑y∈πref(y|x)t∗(r^(x,y)+ν)=1Σ _y∈Y _ ref(y|x)t^*( r(x,y)+ν)=1. A.4 Proof of Proposition 4 We proceed in three parts. Part 1: Algebraic reductions. Let U:=1−r^U:=1- r and note r^=1−U r=1-U. We use the parameters from the proposition statement: α∈(1,2]α∈(1,2] and λ>0λ>0. The weight function is: wα(r^)=[1+(α−1)r^λ]+1α−1=(1+α−1λ(1−U))1α−1.w_α( r)= [1+(α-1) rλ ]_+ 1α-1= (1+ α-1λ(1-U) ) 1α-1. (A.38) We factor out the endpoint value at r^=1 r=1 (i.e., U=0U=0). Define the constants: A:=1+α−1λ,ρ:=(α−1)/λ1+(α−1)/λ=α−1λA,γ:=1α−1.A:=1+ α-1λ,\;ρ:= (α-1)/λ1+(α-1)/λ= α-1λ A,\;γ:= 1α-1. (A.39) Then we can rewrite the inner term as: 1+α−1λ(1−U)=1+α−1λ−α−1λU=A−AρU=A(1−ρU).1+ α-1λ(1-U)=1+ α-1λ- α-1λU=A-Aρ U=A(1-ρ U). (A.40) Thus, the weight becomes: wα(1−U)=(A(1−ρU))γ=Aγ(1−ρU)γ.w_α(1-U)= (A(1-ρ U) )^γ=A^γ(1-ρ U)^γ. (A.41) The moments of the weights under the reference distribution (parameterized by κ) are: κ[wα(r^)]=Aγκ[(1−ρU)γ]=Aγm0,κ[wα(r^)2]=A2γκ[(1−ρU)2γ]=A2γm2,E_κ[w_α( r)]=A^γE_κ[(1-ρ U)^γ]=A^γm_0,\;E_κ[w_α( r)^2]=A^2γE_κ[(1-ρ U)^2γ]=A^2γm_2, (A.42) where we define m0(α;κ,λ):=κ[(1−ρU)γ]m_0(α;κ,λ):=E_κ[(1-ρ U)^γ] and m2(α;κ,λ):=κ[(1−ρU)2γ]m_2(α;κ,λ):=E_κ[(1-ρ U)^2γ]. Therefore, the distortion ratio simplifies to: |wα|L2|wα|L1=κ[wα(r^)2]κ[wα(r^)]=A2γm2Aγm0=m2m0. |w_α|L^2|w_α|L^1= Eκ[w_α( r)^2]E_κ[w_α( r)]= A^2γm_2A^γm_0= m_2m_0. (A.43) Since wα(r^)w_α( r) is increasing in r r, its supremum is at r^=1 r=1, so supwα=Aγ w_α=A^γ. The envelope constant for the exponential mechanism is: M(α)=supwακ[wα]=AγAγm0=1m0.M(α)= w_αE_κ[w_α]= A^γA^γm_0= 1m_0. (A.44) The probability of selecting the reference policy is e−NM(α)−1=e−Nm0e^-NM(α)^-1=e^-Nm_0. This yields the first term in (14). The proxy gain is: Δ(πBoT)=πα[r^]−πref[r^]=πref[U]−πα[U], ( _ BoT)=E_ _α[ r]-E_ _ ref[ r]=E_ _ ref[U]-E_ _α[U], (A.45) where we used r^=1−U r=1-U. Part 2: Heavy-tailed regime (large κ). In the heavy-tailed regime (κ→∞κ→∞), U concentrates near 0. Using U∼Beta(1/κ,1)U (1/κ,1), we have the moments [U]=1κ+1=1κ−1κ2+O(κ−3)E[U]= 1κ+1= 1κ- 1κ^2+O(κ^-3) and [U2]=12κ+1=12κ+O(κ−2)E[U^2]= 12κ+1= 12κ+O(κ^-2). We expand (1−ρU)γ(1-ρ U)^γ around U=0U=0: (1−ρU)γ=1−γρU+γ(γ−1)2ρ2U2+O(U3).(1-ρ U)^γ=1-γρ U+ γ(γ-1)2ρ^2U^2+O(U^3). (A.46) Taking expectations: m0=1−γρ[U]+γ(γ−1)2ρ2[U2]+O(κ−2)=1+1κ(−γρ+γ(γ−1)4ρ2)+O(κ−2).m_0=1-γ [U]+ γ(γ-1)2ρ^2E[U^2]+O(κ^-2)=1+ 1κ (-γρ+ γ(γ-1)4ρ^2 )+O(κ^-2). (A.47) Substituting γ=1α−1γ= 1α-1 and ρ=(α−1)/λAρ= (α-1)/λA: γρ=1α−1⋅α−1λA=1λA.γρ= 1α-1· α-1λ A= 1λ A. (A.48) The quadratic coefficient involves γ(γ−1)=1α−1(1α−1−1)=2−α(α−1)2γ(γ-1)= 1α-1( 1α-1-1)= 2-α(α-1)^2. Thus: γ(γ−1)4ρ2=2−α4(α−1)2(α−1)2λ2A2=2−α4λ2A2. γ(γ-1)4ρ^2= 2-α4(α-1)^2 (α-1)^2λ^2A^2= 2-α4λ^2A^2. (A.49) Combining these over a common denominator 4λ2A2=4(λ+α−1)24λ^2A^2=4(λ+α-1)^2: m0=1+1κ−4λA+2−α4λ2A2+O(κ−2)=1+1κ−4λ−5(α−1)+14(λ+α−1)2+O(κ−2).m_0=1+ 1κ -4λ A+2-α4λ^2A^2+O(κ^-2)=1+ 1κ -4λ-5(α-1)+14(λ+α-1)^2+O(κ^-2). (A.50) The finite-N term expands as e−Nm0≈e−N(1−N(m0−1))e^-Nm_0≈ e^-N(1-N(m_0-1)), contributing the term matching (15). Expanding m2=[(1−ρU)2γ]m_2=E[(1-ρ U)^2γ] similarly leads to m2=1+1κ(−2γρ+2γ(2γ−1)4ρ2)+O(κ−2)m_2=1+ 1κ(-2γρ+ 2γ(2γ-1)4ρ^2)+O(κ^-2). The ratio m2/m0 m_2/m_0 expands as: m2m0≈1+1κ(12coeff(m2)−coeff(m0))=1+1κγ2ρ24. m_2m_0≈ 1+ 1κ ( 12coeff(m_2)-coeff(m_0) )=1+ 1κ γ^2ρ^24. (A.51) Substituting γρ=1λ+α−1γρ= 1λ+α-1: m2m0=1+14κ(λ+α−1)2. m_2m_0=1+ 14κ(λ+α-1)^2. (A.52) This yields the first term in (15). We compute Δ(πBoT)=[U]−πα[U] ( _ BoT)=E[U]-E_ _α[U]. Using the integral approximation for large κ: πα[U]=1κ∫01(1−ρu)γu+O(κ−2)=1κ1−(1−ρ)γ+1ρ(γ+1)+O(κ−2).E_ _α[U]= 1κ _0^1(1-ρ u)^γdu+O(κ^-2)= 1κ 1-(1-ρ)^γ+1ρ(γ+1)+O(κ^-2). (A.53) Linearizing near α=2α=2 (where γ=1γ=1), the integral is 1−ρ/21-ρ/2. Thus: Δ(πBoT)≈1κ−1κ(1−ρ/2)=ρ2κ=1κα−12(λ+α−1). ( _ BoT)≈ 1κ- 1κ(1-ρ/2)= ρ2κ= 1κ α-12(λ+α-1). (A.54) This matches the final term in (15). Part 3: Light-tailed regime (small κ). In this regime (κ→0κ→ 0), r r concentrates near 0. From U∼Beta(1/κ,1)U (1/κ,1), r^=1−U r=1-U has moments: [r^]≈κ−κ2+κ3,[r^2]≈2κ2−6κ3.E[ r]≈κ-κ^2+κ^3, [ r^2]≈ 2κ^2-6κ^3. (A.55) Using the expansion wα(r^)≈1+r^λ+2−α2λ2r^2w_α( r)≈ 1+ rλ+ 2-α2λ^2 r^2: [wα]≈1+κλ+κ2(2−αλ2−1λ),[wα2]≈1+2κλ+κ2(2(3−α)λ2−2λ).E[w_α]≈ 1+ κλ+κ^2 ( 2-αλ^2- 1λ ),\;E[w_α^2]≈ 1+ 2κλ+κ^2 ( 2(3-α)λ^2- 2λ ). (A.56) The gain Δ=πα[r^]−[r^] =E_ _α[ r]-E[ r] is dominated by the covariance term Cov(wα,r^)/[wα]Cov(w_α, r)/E[w_α], and πα[r^]≈κ+κ2λE_ _α[ r]≈κ+ κ^2λ. Thus Δ≈κ2λ ≈ κ^2λ. Including higher order terms leads to the expansion in (16). The squared distortion is: Dist2=[wα2][wα]2≈1+κ2λ2.Dist^2= E[w_α^2]E[w_α]^2≈ 1+ κ^2λ^2. (A.57) Taking the square root yields Dist≈1+κ22λ2Dist≈ 1+ κ^22λ^2. Combining these expansions yields the asymptotic results stated in the proposition. Appendix B Additional Discussion and Experimental Setup We detail the Hill estimator in § 4.2, elucidating its sample complexity advantages over estimating the entire reward distribution and outlining our experimental implementation. B.1 Tail Estimation via the Hill Estimator To implement the prompt-adaptive strategy, we require a robust estimate of the tail index κ(x)κ(x) from a finite set of N candidate responses. We adopt the Hill estimator (Hill, 1975), the standard semi-parametric estimator for heavy-tailed distributions. Standard extreme value theory typically deals with unbounded random variables; however, our proxy rewards are bounded, r^(x,y)∈[0,1] r(x,y)∈[0,1]. To bridge this, we analyze the reward gaps (or regrets): U(x,y)=1−r^(x,y).U(x,y)=1- r(x,y). (A.58) The “tail” of the reward distribution (where r^≈1 r≈ 1) corresponds to the limit U→0U→ 0. As established in our modeling assumption (Prop. 2), the tail behavior is characterized by Pr(U≤u)≈u1/κ (U≤ u)≈ u^1/κ. By applying the transformation Z=1/U=(1−r^)−1Z=1/U=(1- r)^-1, we map the problem to a standard Pareto tail estimation where Pr(Z>z)=Pr(U<1/z)≈z−1/κ (Z>z)= (U<1/z)≈ z^-1/κ. Here, κ appears precisely as the shape parameter (tail index) of the resulting Pareto distribution. Given N sampled responses, let r(1)≥r(2)≥⋯≥r(N)r_(1)≥ r_(2)≥…≥ r_(N) denote the order statistics of the rewards. These correspond to the ordered transformed variables Z(1)≥Z(2)…Z_(1)≥ Z_(2)…, where Z(i)=(1−r(i))−1Z_(i)=(1-r_(i))^-1. The Hill estimator uses the top K order statistics to estimate κ: κ κ =1K∑i=1Klog(Z(i)Z(K+1))=1K∑i=1Klog((1−r(i))−1(1−r(K+1))−1)=1K∑i=1Klog(1−r(K+1)1−r(i)). = 1K _i=1^K ( Z_(i)Z_(K+1) )= 1K _i=1^K ( (1-r_(i))^-1(1-r_(K+1))^-1 )= 1K _i=1^K ( 1-r_(K+1)1-r_(i) ). (A.59) The hyperparameter K determines the “tail cut-off”—the boundary between the distribution’s tail and its body. • Small K: Uses only the most extreme data points. This minimizes bias (as the asymptotic tail approximation is most accurate here) but increases variance due to the small sample size. • Large K: Incorporates more samples, reducing variance. However, if K is too large, it includes samples from the central body of the distribution where the power-law assumption fails, introducing significant bias. In BoT, we fix K to a small fraction of N (e.g., K≈NK≈ N or 5%−10%5\%-10\%) to prioritize low bias, ensuring α(x)α(x) reacts only to the true extremal behavior of the prompt. B.2 Sample Complexity: Tail Index vs. Full Distribution A fundamental advantage of the BoT framework is that it adapts to the reward landscape without requiring a full estimate of the reward distribution. Estimating the complete probability density function p(r)p(r) is a non-parametric problem subject to the curse of dimensionality and slow convergence rates. In contrast, estimating the tail index κ is a semi-parametric problem that enjoys faster statistical convergence. We formalize this distinction below. Consider the task of estimating the underlying reward density p(r)p(r) from N samples to within an error ϵε in total variation distance or L2L_2 norm. The minimax convergence rate for estimating a generic smooth density (Hölder class Σ(β,L) (β,L)) in d=1d=1 dimension is infp^supp∈Σ[|p^−p|]≍N−β2β+1, _ p _p∈ E [| p-p| ] N^- β2β+1, (A.60) where β is the smoothness parameter. For typical values (e.g., β=2β=2), this yields a rate of N−2/5N^-2/5, which is significantly slower than the parametric rate N−1/2N^-1/2. This implies that accurately characterizing the entire reward landscape to optimize a policy would require a prohibitively large number of samples N. Conversely, BoT relies only on the tail index κ, a single scalar summary of the extreme value behavior. Under standard Second-Order Regular Variation conditions, the Hill estimator κ κ (using the top K statistics) is asymptotically normal (De Haan and Resnick, 1998): K(κ^K−κ)→(0,κ2). K ( κ_K-κ ) dN(0,κ^2). (A.61) This indicates that the estimation error decays at the parametric rate O(1/K)O(1/ K). Crucially, the sample complexity to estimate κ to a fixed precision ϵε depends on the number of tail samples K, not the complexity of the full distribution’s support. In our design, α(x)α(x) (Eq. (18)) is a smooth, bounded monotonic function of κ, we do not need infinite precision in κ κ. We essentially perform a “soft classification” of the tail regime (heavy vs. light). The sample complexity to distinguish these regimes is low; even with moderate sample sizes (e.g., N≈50N≈ 50), the variance of κ κ is sufficiently small to steer α(x)α(x) towards the correct end of the interval [1,2][1,2]. This makes BoT highly sample-efficient compared to methods that might attempt to model the full reward surface. B.3 Experimental Setup We conduct a comprehensive empirical analysis across a diverse suite of reasoning and instruction-following tasks. Our evaluation protocol encompasses four standard benchmarks, three instruction-tuned reference policies, and four diverse reward models. We evaluate performance on the following four benchmarks: 1. GSM8K (Grade School Math) (Cobbe et al., 2021): A standard benchmark for multi-step mathematical reasoning. We utilize the test split, comprising 1,319 high-quality grade-school math problems with integer answers. We define the oracle reward r⋆(x,y)r (x,y) as a binary metric: r⋆(x,y)=1r (x,y)=1 if the final numerical answer is correct, and 0 otherwise. 2. MMLU (Massive Multitask Language Understanding) (Hendrycks et al., 2021a): To assess knowledge-intensive reasoning in specialized STEM domains, we utilize the College Mathematics and College Chemistry test splits (approx. 100 questions each). Unlike GSM8K, these tasks use a multiple-choice format. Correctness is determined by matching the model’s final selection to the ground truth key. 3. MATH (Mathematics Aptitude Test of Heuristics) (Hendrycks et al., 2021b): For competition-level mathematics requiring complex heuristics, we construct an evaluation set by randomly sampling 512 instances from the MATH test split. These problems are more challenging, and require complex heuristic application and advanced reasoning beyond standard grade-school arithmetic. Similar to GSM8K, the oracle reward is binary based on the exact match of the final calculated result. 4. AlpacaFarm (Dubois et al., 2023): Designed to simulate ”in-the-wild” interactions, this dataset covers diverse tasks from creative writing to open-ended QA. Following Coste et al. (2024), we use the standard test set of 805 prompts. Since open-ended tasks lack verifiable ground truth, we employ Alpaca-RM (Dubois et al., 2023)—fine-tuned from LLaMA-7B on human preference data—as the oracle reward r⋆r for evaluation. For the reasoning tasks (GSM8K, MMLU, MATH) with verifiable ground truth, we consider three instruction-tuned reference policies (πref _ ref): Gemma-2-2B (Gemma et al., 2024), Llama-3-3B (Grattafiori et al., 2024), and Mistral-7B (Jiang et al., 2023). For AlpacaFarm, we follow prior work and use the Pythia-1.4B model (Biderman et al., 2023). We pair these policies with four diverse proxy reward models (r r): 1. OASST-RM (1.4B): Based on Pythia-1.4B (Köpf et al., 2023). 2. Gemma-RM (2B): Trained from google/gemma-2b-it (Dong et al., 2023). 3. Llama-RM (3B): A generalizable RM trained from meta/llama-3.2-3B (Yang et al., 2024b). 4. ARMO-RM (8B): A mixture-of-experts model trained from meta/llama-3-8B (Wang et al., 2024). We prompt all reference policies using zero-shot Chain-of-Thought (CoT) (Wei et al., 2022), i.e., without any in-context demonstrations, to elicit reasoning. To maximize response diversity, we set the temperature to 1.0 and top-p to 1.0 (no nucleus sampling), limiting generation to 512 tokens. For each unique combination of dataset, policy, and reward model, we pre-generate a large candidate pool of 212=4,0962^12=4,096 responses per prompt. For AlpacaFarm, we generate 12,600 responses per prompt following Coste et al., 2024. We simulate experiments by bootstrapping subsets of varying size N from this pool, reporting results averaged over 10 independent trials to ensure statistical robustness. For the Hill estimator (Eq. (17)) in BoT, we adaptively set the tail cutoff as K=max(1,N)K= (1, N) to balance bias and variance as N scales. To calibrate the mapping function α(x)α(x), we select the pivot κ0 _0 based on the reward model’s typical behavior; specifically, we set κ0 _0 to the median of the estimated tail indices κ^(xi)i=1M\ κ(x_i)\_i=1^M over the set of M prompts. This centers the sigmoid mapping, preventing the policy from saturating into purely optimistic or pessimistic modes, and ensuring BoT remains sensitive to relative tail risk variations across prompts. Appendix C Additional Experiments and Ablation Studies We include additional experimental results and comparisons, including: (i) comprehensive results of all configurations of the reference policies and reward models for each datasets, similar to Figures 2 and 3 in the main text; (i) additional results of selected configurations in Figures 2, including true reward r∗r^* over N, proxy reward r r over N. and the proxy reward over distortions (i.e., the RLHF objective in Eq. (12)); (i) ablation studies on the steering temperature λ and tail pivot κ0 _0 in Algorithm 1. C.1 The Complete Results over all Datasets, Reference Policies, and Reward Models We show the trajectory of the true reward (r∗r^*) versus the proxy reward (r r) as the sample size N scales exponentially from 202^0 to 2102^10, i.e., N∈20.21,⋯,29,210N∈\2^0.2^1,·s,2^9,2^10\. These detailed results span all combinations of reference policies and reward models, i.e., Gemma-2-2B, Llama-3-3B, Mistral-7B × OASST-RM, Gemma-RM, Llama-RM, ARMO-RM for GSM8K (Figure A.4), MMLU (Figure A.5), MATH (Figure A.6), and Pythia-1.4B × OASST-RM, Gemma-RM, Llama-RM, ARMO-RM for AlpacaFarm (Figure A.7). Across these configurations, we observe distinct failure modes for the baselines: optimistic strategies, such as BoN and sBoN, consistently succumb to reward hacking at large N, while ITP remains overly conservative, often failing to adequately explore high-reward regions—a finding consistent with Huang et al. (2025b). In contrast, BoT demonstrates robust adaptivity. Depending on the reward landscape, it either outperforms all baselines in high-variance regimes (e.g., Mistral-7B/Gemma-RM on MATH), matches the efficiency of BoN in light-tailed regime (e.g., Llama-3-3B/Llama-RM on MMLU), or converges to the safety of ITP when heavy-tailed rewards are dominant (e.g., Gemma-2-2B/Gemma-RM on GSM8K). Figure A.4: Comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on GSM8K across different reference models and proxy rewards. Each curve traces the trajectory of True Reward (r∗r^*) vs. Proxy Reward (r r) as the sample size N increases from 202^0 to 2102^10, with marker size proportional to N. The smallest marker represents N=1N=1 (standard sampling), while the largest represents N=1024N=1024. Figure A.5: Comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on MMLU across different reference models and proxy rewards. Each curve traces the trajectory of True Reward (r∗r^*) vs. Proxy Reward (r r) as the sample size N increases from 202^0 to 2102^10, with marker size proportional to N. The smallest marker represents N=1N=1 (standard sampling), while the largest represents N=1024N=1024. Figure A.6: Comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on MATH across different reference models and proxy rewards. Each curve traces the trajectory of True Reward (r∗r^*) vs. Proxy Reward (r r) as the sample size N increases from 202^0 to 2102^10, with marker size proportional to N. The smallest marker represents N=1N=1 (standard sampling), while the largest represents N=1024N=1024. Figure A.7: Comparison of optimistic (sBoN/BoN), pessimistic (ITP), and adaptive (BoT) strategies on AlpacaFarm across different reference models and proxy rewards. Each curve traces the trajectory of True Reward (r∗r^*) vs. Proxy Reward (r r) as the sample size N increases from 202^0 to 2102^10, with marker size proportional to N. The smallest marker represents N=1N=1 (standard sampling), while the largest represents N=1024N=1024. C.2 Additional Results on Selected Configurations in Figures 2 We present the trajectory of true reward r∗r^* and proxy reward r r versus the sample size N increases (top row), alongside the trade-off (bottom row) between proxy reward and policy distortion (KL, χ2χ^2, and Tsallis divergences) on GSM8K (Figure A.8), MMLU (Figure A.9), and MATH (Figure A.10). While BoN maximizes the raw proxy reward, it suffers from reward hacking at large N, leading to lower true utility (accuracy). In contrast, BoT achieves the highest true reward among all strategies, surpassing both the conservative ITP and the optimistic baselines. The distortion analysis further validates the efficiency of BoT: it attains a higher proxy reward than ITP for the same level of distortion, yet maintains significantly tighter control over divergence than sBoN or BoN as N scales, effectively balancing exploration with robustness. Figure A.8: The trajectory of true reward r∗r^* and proxy reward r r versus the sample size N increases (top row), alongside the trade-off (bottom row) between proxy reward and policy distortion (KL, χ2χ^2, and Tsallis divergences) on GSM8K. We compare optimistic (sBoN, BoN), pessimistic (ITP), and adaptive (BoT) strategies using select configurations from Figure 2. Figure A.9: The trajectory of true reward r∗r^* and proxy reward r r versus the sample size N increases (top row), alongside the trade-off (bottom row) between proxy reward and policy distortion (KL, χ2χ^2, and Tsallis divergences) on MMLU. We compare optimistic (sBoN, BoN), pessimistic (ITP), and adaptive (BoT) strategies using select configurations from Figure 2. Figure A.10: The trajectory of true reward r∗r^* and proxy reward r r versus the sample size N increases (top row), alongside the trade-off (bottom row) between proxy reward and policy distortion (KL, χ2χ^2, and Tsallis divergences) on MATH. We compare optimistic (sBoN, BoN), pessimistic (ITP), and adaptive (BoT) strategies using select configurations from Figure 2. C.3 Ablation Studies We conduct ablation studies to analyze the sensitivity of the steering temperature λ and the tail pivot κ0 _0 on the GSM8K dataset, utilizing Gemma-2-2B as the reference policy and ARMO-RM as the reward model. For all experiments in this section, we fix the sample size at N=1024N=1024. Steering Temperature λ (Figure A.11, Left). We first fix κ0 _0 to the median of κ^(x) κ(x) across prompts and sweep λ∈0.001,…,1.0λ∈\0.001,…,1.0\. In the high-temperature regime (large λ), the re-weighting term exp(r/λ) (r/λ) flattens, causing the candidate distribution to approach uniformity; consequently, the behaviors of sBoN, ITP, and BoT converge. As λ decreases, the strategies diverge in how they prioritize high-reward responses. ITP exhibits stable performance due to its linear constraint, whereas sBoN becomes increasingly aggressive, converging toward the hard-maximization behavior of standard BoN (i.e., placing all weight on the single highest-reward candidate). Crucially, while BoT does not always maximize the raw proxy reward—avoiding the over-optimization trap—its adaptive α(x)α(x) allows it to achieve higher true reward (accuracy) by mitigating tail risks that optimistic baselines ignore. Tail Pivot κ0 _0 (Figure A.11, Right). Next, we fix λ=0.01λ=0.01 and sweep κ0∈0.001,…,5.0 _0∈\0.001,…,5.0\. Since sBoN and ITP are independent of κ0 _0, they appear as constant baselines. The parameter κ0 _0 governs the transition point of our adaptive strategy: • Optimistic Limit (κ0→∞ _0→∞): As κ0 _0 increases, the term κ^(x)κ^(x)+κ0 κ(x) κ(x)+ _0 vanishes, driving α(x)=1+κ^(x)κ^(x)+κ0→1α(x)=1+ κ(x) κ(x)+ _0→ 1. In this regime, BoT mimics the optimistic behavior of sBoN, as reflected in the proxy reward curves. • Pessimistic Limit (κ0→0 _0→ 0): Conversely, as κ0 _0 decreases below the median of κ^(x) κ(x), the ratio approaches 1, driving α(x)=1+κ^(x)κ^(x)+κ0→2α(x)=1+ κ(x) κ(x)+ _0→ 2. Here, BoT shifts toward the conservative, pessimistic profile of ITP. The results demonstrate that by selecting an intermediate κ0 _0 (calibrated to the median tail index), BoT successfully interpolates between these regimes, outperforming static baselines by adapting to the specific risk profile of each prompt. Figure A.11: Ablation analysis of steering temperature λ (left two figures) and tail pivot κ0 _0 (right two figures). We compare optimistic (sBoN, BoN), pessimistic (ITP), and adaptive (BoT) strategies. The black dashed line marks the median of the estimated tail indices κ^(x) κ(x) across all prompts.