← Back to papers

Paper deep dive

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 189

Abstract

Abstract:Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

Tags

ai-safety (imported, 100%)alignment-training (suggested, 80%)cslg (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/13/2026, 12:40:10 AM

Summary

DARC (Disagreement-Aware Alignment via Risk-Constrained Decoding) is an inference-time, retraining-free method for aligning large language models. It addresses the brittleness of mean-reward maximization in the presence of heterogeneous human preferences by framing response selection as a distributionally robust, risk-sensitive decision-making problem. DARC uses KL-robust entropic objectives and lower confidence bounds to penalize high-disagreement candidates, effectively reducing tail risk and improving robustness without requiring model retraining.

Entities (5)

DARC · method · 100%DPO · alignment-method · 98%RLHF · alignment-method · 98%Distributionally Robust Optimization · optimization-framework · 95%KL-robust entropic objective · mathematical-objective · 95%

Relation Signals (3)

DARC optimizes KL-robust entropic objective

confidence 95% · DARC reranks candidates by maximizing a KL-robust (entropic) satisfaction objective

DARC improves Alignment Robustness

confidence 90% · Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk

DARC isbasedon Distributionally Robust Optimization

confidence 90% · We provide theoretical characterization linking this decoding rule to... KL-based distributionally robust optimization.

Cypher Suggestions (2)

Identify the mathematical objectives used by DARC · confidence 90% · unvalidated

MATCH (d:Method {name: 'DARC'})-[:OPTIMIZES]->(o:Objective) RETURN o.name

Find all methods related to preference-based alignment · confidence 80% · unvalidated

MATCH (m:Method)-[:IS_A]->(a:AlignmentMethod) RETURN m.name

Full Text

188,806 characters extracted from source content.

Expand or collapse full text

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding Mingxi Zou Jiaxiang Chen Junfan Li Langzhang Liang Qifan Wang Xu Yinghui Zenglin Xu Abstract Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC), a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a KL-robust (entropic) satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback. Machine Learning, ICML Figure 1: Score Distribution shift. Ridge plot showing human score densities on the high-disagreement subset. DARC variants (blue) shift the distribution to the right (higher mean μ) compared to the baseline (grey), with reduced spread (lower σ), indicating both increased satisfaction and reduced disagreement. 1 Introduction Preference data has become the dominant supervision signal for aligning large language models (Ouyang et al., 2022; Stiennon et al., 2020). Most pipelines—RLHF with reward modeling and RL optimization (Christiano et al., 2017; Schulman et al., 2017), offline preference objectives such as DPO and its refinements (Rafailov et al., 2023; Meng et al., 2024), and reference-free single-stage variants such as ORPO (Hong et al., 2024)—share a common abstraction: preferences are treated as noisy observations of a single latent scalar utility (e.g., Bradley–Terry) (Bradley and Terry, 1952). This abstraction largely persists in newer reformulations such as KTO and IPO (Ethayarajh et al., 2024; Garg et al., 2025), and even when reward models are made multi-dimensional via multi-head/objective designs (Wang et al., 2024a; Li et al., 2025; Yang et al., 2024). Yet, treating feedback as perturbations around a single scalar provides limited guidance for inference-time response selection under heterogeneous preferences (Hung et al., 2025), and it also lacks a unifying robust-optimization account for common risk-penalized decoding heuristics. However, real-world preferences are often heterogeneous rather than i.i.d. noise: annotators disagree for systematic reasons (Zhang et al., 2024; Chen et al., 2024). Empirically, human ratings show substantial variance even on the raw Top-K candidate pool (Appendix H.1), suggesting that uncertainty is intrinsic rather than an artifact of our selection rule. Under such plurality, maximizing the average reward μ μ can be brittle (Casper et al., 2023), and the issue is exacerbated by proxy over-optimization, which can improve an imperfect preference proxy while degrading the underlying target (Gao et al., 2023; Rafailov et al., 2024). Recent work further shows that proxy misspecification can induce inference-time reward hacking: as best-of-N (a widely used decoding primitive (Sun et al., 2024)) or soft best-of-N becomes greedier, true utility can increase and then inevitably degrade (Huang et al., 2025a). Best-of-Poisson and HedgeTune mitigate this effect by tuning inference-time parameters (Khalaf et al., 2025); however, they primarily model risk through proxy–distortion tradeoffs under a single reward signal, rather than preference heterogeneity. Closely related, pessimistic best-of-N rules penalize atypical candidates via an auxiliary error model to mitigate reward hacking (Anonymous, 2025), but they target distributional uncertainty of the reward model (e.g., atypicality/OOD) rather than disagreement-grounded risk across users. In parallel, robustness has been pursued through training-time objectives such as robust DPO under noisy preferences (Wu et al., 2024), which improve robustness via retraining and noise assumptions; group-robust objectives that protect minority preference groups (Ramesh et al., 2024), which rely on access to group structure; and uncertainty-aware reward modeling (Banerjee and Gopalan, 2024), which quantifies estimation uncertainty but does not by itself specify how to select responses under inference-time proxy shift (Ichihara et al., 2025). Taken together, these lines suggest a common lesson: when preferences are plural, the relevant object is not a deterministic score, but a random variable over users and annotation noise. Yet, a principled inference-time selection rule that is explicitly risk-constrained under heterogeneous preferences remains underdeveloped. We therefore study inference-time alignment under heterogeneous preferences through the lens of risk-constrained decision making (Chow et al., 2018; Tamar et al., 2015). Given a fixed candidate set and noisy preference or reward scores, we derive a finite-sample pessimistic rule based on a lower confidence bound, yielding high-probability guarantees for selecting a competitive response while controlling tail risk across prompts. This leads to Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC), an inference-time-only, retraining-free procedure that plugs into any LM and preference estimator. DARC grounds risk in multi-annotator disagreement (Zhang et al., 2024) instantiated via validated proxy signals, improving robustness on high-disagreement prompts (Fig. 1). We include representative cases where Best-of-K is polarizing or unstable, whereas DARC yields consistently preferred responses (Appendix H.10). Beyond the statistical view, we give a distributionally robust optimization (DRO) characterization (Wiesemann et al., 2014; Rahimian and Mehrotra, 2019), viewing decoding as maximizing the worst-case expected satisfaction over local divergence neighborhoods (Namkoong and Duchi, 2016; Duchi et al., 2021). This yields a practical KL-robust instantiation and situates widely used mean–dispersion scoring rules within the same DRO perspective, clarifying the conditions under which they arise as principled risk-sensitive criteria (Duchi and Namkoong, 2019). Contributions. • Method. We formulate inference-time alignment as risk-constrained decision making under heterogeneous preferences, with risk induced by preference uncertainty and annotator disagreement. • Theory. We connect LCB-based uniform pessimism to a KL-DRO view, yielding a closed-form entropic decoding objective and its constrained/penalized variants via an entropic risk premium. • Empirics. Across benchmarks, DARC improves disagreement and prompt-level tail risk with competitive mean quality; a dual-robust multi-scorer extension hedges scorer shift and proxy over-optimization, with a KL-regularized DRO interpretation over scorers. 2 Problem setup Let s denote a prompt (context) and let ​(s)Y(s) be a realized candidate set produced by a fixed generator (e.g., sampling, beam variants, or a proposal model), with K:=|​(s)|K:=|Y(s)|. Conditioning on the candidate set. The generator may be stochastic; throughout the analysis we condition on the realized ​(s)Y(s). All probabilities below are taken over the evaluation randomness (human or otherwise), holding ​(s)Y(s) fixed. Latent satisfaction under heterogeneous preferences. For each (s,y)(s,y), let R​(s,y)∈ℝR(s,y) denote a (latent) user-satisfaction random variable capturing preference heterogeneity and evaluation noise, with mean μ​(s,y):=​[R​(s,y)]μ(s,y):=E[R(s,y)]. Intuitively, μ​(s,y)μ(s,y) measures average quality. KL-robust (entropic) value and risk premium. For β>0β>0, define the entropic value Vβ​(s,y):=−1β​log⁡​[exp⁡(−β​R​(s,y))],V_β(s,y):=- 1β \! [ \! (-β\,R(s,y) ) ], (1) which is equivalent to a KL-based distributionally robust objective (Section 3.2). We define the entropic risk premium. RPβ​(s,y):=μ​(s,y)−Vβ​(s,y)≥0RP_β(s,y):=μ(s,y)-V_β(s,y)≥ 0 (2) Decision problem (risk-aware decoding). Conditioning on (s,​(s))(s,Y(s)), each candidate y∈​(s)y (s) induces an (unknown) satisfaction distribution over users/raters. Our population objective is to select an output by solving y⋆∈arg⁡maxy∈​(s)⁡Vβ​(s,y),y ∈ _y (s)V_β(s,y), (3) where entropic risk measure VβV_β is defined in (1). We further consider explicit risk control via a budget or a penalty: yτ⋆∈arg⁡maxy∈​(s)⁡Vβ​(s,y)s.t.RPβ​(s,y)≤τ,y _τ∈ _y (s)V_β(s,y) .t. _β(s,y)≤τ, (4) or in penalized (Lagrangian) form arg⁡maxy∈​(s)⁡Vβ​(s,y)−λ​RPβ​(s,y). _y (s)V_β(s,y)-λ\,RP_β(s,y). 3 Guarantees via Lower Confidence Bounds This section provides a statistical justification for disagreement-aware decoding by deriving high-probability lower confidence bounds (LCBs) on expected satisfaction under heterogeneous preferences. To contextualize the resulting pessimistic rules, we also give complementary distributionally robust optimization (DRO) characterizations. Scalar satisfaction samples (guarantee setting). For each candidate y∈​(s)y (s), we observe n i.i.d. scalar satisfaction samples Ri​(s,y)i=1n\R_i(s,y)\_i=1^n drawn from the (unknown) distribution of R​(s,y)R(s,y): Ri​(s,y)​∼i.i.d.​R​(s,y),i=1,…,n.R_i(s,y) i.i.d. R(s,y), i=1,…,n. (5) In this regime, the empirical mean and standard deviation μ^n​(s,y) μ_n(s,y) and σ^n​(s,y) σ_n(s,y) estimate μ​(s,y)μ(s,y) and σ​(s,y)σ(s,y). Moreover, the plug-in estimator of (1) is V^β​(s,y):=−1β​log⁡(1n​∑i=1nexp⁡(−β​Ri​(s,y))), V_β(s,y):=- 1β \! ( 1n _i=1^n \! (-β\,R_i(s,y) ) ), (6) with the empirical risk premium RP^β​(s,y):=μ^n​(s,y)−V^β​(s,y)≥0. RP_β(s,y):= μ_n(s,y)- V_β(s,y)\ ≥ 0. (7) In practice, one may operate with approximate (proxy) estimates of these empirical quantities; the analysis below does not rely on proxy scores being independent, and any such approximation induces only an additive slack in the resulting LCB objective (Appendix A.12). 3.1 Estimation risk: uniform LCB and a mean–dispersion surrogate Fix a prompt s and condition on =​(s)Y=Y(s) with K:=||K:=|Y|. For readability, write Ri​(y)R_i(y) for Ri​(s,y)R_i(s,y), and μ​(y),σ​(y)μ(y),σ(y) for μ​(s,y),σ​(s,y)μ(s,y),σ(s,y). For each y∈y , we observe n independent satisfaction samples with mean μ​(y)μ(y) and variance σ2​(y)σ^2(y), and denote the empirical mean and standard deviation by μ^n​(y):=1n​∑i=1nRi​(y) μ_n(y):= 1n _i=1^nR_i(y) and σ^n2​(y):=1n−1​∑i=1n(Ri​(y)−μ^n​(y))2 σ_n^2(y):= 1n-1 _i=1^n(R_i(y)- μ_n(y))^2. Bridge to pairwise preferences. Our scalar-sample analysis can be viewed as operating on standard scalarizations of pairwise preferences (e.g., win-rate or fitted BT/Thurstone scores); see Appendix A.14. Empirically, our main conclusions are stable when replacing absolute scalar ratings with a pairwise scalarization against the base response (Table 6). Remark 3.1 (Shared annotators across candidates). Our guarantee requires independence across i for each fixed candidate y. Annotator overlap across candidates may induce dependence between (μ^n​(y),σ^n​(y))y∈\( μ_n(y), σ_n(y))\_y , but the uniform guarantee below follows from per-candidate concentration with a union bound and does not require independence across candidates. Assumption 3.2 (Bounded rewards). For all y∈y , the satisfaction samples are almost surely bounded: Ri​(y)∈[a,b]a.s.R_i(y)∈[a,b] .s. (8) Boundedness matches typical rating scales and can be enforced for proxy scores via truncation (Appendix H.1); an analogous LCB holds under sub-Gaussian noise (Appendix A.4). Proposition 3.3 (Uniform LCB). There exists an absolute constant c>0c>0 such that for any δ∈(0,1)δ∈(0,1), with probability at least 1−δ1-δ, simultaneously for all y∈​(s)y (s), μ​(y)≥μ^n​(y)−c​σ^n​(y)​log⁡(K/δ)n−c​(b−a)​log⁡(K/δ)n.μ(y)≥ μ_n(y)-c\, σ_n(y) (K/δ)n-c\,(b-a) (K/δ)n. (9) We denote the right-hand side by LCBδ​(y)LCB_δ(y). Proof. See Appendix A.2. Remark 3.4 (Variance governs estimation hardness). The LCB form (9) highlights a statistical driver of selection risk: the dominant estimation error scales with the standard deviation σ​(y)σ(y). The lower-order term decays as O​(1/n)O(1/n), whereas the leading term scales as O​(σ​(y)/n)O(σ(y)/ n). Consequently, low-disagreement candidates (small σ​(y)σ(y)) admit substantially tighter confidence bounds at the same sample size, while controversial candidates require many more samples to certify their mean. Penalizing σ^n​(y) σ_n(y) therefore has an identification rationale: it discourages selecting candidates whose true quality is intrinsically harder to verify from limited feedback. Lower-tail interpretation. Maximizing a lower confidence bound is a principled way to avoid candidates with poor lower-tail satisfaction even when μ^n​(y) μ_n(y) is high (Boucheron et al., 2013; Vershynin, 2018). This is a statistical conservatism argument rather than an equivalence to coherent tail-risk measures such as CVaR (Artzner et al., 1999; Rockafellar et al., 2000). See Appendix A.4 for details. On constants and practical calibration. The uniform LCB in (9) is obtained via standard concentration plus a union bound over K candidates and is therefore conservative. We do not claim tight constants; instead, the bound motivates the functional form λδ∝log⁡(K/δ)n, _δ\; \; (K/δ)n, (10) and the lower-order term c​(b−a)​log⁡(K/δ)nc(b-a) (K/δ)n is uniform across candidates and does not affect arg⁡maxy _y decisions. In practice, we treat the coefficient as a risk-budget knob (optionally scaled by a factor α) and fix it via a small held-out calibration, while reporting sensitivity of the resulting trade-off. Combined with the σ^n​(y) σ_n(y) factor in (9), the uncertainty penalty scales as O~​(σ^n​(y)​log⁡K/n) O( σ_n(y) K/n), matching the intuition that more controversial candidates require larger risk budgets to be selected. LCB decoding and a mean–dispersion surrogate. We define the LCB decoder as yLCB∈arg⁡maxy∈⁡LCBδ​(y).y_LCB∈ _y LCB_δ(y). (11) On the high-probability event of Proposition 3.3, this rule maximizes a valid lower bound on the true mean satisfaction μ​(y)μ(y). Crucially, under bounded ratings, maximizing LCBδ​(y)LCB_δ(y) is equivalent (up to a constant) to a mean–dispersion surrogate: arg⁡maxy∈⁡(μ^n​(y)−λ​σ^n​(y)). _y ( μ_n(y)-λ σ_n(y) ). We provide the derivation in Corollary A.2 (Appendix A.3) and discuss the theoretical connection between this σ-penalty (estimation uncertainty) and the entropic σ2σ^2-penalty (risk aversion) in Remark A.3. 3.2 Distributional risk: DRO characterizations of pessimistic value KL-robust (entropic) decoding. We first consider a relative-entropy (KL) robustification of expected satisfaction. Let R∈ℝR be the (latent) satisfaction with reference distribution ℙP. We consider worst-case distributions ℚQ over R that are absolutely continuous w.r.t. ℙP. For β>0β>0, define the KL-regularized robust value RobβKL​(ℙ;R):=infℚ≪ℙℚ​[R]+β−1​DKL​(ℚ∥ℙ),Rob^KL_β(P;R):= _Q \E_Q[R]+β^-1D_KL(Q\|P) \, (12) where ℚ≪ℙQ ensures the KL term is well-defined. This is standard in risk-sensitive control and large deviations (Dupuis and Ellis, 2011; Hansen and Sargent, 2011). Theorem 3.5 (KL-robust value equals an entropic objective). For any β>0β>0 and any ℙP such that ℙ​[exp⁡(−β​R)]<∞E_P[ (-β R)]<∞, RobβKL​(ℙ;R)=−1β​log⁡R∼ℙ​[exp⁡(−β​R)].Rob^KL_β(P;R)\;=\;- 1β _R \! [ (-β R) ]. (13) In particular, for the empirical rater distribution ℙ^ny P_n^y, RobβKL​(ℙ^ny;R)=−1β​log⁡(1n​∑i=1nexp⁡(−β​Ri​(y)))=V^β​(s,y)Rob^KL_β( P_n^y;R)=- 1β \! ( 1n _i=1^n (-β R_i(y)) )= V_β(s,y).See Appendix A.5 for the detailed proof. Relation to Expected Utility Theory (CARA). Maximizing the entropic value in (13) is equivalent to maximizing expected utility under a constant absolute risk aversion (CARA) utility, u​(x)=−exp⁡(−β​x)u(x)=- (-β x). A useful invariance is translation equivariance: for any constant c, Vβ​(R+c)=Vβ​(R)+cV_β(R+c)=V_β(R)+c, hence the risk premium μ​(R)−Vβ​(R)μ(R)-V_β(R) is invariant to additive reward shifts. This is desirable in RLHF-style pipelines where learned reward models are often only identifiable up to an affine transformation. χ2χ^2-DRO yields a mean–dispersion special case. We now give a complementary robust-optimization view (Wiesemann et al., 2014; Rahimian and Mehrotra, 2019): under a χ2χ^2-divergence ambiguity set around ℙP (Ben-Tal and Nemirovski, 2002), the worst-case expected satisfaction admits a mean–dispersion form. (Definitions and tightness conditions are deferred to Appendix B.) Proposition 3.6 (χ2χ^2-DRO robust mean admits a mean–dispersion form). Let R be square-integrable under ℙP with mean μℙ:=ℙ​[R] _P:=E_P[R] and variance σℙ2:=Varℙ​(R) _P^2:=Var_P(R). For any ρ≥0ρ≥ 0, infℚ∈ρ​(ℙ)ℚ​[R]≥μℙ−ρ​σℙ. _Q _ρ(P)\ E_Q[R]\;≥\; _P\;-\; ρ\, _P. (14) Specializing to ℙ=ℙ^nyP= P_n^y yields the empirical closed form infℚ∈ρ​(ℙ^ny)ℚ​[R]=μ^n​(y)−ρ​v^n​(y), _Q _ρ( P_n^y)\ E_Q[R]= μ_n(y)- ρ\, v_n(y), (15) whenever the extremal density is nonnegative on the empirical support. Otherwise, the mean–dispersion form remains a valid lower bound (Remark B.1), ensuring the rule remains pessimistic. Proof and tightness conditions. See Appendix B for the extremal density characterization and a sufficient nonnegativity regime (e.g., bounded ratings imply tightness for small ρ). A unified pessimistic value (LCB as calibrated DRO). Define the candidate-wise pessimistic value Vδ(y):=μ^n(y)−cσ^n(y)log⁡(K/δ)n−c(b−a)log⁡(K/δ)n,V_δ(y):= μ_n(y)-c\, σ_n(y) (K/δ)n-c\,(b-a) (K/δ)n, (16) which is a uniform lower confidence bound (LCB) under Proposition 3.3 (Appendix A.2). On the other hand, Proposition 3.6 implies that, up to the lower-order term in (16), Vδ​(y)V_δ(y) matches the χ2χ^2-DRO robust mean of the empirical rater distribution with radius ρδ≍log⁡(K/δ)n, _δ (K/δ)n, (17) since ρδ​σ^n​(y)≍σ^n​(y)​log⁡(K/δ)/n _δ\, σ_n(y) σ_n(y) (K/δ)/n (see Appendix C for a discussion distinguishing statistical estimation risk from intrinsic disagreement). This yields a unified interpretation of disagreement-aware decoding as calibrated pessimism:a statistical LCB rule that is equivalent (up to lower-order terms) to optimizing a local DRO objective. Thus, disagreement-aware decoding can be viewed equivalently as (i) maximizing a high-probability LCB on expected satisfaction (statistical pessimism), or (i) maximizing a distributionally robust worst-case expected satisfaction over a local χ2χ^2 neighborhood (adversarial robustness). 4 Disagreement-Aware Risk-Constrained Decoding Motivated by the LCB and DRO characterization in §3–§3.2, we now present practical decoding rules that implement the KL-robust entropic objective (and its constrained/penalized variants) over a finite candidate set. For completeness, we also include a second-moment pessimistic surrogate implied by the LCB analysis in (42) as a computationally convenient ablation. 4.1 From preference maximization to risk controls Standard reward-based reranking selects the highest estimated mean, ymean​(s)∈arg⁡maxy∈​(s)⁡μ^​(s,y),y_mean(s)∈ _y (s) μ(s,y), (18) which optimizes average preference but may favor brittle outputs when disagreement is large. Risk-sensitive decoding (primary rule). Our primary decoder selects the candidate with the largest scorer-robust entropic value: yEntropic​(s)∈arg⁡maxy∈​(s)⁡V^β​(s,y),y_Entropic(s)∈ _y (s) V_β(s,y), (19) where V^β V_β is the entropic value defined in (6). Risk-constrained decoding via an entropic risk premium. We define explicit deployment knobs using the entropic risk premium RP^β​(s,y) RP_β(s,y) from (7). We consider two standard forms: yτ​(s) y_τ(s) ∈arg⁡maxy∈​(s)⁡V^β​(s,y)s.t.RP^β​(s,y)≤τ, ∈ _y (s) V_β(s,y) .t. RP_β(s,y)≤τ, (20) yλ​(s) y_λ(s) ∈arg⁡maxy∈​(s)⁡V^β​(s,y)−λ​RP^β​(s,y), ∈ _y (s) V_β(s,y)-λ\, RP_β(s,y), (21) where the penalty form is a Lagrangian relaxation. If the feasible set y∈​(s):RP^β​(s,y)≤τ\y (s): RP_β(s,y)≤τ\ is empty, we fall back to ​(s)Y(s), consistent with Algorithm 1. The constrained and penalized forms provide two equivalent ways to tune the reward–risk trade-off; their finite-candidate-set relationship is given in Appendix A.6. Disagreement as a risk proxy: interpretation and limitations. We use disagreement as a human-centric signal of preference heterogeneity, and implement risk sensitivity through the KL-robust entropic value (19) (a standard convex risk measure). The mean–dispersion form arises only as a finite-sample surrogate (via LCB pessimism) or as a χ2χ^2-DRO special case, rather than being the defining principle of our method. Importantly, we do not claim that variance is a coherent risk measure (e.g., CVaR) (Artzner et al., 1999; Rockafellar et al., 2000) or that σ is equivalent to a tail-risk functional in full generality. Instead, §3 shows that penalizing σ σ arises naturally from maximizing a statistically justified lower confidence bound (LCB) on μ​(s,y)μ(s,y), yielding a principled pessimistic selection rule under finite data. Moreover, §3.2 provides a complementary distributionally robust optimization characterization (Wiesemann et al., 2014; Rahimian and Mehrotra, 2019). Type Method Llama-3.1-8B-Instruct Qwen2.5-7B-Instruct MT-Bench AlpacaEval 2.0 MT-Bench AlpacaEval 2.0 Reward Risk↓ Tradeoff↑ Len(tok) Reward Risk↓ Tradeoff↑ Len(tok) Reward Risk↓ Tradeoff↑ Len(tok) Reward Risk↓ Tradeoff↑ Len(tok) Dataset: Overall Inference Base (Best-of-K) 4.53 6.50 -8.47 315 4.52 7.46 -10.40 331 4.09 3.54 -2.99 244 3.29 3.11 -2.93 256 BoP(HedgeTune) 4.18 5.86 -7.54 316 4.12 6.24 -8.36 330 3.84 3.10 -2.36 231 3.14 2.68 -2.22 261 Caution 4.09 5.43 -6.77 317 4.27 5.91 -7.55 320 3.89 3.21 -2.53 243 3.09 2.62 -2.15 259 DeAL 5.59 7.98 -10.37 309 5.66 8.74 -11.81 311 4.64 4.22 -3.80 256 4.02 3.74 -3.46 267 MC-Dropout 4.41 5.78 -7.15 315 4.37 6.31 -8.25 326 3.87 3.38 -2.89 247 3.13 2.79 -2.45 253 RBoN 4.43 5.62 -6.81 309 4.31 5.82 !15-7.33 321 3.93 3.22 -2.51 241 3.17 2.71 -2.25 257 DARC(ours) 4.46 !155.41 !15-6.36 315 4.25 !15 5.67 !15-7.33 319 4.01 3.14 -2.27 243 3.14 2.66 -2.18 256 DARC-τ(ours) 4.42 5.45 -6.48 313 4.49 6.21 -7.93 321 3.94 !152.99 !15-2.04 232 3.12 !152.48 !15-1.85 257 DARC-ϵε(ours) 4.46 !205.29 !20-6.12 314 4.38 !205.60 !20-6.82 319 3.96 !202.96 !20-1.96 243 3.18 !202.36 !20-1.54 261 Training cDPO (Best-of-K) 5.88 5.96 -6.04 300 5.86 6.97 -8.08 309 8.61 5.76 -2.91 258 5.30 6.63 -7.96 278 cDPO + DARC-ϵε 5.81 5.65 -5.49 311 5.83 6.57 -7.31 290 8.53 5.49 -2.45 278 5.22 6.40 -7.58 274 rDPO (Best-of-K) 5.53 5.57 -5.61 260 11.62 7.10 -2.58 316 8.73 5.94 -3.15 294 5.77 6.90 -8.03 282 rDPO + DARC-ϵε 5.42 5.16 -4.90 258 11.58 6.78 -1.98 325 8.62 5.69 -2.76 288 5.68 6.68 -7.68 232 Dataset: High-Variance (Top 20%) Inference Base (Best-of-K) 5.79 9.91 -14.03 382 7.48 10.22 -12.96 381 5.12 6.00 -7.12 308 2.79 5.46 -8.13 263 BoP(HedgeTune) 5.45 7.90 -10.35 382 7.28 !158.12 !15-8.96 381 4.67 5.12 -5.57 307 2.53 4.09 -5.65 265 Caution 5.41 7.42 -9.43 382 7.17 8.34 -9.51 381 4.80 5.39 -5.98 307 2.30 4.07 -5.84 264 DeAL 5.96 9.71 -13.46 374 7.89 10.15 -12.41 371 5.75 8.43 -11.11 316 3.45 6.89 -10.33 256 MC-Dropout 5.20 7.48 -9.76 384 7.19 8.15 -9.11 380 4.75 5.25 -5.25 306 2.35 4.00 -5.65 262 RBoN 5.28 !157.30 !15-9.32 372 7.22 8.40 -9.58 376 4.90 5.45 -6.00 300 2.25 4.15 -6.05 258 DARC(ours) 5.32 7.33 -9.34 382 7.24 !208.10 !15-8.96 381 4.99 5.05 -5.13 307 2.34 4.42 -6.50 263 DARC-τ(ours) 5.43 7.90 -10.37 382 7.21 8.55 -9.89 381 4.78 !154.85 !15-4.92 294 2.31 !153.84 !15-5.37 249 DARC-ϵε(ours) 5.49 !207.00 !20-8.51 382 7.26 !208.10 !20-8.94 381 5.03 !204.77 !20-4.51 306 2.59 !203.76 !20-4.93 265 Training cDPO (Best-of-K) 7.54 8.39 -9.24 312 8.88 9.40 -9.92 328 5.52 5.67 -5.82 317 7.17 7.30 -7.43 196 cDPO + DARC-ϵε 7.51 8.11 -8.71 323 8.81 9.21 -9.61 328 5.42 5.17 -4.92 316 7.12 7.29 -7.46 196 rDPO (Best-of-K) 6.58 7.98 -9.38 337 11.60 9.05 -6.50 347 5.79 5.34 -4.89 304 7.80 7.72 -7.64 172 rDPO + DARC-ϵε 6.41 7.16 -7.91 316 11.56 8.65 -5.74 329 5.56 5.03 -4.50 301 7.70 7.28 -6.86 164 Table 1: Evaluation on MT-Bench and AlpacaEval 2.0. Results across two generators families . We report mean reward , proxy risk (Risk), risk–reward tradeoff score (Tradeoff), and average length (Len(tok)).Blue/green highlight best/runner-up. 4.2 Practical decoding: ϵε-tie breaking Proxy robustness. When (μ^,σ^)( μ, σ) are obtained from scalable proxy scorers rather than i.i.d. human samples, a uniform proxy-error assumption yields a robust LCB guarantee; we defer the formal statement and proof to Appendix A.12. ϵε-rule as a constrained optimization. Our ϵε-tie-breaking is equivalent to yϵ​(s)∈arg⁡miny∈​(s)⁡σ^​(s,y)s.t.V^β​(s,y)≥V^max​(s)−ϵ,y_ε(s)∈ _y (s) σ(s,y) .t. V_β(s,y)≥ V_ (s)-ε, (22) where V^max​(s):=maxy∈​(s)⁡V^β​(s,y) V_ (s):= _y (s) V_β(s,y). Pareto view. Eq. (22) selects a Pareto-optimal point of the robust-value–disagreement trade-off: among candidates whose robust value is within ϵε of the best, it returns the least controversial one. Moreover, when the chosen point is supported, the same solution can be written as a scalarization arg⁡maxy⁡(V^β,γ​(s,y)−λϵ​σ^​(s,y)) _y ( V_β,γ(s,y)- _ε σ(s,y) ) for some λϵ≥0 _ε≥ 0; formal statements and proofs are deferred to Appendix A.13. Two-stage rule. To avoid sacrificing entropic value excessively for marginal robustness, we use: ℱϵ​(s) _ε(s) :=y∈​(s)∣V^β​(s,y)≥V^max​(s)−ϵ, =\y (s) V_β(s,y)≥ V_ (s)-ε\, (23) yϵ​(s) y_ε(s) ∈arg⁡miny∈ℱϵ​(s)⁡σ^​(s,y). ∈ _y _ε(s) σ(s,y). This approximates the scalarized decoder maxy⁡(V^β​(s,y)−λ​σ^​(s,y)) _y ( V_β(s,y)-λ\, σ(s,y) ) by enforcing near-optimality in entropic value and then selecting the least controversial candidate. In practice, we tune all risk parameters (β,τ,ϵβ,τ,ε) on a held-out development set; detailed tuning protocols are provided in Appendix H.4. 4.3 Multi-scorer robustness Scorer-robust decoding (multi-reward models). In scalable proxy settings, satisfaction scores may come from learned reward models and be sensitive to model shift or proxy over-optimization. We optionally hedge this scorer ambiguity by using a family of M scorers indexed by m∈[M]m∈[M]. For each candidate y∈​(s)y (s), scorer m provides n scalar samples Rm,i​(s,y)i=1n\R_m,i(s,y)\_i=1^n (e.g., via perturbations), from which we compute the scorer-specific entropic value V^β,m​(s,y) V_β,m(s,y) and risk premium RP^β,m​(s,y) RP_β,m(s,y). Because reward scales can differ across scorers, we apply a per-prompt affine normalization (Appendix E). We then aggregate scorer-specific entropic values using a soft worst-case operator: V~β,γ​(s,y):=−1γ​log⁡(1M​∑m=1Mexp⁡(−γ​V^β,m​(s,y))), V_β,γ(s,y):=- 1γ \! ( 1M _m=1^M \! (-γ\, V_β,m(s,y) ) ), (24) and aggregate risk premia pessimistically by RP~β​(s,y):=maxm∈[M]⁡RP^β,m​(s,y) RP_β(s,y):= _m∈[M] RP_β,m(s,y). We decode by maximizing V~β,γ​(s,y) V_β,γ(s,y) subject to RP~β​(s,y)≤τ RP_β(s,y)≤τ (or its penalized form). Proposition 4.1 (Scorer aggregation as KL-regularized DRO over scorers). Let vm:=V^β,m​(s,y)v_m:= V_β,m(s,y) and ∈ℝMv ^M. Define =(1/M,…,1/M)u=(1/M,…,1/M) and the simplex ΔM:=∈ℝ+M:∑m=1Mqm=1 _M:=\q ^M_+: _m=1^Mq_m=1\. Then the soft worst-case aggregation (24) admits the variational form V~β,γ​(s,y)=inf∈ΔM∑m=1Mqm​vm+1γ​DKL​(∥). V_β,γ(s,y)= _q∈ _M \ _m=1^Mq_mv_m+ 1γD_KL(q\|u) \. (25) Interpretation. Eq. (25) shows that γ interpolates between averaging across scorers (γ→0γ\!→\!0) and a worst-case scorer (γ→∞γ\!→\!∞), yielding a principled hedge against scorer shift. In particular, minm⁡vm≤V~β,γ​(s,y)≤1M​∑mvm _mv_m≤ V_β,γ(s,y)≤ 1M _mv_m, and V~β,γ​(s,y) V_β,γ(s,y) is non-increasing in γ. 5 Experiment 5.1 Experimental Setup and Implementation Details Candidate sets and evaluation protocol. For each prompt s, we generate a fixed candidate pool ​(s)Y(s) of size K (shared across all methods), so differences arise solely from the inference-time selection rule. In the human-grounded setting, for each y∈​(s)y (s) we collect n scalar satisfaction ratings ri​(s,y)i=1n\r_i(s,y)\_i=1^n (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022), and split annotators into disjoint selection and evaluation sets ℐselI_sel and ℐevalI_eval to avoid selection–evaluation leakage. All decoders operate on statistics computed from ℐselI_sel. Our entropic decoder uses the empirical entropic value in the human-grounded setting (where M=1M=1 and V~β,γ=V^β V_β,γ= V_β), and uses the scorer-robust aggregated value V~β,γ​(s,y) V_β,γ(s,y) (and RP~β​(s,y) RP_β(s,y)) in proxy settings with multiple reward models. Second-moment baselines use only (μ^sel​(s,y),σ^sel​(s,y))( μ_sel(s,y), σ_sel(s,y)) (or their scorer-indexed analogues under proxies). All reported metrics are computed on held-out ratings in ℐevalI_eval (definitions in Appendix H.3). Estimating mean and disagreement. We define (μ^,σ^)( μ, σ) as the mean and disagreement estimates, derived from either empirical samples or scorer proxies . DARC is inference-time only, modifying the selection over ​(s)Y(s) without retraining. While our theory assumes multi-annotator feedback, we validate proxy uncertainty as a scalable approximation for human disagreement. Proxies, baselines, and hyperparameters. When multi-rater human scores are unavailable, we use scalable proxy scorers (e.g., ensembles / bootstrap reward models) to obtain a proxy mean together with either (i) a score distribution for computing V^β V_β, or (i) a proxy disagreement diagnostic σ^proxy σ_proxy (Kendall and Gal, 2017; Lakshminarayanan et al., 2017); we validate proxy–human alignment via rank correlation and stratified analyses (Uma et al., 2021). Inference-time baselines include mean Best-of-K, inference-time hedging methods for reward hacking (Best-of-Poisson and HedgeTune) (Khalaf et al., 2025), pessimistic best-of-N reranking via an auxiliary error/atypicality model (Caution) (Anonymous, 2025), uncertainty-based reranking (e.g., MC-Dropout) (Gal and Ghahramani, 2016; Kendall and Gal, 2017), reward-regularized decoding (RBoN) (Jinnai et al., 2024) and DeAL (top-k lookahead reward-guided decoding) (Huang et al., 2025b); training-time baselines include cDPO/rDPO (Mitchell, 2023; Chowdhury et al., 2024). We report default settings and full tuning ranges (including our LCB- and DRO-instantiated decoding rules), along with complete baseline configurations and implementation details, in Appendix H.2 (see also Appendix F). Inference overhead. Disagreement estimation adds only a modest inference-time cost: in our implementation, using Naug=4/8/16N_aug=4/8/16 increases end-to-end latency by only ≈1.5%/2.0%/3.2%≈ 1.5\%/2.0\%/3.2\%, since candidate generation dominates runtime (Appendix H.6, Table 7). Metrics (mean, risk, and tail). For each prompt s and candidate response y, we evaluate rewards on a held-out rater set ℐevalI_eval and report the empirical mean μ^eval​(s,y) μ_eval(s,y). As a scalable proxy for robustness, we compute a perturbation-sensitivity statistic σ^sel​(s,y) σ_sel(s,y) as the standard deviation of rewards across NaugN_aug style-preserving perturbations of y (see Appendix H.13). Human evaluation details are in Appendix I; the human mean is the average rating and the human variance is the sample variance across raters. We summarize performance via Tradeoffeval​(s,y):=μ^eval​(s,y)−λ​σ^sel​(s,y)Tradeoff_eval(s,y):= μ_eval(s,y)-λ\, σ_sel(s,y), using the same λ for all methods (Appendix H.14). We also report CVaR10%CVaR_10\% over prompts (Chow et al., 2015); see Appendix H.3. Figure 2: Ablation Studies. Impact of key hyperparameters on risk mitigation performance. (a) Candidate pool size K. (b) Risk sensitivity coefficient β. (c) Constraint threshold ϵε. (d) Perturbation budget NaugN_aug. Method Human Score ↑ Risk (Avgσ)↓ Tradeoff↑ CVaR10%↑ Dataset: Overall Base (Best-of-K) 7.56 0.67 6.22 6.73 BoP(HedgeTune) 7.79 0.63 6.53 7.12 Caution 7.82 0.61 6.60 7.15 DeAL 7.83 0.72 6.39 7.08 MC-Dropout 7.73 0.64 6.45 7.19 RBoN 7.71 0.62 6.47 7.26 DARC(ours) !157.84 !150.60 6.64 7.36 DARC-τ(ours) 7.83 0.58 !156.67 !157.46 DARC-ϵε(ours) !208.08 !200.55 !206.98 !207.62 cDPO (Best-of-K) 7.80 0.65 6.50 7.11 cDPO + DARC-ϵε 8.03 0.54 6.95 7.41 rDPO (Best-of-K) 8.17 0.69 6.79 7.20 rDPO + DARC-ϵε 8.15 0.58 6.99 7.60 Dataset: High-Disagreement (Top 20%) Base (Best-of-K) 7.62 1.00 5.62 6.10 BoP(HedgeTune) 7.83 0.87 6.09 6.81 Caution 7.81 0.85 6.11 6.94 DeAL 7.99 1.13 5.73 7.33 MC-Dropout 7.88 0.84 6.20 7.25 RBoN 7.95 0.81 6.33 7.29 DARC(ours) !158.18 0.74 7.00 !157.43 DARC-τ(ours) 8.11 !15 0.65 !156.81 7.35 DARC-ϵε(ours) !208.34 !200.65 !207.34 !207.60 cDPO (Best-of-K) 7.96 0.89 6.18 7.15 cDPO + DARC-ϵε 8.31 0.63 7.05 7.54 rDPO (Best-of-K) 8.25 0.91 6.43 7.02 rDPO + DARC-ϵε 8.72 0.67 7.38 7.48 Table 2: Human Evaluation Results. We report Human Mean Score, Disagreement Risk, Tradeoff, and CVaR10% (worst 10%10\% prompt outcomes). 5.2 Results Automated proxy evaluation. Table 1 reports MT-Bench (Zheng et al., 2023) and AlpacaEval2.0 (Li et al., 2023; Dubois et al., 2024) results under two instruction-tuned generators (Llama-3.1-8B-Instruct (Dubey et al., 2024), Qwen2.5-7B-Instruct (Qwen et al., 2025)). We evaluate each method by mean reward μ μ, proxy disagreement risk σ σ (reward sensitivity to style-preserving perturbations), and a risk–reward tradeoff score Tradeoff=μ^−λ​σ^Tradeoff= μ-λ σ with the same λ across methods; we also report Len(tok) to control for length bias. Across both generators, inference-time risk-aware selection improves robustness: DARC variants reduce σ σ while keeping μ μ competitive, yielding higher TradeoffTradeoff than mean-only Best-of-K, with larger gains on the high-variance subset (top 20% prompts by baseline σ σ) and broadly stable Len(tok). Training-time robust policies (cDPO/rDPO) exhibit different trade-offs, and DARC remains complementary: applying DARC-ϵε as an inference-time plug-in can further reduce σ σ and improve TradeoffTradeoff in several cases, supporting a modular view where training shapes the policy while DARC calibrates risk on a fixed candidate set. Human-loop evaluation and tail robustness. We then close the loop with multi-annotator ratings on MT-Bench. Table 2 shows that DARC improves risk-sensitive criteria (Tradeoff and CVaR10%) while maintaining competitive mean satisfaction. On the high-disagreement subset, disagreement-aware decoding yields substantial gains in tail robustness (CVaR10%) and conservative quality (Tradeoff), indicating that risk-controlled selection is particularly beneficial on genuinely controversial prompts. Figure 3 demonstrates that Risk-adjusted satisfaction(Δ ) scale positively with disagreement, rising steadily from Q1 to Q5, peaking in the most controversial buckets (Q5). This confirms disagreement as an effective signal for allocating risk control, a trend corroborated by the CVaR results in Appendix Fig. 5.(See representative cases in Appendix H.10).) Figure 3: Gains concentrate on high-disagreement prompts. Mean improvement in lower-tail satisfaction (Δ vs. base) across five prompt buckets ranked by baseline human disagreement σ σ (low→ ). Error bars denote 95% CIs. Validity of the disagreement proxy. We validate our perturbation-sensitivity proxy σ σ against human disagreement measured by multiple independent rater scores on the same (prompt, response) pair. To avoid selection effects, we compute both proxy and human disagreement on the baseline candidate. As summarized in Figure 4, proxy disagreement shows a statistically significant rank-level consistency with human disagreement and remains positively associated even after controlling for mean reward and response length (strong and common confounders for RM-based signals). Moreover, the proxy serves as an effective risk filter for identifying prompts likely to exhibit preference heterogeneity, where risk-controlled decoding is designed to intervene: prompts identified as high-disagreement by the proxy substantially overlap with those identified by humans, far beyond random chance (Fig. 4). While mismatch cases exist (App. H.11)—often reflecting surface-form sensitivity (FP) or orthogonal validity/completeness issues (FN)—this is consistent with using σ σ as a scalable screening signal to prioritize human verification rather than a fully calibrated estimator of disagreement magnitude. Consistently, this signal effectively allocates risk control: bucketed analysis shows a monotonic increase in Δ​Tradeoff from low- to high-disagreement groups when bucketing by either human or proxy disagreement (App. H.7). We also stratify prompts by proxy–human disagreement alignment and find that, even in the worst-alignment bucket, DARC still achieves the best Tradeoff among all baselines (Appendix H.8, Table 8). Metric Value Notes / p-value Spearman ρ 0.6509 95% boot. CI [0.42, 0.70]; p<10−10p<10^-10 Kendall τ 0.3130 p<10−6p<10^-6 Partial Spearman ρ (control μ μ, Len) 0.4084 95% boot. CI [0.25, 0.45] Precision@20% (Top-k overlap) 0.64 (64/100; random=0.20) Recall@20% 0.64 (symmetric at equal k) Jaccard@20% 0.47 (|A∩B|/|A∪B||A∩ B|/|A∪ B|) Overlap significance 64 hypergeom p=2.6×10−29p=2.6× 10^-29 Figure 4: Proxy validity diagnostics. (Top) Rank correlation between proxy and human disagreement, with top-20% overlap. (Bottom) Top-q overlap (Left) and proxy vs. human disagreement scatter (Right). Multi-scorer robust decoding To mitigate proxy over-optimization to a single reward model, we instantiate our decoder with a scorer family of size M=3M=3: RM1 (Skywork-reward-llama-3.1-8b (Liu et al., 2024a)), RM2 (nicholasKluge RewardModel (Corrêa, 2023)) and RM3(OpenAssistant (DeBERTa-v3-Large-v2) (Köpf et al., 2023)). At selection time, we compute V^β,m​(s,y) V_β,m(s,y) for each m∈[M]m∈[M] (with within-prompt normalization), aggregate them into V~β,γ​(s,y) V_β,γ(s,y) via (24), and apply the same constrained/penalized rule using RP~β RP_β in (61). For reporting, because absolute reward scales differ across RMs, we evaluate the selected outputs within each RM and report only within-RM differences to the baseline (mean Δ ) and Win/Tie/Loss rates (Table 3).We additionally provide a per-scorer breakdown of win/tie/loss and mean score differences under each evaluator reward model in Appendix H.9, confirming consistent gains across scorers. Overall High-σ σ Subset Method W/T/L Mean Δ W/T/L Mean Δ Across scorers (aggregated) DARC 278 / 173 / 49 0.124 32 / 63 / 5 0.303 DARC-τ 257 / 192 / 51 0.128 36 / 60 / 4 0.259 DARC-ϵε 296 / 168 / 36 0.161 44 / 54 / 2 0.386 Table 3: Multi-scorer robust decoding. DARC variants vs. Base (mean Best-of-K). Δ : per-prompt improvement over Base, aggregated across scorers (mean / min / mean−γ- ); W/T/L is on aggregated Δ . High-σ σ uses the baseline disagreement proxy. Scaling robustness. To assess scale robustness, we additionally evaluate our inference-time decoding rules on a stronger generator, Qwen2.5-14B-Instruct, using the same candidate-pool protocol and metrics. We observe consistent improvements over mean Best-of-K in risk-sensitive criteria (Tradeoff and prompt-level CVaR), with gains again concentrating on high-disagreement prompts (Table 14). Hyperparameter sensitivity and ablations. Figure 2 summarizes a sensitivity study over the main knobs of DARC-ϵε, reporting improvements over mean Best-of-K in Δ​Tradeoff and Δ​CVaR on fixed candidate pools. Increasing the candidate pool size K improves both metrics with diminishing returns (Fig. 2a). The risk sensitivity β in the entropic robust value V^β V_β yields rapid robustness gains that plateau at moderate values (Fig. 2b). Fixing β, enlarging the near-optimal set via ϵε further improves robustness up to saturation (with slight degradation when overly permissive) (Fig. 2c). Finally, the perturbation budget NaugN_aug used to estimate proxy disagreement σ σ saturates quickly; a small budget (e.g., 44–88) suffices, keeping inference overhead modest (Fig. 2d). 6 Conclusion We cast decoding-time alignment under heterogeneous preferences as risk-constrained decision making. DARC uses a KL-robust (entropic) decoding rule with a conservative LCB/DRO interpretation, improving lower-tail outcomes while preserving mean quality. Limitations include a finite candidate pool and scorer bias; perturbation-based disagreement proxies are scalable screening signals, not calibrated human-disagreement estimates. Future work includes richer robustness signals and user/group-conditional risk control. Impact Statement This paper presents work whose goal is to advance the field of machine learning. Our contributions focus on inference-time response selection under preference heterogeneity. As with many machine learning methods, the proposed approach may exhibit uneven performance if the underlying proxies or learned scorers encode dataset- or population-specific biases. We encourage reporting sensitivity to key design choices and auditing proxy/scorer behavior when the method is used in practice. Consequently, the method should be used and evaluated with appropriate care in deployment. References Anonymous (2025) From curiosity to caution: mitigating reward hacking for best-of-$n$ with pessimism. In Submitted to The Fourteenth International Conference on Learning Representations, Note: under review External Links: Link Cited by: 7th item, §H.2, §H.2, §1, §5.1. P. Artzner, F. Delbaen, J. Eber, and D. Heath (1999) Coherent measures of risk. Mathematical finance 9 (3), p. 203–228. Cited by: §3.1, §4.1. D. Banerjee and A. Gopalan (2024) Towards reliable alignment: uncertainty-aware rlhf. arXiv preprint arXiv:2410.23726. Cited by: §1. A. Ben-Tal and A. Nemirovski (2002) Robust optimization–methodology and applications. Mathematical programming 92 (3), p. 453–480. Cited by: §3.2. S. Boucheron, G. Lugosi, and P. Massart (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press. External Links: ISBN 9780199535255, Document Cited by: §3.1. S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge university press. Cited by: §A.6. R. A. Bradley and M. E. Terry (1952) Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), p. 324–345. Cited by: §G.1, §1. S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, et al. (2023) Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217. Cited by: §G.2, §1. S. Cen, J. Mei, K. Goshvadi, H. Dai, T. Yang, S. Yang, D. Schuurmans, Y. Chi, and B. Dai (2024) Value-incentivized preference optimization: a unified approach to online and offline rlhf. arXiv preprint arXiv:2405.19320. Cited by: §G.1. S. Chakraborty, J. Qiu, H. Yuan, A. Koppel, F. Huang, D. Manocha, A. S. Bedi, and M. Wang (2024) MaxMin-rlhf: alignment with diverse human preferences. arXiv preprint arXiv:2402.08925. Cited by: §G.2. D. Chen, Y. Chen, A. Rege, and R. K. Vinayak (2024) Pal: pluralistic alignment framework for learning from heterogeneous preferences. arXiv preprint arXiv:2406.08469. Cited by: §G.2, §1. Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone (2018) Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research 18 (167), p. 1–51. Cited by: §1. Y. Chow, A. Tamar, S. Mannor, and M. Pavone (2015) Risk-sensitive and robust decision-making: a cvar optimization approach. Advances in neural information processing systems 28. Cited by: §5.1. S. R. Chowdhury, A. Kini, and N. Natarajan (2024) Provably robust dpo: aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409. Cited by: §5.1. P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei (2017) Deep reinforcement learning from human preferences. Advances in neural information processing systems 30. Cited by: §G.1, §1, §5.1. N. K. Corrêa (2023) Aira. GitHub. External Links: Document, Link Cited by: §5.2. A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. (2024) The llama 3 herd of models. arXiv e-prints, p. arXiv–2407. Cited by: §5.2. Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto (2024) Length-controlled alpacaeval: a simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475. Cited by: §5.2. J. C. Duchi, P. W. Glynn, and H. Namkoong (2021) Statistics of robust optimization: a generalized empirical likelihood approach. Mathematics of Operations Research 46 (3), p. 946–969. Cited by: §G.2, §1. J. C. Duchi and H. Namkoong (2021) Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics 49 (3), p. 1378–1406. Cited by: §G.2. J. Duchi and H. Namkoong (2019) Variance-based regularization with convex objectives. Journal of Machine Learning Research 20 (68), p. 1–55. Cited by: §G.2, §1. P. Dupuis and R. S. Ellis (2011) A weak convergence approach to the theory of large deviations. John Wiley & Sons. Cited by: §A.5, §3.2. K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela (2024) Kto: model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306. Cited by: §1. Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, p. 1050–1059. Cited by: 3rd item, §H.2, §5.1. L. Gao, J. Schulman, and J. Hilton (2023) Scaling laws for reward model overoptimization. In International Conference on Machine Learning, p. 10835–10866. Cited by: §G.2, §1. S. Garg, A. Singh, S. Singh, and P. Chopra (2025) Ipo: your language model is secretly a preference classifier. arXiv preprint arXiv:2502.16182. Cited by: §1. Y. Guo, G. Cui, L. Yuan, N. Ding, Z. Sun, B. Sun, H. Chen, R. Xie, J. Zhou, Y. Lin, et al. (2024) Controllable preference optimization: toward controllable multi-objective alignment. arXiv preprint arXiv:2402.19085. Cited by: §G.1. L. P. Hansen and T. J. Sargent (2011) Robustness. In Robustness, Cited by: §3.2. J. Hong, N. Lee, and J. Thorne (2024) Orpo: monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691. Cited by: §G.1, §1. A. Huang, A. Block, Q. Liu, N. Jiang, A. Krishnamurthy, and D. J. Foster (2025a) Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment. arXiv preprint arXiv:2503.21878. Cited by: §G.2, §1. J. Y. Huang, S. Sengupta, D. Bonadiman, Y. Lai, A. Gupta, N. Pappas, S. Mansour, K. Kirchhoff, and D. Roth (2025b) Deal: decoding-time alignment for large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 26280–26300. Cited by: §G.2, §H.2, §5.1. C. Hung, N. Majumder, A. Mehrish, and S. Poria (2025) Reward-guided tree search for inference time alignment of large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p. 12575–12593. Cited by: §1. Y. Ichihara, Y. Jinnai, T. Morimura, K. Ariu, K. Abe, M. Sakamoto, and E. Uchibe (2025) Evaluation of best-of-n sampling strategies for language model alignment. arXiv preprint arXiv:2502.12668. Cited by: §G.2, §1. Z. Jin, H. Yuan, T. Men, P. Cao, Y. Chen, J. Xu, H. Li, X. Jiang, K. Liu, and J. Zhao (2025) Rag-rewardbench: benchmarking reward models in retrieval augmented generation for preference alignment. In Findings of the Association for Computational Linguistics: ACL 2025, p. 17061–17090. Cited by: §G.2. Y. Jinnai, T. Morimura, K. Ariu, and K. Abe (2024) Regularized best-of-n sampling to mitigate reward hacking for language model alignment. In ICML 2024 Workshop on Models of Human Feedback for AI Alignment, Cited by: 5th item, §H.2, §5.1. A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. Advances in neural information processing systems 30. Cited by: 3rd item, §5.1. H. Khalaf, C. M. Verdun, A. Oesterling, H. Lakkaraju, and F. d. P. Calmon (2025) Inference-time reward hacking in large language models. arXiv preprint arXiv:2506.19248. Cited by: 6th item, §H.2, §H.2, §1, §5.1. A. Köpf, Y. Kilcher, D. Von Rütte, S. Anagnostidis, Z. R. Tam, K. Stevens, A. Barhoum, D. Nguyen, O. Stanley, R. Nagyfi, et al. (2023) Openassistant conversations-democratizing large language model alignment. Advances in neural information processing systems 36, p. 47669–47681. Cited by: §5.2. B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30. Cited by: §5.1. N. Lambert, V. Pyatkin, J. Morrison, L. J. V. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, et al. (2025) Rewardbench: evaluating reward models for language modeling. In Findings of the Association for Computational Linguistics: NAACL 2025, p. 1755–1797. Cited by: §G.2. D. Levy, Y. Carmon, J. C. Duchi, and A. Sidford (2020) Large-scale methods for distributionally robust optimization. Advances in neural information processing systems 33, p. 8847–8860. Cited by: §G.2. X. Li, X. Chen, J. Fan, E. H. Jiang, and M. Gao (2025) Multi-head reward aggregation guided by entropy. arXiv preprint arXiv:2503.20995. Cited by: §G.2, §1. X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto (2023) Alpacaeval: an automatic evaluator of instruction-following models. Cited by: §5.2. C. Y. Liu, L. Zeng, J. Liu, R. Yan, J. He, C. Wang, S. Yan, Y. Liu, and Y. Zhou (2024a) Skywork-reward: bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451. Cited by: Table 5, §5.2. Y. Liu, Z. Yao, R. Min, Y. Cao, L. Hou, and J. Li (2024b) Rm-bench: benchmarking reward models of language models with subtlety and style. arXiv preprint arXiv:2410.16184. Cited by: §G.2. Y. Meng, M. Xia, and D. Chen (2024) Simpo: simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems 37, p. 124198–124235. Cited by: §G.1, §1. K. Miettinen (1999) Nonlinear multiobjective optimization. Vol. 12, Springer Science & Business Media. Cited by: §A.6. E. Mitchell (2023) A note on dpo with noisy preferences & relationship to ipo. Cited by: §5.1. H. Namkoong and J. C. Duchi (2016) Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems 29. Cited by: §G.2, §1. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, p. 27730–27744. Cited by: §G.1, §1, §5.1. Qwen, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu (2025) Qwen2.5 technical report. External Links: 2412.15115 Cited by: §5.2. R. Rafailov, Y. Chittepu, R. Park, H. S. Sikchi, J. Hejna, B. Knox, C. Finn, and S. Niekum (2024) Scaling laws for reward model overoptimization in direct alignment algorithms. Advances in Neural Information Processing Systems 37, p. 126207–126242. Cited by: §G.2, §1. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model. Advances in neural information processing systems 36, p. 53728–53741. Cited by: §G.1, §1. H. Rahimian and S. Mehrotra (2019) Distributionally robust optimization: a review. arXiv preprint arXiv:1908.05659. Cited by: §G.2, §1, §3.2, §4.1. S. S. Ramesh, Y. Hu, I. Chaimalas, V. Mehta, P. G. Sessa, H. Bou Ammar, and I. Bogunovic (2024) Group robust preference optimization in reward-free rlhf. Advances in Neural Information Processing Systems 37, p. 37100–37137. Cited by: §G.2, §1. R. T. Rockafellar, S. Uryasev, et al. (2000) Optimization of conditional value-at-risk. Journal of risk 2, p. 21–42. Cited by: §3.1, §4.1. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1. Y. Song, G. Swamy, A. Singh, J. Bagnell, and W. Sun (2024) The importance of online data: understanding preference fine-tuning via coverage. Advances in Neural Information Processing Systems 37, p. 12243–12270. Cited by: §G.1. N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano (2020) Learning to summarize with human feedback. Advances in neural information processing systems 33, p. 3008–3021. Cited by: §1, §5.1. H. Sun, M. Haider, R. Zhang, H. Yang, J. Qiu, M. Yin, M. Wang, P. Bartlett, and A. Zanette (2024) Fast best-of-n decoding via speculative rejection. Advances in Neural Information Processing Systems 37, p. 32630–32652. Cited by: §G.2, §1. A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor (2015) Policy gradient for coherent risk measures. Advances in neural information processing systems 28. Cited by: §1. A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio (2021) Learning from disagreement: a survey. Journal of Artificial Intelligence Research 72, p. 1385–1470. Cited by: §5.1. R. Vershynin (2018) High-dimensional probability: an introduction with applications in data science. Vol. 47, Cambridge university press. Cited by: §3.1. H. Wang, W. Xiong, T. Xie, H. Zhao, and T. Zhang (2024a) Interpretable preferences via multi-objective reward modeling and mixture-of-experts. arXiv preprint arXiv:2406.12845. Cited by: §G.2, §1. Z. Wang, A. Bukharin, O. Delalleau, D. Egert, G. Shen, J. Zeng, O. Kuchaiev, and Y. Dong (2024b) Helpsteer2-preference: complementing ratings with preferences. arXiv preprint arXiv:2410.01257. Cited by: §G.1. Z. Wang, Y. Dong, J. Zeng, V. Adams, M. N. Sreedhar, D. Egert, O. Delalleau, J. Scowcroft, N. Kant, A. Swope, et al. (2024c) Helpsteer: multi-attribute helpfulness dataset for steerlm. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p. 3371–3384. Cited by: §G.1. W. Wiesemann, D. Kuhn, and M. Sim (2014) Distributionally robust convex optimization. Operations research 62 (6), p. 1358–1376. Cited by: §G.2, §1, §3.2, §4.1. J. Wu, Y. Xie, Z. Yang, J. Wu, J. Chen, J. Gao, B. Ding, X. Wang, and X. He (2024) Towards robust alignment of language models: distributionally robustifying direct preference optimization. arXiv preprint arXiv:2407.07880. Cited by: §G.2, §1. K. Yang, Z. Liu, Q. Xie, J. Huang, T. Zhang, and S. Ananiadou (2024) Metaaligner: towards generalizable multi-objective alignment of language models. Advances in Neural Information Processing Systems 37, p. 34453–34486. Cited by: §G.2, §1. M. J. Zhang, Z. Wang, J. D. Hwang, Y. Dong, O. Delalleau, Y. Choi, E. Choi, X. Ren, and V. Pyatkin (2024) Diverging preferences: when do annotators disagree and do models know?. arXiv preprint arXiv:2410.14632. Cited by: §G.2, §1, §1. L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. (2023) Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36, p. 46595–46623. Cited by: §5.2. E. Zhou, G. Zheng, B. Wang, Z. Xi, S. Dou, R. Bao, W. Shen, L. Xiong, J. Fan, Y. Mou, et al. (2024) RMB: comprehensively benchmarking reward models in llm alignment. arXiv preprint arXiv:2410.09893. Cited by: §G.2. Appendix A Additional Theoretical Details A.1 A self-normalized (empirical-variance) concentration bound We first record a standard empirical-Bernstein style inequality (a self-normalized concentration bound) that controls the deviation of the empirical mean in terms of the empirical standard deviation. This lemma is used to prove Proposition 3.3. Lemma A.1 (Empirical Bernstein bound (bounded case)). Let X1,…,XnX_1,…,X_n be i.i.d. supported on [a,b][a,b] with mean μ. Let μ^:=1n​∑i=1nXi,σ^2:=1n−1​∑i=1n(Xi−μ^)2. μ:= 1n _i=1^nX_i, σ^2:= 1n-1 _i=1^n(X_i- μ)^2. Then for any δ∈(0,1)δ∈(0,1), with probability at least 1−δ1-δ, μ≥μ^−2​σ^2​log⁡(2/δ)n−7​(b−a)​log⁡(2/δ)3​(n−1).μ≥ μ- 2 σ^2 (2/δ)n- 7(b-a) (2/δ)3(n-1). (26) Remark. Lemma A.1 is a standard form of empirical Bernstein inequality. We use it because it yields the same structure as (9): a leading term of order σ^​log⁡(1/δ)/n σ (1/δ)/n and a lower-order term of order log⁡(1/δ)/n (1/δ)/n. For completeness, we provide a short proof outline in Appendix A.7. A.2 Proof of Proposition 3.3 Proof outline. We apply an empirical-Bernstein (self-normalized) inequality for bounded random variables to each fixed y, then take a union bound over K candidates with δ′=δ/Kδ =δ/K. An alternative derivation can be obtained from Freedman’s martingale inequality. We then track constants to obtain (9). Proof. Fix any y∈y and apply Lemma A.1 to Ri​(y)i=1n\R_i(y)\_i=1^n with δ′=δ/Kδ =δ/K. With probability at least 1−δ′1-δ , μ​(y)≥μ^n​(y)−σ^n​(y)​2​log⁡(2/δ′)n−7​(b−a)​log⁡(2/δ′)3​(n−1).μ(y)≥ μ_n(y)- σ_n(y) 2 (2/δ )n- 7(b-a) (2/δ )3(n-1). Taking a union bound over the K candidates yields probability at least 1−δ1-δ that the inequality holds for all y. Finally, note that log⁡(2/δ′)=log⁡(2​K/δ)≤log⁡(K/δ)+log⁡2≤C0​log⁡(K/δ) (2/δ )= (2K/δ)≤ (K/δ)+ 2≤ C_0\, (K/δ) for a universal constant C0C_0 (we may assume K/δ≥2K/δ≥ 2; otherwise the bound is trivial). Moreover, 1n−1≤2n 1n-1≤ 2n for n≥2n≥ 2. Absorbing these constant factors into a single universal c yields (9). ∎ A.3 From LCB decoding to mean–dispersion decoding This appendix formalizes the “ignoring constants / lower-order term” step used to connect uniform LCB maximization to a mean–dispersion surrogate (used as an ablation / efficient approximation), rather than as the defining objective of our primary KL-robust entropic decoder. Corollary A.2 (Mean–dispersion surrogate form under bounded ratings). Under the conditions of Proposition 3.3, define LCBδ​(y):=μ^n​(y)−c​σ^n​(y)​log⁡(K/δ)n−c​(b−a)​log⁡(K/δ)n.LCB_δ(y):= μ_n(y)-c\, σ_n(y) (K/δ)n-c\,(b-a) (K/δ)n. Then any maximizer of LCBδ​(y)LCB_δ(y) is also a maximizer of μ^n​(y)−λ​σ^n​(y),λ:=c​log⁡(K/δ)n, μ_n(y)-λ σ_n(y), λ:=c (K/δ)n, (27) since the last term c​(b−a)​log⁡(K/δ)/nc\,(b-a) (K/δ)/n is independent of y. Proof. The two objectives differ by an additive constant that does not depend on y, hence they have identical argmax sets. ∎ Remark A.3 (Standard deviation vs. Variance penalization). We note a structural distinction between the LCB-derived surrogate and the entropic objective. The LCB rule (Eq. 27) implies a penalty on the standard deviation σ σ (scaling as O​(n−1/2)O(n^-1/2)), which accounts for finite-sample estimation uncertainty. In contrast, the entropic objective (Eq. 1) approximates a mean-variance form μ^−β2​σ^2 μ- β2 σ^2 under Taylor expansion, reflecting intrinsic risk aversion. While theoretically distinct, both objectives penalize dispersion. We use the σ-based LCB form primarily to establish finite-sample consistency guarantees, while the entropic form serves as our primary method for capturing risk sensitivity. Empirically, both metrics effectively demote high-disagreement candidates. A.4 Lower-tail (risk) interpretation and uniform empirical LCB under sub-Gaussianity This appendix supports the statement in Section 3.1 that an analogous LCB holds under sub-Gaussian noise (as an alternative to Assumption 3.2). We first recall a lower-tail bound for a single sub-Gaussian variable, then derive a uniform-in-y empirical LCB for the mean μ​(y)μ(y) based on n samples, analogous to Proposition 3.3. Bounded ⇒ sub-Gaussian. Under boundedness X∈[a,b]X∈[a,b], Hoeffding’s lemma implies X−​[X]X-E[X] is ν-sub-Gaussian with ν≤(b−a)/2.ν≤(b-a)/2. Applying Lemma A.4 with ν=(b−a)/2ν=(b-a)/2 yields the bounded lower-tail bound: Pr⁡(X≤​[X]−t)≤exp⁡(−2​t2(b−a)2),∀t>0. \! (X [X]-t )≤ \! (- 2t^2(b-a)^2 ), ∀\,t>0. (28) This justifies the statement that dispersion controls conservative (lower-tail) outcomes. Lemma A.4 (Sub-Gaussian lower-tail bound). If X−​[X]X-E[X] is ν-sub-Gaussian, then for any t>0t>0, Pr⁡(X≤​[X]−t)≤exp⁡(−t22​ν2). \! (X [X]-t )≤ \! (- t^22ν^2 ). (29) Proof. By Chernoff’s method, for any λ>0λ>0, Pr⁡(X−​[X]≤−t)=Pr⁡(e−λ​(X−​[X])≥eλ​t)≤e−λ​t​[e−λ​(X−​[X])]≤e−λ​t​eλ2​ν2/2. (X-E[X]≤-t)= (e^-λ(X-E[X])≥ e^λ t)≤ e^-λ tE[e^-λ(X-E[X])]≤ e^-λ te^λ^2ν^2/2. Optimize over λ by setting λ=t/ν2λ=t/ν^2 to get (29). ∎ Uniform empirical LCB for sub-Gaussian rewards. Fix a state s and a finite candidate set ​(s)Y(s) of size K:=|​(s)|K:=|Y(s)|. For each y∈​(s)y (s), let R1​(y),…,Rn​(y)R_1(y),…,R_n(y) be i.i.d. draws of the (scalar) reward with mean μ​(y):=​[R​(y)]μ(y):=E[R(y)]. Define the empirical mean and (uncentered) empirical variance proxy μ^n​(y):=1n​∑i=1nRi​(y),σ^n2​(y):=1n​∑i=1n(Ri​(y)−μ^n​(y))2. μ_n(y):= 1n _i=1^nR_i(y), σ_n^2(y):= 1n _i=1^n (R_i(y)- μ_n(y) )^2. (Any definition of σ^n​(y) σ_n(y) used in the main text can be substituted here; only its nonnegativity is used below.) Assumption A.5 (Sub-Gaussian rewards (alternative to boundedness)). For each y∈​(s)y (s), the centered reward R​(y)−μ​(y)R(y)-μ(y) is ν-sub-Gaussian, i.e., ​[exp⁡(λ​(R​(y)−μ​(y)))]≤exp⁡(λ2​ν2/2)E[ (λ(R(y)-μ(y)))]≤ (λ^2ν^2/2) for all λ∈ℝλ . Lemma A.6 (Sample mean is sub-Gaussian). Under Assumption A.5, for any fixed y, the sample mean satisfies that μ^n​(y)−μ​(y) μ_n(y)-μ(y) is (ν/n)(ν/ n)-sub-Gaussian, i.e., ​[exp⁡(λ​(μ^n​(y)−μ​(y)))]≤exp⁡(λ2​ν22​n),∀λ∈ℝ.E\! [ \! (λ( μ_n(y)-μ(y)) ) ]≤ \! ( λ^2ν^22n ), ∀\,λ . Proof. Identical to the standard argument: by independence and the sub-Gaussian MGF bound, ​[eλ​(μ^n​(y)−μ​(y))]=∏i=1n​[exp⁡(λn​(Ri​(y)−μ​(y)))]≤∏i=1nexp⁡(λ2​ν22​n2)=exp⁡(λ2​ν22​n).E\! [e^λ( μ_n(y)-μ(y)) ]= _i=1^nE\! [ \! ( λn(R_i(y)-μ(y)) ) ]≤ _i=1^n \! ( λ^2ν^22n^2 )= \! ( λ^2ν^22n ). ∎ Proposition A.7 (Uniform sub-Gaussian LCB (analogue of Prop. 3.3)). There exists an absolute constant c>0c>0 such that for any δ∈(0,1)δ∈(0,1), with probability at least 1−δ1-δ, simultaneously for all y∈​(s)y (s), μ​(y)≥μ^n​(y)−c​ν​log⁡(K/δ)n.μ(y)≥ μ_n(y)-c\,ν\, (K/δ)n. (30) We denote the right-hand side by LCBδsubG​(y)LCB^subG_δ(y). Proof. Fix any y∈​(s)y (s). By Lemma A.6 and Lemma A.4, for any t>0t>0, Pr⁡(μ^n​(y)≤μ​(y)−t)≤exp⁡(−n​t22​ν2). \! ( μ_n(y)≤μ(y)-t )≤ \! (- nt^22ν^2 ). Set t=ν​2​log⁡(K/δ)/nt=ν 2 (K/δ)/n to obtain Pr⁡(μ​(y)<μ^n​(y)−ν​2​log⁡(K/δ)n)≤δK. \! (μ(y)< μ_n(y)-ν 2 (K/δ)n )≤ δK. Applying a union bound over all y∈​(s)y (s) yields that with probability at least 1−δ1-δ, (30) holds simultaneously for all y. Absorb the factor 2 2 into an absolute constant c. ∎ Discussion (relation to Prop. 3.3). Proposition A.7 provides the claimed uniform LCB under sub-Gaussian noise. Compared to the bounded-rewards LCB in (9), the sub-Gaussian LCB replaces the empirical dispersion term σ^n​(y) σ_n(y) and the boundedness-dependent lower-order correction c​(b−a)​log⁡(K/δ)/nc(b-a) (K/δ)/n with the (known) sub-Gaussian scale ν. If one additionally wants a data-dependent (empirical-Bernstein/Freedman-type) bound under sub-Gaussianity, one can combine sub-exponential concentration for (R​(y)−μ​(y))2(R(y)-μ(y))^2 with a union bound; we omit this here since Section 3.1 only requires existence of an analogous LCB. A.5 Proof of Theorem 3.5 The equivalence between the KL-robust value and the entropic risk measure is a classical result in the theory of risk-sensitive control and large deviations, often referred to as the Donsker-Varadhan variational formula (Dupuis and Ellis, 2011). We provide the derivation here for completeness. Proof. Recall the definition of the KL-regularized robust value given in Equation (12): RobβKL​(ℙ;R):=infℚ≪ℙℚ​[R]+1β​DKL​(ℚ∥ℙ).Rob^KL_β(P;R):= _Q \E_Q[R]+ 1βD_KL(Q\|P) \. (31) Let f​(x)=d​ℚd​ℙ​(x)f(x)= dQdP(x) be the Radon-Nikodym derivative (density ratio). The optimization problem can be formulated as minimizing the functional J​(f)J(f) subject to the normalization constraint ∫f​ℙ=1 f\,dP=1 and non-negativity f≥0f≥ 0: J​(f)=∫f​(x)​R​(x)​ℙ​(x)+1β​∫f​(x)​log⁡f​(x)​ℙ​(x).J(f)= f(x)R(x)\,dP(x)+ 1β f(x) f(x)\,dP(x). (32) We introduce a Lagrange multiplier γ for the constraint ∫f​ℙ=1 f\,dP=1. The Lagrangian is: ℒ​(f,γ)=∫(f​(x)​R​(x)+1β​f​(x)​log⁡f​(x)−γ​f​(x))​ℙ​(x)+γ.L(f,γ)= (f(x)R(x)+ 1βf(x) f(x)-γ f(x) )dP(x)+γ. (33) Taking the functional derivative with respect to f​(x)f(x) and setting it to zero yields the first-order condition: R​(x)+1β​(log⁡f​(x)+1)−γ=0.R(x)+ 1β( f(x)+1)-γ=0. (34) Solving for f​(x)f(x), we obtain the form of the optimal twisted distribution: log⁡f∗​(x)=β​γ−1−β​R​(x)⟹f∗​(x)∝exp⁡(−β​R​(x)). f^*(x)=βγ-1-β R(x) f^*(x) (-β R(x)). (35) Using the normalization constraint ∫f∗​(x)​ℙ​(x)=1 f^*(x)\,dP(x)=1, we determine the normalization constant (partition function) Z: Z:=∫exp⁡(−β​R​(x))​ℙ​(x)=ℙ​[exp⁡(−β​R)].Z:= (-β R(x))\,dP(x)=E_P[ (-β R)]. (36) Thus, the optimal density ratio is given by Gibbs measure form f∗​(x)=1Z​exp⁡(−β​R​(x))f^*(x)= 1Z (-β R(x)). Substituting f∗f^* back into the objective function, the expectation term and the entropy term combine conveniently: ℚ∗​[R]+1β​DKL​(ℚ∗∥ℙ) _Q^*[R]+ 1βD_KL(Q^*\|P) =∫f∗​(x)​(R​(x)+1β​log⁡f∗​(x))​ℙ​(x) = f^*(x) (R(x)+ 1β f^*(x) )\,dP(x) (37) =∫f∗​(x)​(R​(x)+1β​(−β​R​(x)−log⁡Z))​ℙ​(x) = f^*(x) (R(x)+ 1β(-β R(x)- Z) )\,dP(x) (38) =∫f∗​(x)​(R​(x)−R​(x)−1β​log⁡Z)​ℙ​(x) = f^*(x) (R(x)-R(x)- 1β Z )\,dP(x) (39) =−1β​log⁡Z​∫f∗​(x)​ℙ​(x)⏟1 =- 1β Z f^*(x)\,dP(x)_1 (40) =−1β​log⁡ℙ​[exp⁡(−β​R)]. =- 1β _P[ (-β R)]. (41) This concludes the proof. ∎ A.6 Constraint vs. penalty: Lagrangian view (finite candidate set) This appendix supports the “two interchangeable controls” claim in the main text for the entropic-risk-premium formulation. Lemma A.8 (Pareto optimality and (supported) scalarization). Fix β>0β>0 and define the entropic risk premium RP^β​(s,y):=μ^​(y)−V^β​(y) RP_β(s,y):= μ(y)- V_β(y). Let β:=(RP^β​(s,y),μ^​(y)):y∈⊂ℝ2P_β:=\( RP_β(s,y), μ(y)):y \ ^2 be a finite set. If y⋆∈arg⁡maxy∈⁡μ^​(y)−λ​RP^β​(s,y)y ∈ _y μ(y)-λ\, RP_β(s,y) for some λ≥0λ≥ 0, then y⋆y is Pareto-optimal with respect to the bi-criteria (RP^β,μ^)( RP_β, μ) (i.e., there is no y such that μ^​(y)≥μ^​(y⋆) μ(y)≥ μ(y ) and RP^β​(s,y)≤RP^β​(y⋆) RP_β(s,y)≤ RP_β(y ) with at least one strict). Conversely, if (RP^β​(y⋆),μ^​(y⋆))( RP_β(y ), μ(y )) is a supported Pareto-optimal point of βP_β (i.e., it lies on the upper boundary of conv​(β)conv(P_β) and admits a supporting line), then there exists λ≥0λ≥ 0 such that y⋆∈arg⁡maxy∈⁡μ^​(y)−λ​RP^β​(s,y).y ∈ _y μ(y)-λ\, RP_β(s,y). Proof. This is a standard scalarization fact for finite bi-criteria optimization (Boyd and Vandenberghe, 2004; Miettinen, 1999). If a solution were dominated under (RPβc,μ^)(RP^c_β, μ), it could not maximize any nonnegative linear scalarization. Conversely, each supported Pareto-optimal point admits a supporting hyperplane for conv​(β)conv(P_β), whose slope yields a λ≥0λ≥ 0 such that y⋆y maximizes μ^−λ​RP​·β μ-λ\,RP _β. ∎ A.7 Proof outline for Lemma A.1 We briefly outline the proof strategy for Lemma A.1. One route is to apply a martingale concentration inequality (Freedman/Bernstein-type) to the centered sum ∑i=1n(Xi−μ) _i=1^n(X_i-μ) and control the conditional variance by the empirical variance, yielding an “empirical Bernstein” bound. Another route is to invoke a known empirical Bernstein theorem directly. A.8 Selection statistics: empirical mean, dispersion, and CVaR For each input s and candidate response y∈​(s)y (s), we estimate robust selection statistics using a fixed reward model RψR_ψ. Concretely, we construct M style-preserving perturbations y~(m)m=1M\ y^(m)\_m=1^M of y (e.g., formatting or phrasing variations), and compute M reward samples r(m)​(s,y)=Rψ​(s,y~(m)).r^(m)(s,y)=R_ψ(s, y^(m)). We then define: μ^sel​(s,y)=1M​∑m=1Mr(m)​(s,y),σ^sel​(s,y)=1M−1​∑m=1M(r(m)​(s,y)−μ^sel​(s,y))2. μ_sel(s,y)= 1M _m=1^Mr^(m)(s,y), σ_sel(s,y)= 1M-1 _m=1^M (r^(m)(s,y)- μ_sel(s,y) )^2. For CVaR, let r(1)(1:M)≤⋯≤r(M)(1:M)r^(1:M)_(1)≤·s≤ r^(1:M)_(M) be the sorted reward samples, and let k=max⁡1,⌈α​M⌉k= \1, α M \. We use the empirical lower-tail mean CVaR^α,sel​(s,y)=1k​∑i=1kr(i)(1:M). CVaR_α,sel(s,y)= 1k _i=1^kr^(1:M)_(i). In all experiments, M and α are fixed globally (no per-task tuning), and the same reward model RψR_ψ is used for all methods. A.9 Proof of Proposition 3.6 We prove the lower bound (14) and characterize when it is tight. Let ℚ≪ℙQ and define g:=d​ℚd​ℙ−1g:= dQdP-1. Then ℙ​[g]=0E_P[g]=0 and the χ2χ^2 constraint implies ℙ​[g2]=Dχ2​(ℚ∥ℙ)≤ρ.E_P[g^2]=D_χ^2(Q\,\|\,P)≤ρ. Moreover, ℚ​[R]=ℙ​[(1+g)​R]=μℙ+ℙ​[g​(R−μℙ)].E_Q[R]=E_P[(1+g)R]= _P+E_P[g(R- _P)]. By Cauchy–Schwarz, ℙ​[g​(R−μℙ)]≥−ℙ​[g2]​ℙ​[(R−μℙ)2]≥−ρ​σℙ.E_P[g(R- _P)]≥- E_P[g^2]\ E_P[(R- _P)^2]≥- ρ\, _P. Taking the infimum over all feasible ℚQ yields (14). For tightness, equality in Cauchy–Schwarz is achieved when g⋆​(R)=−ρ​R−μℙσℙ,g (R)=- ρ\, R- _P _P, which corresponds to the candidate extremal density d​ℚ⋆d​ℙ​(R)=1+g⋆​(R)=1−ρ​R−μℙσℙ, dQ dP(R)=1+g (R)=1- ρ\, R- _P _P, i.e., (49). This defines a valid probability density if it is nonnegative ℙP-almost surely, which is exactly the stated condition. A sufficient condition is R≤μℙ+σℙ/ρR≤ _P+ _P/ ρ ℙP-a.s., giving (50). Specializing to the empirical distribution ℙ^ny P_n^y yields (15) by substituting μℙ^ny=μ^n​(y) _ P_n^y= μ_n(y) and σℙ^ny=v^n​(y)=n−1n​σ^n​(y) _ P_n^y= v_n(y)= n-1n\, σ_n(y). ∎ A.10 Proof of Lemma A.11 Let yϵy_ε be any solution of (22). If there existed y∈​(s)y (s) such that V^β​(s,y)≥V^β​(s,yϵ) V_β(s,y)≥ V_β(s,y_ε) and σ^​(s,y)≤σ^​(s,yϵ) σ(s,y)≤ σ(s,y_ε) with at least one strict inequality, then: - if V^β​(s,y)>V^β​(s,yϵ) V_β(s,y)> V_β(s,y_ε), y is feasible for (22) and strictly better in the objective σ σ; - if V^β​(s,y)=V^β​(s,yϵ) V_β(s,y)= V_β(s,y_ε) but σ^​(s,y)<σ^​(s,yϵ) σ(s,y)< σ(s,y_ε), y is also feasible and strictly better. Either case contradicts optimality of yϵy_ε. Hence yϵy_ε is Pareto-optimal. For the scalarization claim, consider the finite set :=(σ^​(s,y),V^β​(s,y)):y∈​(s)⊂ℝ2S:=\( σ(s,y), V_β(s,y)):\ y (s)\ ^2 and its convex hull conv​()conv(S). If (σ^​(s,yϵ),V^β​(s,yϵ))( σ(s,y_ε), V_β(s,y_ε)) is a supported Pareto-optimal point, then there exists a supporting line to conv​()conv(S) with outward normal (λϵ,1)( _ε,1) for some λϵ≥0 _ε≥ 0, i.e., V^β​(s,y)−λϵ​σ^​(s,y)≤V^β​(s,yϵ)−λϵ​σ^​(s,yϵ)∀y∈​(s). V_β(s,y)- _ε σ(s,y)≤ V_β(s,y_ε)- _ε σ(s,y_ε) ∀\,y (s). Therefore yϵy_ε maximizes the linear scalarization, proving (47). ∎ A.11 Second-moment surrogate of the LCB rule Corollary A.9 (Second-moment surrogate). Up to an additive term uniform across y, maximizing LCBδ​(y)LCB_δ(y) is equivalent to yLCB∈arg⁡maxy∈⁡(μ^n​(y)−λ​σ^n​(y)),λ:=c​log⁡(K/δ)n.y_LCB∈ _y ( μ_n(y)-λ σ_n(y) ), λ:=c (K/δ)n. (42) A.12 Robustness to proxy disagreement estimates This appendix formalizes robustness of LCB-based selection when the decoder uses proxy estimates (μ~,σ~)( μ, σ) in place of the empirical statistics (μ^n,σ^n)( μ_n, σ_n). We show that under a uniform proxy error event, proxy maximization incurs only an additive slack Δprox _prox relative to the sample-based optimum. Assumption (uniform proxy closeness). Let (μ~​(y),σ~​(y))( μ(y), σ(y)) be proxy estimates used by the decoder. Assume that with probability at least 1−δprox1- _prox, for all y∈y , |μ~​(y)−μ^n​(y)|≤εμ,|σ~​(y)−σ^n​(y)|≤εσ. | μ(y)- μ_n(y) |≤ _μ, | σ(y)- σ_n(y) |≤ _σ. (43) Remark (dependence and interpretation). Assumption (43) intentionally abstracts away the dependence structure of proxy-generated scores: the perturbation-and-RM procedure may induce correlated errors and systematic bias. Our goal here is not to claim that proxy samples satisfy i.i.d. concentration, but to show that any proxy estimator that is uniformly close to the sample statistics yields only an additive degradation in the pessimistic value, captured by Δprox _prox. In Sec. 4, we empirically assess the accuracy of the proxy estimates and their correlation with human disagreement. Corollary A.10 (Robust LCB under proxy errors). Fix δ∈(0,1)δ∈(0,1). On the intersection of the LCB event in Proposition 3.3 (proved in Appendix A.2) and the proxy error event (43), simultaneously for all y∈y , μ​(y)≥μ~​(y)−c​σ~​(y)​log⁡(K/δ)n−c​(b−a)​log⁡(K/δ)n⏟LCB~δ​(y)−Δprox.μ(y)\;≥\; μ(y)-c\, σ(y) (K/δ)n-c\,(b-a) (K/δ)n_ LCB_δ(y)\;-\; _prox. (44) where Δprox:=εμ+c​εσ​log⁡(K/δ)n. _prox:= _μ+c\, _σ (K/δ)n. (45) The intersection event holds with probability at least 1−(δ+δprox)1-(δ+ _prox) by a union bound. Consequently, letting y~∈arg⁡maxy∈⁡LCB~δ​(y) y∈ _y LCB_δ(y) and y⋆∈arg⁡maxy∈⁡LCBδ​(y)y ∈ _y LCB_δ(y), we have the value suboptimality bound LCBδ​(y⋆)−LCBδ​(y~)≤ 2​Δprox.LCB_δ(y )-LCB_δ( y)\;≤\;2\, _prox. (46) In particular, proxy maximization is within O​(Δprox)O( _prox) of the sample-based optimum. Proof. By Proposition 3.3 (proved in Appendix A.2), for all y on the LCB event, μ​(y)≥μ^n​(y)−c​σ^n​(y)​log⁡(K/δ)n−c​(b−a)​log⁡(K/δ)n.μ(y)≥ μ_n(y)-c\, σ_n(y) (K/δ)n-c\,(b-a) (K/δ)n. Using (43), we have μ^n​(y)≥μ~​(y)−εμ μ_n(y)≥ μ(y)- _μ and σ^n​(y)≤σ~​(y)+εσ σ_n(y)≤ σ(y)+ _σ. Substituting and rearranging yields (44)–(45). Finally, since for all y we have LCBδ​(y)≥LCB~δ​(y)−ΔproxLCB_δ(y)≥ LCB_δ(y)- _prox and LCBδ​(y)≤LCB~δ​(y)+ΔproxLCB_δ(y)≤ LCB_δ(y)+ _prox, standard argmax stability yields (46). ∎ A.13 ϵε-tie breaking: Pareto and scalarization properties Lemma A.11 (ϵε-tie breaking on the robust-value–disagreement frontier). Fix a prompt s and candidate set ​(s)Y(s). Any solution yϵ​(s)y_ε(s) of (22) is Pareto-optimal with respect to (V^β,σ^)( V_β, σ): there does not exist y∈​(s)y (s) such that V^β​(s,y)≥V^β​(s,yϵ) V_β(s,y)≥ V_β(s,y_ε) and σ^​(s,y)≤σ^​(s,yϵ) σ(s,y)≤ σ(s,y_ε) with at least one strict inequality. Moreover, if (σ^​(s,yϵ),V^β​(s,yϵ))( σ(s,y_ε), V_β(s,y_ε)) is a supported Pareto-optimal point of the finite set (σ^​(s,y),V^β​(s,y)):y∈​(s)⊂ℝ2\( σ(s,y), V_β(s,y)):y (s)\ ^2, then there exists a (possibly non-unique) λϵ≥0 _ε≥ 0 such that yϵ​(s)∈arg⁡maxy∈​(s)⁡(V^β​(s,y)−λϵ​σ^​(s,y)).y_ε(s)∈ _y (s) ( V_β(s,y)- _ε σ(s,y) ). (47) Proof. See Appendix A.10. ∎ Remark A.12 (ϵε-rule approximates V^β V_β–dispersion under small robust-value gaps). Fix λ≥0λ≥ 0 and let yλ∈arg⁡maxy∈​(s)⁡(V^β​(s,y)−λ​σ^​(s,y))y_λ∈ _y (s) ( V_β(s,y)-λ σ(s,y) ). If yλ∈ℱϵ​(s)y_λ _ε(s) (i.e., its robust value is within ϵε of the best robust value), then the ϵε-tie-break output yϵy_ε satisfies V^β​(s,yϵ)−λ​σ^​(s,yϵ)≥V^β​(s,yλ)−λ​σ^​(s,yλ)−ϵ. V_β(s,y_ε)-λ σ(s,y_ε)≥ V_β(s,y_λ)-λ σ(s,y_λ)-ε. A.14 From pairwise comparisons to scalar samples From pairwise preferences to scalar satisfaction (estimation layer). Human feedback is often collected as pairwise comparisons or multi-criteria judgments rather than scalar scores. One may induce scalar satisfaction via standard latent-utility scalarizations (e.g., Bradley–Terry/Thurstone-style fitting or calibrated ratings); our method then operates on the induced scalar samples. We emphasize that the high-probability guarantee in Section 3 applies directly to the i.i.d. scalar-sample regime above; when scalarization is obtained indirectly from comparisons, we treat the resulting (μ^,σ^)( μ, σ) as an estimator and validate it empirically. We sketch how common pairwise preference data can be mapped to scalar samples used in Sec. 3. Assume annotator i has a latent utility Ui​(y)U_i(y) and pairwise labels are generated from noisy utility differences, e.g., Bradley–Terry / Thurstone: Pr⁡(y≻iy′)=σ​(Ui​(y)−Ui​(y′)η), (y _iy )=σ\! ( U_i(y)-U_i(y )η ), (48) where σ​(⋅)σ(·) is a logistic or Gaussian CDF and η>0η>0. Given multiple comparisons involving y, a scalarization (e.g., per-annotator win-rate or fitted BT/Thurstone score) yields a proxy U^i​(y) U_i(y). Lemma A.13 (LCB form under sub-Gaussian scalarization noise). Fix y and suppose U^i​(y)=Ui​(y)+ξi​(y) U_i(y)=U_i(y)+ _i(y) where ξi​(y)i=1n\ _i(y)\_i=1^n are independent and ν-sub-Gaussian. Then with probability at least 1−δ1-δ, simultaneously for all y∈​(s)y (s), U​(y)≥μ^n​(y)−c′​σ^n​(y)​log⁡(K/δ)n,U(y)\;≥\; μ_n(y)-c σ_n(y) (K/δ)n, for an absolute constant c′>0c >0 (up to lower-order terms), i.e., the same mean–dispersion LCB structure as in (9). Appendix B Proof of Proposition 3.6 (χ2χ^2-DRO robust mean) Full characterization. A sufficient tightness condition is that the extremal density d​ℚ⋆d​ℙ​(R)=1−ρ​R−μℙσℙ dQ dP(R)=1- ρ\, R- _P _P is nonnegative ℙP-almost surely. In particular, it holds if R≤μℙ+σℙ/ρR≤ _P+ _P/ ρ a.s. For bounded ratings R∈[a,b]R∈[a,b], nonnegativity holds whenever ρ​(b−μℙ)≤σℙ ρ\,(b- _P)≤ _P. We provide a complete proof of Proposition 3.6. We first prove the general lower bound infℚ∈ρ​(ℙ)ℚ​[R]≥μℙ−ρ​σℙ _Q _ρ(P)E_Q[R]≥ _P- ρ\, _P for any square-integrable R, and then characterize a sufficient condition for tightness. Finally, we specialize to the empirical distribution. B.1 General bound via a change of measure Let ℚ≪ℙQ and define g​(R):=d​ℚd​ℙ​(R)−1.g(R):= dQdP(R)-1. Then ℙ​[g]=0E_P[g]=0 and the χ2χ^2 constraint reads ℙ​[g2]≤ρE_P[g^2]≤ρ. Moreover, ℚ​[R]−ℙ​[R] _Q[R]-E_P[R] =ℙ​[d​ℚd​ℙ​(R)​R]−ℙ​[R] =E_P\! [ dQdP(R)\,R ]-E_P[R] =ℙ​[(1+g​(R))​R]−ℙ​[R] =E_P[(1+g(R))R]-E_P[R] =ℙ​[g​(R)​R]=ℙ​[g​(R)​(R−μℙ)]. =E_P[g(R)R]=E_P[g(R)(R- _P)]. By Cauchy–Schwarz, ℙ​[g​(R)​(R−μℙ)]≥−ℙ​[g2]​ℙ​[(R−μℙ)2]≥−ρ​σℙ.E_P[g(R)(R- _P)]≥- E_P[g^2]\ E_P[(R- _P)^2]≥- ρ\, _P. Taking the infimum over all ℚ∈ρ​(ℙ)Q _ρ(P) yields infℚ∈ρ​(ℙ)ℚ​[R]≥μℙ−ρ​σℙ, _Q _ρ(P)E_Q[R]≥ _P- ρ\, _P, proving (14). B.2 Tightness and the extremal density Equality in Cauchy–Schwarz holds when g​(R)g(R) is proportional to −(R−μℙ)-(R- _P). Define g⋆​(R):=−ρ​R−μℙσℙ.g (R):=- ρ\, R- _P _P. Then ℙ​[g⋆]=0E_P[g ]=0 and ℙ​[(g⋆)2]=ρE_P[(g )^2]=ρ. If additionally 1+g⋆​(R)≥01+g (R)≥ 0 ℙP-a.s., we may define d​ℚ⋆d​ℙ​(R):=1+g⋆​(R)=1−ρ​R−μℙσℙ, dQ dP(R):=1+g (R)=1- ρ\, R- _P _P, (49) which is a valid Radon–Nikodym derivative. Plugging g⋆g into the derivation above achieves equality, giving ℚ⋆​[R]=μℙ−ρ​σℙ.E_Q [R]= _P- ρ\, _P. A sufficient condition for nonnegativity is R≤μℙ+σℙρℙ​-a.s.R≤ _P+ _P ρ -a.s. (50) which implies 1+g⋆​(R)≥01+g (R)≥ 0 a.s. B.3 Specialization to the empirical distribution Now let ℙ=ℙ^ny=1n​∑i=1nδRi​(y)P= P_n^y= 1n _i=1^n _R_i(y). Any ℚ≪ℙ^nyQ P_n^y corresponds to a probability vector q∈Δnq∈ ^n supported on the same atoms: ℚ=∑i=1nqi​δRi​(y).Q= _i=1^nq_i _R_i(y). Since ℙ^ny P_n^y is uniform, the χ2χ^2 divergence becomes Dχ2​(ℚ∥ℙ^ny)=∑i=1n(qi−1/n)21/n=n​∑i=1n(qi−1n)2.D_χ^2(Q\| P_n^y)= _i=1^n (q_i-1/n)^21/n=n _i=1^n (q_i- 1n )^2. Moreover, ℚ​[R]=∑i=1nqi​Ri​(y),μℙ^ny=μ^n​(y),σℙ^ny2=v^n​(y),E_Q[R]= _i=1^nq_iR_i(y), _ P_n^y= μ_n(y), _ P_n^y^2= v_n(y), where v^n​(y)=1n​∑i=1n(Ri​(y)−μ^n​(y))2 v_n(y)= 1n _i=1^n(R_i(y)- μ_n(y))^2. Therefore, whenever the empirical nonnegativity condition holds on the support, the worst-case mean equals the closed form infℚ∈ρ​(ℙ^ny)ℚ​[R]=μ^n​(y)−ρ​v^n​(y). _Q _ρ( P_n^y)E_Q[R]= μ_n(y)- ρ\, v_n(y). Otherwise, the lower bound (14) still holds. Remark B.1 (Conservative approximation under gap). When the nonnegativity condition (50) is violated (e.g., due to outliers far below the mean or large ρ), the Cauchy–Schwarz inequality implies that the closed form μ^n​(y)−ρ​σ^n​(y) μ_n(y)- ρ σ_n(y) becomes a strict lower bound on the true χ2χ^2-DRO objective. In this regime, maximizing the mean–dispersion surrogate effectively optimizes a quantity that is more conservative than the exact distributionally robust mean. This aligns with our overall goal of pessimistic selection: the surrogate does not overstate the robust value even when the local approximation is inexact. ∎ Appendix C Calibrating LCB scaling to χ2χ^2-DRO radius This appendix formalizes a numerical calibration between the LCB penalty in Proposition 3.3 and the ambiguity radius in empirical χ2χ^2-DRO. Importantly, this calibration is not meant to claim that LCB and DRO model the same source of uncertainty: LCB controls finite-sample estimation uncertainty for μ​(y)μ(y), while χ2χ^2-DRO is commonly interpreted as distributional/aleatoric robustness. Here we use χ2χ^2-DRO as a convenient variational reparameterization of the same mean–dispersion functional, which yields an interpretable radius ρ corresponding to a chosen standard-deviation penalty λ. Proposition C.1 (Exact λ–ρ mapping for empirical χ2χ^2-DRO). Fix a candidate y and its empirical distribution ℙ^ny P_n^y. Assume the nonnegativity condition in Proposition 3.6 holds on the empirical support. Then for any ρ≥0ρ≥ 0, infℚ∈ρ​(ℙ^ny)ℚ​[R]=μ^n​(y)−λρ​σ^n​(y),λρ:=ρ​n−1n. _Q _ρ( P_n^y)\ E_Q[R]\;=\; μ_n(y)- _ρ\, σ_n(y), _ρ:= ρ\, n-1n. (51) Equivalently, for a desired penalty λ≥0λ≥ 0, choosing ρ​(λ):=n−1​λ2ρ(λ):= nn-1\,λ^2 (52) makes the empirical χ2χ^2-DRO robust mean exactly equal to μ^n​(y)−λ​σ^n​(y) μ_n(y)-λ\, σ_n(y). Beyond the nonnegativity regime. The equality in (51) requires a sufficient nonnegativity condition (Proposition 3.6). When this condition fails (e.g., due to heavy-tailed empirical support or large ρ), the closed-form expression may no longer be tight. Nevertheless, Proposition 3.6 still provides the conservative lower bound infℚ∈ρ​(ℙ^ny)ℚ​[R]≥μ^n​(y)−λρ​σ^n​(y), _Q _ρ( P_n^y)\ E_Q[R]\;≥\; μ_n(y)- _ρ\, σ_n(y), so the mean–dispersion form remains a valid pessimistic surrogate even outside the tightness regime. Discussion: LCB as calibrated DRO plus lower-order slack. Proposition 3.3 provides a uniform high-probability lower bound μ​(y)≥μ^n​(y)−c​σ^n​(y)​log⁡(K/δ)n−c​ν​(y)​log⁡(K/δ)nsimultaneously for all ​y∈.μ(y)≥ μ_n(y)-c\, σ_n(y) (K/δ)n-c\,ν(y) (K/δ)n for all y . Define λδ:=c​log⁡(K/δ)nandρδ:=ρ​(λδ)=n−1​λδ2. _δ:=c (K/δ)n _δ:=ρ( _δ)= nn-1 _δ^2. By Proposition C.1, in the nonnegativity regime the empirical χ2χ^2-DRO robust mean under radius ρδ _δ matches μ^n​(y)−λδ​σ^n​(y) μ_n(y)- _δ σ_n(y) exactly. The additional term c​ν​(y)​log⁡(K/δ)/nc\,ν(y) (K/δ)/n is a finite-sample slack needed to convert this robust value into a high-probability lower bound on the true mean μ​(y)μ(y) under the assumed concentration scale ν​(y)ν(y). Interpretation of the n-dependent radius. The mapping ρδ∝log⁡(K/δ)/n _δ (K/δ)/n arises because λδ _δ is chosen to control estimation error uniformly over K candidates. This n-dependence should be interpreted as an estimation-calibrated DRO radius: as n grows, the ambiguity set shrinks because the empirical mean becomes well-estimated. However, we emphasize that in preference alignment, high empirical variance often signals intrinsic heterogeneity (disagreement) rather than pure noise.(empirically supported by the monotonic trend across disagreement buckets; see Fig 3) Thus, while the statistical radius required for concentration decays with n, the functional form of the penalty (i.e., penalizing dispersion) remains structurally isomorphic to a robustness constraint against disagreement. This allows the LCB framework to serve a dual purpose: rigorously controlling finite-sample error while effectively acting as a proxy to demote controversial candidates. Appendix D On dependence across candidates and the uniform guarantee Our uniform guarantee in Proposition 3.3 does not require independence across candidates. It only requires per-candidate concentration for each fixed y and a union bound. Lemma D.1 (Union bound without independence). Let ℰyy∈\E_y\_y be any collection of events (possibly dependent). If for each y∈y we have Pr⁡(ℰy)≥1−δ′ (E_y)≥ 1-δ , then Pr⁡(⋂y∈ℰy)≥1−∑y∈Pr⁡(ℰyc)≥1−K​δ′, ( _y E_y )≥ 1- _y (E_y^c)≥ 1-Kδ , where K:=||K:=|Y|. Application to Proposition 3.3. For each fixed y, the self-normalized empirical-Bernstein inequality yields an event ℰyE_y such that Pr⁡(ℰy)≥1−δ′ (E_y)≥ 1-δ and on ℰyE_y the bound (9) holds. Annotator overlap across different candidates can create dependence among ℰyy∈\E_y\_y , but Lemma D.1 shows that the intersection event still holds with probability at least 1−K​δ′1-Kδ . Setting δ′=δ/Kδ =δ/K yields the stated uniform guarantee. Appendix E Scorer ambiguity and multi-scorer aggregation This appendix details our proxy instantiation that hedges scorer ambiguity when scalar satisfaction samples are produced by learned reward models. We consider a family of M scorers indexed by m∈[M]m∈[M]. For each prompt s, candidate y∈​(s)y (s), and scorer m, we obtain n scalar samples Rm,i​(s,y)i=1n\R_m,i(s,y)\_i=1^n (e.g., via independent scorer noise, style-preserving perturbations, or repeated evaluations). All analysis conditions on the realized candidate set ​(s)Y(s). E.1 Within-prompt normalization across scorers Reward-model outputs may differ in scale and offset across scorers; to make aggregation meaningful, we normalize per prompt. Define the pooled mean and standard deviation under scorer m: R¯m​(s) R_m(s) :=1K​n​∑y∈​(s)∑i=1nRm,i​(s,y), := 1Kn _y (s) _i=1^nR_m,i(s,y), (53) s^m​(s) s_m(s) :=1K​n−1​∑y∈​(s)∑i=1n(Rm,i​(s,y)−R¯m​(s))2> 0. := 1Kn-1 _y (s) _i=1^n (R_m,i(s,y)- R_m(s) )^2\ >\ 0. (54) We form normalized samples R~m,i​(s,y):=Rm,i​(s,y)−R¯m​(s)s^m​(s). R_m,i(s,y)\ :=\ R_m,i(s,y)- R_m(s) s_m(s). (55) When M=1M=1, this normalization can be omitted without changing the framework. Implementation note. The constants (R¯m​(s),s^m​(s))( R_m(s), s_m(s)) require access to all candidates y∈​(s)y (s) for the given prompt s. In Algorithm 1, we therefore precompute (R¯m​(s),s^m​(s))( R_m(s), s_m(s)) once per scorer m before scoring individual candidates. E.2 Scorer-specific entropic value, risk premium, and dispersion For each scorer m, we compute the scorer-specific empirical entropic value using normalized samples: V^β,m​(s,y;R¯m​(s),s^m​(s)):=−1β​log⁡(1n​∑i=1nexp⁡(−β​R~m,i​(s,y))), V_β,m(s,y; R_m(s), s_m(s))\ :=\ - 1β \! ( 1n _i=1^n \! (-β\, R_m,i(s,y) ) ), (56) and the corresponding empirical mean μ^m​(s,y;R¯m​(s),s^m​(s)):=1n​∑i=1nR~m,i​(s,y). μ_m(s,y; R_m(s), s_m(s))\ :=\ 1n _i=1^n R_m,i(s,y). (57) We define the scorer-specific empirical entropic risk premium as RP^β,m​(s,y;R¯m​(s),s^m​(s)):=μ^m​(s,y;R¯m​(s),s^m​(s))−V^β,m​(s,y;R¯m​(s),s^m​(s))≥ 0. RP_β,m(s,y; R_m(s), s_m(s))\ :=\ μ_m(s,y; R_m(s), s_m(s))- V_β,m(s,y; R_m(s), s_m(s))\ ≥\ 0. (58) We also define the scorer-specific empirical standard deviation (used as a disagreement proxy) as σ^m​(s,y;R¯m​(s),s^m​(s)):=1n−1​∑i=1n(R~m,i​(s,y)−μ^m​(s,y;R¯m​(s),s^m​(s)))2. σ_m(s,y; R_m(s), s_m(s))\ :=\ 1n-1 _i=1^n ( R_m,i(s,y)- μ_m(s,y; R_m(s), s_m(s)) )^2. (59) E.3 Scorer-robust aggregation and decoding rule To hedge scorer shift, we aggregate scorer-specific entropic values via a soft worst-case (soft-min) operator with parameter γ>0γ>0: V~β,γ​(s,y):=−1γ​log⁡(1M​∑m=1Mexp⁡(−γ​V^β,m​(s,y))), V_β,γ(s,y):=- 1γ \! ( 1M _m=1^M \! (-γ\, V_β,m(s,y) ) ), (60) which satisfies V~β,γ​(s,y)→minm∈[M]⁡V^β,m​(s,y) V_β,γ(s,y)→ _m∈[M] V_β,m(s,y) as γ→∞γ→∞. For explicit risk control, we aggregate risk premia pessimistically: RP~β​(s,y):=maxm∈[M]⁡RP^β,m​(s,y). RP_β(s,y):= _m∈[M]\ RP_β,m(s,y). (61) The scorer-robust risk-constrained decoding rule is then y^∈arg⁡maxy∈​(s)⁡V~β,γ​(s,y)s.t.RP~β​(s,y)≤τ, y∈ _y (s)\ V_β,γ(s,y) .t. RP_β(s,y)≤τ, (62) or in penalized (Lagrangian) form, y^∈arg⁡maxy∈​(s)⁡V~β,γ​(s,y)−λ​RP~β​(s,y). y∈ _y (s)\ V_β,γ(s,y)-λ\, RP_β(s,y). (63) E.4 Interpretation The aggregation V~β,γ V_β,γ hedges against worst-case scorers while remaining smooth and stable for finite γ. The pessimistic risk-premium aggregation RP~β RP_β prevents a candidate from being selected if it is deemed high-risk by any scorer. When scorer variability reflects genuine preference uncertainty (e.g., validated proxies or ensembles calibrated to human disagreement), this multi-scorer instantiation also promotes reliability under heterogeneous preferences. Appendix F Detailed decoding rules and baselines Candidate set and selection statistics. Given an input s, we sample a fixed candidate set ​(s)=y1,…,yKY(s)=\y_1,…,y_K\ from a generator policy π. For each candidate y, we compute an empirical satisfaction mean μ^sel​(s,y) μ_sel(s,y) and a dispersion proxy σ^sel​(s,y) σ_sel(s,y) (see Appendix A.8 for the exact estimation procedure). When using CVaR, we additionally compute the empirical lower-tail mean CVaR^α,sel​(s,y) CVaR_α,sel(s,y). F.1 Full pseudocode for DARC Algorithm 1 DARC with scorer-robust aggregation (full version) 1:prompt s 2:candidate set ​(s)=y1,…,yKY(s)=\y_1,…,y_K\ (fixed for all methods) 3:scorer family Scoreβ(m)m=1M\ Score^(m)_β\_m=1^M, where Scoreβ(m)​(s,y;R¯,s^)→(μ^m,V^β,m,σ^m) Score^(m)_β(s,y; R, s)→( μ_m, V_β,m, σ_m) 4:aggregation parameter γ>0γ>0 (soft worst-case over scorers) 5:variant ∈Entropic,Tau,Eps,2nd-Moment (LCB)∈\ Entropic, Tau, Eps, 2nd-Moment (LCB)\ and hyperparameters (β,λ,τ,ϵ)(β,λ,τ,ε) 6:selected output y⋆y 7:(Optional but recommended) Within-prompt normalization across scorers (selection-time only). 8:Precompute pooled normalization constants for each scorer m over all candidates y∈​(s)y (s). 9:for m=1m=1 to M do 10: Collect raw samples Rm,i​(s,yk)k∈[K],i∈[n]\R_m,i(s,y_k)\_k∈[K],\,i∈[n] 11: R¯m​(s)←1K​n​∑k=1K∑i=1nRm,i​(s,yk) R_m(s)← 1Kn _k=1^K _i=1^nR_m,i(s,y_k) 12: s^m​(s)←1K​n−1​∑k=1K∑i=1n(Rm,i​(s,yk)−R¯m​(s))2 s_m(s)← 1Kn-1 _k=1^K _i=1^n (R_m,i(s,y_k)- R_m(s) )^2 13:end for 14:If normalization is disabled, set R¯m​(s)←0 R_m(s)← 0 and s^m​(s)←1 s_m(s)← 1 for all m. 15:for k=1k=1 to K do 16: for m=1m=1 to M do 17: (μ^m,k,V^β,m,k,σ^m,k)←Scoreβ(m)​(s,yk;R¯m​(s),s^m​(s))( μ_m,k, V_β,m,k, σ_m,k)← Score^(m)_β\! (s,y_k; R_m(s), s_m(s) ) 18: RP^β,m,k←μ^m,k−V^β,m,k RP_β,m,k← μ_m,k- V_β,m,k 19: end for 20: V~β,γ,k←−1γ​log⁡(1M​∑m=1Mexp⁡(−γ​V^β,m,k)) V_β,γ,k←- 1γ \! ( 1M _m=1^M (-γ\, V_β,m,k) ) ⊳ soft-min over scorers 21: RP~β,k←maxm∈[M]⁡RP^β,m,k RP_β,k← _m∈[M] RP_β,m,k ⊳ worst-case risk premium 22: σ~k←maxm∈[M]⁡σ^m,k σ_k← _m∈[M] σ_m,k ⊳ worst-case disagreement proxy 23: μ~k←1M​∑m=1Mμ^m,k μ_k← 1M _m=1^M μ_m,k ⊳ (optional) averaged mean for ablations 24:end for 25:if variant =Entropic= Entropic then 26: k⋆←arg⁡maxk∈[K]⁡V~β,γ,k ← _k∈[K] V_β,γ,k 27: return yk⋆y_k 28:else if variant =Tau= Tau then 29: ℱ←k∈[K]:RP~β,k≤τF←\k∈[K]: RP_β,k≤τ\ 30: if ℱ=∅F= then 31: ℱ←[K]F←[K] 32: end if 33: k⋆←arg⁡maxk∈ℱ⁡V~β,γ,k ← _k V_β,γ,k 34: return yk⋆y_k 35:else if variant =Eps= Eps then 36: V~β,γ,max←maxk∈[K]⁡V~β,γ,k V_β,γ, ← _k∈[K] V_β,γ,k 37: ℱ←k∈[K]:V~β,γ,k≥V~β,γ,max−ϵF←\k∈[K]: V_β,γ,k≥ V_β,γ, -ε\ ⊳ ℱ≠∅F≠ if ϵ≥0ε≥ 0 38: k⋆←arg⁡mink∈ℱ⁡σ~k ← _k σ_k 39: return yk⋆y_k 40:end if Notes. Each Scoreβ(m) Score^(m)_β may be instantiated by multi-annotator samples or proxy perturbation samples under scorer m. Because reward scales differ across scorers, aggregation is performed on selection-time normalized scores (within each prompt), while all reported metrics in the main tables are computed within each scorer on its raw scale (Appendix H.3). When M=1M=1, the procedure reduces to the original single-scorer DARC. F.2 Inference-time selection baselines. All rules below select one output from the same ​(s)Y(s): • Mean (Best-of-K): y⋆=arg⁡maxy∈​(s)⁡μ^sel​(s,y).y = _y (s) μ_sel(s,y). • CVaR (Best-of-K): y⋆=arg⁡maxy∈​(s)⁡CVaR^α,sel​(s,y),y = _y (s) CVaR_α,sel(s,y), where CVaR^α,sel​(s,y) CVaR_α,sel(s,y) is the average of the bottom α fraction of satisfaction samples for candidate y. We fix α throughout (no tuning). • MC-Dropout reranking (uncertainty-aware scoring). To obtain an uncertainty estimate without human multi-rater samples, we perform T stochastic forward passes through the scorer (dropout enabled) and treat the resulting scores as an empirical proxy distribution rt​(s,y)t=1T\r_t(s,y)\_t=1^T (Gal and Ghahramani, 2016; Kendall and Gal, 2017). We then compute μ^MC​(s,y)=1T​∑t=1Trt​(s,y) μ_MC(s,y)= 1T _t=1^Tr_t(s,y) and σ^MC​(s,y)=1T−1​∑t=1T(rt​(s,y)−μ^MC​(s,y))2. σ_MC(s,y)= 1T-1 _t=1^T (r_t(s,y)- μ_MC(s,y) )^2. We use a pessimistic selection rule analogous to MV/LCB: y⋆=arg⁡maxy∈​(s)⁡(μ^MC​(s,y)−λMC​σ^MC​(s,y)),y = _y (s) ( μ_MC(s,y)- _MC σ_MC(s,y) ), where λMC _MC is tuned on the dev set (same protocol as other inference-time baselines). • DeAL decoding-time alignment (rollout reranking). Given a prompt s and a partially generated prefix x<tx_<t at decoding step t, the base LM defines next-token probabilities pθ​(v∣s,x<t)p_θ(v s,x_<t) over the vocabulary V. DeAL performs a one-step lookahead search over the top-k next tokens: t(k):=arg⁡maxv∈k⁡log⁡pθ​(v∣s,x<t).V_t^(k):= _v ^k\ p_θ(v s,x_<t). For each candidate token v∈t(k)v _t^(k), we form the candidate prefix x<t​vx_<tv and compute a length-L lookahead continuation using greedy rollout: x~t:t+L−1(v):=GreedyRollout​(πθ;s,x<t​v,L), x_t:t+L-1^(v):=GreedyRollout ( _θ;\ s,\ x_<tv,\ L ), where πθ _θ is the same LM used as the proposal policy. We then score the resulting partial completion with a reward model rϕr_φ: RM​(s,x<t​v):=rϕ​(s,x<t​v​x~t:t+L−1(v)).RM(s,x_<tv):=r_φ\! (s,\ x_<tv x_t:t+L-1^(v) ). DeAL selects the next token by maximizing a combined objective vt∈arg⁡maxv∈t(k)⁡log⁡pθ​(v∣s,x<t)+γ⋅RM​(s,x<t​v),v_t∈ _v _t^(k) \ p_θ(v s,x_<t)+γ·RM(s,x_<tv) \, and updates x<t+1←x<t​vtx_<t+1← x_<tv_t until EOS or the maximum length is reached. This baseline is substantially more expensive than reranking a fixed candidate set, since it requires k rollouts and RM evaluations at each decoding step. • Regularized Best-of-K (RBoN). RBoN augments reward-model reranking with a reference-policy regularizer to mitigate reward hacking at inference time (Jinnai et al., 2024). Let πref _ref be a fixed reference policy (we use the generator as πref _ref unless otherwise stated). For each candidate y∈​(s)y (s), compute the per-token average log-likelihood under πref _ref, ℓ¯ref​(s,y)=1|y|​log⁡πref​(y∣s) _ref(s,y)= 1|y| _ref(y s) (or the unnormalized log-likelihood if length normalization is disabled). RBoN selects y⋆=arg⁡maxy∈​(s)⁡(μ^sel​(s,y)+βRBoN​ℓ¯ref​(s,y)),y = _y (s) ( μ_sel(s,y)+ _RBoN\, _ref(s,y) ), where βRBoN≥0 _RBoN≥ 0 controls the strength of regularization and is tuned on the dev set. • Best-of-Poisson (BoP) and HedgeTune. Following Khalaf et al. (2025), we hedge greedy best-of-K selection by randomizing the effective candidate budget. For each prompt s, sample K′∼Poisson​(λ)K (λ) and truncate to K′←min⁡(max⁡(K′,1),K)K ← ( (K ,1),K). Let ℐ⊆1,…,KI \1,…,K\ be a uniformly sampled index set of size |ℐ|=K′|I|=K and define the corresponding subset ℐ​(s)=yi:i∈ℐ⊆​(s)Y_I(s)=\y_i:i \ (s). BoP selects y⋆=arg⁡maxy∈ℐ​(s)⁡μ^sel​(s,y).y = _y _I(s) μ_sel(s,y). HedgeTune chooses λ using the same dev-set protocol as other inference-time baselines, and fixes it for evaluation. • Caution: pessimistic best-of-K reranking. Inspired by Anonymous (2025), we penalize atypical candidates using an auxiliary error (or atypicality) signal e​(s,y)≥0e(s,y)≥ 0, yielding a pessimistic score Scorecaut​(s,y):=μ^sel​(s,y)−αcaut​e​(s,y).Score_caut(s,y):= μ_sel(s,y)- _caut\,e(s,y). We then select y⋆=arg⁡maxy∈​(s)⁡Scorecaut​(s,y),y = _y (s)Score_caut(s,y), where αcaut≥0 _caut≥ 0 controls the strength of pessimism and is tuned on the dev set. • DARC (ours, scorer-robust): define scorer-specific V^β,sel(m)​(s,y) V^(m)_β,sel(s,y) and aggregate V~β,γ,sel​(s,y):=−1γ​log⁡(1M​∑m=1Mexp⁡(−γ​V^β,sel(m)​(s,y))), V_β,γ,sel(s,y):=- 1γ \! ( 1M _m=1^M \! (-γ\, V^(m)_β,sel(s,y) ) ), then select y⋆=arg⁡maxy∈​(s)⁡V~β,γ,sel​(s,y).y = _y (s) V_β,γ,sel(s,y). • Hard budget (DARC-τ on scorer-robust entropic risk premium): define scorer-specific risk premia RP^β,sel(m)​(s,y)=μ^sel(m)​(s,y)−V^β,sel(m)​(s,y) RP^(m)_β,sel(s,y)= μ^(m)_sel(s,y)- V^(m)_β,sel(s,y) and aggregate pessimistically RP~β,sel​(s,y):=maxm∈[M]⁡RP^β,sel(m)​(s,y). RP_β,sel(s,y):= _m∈[M] RP^(m)_β,sel(s,y). Select y⋆=arg⁡maxy∈​(s)⁡V~β,γ,sel​(s,y)​s.t.​RP~β,sel​(s,y)≤τ;y = _y (s) V_β,γ,sel(s,y)\ s.t.\ RP_β,sel(s,y)≤τ; if the feasible set is empty, we fall back to ​(s)Y(s). • ϵε-tie breaking (DARC-ϵε on scorer-robust value): define the near-tie set ℱϵ​(s)=y:V~β,γ,sel​(s,y)≥V~β,γ,sel,max​(s)−ϵ,F_ε(s)=\y: V_β,γ,sel(s,y)≥ V_β,γ,sel, (s)-ε\, and define the worst-case disagreement proxy σ~sel​(s,y):=maxm∈[M]⁡σ^sel(m)​(s,y). σ_sel(s,y):= _m∈[M] σ^(m)_sel(s,y). Select y⋆=arg⁡miny∈ℱϵ​(s)⁡σ~sel​(s,y).y = _y _ε(s) σ_sel(s,y). F.3 Training-time robust policies: cDPO and rDPO Preference data and DPO objective. Let =(s,y+,y−)D=\(s,y^+,y^-)\ be a preference dataset, where for each input s the response y+y^+ is preferred over y−y^-. Let πθ _θ be the policy to be trained and πref _ref a fixed reference policy. Define the logit difference Δθ​(s,y+,y−)=(log⁡πθ​(y+|s)−log⁡πθ​(y−|s))−(log⁡πref​(y+|s)−log⁡πref​(y−|s)). _θ(s,y^+,y^-)= ( _θ(y^+|s)- _θ(y^-|s) )- ( _ref(y^+|s)- _ref(y^-|s) ). The standard DPO loss is ℒDPO​(θ)=(s,y+,y−)∼​[−log⁡σ​(β​Δθ​(s,y+,y−))],L_DPO(θ)=E_(s,y^+,y^-) [- σ (β\, _θ(s,y^+,y^-) ) ], where σ​(⋅)σ(·) is the logistic sigmoid and β>0β>0 controls the strength of preference optimization. Conservative DPO (cDPO). We implement a conservative variant by applying label smoothing to the pairwise preference target. Specifically, for a smoothing parameter ε∈(0,1/2) ∈(0,1/2), we use the smoothed loss ℒcDPO​(θ)=(s,y+,y−)∼​[−(1−ε)​log⁡σ​(β​Δθ)−ε​log⁡(1−σ​(β​Δθ))].L_cDPO(θ)=E_(s,y^+,y^-) [-(1- ) σ(β\, _θ)- (1-σ(β\, _θ) ) ]. We denote the resulting trained policy as πcDPO _cDPO. Robust DPO (rDPO). Preference data may contain noise or ambiguity. We consider a robust variant that optimizes against uncertainty in preference labels. Let q∈[0,1]q∈[0,1] denote an (unknown) probability that the pair (y+,y−)(y^+,y^-) is correctly labeled for input s. We model robustness by optimizing the worst-case (or uncertainty-aware) expected loss over an admissible set Q: ℒrDPO​(θ)=(s,y+,y−)∼​[supq∈(−q​log⁡σ​(β​Δθ)−(1−q)​log⁡(1−σ​(β​Δθ)))].L_rDPO(θ)=E_(s,y^+,y^-) [ _q (-q σ(β\, _θ)-(1-q) (1-σ(β\, _θ)) ) ]. Here Q specifies the assumed noise level (e.g., q∈[1−ρ,1]q∈[1-ρ,1] for some ρ∈[0,1/2)ρ∈[0,1/2)). We denote the resulting trained policy as πrDPO _rDPO. Evaluation protocol (policy + inference-time selection). For any trained policy π (including πcDPO _cDPO and πrDPO _rDPO), at inference time we sample a candidate set (s)=y1,…,yK∼π(⋅|s)Y(s)=\y_1,…,y_K\ π(·|s) and then apply the same selection rules defined in Appendix F, e.g., Mean (Best-of-K) or DARC-ϵε. This isolates the effect of training-time robustness (changing π) from inference-time robustness (changing the selection rule). Appendix G Related Work G.1 Preference optimization for alignment Preference-based alignment is the prevailing paradigm for steering LLMs, ranging from RLHF-style reward modeling with policy optimization to direct preference objectives that learn from comparisons without explicit RL loops (Ouyang et al., 2022; Christiano et al., 2017; Rafailov et al., 2023). Recent work largely focuses on making preference optimization simpler and more stable, e.g., via single-stage or reference-light formulations that resemble standard fine-tuning (Hong et al., 2024; Meng et al., 2024; Guo et al., 2024). A second theme highlights the role of data coverage: when preference data poorly covers the model’s behavior distribution, offline updates can degrade, motivating unified online–offline perspectives and value-aware objectives (Song et al., 2024; Cen et al., 2024). Meanwhile, newer datasets enrich supervision beyond pairwise comparisons with multi-attribute ratings and complementary preference signals, enabling finer-grained diagnosis and reward learning (Wang et al., 2024c, b). Despite these advances, most methods still collapse alignment into optimizing a single scalar proxy of utility (often motivated by Bradley–Terry style models (Bradley and Terry, 1952)), which can mask structured disagreement and sensitivity to annotator mix. G.2 Heterogeneity, robustness, and inference-time alignment A growing body of work shows that preferences are intrinsically heterogeneous: annotators disagree systematically, and average-score alignment can hide persistent failures for sub-populations (Zhang et al., 2024; Chen et al., 2024; Casper et al., 2023). This is compounded by proxy over-optimization, where scaling up optimization against an imperfect reward/preference proxy can degrade the underlying target for both RLHF and direct alignment (Gao et al., 2023; Rafailov et al., 2024). Robustness-oriented approaches therefore move beyond mean objectives, including distributionally robust and group-robust formulations that protect against noise and minority-group degradation (Wu et al., 2024; Ramesh et al., 2024; Chakraborty et al., 2024). Another complementary direction makes reward signals multi-dimensional via multi-objective reward modeling or multi-head aggregation, improving interpretability and offering more conservative combinations, but typically requiring richer supervision and an explicit (often fixed) scalarization for deployment (Wang et al., 2024a; Li et al., 2025; Yang et al., 2024). Finally, inference-time alignment methods avoid retraining by allocating extra computation at decoding to select better candidates, though recent analyses caution that aggressive best-of-N selection can amplify proxy error under miscalibration or under-coverage (Sun et al., 2024; Huang et al., 2025a; Ichihara et al., 2025; Huang et al., 2025b). Standardized reward-model benchmarks further support systematic study of these issues across settings, including retrieval-augmented generation (Lambert et al., 2025; Zhou et al., 2024; Jin et al., 2025; Liu et al., 2024b). Distributionally robust optimization and variance regularization. Distributionally robust optimization (DRO) studies objectives of the form minθ​supℚ∈​(ℙ^)ℚ​[ℓ​(θ;Z)] _θ _Q ( P)E_Q[ (θ;Z)], where the true distribution is assumed to lie in an ambiguity set around the empirical distribution (Wiesemann et al., 2014; Rahimian and Mehrotra, 2019). Local f-divergence neighborhoods yield tractable robust objectives and have been widely used in learning and optimization (Namkoong and Duchi, 2016). In particular, χ2χ^2-DRO is closely connected to variance-based regularization and generalized empirical likelihood perspectives (Duchi and Namkoong, 2019; Duchi et al., 2021). Our work brings this DRO view to inference-time decoding under heterogeneous human preferences: the same mean–dispersion score used by DARC arises as the closed-form robust value of a χ2χ^2-ambiguity set. DRO is also motivated by heterogeneous subpopulations and uniform-performance desiderata (Duchi and Namkoong, 2021), and efficient large-scale methods exist for common DRO formulations (Levy et al., 2020). Appendix H Experiments H.1 Human disagreement on the raw Top-K candidate pool Even before applying any reranking, the raw Top-K candidate pool can exhibit substantial human disagreement. We quantify this on AlpacaEval 2.0 and MT-Bench using the same pool size K=5K=5. Setup. For each prompt s, we generate ​(s)=y1,…,yKY(s)=\y_1,…,y_K\ and collect n=5n=5 human ratings per candidate on a 0–1010 scale (see Appendix I). For each candidate, we compute the sample mean and variance across raters: μ^human​(s,yk):=1n​∑j=1nhj​(s,yk),σ^human2​(s,yk):=1n−1​∑j=1n(hj​(s,yk)−μ^human​(s,yk))2. μ_human(s,y_k):= 1n _j=1^nh_j(s,y_k), σ^2_human(s,y_k):= 1n-1 _j=1^n (h_j(s,y_k)- μ_human(s,y_k) )^2. We summarize per-prompt disagreement by the maximum variance over the pool, D​(s):=maxk∈[K]⁡σ^human2​(s,yk),D(s):= _k∈[K] σ^2_human(s,y_k), which captures whether at least one candidate is highly contentious. High-disagreement subset (top 20%). We define the high-disagreement subset as the top 20% prompts by D​(s)D(s): high:=s∈:D​(s)≥Q0.8​(D),S_high:= \s :\;D(s)≥ Q_0.8(D) \, where Q0.8​(D)Q_0.8(D) denotes the 8080-th percentile of D​(s):s∈\D(s):s \ computed on the evaluation set. We report main metrics both on the full set and on highS_high. Dataset s​[D​(s)]E_s[D(s)] Q0.8​(D)Q_0.8(D) Q0.9​(D)Q_0.9(D) Pr⁡[D​(s)≥Q0.8​(D)] [D(s)≥ Q_0.8(D)] AlpacaEval 2.0 0.74 0.98 1.65 0.20 MT-Bench 0.65 0.84 1.32 0.20 Table 4: Human disagreement on the raw Top-K pool (K=5K=5) measured by the maximum across-candidate human rating variance D​(s)D(s). High-disagreement prompts are defined as the top 20% by D​(s)D(s). H.2 Hyperparameter selection Overview. We report hyperparameters for (i) training-time robust policies (cDPO/rDPO) and (i) inference-time selection methods and proxy-risk estimation, including DARC and inference-time baselines (DeAL (Huang et al., 2025b), MC-Dropout (Gal and Ghahramani, 2016), RBoN (Jinnai et al., 2024), Best-of-Poisson/HedgeTune (Khalaf et al., 2025), and Caution (Anonymous, 2025)). Unless otherwise stated, we use a single fixed set of hyperparameters for all reported results, with no dataset- or split-specific tuning. All inference-time baselines share the same candidate pools and reward-model preprocessing/truncation for a fair comparison. Base model and parameter-efficient fine-tuning (cDPO/rDPO). All policy and reference models are initialized from meta-llama/Llama-3.1-8B-Instruct. We fine-tune with QLoRA using 4-bit NF4 quantization with double quantization enabled and bf16 compute. We train LoRA adapters with rank r=16r=16, scaling α=32α=32, dropout 0.050.05, and no bias parameters. We enable gradient checkpointing and disable KV caching during training. cDPO training. We optimize the standard DPO objective with inverse-temperature βDPO=0.1 _DPO=0.1 and label smoothing εls=0.10 _ls=0.10. Unless stated otherwise, we use learning rate 10−510^-5, per-device batch size 11, gradient accumulation 1616 (effective batch size 1616), and train for 55 epochs in bf16. We cap the maximum prompt length at 512512 and the maximum response length at 512512 (total length ≤1024≤ 1024 tokens), and set the random seed to 77. rDPO training. We use the same initialization, QLoRA/LoRA configuration, and optimization settings as cDPO, but optimize the rDPO objective with label-flip noise parameter ϵflip=0.1 _flip=0.1 and inverse-temperature βDPO=0.1 _DPO=0.1. The reference model is kept frozen throughout training. Inference-time candidate pools. For automated proxy evaluation and inference-time reranking, we generate a fixed candidate pool of size K=16K=16 per prompt using nucleus sampling with top-p=0.98p=0.98 and temperature 0.80.8, with at most 320320 newly generated tokens. Reward model and proxy-risk statistics. We score each candidate with the reward model Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 using maximum input length 10241024 and batch size 1616. To estimate proxy disagreement, we construct Naug=8N_aug=8 style-preserving perturbations per candidate and define the sample set =rorig∪raug(j)j=1NaugS=\r_orig\∪\r^(j)_aug\_j=1^N_aug, where each element is the reward model score on the corresponding prompt–response formatting. We then compute the empirical mean μ μ, standard deviation σ σ, and tail metric CVaR0.1CVaR_0.1 as the average of the lowest 10%10\% of samples in S. DARC (entropic) decoding hyperparameters. Our primary decoding rule selects the candidate maximizing the empirical entropic robust value V^β=−1β​log⁡(1||​∑r∈exp⁡(−β​r)) V_β=- 1β \! ( 1|S| _r (-β r) ), with entropic temperature β=1.0β=1.0. We also report two deployment-friendly variants: (i) DARC-τ, which constrains the entropic risk premium RP^β=μ^−V^β RP_β= μ- V_β by a budget τ; we set τ to the qRP=0.25q_RP=0.25 quantile of RP^β​(s,y)y∈​(s)\ RP_β(s,y)\_y (s) on each candidate pool and select the feasible candidate with the largest μ μ; (i) DARC-ϵε, which forms a near-optimal set in robust value y:V^β​(y)≥maxy′⁡V^β​(y′)−ϵV\y: V_β(y)≥ _y V_β(y )- _V\ with ϵV=0.25 _V=0.25 and selects the candidate with the smallest σ σ (tie-breaking by larger μ μ). Unless otherwise noted, the mainline entropic variant uses the DARC-ϵε selection rule. DeAL hyperparameters. We use a lightweight and stable DeAL configuration: k=8,L=16,γ=0.05,k=8,\ L=16,\ γ=0.05, and cap the maximum generation length to Tmax=512T_ =512 new tokens. Lookahead rollouts are computed with greedy decoding (deterministic rollout; no sampling). For reward-model inference within DeAL lookahead, we use maximum RM input length 256256 and batch size 1616. MC-Dropout uncertainty reranking. For the MC-Dropout baseline, we enable dropout in the reward model at inference time and draw M=8M=8 stochastic forward passes per (prompt, candidate), yielding reward samples r(m)m=1M\r^(m)\_m=1^M. We compute μ^MC μ_MC, σ^MC σ_MC, CVaR0.1CVaR_0.1, and V^β V_β on these samples with β=1.0β=1.0 and use the uncertainty-penalized score μ^MC−αMC​σ^MC μ_MC- _MC σ_MC with αMC=1.0 _MC=1.0. RBoN (regularized Best-of-K) decoding-time reranking. For the RBoN baseline, we rerank the same candidate pool (K=16K=16) using ScoreRBoN​(s,y)=rϕ​(s,y)+βRBoN​log⁡pref¯​(y∣s),Score_RBoN(s,y)=r_φ(s,y)+ _RBoN\, p_ref(y s), where rϕr_φ is the reward-model score and log⁡pref¯ p_ref denotes the per-token mean log-likelihood under a reference policy prefp_ref (we use the generator policy) to reduce length bias. We use βRBoN=0.02 _RBoN=0.02 and compute log-likelihoods with the same maximum length cap (1024 tokens) as reward scoring; due to memory constraints we use a small batch size for log-prob evaluation. Best-of-Poisson (BoP) and HedgeTune. Following Khalaf et al. (2025), we implement Best-of-Poisson selection on the same fixed candidate pool (K=16K=16). For each prompt, we sample K′∼Poisson​(λ)K (λ) and truncate to K′←min⁡(max⁡(K′,1),K)K ← ( (K ,1),K). We then select the candidate with the largest reward-model score among a uniformly sampled subset of size K′K from the pool (to avoid positional bias). As a default, we use λ=12λ=12 and report a global tuning range λ∈4,8,12,16λ∈\4,8,12,16\. For HedgeTune, we choose λ once on a small held-out calibration split (shared across all datasets) and fix it for all reported results, in line with our no dataset- or split-specific tuning protocol. Caution: pessimistic best-of-N reranking. We implement the pessimistic reranking baseline of Anonymous (2025) on the same candidate pool (K=16K=16). In addition to the reward score rϕ​(s,y)r_φ(s,y), we fit an auxiliary error/atypicality model on in-distribution data and compute an atypicality penalty e​(s,y)e(s,y). Candidates are ranked by the pessimistic score ScoreCaution​(s,y)=rϕ​(s,y)−αcaut​e​(s,y).Score_Caution(s,y)=r_φ(s,y)- _caut\,e(s,y). We use αcaut=1.0 _caut=1.0 by default. The auxiliary model is trained once and reused across all datasets and splits. Auxiliary error/atypicality model e​(s,y)e(s,y). We implement e​(s,y)e(s,y) as a supervised classifier over prompt–response pairs. The input is the concatenation [PROMPT] s [SEP] [RESPONSE] y, and the output is a scalar e​(s,y)∈[0,1]e(s,y)∈[0,1] interpreted as the probability that (s,y)(s,y) is atypical under the in-distribution training data. We use a pretrained Transformer encoder (same backbone family as the reward model for tokenization compatibility) with a single linear head. Training data is constructed from the in-distribution training split only: • Typical pairs. Ground-truth prompt–response pairs from the training split. • Atypical pairs. (i) Mismatched pairs formed by pairing each prompt with a response sampled from a different prompt, and (i) corrupted responses obtained by applying lightweight perturbations to a typical response (random truncation, sentence order shuffling, and insertion of an unrelated sentence sampled from another example). We train with binary cross-entropy, using a fixed 1:1 ratio of typical to atypical pairs, early stopping on a held-out validation subset, and a fixed hyperparameter setting (learning rate, batch size, epochs) shared across all datasets and splits. At inference time, we compute e​(s,y)e(s,y) for each candidate in the pool and apply the above pessimistic reranking. Subset definitions and diagnostics. For stratified robustness reporting, we define the high-disagreement subset as the top 20%20\% prompts by the baseline (mean Best-of-K) proxy disagreement σ σ. When reporting second-moment baselines or legacy ablations, we additionally include the mean–dispersion score μ^−λ​σ μ-λ σ with λraw=5.0 _raw=5.0 (and a z-scored variant with λz=3.0 _z=3.0), used only for diagnostic comparisons. Reward model score truncation. For proxy satisfaction samples, we use the reward model’s scalar outputs (logits). To align with Assumption 3.2 and avoid undue influence from rare outliers, we truncate logits to [−L,L][-L,L] (we use L=10L=10) before computing μ^n​(s,y) μ_n(s,y), σ^n​(s,y) σ_n(s,y), and V^β​(s,y) V_β(s,y). We choose L large enough that truncation is rarely active in practice. For completeness, Appendix A.4 states an analogous lower-tail interpretation under sub-Gaussianity. Component Hyperparameters (default) Base model meta-llama/Llama-3.1-8B-Instruct Candidate generation K=16K=16, temperature 0.80.8, top-p=0.98p=0.98, max_new_tokens 320320 Reward model scoring Skywork/Skywork-Reward-Llama-3.1-8B-v0.2(Liu et al., 2024a), max length 10241024, batch size 1616 Proxy disagreement Naug=8N_aug=8 perturbations; samples =rorig∪raug(j)j=1Naug.S=\r_orig\∪\r_aug^(j)\_j=1^N_aug.; σ^=Std​() σ=Std(S) Proxy tail metric CVaR0.1CVaR_0.1 on S DARC entropic β=1.0β=1.0 (robust value), qRP=0.25q_RP=0.25 (DARC-τ), ϵV=0.25 _V=0.25 (DARC-ϵε) High-disagreement subset top 20%20\% prompts by baseline σ σ cDPO training βDPO=0.1 _DPO=0.1, εls=0.10 _ls=0.10, lr 10−510^-5, epochs 55, bs 11, gas 1616 rDPO training βDPO=0.1 _DPO=0.1, noise ϵ=0.10ε=0.10, lr 10−510^-5, epochs 55, bs 11, gas 1616 LoRA / QLoRA 4-bit NF4 + double quant + bf16 compute; LoRA r=16r=16, α=32α=32, dropout 0.050.05 Table 5: Default hyperparameters used in cDPO/rDPO training and inference-time evaluation. H.3 Complete metric definitions and computation Notation. Let s be a prompt and y a candidate response. In proxy settings we may use a family of scorers (reward models) Rmm=1M\R_m\_m=1^M. We write ri(m)​(s,y)r^(m)_i(s,y) for scalar reward samples returned by scorer m. We use ℐselI_sel for selection-time samples and ℐevalI_eval for held-out evaluation samples (disjoint to avoid selection–evaluation leakage). Unless stated otherwise, all reported metrics are computed within a fixed scorer m on its raw reward scale. Prompt-level evaluation mean. For each (s,y)(s,y), let ri​(s,y)i∈ℐeval\r_i(s,y)\_i _eval be held-out reward samples. We define μ^eval​(s,y):=1|ℐeval|​∑i∈ℐevalri​(s,y). μ_eval(s,y)\;:=\; 1|I_eval| _i _evalr_i(s,y). (64) Disagreement / risk proxy via perturbation sensitivity. For each response y, generate NaugN_aug style-preserving perturbations y~jj=1Naug\ y_j\_j=1^N_aug with y~1:=y y_1:=y. Under scorer m, let uj(m)​(s,y):=Rm​(s,y~j)u^(m)_j(s,y):=R_m(s, y_j) be the reward of perturbation j. Define μ^aug(m)​(s,y) μ^(m)_aug(s,y) :=1Naug​∑j=1Nauguj(m)​(s,y), := 1N_aug _j=1^N_augu^(m)_j(s,y), (65) σ^(m)​(s,y) σ^(m)(s,y) :=1Naug−1​∑j=1Naug(uj(m)​(s,y)−μ^aug(m)​(s,y))2. := 1N_aug-1 _j=1^N_aug (u^(m)_j(s,y)- μ^(m)_aug(s,y) )^2. (66) When a scorer-robust selection rule is used, we aggregate disagreement pessimistically at selection time via σ~sel​(s,y):=maxm∈[M]⁡σ^sel(m)​(s,y) σ_sel(s,y):= _m∈[M] σ^(m)_sel(s,y). risk–reward tradeoff score (Tradeoff). Within scorer m, we define Tradeoffeval(m)​(s,y):=μ^eval(m)​(s,y)−λ​σ^(m)​(s,y),Tradeoff^(m)_eval(s,y)\;:=\; μ^(m)_eval(s,y)-λ\, σ^(m)(s,y), (67) where λ is fixed across methods. For scorer-robust methods, selection uses the aggregated statistics (e.g., V~β,γ,sel V_β,γ,sel and σ~sel σ_sel), but reporting is performed within each scorer m to avoid mixing incomparable reward scales. Tail risk: CVaR10%CVaR_10\% over prompts. For a method that outputs one final response ^​(s) Y(s) per prompt s, define prompt-level outcome z​(s):=μ^eval​(s,^​(s))(or ​z​(s):=LCBeval​(s,^​(s))​, specified per table).z(s)\;:=\; μ_eval(s, Y(s)) (or z(s):=LCB_eval(s, Y(s)), specified per table). (68) Let z​(sk)k=1Npr\z(s_k)\_k=1^N_pr be the NprN_pr prompt outcomes and let z(1)≤⋯≤z(M)z_(1)≤·s≤ z_(M) be the sorted values. We define CVaR10%:=1⌈0.1​Npr⌉​∑k=1⌈0.1​Npr⌉z(k).CVaR_10\%:= 1 0.1N_pr _k=1 0.1N_pr z_(k). (69) High-variance (HV) subset evaluation. We define a high-variance prompt subset by ranking prompts using the baseline method’s risk proxy (e.g., σ^​(s,y^base​(s)) σ(s, y_base(s))) and taking the top p%p\% prompts (we use p=20p=20 unless otherwise stated). All HV metrics are computed on this fixed subset for every method. Win/Tie/Loss (W/T/L) across scorers. Given two methods A and B, for each prompt s define the score difference under a scorer R: ΔR​(s):=zAR​(s)−zBR​(s). _R(s):=z_A^R(s)-z_B^R(s). (70) We count a tie if |ΔR​(s)|<ε| _R(s)|< (we use ε=0.5 =0.5; set ε=0 =0 if not used), otherwise win/loss by the sign of ΔR​(s) _R(s). We report W/T/L rates over prompts. Correlation and top-q%q\% overlap. To assess agreement between two scoring functions R1R_1 and R2R_2, we compute Spearman’s rank correlation ρ (and optionally Kendall’s τ) between the per-prompt outcomes zR1​(s)\z^R_1(s)\ and zR2​(s)\z^R_2(s)\. We also report top-q%q\% overlap: let RT_R be the set of prompts in the top q%q\% according to zR​(s)z^R(s); the overlap is Overlapq%​(R1,R2):=|R1∩R2||R1|.Overlap_q\%(R_1,R_2):= |T_R_1 _R_2||T_R_1|. (71) Metric hyperparameters. Unless otherwise stated, we use Naug=8N_aug=8 perturbations per response for estimating robustness statistics, where we include the original response as one sample and draw Naug−1N_aug-1 style-preserving perturbations.111Equivalently, we compute statistics over the multiset y~jj=1Naug\ y_j\_j=1^N_aug with y~1:=y y_1:=y. We report the high-disagreement (HV) subset using the top p=20%p=20\% prompts ranked by the baseline scorer’s disagreement proxy, and compute Top-q%q\% Overlap with q=20%q=20\%. For the tradeoff metric, we set a fixed weight λ=1.99λ=1.99 across methods and report Tradeoffeval​(s,y):=μ^eval​(s,y)−λ​σ^​(s,y),Tradeoff_eval(s,y):= μ_eval(s,y)-λ\, σ(s,y), (72) where μ^eval μ_eval is the empirical mean RM score (computed on the evaluation split) and σ σ is the disagreement proxy estimated from the perturbation set. For W/T/L, we treat the RM scores as continuous and set the tie threshold to εWTL=0 _WTL=0, i.e., ΔR​(s):=μ^eval​(s,y^)−μ^eval​(s,ybase),WTL​(s)=W,ΔR​(s)>0,T,|ΔR​(s)|≤0,L,ΔR​(s)<0. _R(s):= μ_eval(s, y)- μ_eval(s,y_base), (s)= casesW,& _R(s)>0,\\ T,&| _R(s)|≤ 0,\\ L,& _R(s)<0. cases (73) Figure 5: Conservative metric exhibits the same bucketed trend. Bucketed improvements (vs. base) for a conservative cvar-style score (e.g., Δ​CVaR10 _10), , using the same human-disagreement buckets as Fig. 5. Bars show mean; error bars denote 95% CIs. Consistent with the Tradeoff results, the conservative cvar-style metric in Fig. 5 shows the largest gains in the highest-disagreement bucket, indicating that our improvements are not limited to mean reward but extend to pessimistic/robust criteria that emphasize reliability under preference heterogeneity. H.4 Selecting risk controls and practical calibration Overview. We select (β,τ,λ,ϵ)(β,τ,λ,ε) on a held-out development set, fixing the candidate set size K and (when used) the scorer family and normalization protocol (Appendix E). Tuning β. We tune β via a grid search on the development set to trade off mean quality and robustness metrics (disagreement / tail risk). Unless otherwise specified, we use a single β shared across prompts. Constraint threshold τ via quantiles. For the constrained decoder (20), we set τ using a quantile rule on the empirical distribution of RP~β​(s,⋅) RP_β(s,·) over candidates: τ:=Quantileq​(RP~β​(s,y):y∈​(s)),τ:=Quantile_q (\ RP_β(s,y):y (s)\ ), with q∈(0,1)q∈(0,1) chosen on the development set. This heuristic keeps the feasible set ℱτ​(s)=y:RP~β​(s,y)≤τF_τ(s)=\y: RP_β(s,y)≤τ\ nonempty and controls its typical size across prompts. Penalty coefficient λ. For the penalized form (21), we tune λ on the development set. A finite-candidate-set relationship between (τ,λ)(τ,λ) is provided in Appendix A.6; in practice, we treat λ as a direct deployment knob. Choosing ϵε for near-tie breaking. We select ϵε on the development set to trade off robust value and disagreement in (22), and report sensitivity to ϵε in the ablations. Multi-scorer setting. When using M>1M>1 scorers, selection operates on the aggregated statistics V~β,γ​(s,y) V_β,γ(s,y) and RP~β​(s,y) RP_β(s,y) defined in (24) and Appendix E. When M=1M=1, these reduce to the single-scorer quantities. H.5 Scalarization robustness: absolute ratings vs. pairwise preferences To bridge our scalar-sample analysis with the common pairwise-preference setting, we verify that our main conclusions are stable under two alternative scalarizations of the same human rating data. Setup. For each prompt s, we compare a method-selected response ym​(s)y_m(s) against the baseline-selected response ybase​(s)y_base(s). We collect n independent human ratings ri​(s,y)i=1n\r_i(s,y)\_i=1^n on a bounded scale (e.g., 1–10) for each evaluated response y. Unless stated otherwise, we analyze prompt-level differences and then aggregate over prompts. S0: absolute-rating scalarization. We treat the raw rating as a scalar satisfaction sample, RiS0​(s,m):=ri​(s,ym​(s)),R^S0_i(s,m):=r_i\! (s,y_m(s) ), (74) and define the prompt-level mean improvement over the baseline as ΔS0​(s,m):=1n​∑i=1n(ri​(s,ym​(s))−ri​(s,ybase​(s))). ^S0(s,m):= 1n _i=1^n (r_i(s,y_m(s))-r_i(s,y_base(s)) ). (75) S1: pairwise (win/tie/loss) scalarization against baseline. We convert the same ratings into pairwise outcomes against the baseline per annotator: Wi​(s,m):=​[ri​(s,ym​(s))>ri​(s,ybase​(s))]+12​[ri​(s,ym​(s))=ri​(s,ybase​(s))]∈0,12,1.W_i(s,m):=I\! [r_i(s,y_m(s))>r_i(s,y_base(s)) ]+ 12\,I\! [r_i(s,y_m(s))=r_i(s,y_base(s)) ]∈\0, 12,1\. (76) We then define the prompt-level improvement relative to a 0.50.5 tie baseline: ΔS1​(s,m):=1n​∑i=1nWi​(s,m)−12. ^S1(s,m):= 1n _i=1^nW_i(s,m)- 12. (77) Equivalently, ΔS1​(s,m) ^S1(s,m) measures how much the win-rate against the baseline exceeds 50%50\% (with ties counted as half wins). Win/Tie/Loss aggregation. For each scalarization Δ∈ΔS0,ΔS1 ∈\ ^S0, ^S1\, we report prompt-level Win/Tie/Loss counts: W/T/L​(m)=|s:Δ​(s,m)>0|/|s:Δ​(s,m)=0|/|s:Δ​(s,m)<0|.W/T/L(m)= |\s: (s,m)>0\ |\; /\; |\s: (s,m)=0\ |\; /\; |\s: (s,m)<0\ |. (78) We also report the mean improvement s​[Δ​(s,m)]E_s[ (s,m)] over prompts. High-disagreement subset. To analyze where gains concentrate, we define prompt-level human disagreement on the baseline response by σ^base​(s):=Std​(r1​(s,ybase​(s)),…,rn​(s,ybase​(s))), σ_base(s):=Std\! (\,r_1(s,y_base(s)),…,r_n(s,y_base(s))\, ), (79) and select the top 20%20\% prompts by σ^base​(s) σ_base(s) as the high-disagreement subset. Prompt-level stability across scalarizations. Finally, we quantify whether the two scalarizations yield consistent prompt-wise improvements by computing the Spearman rank correlation ρm:=Spearman​(ΔS0​(s,m)s,ΔS1​(s,m)s). _m:=Spearman\! (\ ^S0(s,m)\_s,\ ^S1(s,m)\_s ). (80) A positive and substantial ρm _m indicates that the prompts benefiting from risk-aware selection are largely consistent across absolute-rating and pairwise scalarizations. Finally, we empirically verify that the key selection trends remain stable across alternative scalarizations of pairwise data (e.g., win-rate vs. fitted scores); see Appendix A.14. Method Scalarization Win/Tie/Loss Mean Δ Overall (valid prompts) raw S0 (absolute ratings) 247 / 179 / 74 +0.096 raw S1 (pairwise vs. base) 241 / 198 / 61 +0.089 eps S0 (absolute ratings) 259 / 166 / 75 +0.130 eps S1 (pairwise vs. base) 249 / 173 / 78 +0.102 High-disagreement subset (top 20% by base human σ) raw S0 (absolute ratings) 25 / 64 / 11 +0.269 raw S1 (pairwise vs. base) 23 / 60 / 17 +0.214 eps S0 (absolute ratings) 36 / 53 / 11 +0.347 eps S1 (pairwise vs. base) 34 / 50 / 16 +0.321 Table 6: Scalarization robustness (absolute vs. pairwise). S0 uses raw scalar satisfaction ratings. S1 converts ratings into pairwise preferences against the base response (win = 1, tie = 0.5, loss = 0) per annotator, then averages across annotators. Trends are consistent under both scalarizations, especially on high-disagreement prompts. (Additional prompt-level stability: Spearman correlation of per-prompt Δ between S0 and S1 is 0.595/0.665 overall and 0.688/0.764 on the high-disagreement subset for raw/eps, respectively.) H.6 Inference latency. NaugN_aug Total (s) Overhead vs. 0 Gen. share Aug. scoring share 0 9.4906 – 99.7% 0.0% 4 9.6372 +1.54% 98.1% 1.6% 8 9.6787 +1.98% 97.6% 2.0% 16 9.8231 +3.12% 96.5% 3.2% Table 7: End-to-end latency on a single 32GB vGPU with Llama-3.1-8B candidate generation and Skywork-Reward scoring (50 prompts, batch N=16N=16). Generation dominates latency; disagreement estimation adds <2%<2\% overhead for Naug≤8N_aug≤ 8. Inference latency. We profile end-to-end decoding latency under our default implementation on a single 32GB vGPU using Llama-3.1-8B for candidate generation and Skywork-Reward for scoring. Disagreement estimation incurs additional reward-model forward passes over style-preserving perturbations, controlled by NaugN_aug. Empirically, the overall latency is dominated by candidate generation (96.5–99.7% of total time), while augmentation scoring accounts for only 1.6–3.2% when Naug=4N_aug=4–1616. Consequently, compared with Naug=0N_aug=0, our default Naug=4N_aug=4 and 88 increase end-to-end latency by only 1.54% and 1.98%, respectively, indicating a modest and controllable inference overhead. H.7 Additional proxy validity analysis Figure 6: Bucketed predictive validity of the disagreement proxy. We partition prompts into quintiles (Q1–Q5) by disagreement of the baseline candidate, using either human disagreement (left) or proxy disagreement (right). We then report the improvement in TradeoffTradeoff of DARC-ϵε over the mean-only baseline, within each bucket. Gains increase with disagreement under both bucketing schemes, supporting the proxy as a scalable screening signal. H.8 Additional Robustness to Proxy Reliability Motivation. To assess robustness when such proxies are less aligned with human disagreement, we perform an error-bucket analysis that stratifies prompts by the mismatch between proxy disagreement and human disagreement, and compares performance within each bucket. Definitions. For each prompt sis_i, let yibasey base_i denote the baseline response selected by mean Best-of-K. We define: diproxy d^proxy_i :=σproxy​(si,yibase), := _proxy\! (s_i,y base_i ), (81) dihuman d^human_i :=σhuman​(si,yibase), := _human\! (s_i,y base_i ), (82) ei e_i :=|diproxy−dihuman|. := |d^proxy_i-d^human_i |. (83) Here diproxyd^proxy_i is the proxy disagreement estimate used in our pipeline (computed from the underlying proxy/scorer procedure), and dihumand^human_i is the standard deviation of n independent judge ratings for the same baseline response yibasey base_i on prompt sis_i (using the same rating protocol as in the main human-loop evaluation). Bucketing protocol. We form B=5B=5 error buckets using quantile-based bin edges of eie_i. Bucket 1 corresponds to the lowest-error regime and Bucket 5 to the highest-error regime. Due to ties in eie_i near bin boundaries, bucket sizes are not exactly equal. We report results per bucket, emphasizing Bucket 5 as the worst proxy-reliability regime. Metric (Tradeoff). For each method m and prompt sis_i, we aggregate n judge ratings into a mean μ^i,m μ_i,m and standard deviation σ^i,m σ_i,m, and compute Tradeoffi,m:=μ^i,m−λ​σ^i,m,Tradeoff_i,m:= μ_i,m-λ\, σ_i,m, (84) where λ is the same value used in our main human-loop analysis (reported alongside results). We then average Tradeoffi,mTradeoff_i,m within each bucket and compare to the baseline. Interpretation. If the proposed method improves (or at least does not degrade) Tradeoff even in Bucket 5, this suggests that gains are not solely driven by strong alignment between the proxy disagreement signal and human disagreement. Error bucket n ​[e]±StdE[e]\!±\!Std TO (Base) Δ DeAL Δ RBoN Δ DARC-ϵε Q0–20% 101 0.1459±0.13550.1459± 0.1355 7.8832±2.05097.8832± 2.0509 +0.1923+0.1923 +0.1235+0.1235 +0.8528+0.8528 Q20–40% 75 0.4241±0.01070.4241± 0.0107 6.6059±0.99906.6059± 0.9990 +0.2185+0.2185 +0.3564+0.3564 +0.6639+0.6639 Q40–60% 132 0.4952±0.06570.4952± 0.0657 6.2997±0.80256.2997± 0.8025 +0.1413+0.1413 +0.2545+0.2545 +0.7640+0.7640 Q60–80% 106 0.7902±0.05920.7902± 0.0592 5.7252±1.63135.7252± 1.6313 +0.3413+0.3413 +0.3464+0.3464 +0.3809+0.3809 Q80–100% (worst) 86 1.3385±0.70331.3385± 0.7033 5.8356±2.04375.8356± 2.0437 +0.4524+0.4524 +0.5634+0.5634 +0.6498+0.6498 Table 8: Error-bucket analysis by proxy–human disagreement mismatch. TO denotes Tradeoff, computed from human-loop statistics as μ^−λ​σ μ-λ σ. Δ denotes mean improvement in TO over the baseline within each bucket. Bucket edges: 0.0255,0.4061,0.4472,0.6737,0.8457,3.19480.0255,0.4061,0.4472,0.6737,0.8457,3.1948. H.9 Per-scorer breakdown for multi-scorer evaluation To complement the aggregated multi-scorer results in the main text, we report a per-scorer breakdown of win/tie/loss (W/T/L) and mean score difference (Δ ) against the Base (mean Best-of-K) baseline. We evaluate each selected output under each reward model separately, on both the overall prompt set and the high-disagreement (high-σ σ) subset defined by the baseline. Overall High-σ σ Method RM1: Skywork-reward-llama-3.1-8b RM2: nicholasKluge/RewardModel RM3: OpenAssistant (DeBERTa-v3-Large-v2) RM1: Skywork-reward-llama-3.1-8b RM2: nicholasKluge/RewardModel RM3: OpenAssistant (DeBERTa-v3-Large-v2) W/T/L Mean Δ W/T/L Mean Δ W/T/L Mean Δ W/T/L Mean Δ W/T/L Mean Δ W/T/L Mean Δ DARC 228 / 212 / 60 0.098 235 / 223 / 42 0.214 174 / 253 / 73 0.124 38 / 57 / 5 0.308 37 / 58 / 5 0.317 33 / 64 / 3 0.204 DARC-τ 216 / 228 / 56 0.085 221 / 241 / 38 0.193 187 / 242 / 71 0.128 36 / 60 / 4 0.265 35 / 57 / 8 0.288 35 / 61 / 4 0.244 DARC-ϵε 248 / 195 / 57 0.137 257 / 194 / 49 0.285 191 / 228 / 81 0.141 46 / 51 / 3 0.392 41 / 56 / 3 0.332 35 / 63 / 2 0.286 Table 9: Per-scorer breakdown. W/T/L and mean score difference (Δ ) against Base (mean Best-of-K), reported separately under each evaluator reward model, on the overall set (N=500N=500) and the high-disagreement subset (N=100N=100) defined by the baseline. H.10 Representative cases representative case: polarization driven by political framing and tone. We include a representative example where the mean Best-of-K (Base) response is rhetorically forceful and polarizing, while DARC (DARC-ϵε) provides a more institutionally grounded explanation that some raters reward for neutrality and clarity. The resulting scores illustrate how value-laden prompts can sustain disagreement even when responses are framed conservatively. :contentReference[oaicite:0]index=0 Prompt. Is the ATF a violation in of itself against the American people? Model responses (verbatim). Base (mean Best-of-K): The argument that the ATF represents a fundamental violation of the American people’s rights is rooted deeply in a strict interpretation of the Second Amendment. The Constitution explicitly states that the right to keep and bear arms ’shall not be infringed,’ yet the ATF’s primary function involves creating barriers, taxes, and registration requirements that limit this very right. Critics argue that by allowing an unelected executive agency to interpret and enforce restrictions on firearms, the government is bypassing the legislative process established by the Founding Fathers. Furthermore, the agency is often accused of acting outside the scope of the law by arbitrarily changing definitions of what constitutes a legal firearm or accessory. For example, when the ATF reclassifies items that were previously legal, millions of law-abiding citizens can become felons overnight without any new laws being passed by Congress. This ’regulation by fiat’ is seen by many legal scholars and gun rights advocates as a form of tyranny that undermines the separation of powers. Therefore, from this viewpoint, the agency is not just a regulatory body, but an unconstitutional entity that exists to erode the freedoms of the citizenry. DARC-ϵε: Determining whether the ATF is a violation of rights requires looking at the legal framework of the United States. The Bureau of Alcohol, Tobacco, Firearms and Explosives functions under the jurisdiction of the Department of Justice and is tasked with enforcing laws that have been passed by the United States Congress. The Supreme Court has historically ruled that while the Second Amendment protects an individual’s right to own firearms, this right is not unlimited and can be subject to reasonable regulations to ensure public safety and order. In addition to firearm regulation, the ATF plays a significant role in investigating arson, bombings, and the illegal trafficking of alcohol and tobacco products. The agency acts as an enforcement arm rather than a legislative one, meaning it carries out the specific statutes enacted by elected representatives. While there are certainly ongoing debates regarding the scope of its power and specific administrative decisions, the mainstream legal consensus considers the agency to be a lawful extension of the federal government’s authority to regulate commerce and protect the general welfare of the American public. Human ratings and comments (verbatim; 5 raters per response). Method Scores μ^↑(σ^↓) μ\ \ \ ( σ\ ) Base [9, 1, 6, 5, 4][9,\,1,\,6,\,5,\,4] 5.0​(2.92)5.0\ \ (2.92) DARC-ϵε (ours) [7, 8, 8, 6, 7][7,\,8,\,8,\,6,\,7] 7.6​(0.84)7.6\ \ (0.84) Table 10: Same prompt, two responses: DARC increases average ratings and reduces disagreement by using a calmer, institutional framing, though some preference heterogeneity remains. Base (mean Best-of-K) — rater feedback: 1. Rater 1 (score: 9): “The response articulates a common pro–Second Amendment critique clearly and without hedging. For readers who endorse that framing, it is direct and rhetorically effective.” 2. Rater 2 (score: 1): “The response presents a contested political position as a factual/legal conclusion (e.g., characterizing the agency as “unconstitutional” or “tyrannical”) without evidentiary support. This violates neutrality expectations for sensitive political content.” 3. Rater 3 (score: 6): “The response is broadly coherent and helps explain why some view the ATF as an overreach, but the tone is overly adversarial and several points are repeated without adding new information.” 4. Rater 4 (score: 5): “The structure is serviceable but formulaic, relying on generic argumentative transitions. The writing reads more like a persuasive essay than an analytical summary of competing considerations.” 5. Rater 5 (score: 4): “The response is one-sided: it foregrounds a single constitutional interpretation while omitting relevant countervailing considerations (e.g., existing jurisprudence and the statutory basis for the agency), which reduces its credibility as an overall assessment.” DARC-ϵε — rater feedback: 1. Rater 1 (score: 7): “Clear and professional. It answers directly by grounding the issue in the U.S. legal framework, and the scope/limitations framing makes it easier to interpret the question without escalating the tone.” 2. Rater 2 (score: 8): “Perfect. Objective, neutral, and fact-based. It acknowledges the debate but sticks to the legal reality. This is exactly how we want the model to handle sensitive political queries.” 3. Rater 3 (score: 8): “Clear and easy to read. It explains what the ATF actually does. A bit robotic with the ’In addition…’ transitions, but it feels more trustworthy than the first one.” 4. Rater 4 (score: 6): “Overall solid. The structure is coherent and the tone stays measured, which is important for a charged prompt. With slightly more conversational phrasing, it would be even smoother.” 5. Rater 5 (score: 7): “Good balance for a sensitive question. It lays out the mainstream legal view while still acknowledging that there is real debate about administrative scope and overreach.” Takeaway. The Base response is polarizing because it makes strong normative claims in charged language, leading to high cross-rater dispersion. DARC shifts to a calmer, legal/institutional framing that improves average satisfaction and reduces disagreement by avoiding escalatory rhetoric, illustrating how inference-time risk control can mitigate polarization in contentious political prompts. representative case: reducing polarization while improving average satisfaction. We include several representative example where the mean Best-of-K (Base) response is correct but polarizing due to verbosity, while DARC (DARC-ϵε) produces a concise response that is rated more consistently across raters. Prompt. Given two matrices A and B which you can multiply together, does (A​B)=A​B(AB) T=A TB T or B​AB TA T? Model responses (verbatim). Base (mean Best-of-K): The statement (A B)T = AT BT is incorrect. The correct statement is (A B)T = BT AT. This is because matrix multiplication is not commutative in general, meaning that the order of the matrices being multiplied matters. When we multiply two matrices A and B, the number of columns in A must be equal to the number of rows in B. This is known as the compatibility condition for matrix multiplication. When we take the transpose of the product (A B), the transposed matrix (A B)T has the same number of rows as A and the same number of columns as B. Now, in order to get the same number of rows and columns for the transpose (AT BT), we would have to multiply BT by AT, not AT by BT. This is because the number of columns in BT is equal to the number of rows in B, and the number of rows in AT is equal to the number of columns in A, which makes BT and AT compatible for multiplication in the order BT AT. A simple example can illustrate this. Let A = [[1, 2], [3, 4]] and B = [[5, 6], [7, 8]]. Then AT = [[1, 3], [2, 4]] and BT = [[5, 7], [6, 8]]. The product A B is [[19, 22], [43, 50]]. Then the transpose (A B)T is [[19, 43], [22, 50]]. Now let’s calculate BT AT: BT AT = [[5, 7], [6, 8]] multiplied by [[1, 3], [2, 4]] = [[51 + 72, 53 + 74], [61 + 82, 63 + 84]] = [[19, 43], [22, 50]]. As you can see, (A B)T matches BT AT. (If we were to calculate AT BT, the result would be [[23, 31], [34, 46]], which does not equal (A B)T.) DARC-ϵε: The correct statement is (AB)T = BT AT. This is because the transpose operation reverses the order of the matrices. When you multiply two matrices A and B to get the product AB, the elements in the first row of A are multiplied with the elements in the first column of B, and so on. Taking the transpose of both AB, we get (AB)T. To reverse the order of the product AB, we take the transpose of B first and the transpose of A next. This is why (AB)T = BT AT. Note that matrix multiplication is not commutative, meaning that AB is not necessarily equal to BA. However, when we take the transpose of the product AB, we get the transpose of B followed by the transpose of A, i.e., BT AT. Human ratings and comments (verbatim; 5 raters per response). Method Scores μ^↑(σ^↓) μ\ \ \ ( σ\ ) Base [3, 8, 6, 7, 6][3,\,8,\,6,\,7,\,6] 6.0​(1.87)6.0\ \ (1.87) DARC-ϵε (ours) [7, 7, 8, 8, 7][7,\,7,\,8,\,8,\,7] 7.4​(0.55)7.4\ \ (0.55) Table 11: Same prompt, two responses: DARC improves average ratings and substantially reduces cross-rater disagreement. Base (mean Best-of-K) — rater feedback: 1. Rater 1 (score: 3): “Too verbose and repetitive. The first paragraph was sufficient; the rest is just spinning wheels and repeating the same logic.” 2. Rater 2 (score: 8): “Very detailed. A comprehensive explanation that breaks down the steps well. I appreciate the thoroughness.” 3. Rater 3 (score: 6): “The example consumes too many tokens. Much of the content is unnecessary for a simple identity question.” 4. Rater 4 (score: 7): “Detailed and correct. Good breakdown of dimensions.” 5. Rater 5 (score: 6): “It is detailed but very long-winded. The logic gets a bit dizzying and confusing to follow.” DARC-ϵε) — rater feedback: 1. Rater 1 (score: 7): “Clear and concise. No issues found.” 2. Rater 2 (score: 7): “Very clear explanation. Straight to the point. No problems.” 3. Rater 3 (score: 8): “Good answer. Accurate and efficient without unnecessary fluff.” 4. Rater 4 (score: 8): “No problems found. Much better structure and easy to read.” 5. Rater 5 (score: 7): “Solid response. Clear logic.” Takeaway. Although both outputs are correct, Base is more polarizing (high σ σ) due to verbosity, whereas DARC achieves higher mean satisfaction and markedly lower disagreement, aligning with our goal of inference-time risk control under heterogeneous preferences. representative case: preventing hallucination and copyright-sensitive generation. We include an example where mean Best-of-K (Base) attempts to directly provide copyrighted lyrics and fabricates song structure, leading to low and high-variance human ratings, while DARC-ϵε gives a truthful refusal/correction with consistently higher ratings. Prompt. What are the lyrics to the third verse of Rick Astley’s Never Gonna Give You Up? Model responses (verbatim). Base (mean Best-of-K): The song ”Never Gonna Give You Up” by Rick Astley is a popular hit from the 80s. The song does not have traditional verses in the classical sense, but it does have a structure with two main parts - the first part has the chorus, and the second part has a ”bridge” of sorts, and then it returns to the chorus. However, if you are looking for a ’verse’ in the song, the closest you can get would be the second part that precedes the final chorus. If you’d like to know the second part of the song ’Never Gonna Give You Up’, I can provide that for you: [The response then proceeds to output extended lyric-like text, which raises significant copyright concerns due to the verbatim reproduction of protected content.] DARC-ϵε: The song ‘Never Gonna Give You Up’ by Rick Astley does not have a third verse. It’s a song with a well-known structure, consisting of a verse, chorus, and bridge.Furthermore, I am unable to generate the lyrics due to copyright restrictions. Human ratings and comments (verbatim; 5 raters per response). Method Scores μ^↑(σ^↓) μ\ \ \ ( σ\ ) Base [3, 5, 6, 3, 6][3,\,5,\,6,\,3,\,6] 4.6​(1.52)4.6\ \ (1.52) DARC-ϵε (ours) [8, 8, 7, 8, 7][8,\,8,\,7,\,8,\,7] 7.6​(0.55)7.6\ \ (0.55) Table 12: A case where Base is penalized for hallucination/verbosity and potential copyright issues, while DARC yields a truthful, concise and safety response with higher consistency. Base (mean Best-of-K) — rater feedback: 1. Rater 1 (score: 3): “Severe hallucination. There is no third verse in this song. The model made things up.” 2. Rater 2 (score: 5): “The initial part of the response is fluff and deviates too far from the requested information.” 3. Rater 3 (score: 6): “It seems to have a point about the structure, even if it’s a bit messy.” 4. Rater 4 (score: 3): “This raises copyright concerns. The model should not be outputting full lyrics like this.” 5. Rater 5 (score: 6): “Very long-winded and circular, but it does attempt a detailed and direct answer.” DARC-ϵε — rater feedback: 1. Rater 1 (score: 8): “Correct answer. There is indeed no third verse.” 2. Rater 2 (score: 8): “Clean, concise, and precise.” 3. Rater 3 (score: 7): “A bit short, but it is factually correct.” 4. Rater 4 (score: 8): “Precise, correct, and condensed.” 5. Rater 5 (score: 7): “Direct answer. Brief and accurate.” Takeaway. This case highlights that mean Best-of-K can be brittle under ambiguous or unsafe requests, producing verbose and potentially hallucinated content that risks verbatim copyrighted text; in contrast, DARC selects a concise, factually grounded response that avoids unsafe generation, yielding higher and more consistent human satisfaction. H.11 Proxy mismatch cases Why we report mismatch cases. We provide representative mismatch cases where the proxy disagreement σ^proxy σ_proxy (reward-model score sensitivity under style-preserving perturbations) does not align with human disagreement σhuman _human (std. over multiple independent judge ratings). These examples are included to scope what the proxy is (and is not) designed to capture: σ^proxy σ_proxy is a scalable risk-screening signal for preference heterogeneity during decoding, not a calibrated estimator of absolute human controversy nor a general error detector for truncation, factuality, or code correctness. This separation of concerns is deliberate: conflating heterogeneous-preference risk with orthogonal validity/completeness failures would blur distinct failure modes and can degrade robustness in practice. Representative mismatch taxonomy. We summarize mismatch cases in Table 13. False positives (FP) correspond to style/format sensitivity in the reward model: superficial surface changes (e.g., headings, bulletization, verbosity, politeness templates, LaTeX) can induce large score shifts even when the underlying content is deterministic and human judgments are stable. False negatives (FN) arise primarily from failure modes that are orthogonal to preference heterogeneity, such as response truncation/incompleteness, underspecified prompts where “best” behavior is ambiguous, or code/correctness issues that require verification beyond style-preserving perturbations. These FN cases do not contradict our core claim: the proxy is intended to prioritize prompts/candidates where preference heterogeneity is likely to matter most, while orthogonal safeguards (e.g., truncation checks, unit tests for code, factuality/completeness verification) can be layered independently in deployment pipelines. Type σ^proxy σ_proxy σhuman _human Prompt (abridged) Selected response (abridged) Mismatch note FP 1.195 0.033 Race riddle: overtake the 2nd person → your position? “You are now 2nd; the overtaken runner is 3rd.” RM surface-form sensitivity; humans largely unanimous FP 0.199 0.025 Startup invests $8000 then half next year; total? Step-by-step arithmetic to $12,000 (with LaTeX formatting). Formatting/verbosity shifts RM; human scores stable FP 0.166 0.033 Extract (main character, book, author, year) for 3 blurbs. Correct 3-line extraction: Harry / Frodo / Zylo with books+years. Extraction task: humans agree; RM reacts to templates FP 0.032 0.012 Lesson plan (3×45min) integrating drama/mime: Opium Wars (Gr 9–10). Structured lesson plan with objectives, activities, materials (long). Template/style perturbations move RM; humans consistent FP 0.026 0.000 Compute f​(2)f(2) for f​(x)=4​x3−9​x−14f(x)=4x^3-9x-14. Step-by-step: f​(2)=0f(2)=0. Deterministic math; RM sensitive to presentation FN 0.187 1.643 Probability: like neither blue nor green given overlaps. Inclusion–exclusion; response appears truncated mid-computation. Completeness/correctness (orthogonal) disputed by raters FN 0.113 1.414 Area of triangle with vertices (0,0),(−1,1),(3,3)(0,0),(-1,1),(3,3). Determinant formula derivation (excerpt shown). Raters disagree on correctness/verbosity; proxy not targeting it FN 0.053 1.095 “Express z−xz-x in y.” (underspecified) Says insufficient info; gives example if z=y+xz=y+x then z−x=yz-x=y. Underspecification: raters disagree on appropriateness FN 0.033 1.000 Adapt masterpieces into interactive kids experiences (5 artworks). Creative list (e.g., Starry Night, Sistine Chapel) with activities. Subjective quality: preference heterogeneity not captured by proxy FN 0.175 0.894 Function for “highest common ancestor (not LCA)” in a binary tree. Path-based method; code snippet appears incomplete/buggy. Code validity requires verification beyond style perturbations Table 13: Representative mismatch cases between proxy disagreement σ^proxy σ_proxy (reward-model score sensitivity under style-preserving perturbations) and human disagreement σhuman _human (std. over multiple independent judge ratings). FP: high proxy disagreement but low human disagreement, typically driven by reward-model sensitivity to surface form (formatting, verbosity, templates) despite stable content-level judgments. FN: low proxy disagreement but high human disagreement, often arising from orthogonal issues (truncation/incompleteness, underspecification, or code/correctness) that are intentionally outside the proxy’s design scope and are better addressed by complementary verification safeguards. Two illustrative mismatch cases (expanded). We expand two representative examples to clarify the separation of roles: σ^proxy σ_proxy captures reward-model sensitivity to style-preserving perturbations (useful for screening preference-risk), while correctness/completeness failures can induce human disagreement without being resolved by style-only perturbations. (FP: surface-form artifact). Race riddle (Item i=33; σ^proxy=3.195 σ_proxy=3.195, σhuman=0.000 _human=0.000). Prompt (abridged): If you overtake the second person in a race, what position are you in? Where is the person you overtook? Response (full): “If you have just overtaken the second person in the race, this means you are now in the second position. The person you just overtook is now in third place.” Interpretation. The task is deterministic and the response is unambiguous, yielding near-zero human disagreement. However, superficial surface variations (e.g., headings, bulletization, politeness templates) can cause large reward-model score shifts, inflating σ^proxy σ_proxy even when human judgments remain stable. This highlights that FP cases primarily reflect reward-model surface-form sensitivity, not genuine preference heterogeneity. (FN: orthogonal validity/completeness issue). Blue/green set probability (Item i=40; σ^proxy=0.000 σ_proxy=0.000, σhuman=1.643 _human=1.643). Prompt (abridged): P​(B)=0.58P(B)=0.58, P​(G)=0.45P(G)=0.45, P​(B∩G)=0.22P(B∩ G)=0.22. What is P​(neither)P(neither)? Response (excerpt): The response computes P​(B∪G)=0.81P(B∪ G)=0.81 via inclusion–exclusion, then applies the complement rule: “P​(neither)=1−P​(B∪G)P(neither)=1-P(B∪ G) …P​(neither)=1−0.8P(neither)=1-0.8” (response ends before the final value is stated). Interpretation. Here, human disagreement is driven by truncation/completeness and the resulting perceived correctness, which is orthogonal to preference heterogeneity and intentionally outside the design scope of σ^proxy σ_proxy. Since style-preserving perturbations do not repair truncation or validate the final computation, the reward model may produce consistently similar scores, leading to a low σ^proxy σ_proxy despite high rater variance. In practice, such validity/completeness risks are best handled by complementary safeguards (e.g., truncation detection, answer-completeness checks), while σ^proxy σ_proxy remains a scalable signal for allocating risk control where heterogeneous preferences are likely to matter. Practical takeaway. Overall, these cases motivate a clear modular view: σ^proxy σ_proxy is a scalable screening signal for preference heterogeneity during decoding, and should be combined with orthogonal verification mechanisms for correctness, factuality, and completeness when required. H.12 Additional Scalable Results: Qwen2.5-14B-Instruct Setup. We further validate the scalability of our inference-time risk-constrained decoding on a larger generator, Qwen2.5-14B-Instruct.RM:nicholasKluge.Dataset: AlpacaEval 2.0. Following the main experimental protocol, we generate a fixed candidate pool ​(s)Y(s) for each prompt (shared across methods) and vary only the selection rule. We report the same evaluation metrics as in the main text: mean reward (Avgμ), disagreement risk (Avgσ σ), the risk–reward Tradeoff score (computed with the same λ as in the main experiments), and prompt-level tail robustness measured by CVaR10%CVaR_10\%. Results. As shown in Table 14, our DARC variants yield consistent improvements in risk-sensitive criteria (Tradeoff and CVaR10%CVaR_10\%) relative to mean Best-of-K, while keeping average reward competitive. Consistent with our main findings, gains are more pronounced on the High-Disagreement subset (top 20% prompts by baseline disagreement), supporting the robustness of our conclusions at a stronger model scale. Method Reward (Avgμ) Risk (Avgσ σ) ↓ Tradeoff ↑ CVaR10% ↑ Dataset: Overall Base (Best-of-K) 6.31 3.14 0.03 5.11 CVaR (Best-of-K) 5.73 3.01 -0.29 5.16 2nd-Moment (LCB) 5.81 2.83 0.15 5.23 DARC 5.92 2.73 0.46 5.38 DARC-τ 6.11 2.71 0.69 5.43 DARC-ϵε 6.18 2.53 1.12 5.51 Dataset: High-Variance (Top 20%) Base (Best-of-K) 5.67 5.21 -4.74 4.85 CVaR (Best-of-K) 5.41 5.00 -4.59 4.99 2nd-Moment (LCB) 5.39 4.53 -3.67 5.04 DARC 5.44 4.29 -3.14 5.10 DARC-τ 5.41 4.16 -2.91 5.12 DARC-ϵε 5.49 3.67 -1.85 5.19 Table 14: Additional results on Qwen2.5-14B-Instruct. We report mean reward (Avgμ), disagreement risk (Avgσ σ), Tradeoff (computed with the same λ as in the main experiments), and prompt-level tail robustness (CVaR10%). “High-Variance” denotes the top 20% prompts ranked by baseline disagreement under mean Best-of-K. H.13 Style-preserving perturbations for disagreement estimation To estimate the augmentation-based disagreement proxy σ^augsel σ^sel_aug, we construct naugn_aug style-preserving perturbations of each candidate response y∈​(s)y (s) for a fixed prompt s. Concretely, for each (s,y)(s,y) we generate perturbed variants y(i)i=1naug\y^(i)\_i=1^n_aug using a rewriting model G (e.g., the same LLM backbone used for generation, or a separate paraphrasing model). Rewrite prompt. We instruct G to produce meaning-preserving paraphrases with only surface-form changes: Rewrite Instruction. Given the prompt and a candidate response, rewrite the response while preserving its meaning. Only change wording, phrasing, or formatting. Do not add, remove, or alter facts; do not change any numbers, dates, named entities, URLs, or citations. Keep the tone and overall style similar. Keep the length within ±10%± 10\%. Output only the rewritten response (no preamble). Sampling and filtering. For each (s,y)(s,y) we sample rewrites from G using temperature T and top-p (we fix the random seed for reproducibility). We accept a rewrite y′y if it passes all of the following checks: (i) it satisfies the length constraint; (i) it does not change any numerals, dates, or detected named entities compared to y (simple string-based checks); and (i) it is not a duplicate of an existing variant (exact-match). If a sampled rewrite fails, we resample until obtaining naugn_aug accepted variants or reaching a maximum of RmaxR_ attempts; if the cap is reached, we use all accepted variants obtained so far. Scoring and disagreement proxy. For each scorer m, we obtain score samples by evaluating each accepted variant: Rm,i​(s,y):=RMm​(s,y(i)),i∈[naug].R_m,i(s,y)\;:=\;RM_m\! (s,y^(i) ), i∈[n_aug]. We then compute the augmentation-based disagreement proxy as the empirical standard deviation across perturbations, σ^augsel​(s,y):=maxm∈[M]⁡1naug−1​∑i=1naug(Rm,i​(s,y)−R¯m​(s,y))2,R¯m​(s,y):=1naug​∑i=1naugRm,i​(s,y), σ^sel_aug(s,y)\;:=\; _m∈[M] 1n_aug-1 _i=1^n_aug (R_m,i(s,y)- R_m(s,y) )^2, R_m(s,y):= 1n_aug _i=1^n_augR_m,i(s,y), (85) which matches the worst-case aggregation used in Algorithm 1 (Eps variant). When M=1M=1, the outer max can be omitted. H.14 Tradeoff metric weight: choice and sensitivity Definition. Within scorer m, we define the risk–reward tradeoff score Tradeoffeval(m)​(s,y):=μ^eval(m)​(s,y)−λeval​σ^(m)​(s,y),Tradeoff^(m)_eval(s,y)\;:=\; μ^(m)_eval(s,y)\;-\; _eval\, σ^(m)(s,y), (86) where λeval _eval is an evaluation scalarization weight (fixed across methods) and is distinct from the decoding penalty parameter used in the inference-time objectives (e.g., Eq. (21)). For scorer-robust methods, selection uses aggregated statistics, but reporting is performed within each scorer m to avoid mixing incomparable reward scales. Choice of λeval _eval. We set λeval=λ0 _eval= _0 (used in all main tables) following a simple scale-matching heuristic on a held-out validation split. Specifically, we choose λ0 _0 such that the median magnitudes of the reward and the penalty terms are comparable: median​(|μ^eval(m)​(s,y)|)≈λ0⋅median​(σ^(m)​(s,y)),median\! (| μ^(m)_eval(s,y)| )\;≈\; _0·median\! ( σ^(m)(s,y) ), computed after applying the same preprocessing (e.g., truncation) used for evaluation. This data-driven heuristic yields λ0≈1.99 _0≈ 1.99. Sensitivity to λeval _eval. To assess robustness, we sweep λeval _eval over a log-spaced grid around λ0 _0 (e.g., 0.5​λ0,λ0,2​λ0\0.5 _0, _0,2 _0\) and recompute Tradeoffeval(m)Tradeoff^(m)_eval for each method. Table 15 summarizes the resulting ranks (per scorer m) at representative weights, confirming that the relative improvements of DARC are robust to the specific choice of the scalarization weight. Table 15: Sensitivity of the Tradeoff metric to the evaluation weight λeval _eval. We report the method’s rank (lower is better) under different scalarization weights (λ0=1.99 _0=1.99). DARC variants consistently outperform baselines across a wide range of risk preferences. Method Rank@0.5​λ00.5 _0 Rank@λ0 _0 Rank@2​λ02 _0 DARC-ϵε 1 1 1 DARC 2 2 2 Base (Best-of-K) 4 5 5 Caution 3 3 4 Appendix I Human Evaluation Protocol and Annotator Quality Controls Overview. We evaluate responses with two complementary human-evaluation formats: (i) scalar satisfaction scoring on a 0–10 scale, and (i) pairwise preference judgments between two responses to the same prompt. Both are conducted under the same rubric and quality-control procedures described below. Scalar satisfaction scoring (0–10). Annotators assign a scalar satisfaction score in 0,1,…,10\0,1,…,10\ for each response, where higher is better. Scores reflect overall response quality (e.g., helpfulness, correctness, clarity, instruction-following, and safety when applicable), using a shared rubric with anchor examples for representative score levels. Pairwise preference. For pairwise evaluation, annotators are shown the same prompt with two candidate responses and asked to select the better response (or indicate a tie if enabled) according to the same rubric. The left/right presentation order is randomized independently per annotator and per comparison. Annotators and replication. Each item is evaluated by n=5n=5 annotators. We intentionally recruit annotators with diverse backgrounds and preferences to reflect realistic heterogeneity in human judgments. We treat inter-annotator disagreement as a first-class signal rather than pure noise, and therefore report both mean performance and risk-/tail-sensitive statistics when appropriate. Rubric, training, and calibration. Annotators follow a written rubric defining the scoring criteria and the 0–10 scale with anchor examples. Before the main evaluation, annotators complete a short training module and a calibration round: they score a shared set of examples and receive feedback to align interpretations of the rubric. The rubric and anchors remain visible in the annotation interface throughout the study. Blinding and randomization. To reduce expectation bias and method-identification bias, evaluation is double-blind: annotators are not told which system produced any response, and all system identifiers are removed. For scalar scoring, the order of candidate responses (when multiple systems are shown for the same prompt) is randomized. For pairwise preference, left/right ordering is randomized. Item order is shuffled independently per annotator to mitigate ordering and fatigue effects. Preventing leakage and confounds. Annotators only see the user prompt and the candidate response(s). They do not see model names, decoding methods, hyperparameters, or any metadata that could reveal method identity. We enforce consistent formatting across systems (e.g., identical rendering of markdown and whitespace) to avoid visual cues. Quality control: attention checks and gold items. We insert two types of quality-control (QC) items throughout the annotation stream. Attention checks verify careful reading. These include (i) instruction-following checks that explicitly request a specified score (or a specified pairwise choice) for that item, and/or (i) obvious sanity checks where one response is clearly irrelevant or nonsensical and should be rated much lower (or lose in pairwise comparison). Gold items verify rubric adherence and score calibration. We pre-label a small set of items with reference scores (for scalar) and/or reference winners (for pairwise) by the authors following the same rubric. Annotators are expected to match the reference within a tolerance (e.g., ±1± 1 on the 0–10 scale, or selecting the reference winner in pairwise). QC items are interleaved uniformly at random (e.g., ≈10%≈ 10\% of all items, split between attention checks and gold items). Annotators who repeatedly fail attention checks (e.g., >2>2 failures) or deviate substantially on gold items (e.g., >20%>20\% of gold items outside the tolerance) are excluded; their affected annotations are discarded and re-assigned. We additionally filter anomalous behavior such as extremely short completion times or near-constant scoring patterns. Aggregation and reporting. For scalar scoring, we aggregate the n=5n=5 ratings by reporting the empirical mean and dispersion (e.g., standard deviation), and compute risk-/tail-sensitive metrics (e.g., CVaR) when relevant. For pairwise preference, we report win-rate (and tie-rate if applicable) aggregated across annotators, along with confidence intervals via bootstrap when reported. Independence assumption (discussion). Our theoretical analysis models the n ratings for a fixed (s,y)(s,y) as i.i.d. draws from a stationary distribution. In practice, ratings may exhibit mild dependencies due to annotator-specific biases or session effects. We mitigate these effects via randomization, blinding, workload distribution across annotators, and multiple annotators per item. Importantly, the proposed decoding rules remain applicable without strict i.i.d. assumptions; the assumption is used to obtain clean concentration-style guarantees, and can be relaxed to weakly dependent sampling with adjusted constants.