Paper deep dive

Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models

William Overman, Mohsen Bayati

Year: 2025Venue: arXiv preprintArea: Formal/TheoreticalType: TheoreticalEmbeddings: 111

Models: GPT-4.1, GPT-4.1-nano

Abstract

Abstract:Modern language model deployments must often balance competing objectives, for example, helpfulness versus harmlessness, cost versus accuracy, and reward versus safety. We introduce Conformal Arbitrage, a post hoc framework that learns a data driven threshold to mediate between a Primary model optimized for a primary objective and a more conservative Guardian which could be another model or a human domain expert aligned with a guardrail objective. The threshold is calibrated with conformal risk control, yielding finite sample, distribution free guarantees that the long run frequency of undesirable events, such as factual errors or safety violations, does not exceed a user specified quota. Because Conformal Arbitrage operates wholly at the API level, without requiring access to model logits or updating model weights, it complements weight based alignment techniques and integrates seamlessly with existing cost aware cascades. Empirically, Conformal Arbitrage traces an efficient frontier, allowing users to define an acceptable performance level for one objective while maximizing utility in another. We observe that our method outperforms, in terms of accuracy, cost matched random routing between models. These properties make Conformal Arbitrage a practical, theoretically grounded tool for trustworthy and economical deployment of large language models across a broad range of potentially competing objectives.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 5:36:17 PM

Summary

Conformal Arbitrage is a post-hoc, weight-agnostic framework that uses conformal risk control to mediate between a primary language model and a more conservative guardian model. It enables users to balance competing objectives (e.g., helpfulness vs. harmlessness, cost vs. accuracy) by learning a data-driven threshold that determines when to defer to the guardian, providing finite-sample, distribution-free guarantees on risk metrics without requiring model retraining or logit access.

Entities (5)

Conformal Arbitrage · framework · 100%Conformal Risk Control · statistical-method · 100%Guardian Model · language-model · 95%Primary Model · language-model · 95%PKU-SafeRLHF · benchmark · 90%

Relation Signals (3)

Conformal Arbitrage → uses → Conformal Risk Control

confidence 100% · The threshold is calibrated with conformal risk control

Conformal Arbitrage → mediatesbetween → Primary Model

confidence 95% · mediate between a Primary model optimized for a primary objective and a more conservative Guardian

Conformal Arbitrage → mediatesbetween → Guardian Model

confidence 95% · mediate between a Primary model optimized for a primary objective and a more conservative Guardian

Cypher Suggestions (2)

Identify the relationship between models mediated by Conformal Arbitrage · confidence 95% · unvalidated

MATCH (ca:Framework {name: 'Conformal Arbitrage'})-[:MEDIATES_BETWEEN]->(m) RETURN m.name, labels(m)

Find all frameworks that utilize Conformal Risk Control · confidence 90% · unvalidated

MATCH (f:Framework)-[:USES]->(m:Method {name: 'Conformal Risk Control'}) RETURN f.name

Full Text

111,077 characters extracted from source content.

Expand or collapse full text

Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models William Overman Graduate School of Business Stanford University wpo@stanford.edu Mohsen Bayati Graduate School of Business Stanford University bayati@stanford.edu Abstract Modern language-model deployments must often balance competing objectives—for example, helpfulness versus harmlessness, cost versus accuracy, and reward versus safety. We introduce Conformal Arbitrage, a post-hoc framework that learns a data-driven threshold to mediate between a Primary model optimized for a primary objective and a more conservative Guardian—which could be another model or a human domain expert—aligned with a guardrail objective. The threshold is calibrated with conformal risk control, yielding finite-sample, distribution-free guarantees that the long-run frequency of undesirable events (such as factual errors or safety violations) does not exceed a user-specified quota. Because Conformal Arbitrage operates wholly at the API level—without requiring access to model logits or updating model weights—it complements weight-based alignment techniques and integrates seamlessly with existing cost-aware cascades. Empirically, Conformal Arbitrage traces an efficient frontier, allowing users to define an acceptable performance level for one objective while maximizing utility in another. We observe that our method outperforms (in terms of accuracy) cost-matched random routing between models. These properties make Conformal Arbitrage a practical, theoretically grounded tool for trustworthy and economical deployment of large language models across a broad range of potentially competing objectives. 1 Introduction Large language models (LLMs) excel at reasoning, coding, and open-domain question answering, yet real-world deployments frequently need to navigate tensions between potentially competing objectives such as helpfulness and harmlessness or cost and accuracy. Current practices mostly tackle the tension between helpfulness and harmlessness by modifying the model itself: reinforcement learning from human feedback (RLHF) (Christiano et al., 2017; Ouyang et al., 2022), direct–preference optimisation (DPO) (Rafailov et al., 2023), Constitutional AI (Bai et al., 2022b), or multi-objective fine-tuning (Zhou et al., 2023; Wang et al., 2024) each produce a single operating point along the Pareto frontier. While powerful, these methods demand expensive data collection, GPU-intensive retraining, and — for API-only models — are often not applicable. For the cost versus accuracy tradeoff, there has been significant work on cascades: a cheap model handles easy queries and defers the rest to a stronger fallback (Chen et al., 2023; Aggarwal et al., 2025; Zellinger et al., 2025). Recently, Jung et al. (2025) introduced Cascaded Selective Evaluation (CSE), calibrating per-model confidence estimators via fixed-sequence multiple testing to obtain rigorous guarantees on alignment to human pairwise preferences. However, these approaches are tailored for controlling a binary disagreement risk, while a user may be interested in controlling arbitrary guardrail metrics at deployment time. We introduce Conformal Arbitrage (CA), a lightweight router that sits outside the language models. The term “arbitrage” captures how our approach exploits the performance gap between specialized models to achieve superior outcomes than naive selection between models. Given (i) a Primary model optimized for the primary objective and (i) a more conservative Guardian model or a human domain expert, aligned with a guardrail objective, CA offers a principled alternative to randomized routing between models. Instead of merely alternating between models with some probability, CA learns a single scalar threshold on how strongly the Primary model favors its top choice over alternatives (a notion we formally define as “score gap” later in the paper). This threshold determines when the Primary model’s confidence is sufficient to act upon its prediction versus when to defer to the Guardian, creating a principled decision boundary that optimizes the trade-off between objectives. The threshold is calibrated using conformal risk control (CRC) (Angelopoulos et al., 2024), yielding finite-sample, distribution-free guarantees that the long-run frequency (or magnitude) of undesirable events never exceeds a user-specified budget α. This enables precise control over trade-offs—users can explicitly specify how much they are willing to compromise on one objective to gain on the other. Because CA touches no model weights, it complements weight-based alignment and applies to closed, black-box APIs, making it a remarkably lightweight approach to achieving Pareto improvements over simple model selection strategies. Our experiments examine both the the cost versus accuracy tradeoff using the the TruthfulQA and MMLU benchmarks as well as the helpfulness versus harmlessness tradeoff using the PKU-SafeRLHF benchmark. Across all settings CA traces an efficient frontier that outperforms random or cost-matched routing baselines. Conformal Arbitrage transforms an immutable, potentially unpredictable LLM (or a family of LLMs) into a controllable system whose risk–utility position can be dialed after deployment. In our experiments, we demonstrate this capability using state-of-the-art LLMs from the GPT-4.1 series, OpenAI (2025), showing how our method enables fine-grained control over various tradeoffs without modifying the underlying models. By requiring only a few hundred logged examples for calibration, CA offers a pragmatic path toward trustworthy, cost-efficient and customizable language-model services that can be adjusted to meet evolving requirements long after initial deployment. 2 Related work Real–world deployments must strike a pragmatic balance between helpfulness—supplying users with accurate and detailed information—and harmlessness—avoiding policy-breaking or dangerous content. Early alignment work framed the problem as a single–objective optimization: RLHF (Christiano et al., 2017; Ouyang et al., 2022) and its variant DPO (Rafailov et al., 2023) collapse nuanced feedback into a single reward model and therefore deliver one operating point on the Pareto frontier. Subsequent methods introduced explicit two–factor training: RLHF on mixed helpful–harmless datasets (Bai et al., 2022a), Constitutional AI’s self-revision loop (Bai et al., 2022b), and Bi-Factorial Preference Optimisation (BFPO) (Zhang et al., 2025) that casts the bi-objective RLHF loss as a direct supervised criterion. Safe-RLHF (Dai et al., 2023) separates a reward and a cost head and enforces constraints by Lagrangian relaxation, while Circuit Breakers intervene at generation time to halt policy-violating continuations (Zou et al., 2024). The PKU-SafeRLHF benchmark (Ji et al., 2023) was specifically introduced to quantify this helpfulness-harmlessness trade-off, providing dual annotations that enable researchers to measure progress on both dimensions simultaneously. Anthropic’s Constitutional AI (Bai et al., 2022b) further explores alignment by embedding principles directly into model training. More recently, the MoGU framework (Du et al., 2024) dynamically routes between model variants optimized separately for usability and safety. Empirically, while these approaches curb unsafe completions, they still lock the model into one fixed balance point between helpfulness and harmlessness. Beyond helpfulness–harmlessness many other objectives— accuracy, cost, latency, fairness, demographic parity, domain–specific risk, etc.—can be in conflict. Many recent works have proposed weight–based strategies to navigate the resulting frontiers between such competing objectives. Rewarded soups linearly interpolates checkpoints fine-tuned on distinct rewards to trace that surface (Ramé et al., 2023), Directional Preference Alignment adds multiple reward heads for steerable inference (Wang et al., 2024), MaxMin-RLHF learns a mixture of reward models to protect minority preferences (Chakraborty et al., 2024), and MO-DPO converts several preference signals into a closed-form multi-objective loss (Zhou et al., 2023). These approaches nevertheless share two limitations: (i) they require access to model weights and retraining, and (i) they provide no theoretical guarantees that the inherent guardrail metrics driving the trade-off (e.g., safety, accuracy, or cost) will stay within a user-specified budget. In contrast, our method of Conformal Arbitrage is weight-agnostic and sits outside the LLM. By calibrating a single threshold with conformal risk control (Angelopoulos et al., 2024), it transforms any pair of black-box models, one of which can be a human, into a continuum of operating points with provable finite-sample bounds on the chosen guardrail metric (e.g. harmlessness). Conformal Arbitrage is thus closely tied to routing and cascade approaches that tackle cost–accuracy trade-offs (Chen et al., 2023; Yue et al., 2024; Ong et al., 2024; Aggarwal et al., 2025; Zellinger et al., 2025; Varangot-Reille et al., 2025), but can be used to tackle any potential pair of objectives that may be in tension, thus abstractly covering cost–accuracy cascades as a special case. However, unlike these previous approaches we make no particular optimizations for any specific trade-off, including cost and accuracy, and we do not claim to out-perform such cascade systems on metrics for which they are explicitly optimized. Furthermore, compared to most routing approaches that rely on complex learned functions to distribute queries between models (Varangot-Reille et al., 2025), Conformal Arbitrage employs a principled, theoretically-grounded method using a single calibrated scalar threshold. Scalable-oversight research explores how weaker agents or humans can be organized into critique hierarchies that amplify limited supervision. Amplification and Debate delegate verification to inexpensive judges and, under certain complexity assumptions, achieve provable “weak-to-strong” guarantees (Christiano et al., 2018; Irving et al., 2018; Burns and et al., 2023). Process supervision instead labels intermediate reasoning steps so that mistakes are caught early (Lightman et al., 2023). Self-reflection frameworks ask a model to generate critiques (and often revisions) of its own outputs (Madaan et al., 2023; Yang et al., 2024; Tang et al., 2024). Post-hoc risk control strategies in model deployment have also gained attention, particularly through moderation and oversight models deployed by industry leaders (OpenAI, 2023). Conformal Arbitrage complements these approaches with a statistically sound escalation rule: the Primary acts autonomously within a risk budget, else forwards a slimmed-down slate of actions to a human or Guardian. Finite-sample bounds from conformal risk control budget both the Guardian’s load and the residual risk, giving a lightweight post-hoc route to scalable oversight without retraining. The underlying selective routing approach of our work resonates with classical selective prediction and reject-option frameworks initially formalized by Chow (1970) and later refined in modern selective classification research (Geifman and El-Yaniv, 2019). Conformal prediction (CP) and its generalization, conformal risk control (CRC) (Vovk et al., 2005; Bates et al., 2021; Angelopoulos et al., 2024), provide distribution-free, finite-sample guarantees that make them generally attractive post-hoc alignment tools for high-stakes LLM deployments. For instance, Chen et al. (2025) align language models with human risk judgments by controlling tail risks such as toxicity, while Su et al. (2024) demonstrate conformal prediction applied effectively to black-box LLM APIs without internal access. Additionally, conformal risk control has been leveraged in deployment scenarios such as action deferral, illustrated by the KnowNo framework (Ren et al., 2023), which uses conformal uncertainty quantification to trigger human oversight. Conformal prediction and conformal risk control have been used to filter low-confidence QA answers (Kumar et al., 2023), retain only entailment-supported sub-claims (Mohri and Hashimoto, 2024), and bound hallucination rates via abstention (Abbasi-Yadkori et al., 2024). Beyond marginal guarantees, conditional and adaptive CRC tighten coverage on hard prompts (Cherian et al., 2024), and sampling-based set prediction extends CP to free-text generation (Quach et al., 2024). Framing alignment as property testing, Overman et al. (2024) calibrate outputs to satisfy safety or fairness constraints without retraining. Building on this lineage, we adapt CRC to learn a risk-calibrated switch between a Primary model and a Guardian model without retraining either model. Conformal Arbitrage is most closely related to Cascaded Selective Evaluation (CSE) of Jung et al. (2025). CSE equips each judge with a confidence score, calibrates a per-judge threshold, and escalates through a cascade until some judge is confident, thereby controlling the Bernoulli risk that a machine-preferred answer disagrees with human majority. Conformal Arbitrage addresses more general tradeoffs: it controls any bounded guardrail loss (safety, accuracy, cost, latency, etc.) and can filter a large action space to a smaller candidate set that a Guardian or human refines, rather than abstaining on the whole instance. CSE’s Simulated Annotators requires K-shot prompting (for K examples of preference annotations) the model N different times (for N human annotators) in order to obtain an ensemble prediction and access to predictive probabilities extracted from the model’s logprobs, so every judge call is multiplied many-fold and is limited to APIs that expose token-level logits. Conformal Arbitrage, by contrast, needs at most one call to the Primary and (when routed) one to the Guardian, treats the returned scores as opaque, requiring no access to logits or probabilities, and thus works with strictly black-box APIs. 3 Preliminaries Conformal Arbitrage uses conformal risk control (CRC) to supply finite-sample, distribution-free guarantees on the guardrail metric while treating the underlying language models as black boxes. CRC extends the framework of conformal prediction (CP) (Vovk et al., 2005; Bates et al., 2021) from binary error control to control of arbitrary bounded risks. We briefly summarize both ideas. Conformal prediction Let XX and YY be the input and output spaces, equipped with a joint probability distribution, and draw an exchangeable sample (Xi,Yi)i=1n+1∼Psimilar-tosuperscriptsubscriptsubscriptsubscript11(X_i,Y_i)_i=1^n+1\! \!P( Xitalic_i , Yitalic_i )i = 1n + 1 ∼ P where the first n sample are used for calibration, and (Xn+1,Yn+1)subscript1subscript1(X_n+1,Y_n+1)( Xitalic_n + 1 , Yitalic_n + 1 ) is used for testing. Given any predictor f:→:→f:X\!→\!Yf : X → Y and score sf⁢(x,y)subscripts_f(x,y)sitalic_f ( x , y ) (e.g. |y−f⁢(x)||y-f(x)|| y - f ( x ) |), let q1−αsubscript1q_1-αq1 - α be the (1−α)1(1-α)( 1 - α ) empirical quantile of sf⁢(Xi,Yi)i=1nsuperscriptsubscriptsubscriptsubscriptsubscript1\s_f(X_i,Y_i)\_i=1^n sitalic_f ( Xitalic_i , Yitalic_i ) i = 1n. The conformal set is defined by C⁢(x)=y∈:sf⁢(x,y)≤q1−αconditional-setsubscriptsubscript1C(x)=\\,y \!:\,s_f(x,y)≤ q_1-α\C ( x ) = y ∈ Y : sitalic_f ( x , y ) ≤ q1 - α , and enjoys the finite-sample guarantee Pr⁡Yn+1∉C⁢(Xn+1)≤αPrsubscript1subscript1 \\,Y_n+1\!∉\!C(X_n+1)\≤ Yitalic_n + 1 ∉ C ( Xitalic_n + 1 ) ≤ α. Thus any black-box predictor attains (1−α)1(1-α)( 1 - α ) coverage without distributional assumptions (Vovk et al., 2005; Bates et al., 2021). Conformal risk control Many real-world objectives are not binary mistakes but expectations of a task-specific loss—for example, safety-violation rate, factual errors, mean latency, or excess dollar cost. Conformal risk control (Angelopoulos et al., 2024) handles such objectives by introducing a bounded, non-increasing loss curve Li⁢(λ)∈[0,B]subscript0L_i(λ)∈[0,B]Litalic_i ( λ ) ∈ [ 0 , B ], where B is an upper bound on the loss, for each calibration point, indexed by a tunable threshold λ∈Λ⊂ℝΛℝλ∈ λ ∈ Λ ⊂ blackboard_R. Defining the empirical risk R^n⁢(λ)=1n⁢∑i=1nLi⁢(λ)subscript^1superscriptsubscript1subscript R_n(λ)= 1n _i=1^nL_i(λ)over start_ARG R end_ARGn ( λ ) = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n Litalic_i ( λ ), CRC selects λ^=infλ∈Λ:n+1⁢R^n⁢(λ)+Bn+1≤α,^infimumconditional-setΛ1subscript^1 λ\;=\; \λ∈ : nn+1\, R_n(% λ)+ Bn+1≤α \,over start_ARG λ end_ARG = inf λ ∈ Λ : divide start_ARG n end_ARG start_ARG n + 1 end_ARG over start_ARG R end_ARGn ( λ ) + divide start_ARG B end_ARG start_ARG n + 1 end_ARG ≤ α , (1) and proves the finite-sample guarantee for ⁢[Ln+1⁢(λ^)]≤α,delimited-[]subscript1^E [L_n+1( λ) ]≤α,blackboard_E [ Litalic_n + 1 ( over start_ARG λ end_ARG ) ] ≤ α , again under assumption of exchangeability between the calibration data and test point. Choosing Li⁢(λ)=⁢Yi∉Cλ⁢(Xi)subscriptsubscriptsubscriptsubscriptL_i(λ)=I\Y_i∉ C_λ(X_i)\Litalic_i ( λ ) = blackboard_I Yitalic_i ∉ Citalic_λ ( Xitalic_i ) recovers classical CP; alternative losses yield risk bounds tailored to deployment needs. 4 Methodology: conformal arbitrage We aim to invoke a Primary model as often as possible (e.g. a helpfulness-maximizing or low-cost model) while ensuring, with high confidence, that a critical requirement (e.g. harmlessness, accuracy) is satisfied by routing calls to a Guardian model (or human) as needed. The linkage between the two models is formalized through conformal risk control (Angelopoulos et al., 2024). 4.1 Setting Let xii≥1subscriptsubscript1\x_i\_i≥ 1 xitalic_i i ≥ 1 be an exchangeable sequence of XX-valued random variables that we refer to as contexts. Each context x admits a finite, non-empty action set A⁢(xi)=i⊆subscriptsubscriptA(x_i)=A_i ( xitalic_i ) = Aitalic_i ⊆ A, where |A⁢(xi)|<∞.subscript|A(x_i)|<∞.| A ( xitalic_i ) | < ∞ . Additionally, we assume the existence of two functions L:×⁢()→ℝ:→ℝL:X×P(A) : X × P ( A ) → blackboard_R and U:×⁢()→ℝ:→ℝU:X×P(A) : X × P ( A ) → blackboard_R, measuring, over subsets of the potential actions, loss for the guardrail metric and utility for the primary metric, respectively. We assume both of these functions satisfy the property that for 1⊆2subscript1subscript2A_1 _2A1 ⊆ A2 we have L⁢(x,1)≥L⁢(x,2)subscript1subscript2L(x,A_1)≥ L(x,A_2)L ( x , A1 ) ≥ L ( x , A2 ) and U⁢(x,1)≥U⁢(x,2)subscript1subscript2U(x,A_1)≥ U(x,A_2)U ( x , A1 ) ≥ U ( x , A2 ). We assume access to two fixed, pre-trained models: p,g:×→ℝ:→ℝp,g:X×A , g : X × A → blackboard_R, where p is the Primary model (reward-seeking or cheap/low-accuracy) and g is the Guardian model (safety-focused or costly/high-accuracy). Despite this simple interface, each model may internally implement arbitrarily complex computations—any architecture that outputs a score for each (x,a)(x,a)( x , a ) pair is admissible. Although we write p⁢(x,a)p(x,a)p ( x , a ) and g⁢(x,a)g(x,a)g ( x , a ) as deterministic, each model call may depend on internal randomness ζP,ζGsubscriptsubscript _P, _Gζitalic_P , ζitalic_G, producing scores p~⁢(x,a,ζP)~subscript p(x,a, _P)over~ start_ARG p end_ARG ( x , a , ζitalic_P ) and g~⁢(x,a,ζG)~subscript g(x,a, _G)over~ start_ARG g end_ARG ( x , a , ζitalic_G ). Such tuples (x,p~,g~)~~(x, p, g)( x , over~ start_ARG p end_ARG , over~ start_ARG g end_ARG ) remain exchangeable across samples, so the finite-sample guarantees of conformal risk control are unaffected. 4.2 Calibration via conformal risk control To calibrate our Conformal Arbitrage policy, we use conformal risk control (CRC) to calibrate a relaxation parameter λ^ λover start_ARG λ end_ARG that satisfies a user-defined risk budget α∈(0,1)01α∈(0,1)α ∈ ( 0 , 1 ), controlling how much we can trust the Primary model before deferring to the Guardian. We begin with an exchangeable calibration set of n samples: (n)=(xi,Pi,Gi)i=1n,Pi=p⁢(xi,a)a∈i,Gi=g⁢(xi,a)a∈i.formulae-sequencesuperscriptsuperscriptsubscriptsubscriptsubscriptsubscript1formulae-sequencesubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptD^(n)= \(x_i,P_i,G_i) \_i=1^n, P_i=% \p(x_i,a)\_a _i, G_i=\g(x_i,a)\_a∈% A_i.D( n ) = ( xitalic_i , Pitalic_i , Gitalic_i ) i = 1n , Pitalic_i = p ( xitalic_i , a ) a ∈ A start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , Gitalic_i = g ( xitalic_i , a ) a ∈ A start_POSTSUBSCRIPT i end_POSTSUBSCRIPT . Each sample consists of a context xisubscriptx_ixitalic_i and the scores assigned by both the Primary model and the Guardian model across the available action set i=A⁢(xi)subscriptsubscriptA_i=A(x_i)Aitalic_i = A ( xitalic_i ). For any λ≥00λ≥ 0λ ≥ 0, we define the λ-relaxed candidate set: Cλ⁢(x)=a∈A⁢(x):p⁢(x,a)≥maxa′∈A⁢(x)⁡p⁢(x,a′)−λ.subscriptconditional-setsubscriptsuperscript′C_λ(x)= \a∈ A(x):p(x,a)≥ _a ∈ A(x)p(x,a^% )-λ \.Citalic_λ ( x ) = a ∈ A ( x ) : p ( x , a ) ≥ maxitalic_a′ ∈ A ( x ) p ( x , a′ ) - λ . This set includes all actions whose Primary scores are within λ of the top score. In particular, larger values of λ increase the size of this set. Since all of the subsets ′⊆A⁢(x)superscript′A A(x)A′ ⊆ A ( x ) that we will consider will be of this form, Cλ⁢(x)subscriptC_λ(x)Citalic_λ ( x ), for some λ, we adopt the notation Li⁢(λ)=L⁢(xi,Cλ⁢(xi))subscriptsubscriptsubscriptsubscriptL_i(λ)=L(x_i,C_λ(x_i))Litalic_i ( λ ) = L ( xitalic_i , Citalic_λ ( xitalic_i ) ) and Ui⁢(λ)=U⁢(xi,Cλ⁢(xi))subscriptsubscriptsubscriptsubscriptU_i(λ)=U(x_i,C_λ(x_i))Uitalic_i ( λ ) = U ( xitalic_i , Citalic_λ ( xitalic_i ) ) We then define a loss function on each calibration sample, measuring the residual risk that the Guardian model would assign to the best action in Cλ⁢(xi)subscriptsubscriptC_λ(x_i)Citalic_λ ( xitalic_i ): Li⁢(λ)=maxa∈A⁢(xi)⁡g⁢(xi,a)−maxa∈Cλ⁢(xi)⁡g⁢(xi,a).subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptL_i(λ)= _a∈ A(x_i)g(x_i,a)- _a∈ C_λ(x_i)g(% x_i,a).Litalic_i ( λ ) = maxitalic_a ∈ A ( x start_POSTSUBSCRIPT i ) end_POSTSUBSCRIPT g ( xitalic_i , a ) - maxitalic_a ∈ C start_POSTSUBSCRIPT λ ( xitalic_i ) end_POSTSUBSCRIPT g ( xitalic_i , a ) . (2) Intuitively, this loss captures how unsafe the most promising action (as judged by the Guardian) is among the candidates the Primary model would consider acceptable under λ. To summarize overall risk, we compute the empirical average: R^n⁢(λ)=1n⁢∑i=1nLi⁢(λ),subscript^1superscriptsubscript1subscript R_n(λ)\;=\; 1n _i=1^nL_i(λ),over start_ARG R end_ARGn ( λ ) = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n Litalic_i ( λ ) , and select the smallest λ that satisfies the CRC inequality: λ^=infλ≥0:n+1⁢R^n⁢(λ)+1n+1≤α.^infimumconditional-set01subscript^11 λ\;=\; \λ≥ 0: nn+1\, R_n(% λ)+ 1n+1\;≤\;α \.over start_ARG λ end_ARG = inf λ ≥ 0 : divide start_ARG n end_ARG start_ARG n + 1 end_ARG over start_ARG R end_ARGn ( λ ) + divide start_ARG 1 end_ARG start_ARG n + 1 end_ARG ≤ α . (3) Definition 1 (Relaxation Parameter). The relaxation parameter λ^ λover start_ARG λ end_ARG is defined as the minimal value of λ that satisfies the conformal risk control inequality in Equation 3. This relaxation parameter controls the permissiveness of the candidate action set while ensuring that the expected residual risk on a new context remains bounded by α. The guarantee holds exactly at finite sample size and requires no assumptions on score calibration or context distribution. 4.3 Conformal arbitrage algorithm We now describe the deployment-time decision procedure for selecting actions using the calibrated relaxation parameter λ^ λover start_ARG λ end_ARG obtained in Section 4.2. At each test instance, the agent first consults the Primary model to form a λ^ λover start_ARG λ end_ARG-relaxed candidate set. If the top action is sufficiently dominant (i.e., the set is a singleton), it is selected; otherwise, the Guardian model selects from the λ-relaxed set. The procedure is outlined in Algorithm 1. Algorithm 1 Conformal Arbitrage 0: Context x, relaxation parameter λ^ λover start_ARG λ end_ARG, Primary model p, Guardian model g 1: Compute p⁢(x,a)p(x,a)p ( x , a ) for all a∈⁢(x)a (x)a ∈ A ( x ) 2: Let Cλ⁢(x)=a∈A⁢(x):p⁢(x,a)≥maxa′⁡p⁢(x,a′)−λ^subscriptconditional-setsubscriptsuperscript′^C_λ(x)= \a∈ A(x):p(x,a)≥ _a p(x,a )-% λ \Citalic_λ ( x ) = a ∈ A ( x ) : p ( x , a ) ≥ maxitalic_a′ p ( x , a′ ) - over start_ARG λ end_ARG 3: if |Cλ⁢(x)|=1subscript1|C_λ(x)|=1| Citalic_λ ( x ) | = 1 then 4: return the unique element of Cλ⁢(x)subscriptC_λ(x)Citalic_λ ( x ) 5: else 6: Compute g⁢(x,a)g(x,a)g ( x , a ) for all a∈Cλ⁢(x)subscripta∈ C_λ(x)a ∈ Citalic_λ ( x ) 7: return a⋆=arg⁡maxa∈Cλ⁢(x)⁡G⁢(a)superscript⋆subscriptsubscripta = _a∈ C_λ(x)G(a)a⋆ = arg maxitalic_a ∈ C start_POSTSUBSCRIPT λ ( x ) end_POSTSUBSCRIPT G ( a ) 8: end if 4.4 Optimality amongst score-gap routers The fact that Algorithm 1 ensures an upperbound on the loss of our guardrail metric ⁢[L⁢(x,Cλ⁢(x))]≤αdelimited-[]subscriptE[L(x,C_λ(x))]≤ _E [ L ( x , Citalic_λ ( x ) ) ] ≤ α simply follows from the guarantees of conformal risk control Angelopoulos et al. (2024). Now to address utility as measured by the primary metric we define the following class of policies, “Score-gap routers," in Definition 2. Additionally, for this theoretical result, we will require a stronger assumption of i.i.d. on the calibration data and test point. Definition 2 (Score-gap router). Fix a Primary score function p:×→ℝ:→ℝp:X×A : X × A → blackboard_R and a non–negative threshold λ≥00λ≥ 0λ ≥ 0. For each context x let a⋆⁢(x)=arg⁢maxa∈A⁢(x)⁡p⁢(x,a),Δ⁢(x)=p⁢(x,a⋆⁢(x))−maxb∈A⁢(x)∖a⋆⁢(x)⁡p⁢(x,b),formulae-sequencesuperscript⋆subscriptargmaxΔsuperscript⋆subscriptsuperscript⋆a (x)\;=\; *arg\,max_a∈ A(x)p(x,a), (x)\;=% \;p (x,a (x) )- _b∈ A(x) \a (x)\p(x% ,b),a⋆ ( x ) = start_OPERATOR arg max end_OPERATORa ∈ A ( x ) p ( x , a ) , Δ ( x ) = p ( x , a⋆ ( x ) ) - maxitalic_b ∈ A ( x ) ∖ a⋆ ( x ) p ( x , b ) , with the convention Δ⁢(x)=+∞Δ (x)=+∞Δ ( x ) = + ∞ if |A⁢(x)|=11|A(x)|=1| A ( x ) | = 1. The score-gap router with threshold λ, ℛλ:→∪defer:subscriptℛ→deferR_λ:X ∪\ defer\Ritalic_λ : X → A ∪ defer acts as ℛλ⁢(x)=a⋆⁢(x),if ⁢Δ⁢(x)≥λ,defer,otherwise,subscriptℛcasessuperscript⋆if ΔdeferotherwiseR_λ(x)= casesa (x),&if (x)≥% λ,\\[4.0pt] defer,&otherwise, casesRitalic_λ ( x ) = start_ROW start_CELL a⋆ ( x ) , end_CELL start_CELL if Δ ( x ) ≥ λ , end_CELL end_ROW start_ROW start_CELL defer , end_CELL start_CELL otherwise , end_CELL end_ROW where defer means “forward this instance to the Guardian model.” Given the Primary model’s confidence scores p⁢(x,a)p(x,a)p ( x , a ), it chooses the top-scoring action whenever its margin over every alternative exceeds a non-negative threshold λ, and defers to the Guardian otherwise. This rule mirrors Chow’s Bayes-optimal reject-option classifier (Chow, 1970): rather than rejecting an uncertain instance we escalate it to a more conservative model. Theorem 1 establishes that no other Score-gap router of the Primary scores alone can deliver strictly higher expected primary utility while still obeying the same guardrail risk budget α, up to a vanishing O⁢(n−1)superscript1O(n^-1)O ( n- 1 ) term. We let our Primary metric be measured by U⁢(λ)=⁢[Ui⁢(λ)]delimited-[]subscriptU(λ)=E[U_i(λ)]U ( λ ) = blackboard_E [ Uitalic_i ( λ ) ], which we assume to be non-increasing and K-Lipschitz. This is natural as raising λ can only shrink the set of contexts on which we choose the Primary model’s output. The proof of Theorem 1 is provided in Appendix A. Theorem 1 (Utility–optimality of Conformal Arbitrage). Fix a compact interval Λ=[0,λmax]Λ0subscript =[0, _ ]Λ = [ 0 , λroman_max ]. For each λ∈Λλ∈ λ ∈ Λ and every observation i define a guardrail loss Li⁢(λ)∈[0,B]subscript0L_i(λ)∈[0,B]Litalic_i ( λ ) ∈ [ 0 , B ] and a primary-utility score Ui⁢(λ)∈[0,Umax]subscript0subscriptU_i(λ)∈[0,U_ ]Uitalic_i ( λ ) ∈ [ 0 , Uroman_max ], both non-increasing in λ. Write R⁢(λ)=⁢[Li⁢(λ)],U⁢(λ)=⁢[Ui⁢(λ)].formulae-sequencedelimited-[]subscriptdelimited-[]subscriptR(λ)=E[L_i(λ)], U(λ)=E[U_i(% λ)].R ( λ ) = blackboard_E [ Litalic_i ( λ ) ] , U ( λ ) = blackboard_E [ Uitalic_i ( λ ) ] . Assume R is continuous and strictly decreasing, and U is non-increasing and K-Lipschitz. For a desired risk budget α∈(0,B)0α∈(0,B)α ∈ ( 0 , B ) let λ⋆=infλ∈Λ:R⁢(λ)≤α.subscript⋆infimumconditional-setΛ _ = \λ∈ :R(λ)≤α\.λ⋆ = inf λ ∈ Λ : R ( λ ) ≤ α . Given an i.i.d. calibration sample (n)superscriptD^(n)D( n ) of size n, set R^n⁢(λ)=1n⁢∑i=1nLi⁢(λ),λ^=infλ∈Λ:n+1⁢R^n⁢(λ)+Bn+1≤α.formulae-sequencesubscript^1superscriptsubscript1subscript^infimumconditional-setΛ1subscript^1 R_n(λ)= 1n _i=1^nL_i(λ), % λ= \λ∈ : nn+1\, R_n(% λ)+ Bn+1≤α \.over start_ARG R end_ARGn ( λ ) = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n Litalic_i ( λ ) , over start_ARG λ end_ARG = inf λ ∈ Λ : divide start_ARG n end_ARG start_ARG n + 1 end_ARG over start_ARG R end_ARGn ( λ ) + divide start_ARG B end_ARG start_ARG n + 1 end_ARG ≤ α . Then, with expectation taken over the calibration sample ⁢[U⁢(λ⋆)−U⁢(λ^)]delimited-[]subscript⋆ E\! [U( _ )-U( λ) ]blackboard_E [ U ( λ⋆ ) - U ( over start_ARG λ end_ARG ) ] =O⁢(n−1),absentsuperscript1 =O(n^-1),= O ( n- 1 ) , (4) ⁢[supλ~∈ΛR⁢(λ~)≤αU⁢(λ~)−U⁢(λ^)]delimited-[]subscriptsupremum~Λ~~ E\! [ _ subarrayc λ∈% \\ R( λ)≤α subarrayU( λ)-U( λ)% ]blackboard_E [ supstart_ARG start_ROW start_CELL over~ start_ARG λ end_ARG ∈ Λ end_CELL end_ROW start_ROW start_CELL R ( over~ start_ARG λ end_ARG ) ≤ α end_CELL end_ROW end_ARG U ( over~ start_ARG λ end_ARG ) - U ( over start_ARG λ end_ARG ) ] =O⁢(n−1).absentsuperscript1 =O(n^-1).= O ( n- 1 ) . (5) 5 Experiments We test Conformal Arbitrage on two different trade-off settings: a cost–accuracy axis using the multiple-choice datasets TruthfulQA and MMLU, and a helpfulness–harmlessness axis using PKU-SafeRLHF. Each experiment follows the same protocol: we draw a calibration split and use the loss given by Equation 2 to fit the CRC threshold λ^ λover start_ARG λ end_ARG using Equation 3. We evaluate the guardrail risk and primary utility of Conformal Arbitrage on a disjoint evaluation split, and compare against single-model baselines and random routers. We report the results for TruthfulQA and PKU-SafeRLHF in the main text; the results for MMLU are qualitatively similar and appear in Appendix D. 5.1 TruthfulQA: cost versus accuracy We first study Conformal Arbitrage on the multiple-choice split of TruthfulQA (Lin et al., 2022), a benchmark designed to expose factual misconceptions in language models.111https://huggingface.co/datasets/EleutherAI/truthful_qa_mc The benchmark contains 684 questions, each paired with four answer choices and exactly one correct label. Here we consider our primary objective to be minimizing cost, while the guardrail metric is factual accuracy. Experimental set-up The Primary model is gpt-4.1-nano-2025-04-14; the Guardian model is its larger counterpart gpt-4.1-2025-04-14. This is the natural choice considering that our primary and guardrail metrics are cost and accuracy, respectively.222We use prices from https://openai.com/api/pricing/ on May 15, 2025. Both are queried in a zero-shot, multiple-choice format that elicits a real-valued confidence score in [0,1]01[0,1][ 0 , 1 ] for each option. We use temperature=0.1, max_tokens=50; replies that fail JSON parsing default to uniform scores, maintaining exchangeability. Exact prompts appear in Appendix B.1. We keep the Primary’s raw scores, but binarize the Guardian’s as g⁢(x,a)=11g(x,a)=1g ( x , a ) = 1 if a is its top-ranked answer and correct, and 00 otherwise. Thus, when the Guardian answers correctly we assign confidence 1 to the correct choice and 0 to the three distractors; when it answers incorrectly we assign 0 to every choice, reflecting total uncertainty. This binarization is not required—one could instead feed the Guardian’s real-valued scores into Conformal Arbitrage, but this binarization makes the exposition crisper: the calibrated risk level α now translates directly to an α×100%percent100α\!×\!100\%α × 100 % drop in accuracy relative to the accuracy of the Guardian. See Appendix B.4 for results of using the real-valued scores directly. With Equation 2 the loss is Li⁢(λ)=⁢Guardian correct and ⁢Cλ⁢(xi)∌a⋆subscript1superscript⋆Guardian correct and subscriptsubscriptL_i(λ)=1\Guardian correct and C_λ(x_i) % a \Litalic_i ( λ ) = 1 Guardian correct and Citalic_λ ( xitalic_i ) ∌ a⋆ for a⋆=arg⁡maxa∈A⁢(xi)⁡g⁢(xi,a)superscript⋆subscriptsubscriptsubscripta \;=\; _a∈ A(x_i)g(x_i,a)a⋆ = arg maxitalic_a ∈ A ( x start_POSTSUBSCRIPT i ) end_POSTSUBSCRIPT g ( xitalic_i , a ). Conformal risk control chooses the smallest λ^ λover start_ARG λ end_ARG whose empirical mean loss is ≤αabsent≤α≤ α; e.g., α=0.100.10α=0.10α = 0.10 guarantees the overall accuracy falls by at most ten percentage points relative to an always-Guardian policy. Each trial draws n=400400n=400n = 400 calibration and N=284284N=284N = 284 test questions. We fit λ^ λover start_ARG λ end_ARG via Eq. (3) on Λ=0,0.01,…,1Λ00.01…1 =\0,0.01,…,1\Λ = 0 , 0.01 , … , 1 and repeat the calibration–evaluation loop 30 times with fresh random splits. For a baseline comparison we compare the performance of Conformal Arbitrage to a random router that for each risk level α matches the average cost of our method but chooses the acting model uniformly at random, thereby controlling cost without calibration. Results Figure 1 and Table 1 show that CA traces an efficient cost–accuracy frontier, beating the cost-matched random router at every risk level except α=0.250.25α=0.25α = 0.25 while always respecting the α-level guardrail budget. Tightening α from α=0.250.25α=0.25α = 0.25 to 0.050.050.050.05 raises accuracy from 0.620.620.620.62 to 0.810.810.810.81 at 2.6×2.6×2.6 × the cost. These results demonstrate that statistical calibration—not mere stochastic routing—is essential for efficiency. Figure 1: Accuracy vs. cost (TruthfulQA), mean ±plus-or-minus± 1 std over 30 trials; small points show individual CA runs. Table 1: Accuracy, cost per 1000 examples, λ^ λover start_ARG λ end_ARG, Δ Δ above random baseline, and Guardian usage (mean ± std over 30 trials). Calibration size n=400400n=400n = 400. Policy Accuracy Cost ($/1000) ^bold- λoverbold_ start_ARG italic_λ end_ARG Δ Guardian % Primary 0.559±0.015plus-or-minus0.5590.0150.559± 0.0150.559 ± 0.015 0.032±0.000plus-or-minus0.0320.0000.032± 0.0000.032 ± 0.000 – – 0.0%percent0.00.0\%0.0 % CA (α=0.250.25α=0.25α = 0.25) 0.621±0.025plus-or-minus0.6210.0250.621± 0.0250.621 ± 0.025 0.188±0.024plus-or-minus0.1880.0240.188± 0.0240.188 ± 0.024 0.277±0.067plus-or-minus0.2770.0670.277± 0.0670.277 ± 0.067 −0.0110.011-0.011- 0.011 27.7±3.9%plus-or-minus27.7percent3.927.7± 3.9\%27.7 ± 3.9 % CA (α=0.200.20α=0.20α = 0.20) 0.672±0.025plus-or-minus0.6720.0250.672± 0.0250.672 ± 0.025 0.234±0.033plus-or-minus0.2340.0330.234± 0.0330.234 ± 0.033 0.403±0.058plus-or-minus0.4030.0580.403± 0.0580.403 ± 0.058 +0.0190.019+0.019+ 0.019 34.3±5.3%plus-or-minus34.3percent5.334.3± 5.3\%34.3 ± 5.3 % CA (α=0.150.15α=0.15α = 0.15) 0.714±0.024plus-or-minus0.7140.0240.714± 0.0240.714 ± 0.024 0.302±0.035plus-or-minus0.3020.0350.302± 0.0350.302 ± 0.035 0.529±0.059plus-or-minus0.5290.0590.529± 0.0590.529 ± 0.059 +0.0290.029+0.029+ 0.029 44.9±5.7%plus-or-minus44.9percent5.744.9± 5.7\%44.9 ± 5.7 % CA (α=0.100.10α=0.10α = 0.10) 0.766±0.017plus-or-minus0.7660.0170.766± 0.0170.766 ± 0.017 0.407±0.026plus-or-minus0.4070.0260.407± 0.0260.407 ± 0.026 0.706±0.031plus-or-minus0.7060.0310.706± 0.0310.706 ± 0.031 +0.0310.031+0.031+ 0.031 62.1±4.4%plus-or-minus62.1percent4.462.1± 4.4\%62.1 ± 4.4 % CA (α=0.050.05α=0.05α = 0.05) 0.806±0.017plus-or-minus0.8060.0170.806± 0.0170.806 ± 0.017 0.521±0.035plus-or-minus0.5210.0350.521± 0.0350.521 ± 0.035 0.867±0.040plus-or-minus0.8670.0400.867± 0.0400.867 ± 0.040 +0.0160.016+0.016+ 0.016 78.9±5.6%plus-or-minus78.9percent5.678.9± 5.6\%78.9 ± 5.6 % Guardian 0.833±0.011plus-or-minus0.8330.0110.833± 0.0110.833 ± 0.011 0.620±0.001plus-or-minus0.6200.0010.620± 0.0010.620 ± 0.001 – – 100.0%percent100.0100.0\%100.0 % Ablation studies Across ablations CA’s frontier stays stable. First, varying the calibration split (300, 400, 500 points; Appendix B.3) lifts accuracy by only a point or two with flat cost, matching theory that a few hundred examples suffice (Angelopoulos and Bates, 2022). Second, feeding CA the Guardian’s raw scores instead of the 0/1 binarization nudges accuracy up under tight risk budgets and down by a similar amount when the budget loosens (Appendix B.4). Third, letting the Guardian operate on the full action set rather than the λ^ λover start_ARG λ end_ARG-relaxed subset (unrestricted routing, Appendix B.5) raises accuracy a few points at roughly 10% extra cost; because the Primary still acts on the same contexts while the Guardian’s menu only expands, the finite-sample risk bound is unaffected, though the primary metric (cost) can overshoot the target. Finally, swapping the Primary gpt-4.1-nano for the stronger but pricier gpt-4.1-mini (Appendix B.6) lifts the low-cost end of the frontier by about 0.22 accuracy points. CA still beats a cost-matched random router, but the margin narrows as the capability and cost gap between models decreases. 5.2 PKU-SafeRLHF: helpfulness versus harmlessness We consider how Conformal Arbitrage can be applied to the tradeoff between helpfulness and harmlessness. The PKU-SafeRLHF corpus contains ∼90similar-toabsent90 \!90∼ 90k prompts, each paired with two distinct LLM responses.333https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF Each response is annotated for (i) which response is more helpful, (i) which is safer, and (i) a severity label sev∈0,1,2,3sev0123sev∈\0,1,2,3\sev ∈ 0 , 1 , 2 , 3 indicating the extent of the safety violation (higher is worse). We retain only the rows where the two responses differ in severity level and where the more helpful answer is not the safer answer. These are essentially the hardest cases that demonstrate the conflict between helpfulness and harmlessness. This leaves N=3,5523552N=3,552N = 3 , 552 examples. Experiment set-up We report two quantities: Empirical Human Alignment, the fraction of prompts whose chosen answer matches the annotator-preferred (more-helpful) reply, and Safety-violation loss, the excess severity of the selected answer above the safer one (00 is ideal, larger is worse). The Primary model gpt-4.1-2025-04-14 is instructed (Appendix C) to assign a real-valued helpfulness score p⁢(x,a)∈[0,1]01p(x,a)∈[0,1]p ( x , a ) ∈ [ 0 , 1 ] to every candidate reply while ignoring policy concerns. The Guardian is a deterministic rule that selects the lowest-severity reply, g⁢(x,a)=⁢sev⁢(x,a)=mina′⁡sev⁢(x,a′)1sevsubscriptsuperscript′sevsuperscript′g(x,a)= 1\sev(x,a)= _a sev(x,a )\g ( x , a ) = blackboard_1 sev ( x , a ) = minitalic_a′ sev ( x , a′ ) , mimicking a human safety judge. Over 30 trials we draw 500/500 calibration–evaluation splits from the 3,552 prompts, tune λ^ λover start_ARG λ end_ARG on Λ=0,0.0025,…,1Λ00.0025…1 =\0,0.0025,…,1\Λ = 0 , 0.0025 , … , 1 , and evaluate at risk budgets α∈0.10,0.20,…,0.600.100.20…0.60α∈\0.10,0.20,…,0.60\α ∈ 0.10 , 0.20 , … , 0.60 . Baselines are (i) Primary-only (arg⁡maxa⁡p⁢(x,a)subscript _ap(x,a)arg maxitalic_a p ( x , a )), (i) Guardian-only (lowest-severity reply), and (i) a random router that calls the Guardian with p∈0.2,0.4,0.5,0.6,0.80.20.40.50.60.8p∈\0.2,0.4,0.5,0.6,0.8\p ∈ 0.2 , 0.4 , 0.5 , 0.6 , 0.8 . Results Fig. 2 shows that Conformal Arbitrage traces an efficient frontier between helpfulness and harmlessness. Exact numerical results are given in Appendix C.2. The mean of every CA model dominates the linear interpolation between the Primary and Guardian models that can be obtained via randomized routing. Additionally CA meets the finite-sample guarantee ⁢[L]≤αdelimited-[]E[L]\!≤\! _E [ L ] ≤ α for every guardrail budget α, as indicated by the mean of each point falling to the left of its corresponding vertical target. Figure 2: Harmfulness vs. helpfulness (PKU-SafeRLHF), mean ±plus-or-minus± 1 std over 30 trials. 6 Conclusion Conformal Arbitrage converts a fixed pair of black-box language models (or a model–human pairing) into a continuum of operating points on a frontier of competing objectives. By calibrating a single score-gap threshold with conformal risk control, CA supplies finite-sample, distribution-free guarantees that a user-chosen guard-rail metric stays within budget while maximizing a second objective such as accuracy, helpfulness, or cost efficiency. Empirical results show CA outperforms cost and risk matched random routing, recovers most gains of the stronger model at a fraction of the cost, and works with closed-API deployments without accessing weights or logits. Limitations & future work Our study is confined to multiple-choice tasks; applying Conformal Arbitrage to free-text generation would require bespoke loss functions. We forgo task-specific optimizations (e.g., cost–accuracy tuning), deferring comparisons with specialized cascade systems. We analyze only a single-step, two-model router—deeper cascades may be possible. Next steps include (i) integrating adaptive CRC (Blot et al., 2025), (i) adding tailored optimizations to benchmark against state-of-the-art cascades, and (i) extending CA to multi-model cascades and agentic pipelines. References Abbasi-Yadkori et al. [2024] Yasin Abbasi-Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, and Nenad Tomasev. Mitigating llm hallucinations via conformal abstention. arXiv preprint arXiv:2405.01563, 2024. URL https://arxiv.org/abs/2405.01563. Aggarwal et al. [2025] Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam. Automix: Automatically mixing language models. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), 2025. arXiv:2310.12963. Angelopoulos and Bates [2022] Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification, 2022. URL https://arxiv.org/abs/2107.07511. Angelopoulos et al. [2024] Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. In The Twelfth International Conference on Learning Representations, 2024. Bai et al. [2022a] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a. URL https://arxiv.org/abs/2204.05862. Bai et al. [2022b] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022b. URL https://arxiv.org/abs/2212.08073. Bates et al. [2021] Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael I. Jordan. Distribution-free, risk-controlling prediction sets, 2021. Blot et al. [2025] Vincent Blot, Anastasios N Angelopoulos, Michael I Jordan, and Nicolas J-B Brunel. Automatically adaptive conformal risk control, 2025. URL https://arxiv.org/abs/2406.17819. Burns and et al. [2023] Collin Burns and et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023. Chakraborty et al. [2024] Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Bedi, and Mengdi Wang. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences. In ICML Workshop on Models of Human Feedback for AI Alignment, 2024. Chen et al. [2025] Catherine Yu-Chi Chen, Jingyan Shen, Zhun Deng, and Lihua Lei. Conformal tail risk control for large language model alignment, 2025. URL https://arxiv.org/abs/2502.20285. Chen et al. [2023] Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023. Cherian et al. [2024] John J. Cherian, Isaac Gibbs, and Emmanuel J. Candès. Large language model validity via enhanced conformal prediction methods. In Advances in Neural Information Processing Systems, volume 37, 2024. URL https://proceedings.neurips.c/paper_files/paper/2024/hash/d02f1aeaa5c268dc34790d1ad21526-Abstract-Conference.html. Chow [1970] C. K. Chow. On optimum recognition error and reject trade-off. IEEE Transactions on Information Theory, 16(1):41–46, 1970. Christiano et al. [2018] Paul Christiano, Evan Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. In arXiv preprint arXiv:1810.08575, 2018. Christiano et al. [2017] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30, 2017. Dai et al. [2023] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023. Du et al. [2024] Yanrui Du, Sendong Zhao, Danyang Zhao, Ming Ma, Yuhan Chen, Liangyu Huo, Qing Yang, Dongliang Xu, and Bing Qin. Mogu: A framework for enhancing safety of llms while preserving their usability. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 87569–87591. Curran Associates, Inc., 2024. URL https://proceedings.neurips.c/paper_files/paper/2024/file/9f7f063144103bf6debb09a3f15e00fb-Paper-Conference.pdf. Geifman and El-Yaniv [2019] Yonatan Geifman and Ran El-Yaniv. SelectiveNet: A deep neural network with an integrated reject option. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2151–2159. PMLR, 09–15 Jun 2019. Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. Irving et al. [2018] Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate. arXiv preprint arXiv:1805.00899, 2018. Ji et al. [2023] Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=g0QovXbFw3. Jung et al. [2025] Jaehun Jung, Faeze Brahman, and Yejin Choi. Trust or escalate: Llm judges with provable guarantees for human agreement. arXiv preprint arXiv:2407.18370, 2025. Kumar et al. [2023] Bhawesh Kumar, Charles Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. In Proceedings of the ICML 2023 Workshop on Neural Conversational AI: Teaching Machines to Converse, 2023. URL https://arxiv.org/abs/2305.18404. Lightman et al. [2023] Sam Lightman, Nikita Nangia, and Samuel R. Bowman. Process supervision improves mathematical reasoning in chain-of-thought models. arXiv preprint arXiv:2305.20050, 2023. Lin et al. [2022] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958. Madaan et al. [2023] Aman Madaan, Guangtao Tu, Yiming Chen, Yulia Tsvetkov, and Graham Neubig. Self-refine: Iterative refinement with self-feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023. Mohri and Hashimoto [2024] Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. arXiv preprint arXiv:2402.10978, 2024. URL https://arxiv.org/abs/2402.10978. Ong et al. [2024] Isaac Ong, Pranav Patil, Shivang Agarwal, Harsh Gupta, Nelson F. Liu, Yanda Chen, Percy Liang, and Tatsunori Hashimoto. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024. OpenAI [2023] OpenAI. Gpt-4 system card, 2023. https://openai.com/blog/gpt-4. OpenAI [2025] OpenAI. Introducing gpt-4.1 in the api, April 2025. URL https://openai.com/index/gpt-4-1/. Accessed: 2025-05-15. Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.c/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf. Overman et al. [2024] William Overman, Jacqueline Jil Vallon, and Mohsen Bayati. Aligning model properties via conformal risk control. In Advances in Neural Information Processing Systems, volume 37, 2024. URL https://proceedings.neurips.c/paper_files/paper/2024/hash/c79625091a4f8b5d3abe29f3b14fa43a-Abstract-Conference.html. Quach et al. [2024] Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2306.10193. Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023. Ramé et al. [2023] Alexandre Ramé, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, and Matthieu Cord. Rewardedsoups: Towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In NeurIPS, 2023. Ren et al. [2023] Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots that ask for help: Uncertainty alignment for large language model planners. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=4ZK8ODNyFXx. Su et al. [2024] Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. Api is enough: Conformal prediction for large language models without logit-access, 2024. URL https://arxiv.org/abs/2403.01216. Tang et al. [2024] Yunhao Tang, Rohan Anil, Hyung Won Chung, Zhang Chen, Zhifeng Dai, and Barret Zoph. Scrit: Self-evolving critic for scalable oversight. arXiv preprint arXiv:2403.09613, 2024. Varangot-Reille et al. [2025] Clovis Varangot-Reille, Olivier Caelen, Emelyne Goffinet, Alison Baumann, Alexandre Chauvet, and Patrick von Platen. Doing more with less – implementing routing strategies in large language model-based systems: An extended survey. arXiv preprint arXiv:2502.00409, 2025. Vovk et al. [2005] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World, Second Edition. January 2005. doi: 10.1007/978-3-031-06649-8. Springer-Verlag New York, Inc. 2005. Wang et al. [2024] Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In ACL, 2024. Yang et al. [2024] Hanjiang Yang, Tianyu Fu, Xu Wang, Yao Yao, Sean Welleck, Etienne Levin, Anqi Nie, Kyunghyun Cho, and Jason Weston. Deepcritic: Large language model critics for scalable oversight. arXiv preprint arXiv:2402.05497, 2024. Yue et al. [2024] Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. Large language model cascades with mixture of thought representations for cost-efficient reasoning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=6okaSfANzh. Zellinger et al. [2025] Michael J. Zellinger, Rex Liu, and Matt Thomson. Cost-saving llm cascades with early abstention. arXiv preprint arXiv:2502.09054, 2025. Zhang et al. [2025] Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny, and Adel Bibi. Bi-factorial preference optimization: Balancing safety-helpfulness in language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=GjM61KRiTG. Zhou et al. [2023] Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708, 2023. Zou et al. [2024] Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. arXiv preprint arXiv:2406.04313, 2024. Appendix A Utility-optimality of CRC among score-gap routers We restate Theorem 1 here for convenience and provide the full proof. Theorem 1 (Utility–optimality of conformal risk control). Fix a compact interval Λ=[0,λmax]Λ0subscript =[0, _ ]Λ = [ 0 , λroman_max ]. For each λ∈Λλ∈ λ ∈ Λ and every observation i define a guardrail loss Li⁢(λ)∈[0,B]subscript0L_i(λ)∈[0,B]Litalic_i ( λ ) ∈ [ 0 , B ] and a primary-utility score Ui⁢(λ)∈[0,Umax]subscript0subscriptU_i(λ)∈[0,U_ ]Uitalic_i ( λ ) ∈ [ 0 , Uroman_max ], both non-increasing in λ. Write R⁢(λ)=⁢[Li⁢(λ)],U⁢(λ)=⁢[Ui⁢(λ)].formulae-sequencedelimited-[]subscriptdelimited-[]subscriptR(λ)=E[L_i(λ)], U(λ)=E[U_i(% λ)].R ( λ ) = blackboard_E [ Litalic_i ( λ ) ] , U ( λ ) = blackboard_E [ Uitalic_i ( λ ) ] . Assume R is continuous and strictly decreasing, and U is non-increasing and K-Lipschitz. For a desired risk budget α∈(0,B)0α∈(0,B)α ∈ ( 0 , B ) let λ⋆=infλ∈Λ:R⁢(λ)≤α.subscript⋆infimumconditional-setΛ _ = \λ∈ :R(λ)≤α\.λ⋆ = inf λ ∈ Λ : R ( λ ) ≤ α . Given an i.i.d. calibration sample (n)superscriptD^(n)D( n ) of size n, set R^n⁢(λ)=1n⁢∑i=1nLi⁢(λ),λ^=infλ∈Λ:n+1⁢R^n⁢(λ)+Bn+1≤α.formulae-sequencesubscript^1superscriptsubscript1subscript^infimumconditional-setΛ1subscript^1 R_n(λ)= 1n _i=1^nL_i(λ), % λ= \λ∈ : nn+1\, R_n(% λ)+ Bn+1≤α \.over start_ARG R end_ARGn ( λ ) = divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n Litalic_i ( λ ) , over start_ARG λ end_ARG = inf λ ∈ Λ : divide start_ARG n end_ARG start_ARG n + 1 end_ARG over start_ARG R end_ARGn ( λ ) + divide start_ARG B end_ARG start_ARG n + 1 end_ARG ≤ α . Then, with expectation taken over the calibration sample ⁢[U⁢(λ⋆)−U⁢(λ^)]delimited-[]subscript⋆ E\! [U( _ )-U( λ) ]blackboard_E [ U ( λ⋆ ) - U ( over start_ARG λ end_ARG ) ] =O⁢(n−1),absentsuperscript1 =O(n^-1),= O ( n- 1 ) , (6) ⁢[supλ~∈ΛR⁢(λ~)≤αU⁢(λ~)−U⁢(λ^)]delimited-[]subscriptsupremum~Λ~~ E\! [ _ subarrayc λ∈% \\ R( λ)≤α subarrayU( λ)-U( λ)% ]blackboard_E [ supstart_ARG start_ROW start_CELL over~ start_ARG λ end_ARG ∈ Λ end_CELL end_ROW start_ROW start_CELL R ( over~ start_ARG λ end_ARG ) ≤ α end_CELL end_ROW end_ARG U ( over~ start_ARG λ end_ARG ) - U ( over start_ARG λ end_ARG ) ] =O⁢(n−1).absentsuperscript1 =O(n^-1).= O ( n- 1 ) . (7) Proof. Angelopoulos et al. (2024, Thm. 2) show that the threshold λ^ λover start_ARG λ end_ARG selected by the conformal-risk-control rule satisfies a tight risk lower bound ⁢[Ln+1⁢(λ^)]≥α−2⁢Bn+1delimited-[]subscript1^21E[L_n+1( λ)]≥α- 2Bn+1blackboard_E [ Litalic_n + 1 ( over start_ARG λ end_ARG ) ] ≥ α - divide start_ARG 2 B end_ARG start_ARG n + 1 end_ARG . Which by the fact that α≥R⁢(λ⋆)subscript⋆α≥ R( _ )α ≥ R ( λ⋆ ) implies R⁢(λ^)≥R⁢(λ⋆)−2⁢Bn+1.^subscript⋆21R( λ)≥ R( _ )- 2Bn+1.R ( over start_ARG λ end_ARG ) ≥ R ( λ⋆ ) - divide start_ARG 2 B end_ARG start_ARG n + 1 end_ARG . Thus we get 0≤R⁢(λ⋆)−R⁢(λ^)≤2⁢Bn+1.0subscript⋆^210≤ R( _ )-R( λ)≤ 2Bn+1.0 ≤ R ( λ⋆ ) - R ( over start_ARG λ end_ARG ) ≤ divide start_ARG 2 B end_ARG start_ARG n + 1 end_ARG . Strict monotonicity and continuity of R on the compact interval Λ Λ imply that its inverse is Lipschitz; writing m=infλ∈Λ|R′⁢(λ)|>0subscriptinfimumΛsuperscript′0m= _λ∈ |R (λ)|>0m = infitalic_λ ∈ Λ | R′ ( λ ) | > 0 gives |λ^−λ⋆|≤2⁢B/(m⁢(n+1))^subscript⋆21| λ- _ |≤ 2B/(m(n+1))| over start_ARG λ end_ARG - λ⋆ | ≤ 2 B / ( m ( n + 1 ) ). Then by our non-increasing and Lipschitz assumptions on the utility curve, U⁢(λ⋆)−U⁢(λ^)≤Umax⁢|λ⋆−λ^|≤2⁢K⁢Bm⁢(n+1).subscript⋆^subscriptsubscript⋆^21U( _ )-U( λ)≤ U_ | _ - λ% |≤ 2KBm(n+1).U ( λ⋆ ) - U ( over start_ARG λ end_ARG ) ≤ Uroman_max | λ⋆ - over start_ARG λ end_ARG | ≤ divide start_ARG 2 K B end_ARG start_ARG m ( n + 1 ) end_ARG . Here U⁢(λ^)^U( λ)U ( over start_ARG λ end_ARG ) is still random through λ^=λ^⁢((n))^^superscript λ= λ(D^(n))over start_ARG λ end_ARG = over start_ARG λ end_ARG ( D( n ) ), while U⁢(λ⋆)subscript⋆U( _ )U ( λ⋆ ) is deterministic. Integrating the inequality over the distribution of (n)superscriptD^(n)D( n ) preserves the bound and yields (6). If λ~~ λover~ start_ARG λ end_ARG satisfies R⁢(λ~)≤α~R( λ)≤ ( over~ start_ARG λ end_ARG ) ≤ α then, by strict monotonicity of R, one must have λ~≥λ⋆~subscript⋆ λ≥ _ over~ start_ARG λ end_ARG ≥ λ⋆ and hence U⁢(λ~)≤U⁢(λ⋆).~subscript⋆U( λ)≤ U( _ ).U ( over~ start_ARG λ end_ARG ) ≤ U ( λ⋆ ) . Therefore, for every calibration draw (n)superscriptD^(n)D( n ), supλ~∈ΛR⁢(λ~)≤αU⁢(λ~)−U⁢(λ^)≤U⁢(λ⋆)−U⁢(λ^)≤2⁢K⁢Bm⁢(n+1).subscriptsupremum~Λ~~^subscript⋆^21 _ subarrayc λ∈ \\ R( λ)≤α subarray \U( λ)-U( % λ) \≤ U( _ )-U( λ)≤ 2KBm\,(n% +1).supstart_ARG start_ROW start_CELL over~ start_ARG λ end_ARG ∈ Λ end_CELL end_ROW start_ROW start_CELL R ( over~ start_ARG λ end_ARG ) ≤ α end_CELL end_ROW end_ARG U ( over~ start_ARG λ end_ARG ) - U ( over start_ARG λ end_ARG ) ≤ U ( λ⋆ ) - U ( over start_ARG λ end_ARG ) ≤ divide start_ARG 2 K B end_ARG start_ARG m ( n + 1 ) end_ARG . Taking expectation establishes (7). ∎ Appendix B TruthfulQA: Additional Experimental Details B.1 Prompt for Score Elicitation System message You are an expert who evaluates multiple choice questions. # Instructions -- Assign a confidence score to each answer choice on a scale from 0 to 1 -- 0 means certainly incorrect, 1 means certainly correct -- Don’t assign similar scores to choices unless you are genuinely equally uncertain # Response Format -- Output ONLY a valid JSON object with a "scores" key containing an array of numbers -- Example: "scores": [0.1, 0.8, 0.05, 0.05] -- No explanations, just the JSON object User message Question: <verbatim question text> Answer Choices: <json.dumps(choices)> Respond ONLY with a JSON object containing your confidence scores for these choices, e.g. "scores": [0.1, 0.8, 0.05, 0.05] Both the Primary (gpt-4.1-nano-2025-04-14) and Guardian (gpt-4.1-2025-04-14) models receive exactly this dialog. We parse the returned JSON, extract the scores array, and then normalize it so that it sums to 1; these normalized values are used as the per-choice confidence scores p⁢(x,a)p(x,a)p ( x , a ) and g⁢(x,a)g(x,a)g ( x , a ) throughout calibration and evaluation. B.2 Cost Calculation For every question in every trial we record the four token counts (tinprimary,toutprimary,tinguardian,toutguardian),superscriptsubscriptinprimarysuperscriptsubscriptoutprimarysuperscriptsubscriptinguardiansuperscriptsubscriptoutguardian (t_in^primary,\;t_out^primary,% \;t_in^guardian,\;t_out^guardian % ),( troman_inprimary , troman_outprimary , troman_inguardian , troman_outguardian ) , i.e. the prompt- and completion-token usage of the Primary and Guardian models, respectively. Each model is billed at its own per-token prices cinprimary,coutprimary⁢and⁢cinguardian,coutguardian.superscriptsubscriptinprimarysuperscriptsubscriptoutprimaryandsuperscriptsubscriptinguardiansuperscriptsubscriptoutguardianc_in^primary,\;c_out^primary\;% and\;c_in^guardian,\;c_out^guardian.croman_inprimary , croman_outprimary and croman_inguardian , croman_outguardian . For M∈primary,guardianprimaryguardianM∈\primary,guardian\M ∈ primary , guardian the cost is costM=cinM⁢tinM+coutM⁢toutM.subscriptcostsuperscriptsubscriptinsuperscriptsubscriptinsuperscriptsubscriptoutsuperscriptsubscriptoutcost_M\;=\;c_in^M\,t_in^M+c_out% ^M\,t_out^M.costitalic_M = croman_initalic_M troman_initalic_M + croman_outitalic_M troman_outitalic_M . Hybrid (routed) calls. If the Primary’s λ^ λover start_ARG λ end_ARG-relaxed conformal set contains m>11m>1m > 1 answers, the query is routed to the Guardian. To upper-bound this second leg we start from the original, full-prompt token count tinfullsuperscriptsubscriptinfullt_in^fulltroman_infull (the question shown to both models) and scale it according to the fraction of choices actually sent: t^in=⌊tinfull⁢(0.5+0.5⁢mn)⌋,subscript^insuperscriptsubscriptinfull0.50.5 t_in\;=\; t_in^full\,% (0.5+0.5\, mn ) ,over start_ARG t end_ARGin = ⌊ troman_infull ( 0.5 + 0.5 divide start_ARG m end_ARG start_ARG n end_ARG ) ⌋ , where n is the total number of answer options. We keep the Guardian’s completion length fixed at toutguardiansuperscriptsubscriptoutguardiant_out^guardiantroman_outguardian, yielding the estimate costguardianestsuperscriptsubscriptcostguardianest _guardian^estcostroman_guardianest =cinguardian⁢t^in+coutguardian⁢toutguardianabsentsuperscriptsubscriptinguardiansubscript^insuperscriptsubscriptoutguardiansuperscriptsubscriptoutguardian =c_in^guardian\, t_in+c_% out^guardian\,t_out^guardian= croman_inguardian over start_ARG t end_ARGin + croman_outguardian troman_outguardian costtotalsubscriptcosttotal _totalcostroman_total =costprimary+costguardianest.absentsubscriptcostprimarysuperscriptsubscriptcostguardianest =cost_primary+cost_guardian% ^est.= costroman_primary + costroman_guardianest . Because we (i) retain the Guardian’s full completion length and (i) shrink prompt tokens linearly with m/nm/nm / n, this accounting is deliberately conservative: an implementation that truly shortens both prompt and completion when m<nm<nm < n would only reduce the spend. Hence our reported savings under Conformal Arbitrage are a lower bound.444Token prices follow the OpenAI schedule of 15 May 2025. B.3 Calibration Size Ablations Table 2: TruthfulQA. Accuracy, cost per 1000 examples, λ^ λover start_ARG λ end_ARG, Δ Δ above random baseline, and Guardian usage (mean ± std over 30 trials). Calibration size n=300300n=300n = 300. Policy Accuracy Cost ($/1000) ^bold- λoverbold_ start_ARG italic_λ end_ARG Δ Guardian % Primary 0.557±0.012plus-or-minus0.5570.0120.557± 0.0120.557 ± 0.012 0.032±0.000plus-or-minus0.0320.0000.032± 0.0000.032 ± 0.000 – – 0.0%percent0.00.0\%0.0 % CA (α=0.250.25α=0.25α = 0.25) 0.619±0.038plus-or-minus0.6190.0380.619± 0.0380.619 ± 0.038 0.184±0.030plus-or-minus0.1840.0300.184± 0.0300.184 ± 0.030 0.280±0.079plus-or-minus0.2800.0790.280± 0.0790.280 ± 0.079 −0.0080.008-0.008- 0.008 27.3±5.1%plus-or-minus27.3percent5.127.3± 5.1\%27.3 ± 5.1 % CA (α=0.200.20α=0.20α = 0.20) 0.667±0.033plus-or-minus0.6670.0330.667± 0.0330.667 ± 0.033 0.236±0.027plus-or-minus0.2360.0270.236± 0.0270.236 ± 0.027 0.405±0.048plus-or-minus0.4050.0480.405± 0.0480.405 ± 0.048 +0.0160.016+0.016+ 0.016 35.0±4.3%plus-or-minus35.0percent4.335.0± 4.3\%35.0 ± 4.3 % CA (α=0.150.15α=0.15α = 0.15) 0.710±0.034plus-or-minus0.7100.0340.710± 0.0340.710 ± 0.034 0.304±0.040plus-or-minus0.3040.0400.304± 0.0400.304 ± 0.040 0.542±0.063plus-or-minus0.5420.0630.542± 0.0630.542 ± 0.063 +0.0270.027+0.027+ 0.027 45.6±6.5%plus-or-minus45.6percent6.545.6± 6.5\%45.6 ± 6.5 % CA (α=0.100.10α=0.10α = 0.10) 0.757±0.031plus-or-minus0.7570.0310.757± 0.0310.757 ± 0.031 0.394±0.041plus-or-minus0.3940.0410.394± 0.0410.394 ± 0.041 0.700±0.048plus-or-minus0.7000.0480.700± 0.0480.700 ± 0.048 +0.0280.028+0.028+ 0.028 60.3±6.7%plus-or-minus60.3percent6.760.3± 6.7\%60.3 ± 6.7 % CA (α=0.050.05α=0.05α = 0.05) 0.801±0.022plus-or-minus0.8010.0220.801± 0.0220.801 ± 0.022 0.513±0.048plus-or-minus0.5130.0480.513± 0.0480.513 ± 0.048 0.861±0.059plus-or-minus0.8610.0590.861± 0.0590.861 ± 0.059 +0.0180.018+0.018+ 0.018 78.3±7.7%plus-or-minus78.3percent7.778.3± 7.7\%78.3 ± 7.7 % Guardian 0.833±0.010plus-or-minus0.8330.0100.833± 0.0100.833 ± 0.010 0.615±0.001plus-or-minus0.6150.0010.615± 0.0010.615 ± 0.001 – – 100.0%percent100.0100.0\%100.0 % Table 3: TruthfulQA. Accuracy, cost per 1000 examples, λ^ λover start_ARG λ end_ARG, Δ Δ above random baseline, and Guardian usage (mean ± std over 30 trials). Calibration size n=500500n=500n = 500. Policy Accuracy Cost ($/1000) ^bold- λoverbold_ start_ARG italic_λ end_ARG Δ Guardian % Primary 0.554±0.012plus-or-minus0.5540.0120.554± 0.0120.554 ± 0.012 0.032±0.000plus-or-minus0.0320.0000.032± 0.0000.032 ± 0.000 – – 0.0%percent0.00.0\%0.0 % CA (α=0.250.25α=0.25α = 0.25) 0.625±0.040plus-or-minus0.6250.0400.625± 0.0400.625 ± 0.040 0.184±0.019plus-or-minus0.1840.0190.184± 0.0190.184 ± 0.019 0.301±0.039plus-or-minus0.3010.0390.301± 0.0390.301 ± 0.039 −0.0050.005-0.005- 0.005 27.3±3.4%plus-or-minus27.3percent3.427.3± 3.4\%27.3 ± 3.4 % CA (α=0.200.20α=0.20α = 0.20) 0.672±0.042plus-or-minus0.6720.0420.672± 0.0420.672 ± 0.042 0.233±0.025plus-or-minus0.2330.0250.233± 0.0250.233 ± 0.025 0.414±0.045plus-or-minus0.4140.0450.414± 0.0450.414 ± 0.045 +0.0200.020+0.020+ 0.020 34.6±4.2%plus-or-minus34.6percent4.234.6± 4.2\%34.6 ± 4.2 % CA (α=0.150.15α=0.15α = 0.15) 0.715±0.037plus-or-minus0.7150.0370.715± 0.0370.715 ± 0.037 0.301±0.024plus-or-minus0.3010.0240.301± 0.0240.301 ± 0.024 0.563±0.038plus-or-minus0.5630.0380.563± 0.0380.563 ± 0.038 +0.0310.031+0.031+ 0.031 45.1±3.9%plus-or-minus45.1percent3.945.1± 3.9\%45.1 ± 3.9 % CA (α=0.100.10α=0.10α = 0.10) 0.765±0.033plus-or-minus0.7650.0330.765± 0.0330.765 ± 0.033 0.402±0.025plus-or-minus0.4020.0250.402± 0.0250.402 ± 0.025 0.712±0.026plus-or-minus0.7120.0260.712± 0.0260.712 ± 0.026 +0.0320.032+0.032+ 0.032 62.0±4.2%plus-or-minus62.0percent4.262.0± 4.2\%62.0 ± 4.2 % CA (α=0.050.05α=0.05α = 0.05) 0.806±0.029plus-or-minus0.8060.0290.806± 0.0290.806 ± 0.029 0.524±0.024plus-or-minus0.5240.0240.524± 0.0240.524 ± 0.024 0.881±0.028plus-or-minus0.8810.0280.881± 0.0280.881 ± 0.028 +0.0190.019+0.019+ 0.019 80.1±3.8%plus-or-minus80.1percent3.880.1± 3.8\%80.1 ± 3.8 % Guardian 0.833±0.010plus-or-minus0.8330.0100.833± 0.0100.833 ± 0.010 0.615±0.001plus-or-minus0.6150.0010.615± 0.0010.615 ± 0.001 – – 100.0%percent100.0100.0\%100.0 % To assess how many calibration examples are needed for Conformal Arbitrage (CA) to stabilize, we repeat the TruthfulQA experiment with calibration split sizes n∈300,500300500n∈\300,500\n ∈ 300 , 500 . Tables 2–3 report accuracy, dollar cost per 1000 questions, the fitted threshold λ^ λover start_ARG λ end_ARG, and Guardian usage at the same guardrail levels α∈0.25,0.20,0.15,0.10,0.050.250.200.150.100.05α∈\0.25,0.20,0.15,0.10,0.05\α ∈ 0.25 , 0.20 , 0.15 , 0.10 , 0.05 . Across all risk budgets the frontier is stable. Moving from n=300300n=300n = 300 to n=500500n=500n = 500 changes the mean accuracy by at most 1−2121\!-\!21 - 2 percentage points. Average cost remains effectively unchanged (differences <3%absentpercent3<\!3\%< 3 %) for every α. The fraction of queries escalated to the Guardian varies by less than 2%percent22\%2 % absolute. B.4 Guardian Scoring Ablation Table 4: Accuracy, cost per 1000 examples, λ^ λover start_ARG λ end_ARG, Δ Δ above random baseline, and Guardian usage (mean ± std over 30 trials) when the Guardian’s raw scores are used instead of hard 0/1010/10 / 1 binarization. Policy Accuracy Cost ($/1000) ^bold- λoverbold_ start_ARG italic_λ end_ARG Δ Guardian % Primary 0.556±0.012plus-or-minus0.5560.0120.556± 0.0120.556 ± 0.012 0.032±0.000plus-or-minus0.0320.0000.032± 0.0000.032 ± 0.000 – – 0.0%percent0.00.0\%0.0 % CA (α=0.250.25α=0.25α = 0.25) 0.598±0.037plus-or-minus0.5980.0370.598± 0.0370.598 ± 0.037 0.163±0.026plus-or-minus0.1630.0260.163± 0.0260.163 ± 0.026 0.203±0.089plus-or-minus0.2030.0890.203± 0.0890.203 ± 0.089 −0.0210.021-0.021- 0.021 24.0±4.5%plus-or-minus24.0percent4.524.0± 4.5\%24.0 ± 4.5 % CA (α=0.200.20α=0.20α = 0.20) 0.661±0.035plus-or-minus0.6610.0350.661± 0.0350.661 ± 0.035 0.222±0.028plus-or-minus0.2220.0280.222± 0.0280.222 ± 0.028 0.394±0.059plus-or-minus0.3940.0590.394± 0.0590.394 ± 0.059 +0.0140.014+0.014+ 0.014 32.8±4.4%plus-or-minus32.8percent4.432.8± 4.4\%32.8 ± 4.4 % CA (α=0.150.15α=0.15α = 0.15) 0.714±0.028plus-or-minus0.7140.0280.714± 0.0280.714 ± 0.028 0.304±0.032plus-or-minus0.3040.0320.304± 0.0320.304 ± 0.032 0.558±0.059plus-or-minus0.5580.0590.558± 0.0590.558 ± 0.059 +0.0290.029+0.029+ 0.029 45.6±5.3%plus-or-minus45.6percent5.345.6± 5.3\%45.6 ± 5.3 % CA (α=0.100.10α=0.10α = 0.10) 0.771±0.025plus-or-minus0.7710.0250.771± 0.0250.771 ± 0.025 0.414±0.030plus-or-minus0.4140.0300.414± 0.0300.414 ± 0.030 0.741±0.036plus-or-minus0.7410.0360.741± 0.0360.741 ± 0.036 +0.0320.032+0.032+ 0.032 63.1±4.3%plus-or-minus63.1percent4.363.1± 4.3\%63.1 ± 4.3 % CA (α=0.050.05α=0.05α = 0.05) 0.813±0.021plus-or-minus0.8130.0210.813± 0.0210.813 ± 0.021 0.554±0.059plus-or-minus0.5540.0590.554± 0.0590.554 ± 0.059 0.917±0.056plus-or-minus0.9170.0560.917± 0.0560.917 ± 0.056 +0.0130.013+0.013+ 0.013 84.8±9.6%plus-or-minus84.8percent9.684.8± 9.6\%84.8 ± 9.6 % Guardian 0.831±0.010plus-or-minus0.8310.0100.831± 0.0100.831 ± 0.010 0.615±0.001plus-or-minus0.6150.0010.615± 0.0010.615 ± 0.001 – – 100.0%percent100.0100.0\%100.0 % When calibrating Conformal Arbitrage (CA) on TruthfulQA we binarize the Guardian’s output in the main experiments—assigning score 1111 to the Guardian’s highest scoring answwer if and only if it is correct and 00 to all others—to make the accuracy loss Li⁢(λ)subscriptL_i(λ)Litalic_i ( λ ) in Eq. (2) directly interpretable as “fractional drop in accuracy” relative to an always-Guardian policy. Here we repeat the experiment but feed CA the Guardian’s raw confidence scores. The resulting frontier is reported in Table 4. For tighter risk budgets (α≤0.100.10α\!≤\!0.10α ≤ 0.10). accuracy rises by roughly +1−2%1percent2+1\!-\!2\%+ 1 - 2 % while cost is unchanged. At loose risk budgets (α≥0.200.20α\!≥\!0.20α ≥ 0.20), accuracy drops slightly (about 0.5%−1%percent0.5percent10.5\%-1\%0.5 % - 1 %). Cost differences remain negligible. With respect to the risk guarantees, feeding softer scores does not affect the finite-sample CRC bound; every row in Table 4 satisfies the ⁢[L]≤αdelimited-[]E[L]\!≤\! _E [ L ] ≤ α constraint as expected. B.5 Unrestricted Action Set Routing In our main pipeline the Guardian is asked to choose only from the λ^ λover start_ARG λ end_ARG-relaxed candidate set Cλ^⁢(x)subscript^C_ λ(x)Cover start_ARG λ end_ARG ( x ) generated by the Primary. Here we study a more liberal variant—denoted CA⋆superscriptCA⋆CA CA⋆—that lets the Guardian reconsider the entire action set A⁢(x)A(x)A ( x ). Table 5 shows that unrestricted routing lifts accuracy by roughly 3−6363\!-\!63 - 6 percentage points across the tested risk budgets, with the largest gains appearing in the looser regimes (α≥0.200.20α\!≥\!0.20α ≥ 0.20). The calibration diagnostics in Table 6 explain why: as α grows the conformal set shrinks, increasing the odds that the Primary prunes away the correct answer. When the Guardian can inspect all options it can often recover that mistake, yielding the frontier in Figure 3. The cost penalty is modest—on average 7−10%7percent107\!-\!10\,\%7 - 10 % above the restricted CA variant. In many applications the action space is much larger than the four-choice multiple-choice setting considered here. Passing the full set to the Guardian would then erase most of the cost savings that Conformal Arbitrage provides. Moreover, for trade-offs other than cost-accuracy (e.g. reward versus safety) a filtered candidate set can be desirable: it biases the Guardian toward options with high primary utility while still respecting the guard-rail budget. For these reasons we present the restricted policy as the default and treat unrestricted routing as an informative ablation. Figure 3: Accuracy vs. cost per 1000 examples on TruthfulQA using unrestricted calibrated routing. Each point corresponds to the mean over 30 trials; error bars represent one standard deviation. Solid circles denote our CRC-hybrid policy, stars represent static baselines (Preferred-only and Guardian-only), and hollow diamonds show the random routing baseline. Table 5: Accuracy, cost per 1000 examples, λ^ λover start_ARG λ end_ARG, Δ Δ above unrestricted random baseline, and Guardian usage (mean ±plus-or-minus± std over 30 trials). Calibration size n=400400n=400n = 400. CA rows report the unrestricted variant. Policy Accuracy Cost ($/1000) ^bold- λoverbold_ start_ARG italic_λ end_ARG Δ Guardian % Primary 0.559±0.015plus-or-minus0.5590.0150.559± 0.0150.559 ± 0.015 0.032±0.000plus-or-minus0.0320.0000.032± 0.0000.032 ± 0.000 – – 0.0%percent0.00.0\%0.0 % CA⋆ (α=0.250.25α=0.25α = 0.25) 0.687±0.021plus-or-minus0.6870.0210.687± 0.0210.687 ± 0.021 0.206±0.025plus-or-minus0.2060.0250.206± 0.0250.206 ± 0.025 0.277±0.067plus-or-minus0.2770.0670.277± 0.0670.277 ± 0.067 +0.0460.046+0.046+ 0.046 27.7±3.9%plus-or-minus27.7percent3.927.7± 3.9\%27.7 ± 3.9 % CA⋆ (α=0.200.20α=0.20α = 0.20) 0.713±0.022plus-or-minus0.7130.0220.713± 0.0220.713 ± 0.022 0.247±0.033plus-or-minus0.2470.0330.247± 0.0330.247 ± 0.033 0.403±0.058plus-or-minus0.4030.0580.403± 0.0580.403 ± 0.058 +0.0520.052+0.052+ 0.052 34.3±5.3%plus-or-minus34.3percent5.334.3± 5.3\%34.3 ± 5.3 % CA⋆ (α=0.150.15α=0.15α = 0.15) 0.741±0.022plus-or-minus0.7410.0220.741± 0.0220.741 ± 0.022 0.313±0.036plus-or-minus0.3130.0360.313± 0.0360.313 ± 0.036 0.529±0.059plus-or-minus0.5290.0590.529± 0.0590.529 ± 0.059 +0.0500.050+0.050+ 0.050 44.9±5.7%plus-or-minus44.9percent5.744.9± 5.7\%44.9 ± 5.7 % CA⋆ (α=0.100.10α=0.10α = 0.10) 0.785±0.016plus-or-minus0.7850.0160.785± 0.0160.785 ± 0.016 0.421±0.027plus-or-minus0.4210.0270.421± 0.0270.421 ± 0.027 0.706±0.031plus-or-minus0.7060.0310.706± 0.0310.706 ± 0.031 +0.0430.043+0.043+ 0.043 62.1±4.4%plus-or-minus62.1percent4.462.1± 4.4\%62.1 ± 4.4 % CA⋆ (α=0.050.05α=0.05α = 0.05) 0.812±0.016plus-or-minus0.8120.0160.812± 0.0160.812 ± 0.016 0.525±0.035plus-or-minus0.5250.0350.525± 0.0350.525 ± 0.035 0.867±0.040plus-or-minus0.8670.0400.867± 0.0400.867 ± 0.040 +0.0200.020+0.020+ 0.020 78.9±5.6%plus-or-minus78.9percent5.678.9± 5.6\%78.9 ± 5.6 % Guardian 0.833±0.011plus-or-minus0.8330.0110.833± 0.0110.833 ± 0.011 0.620±0.001plus-or-minus0.6200.0010.620± 0.0010.620 ± 0.001 – – 100.0%percent100.0100.0\%100.0 % Table 6: Calibrated λ^ λover start_ARG λ end_ARG values and resulting conformal-set sizes for CA as used in the main text (means ±plus-or-minus± s.d. over 30 trials). As the risk budget α tightens (top → bottom), the candidate set grows. αitalic_α ^bold- λoverbold_ start_ARG italic_λ end_ARG Set size 0.25 0.277±0.067plus-or-minus0.2770.0670.277± 0.0670.277 ± 0.067 1.457±0.024plus-or-minus1.4570.0241.457± 0.0241.457 ± 0.024 0.20 0.403±0.058plus-or-minus0.4030.0580.403± 0.0580.403 ± 0.058 1.801±0.038plus-or-minus1.8010.0381.801± 0.0381.801 ± 0.038 0.15 0.529±0.059plus-or-minus0.5290.0590.529± 0.0590.529 ± 0.059 2.105±0.045plus-or-minus2.1050.0452.105± 0.0452.105 ± 0.045 0.10 0.706±0.031plus-or-minus0.7060.0310.706± 0.0310.706 ± 0.031 2.587±0.041plus-or-minus2.5870.0412.587± 0.0412.587 ± 0.041 0.05 0.867±0.040plus-or-minus0.8670.0400.867± 0.0400.867 ± 0.040 3.253±0.034plus-or-minus3.2530.0343.253± 0.0343.253 ± 0.034 B.6 Model Choice Ablation To probe how Conformal Arbitrage behaves for the cost-accuracy tradeoff when the capability gap between the two models is smaller, we replace the original gpt-4.1-nano Primary with the stronger but costlier gpt-4.1-mini. This boosts the stand-alone Primary accuracy from 0.560.560.560.56 to 0.770.770.770.77—only ∼6similar-toabsent6 \!6∼ 6 p below the Guardian—and raises the token price four-fold. Even in this compressed regime CA still delivers a meaningful improvement over cost-matched random routing: at α=0.050.05α\!=\!0.05α = 0.05 it gains +22+2+ 2 p in accuracy while invoking the Guardian on just one quarter of the queries, and at α=0.0250.025α\!=\!0.025α = 0.025 it matches the Guardian’s accuracy for 40%percent4040\%40 % of the cost. The detailed numbers are collected in Table 7, and the corresponding cost–accuracy frontier is visualized in Figure 4. Table 7: Model-ablation results on TruthfulQA with gpt-4.1-mini as the Primary. Accuracy, cost per 1000 examples, fitted threshold λ^ λover start_ARG λ end_ARG, improvement over a cost-matched random router (Δ Δ), and Guardian usage. Means ±plus-or-minus± one standard deviation across 30 trials. Policy Accuracy Cost ($/1000) ^bold- λoverbold_ start_ARG italic_λ end_ARG Δ Guardian % Primary (4.1-mini) 0.7738±0.0113plus-or-minus0.77380.01130.7738± 0.01130.7738 ± 0.0113 0.126±0.000plus-or-minus0.1260.0000.126± 0.0000.126 ± 0.000 – – 0.0%percent0.00.0\%0.0 % CA (α=0.0500.050α=0.050α = 0.050) 0.8156±0.0194plus-or-minus0.81560.01940.8156± 0.01940.8156 ± 0.0194 0.265±0.032plus-or-minus0.2650.0320.265± 0.0320.265 ± 0.032 0.452±0.082plus-or-minus0.4520.0820.452± 0.0820.452 ± 0.082 +0.0210.021+0.021+ 0.021 23.9±5.0%plus-or-minus23.9percent5.023.9± 5.0\%23.9 ± 5.0 % CA (α=0.0250.025α=0.025α = 0.025) 0.8345±0.0208plus-or-minus0.83450.02080.8345± 0.02080.8345 ± 0.0208 0.375±0.064plus-or-minus0.3750.0640.375± 0.0640.375 ± 0.064 0.669±0.094plus-or-minus0.6690.0940.669± 0.0940.669 ± 0.094 +0.0260.026+0.026+ 0.026 41.2±10.7%plus-or-minus41.2percent10.741.2± 10.7\%41.2 ± 10.7 % Guardian (4.1) 0.8328±0.0088plus-or-minus0.83280.00880.8328± 0.00880.8328 ± 0.0088 0.615±0.001plus-or-minus0.6150.0010.615± 0.0010.615 ± 0.001 – – 100.0%percent100.0100.0\%100.0 % Figure 4: Cost–accuracy frontier on TruthfulQA when the Primary is gpt-4.1-mini. Large markers denote means over 30 trials (error bars: ±1plus-or-minus1± 1± 1 s.d.); faint dots show individual CA runs. The dashed grey line is the linear interpolation between the single-model baselines. Appendix C PKU-SafeRLHF: Additional Experimental Details C.1 Prompt for Score Elicitation System message You are an expert assistant-evaluator. # Task Score the helpfulness of each assistant reply on a continuous scale from 0.00 to 1.00 (use two decimal places). -- Evaluate only technical helpfulness: factual correctness, completeness, clarity, and relevance to the user’s request. -- Ignore safety, ethics, policy compliance, or any notion of harm. # Reasoning 1. Analyse the user’s intent. 2. Independently assess each reply’s strengths and weaknesses. 3. Assign two distinct scores—the more helpful answer must receive the higher score. # Output format -- On a single line, return exactly two numbers separated by a comma, e.g. ‘0.37, 0.54‘. -- Scores must differ and lie in [0.00, 1.00]. -- No explanations, no extra text, just the two numbers. User message template (filled per instance) <USER PROMPT> <verbatim user prompt text> </USER PROMPT> <ASSISTANT REPLY 0> <first candidate reply> </ASSISTANT REPLY 0> <ASSISTANT REPLY 1> <second candidate reply> </ASSISTANT REPLY 1> Scores: We parse the single-line response as two floats, enforce strict inequality by perturbing ties by ±0.01plus-or-minus0.01± 0.01± 0.01, preserving exchangeablity, and use the resulting pair as the helpfulness scores given by the Primary model in our Conformal Arbitrage pipeline. C.2 Numerical Results We provide the complete numerical results for the PKU-SafeRLHF experiment introduced in Section 5. Table 8 aggregates performance over 30 independent calibration/evaluation splits. Accuracy is the fraction of prompts whose chosen answer matches the annotator-preferred more-helpful response, while Severity-loss measures the average excess severity of the selected answer above the safer one (0≤sev≤30sev30 ≤ 30 ≤ sev ≤ 3; lower is better). As guaranteed by theory, every CA configuration respects the finite-sample bound Severity-loss≤αSeverity-lossSeverity-loss≤ -loss ≤ α while tracing an efficient helpfulness–harmlessness frontier that strictly dominates random routing. Table 8: PKU-SafeRLHF helpfulness–harmlessness trade-off. Primary = helpfulness-maximising model; Guardian = severity-minimizing rule. Mean ±plus-or-minus± std over 30 trials. Policy Accuracy Severity-loss ^bold- λoverbold_ start_ARG italic_λ end_ARG Δ Guardian % Primary 0.519±0.019plus-or-minus0.5190.0190.519± 0.0190.519 ± 0.019 0.676±0.033plus-or-minus0.6760.0330.676± 0.0330.676 ± 0.033 – – 0.0%percent0.00.0\%0.0 % CA (α=0.600.60α=0.60α = 0.60) 0.475±0.029plus-or-minus0.4750.0290.475± 0.0290.475 ± 0.029 0.571±0.070plus-or-minus0.5710.0700.571± 0.0700.571 ± 0.070 0.206±0.088plus-or-minus0.2060.0880.206± 0.0880.206 ± 0.088 +0.0120.012+0.012+ 0.012 19.0±9.4%plus-or-minus19.0percent9.419.0± 9.4\%19.0 ± 9.4 % CA (α=0.500.50α=0.50α = 0.50) 0.443±0.026plus-or-minus0.4430.0260.443± 0.0260.443 ± 0.026 0.482±0.053plus-or-minus0.4820.0530.482± 0.0530.482 ± 0.053 0.354±0.051plus-or-minus0.3540.0510.354± 0.0510.354 ± 0.051 +0.0280.028+0.028+ 0.028 35.6±5.3%plus-or-minus35.6percent5.335.6± 5.3\%35.6 ± 5.3 % CA (α=0.400.40α=0.40α = 0.40) 0.393±0.034plus-or-minus0.3930.0340.393± 0.0340.393 ± 0.034 0.379±0.064plus-or-minus0.3790.0640.379± 0.0640.379 ± 0.064 0.495±0.061plus-or-minus0.4950.0610.495± 0.0610.495 ± 0.061 +0.0330.033+0.033+ 0.033 51.8±8.0%plus-or-minus51.8percent8.051.8± 8.0\%51.8 ± 8.0 % CA (α=0.300.30α=0.30α = 0.30) 0.325±0.026plus-or-minus0.3250.0260.325± 0.0260.325 ± 0.026 0.245±0.043plus-or-minus0.2450.0430.245± 0.0430.245 ± 0.043 0.619±0.022plus-or-minus0.6190.0220.619± 0.0220.619 ± 0.022 +0.0370.037+0.037+ 0.037 71.7±4.9%plus-or-minus71.7percent4.971.7± 4.9\%71.7 ± 4.9 % CA (α=0.200.20α=0.20α = 0.20) 0.270±0.018plus-or-minus0.2700.0180.270± 0.0180.270 ± 0.018 0.161±0.021plus-or-minus0.1610.0210.161± 0.0210.161 ± 0.021 0.681±0.007plus-or-minus0.6810.0070.681± 0.0070.681 ± 0.007 +0.0280.028+0.028+ 0.028 82.2±2.1%plus-or-minus82.2percent2.182.2± 2.1\%82.2 ± 2.1 % CA (α=0.100.10α=0.10α = 0.10) 0.214±0.016plus-or-minus0.2140.0160.214± 0.0160.214 ± 0.016 0.080±0.022plus-or-minus0.0800.0220.080± 0.0220.080 ± 0.022 0.777±0.014plus-or-minus0.7770.0140.777± 0.0140.777 ± 0.014 +0.0150.015+0.015+ 0.015 91.8±1.9%plus-or-minus91.8percent1.991.8± 1.9\%91.8 ± 1.9 % Guardian 0.156±0.011plus-or-minus0.1560.0110.156± 0.0110.156 ± 0.011 0.000±0.000plus-or-minus0.0000.0000.000± 0.0000.000 ± 0.000 – – 100.0%percent100.0100.0\%100.0 % Tightening the risk budget reduces severity-loss while gradually approaching the Guardian-only baseline. At α=0.300.30α=0.30α = 0.30 CA halves the Primary’s safety violations yet retains 63% of its helpfulness, invoking the Guardian on ∼similar-to ∼72% of queries. Even under the strictest budget (α=0.100.10α=0.10α = 0.10) CA more than doubles the Guardian’s helpfulness while keeping average severity within the prescribed limit. Appendix D MMLU We next evaluate Conformal Arbitrage (CA) on the Massive Multitask Language Understanding benchmark (MMLU; [Hendrycks et al., 2021]). Unless otherwise noted, the pipeline, models, prompts, cost accounting, and random–router baselines are identical to the TruthfulQA setup in Section 5; below we list only the divergences that are specific to MMLU. Both models receive the same JSON-forced multiple-choice prompt used for TruthfulQA (Appendix B.1); we simply drop the TruthfulQA pre-amble and insert the MMLU question and four answer strings verbatim. Dataset MMLU comprises almost ∼16similar-toabsent16 \!16∼ 16k multiple choice questions across 57 subject areas covering high-school, undergraduate, and professional curricula. We load the public cais/mmlu distribution via datasets and collapse the original train/validation/test splits into one pool. For each trial we draw a fresh, balanced sample of Ntot=1,000subscripttot1000N_tot=1,000Ntot = 1 , 000 questions, allocating n=500500n=500n = 500 for calibration and the remaining 500500500500 for evaluation. Balancing is accomplished by first shuffling each subject’s pool and then taking ⌊Ntot/57⌋subscripttot57 N_tot/57 ⌊ Ntot / 57 ⌋ items from every subject, distributing the remainder randomly. Results Although it is of less average gain compared to TruthfulQA, Conformal Arbitrage still traces an efficient frontier that beats cost-matched random routing for most values of α apart from the extremes. We can see that, in particular, the performance of CA degrades at the higher and lower values of α compared to the middle range. We hypothesize that the decreased gain compared to TruthfulQA is likely due to the fact that even with balancing, the questions in MMLU are of more varying difficulty across subjects than the differences between questions within TruthfulQA. Nevertheless, at α=0.100.10α=0.10α = 0.10 CA recovers 91%percent9191\,\%91 % of the Guardian’s accuracy while spending only 61%percent6161\,\%61 % of its cost, demonstrating that the method remains effective even when the capability gap is modest. Table 9: Accuracy, cost per 1000 examples, λ^ λover start_ARG λ end_ARG, Δ Δ above random baseline, and Guardian usage (mean ± std over 30 trials; calibration n=500500n=500n = 500). Policy Accuracy Cost ($/1000) ^bold- λoverbold_ start_ARG italic_λ end_ARG Δ Guardian % Primary 0.591±0.011plus-or-minus0.5910.0110.591± 0.0110.591 ± 0.011 0.035±0.000plus-or-minus0.0350.0000.035± 0.0000.035 ± 0.000 – – 0.0%percent0.00.0\%0.0 % CA (α=0.250.25α=0.25α = 0.25) 0.618±0.019plus-or-minus0.6180.0190.618± 0.0190.618 ± 0.019 0.111±0.034plus-or-minus0.1110.0340.111± 0.0340.111 ± 0.034 0.126±0.111plus-or-minus0.1260.1110.126± 0.1110.126 ± 0.111 −0.0050.005-0.005- 0.005 13.0±5.6%plus-or-minus13.0percent5.613.0± 5.6\%13.0 ± 5.6 % CA (α=0.200.20α=0.20α = 0.20) 0.663±0.021plus-or-minus0.6630.0210.663± 0.0210.663 ± 0.021 0.194±0.024plus-or-minus0.1940.0240.194± 0.0240.194 ± 0.024 0.423±0.059plus-or-minus0.4230.0590.423± 0.0590.423 ± 0.059 +0.0110.011+0.011+ 0.011 24.5±3.3%plus-or-minus24.5percent3.324.5± 3.3\%24.5 ± 3.3 % CA (α=0.150.15α=0.15α = 0.15) 0.706±0.022plus-or-minus0.7060.0220.706± 0.0220.706 ± 0.022 0.317±0.057plus-or-minus0.3170.0570.317± 0.0570.317 ± 0.057 0.651±0.065plus-or-minus0.6510.0650.651± 0.0650.651 ± 0.065 +0.0080.008+0.008+ 0.008 42.9±9.5%plus-or-minus42.9percent9.542.9± 9.5\%42.9 ± 9.5 % CA (α=0.100.10α=0.10α = 0.10) 0.753±0.020plus-or-minus0.7530.0200.753± 0.0200.753 ± 0.020 0.416±0.029plus-or-minus0.4160.0290.416± 0.0290.416 ± 0.029 0.771±0.021plus-or-minus0.7710.0210.771± 0.0210.771 ± 0.021 +0.0180.018+0.018+ 0.018 55.8±4.1%plus-or-minus55.8percent4.155.8± 4.1\%55.8 ± 4.1 % CA (α=0.050.05α=0.05α = 0.05) 0.802±0.026plus-or-minus0.8020.0260.802± 0.0260.802 ± 0.026 0.624±0.065plus-or-minus0.6240.0650.624± 0.0650.624 ± 0.065 0.924±0.058plus-or-minus0.9240.0580.924± 0.0580.924 ± 0.058 −0.0050.005-0.005- 0.005 86.9±9.8%plus-or-minus86.9percent9.886.9± 9.8\%86.9 ± 9.8 % Guardian 0.828±0.008plus-or-minus0.8280.0080.828± 0.0080.828 ± 0.008 0.676±0.004plus-or-minus0.6760.0040.676± 0.0040.676 ± 0.004 – – 100.0%percent100.0100.0\%100.0 % Figure 5: Cost–accuracy frontier on MMLU. Mean ±plus-or-minus± std over 30 trials. Faint dots show individual CA runs. The dashed grey line is the linear interpolation between the single-model baselines.