Paper deep dive

Avoiding Obfuscation with Prover-Estimator Debate

Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

Year: 2025Venue: arXiv preprintArea: Scalable OversightType: TheoreticalEmbeddings: 232

Abstract

Abstract:Training powerful AI systems to exhibit desired behaviors hinges on the ability to provide accurate human supervision on increasingly complex tasks. A promising approach to this problem is to amplify human judgement by leveraging the power of two competing AIs in a debate about the correct solution to a given problem. Prior theoretical work has provided a complexity-theoretic formalization of AI debate, and posed the problem of designing protocols for AI debate that guarantee the correctness of human judgements for as complex a class of problems as possible. Recursive debates, in which debaters decompose a complex problem into simpler subproblems, hold promise for growing the class of problems that can be accurately judged in a debate. However, existing protocols for recursive debate run into the obfuscated arguments problem: a dishonest debater can use a computationally efficient strategy that forces an honest opponent to solve a computationally intractable problem to win. We mitigate this problem with a new recursive debate protocol that, under certain stability assumptions, ensures that an honest debater can win with a strategy requiring computational efficiency comparable to their opponent.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 6:28:13 PM

Summary

The paper introduces 'prover-estimator debate', a new recursive AI debate protocol designed to mitigate the 'obfuscated arguments problem'. By shifting the role of the opponent from a challenger to an estimator who assigns probabilities to subclaims, the protocol ensures that an honest debater can win against a dishonest one with comparable computational efficiency, provided a stability condition is met.

Entities (5)

Jonah Brown-Cohen · researcher · 100%Obfuscated arguments problem · problem · 100%Prover-estimator debate · protocol · 100%Stability requirement · condition · 95%Scalable oversight · field · 90%

Relation Signals (3)

Prover-estimator debate → mitigates → Obfuscated arguments problem

confidence 95% · We introduce prover-estimator debate, a protocol for AI debate that mitigates the obfuscated arguments problem

Stability requirement → enables → Completeness

confidence 90% · Specifically, under reasonable assumptions... (formalized as... the (ϵ, ρ)-stability condition), we show: 1. Completeness: An honest prover... can always win

Prover-estimator debate → relianton → Stability requirement

confidence 90% · Our formal notion of argument stability, in particular, clarifies the conditions required to avoid the obfuscated arguments problem

Cypher Suggestions (2)

Find all protocols designed to solve specific problems · confidence 90% · unvalidated

MATCH (p:Protocol)-[:MITIGATES]->(prob:Problem) RETURN p.name, prob.name

Identify conditions required for protocol success · confidence 85% · unvalidated

MATCH (c:Condition)-[:ENABLES]->(prop:Property) RETURN c.name, prop.name

Full Text

231,501 characters extracted from source content.

Expand or collapse full text

Avoiding Obfuscation with Prover-Estimator Debate Jonah Brown-Cohen Google DeepMind jonahbc@google.com &Geoffrey Irving UK AI Security Insitute geoffrey.irving@dsit.gov.uk &Georgios Piliouras Google DeepMind gpil@google.com Abstract Training powerful AI systems to exhibit desired behaviors hinges on the ability to provide accurate human supervision on increasingly complex tasks. A promising approach to this problem is to amplify human judgement by leveraging the power of two competing AIs in a debate about the correct solution to a given problem. Prior theoretical work has provided a complexity-theoretic formalization of AI debate, and posed the problem of designing protocols for AI debate that guarantee the correctness of human judgements for as complex a class of problems as possible. Recursive debates, in which debaters decompose a complex problem into simpler subproblems, hold promise for growing the class of problems that can be accurately judged in a debate. However, existing protocols for recursive debate run into the obfuscated arguments problem: a dishonest debater can use a computationally efficient strategy that forces an honest opponent to solve a computationally intractable problem to win. We mitigate this problem with a new recursive debate protocol that, under certain stability assumptions, ensures that an honest debater can win with a strategy requiring computational efficiency comparable to their opponent. Original recursive debatexxxy1subscript1y_1y1y2subscript2y_2y2⋯·s⋯yqsubscripty_qyitalic_qflaw!A makessubclaimsBBB choosesa subclaimProver-estimator debatexxxy1subscript1y_1y1y2subscript2y_2y2⋯·s⋯yqsubscripty_qyitalic_qflaw!A makessubclaimsOur paperp1subscript1p_1p1p2subscript2p_2p2⋯·s⋯pqsubscriptp_qpitalic_qBBitalic_B assignsprobabilitiesAAA chooses a subclaimand claims B is wrongin a specific direction Figure 1: The original recursive debate protocol suffered from the obfuscated arguments problem: debater A could decompose an easy question x into hard subclaims y1,y2,…,yqsubscript1subscript2…subscripty_1,y_2,…,y_qy1 , y2 , … , yitalic_q, and debater B would fail to find the flaw even if he knew one existed. In prover-estimator debate, B assigns probabilities to subclaims and A chooses a probability to claim that B is wrong in a specific direction. Since A must point to a flaw in B’s probabilities, B wins if neither player can locate a flaw. 1 Introduction Modern machine learning (ML) is fundamentally driven by, and further enables, fast and complex computations. We leverage AI precisely because it allows us to automate and optimize tasks with an efficiency that surpasses human capabilities. This rapid computation, permeating increasingly high-complexity and intelligence tasks, is the engine of progress in countless domains. However, this very strength creates a fundamental tension: the speed and complexity of these computations make exhaustive inspection impossible. Understanding and verifying the internal workings of these powerful AI systems—what’s going on “behind the scenes”—becomes increasingly intractable. This tension directly leads to the critical problem of scalable oversight: how to provide accurate training supervision, such that these powerful, yet opaque, AI systems behave as intended, even when their tasks are too complex for direct human understanding and verification? Ensuring the reliable and safe operation of advanced AI systems is a central challenge for the field. The potential for unintended consequences arising from misaligned or poorly understood AI behavior requires the development of robust oversight mechanisms in order to provide a training reward signal that accurately reflects human values and intentions. Effective scalable oversight is not merely a desirable feature; it is a foundational requirement for building trustworthy AI that aligns with human intentions and avoids producing systematically undesirable outcomes. Progress in this area is crucial for realizing the full potential of AI while mitigating potential risks. Various approaches to scalable oversight have been proposed, including amplification (Christiano et al., 2018), recursive reward modeling (Leike et al., 2018), and AI debate (Irving et al., 2018). All of these approaches are based on the intuition that it is possible to amplify the reward signal coming from human judgment by leveraging improving AI capabilities. For the case of AI debate, Irving et al. (2018) formalized this intuition in the language of computational complexity theory: they proved that a computationally limited human can accurately judge the correct solution to problems of much larger computational complexity by observing a specifically structured debate between two computationally powerful AIs. The main idea is that the two powerful AI debaters recursively break down a complex problem into simpler sub-problems, eventually terminating in subproblems that are simple enough for the computationally limited human to judge directly. As an example, consider the case of training a software-engineering AI system to produce large, complex pull requests. Rather than have a human expert carefully review each pull request during training, one could train the AI system via debate, where the first debater produces the pull request, and the second debater acts as a code reviewer attempting to point out flaws in the produced code. For example, the code-reviewing debater might claim that a particular function returns an incorrect value on a certain input. The coding debater could defend their code by tracing through the function on that input and arguing that the correct value is returned. The code-reviewing debater might then zero-in on a single step of the trace calling a library function, and claim that the library function does not behave as the coding debater has claimed. The coding debater might appeal to the library’s documentation to justify their claim, and the code-reviewing debater might respond by claiming that the version of the library in the production environment differs from that documentation. At this point, a human judge can quickly check which of the two debaters is correct about the production library version, without having to read and fully understand the complex pull request. Hence, the human judge only has to understand a specific, narrow claim about one piece of the code to provide an accurate reward signal. However, a significant vulnerability exists in many recursive approaches, including recursive debate: the obfuscated arguments problem (Barnes, 2020). In recursive debate, a dishonest debater can strategically decompose an easy problem into difficult, even intractable, subproblems. This renders an honest debater unable to find a flaw in their opponent’s argument even if both debaters know that a flaw exists somewhere. The debate will thus fail to identify the correct answer. Obfuscation in recursive debate is not merely a theoretical concern; it emerged naturally in prior empirical work (Barnes, 2020), suggesting obfuscation is a likely attractor state in some settings. Brown-Cohen et al. (2024) avoid the issue of obfuscation in debate by doing away with recursion entirely. In this case, an efficient honest debater can always win, but debaters are required to write-out the full human-judgeable argument for the correctness of their claims. To return to the software engineering example, this would mean that the coding debater has to provide complete, detailed documentation of how and why every single piece of the code works, such that any plausible critique by the code-reviewing debater is already answered by the documentation. This is a significant sacrifice relative to recursive debate, which allows the debaters to efficiently and adaptively zoom-in on a small fraction of the full, detailed human-judgeable argument. Our Approach. We introduce prover-estimator debate, a protocol for AI debate that mitigates the obfuscated arguments problem while retaining the efficiency of recursive debate. In prover-estimator debate, the prover (Alice) decomposes a problem into subclaims. The estimator (Bob) has the sole task of assigning probabilities to the prover’s subclaims. The prover uses these probabilities to decide which subclaim to recursively debate further. This departs from previous debate protocols, where the opponent selects which subclaim to challenge. We show that Bob can always produce probabilities that are indistinguishable from the truth by Alice, while expending computational effort not too much greater than Alice. Alice cannot therefore (on average) select a subproblem where Bob is non-trivially wrong. However, prover-estimator debate has limitations. Alice’s task becomes harder because Bob can defeat an argument by assigning low probabilities to many subclaims. This rules out certain delicate, yet correct arguments that are highly sensitive to the probabilities assigned to subproblems. We formalize this condition by introducing a stability requirement, (ϵ(ε( ϵ, ρ)-stability, which demands that the validity of arguments cannot hinge on arbitrarily small changes in subclaim probabilities. We require stability only for usefulness, not safety: even if stability does not hold, an honest estimator can still win against any incorrect argument by a dishonest prover. To gain intuition for how prover-estimator debate works, let us return to the pull request example. Anyone who has undertaken code review knows that reading code and directly identifying bugs is a difficult task. However, it is often possible to estimate where bugs are likely to occur from the structure of the code, and then to request additional evidence in the form of tests that demonstrate the desired behavior. Prover-estimator debate in this setting requires Alice to provide reliable test coverage, so that even when Bob is uncertain that any single test provides sufficient evidence that the code is correct, taken together it is still likely that Alice’s code exhibits the desired behavior. On the other hand, if Alice has engaged in obfuscation by producing code that fails in a way neither Alice nor Bob can readily identify, then Bob can output low probabilities for Alice’s claims, and Alice will not be able to reliably identify any probability as being unreasonable. Our Results. We provide rigorous theoretical guarantees demonstrating that prover-estimator debate, combined with the stability requirement, mitigates the obfuscated arguments problem. Specifically, under reasonable assumptions about the structure of the problems being debated (formalized as recursively decomposable problems and the (ϵitalic-ϵεϵ, ρ)-stability condition), we show: 1. Completeness: An honest prover, who correctly provides a stable decomposition into solvable subproblems, can always win the debate against any dishonest estimator. 2. Soundness: An honest estimator can win on average against a dishonest prover, even if stability is violated and the subproblems are obfuscated, using computational effort comparable to their dishonest opponent. We discuss how stability can be plausibly satisfied in practice, using a voting technique across many pieces of evidence. The stability requirement is a primary limitation of our work, and we are still working to understand it: while it is required only for completeness and not soundness in our main results (and thus not for safety), an independence condition is required for safety if we use the voting technique to ensure stability. Specifically, we can convert an unstable argument into a stable argument if there is sufficient independent evidence for each claim (Definition 8.1). This independence requirement is not checked by the protocol: it must be established via other means. Prover-estimator debate provides a step towards building robust and trustworthy AI systems capable of tackling complex tasks beyond direct human oversight. Our formal notion of argument stability, in particular, clarifies the conditions required to avoid the obfuscated arguments problem and enables more targeted research on overcoming this problem. Our approach also suggests general lessons for designing future debate protocols, including the value of asymmetric debate protocols and explicit uncertainty estimates. This work opens up new avenues for research, both theoretical and empirical, in the pursuit of safe and aligned AI. 2 Related work The prior theoretical work on the design of protocols for AI debate consists primarily of Irving et al. (2018), which introduced AI debate and its complexity-theoretic formalization, and Brown-Cohen et al. (2024) which introduced doubly-efficient debate. As per the introduction, the original debate proposal makes the too-strong assumption that both debaters are computationally unbounded. Doubly-efficient debate relaxes this assumption, at the price of solving a more limited class of problems. In a complementary direction, the work of Chen et al. (2023) focuses on the theory of learning-in-games for AI debate. The objective in this case is to design efficient learning algorithms for debate-inspired games where the space of strategies is exponentially large, but the outcome of the game can be efficiently judged given strategies of two players. The much earlier work in complexity theory by Chandra et al. (1981) introduced, with different terminology, the question of studying the complexity of polynomial-time verifiable arguments between computationally unbounded provers. Later Feige and Kilian (1997) proved stronger results in a complexity-theoretic model of an efficient verifier that judged arguments between competing computationally unbounded provers. A similar model with competing provers was studied in the context of delegation of computation to untrusted servers in Canetti et al. (2013). There has recently been a line of research applying single-prover interactive proofs to machine learning for proving correctness of model outputs Amit et al. (2024); Anil et al. (2021); Hammond and Adam-Day (2025), and for improving interpretability and legibility Wäldchen et al. (2022); Kirchner et al. (2024). Empirical work on debate has focused primarily on debates where both the debaters and the judge are large language models (LLMs), and has shown some positive signs that debate allows for more accurate judgements on standard datasets (Michael et al., 2023; Kenton et al., 2024; Khan et al., 2024). 3 Preliminaries We use the notation [n]=1,…,ndelimited-[]1…[n]=\1,…,n\[ n ] = 1 , … , n . For a sequence of elements x1,…,xnsubscript1…subscriptx_1,…,x_nx1 , … , xitalic_n from a set X and S⊆[n]delimited-[]S [n]S ⊆ [ n ], we use the notation xSsubscriptx_Sxitalic_S for the subsequence of xisubscriptx_ixitalic_i with i∈Si∈ Si ∈ S. Similarly, we write x≤isubscriptabsentx_≤ ix≤ i, x<isubscriptabsentx_<ix< i, and x≠isubscriptabsentx_≠ ix≠ i to denote the subsequence consisting of xjsubscriptx_jxitalic_j where j≤i,j<iformulae-sequencej≤ i,j<ij ≤ i , j < i, and j≠ij≠ ij ≠ i respectively. For a probability p∈[0,1]01p∈[0,1]p ∈ [ 0 , 1 ] we use the notation ℬ⁡(p)ℬ B(p)B ( p ) to denote the Bernoulli random variable that takes value 1111 with probability p, and 00 otherwise. 3.1 Complexity-theoretic debate As in prior work on debate, we use computational complexity theory to formalize the power of the debaters and human judges, as well as the complexity of the problems that can be judged via debate. Algorithms and oracles. Algorithms are modeled by Turing machines M:0,1∗→0,1:→superscript0101M:\0,1\^*→\0,1\M : 0 , 1 ∗ → 0 , 1 which take an input x and output a yes-or-no answer encoded as a bit M⁢(x)∈0,101M(x)∈\0,1\M ( x ) ∈ 0 , 1 . Furthermore, we consider oracle Turing machines that additionally have black-box access to an oracle :0,1n→0,1:→superscript0101O:\0,1\^n→\0,1\O : 0 , 1 n → 0 , 1 . Once a query u∈0,1nsuperscript01u∈\0,1\^nu ∈ 0 , 1 n is written on its tape, an oracle Turing machine can receive the oracle answer ⁢(u)∈0,101O(u)∈\0,1\O ( u ) ∈ 0 , 1 in a single step. We use the notation MsuperscriptM^OMcaligraphic_O to denote an oracle Turing machine with black-box query access to the oracle OO. In the setting of AI debate, the Turing machine M corresponds to a natural language plan or set of instructions to follow, and queries to the oracle ⁢(u)O(u)O ( u ) corresponds to querying human judgement on a particular question u. Classes of problems. The goal in debate is to design protocols that allow us to solve a broad class of computational problems, while requiring few queries to the human judgement oracle OO. A computational problem with a yes-or-no answer is encoded by a language L⊂0,1∗superscript01L⊂\0,1\^*L ⊂ 0 , 1 ∗ consisting of all the inputs x for which the answer is “yes”. We abuse notation to view L as a function L:0,1∗→0,1:→superscript0101L:\0,1\^*→\0,1\L : 0 , 1 ∗ → 0 , 1 where L⁢(x)=11L(x)=1L ( x ) = 1 if x∈Lx∈ Lx ∈ L and L⁢(x)=00L(x)=0L ( x ) = 0 otherwise. The primary class of problems that we consider is NTIME⁢(T)superscriptNTIME NTIME^O(T)NTIMEO ( T ), the set of problems where for every instance the answer is “yes” if and only if there is proof of that the answer is “yes”, and this proof can be verified in time T with access to the human judgement OO. Definition 3.1. A language L is in NTIME⁢(T)superscriptNTIME NTIME^O(T)NTIMEO ( T ) if there is a time T oracle Turing machine M such that x∈Lx∈ Lx ∈ L if and only if there exists a proof y of length at most T such that M⁢(x,y)=1superscript1M^O(x,y)=1Mcaligraphic_O ( x , y ) = 1. We also use the class coNTIME⁢(T)superscriptcoNTIME coNTIME^O(T)coNTIMEO ( T ) of languages L whose complement is in NTIME⁢(T)superscriptNTIME NTIME^O(T)NTIMEO ( T ). coNTIME⁢(T)superscriptcoNTIME coNTIME^O(T)coNTIMEO ( T ) is the set of yes-or-no problems with efficiently checkable proofs of “no”. Definition 3.2. A language L is in coNTIME⁢(T)superscriptcoNTIME coNTIME^O(T)coNTIMEO ( T ) if there is a time T oracle Turing machine M such that x∉Lx∉ Lx ∉ L if and only if there exists a proof y of length at most T such that M⁢(x,y)=1superscript1M^O(x,y)=1Mcaligraphic_O ( x , y ) = 1. We use the notation Γ⁢(T)=NTIME⁢(T)∩coNTIME⁢(T)superscriptΓsuperscriptNTIMEsuperscriptcoNTIME ^O(T)= NTIME^O(T)∩ coNTIME^% O(T)Γcaligraphic_O ( T ) = NTIMEO ( T ) ∩ coNTIMEO ( T ) for the set of problems where one can verify both the cases x∈Lx∈ Lx ∈ L and x∉Lx∉ Lx ∉ L with a proof that is checkable in time T given access to the ground-truth oracle. Our main results focus on this class. Our debate protocols will rely on recursively decomposing the execution of a verifier machine M into individual steps. To formalize this we introduce the notion of the transcript of a machine M. Definition 3.3. The transcript of a time T machine M on input x is a string τ∈0,1Tsuperscript01τ∈\0,1\^Tτ ∈ 0 , 1 T where τisubscript _iτitalic_i is the bit written at the current head position of M at time step i. 3.2 Game theory The main game-theoretic concept used in our results is a Stackelberg equilibrium. This equilibrium concept models a situation where one player (the leader) must commit to a strategy, after which the other player (the follower) chooses their strategy. A Stackelberg equilibrium is a strategy for the leader that achieves the best possible payoff, given that the follower plays a best response. Definition 3.4 (Stackelberg equilibrium). Consider a two-player (leader and follower) game with a sequential structure, where SLsubscriptS_LSitalic_L is the set of strategies for the leader, SFsubscriptS_FSitalic_F is the set of strategies for the follower, uL⁢(sL,sF)subscriptsubscriptsubscriptu_L(s_L,s_F)uitalic_L ( sitalic_L , sitalic_F ) is the payoff for the leader, and uF⁢(sL,sF)subscriptsubscriptsubscriptu_F(s_L,s_F)uitalic_F ( sitalic_L , sitalic_F ) is the payoff for the follower. The game proceeds as follows: first the leader chooses a strategy sL∈SLsubscriptsubscripts_L∈ S_Lsitalic_L ∈ Sitalic_L, then the follower observes sLsubscripts_Lsitalic_L and chooses a strategy sF∈SFsubscriptsubscripts_F∈ S_Fsitalic_F ∈ Sitalic_F. An α-approximate Stackelberg equilibrium is a strategy profile (sL∗,sF∗⁢(sL))superscriptsubscriptsuperscriptsubscriptsubscript(s_L^*,s_F^*(s_L))( sitalic_L∗ , sitalic_F∗ ( sitalic_L ) ) with 1. Follower’s best response: For every leader strategy sLsubscripts_Lsitalic_L, the follower’s strategy sF∗⁢(sL)superscriptsubscriptsubscripts_F^*(s_L)sitalic_F∗ ( sitalic_L ) is an approximate best response: uF⁢(sL,sF∗⁢(sL))≥uF⁢(sL,sF)−α∀sF∈SF.formulae-sequencesubscriptsubscriptsuperscriptsubscriptsubscriptsubscriptsubscriptsubscriptfor-allsubscriptsubscriptu_F(s_L,s_F^*(s_L))≥ u_F(s_L,s_F)-α ∀ s_F% ∈ S_F.uitalic_F ( sitalic_L , sitalic_F∗ ( sitalic_L ) ) ≥ uitalic_F ( sitalic_L , sitalic_F ) - α ∀ sitalic_F ∈ Sitalic_F . (1) 2. Leader’s optimal strategy: The leader’s strategy sL∗superscriptsubscripts_L^*sitalic_L∗ approximately maximizes the leader’s payoff, given the follower’s best response function: uL⁢(sL∗,sF∗⁢(sL∗))≥uL⁢(sL,sF∗⁢(sL))−α∀sL∈SL.formulae-sequencesubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsubscriptsubscriptsuperscriptsubscriptsubscriptfor-allsubscriptsubscriptu_L(s_L^*,s_F^*(s_L^*))≥ u_L(s_L,s_F^*(s_L))-α% ∀ s_L∈ S_L.uitalic_L ( sitalic_L∗ , sitalic_F∗ ( sitalic_L∗ ) ) ≥ uitalic_L ( sitalic_L , sitalic_F∗ ( sitalic_L ) ) - α ∀ sitalic_L ∈ Sitalic_L . (2) Some intuition is useful for why Alice-leading, Bob-following Stackelberg equilibria are the right setting. For completeness, we will use a prover Alice that always gives correct answers: this will win against any Bob regardless of Bob’s computational complexity, and thus works as an Alice-leading strategy. For soundness, however, we must find an estimator Bob that uses computation not too much larger than Alice, which we do by repeatedly playing Alice against herself. This method of Bob construction works only once Alice is fixed, and thus requires Bob to be the Stackelberg follower. 4 Debate protocols and games The goal of debate is to provide an accurate training signal on questions too hard for humans to judge directly. Training with debate involves asking an AI A to give the answer to a question x drawn from a distribution μ, and having another AI B debate the correctness of A’s answer. At the end of the debate, human judgement is used to provide a reward signal to both AI debaters involved. Thus, the more specific goal is to design a debate protocol, i.e. the rules of the debate and the rewards, such that training A and B with the reward signal provided by the debate protocol leads to A giving the correct answer to x. This can be formalized game theoretically: a debate protocol induces a game between the AI debaters, and the protocol succeeds if every equilibrium of the game leads A to correctly answer almost all questions x∼μsimilar-tox ∼ μ. We begin with the formal definition of a debate protocol, which specifies two sets of possible strategies as well as verification procedure to determine the rules and outcome of a debate. Definition 4.1 (Debate protocol). A debate protocol is given by a family of tuples (n,ℬn,V)subscriptsubscriptℬ(A_n,B_n,V)( Aitalic_n , Bitalic_n , V ), where for each natural number n, nsubscriptA_nAitalic_n and ℬnsubscriptℬB_nBitalic_n are finite sets of strategies defined by functions A,B:0,1O⁢(n)→0,1O⁢(n):→superscript01superscript01A,B:\0,1\^O(n)→\0,1\^O(n)A , B : 0 , 1 O ( n ) → 0 , 1 O ( n ), and the verifier V:0,1∗→[−1,1]:→superscript0111V:\0,1\^*→[-1,1]V : 0 , 1 ∗ → [ - 1 , 1 ] is a polynomial time oracle Turing machine. A debate protocol induces a zero-sum game where debaters select strategies from nsubscriptA_nAitalic_n and ℬnsubscriptℬB_nBitalic_n, exchange messages, and the verifier determines a final payoff based on all the messages in the debate. Definition 4.2 (Two-player, zero-sum debate game). Given a language L and a distribution μ on inputs x∈0,1nsuperscript01x∈\0,1\^nx ∈ 0 , 1 n, a debate protocol induces a two-player zero-sum game where one player selects a strategy A∈nsubscriptA _nA ∈ Aitalic_n, the other a strategy B∈ℬnsubscriptℬB _nB ∈ Bitalic_n. The payoffs are defined by the following process: 1. Sample a random x∼μsimilar-tox ∼ μ. 2. The A player outputs a bit A⁢(x)∈0,101A(x)∈\0,1\A ( x ) ∈ 0 , 1 to claim that L⁢(x)=A⁢(x)L(x)=A(x)L ( x ) = A ( x ) (that is, A suggests whether the claim is true or false). 3. The players A and B interact by exchanging messages on input x. In particular, A and B generate a sequence of messages ai=A⁢(x,a1,b1,…,ai−1,bi−1)subscriptsubscript1subscript1…subscript1subscript1a_i=A(x,a_1,b_1,…,a_i-1,b_i-1)aitalic_i = A ( x , a1 , b1 , … , aitalic_i - 1 , bitalic_i - 1 ) and bi=B⁢(x,a1,b1,…,ai−1)subscriptsubscript1subscript1…subscript1b_i=B(x,a_1,b_1,…,a_i-1)bitalic_i = B ( x , a1 , b1 , … , aitalic_i - 1 ). 4. The verifier V runs on the interaction to determine the payoffs. 5. The A player receives payoff VA,B⁢(x)superscriptV^A,B(x)Vitalic_A , B ( x ) and the B player receives payoff −VA,B⁢(x)superscript-V^A,B(x)- Vitalic_A , B ( x ). The rules of the debate are encoded by the verifier V, while the sets of functions n,ℬnsubscriptsubscriptℬA_n,B_nAitalic_n , Bitalic_n encode the possible strategies that debaters A and B can employ. For example, V might be a set of prompts and scaffolding for an LLM debate, while nsubscriptA_nAitalic_n and ℬnsubscriptℬB_nBitalic_n are sets of possible LLMs of given architecture and size. When the input length n is clear we write AA and ℬBB for the strategy sets. Our setting in this paper is when AA and ℬBB both contain functions of bounded computational complexity. We measure complexity purely relatively, i.e. we show that it is possible to incentivize the A-player to provide the correct solution when the set of strategies ℬBB are all low-complexity functions of strategies from AA. Each debate will then also have low AA-complexity, as each round of the debate requires a constant number of queries to the strategies of A and B. 5 Recursively decomposable problems In this section we introduce the problem class for which our debate protocols will succeed. The intuition is that for any L∈Γ⁢(T)superscriptΓL∈ ^O(T)L ∈ Γcaligraphic_O ( T ), checking the proof y that x∈Lx∈ Lx ∈ L or x∉Lx∉ Lx ∉ L can be recursively decomposed into subproblems. Given a transcript τ of a verifier machine M⁢(x,y)M(x,y)M ( x , y ), a natural decomposition is to split into the two subproblems of checking that the first half of τ is correct and that the second half is correct. This decomposition can be repeated log⁡T Tlog T times until the final level, which checks individual steps in τ. We formalize the notion of a recursive decomposition as follows. Definition 5.1. Let L∈Γ⁢(T)superscriptΓL∈ ^O(T)L ∈ Γcaligraphic_O ( T ) be a language and M the machine that verifies proofs for membership of x∈Lx∈ Lx ∈ L (or x∉Lx∉ Lx ∉ L respectively). A depth d, width q recursive decomposition for L is defined by a time O⁢(n⁢q)O(nq)O ( n q ) machine MDsubscriptM_DMitalic_D that takes as input an original input x∈0,1nsuperscript01x∈\0,1\^nx ∈ 0 , 1 n, a sequence of q queries y=(y1,…,yq)∈0,1n×qsubscript1…subscriptsuperscript01y=(y_1,…,y_q)∈\0,1\^n× qy = ( y1 , … , yitalic_q ) ∈ 0 , 1 n × q, and a sequence of q query answers z=(z1,…,zq)∈0,1qsubscript1…subscriptsuperscript01z=(z_1,…,z_q)∈\0,1\^qz = ( z1 , … , zitalic_q ) ∈ 0 , 1 q. For any x∈Lx∈ Lx ∈ L there exists an implicit proof u∈0,1msuperscript01u∈\0,1\^mu ∈ 0 , 1 m with m≤qdsuperscriptm≤ q^dm ≤ qitalic_d such that for all k∈[d]delimited-[]k∈[d]k ∈ [ d ], the correct answers zksuperscriptz^kzitalic_k to queries yksuperscripty^kyitalic_k are defined recursively as follows. 1. For the base case k=00k=0k = 0, for all leaves i there exists a subset Si⊆[m]subscriptdelimited-[]S_i [m]Sitalic_i ⊆ [ m ] of coordinates in the implicit proof, such that the query yi0=Sisuperscriptsubscript0subscripty_i^0=S_iyitalic_i0 = Sitalic_i, and the correct answer zi0=⁢(uSi)superscriptsubscript0subscriptsubscriptz_i^0=O(u_S_i)zitalic_i0 = O ( uitalic_S start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ). 2. The correct answers to queries yk−1superscript1y^k-1yitalic_k - 1 satisfy zik=MD⁢(x,yik−1,zk−1)subscriptsuperscriptsubscriptsubscriptsuperscript1superscript1z^k_i=M_D(x,y^k-1_i,z^k-1)zitalic_kitalic_i = Mitalic_D ( x , yitalic_k - 1i , zitalic_k - 1 ), where zk−1superscript1z^k-1zitalic_k - 1 are the correct answers to yk−1superscript1y^k-1yitalic_k - 1. That is, answers at k−11k-1k - 1 support answers at k. 3. At the top level k=dk=dk = d of the recursion, MD⁢(x,yd,zd)=M⁢(x,y∗)subscriptsuperscriptsuperscriptsuperscriptsuperscriptM_D(x,y^d,z^d)=M^O(x,y^*)Mitalic_D ( x , yitalic_d , zitalic_d ) = Mcaligraphic_O ( x , y∗ ) where y∗superscripty^*y∗ is a witness to the original machine MsuperscriptM^OMcaligraphic_O for L. Intuitively a recursive decomposition breaks down the task of checking that x∈Lx∈ Lx ∈ L into q subproblems, which are themselves broken down into q subproblems, with the recursion bottoming out in ground-truth oracle queries to an implicit proof u. Thus, the queries yiksuperscriptsubscripty_i^kyitalic_iitalic_k at non-leaf steps in the recursion have correct answers ziksuperscriptsubscriptz_i^kzitalic_iitalic_k that are functions of larger subsets of the u, with the top level depending on the entire implicit proof. For every language L∈Γ⁢(T)superscriptΓL∈ ^O(T)L ∈ Γcaligraphic_O ( T ) there is a natural recursive decomposition of depth d=log⁡Td= Td = log T and width q=22q=2q = 2, where the implicit proof u is just the transcript τ of M⁢(x,y∗)superscriptM(x,y^*)M ( x , y∗ ). The base case queries yi0superscriptsubscript0y_i^0yitalic_i0 are just individual steps in the transcript τ, the queries yi1superscriptsubscript1y_i^1yitalic_i1 are claims that 2222 consecutive steps of τ are performed correctly, and in general yiksuperscriptsubscripty_i^kyitalic_iitalic_k is the claim that 2ksuperscript22^k2k consecutive steps of τ are correct, which is clearly the AND of two queries y1k−1,y2k−1superscriptsubscript11superscriptsubscript21y_1^k-1,y_2^k-1y1italic_k - 1 , y2italic_k - 1 regarding sequences of 2k−1superscript212^k-12k - 1 steps. Hence, MDsubscriptM_DMitalic_D simply computes the AND function on the query answers z1subscript1z_1z1 and z2subscript2z_2z2. For a general language L∈Γ⁢(T)superscriptΓL∈ ^O(T)L ∈ Γcaligraphic_O ( T ) some of the subproblems given by queries yisubscripty_iyitalic_i may be computationally intractable for debaters using only algorithms A∈A ∈ A, while others can be efficiently solved by some A∈A ∈ A. Our goal is to design a protocol that is agnostic to the difficulty of the subproblems. If all subproblems are easy it should be possible for the honest debater to win the debate without expanding the full proof tree corresponding to the recursive decomposition. At the same time, it should not be possible for a dishonest debater to win by intentionally decomposing an easy problem into hard subproblems, which the honest debater is unable to refute. To achieve this goal, the computational complexity of the provers in the debate will be measured purely relatively, i.e. ℬBB contains strategies that are low complexity functions of the strategies in AA. We next capture problems where an AI can solve any subproblem in the recursive decomposition if asked, and thus in some sense “knows” the correct answer. The definition requires that there is a strategy A∈A ∈ A that correctly answers all queries that appear in the recursive decomposition MDsubscriptM_DMitalic_D. Definition 5.2 (AA-provable). For each natural number n, let nsubscriptA_nAitalic_n be a set of functions on 0,1nsuperscript01\0,1\^n 0 , 1 n and let 0<ϵ<1/2,0<γ<1formulae-sequence0italic-ϵ12010<ε<1/2,0<γ<10 < ϵ < 1 / 2 , 0 < γ < 1. Let μ be a distribution on inputs x of length n, let L∈Γ⁢(T)superscriptΓL∈ ^O(T)L ∈ Γcaligraphic_O ( T ) be a language, and let MDsubscriptM_DMitalic_D be a recursive decomposition of L. The decomposition MDsubscriptM_DMitalic_D is AA-provable with probability 1−γ11- 1 - γ if there exists a single A∈A ∈ A such that with probability at least 1−γ11- 1 - γ over x∼μsimilar-tox ∼ μ: 1. In the base case of the decomposition, A⁢(x,j)=ujsubscriptA(x,j)=u_jA ( x , j ) = uitalic_j for any j∈Sisubscriptj∈ S_ij ∈ Sitalic_i, where SisubscriptS_iSitalic_i is the subset given by yi0superscriptsubscript0y_i^0yitalic_i0. 2. For all queries yiksuperscriptsubscripty_i^kyitalic_iitalic_k in the recursive decomposition, A⁢(x,yik)=ziksuperscriptsubscriptsuperscriptsubscriptA(x,y_i^k)=z_i^kA ( x , yitalic_iitalic_k ) = zitalic_iitalic_k is the correct query answer. 3. A⁢(x)=L⁢(x)A(x)=L(x)A ( x ) = L ( x ). We also need the arguments made in the debate to not depend too precisely on the probabilities of the answers to the subproblems. Definition 5.3 ((ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stability). Let x∈0,1nsuperscript01x∈\0,1\^nx ∈ 0 , 1 n be an instance of a problem L∈Γ⁢(T)superscriptΓL∈ ^O(T)L ∈ Γcaligraphic_O ( T ) and let MDsubscriptM_DMitalic_D define a width q recursive decomposition of L. For 0<ϵ<1/2,0<ρ<1formulae-sequence0italic-ϵ12010<ε<1/2,0<ρ<10 < ϵ < 1 / 2 , 0 < ρ < 1, the machine MDsubscriptM_DMitalic_D is (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stable for input x,yx,yx , y and conditional probabilities pi⁢(z<i)i=1qsuperscriptsubscriptsubscriptsubscriptabsent1\p_i(z_<i)\_i=1^q pitalic_i ( z< i ) i = 1q if for all sequences of conditional probabilities qi⁢(z<i)i=1qsuperscriptsubscriptsubscriptsubscriptabsent1\q_i(z_<i)\_i=1^q qitalic_i ( z< i ) i = 1q satisfying |pi⁢(z<i)−qi⁢(z<i)|<ϵsubscriptsubscriptabsentsubscriptsubscriptabsentitalic-ϵ p_i(z_<i)-q_i(z_<i) <ε| pitalic_i ( z< i ) - qitalic_i ( z< i ) | < ϵ for all z, we have |ℙzi∼ℬ⁡(pi⁢(z<i))[MD⁢(x,y,z)=1]−ℙzi∼ℬ⁡(qi⁢(z<i))[MD⁢(x,y,z)=1]|<ρ⁢ϵ.subscriptℙsimilar-tosubscriptℬsubscriptsubscriptabsentsubscript1subscriptℙsimilar-tosubscriptℬsubscriptsubscriptabsentsubscript1italic-ϵ *P_z_i % B(p_i(z_<i)) [M_D(x,y,z)=1 ] .- .% *P_z_i B(q_i(z_<i))% [M_D(x,y,z)=1 ] <ρε.| blackboard_Pz start_POSTSUBSCRIPT i ∼ B ( pitalic_i ( z< i ) ) end_POSTSUBSCRIPT [ Mitalic_D ( x , y , z ) = 1 ] - blackboard_Pz start_POSTSUBSCRIPT i ∼ B ( qitalic_i ( z< i ) ) end_POSTSUBSCRIPT [ Mitalic_D ( x , y , z ) = 1 ] | < ρ ϵ . Further, let μ be a distribution on inputs x and 0<γ<1010<γ<10 < γ < 1. Then MDsubscriptM_DMitalic_D is (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stable with probability 1−γ11- 1 - γ over μ if and only if: With probability at least 1−γ11- 1 - γ over x∼μsimilar-tox ∼ μ the decomposition MDsubscriptM_DMitalic_D is (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stable at x, queries yisubscripty_iyitalic_i appearing in the decomposition, and for probabilities pi⁢(z<i)=1subscriptsubscriptabsent1p_i(z_<i)=1pitalic_i ( z< i ) = 1 if and only if the correct answer to yisubscripty_iyitalic_i is zi=1subscript1z_i=1zitalic_i = 1. Intuitively Definition 5.3 says that changing the probabilities of the sequence z by less than ϵitalic-ϵεϵ leads to small changes in the output of the verifier machine. This type of stability is a natural requirement for judging fuzzy tasks where arguments that are sensitive to very small changes in the probability of a particular subclaim are inherently unreliable. For example, scientific meta-analyses typically perform a sensitivity analysis, which involves dropping each individual scientific study from the analysis, and measuring the impact on the overall conclusions. We will further show in Theorem 8.3 that Definition 5.3 can be satisfied by any language L∈ΓsuperscriptΓL∈ ^OL ∈ Γcaligraphic_O whenever the queries to OO have sufficiently many pieces of indpendent supporting evidence. See Section 8 for the formal statement of this claim. 6 The Prover-Estimator Debate Protocol pchstack [center,boxed] [mode=text,width=0.95]Prover-estimator debate protocol All parties have access to the input x∈0,1nsuperscript01x∈\0,1\^nx ∈ 0 , 1 n, a language L∈Γ⁢(T)superscriptΓL∈ ^O(T)L ∈ Γcaligraphic_O ( T ), the human judgement oracle OO, a recursive decomposition MDsubscriptM_DMitalic_D, and stability parameters ϵ>0,ρ>0formulae-sequenceitalic-ϵ00ε>0,ρ>0ϵ > 0 , ρ > 0. The debate proceeds in a series of d rounds in decreasing order. In round k∈[d]delimited-[]k∈[d]k ∈ [ d ], A is trying to prove that z^ksuperscript z^kover start_ARG z end_ARGk is the answer to a query y^ksuperscript y^kover start_ARG y end_ARGk, while B claims the probability of z^k=1superscript^1 z^k=1over start_ARG z end_ARGk = 1 is pk=Bk⁢(x,y^k)subscriptsuperscriptsuperscript^p_k=B^k(x, y^k)pitalic_k = Bitalic_k ( x , over start_ARG y end_ARGk ). Prior to the initial round k=dk=dk = d, initialize the debate as follows: 1. Set the reward ratio r=14⁢ϵ⁢(1−ρ)≤1414italic-ϵ114r= 14ε(1-ρ)≤ 14r = divide start_ARG 1 end_ARG start_ARG 4 end_ARG ϵ ( 1 - ρ ) ≤ divide start_ARG 1 end_ARG start_ARG 4 end_ARG. 2. A outputs a bit A⁢(x)∈0,101A(x)∈\0,1\A ( x ) ∈ 0 , 1 to claim either that L⁢(x)=11L(x)=1L ( x ) = 1 or L⁢(x)=00L(x)=0L ( x ) = 0. Set y^d=xsuperscript y^d=xover start_ARG y end_ARGd = x. 3. B outputs a probability pd=B⁢(x)subscriptp_d=B(x)pitalic_d = B ( x ) which is supposed to be B’s best estimate of ℙ[L⁢(x)=1]ℙ1 *P[L(x)=1]blackboard_P [ L ( x ) = 1 ]. 4. Sample a bit z^∼pdsimilar-to^subscript z p_dover start_ARG z end_ARG ∼ pitalic_d with trusted randomness. 5. A receives initial reward rinit=rd+1⋅⁢(z^=A⁢(x))subscriptinit⋅superscript11^r_ init=r^d+1· 1( z=A(x))rinit = ritalic_d + 1 ⋅ 1 ( over start_ARG z end_ARG = A ( x ) ) After initialization, the k-th round of the debate proceeds as follows: 1. A outputs a value Ak⁢(x,y^k,pk)∈−1,0,1superscriptsuperscript^subscript101A^k(x, y^k,p_k)∈\-1,0,1\Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) ∈ - 1 , 0 , 1 to claim that pksubscriptp_kpitalic_k is overestimated, approximately correct, or underestimated respectively. 2. A outputs a sequence of queries y1k,…,yqksubscriptsuperscript1…subscriptsuperscripty^k_1,…,y^k_qyitalic_k1 , … , yitalic_kitalic_q to claim that z^k=MD⁢(x,yk,zk)superscript^subscriptsuperscriptsuperscript z^k=M_D(x,y^k,z^k)over start_ARG z end_ARGk = Mitalic_D ( x , yitalic_k , zitalic_k ) is the correct answer to y^ksuperscript y^kover start_ARG y end_ARGk, given correct answers zksuperscriptz^kzitalic_k to the queries yksuperscripty^kyitalic_k. 3. If this is the final round k=00k=0k = 0, (a) Let Si⊆[m]subscriptdelimited-[]S_i [m]Sitalic_i ⊆ [ m ] be the subset of indices into the implicit proof given by y^0superscript^0 y^0over start_ARG y end_ARG0. (b) Evaluate A with no other debate context to obtain each bit u~jsubscript~ u_jover~ start_ARG u end_ARGj for j∈Sisubscriptj∈ S_ij ∈ Sitalic_i of the supposed proof u~Sisubscript~subscript u_S_iover~ start_ARG u end_ARGS start_POSTSUBSCRIPT i end_POSTSUBSCRIPT. (c) A receives reward r0=A0⁢(x,y^0,p0)⁢(⁢(u~Si)−p0)subscript0superscript0superscript^0subscript0subscript~subscriptsubscript0r_0=A^0(x, y^0,p_0)(O( u_S_i)-p_0)r0 = A0 ( x , over start_ARG y end_ARG0 , p0 ) ( O ( over~ start_ARG u end_ARGS start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) - p0 ). B receives reward −r0subscript0-r_0- r0. (d) The final payoff to each player is R=∑k=0drksuperscriptsubscript0subscriptR= _k=0^dr_kR = ∑k = 0d ritalic_k for A, and −R-R- R for B. 4. If k>00k>0k > 0, A and B simulate the execution of MD⁢(x,yk,zk)subscriptsuperscriptsuperscriptM_D(x,y^k,z^k)Mitalic_D ( x , yitalic_k , zitalic_k ) as follows: (a) For each query i∈[m]delimited-[]i∈[m]i ∈ [ m ] of the machine MD⁢(x,yk,zk)subscriptsuperscriptsuperscriptM_D(x,y^k,z^k)Mitalic_D ( x , yitalic_k , zitalic_k ), B outputs a probability Bik⁢(x,yk,z<ik)superscriptsubscriptsuperscriptsubscriptsuperscriptabsentB_i^k(x,y^k,z^k_<i)Bitalic_iitalic_k ( x , yitalic_k , zitalic_k< i ), which is supposed to be equal to B’s best estimate of the probability that zi=1subscript1z_i=1zitalic_i = 1, conditioned on the prior outcomes z<iksubscriptsuperscriptabsentz^k_<izitalic_k< i. (b) Sample (using trusted randomness) zik=1subscriptsuperscript1z^k_i=1zitalic_kitalic_i = 1 with probability Bik⁢(x,yk,z<ik)superscriptsubscriptsuperscriptsubscriptsuperscriptabsentB_i^k(x,y^k,z^k_<i)Bitalic_iitalic_k ( x , yitalic_k , zitalic_k< i ), and continue to step i+11i+1i + 1 in the execution of MDsubscriptM_DMitalic_D. 5. A receives reward rk=rk⁢Ak⁢(x,y^k,pk)⁢(MD⁢(x,yk,zk)−pk)subscriptsuperscriptsuperscriptsuperscript^subscriptsubscriptsuperscriptsuperscriptsubscriptr_k=r^kA^k(x, y^k,p_k)(M_D(x,y^k,z^k)-p_k)ritalic_k = ritalic_k Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) ( Mitalic_D ( x , yitalic_k , zitalic_k ) - pitalic_k ). B receives reward −rksubscript-r_k- ritalic_k. 6. A selects a step j for recursion. 7. Recurse to the next round with instance y^k−1=yjksuperscript^1subscriptsuperscript y^k-1=y^k_jover start_ARG y end_ARGk - 1 = yitalic_kitalic_j and B’s probability pk−1=Bjk⁢(x,yk,z<jk)subscript1superscriptsubscriptsuperscriptsubscriptsuperscriptabsentp_k-1=B_j^k(x,y^k,z^k_<j)pitalic_k - 1 = Bitalic_jitalic_k ( x , yitalic_k , zitalic_k< j ). Figure 2: The prover-estimator debate protocol. In this section we give our debate protocol and the main theorems that it correctly incentivizes debaters for problems in ΓsuperscriptΓ ^OΓcaligraphic_O. The intuition is as in Figure 1: debater A decomposes the problem into subproblems, and debater B gives probabilities for the answers to each subproblem. Debater A then selects a subproblem for recursion where B’s probability is allegedly incorrectly estimated. Rewards are assigned in each round to A for correctly predicting inconsistencies between B’s probabilities in the current round versus the previous round, with the available reward growing by a factor of 1/r11/r1 / r in each round. The final round uses the human judgement oracle OO rather than B’s probabilities. 6.1 The protocol Our protocol for two-player zero-sum debate is given in Figure 2. We can now state our main theorem. Theorem 6.1. Let AA be any set of functions on 0,1nsuperscript01\0,1\^n 0 , 1 n and η>00η>0η > 0. Let μ be a distribution on inputs x∈0,1nsuperscript01x∈\0,1\^nx ∈ 0 , 1 n, and L∈Γ⁢(nc)superscriptΓsuperscriptL∈ ^O(n^c)L ∈ Γcaligraphic_O ( nitalic_c ) be a language with a recursive decomposition MDsubscriptM_DMitalic_D of depth d=log⁡nlog⁡(8/ϵ)−18italic-ϵ1d= n (8/ε)-1d = divide start_ARG log n end_ARG start_ARG log ( 8 / ϵ ) end_ARG - 1 and width q=O⁢(c⋅(8ϵ)c)⋅superscript8italic-ϵq=O (c· ( 8ε )^c )q = O ( c ⋅ ( divide start_ARG 8 end_ARG start_ARG ϵ end_ARG )c ). Let 0 < γ<1/212γ<1/2γ < 1 / 2. Then the protocol of Figure 2 run on MDsubscriptM_DMitalic_D satisfies: • Completeness. If MDsubscriptM_DMitalic_D is AA-provable and (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable each with probability 1−γ11- 1 - γ, there exists A∈A ∈ A such that for all B (of any size), such that with probability at least 1−2⁢γ121-2 1 - 2 γ [VA,B⁢(x)]≥1−ϵnsuperscript1italic-ϵ *E [V^A,B(x) ]≥ 1-εnblackboard_E [ Vitalic_A , B ( x ) ] ≥ divide start_ARG 1 - ϵ end_ARG start_ARG n end_ARG where the expectation is over the internal randomness of the protocol only. • Soundness. Let ℬBB be the set of circuits of size O⁢(q⁢22⁢q⁢n⁢d2/η2)superscript22superscript2superscript2O (q2^2qnd^2/η^2 )O ( q 22 q n d2 / η2 ) that may query a function A∈A ∈ A at each gate. For every strategy A∈A ∈ A, there exists B∈ℬB ∈ B such that, x∼μ[VA,B⁢(x)]≤ℙ[A⁢(x)=L⁢(x)]n+η.subscriptsimilar-tosuperscriptℙ *E_x μ [V^A,B(x) ]≤ % *P[A(x)=L(x)]n+η.blackboard_Ex ∼ μ [ Vitalic_A , B ( x ) ] ≤ divide start_ARG blackboard_P [ A ( x ) = L ( x ) ] end_ARG start_ARG n end_ARG + η . Notice that we only require that L admits an AA-provable, (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stable decomposition in the completeness case. This means that a dishonest debater A may produce an incorrect decomposition into queries yksuperscripty^kyitalic_k for some input x such that MD⁢(x,yk,zk)subscriptsuperscriptsuperscriptM_D(x,y^k,z^k)Mitalic_D ( x , yitalic_k , zitalic_k ) is not (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stable, with answers to queries that cannot be answered by any function in AA. Therefore, our protocol must account for situations where A produces an unstable decomposition into hard-to-answer queries, and ensure B is still able to win. By Theorem 8.3, the requirement in the completeness case that MDsubscriptM_DMitalic_D be (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable can be achieved for any s-supported language with s=Ω⁢(cd)Ωsuperscripts= (c^d)s = Ω ( citalic_d ). Since Figure 2 has O⁢(d)O(d)O ( d ) rounds and the dominant per-round cost is B’s probability estimates, a single debate has a cost of O⁢(q⁢22⁢q⁢n⁢d3/η2)superscript22superscript3superscript2O(q2^2qnd^3/η^2)O ( q 22 q n d3 / η2 ) queries to functions in AA. Because L∈Γ⁢(nc)superscriptΓsuperscriptL∈ ^O(n^c)L ∈ Γcaligraphic_O ( nitalic_c ), and q depends only on ϵitalic-ϵεϵ and c, this computational complexity represents a polynomial improvement over the cost O⁢(nc)superscriptO(n^c)O ( nitalic_c ) required to verify the full human-judgeable argument given by the transcript of M⁢(x,y∗)superscriptsuperscriptM^O(x,y^*)Mcaligraphic_O ( x , y∗ ). In particular, doubly-efficient debate would require O⁢(nc)superscriptO(n^c)O ( nitalic_c ) queries to the debaters, and naive recursive debate would either fail to satisfy soundness, or otherwise require expanding the whole O⁢(nc)superscriptO(n^c)O ( nitalic_c )-size search tree. We prove Theorem 6.1 in a series of lemmas in Section A. The proof of completeness, that an honest debater A can always defeat a dishonest debater B, follows from the intuition that human judgement is used directly in the last round, and the reward in each round increases by a factor of 1/r11/r1 / r. Thus, even if B lies throughout the first k rounds, at some point B (or eventually the human judgement oracle) must tell the truth, at which point a debater A that has always answered honestly can pick up a positive reward that is larger than the sum of all negative rewards received so far. The key technical piece is in showing soundness: that regardless of the dishonest strategy employed by debater A, debater B can gain an advantage while using only slightly more compute than A. We use methods from online convex optimization to construct probabilities for B that are indistinguishable by A from the truth. The idea is to run online gradient descent on a sequence of loss functions which capture how well on average A can distinguish B’s probabilities from the truth. The convergence rate of online gradient descent then yields bounds on the complexity of B’s strategy relative to A’s. As a corollary of Theorem 6.1, we show that answering correctly is an A-leading Stackelberg equilibrium in the prover-estimator debate game. The equilibrium result is weaker than Theorem 6.1, because it requires AA-provability always rather than only for completeness (a definitional artifact). Theorem 6.2. Let AA be any set of functions on 0,1nsuperscript01\0,1\^n 0 , 1 n and let d, q, ℬBB be as in Theorem 6.1, with η=O⁢(ϵn)italic-ϵη=O( εn)η = O ( divide start_ARG ϵ end_ARG start_ARG n end_ARG ). Let μ be a distribution on inputs x∈0,1nsuperscript01x∈\0,1\^nx ∈ 0 , 1 n, and let L∈Γ⁢(nc)superscriptΓsuperscriptL∈ ^O(n^c)L ∈ Γcaligraphic_O ( nitalic_c ) be a language with a recursive decomposition given by MDsubscriptM_DMitalic_D. Let GG be the two-player zero-sum game defined by the protocol of Figure 2 with decomposition MDsubscriptM_DMitalic_D, and let α<ϵ2⁢nitalic-ϵ2α< ε2nα < divide start_ARG ϵ end_ARG start_ARG 2 n end_ARG. If MDsubscriptM_DMitalic_D is AA-provable and (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable each with probability 1−ϵ4⁢n1italic-ϵ41- ε4n1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG, then in every α-approximate, A-leading Stackelberg equilibrium of GG ℙx∼μ[A⁢(x)=L⁢(x)]>1−4⁢ϵ.subscriptℙsimilar-to14italic-ϵ *P_x μ[A(x)=L(x)]>1-4ε.blackboard_Px ∼ μ [ A ( x ) = L ( x ) ] > 1 - 4 ϵ . We prove Theorem 6.2 in Section B. 6.2 Consequences for training with prover-estimator debate. The key take-away of the our two main theorems is that one can train debaters in the prover-estimator debate protocol via standard methods that converge to Stackelberg equilibria in zero-sum games. If all claims in the debate have sufficient evidence to support them (s-supportedness) and the AI debaters only make claims they actually know the answer to (AA-provability), then Theorem 6.2 implies A will produce the correct answer with probability at least 1−4⁢ϵ14italic-ϵ1-4 1 - 4 ϵ. On the other hand, if it is not possible for the AI’s to decompose their arguments into subclaims which are both supported by sufficient evidence and which the AI’s can actually answer, then the soundness case of Theorem 6.1 implies that the payoff to A will fall below a known threshold. Thus, one can read off whether training has succeeded by looking at A’s expected payoff, and in the case of success, A outputs the correct answer with probability 1−4⁢ϵ14italic-ϵ1-4 1 - 4 ϵ. We provide further recommendations for empirical debate research in Section D. 7 Technical Overview In this section we summarize the main ingredients in the proof of Theorem 6.1. We begin with a more detailed discussion of what goes wrong with naive recursive debate, and an idealized model of the obfuscated arguments problem. Obfuscated arguments with naive recursion. In naive recursive debate, the debater A decomposes the initial problem of deciding whether x∈Lx∈ Lx ∈ L into subproblems y1,…,yqsubscript1…subscripty_1,…,y_qy1 , … , yitalic_q along with claimed answers zi=A⁢(yi)subscriptsubscriptz_i=A(y_i)zitalic_i = A ( yitalic_i ) to each subproblem. Debater B then selects a subproblem yisubscripty_iyitalic_i, and the protocol recurses with A claiming that A⁢(yi)=zisubscriptsubscriptA(y_i)=z_iA ( yitalic_i ) = zitalic_i. The obfuscated arguments problem refers to the fact that A may choose a decomposition y1,…,yqsubscript1…subscripty_1,…,y_qy1 , … , yitalic_q, and answers zi=A⁢(yi)subscriptsubscriptz_i=A(y_i)zitalic_i = A ( yitalic_i ), such that only one of the answers is wrong, but it is computationally intractable for both A and B to determine which one. A concrete example of obfuscation is primality testing (Barnes, 2020). In this case, the initial question is whether an integer x is prime or composite. The Miller-Rabin primality test can efficiently solve this problem, but there is no known polynomial-time algorithm to factor x (Miller, 1975; Rabin, 1980). Indeed, if x is the product of two randomly chosen n-bit primes, the RSA cryptosystem’s security relies on the inability to efficiently factor x (Rivest et al., 1978). Hence, A can employ the following obfuscating strategy in the case that x is a hard-to-factor product of two primes. A decomposes the integers from 1 to x into q intervals y1,…,yqsubscript1…subscripty_1,…,y_qy1 , … , yitalic_q, and claims that x has no factors in each interval. When B chooses an interval yisubscripty_iyitalic_i for recursion, A further decomposes this into q subintervals and so on. For B to correctly identify the flaw in A’s argument, B would need to find a factor of x, which is computationally intractable. This holds even though both A and B can run the Miller-Rabin test and be confident a factor exists. In summary, A has efficiently decomposed the problem such that both debaters know there is a flaw somewhere in A’s argument, but neither can efficiently locate the flaw. Soundness via indistinguishability. Prover-estimator debate aims to mitigate the obfuscated arguments problem using computational indistinguishability. Debater A decomposes the problem x into subproblems y1,…,yqsubscript1…subscripty_1,…,y_qy1 , … , yitalic_q, just as in naive recursive debate. Then debater B is allowed to assign conditional probabilities pisubscriptp_ipitalic_i for the truth value of the subclaims yisubscripty_iyitalic_i. Obfuscated arguments shows that B cannot count on being able to efficiently solve the subproblems yisubscripty_iyitalic_i. Instead, we show that B can efficiently produce probabilities pisubscriptp_ipitalic_i that are computationally indistinguishable from the correct answer by A. Our notion of efficiency is that B’s strategy can be computed by a small circuit making queries to A’s strategy at each gate. The notion of indistinguishability that we use in the proof is known as outcome indistinguishability (Dwork et al., 2021), and intuitively requires that, on average over the input distribution, A cannot tell the difference between a sample from the ground-truth answers for yisubscripty_iyitalic_i and a sample from B’s chosen probabilities pisubscriptp_ipitalic_i. See Section A.1 for the formal definition. To ensure that A has advantage at most δ in distinguishing, the size of B’s circuit must grow as O⁢(q⁢22⁢q/δ2)superscript22superscript2O(q2^2q/δ^2)O ( q 22 q / δ2 ). The exponential factor in q appears because this type of indistinguishability requires explicitly representing B’s probability distribution over the 2qsuperscript22^q2q possible answers for the q subclaims at each level. Hence, we will require that q is a constant independent of the input size n. It is instructive to see how indistinguishability helps for a particularly simple type of obfuscated argument by A. Suppose that A decomposes x into q subproblems so that there is one subproblem yisubscripty_iyitalic_i where the correct answer is zi=0subscript0z_i=0zitalic_i = 0, but neither A nor B can identify this subproblem any better than by randomly guessing its location. In this case, B can output a probability of 1/q11/q1 / q that each answer zi=0subscript0z_i=0zitalic_i = 0. If A chooses a subproblem yisubscripty_iyitalic_i to recurse on, A will in all likelihood choose one where the true answer is 1, hence B’s probability is only off by 1/q11/q1 / q. If A again obfuscates in the next round of the prover-estimator debate, then B′⁢ssuperscript′B sB′ s probability will be off by at most 1/q21superscript21/q^21 / q2. Thus, if we reward A proportional to the magnitude of the error in B’s probability, A will receive a very small reward. In the general case, prover-estimator debate gives rewards in every round of the debate. Prior to the initial round, A receives expected reward proportional to how close B’s initial probability estimate is to A’s initial claim about the input. In all but the last round, the rewards are proportional to any error in the consistency between B’s probability claimed in the current round, versus the previous round. However, A only receives these rewards if A has correctly identified the direction of B’s error. In the final round, B’s probabilities are compared to human judgement, and again A only receives a reward proportional to the magnitude of B’s error if A correctly predicted the error direction. Hence, B can use indistinguishability to ensure that A gains rewards at most δ in each round, at the price of using strategies with circuit size O⁢(q⁢22⁢q/δ2)superscript22superscript2O(q2^2q/δ^2)O ( q 22 q / δ2 ). The only remaining source of reward for A then comes from the initialization, where A receives rewards depending on how closely A agrees with B’s initial probability estimate p. Indistinguishability then implies that the fraction of inputs on which A agrees with B’s initial estimate p is approximately equal to the faction of inputs where A agrees with the ground truth. Thus, by using more compute B can ensure that A’s rewards are bounded by a value proportional to the probability that A agrees with the ground truth on the initial problem. Completeness via stability. For the soundness guarantees upper-bounding A’s rewards to be meaningful, we must also show that if A tells the truth, it is possible to lower-bound the rewards received by A. Completeness is achieved via two key properties. First, we will require (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stability of the decomposition, and second, the magnitude of available rewards will grow by a factor of 1/r11/r1 / r in each round, where r<1/212r<1/2r < 1 / 2 depends on ϵitalic-ϵεϵ and ρ. To see why we need both properties, consider the following sketch of a completeness argument. Suppose A always correctly answers the initial input problem and all subclaims, and that A always selects the probability from B that is furthest from the truth. Then if B ever outputs an incorrect probability, it will eventually lead to B being caught: either in the final round when comparing to human judgement, or in some intermediate round where B switches to telling the truth, and thus outputs an inconsistent probability. The reason for the rewards to grow by a factor of 1/r11/r1 / r in each round is to ensure that no matter what happened in the prior rounds, the cumulative negative rewards received earlier by A are outweighed by the reward in the round where B is finally caught in an inconsistency or by the human-judgement comparison. The reason for the stability requirement arises from the need to check for inconsistencies between B’s probability for the query y^ yover start_ARG y end_ARG selected for recursion in the previous round, when compared to B’s probabilities for the subclaims yisubscripty_iyitalic_i in the current round. In the worst case, very small changes in B’s probabilities for the subclaims could cause large changes in overall outcome for the query y^ yover start_ARG y end_ARG. Thus, even if B’s probability estimate for y^ yover start_ARG y end_ARG was off by a lot, B could produce estimates for the yisubscripty_iyitalic_i that are individually only slightly off (say O⁢(1/q)1O(1/q)O ( 1 / q )), but are approximately consistent with the original estimate for y^ yover start_ARG y end_ARG. This would mean that, even when A told the truth and B’s estimates of the initial probability were far from correct, B could shift to approximately correct estimates while paying a small cost in rewards to A for the consistency checks. Stability prevents this failure mode, because it requires that small errors in probability for the yisubscripty_iyitalic_i have a small impact on the outcome for y^ yover start_ARG y end_ARG. Put another way, stability limits how rapidly errors can grow in the debate tree. This can also be interpreted as on average limiting the total search space necessary to determine correctness, because even errors of magnitude ϵ>1/qitalic-ϵ1ε>1/qϵ > 1 / q cannot change the overall outcome by much, so it is not information-theoretically necessary to search the full debate tree of size qdsuperscriptq^dqitalic_d to determine correctness. Ensuring large reward gap. This outlined argument ensures A receives some minimal reward when telling the truth, and that B can spend more compute to ensure that A receives an amount Δ Δ less reward if A frequently lies. However, Δ>0Δ0 >0Δ > 0 is insufficient for efficient debate training. To see why, observe that if we run naive recursive debate with q subclaims per round and d rounds, we achieve Δ=1/ncΔ1superscript =1/n^cΔ = 1 / nitalic_c if B uniformly randomly selects subclaims for recursion. This does not enable efficient training, as in expectation one needs to run ncsuperscriptn^cnitalic_c debates before seeing a single one where B catches a flaw. In contrast, Theorem 6.1 obtains Δ=Ω⁢(1/n)ΔΩ1 = (1/n)Δ = Ω ( 1 / n ). This is a polynomially-large improvement in efficiency, because one need only see O⁢(n)O(n)O ( n ) debates before getting a non-trivial training signal. This non-trivial reward requirement is obtained by balancing the parameter settings for q, d, ϵitalic-ϵεϵ, and r in our proof. Intuitively, worst-case naive recursive debate must search a depth d tree with a total of ncsuperscriptn^cnitalic_c leaves to find a single flaw. In this case qd=ncsuperscriptsuperscriptq^d=n^cqitalic_d = nitalic_c. In contrast, prover-estimator debate uses (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stability along with indistinguishability to search a space of size (1/ϵ)dsuperscript1italic-ϵ(1/ε)^d( 1 / ϵ )d, even though the full debate tree has size qdsuperscriptq^dqitalic_d. The reason for this smaller search space is that errors in probability smaller than ϵitalic-ϵεϵ have limited impact on the outcome. Hence, if we set q=1/ϵc1superscriptitalic-ϵq=1/ε^cq = 1 / ϵitalic_c and d=log⁡n/log⁡(1/ϵ)1italic-ϵd= n/ (1/ε)d = log n / log ( 1 / ϵ ), then we will still have qd=ncsuperscriptsuperscriptq^d=n^cqitalic_d = nitalic_c, so that the leaves of the debate tree cover the full ncsuperscriptn^cnitalic_c-length proof. However, the overall implicit search required by the protocol will only be (1/ϵ)d=nsuperscript1italic-ϵ(1/ε)^d=n( 1 / ϵ )d = n. This trade-off allows for a reward gap of Δ=Ω⁢(1/n)ΔΩ1 = (1/n)Δ = Ω ( 1 / n ), and thus a polynomial speed-up. 8 The Existence of Stable Decompositions Our main result Theorem 6.1 requires (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stability for completeness: to ensure that an honest debater A can win, it is necessary that an (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stable recursive decomposition exists for the language L. For the same reason Theorem 6.2 requires a stable recursive decomposition to ensure that A outputs the truth in every Stackelberg equilibrium. A natural question is whether there exist languages that admit sufficiently stable decompositions, and how large the class of such languages is. In this section, we show that a language L∈ΓsuperscriptΓL∈ ^OL ∈ Γcaligraphic_O admits a stable recursive decomposition if for each query to human judgement required, we can find sufficiently many pieces of independent supporting evidence. To formalize the notion of independent supporting evidence we consider a distribution PP over human-judgement oracles OO, along with a single fixed ground-truth oracle ∗superscriptO^*O∗. One can think of ∼similar-toO ∼ P as a noisy approximation of the ground-truth ∗superscriptO^*O∗, due either to errors in human judgement or legitimate uncertainty about the true answer to certain questions. We further assume that for u∈0,1nsuperscript01u∈\0,1\^nu ∈ 0 , 1 n it is possible to sample from the 0,101\0,1\ 0 , 1 -valued random variable ⁢(u)O(u)O ( u ) for ∼similar-toO ∼ P using at most O⁢(n)O(n)O ( n ) random bits. We now give conditions on a language L relative to the distribution PP on human-judgement oracles that ensures L has a recursive decomposition MDsubscriptM_DMitalic_D satisfying Definition 5.3. Definition 8.1 (s-supported). Let L∈Γ∗⁢(T)superscriptΓsuperscriptL∈ ^O^*(T)L ∈ Γcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( T ) be a language, and for every x∈Lx∈ Lx ∈ L (resp. x∉Lx∉ Lx ∉ L) and proof y let uitalic_u be the sequence of oracle queries made by M∗⁢(x,y)superscriptsuperscriptM^O^*\!(x,y)Mcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( x , y ). The language L is s-supported under human-judgement oracle distribution PP if for every oracle query isubscript u_iitalic_uitalic_i made by M∗⁢(x,y)superscriptsuperscriptM^O^*\!(x,y)Mcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( x , y ), there exist s queries vi1,…⁢vissuperscriptsubscript1…superscriptsubscriptv_i^1,… v_i^svitalic_i1 , … vitalic_iitalic_s satisfying ℙ∼[⁢(vij)=∗⁢(i)∣⁢(vi′j′)⁢∀i′≠i,j′≠j]≥1−ϵ4⁢n.subscriptℙsimilar-tosuperscriptsubscriptconditionalsuperscriptsubscriptsuperscriptsubscriptsuperscript′for-allsuperscript′1italic-ϵ4 *P_O [O(v_i% ^j)=O^*( u_i) (v_i ^j^% )\>\>∀ i ≠ i,j ≠ j ]≥ 1- % ε4n.blackboard_PO ∼ P [ O ( vitalic_iitalic_j ) = O∗ ( italic_uitalic_i ) ∣ O ( vitalic_i′italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∀ i′ ≠ i , j′ ≠ j ] ≥ 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG . The intuition is that for every question that one needs to answer when checking a proof that x∈Lx∈ Lx ∈ L, we can come up with s independent pieces of evidence. The notion of independence between the pieces of evidence is formalized by the distribution PP over human judgement oracles. The choice of the bound 1−ϵ4⁢n1italic-ϵ41- ε4n1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG was purely for convenience. We could achieve the same results by instead choosing any constant for the bound (say 2/3232/32 / 3), and then applying majority voting after increasing s by a factor of O⁢(log⁡(n/ϵ))italic-ϵO( (n/ε))O ( log ( n / ϵ ) ). We caution that s-supportedness cannot be verified via calls to the oracles: additional causal structure or other knowledge about L must be brought to bear, and this is a limitation of our paper. Using Definition 8.1, we can show that a simple transformation of any s-supported language admits a stable decomposition. We begin by describing the transformation. Definition 8.2 (Stable Transformation). Let μ be a distribution on x∈0,1nsuperscript01x∈\0,1\^nx ∈ 0 , 1 n and L∈Γ∗⁢(nc)superscriptΓsuperscriptsuperscriptL∈ ^O^*(n^c)L ∈ Γcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( nitalic_c ) be an s-supported language under oracle distribution PP. Define a new oracle ¯ Oover¯ start_ARG O end_ARG, language L′∈Γ¯superscript′Γ¯L ∈ OL′ ∈ Γover¯ start_ARG O end_ARG, and distribution μ′μ μ′ as follows. Instances of L′ are pairs (x,r)∈0,1n×0,1Θ⁢(n)superscript01superscript01Θ(x,r)∈\0,1\^n×\0,1\ (n)( x , r ) ∈ 0 , 1 n × 0 , 1 Θ ( n ). The distribution μ′μ μ′ is given by sampling x∼μsimilar-tox ∼ μ and independently sampling r∼0,1Θ⁢(n)similar-tosuperscript01Θr \0,1\ (n)r ∼ 0 , 1 Θ ( n ) uniformly at random. Let ¯⁢(v,r)¯ O(v,r)over¯ start_ARG O end_ARG ( v , r ) be the oracle defined by sampling ∼similar-toO ∼ P using r as the random bits and outputting ⁢(v)O(v)O ( v ). Membership in L′ for a pair (x,r)(x,r)( x , r ) is determined as follows: run M∗⁢(x,y)superscriptsuperscriptM^O^*(x,y)Mcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( x , y ), but in place of each original query isubscript u_iitalic_uitalic_i, instead make s queries ¯⁢(vij,r)¯superscriptsubscript O(v_i^j,r)over¯ start_ARG O end_ARG ( vitalic_iitalic_j , r ) for j∈[s]delimited-[]j∈[s]j ∈ [ s ] and output the majority vote over j. The transformation simply encodes the sampling of oracles ∼similar-toO ∼ P via a uniform random a string r which is part of the input, and otherwise answers each original oracle query by majority voting over the s different pieces of independent evidence. We next show that the transformed language L′ admits a stable decomposition. Theorem 8.3. Let L∈Γ∗⁢(nc)superscriptΓsuperscriptsuperscriptL∈ ^O^*(n^c)L ∈ Γcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( nitalic_c ) be an s-supported language under distribution PP, and let 0<ϵ<1/20italic-ϵ120<ε<1/20 < ϵ < 1 / 2. Let L′ be the language constructed via Definition 8.2. Then if s=Ω⁢(cd)Ωsuperscripts= (c^d )s = Ω ( citalic_d ) there is a depth d=log⁡nlog⁡(8/ϵ)−18italic-ϵ1d= n (8/ε)-1d = divide start_ARG log n end_ARG start_ARG log ( 8 / ϵ ) end_ARG - 1, width q=O⁢(c⋅(8ϵ)c)⋅superscript8italic-ϵq=O (c· ( 8ε )^c )q = O ( c ⋅ ( divide start_ARG 8 end_ARG start_ARG ϵ end_ARG )c ) recursive decomposition of L′ that is (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable with probability at least 1−ϵ4⁢n1italic-ϵ41- ε4n1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG. Furthermore, x∈Lx∈ Lx ∈ L if and only if ℙr∼μ′[(x,r)∈L′]>1−ϵ4⁢nsubscriptℙsimilar-tosuperscript′1italic-ϵ4 *P_r μ [(x,r)∈ L ]>1- % ε4nblackboard_Pr ∼ μ′ [ ( x , r ) ∈ L′ ] > 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG. That is, if every human judgement required to verify that x∈Lx∈ Lx ∈ L has sufficiently many pieces of independent supporting evidence, there is an (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable recursive decomposition of L′. Further, x∈Lx∈ Lx ∈ L if and only if (x,r)∈L′(x,r)∈ L ( x , r ) ∈ L′ with high probability over the uniform random bits r. 9 Discussion and Open Problems Improvedprotocols?AmplificationBayesianScientist AIDebate withcounterpointsProver-estimatordebate⋯·s⋯ Figure 3: While prover-estimator debate mitigates the obfuscated arguments problem that occurs in amplification and other scalable oversight protocols, it does not have all the features of these other protocols. For example, it lacks the argument/counterargument structure of Irving et al. (2018) and the Bayesian structure of Bengio et al. (2025). We hope that future work can explore whether the core ideas needed to avoid obfuscated arguments can be combined with features of other protocols. Our paper presents debate protocols which circumvent the obfuscated arguments problem in recursive debate. Intuitively, this demonstrates that it is possible to design the rules of a debate such that training to win at debate leads to correct behavior of the final trained models in most cases. These theoretical results lead to qualitative recommendations for empirical debate research, which present interesting avenues for future research in both of these directions. Theoretical. Our results identify stability as a critical requirement for debate training to succeed, and hence we need to better understand when stability can be achieved, either in theory for interesting classes of problems, or in practice for debates in messy, unformalized settings. On the theoretical side, Theorem 6.1 requires stability for completeness but not soundness, and thus is safe to use even if one is not confident in stability. Furthermore, Theorem 8.3 allows us to turn any polynomial-time verifiable computation into a stable one, provided that s-supportedness holds. However, this introduces a new wrinkle: to ensure safety in this case, we must know that s-supportedness holds to trust that the new computation has the same result. This requires understanding the probabilistic structure of different pieces of evidence used in the argument, either directly or via further debate about that structure. This requires the human judge to determine whether several pieces of evidence are independent, or whether any human uncertainty about these pieces is significantly correlated. We leave open how to do this in general, or whether there are better routes to stability than s-supportedness. Another limitation is that our results hold in an average-case rather than a worst-case sense: we do not prove the judge is always able to tell which debater is correct on every question. Rather, we show that the payoffs given by the judge incentivize the debaters to converge to answering correctly. Thus, the payoff assigned by the judge is a random variable that is correlated with correctness, but imperfectly so. We believe such an average-case analysis is necessary and formalizing this is an interesting open question. We are also curious whether there are corresponding lower bounds to B’s circuit size if the stability parameters ϵ,ρitalic-ϵε,ρϵ , ρ are held constant, in either the oracle or fine-grained complexity settings. Empirical. The central empirical question raised by our paper is how to incorporate the qualitative recommendations from our theoretical results into practical debate training experiments with LLMs. Such experiments may require additional flexibility for training to succeed, so we enumerate the key qualitative properties of prover-estimator debate that could guide practical debate experiments: 1. The prover A both decomposes problems and chooses where to recurse given B’s estimates. The estimator B only needs to express reasonable uncertainty about where a mistake might lie. 2. There is a per-round reward based on a consistency check between B’s estimate from the last round for the current main claim, and estimates for claims in the current round. A can only collect this reward by accurately predicting the direction of B’s error between rounds. 3. In the last round, the judge directly decides if B’s estimates are correct. 4. The debaters should be trained to converge to an A-leading Stackelberg equilibrium. Zooming out. The broad question is whether some variant of debate works as a component of an overall alignment strategy. Buhl et al. (2025) sketches what an overall safety case using debate might look like, combining both theoretical and empirical evidence. We believe prover-estimator debate is a useful contribution on the theory side, but hope that future work finds improved protocols that combine the advantages of other scalable oversight methods (Figure 3). In particular, we are interested in protocols which mitigate systematic errors in human input (Irving, 2025) and the potential for debaters to hide information in the subclaims yksubscripty_kyitalic_k (Pfau and Irving, 2025). These are just two areas for improvement: the full safety case sketch has a lot more holes to fill. Acknowledgments We would like to thank Beth Barnes for good probing on the stability requirement, Marie Buhl for discussions relating this work and broader debate safety cases, Benjamin Hilton, Jacob Pfau, Zachary Kenton, Simon Marshall, Jacob Hilton, and Mario Szegedy for detailed comments on the paper, and Lean for refusing to buy the middle author’s incomplete understandings of earlier versions of the proof. References Amit et al. [2024] Noga Amit, Shafi Goldwasser, Orr Paradise, and Guy Rothblum. Models that prove their own correctness. arXiv preprint arXiv:2405.15722, 2024. Anil et al. [2021] Cem Anil, Guodong Zhang, Yuhuai Wu, and Roger Grosse. Learning to give checkable answers with prover-verifier games. arXiv preprint arXiv:2108.12099, 2021. Barnes [2020] Beth Barnes. Debate update: Obfuscated arguments problem, 2020. URL https://w.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem. Bengio et al. [2025] Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist AI offer a safer path? arXiv preprint arXiv:2502.15657, 2025. Brown-Cohen et al. [2024] Jonah Brown-Cohen, Geoffrey Irving, and Georgios Piliouras. Scalable AI safety via doubly-efficient debate. In Forty-first International Conference on Machine Learning, 2024. Buhl et al. [2025] Marie Davidsen Buhl, Jacob Pfau, Benjamin Hilton, and Geoffrey Irving. An alignment safety case sketch based on debate. arXiv preprint arXiv:2505.03989, 2025. Canetti et al. [2013] Ran Canetti, Ben Riva, and Guy N Rothblum. Refereed delegation of computation. Information and Computation, 226:16–36, 2013. Chandra et al. [1981] Ashok K. Chandra, Dexter C. Kozen, and Larry J. Stockmeyer. Alternation. J. ACM, 28(1):114–133, January 1981. ISSN 0004-5411. doi: 10.1145/322234.322243. URL https://doi.org/10.1145/322234.322243. Chen et al. [2023] Xinyi Chen, Angelica Chen, Dean Foster, and Elad Hazan. Playing large games with oracles and AI debate. arXiv preprint arXiv:2312.04792, 2023. Christiano et al. [2018] Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018. Duchi et al. [2008] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In International Conference on Machine Learning, 2008. URL https://api.semanticscholar.org/CorpusID:1226433. Dwork et al. [2021] Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1095–1108, 2021. Dwork et al. [2022] Cynthia Dwork, Michael P. Kim, Omer Reingold, Guy N. Rothblum, and G. Yona. Beyond Bernoulli: Generating random outcomes that cannot be distinguished from nature. In International Conference on Algorithmic Learning Theory, 2022. URL https://api.semanticscholar.org/CorpusID:247682041. Feige and Kilian [1997] Uriel Feige and Joe Kilian. Making games short. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 506–516, 1997. Hammond and Adam-Day [2025] Lewis Hammond and Sam Adam-Day. Neural interactive proofs. In The Thirteenth International Conference on Learning Representations, 2025. Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016. Irving [2025] Geoffrey Irving. Dodging systematic human errors in scalable oversight, 2025. URL https://w.alignmentforum.org/posts/EgRJtwQurNzz8CEfJ/dodging-systematic-human-errors-in-scalable-oversight. Irving et al. [2018] Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. arXiv preprint arXiv:1805.00899, 2018. Kenton et al. [2024] Zachary Kenton, Noah Yamamoto Siegel, Janos Kramar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah Goodman, et al. On scalable oversight with weak LLMs judging strong LLMs. In Forty-first International Conference on Machine Learning, 2024. Khan et al. [2024] Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R Bowman, Tim Rocktäschel, and Ethan Perez. Debating with more persuasive LLMs leads to more truthful answers. In Forty-first International Conference on Machine Learning, 2024. Kirchner et al. [2024] Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of LLM outputs. arXiv preprint arXiv:2407.13692, 2024. Leike et al. [2018] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018. Michael et al. [2023] Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, and Samuel R Bowman. Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702, 2023. Miller [1975] Gary L Miller. Riemann’s hypothesis and tests for primality. In Proceedings of the seventh annual ACM symposium on Theory of computing, pages 234–239, 1975. Pfau and Irving [2025] Jacob Pfau and Geoffrey Irving. Unexploitable search: blocking malicious use of free parameters, 2025. URL https://w.alignmentforum.org/posts/CuneN5HmLnztsLRzD/unexploitable-search-blocking-malicious-use-of-free-1. Rabin [1980] Michael O Rabin. Probabilistic algorithm for testing primality. Journal of number theory, 12(1):128–138, 1980. Rivest et al. [1978] Ronald L Rivest, Adi Shamir, and Leonard Adleman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21(2):120–126, 1978. Von Neumann [1956] John Von Neumann. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata studies, 34(34):43–98, 1956. Wäldchen et al. [2022] Stephan Wäldchen, Kartikey Sharma, Max Zimmer, and Sebastian Pokutta. Merlin-Arthur classifiers: Formal interpretability with interactive black boxes. arXiv preprint arXiv:2206.00759, 2022. Zinkevich [2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003. Appendix A Proof of Theorem 6.1 Fix a strategy A for the leader. Our approach is to construct a small circuit implementing B’s strategy via online gradient descent on a sequence of loss functions chosen so that B’s probabilities will be indistinguishable by A from the truth. In each iteration of online gradient descent the loss is chosen by measuring, averaged over all queries, how well A can distinguish between B’s current strategy and the true answer. Crucially, our online gradient descent algorithm cannot be run: each step of gradient descent requires access to the truth, which is available only with exponential compute. However, we will show that the resulting circuit B has complexity bounded in terms of number of iterations, which will be small. A.1 Indistinguishability We begin with a formal definition of the notion of indistinguishability that we will use in our proof. Definition A.1. Let X be a set, Z be a finite sets, μ be a finitely supported probability distribution on X, and g:X→ΔZ:→subscriptΔg:X→ _Zg : X → Δitalic_Z. Let ℱFF be a finite family of functions f:X×Z×ΔZ→[−1,1]:→subscriptΔ11f:X× Z× _Z→[-1,1]f : X × Z × Δitalic_Z → [ - 1 , 1 ]. A function h:X→ΔZ:ℎ→subscriptΔh:X→ _Zh : X → Δitalic_Z is (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishable from g if and only if for all f∈ℱf ∈ F |x∼μz∼g⁢(x)[f⁢(x,z,h⁢(x))]−x∼μz∼h⁢(x)[f⁢(x,z,h⁢(x))]|<δsubscriptsimilar-tosimilar-toℎsubscriptsimilar-tosimilar-toℎ *E_ subarraycx μ\\ z g(x) subarray [f(x,z,h(x)) ]- *E_% subarraycx μ\\ z h(x) subarray [f(x,z,h(x)) ] <δ| blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL z ∼ g ( x ) end_CELL end_ROW end_ARG [ f ( x , z , h ( x ) ) ] - blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL z ∼ h ( x ) end_CELL end_ROW end_ARG [ f ( x , z , h ( x ) ) ] | < δ Intuitively, Definition A.1 says that test-functions from the class ℱFF cannot tell the difference between hℎh and g. Our notion of indistinguishability is nearly equivalent to a multi-class variant of outcome indistinguishability [Dwork et al., 2021], with the only difference being that we use real-valued bounded test functions, rather than functions taking values in 0,101\0,1\ 0 , 1 . The work of Dwork et al. [2022] further provides an algorithm for constructing multi-class outcome indistinguishability in the 0,101\0,1\ 0 , 1 -valued test function setting. However, due to a slightly different focus, Dwork et al. [2022] achieve a linear dependence on |Z| Z | Z | in the complexity of hℎh, at the cost of a O⁢(log⁡||)O( )O ( log | A | ) factor. In our setting |Z|=2qsuperscript2 Z =2^q| Z | = 2q is small, while log⁡|| | A | is as large as the circuit complexity (or number of weights in the neural network) of the debaters. Thus, we design a different algorithm than that of Dwork et al. [2022] which has quadratic dependence on |Z| Z | Z |, but is independent of || | A |. Our main technical lemma uses online gradient descent to produce a (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishable function hℎh that has bounded complexity relative to ℱFF. In particular, hℎh will be computable by a small circuit that at each gate may query a function f∈ℱf ∈ F. The result will follow from online gradient descent, where in each iteration we select a loss function determined by some f∈ℱf ∈ F such that the condition of Definition A.1 does not hold. We recall the standard guarantees of online gradient descent Zinkevich [2003], Hazan et al. [2016]. Theorem A.2 (Online Gradient Descent). Let K be a convex set, D be the diameter of K, and G an upper bound on ∥∇lt⁢(v)∥2subscriptdelimited-∥∇subscript2 ∇ l_t(v) _2∥ ∇ litalic_t ( v ) ∥2 for v∈Kv∈ Kv ∈ K. Then for each v∈Kv∈ Kv ∈ K, online gradient descent with step sizes ηt=DG⁢tsubscript _t= DG tηitalic_t = divide start_ARG D end_ARG start_ARG G square-root start_ARG t end_ARG end_ARG satisfies regret=1T⁢∑t=1Tlt⁢(vt)−1T⁢∑t=1Tlt⁢(v)≤3⁢D⁢G2⁢Tregret1superscriptsubscript1subscriptsubscript1superscriptsubscript1subscript32 regret= 1T _t=1^Tl_t(v_t)- 1T _t=1^T% l_t(v)≤ 3DG2 Tregret = divide start_ARG 1 end_ARG start_ARG T end_ARG ∑t = 1T litalic_t ( vitalic_t ) - divide start_ARG 1 end_ARG start_ARG T end_ARG ∑t = 1T litalic_t ( v ) ≤ divide start_ARG 3 D G end_ARG start_ARG 2 square-root start_ARG T end_ARG end_ARG where the loss ltsubscriptl_tlitalic_t can be chosen adversarially after vtsubscriptv_tvitalic_t is revealed. Algorithm 1 Constructing an indistinguishable hℎh 1:Input: Finite sets X and Z, a distribution μ on X, a function g:X→Z:→g:X→ Zg : X → Z, a class ℱFF of functions f:X×ΔZ→[−1,1]:→subscriptΔ11f:X× _Z→[-1,1]f : X × Δitalic_Z → [ - 1 , 1 ], and accuracy parameter δ>00δ>0δ > 0. 2:Output: A function h:X→ΔZ:ℎ→subscriptΔh:X→ _Zh : X → Δitalic_Z that is (ℱ,δ)ℱ(F,δ)( F , δ )-indistinguishable from g. 3:Online gradient descent with each loss given by a function f∈ℱf ∈ F that can distinguish hℎh from g. 4:Initialize ht:X→ΔZ:subscriptℎ→subscriptΔh_t:X→ _Zhitalic_t : X → Δitalic_Z to ht≡1|Z|subscriptℎ1h_t≡ 1 Z hitalic_t ≡ divide start_ARG 1 end_ARG start_ARG | Z | end_ARG ▷ ▷ Initialize estimates to the uniform distribution. 5:for t=11t=1t = 1 to T do 6: For any f∈ℱf ∈ F let Advft=x∼μz∼g⁢(x)[f⁢(x,z,h⁢(x))]−x∼μz∼ht⁢(x)[f⁢(x,z,ht⁢(x))]subscriptsuperscriptAdvsubscriptsimilar-tosimilar-toℎsubscriptsimilar-tosimilar-tosubscriptℎsubscriptℎAdv^t_f= *E_ subarraycx% μ\\ z g(x) subarray [f(x,z,h(x)) ]- *E_% subarraycx μ\\ z h_t(x) subarray [f(x,z,h_t(x)) ]Advitalic_titalic_f = blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL z ∼ g ( x ) end_CELL end_ROW end_ARG [ f ( x , z , h ( x ) ) ] - blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL z ∼ h start_POSTSUBSCRIPT t ( x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ f ( x , z , hitalic_t ( x ) ) ] 7: if there exists f∈ℱf ∈ F such that |Advft|≥δsubscriptsuperscriptAdv ^t_f ≥δ| Advitalic_titalic_f | ≥ δ then ▷ ▷ If f can distinguish hℎh from g. 8: ht+1⁢(x,z)=ht⁢(x,z)+ηt⋅sign⁡Advft⋅f⁢(x,z,ht⁢(x))⁢∀x,zsubscriptℎ1subscriptℎ⋅subscriptsign⋅subscriptsuperscriptAdvsubscriptℎfor-all h_t+1(x,z)=h_t(x,z)+ _t·sign% Adv^t_f· f(x,z,h_t(x))~∀~x,zhitalic_t + 1 ( x , z ) = hitalic_t ( x , z ) + ηitalic_t ⋅ sign Advitalic_titalic_f ⋅ f ( x , z , hitalic_t ( x ) ) ∀ x , z. ▷ ▷ Gradient step with f. 9: Project all outputs of ht+1subscriptℎ1h_t+1hitalic_t + 1 to ΔZsubscriptΔ _ZΔitalic_Z. ▷ ▷ Project to probabilities. 10: else 11: return htsubscriptℎh_thitalic_t 12: end if 13:end for Lemma A.3 (Indistinguishability). Let X be a set, Z be a finite set, μ be a finitely supported distribution on X, and ℱFF a set of functions f:X×Z×ΔZ→[−1,1]:→subscriptΔ11f:X× Z× _Z→[-1,1]f : X × Z × Δitalic_Z → [ - 1 , 1 ]. For any g:X→ΔZ:→subscriptΔg:X→ _Zg : X → Δitalic_Z there exists a function h:X→ΔZ:ℎ→subscriptΔh:X→ _Zh : X → Δitalic_Z such that 1. hℎh is (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishable from g. 2. hℎh is computable by a circuit of size O⁢(|Z|2/δ2)superscript2superscript2O( Z ^2/δ^2)O ( | Z |2 / δ2 ) that can query some f∈ℱf ∈ F at each gate. Proof. The function hℎh will be the output of Algorithm 1. The first condition of the lemma will follow from the guarantees of online gradient descent, so we begin by showing that Algorithm 1 is online gradient descent on an appropriate sequence of losses. For a function h:X→ΔZ:ℎ→subscriptΔh:X→ _Zh : X → Δitalic_Z we let h⁢(x,z)=ℙ[h⁢(x)=z]ℎℙℎh(x,z)= *P[h(x)=z]h ( x , z ) = blackboard_P [ h ( x ) = z ] denote the probability assigned to z∈Zz∈ Zz ∈ Z by hℎh at input x∈Xx∈ Xx ∈ X. We can interpret both h⁢(x,z)ℎh(x,z)h ( x , z ) and F⁢(x,z)=f⁢(x,z,h⁢(x))ℎF(x,z)=f(x,z,h(x))F ( x , z ) = f ( x , z , h ( x ) ) as functions X×Z→ℝ→ℝX× Z × Z → blackboard_R. Construct an embedding Ψ Ψ sending functions h:X×Z→ℝ:ℎ→ℝh:X× Z : X × Z → blackboard_R to vectors in ℝX×ZsuperscriptℝR^X× Zblackboard_RX × Z as Ψ⁢h⁢(x,z)=μ⁢(x)⋅h⁢(x,z)Ψℎ⋅ℎ h(x,z)= μ(x)· h(x,z)Ψ h ( x , z ) = square-root start_ARG μ ( x ) end_ARG ⋅ h ( x , z ) Hence the inner product between embeddings is given by ⟨Ψ⁢h,Ψ⁢g⟩=x∼μ[∑z∈Zh⁢(x,z)⁢g⁢(x,z)]ΨℎΨsubscriptsimilar-tosubscriptℎ h, g = *E_x μ [ _% z∈ Zh(x,z)g(x,z) ]⟨ Ψ h , Ψ g ⟩ = blackboard_Ex ∼ μ [ ∑z ∈ Z h ( x , z ) g ( x , z ) ]. Let ℋ=h→∈ℝX×Z∣h→=Ψ⁢h,h:X→ΔZℋconditional-set→ℎsuperscriptℝ:→ℎΨℎ→subscriptΔH=\ h ^X× Z\ h= h,h:X→% _Z\H = over→ start_ARG h end_ARG ∈ blackboard_RX × Z ∣ over→ start_ARG h end_ARG = Ψ h , h : X → Δitalic_Z . Observe that ∥h→∥2≤1subscriptdelimited-∥→ℎ21 h _2≤ 1∥ over→ start_ARG h end_ARG ∥2 ≤ 1 for all h→∈ℋ→ℎℋ h → start_ARG h end_ARG ∈ H and that the ℓ2subscriptℓ2 _2ℓ2-projection onto ℋHH is given by, for each x, projecting the |Z| Z | Z |-dimensional vector h→⁢(x,⋅)→ℎ⋅ h(x,·)over→ start_ARG h end_ARG ( x , ⋅ ) to ΔZsubscriptΔ _ZΔitalic_Z. Now for any f∈ℱf ∈ F we define the loss functions lt:ℝX×Z→ℝ:subscript→superscriptℝl_t:R^X× Z _t : blackboard_RX × Z → blackboard_R by lt⁢(h→)=|⟨Ψ⁢g,Ψ⁢F⟩−⟨h→,Ψ⁢F⟩|subscript→ℎΨ→ℎΨl_t( h)= g, F - h, F% _t ( over→ start_ARG h end_ARG ) = | ⟨ Ψ g , Ψ F ⟩ - ⟨ over→ start_ARG h end_ARG , Ψ F ⟩ | (3) As it is the absolute value of an affine function, ltsubscriptl_tlitalic_t is convex in h→ℎ hover→ start_ARG h end_ARG. Furthermore, for any h→∈ℋ→ℎℋ h → start_ARG h end_ARG ∈ H with h→=Ψ⁢h→ℎΨℎ h= hover→ start_ARG h end_ARG = Ψ h, the subgradient of ltsubscriptl_tlitalic_t is given by ∂lt⁢(h→)⁢(x,z)subscript→ℎ ∂ l_t( h)(x,z)∂ litalic_t ( over→ start_ARG h end_ARG ) ( x , z ) =−sign⁡(⟨Ψ⁢g,Ψ⁢F⟩−⟨h→,Ψ⁢F⟩)⁢Ψ⁢F⁢(x,z)absentsignΨ→ℎΨ =-sign ( g, F - % h, F ) F(x,z)= - sign ( ⟨ Ψ g , Ψ F ⟩ - ⟨ over→ start_ARG h end_ARG , Ψ F ⟩ ) Ψ F ( x , z ) =−sign⁡(x∼μz∼g⁢(x)[f⁢(x,z,h⁢(x))]−x∼μz∼h⁢(x)[f⁢(x,z,h⁢(x))])⁢Ψ⁢F⁢(x,z)absentsignsubscriptsimilar-tosimilar-toℎsubscriptsimilar-tosimilar-toℎΨ =-sign ( *E_ % subarraycx μ\\ z g(x) subarray [f(x,z,h(x)) ]- *E_% subarraycx μ\\ z h(x) subarray [f(x,z,h(x)) ] ) F(x,z)= - sign ( blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL z ∼ g ( x ) end_CELL end_ROW end_ARG [ f ( x , z , h ( x ) ) ] - blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL z ∼ h ( x ) end_CELL end_ROW end_ARG [ f ( x , z , h ( x ) ) ] ) Ψ F ( x , z ) =−sign⁡Advf⋅Ψ⁢F⁢(x,z)absentsign⋅subscriptAdvΨ =-signAdv_f· F(x,z)= - sign Advitalic_f ⋅ Ψ F ( x , z ) Hence the embedding by Ψ Ψ of the gradient update on line 9 of Algorithm 1 can be written as Ψ⁢ht+1=Ψ⁢ht+ηt⋅sign⁡Advft⋅Ψ⁢F=Ψ⁢ht+ηt⋅∂lt⁢(Ψ⁢ht)Ψsubscriptℎ1Ψsubscriptℎ⋅subscriptsign⋅subscriptsuperscriptAdvΨsubscriptℎ⋅subscriptsubscriptΨsubscriptℎ h_t+1= h_t+ _t·signAdv^t% _f· F= h_t+ _t·∂l_t( h_t)Ψ hitalic_t + 1 = Ψ hitalic_t + ηitalic_t ⋅ sign Advitalic_titalic_f ⋅ Ψ F = Ψ hitalic_t + ηitalic_t ⋅ ∂ litalic_t ( Ψ hitalic_t ) Furthermore, the projection on line 8 corresponds to the projection of ht→subscriptℎ h_tover→ start_ARG hitalic_t end_ARG onto ℋHH. Therefore, Algorithm 1 is precisely online gradient descent on the embedded vectors Ψ⁢htΨsubscriptℎ h_tΨ hitalic_t with loss ltsubscriptl_tlitalic_t. Since |Advf|≤2subscriptAdv2 _f ≤ 2| Advitalic_f | ≤ 2 we have ∥∂lt⁢(h→)∥2≤2⁢∥Ψ⁢F∥2≤2⁢|Z|subscriptdelimited-∥subscript→ℎ22subscriptdelimited-∥Ψ22 ∂ l_t( h) _2≤ 2 F _2≤ 2 % Z ∥ ∂ litalic_t ( over→ start_ARG h end_ARG ) ∥2 ≤ 2 ∥ Ψ F ∥2 ≤ 2 square-root start_ARG | Z | end_ARG for all h→∈ℋ→ℎℋ h → start_ARG h end_ARG ∈ H. By Theorem A.2 we conclude 3⁢|Z|T3 3 Z Tdivide start_ARG 3 square-root start_ARG | Z | end_ARG end_ARG start_ARG square-root start_ARG T end_ARG end_ARG ≥1T⁢∑t=1Tlt⁢(h→t)−1T⁢∑t=1Tlt⁢(Ψ⁢g)absent1superscriptsubscript1subscriptsubscript→ℎ1superscriptsubscript1subscriptΨ ≥ 1T _t=1^Tl_t( h_t)- 1T _% t=1^Tl_t( g)≥ divide start_ARG 1 end_ARG start_ARG T end_ARG ∑t = 1T litalic_t ( over→ start_ARG h end_ARGt ) - divide start_ARG 1 end_ARG start_ARG T end_ARG ∑t = 1T litalic_t ( Ψ g ) =1T⁢∑t=1T|⟨Ψ⁢g,Ψ⁢F⟩−⟨h→,Ψ⁢F⟩|absent1superscriptsubscript1Ψ→ℎΨ = 1T _t=1^T g, F -% h, F = divide start_ARG 1 end_ARG start_ARG T end_ARG ∑t = 1T | ⟨ Ψ g , Ψ F ⟩ - ⟨ over→ start_ARG h end_ARG , Ψ F ⟩ | =|x∼μz∼g⁢(x)[f⁢(x,z,ht⁢(x))]−x∼μz∼ht⁢(x)[f⁢(x,z,ht⁢(x))]|absentsubscriptsimilar-tosimilar-tosubscriptℎsubscriptsimilar-tosimilar-tosubscriptℎsubscriptℎ = *E_ subarraycx % μ\\ z g(x) subarray [f(x,z,h_t(x)) ]- *% E_ subarraycx μ\\ z h_t(x) subarray [f(x,z,h_t(x)) ] = | blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL z ∼ g ( x ) end_CELL end_ROW end_ARG [ f ( x , z , hitalic_t ( x ) ) ] - blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL z ∼ h start_POSTSUBSCRIPT t ( x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ f ( x , z , hitalic_t ( x ) ) ] | =|Advft|absentsuperscriptsubscriptAdv = _f^t = | Advitalic_fitalic_t | (4) Each term in the final sum of (A.1) is greater than or equal to δ by the if condition in Algorithm 1. Hence Algorithm 1 can run for at most T=O⁢(|Z|/δ2)superscript2T=O( Z /δ^2)T = O ( | Z | / δ2 ) steps before it is no longer possible to find an f satisfying |Advft|≥δsuperscriptsubscriptAdv _f^t ≥δ| Advitalic_fitalic_t | ≥ δ, as otherwise this would contradict (A.1). Thus, the function hℎh output by Algorithm 1 is (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishable from g, as desired. For the second condition of the lemma, we can construct a size O⁢(|Z|2/δ2)superscript2superscript2O( Z ^2/δ^2)O ( | Z |2 / δ2 ) circuit for hℎh as follows. On input x∈Xx∈ Xx ∈ X, the function ht+1⁢(x)subscriptℎ1h_t+1(x)hitalic_t + 1 ( x ) can be computed by evaluating ht⁢(x,z)+ηt⋅Advft⁡f⁢(x,z,ht⁢(x))subscriptℎ⋅subscriptsuperscriptsubscriptAdvsubscriptℎh_t(x,z)+ _t·Adv_f^tf(x,z,h_t(x))hitalic_t ( x , z ) + ηitalic_t ⋅ Advitalic_fitalic_t f ( x , z , hitalic_t ( x ) ) for each z∈Zz∈ Zz ∈ Z, and then performing the projection onto ΔZsubscriptΔ _ZΔitalic_Z. This takes O⁢(|Z|)O( Z )O ( | Z | ) evaluations of f, one evaluation of htsubscriptℎh_thitalic_t, and O⁢(|Z|)O( Z )O ( | Z | ) time for the projection [Duchi et al., 2008]. Thus, a circuit to compute the final hℎh returned by Algorithm 1 can be constructed starting from h1subscriptℎ1h_1h1 with size at most O⁢(T⋅|Z|)=O⁢(|Z|2/δ2)⋅superscript2superscript2O(T· Z )=O( Z ^2/δ^2)O ( T ⋅ | Z | ) = O ( | Z |2 / δ2 ). ∎ A.2 Soundness Now that we have established that small circuits can produce indistinguishable probabilities for any class of test functions ℱFF, we will leverage Lemma A.3 to produce the probabilities for B in the Prover-Estimator debate protocol. The main idea is to, given a fixed strategy for A, construct B’s strategy in each round k via Lemma A.3 with the function class ℱFF chosen to be size q circuits that make q queries to A. The distribution on inputs for round k is given by the fixed strategies for A and B in prior rounds applied to the original input distribution μ. Lemma A.4 (Soundness). For any strategy A∈A ∈ A there exists a strategy B computable by circuits of size O⁢(q⁢22⁢q⁢n/δ2)superscript22superscript2O(q2^2qn/δ^2)O ( q 22 q n / δ2 ) that may query A at each gate such that x∼μ[VA,B⁢(x)]<rd+1⋅ℙ[A⁢(x)=L⁢(x)]+(2⁢d+1)⁢δsubscriptsimilar-tosuperscript⋅superscript1ℙ21 *E_x μ [V^A,B(x) ]<r^d+1·% *P[A(x)=L(x)]+(2d+1) _Ex ∼ μ [ Vitalic_A , B ( x ) ] < ritalic_d + 1 ⋅ blackboard_P [ A ( x ) = L ( x ) ] + ( 2 d + 1 ) δ Proof. We first show that at initialization, we can construct B’s strategy so that the probability pdsubscriptp_dpitalic_d is indistinguishable by A from the truth, and prove an upper bound on the initial reward A can receive in terms of ℙ[A⁢(x)=L⁢(x)]ℙ *P[A(x)=L(x)]blackboard_P [ A ( x ) = L ( x ) ]. Initialization. Let X=0,1nsuperscript01X=\0,1\^nX = 0 , 1 n be the set of possible inputs x, Z=0,101Z=\0,1\Z = 0 , 1 be the set of outputs, and ℱFF be the set of functions f:X×Z×ΔZ→[−1,1]:→subscriptΔ11f:X× Z× _Z→[-1,1]f : X × Z × Δitalic_Z → [ - 1 , 1 ] computable by O⁢(1)1O(1)O ( 1 )-sized circuits that evaluate A once. Let g:X→Δ⁢(z):→Δg:X→ (z)g : X → Δ ( z ) be the function that puts probability 1 on L⁢(x)L(x)L ( x ) given input x. By Lemma A.3 there exists a function h:X→ΔZ:ℎ→subscriptΔh:X→ _Zh : X → Δitalic_Z that is (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishable from g, and can be computed by O⁢(1/δ2)1superscript2O(1/δ^2)O ( 1 / δ2 )-sized circuits that can query functions A∈A ∈ A in each step. Let the strategy for B be given by B⁢(x)=ℙz∼h⁢(x)[z=1]subscriptℙsimilar-toℎ1B(x)= *P_z h(x)[z=1]B ( x ) = blackboard_Pz ∼ h ( x ) [ z = 1 ]. Note that the initial reward rinit⁢(x,z^)=rd+1⋅⁢(z^=A⁢(x))subscriptinit^⋅superscript11^r_ init(x, z)=r^d+1· 1( z=A(x))rinit ( x , over start_ARG z end_ARG ) = ritalic_d + 1 ⋅ 1 ( over start_ARG z end_ARG = A ( x ) ) can be computed by evaluating A once, and performing a constant number of arithmetic operations, hence rinit∈ℱsubscriptinitℱr_ init ∈ F. Therefore by (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishability of hℎh we have δ δ >|x∼μz^∼h⁢(x)[rinit⁢(x,z^)]−x∼μz^∼g⁢(x)[rinit⁢(x,z^)]|absentsubscriptsimilar-tosimilar-to^ℎsubscriptinit^subscriptsimilar-tosimilar-to^subscriptinit > *E_ subarraycx % μ\\ z h(x) subarray [r_ init(x, z) ]-% *E_ subarraycx μ\\ z g(x) subarray [r_ init(x, z) ] > | blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL over start_ARG z end_ARG ∼ h ( x ) end_CELL end_ROW end_ARG [ rinit ( x , over start_ARG z end_ARG ) ] - blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL over start_ARG z end_ARG ∼ g ( x ) end_CELL end_ROW end_ARG [ rinit ( x , over start_ARG z end_ARG ) ] | =|x∼μz^∼h⁢(x)[rinit⁢(x,z^)]−x∼μ[rd+1⋅⁢(A⁢(x)=L⁢(x))]|absentsubscriptsimilar-tosimilar-to^ℎsubscriptinit^subscriptsimilar-to⋅superscript11 = *E_ subarraycx % μ\\ z h(x) subarray [r_ init(x, z) ]-% *E_ subarraycx μ subarray [% r^d+1· 1(A(x)=L(x)) ] = | blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW start_ROW start_CELL over start_ARG z end_ARG ∼ h ( x ) end_CELL end_ROW end_ARG [ rinit ( x , over start_ARG z end_ARG ) ] - blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW end_ARG [ ritalic_d + 1 ⋅ 1 ( A ( x ) = L ( x ) ) ] | =|x∼μ[rinit]−rd+1⁢ℙx∼μ[A⁢(x)=L⁢(x)]|.absentsubscriptsimilar-tosubscriptinitsuperscript1subscriptℙsimilar-to = *E_ subarraycx % μ subarray [r_ init ]-r^d+1 *% P_ subarraycx μ subarray[A(x)=L(x)] .= | blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW end_ARG [ rinit ] - ritalic_d + 1 blackboard_Pstart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW end_ARG [ A ( x ) = L ( x ) ] | . Rearranging yields x∼μ[rinit]≤rd+1⁢ℙx∼μ[A⁢(x)=L⁢(x)]+δ.subscriptsimilar-tosubscriptinitsuperscript1subscriptℙsimilar-to *E_ subarraycx μ subarray [% r_ init ]≤ r^d+1 *P_ % subarraycx μ subarray[A(x)=L(x)]+δ.blackboard_Estart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW end_ARG [ rinit ] ≤ ritalic_d + 1 blackboard_Pstart_ARG start_ROW start_CELL x ∼ μ end_CELL end_ROW end_ARG [ A ( x ) = L ( x ) ] + δ . (5) We next recursively construct B’s strategy as follows. Given fixed strategies for A and B in round k′>ksuperscript′k >k′ > k, let μksubscript _kμitalic_k be the distribution on inputs x and queries y in round k. Recursion. Let X=0,1n+n+n⁢q×[0,1]superscript0101X=\0,1\^n+n+nq×[0,1]X = 0 , 1 n + n + n q × [ 0 , 1 ] be the set of input-query sequences (x,y^k,yk)superscript^superscript(x, y^k,y^k)( x , over start_ARG y end_ARGk , yitalic_k ) along with the claimed probability pksubscriptp_kpitalic_k from the previous round. Let Z=0,1qsuperscript01Z=\0,1\^qZ = 0 , 1 q be the set of query answers z. Let ℱFF be the set of functions f:X×Z×ΔZ→[−1,1]:→subscriptΔ11f:X× Z× _Z→[-1,1]f : X × Z × Δitalic_Z → [ - 1 , 1 ] computable by size O⁢(n⁢q)O(nq)O ( n q ) circuits that make q queries to AksuperscriptA^kAitalic_k, and define gk:X→ΔZ:subscript→subscriptΔg_k:X→ _Zgitalic_k : X → Δitalic_Z by ℙz∼gk⁢(x,y)[z]=∏i⁢(g⁢(x,yi)=zi)subscriptℙsimilar-tosubscriptsubscriptproduct1subscriptsubscript *P_z g_k(x,y)[z]= _i 1(g(x,y% _i)=z_i)blackboard_Pz ∼ g start_POSTSUBSCRIPT k ( x , y ) end_POSTSUBSCRIPT [ z ] = ∏i 1 ( g ( x , yitalic_i ) = zitalic_i ), where g is the function that assigns the correct answer to each query yisubscripty_iyitalic_i. Lemma A.3 implies that there exists a function hk:X→ΔZ:subscriptℎ→subscriptΔh_k:X→ _Zhitalic_k : X → Δitalic_Z that is (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishable from gksubscriptg_kgitalic_k, and is computable by circuits of size O⁢(q⁢22⁢q⁢n/δ2)superscript22superscript2O(q2^2qn/δ^2)O ( q 22 q n / δ2 ). We let B’s strategy in round k be Bik⁢(x,y,z<i)=ℙz∼hk⁢(x,y)[zi=1∣z<i]superscriptsubscriptsubscriptabsentsubscriptℙsimilar-tosubscriptℎsubscriptconditional1subscriptabsentB_i^k(x,y,z_<i)= *P_z h_k(x,y)[z_i=1% z_<i]Bitalic_iitalic_k ( x , y , z< i ) = blackboard_Pz ∼ h start_POSTSUBSCRIPT k ( x , y ) end_POSTSUBSCRIPT [ zitalic_i = 1 ∣ z< i ] In every round k, the reward rk=rk⁢Ak⁢(x,y^k,pk)⁢(MD⁢(x,yk,zk)−pk)subscriptsuperscriptsuperscriptsuperscript^subscriptsubscriptsuperscriptsuperscriptsubscriptr_k=r^kA^k(x, y^k,p_k)(M_D(x,y^k,z^k)-p_k)ritalic_k = ritalic_k Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) ( Mitalic_D ( x , yitalic_k , zitalic_k ) - pitalic_k ) can be viewed as a function rk⁢(k,zk,hk⁢(k))subscriptsuperscriptsuperscriptsubscriptℎsuperscriptr_k( x^k,z^k,h_k( x^k))ritalic_k ( italic_xitalic_k , zitalic_k , hitalic_k ( italic_xitalic_k ) ) that depends only on inputs k=(x,y^k,yk,pk)∈Xsuperscriptsuperscript^superscriptsubscript x^k=(x, y^k,y^k,p_k)∈ Xitalic_xitalic_k = ( x , over start_ARG y end_ARGk , yitalic_k , pitalic_k ) ∈ X, the probability distribution hk⁢(x)∈ΔZsubscriptℎsubscriptΔh_k(x)∈ _Zhitalic_k ( x ) ∈ Δitalic_Z produced by BiksuperscriptsubscriptB_i^kBitalic_iitalic_k, and query answers z∈Zz∈ Zz ∈ Z. Since rksubscriptr_kritalic_k can be computed by running the machine MDsubscriptM_DMitalic_D for O⁢(n⁢q)O(nq)O ( n q ) steps, rk∈ℱsubscriptℱr_k _k ∈ F. Hence by (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishability of hksubscriptℎh_khitalic_k, |∼μkz∼hk⁢(x)[rk⁢(,z,hk⁢())]−∼μkz∼gk⁢()[rk⁢(,z,hk⁢())]|<δsubscriptsimilar-tosubscriptsimilar-tosubscriptℎsubscriptsubscriptℎsubscriptsimilar-tosubscriptsimilar-tosubscriptsubscriptsubscriptℎ *E_ subarrayc% x _k\\ z h_k(x) subarray [r_k( x,z,h_k( x% )) ]- *E_ subarrayc x % _k\\ z g_k( x) subarray [r_k( x,z,h_k(% x)) ] <δ| blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k end_CELL end_ROW start_ROW start_CELL z ∼ hitalic_k ( x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ritalic_k ( italic_x , z , hitalic_k ( italic_x ) ) ] - blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k end_CELL end_ROW start_ROW start_CELL z ∼ gitalic_k ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ritalic_k ( italic_x , z , hitalic_k ( italic_x ) ) ] | < δ (6) Recalling that y^k=yjk+1superscript^subscriptsuperscript1 y^k=y^k+1_jover start_ARG y end_ARGk = yitalic_k + 1j and zjk+1superscriptsubscript1z_j^k+1zitalic_jitalic_k + 1 are the query and answer that A selects for recursion in the previous round k+11k+1k + 1, and that pk=hk+1⁢(k+1)subscriptsubscriptℎ1superscript1p_k=h_k+1( x^k+1)pitalic_k = hitalic_k + 1 ( italic_xitalic_k + 1 ) is B’s previous round probability, the quantity fk+1⁢(k+1,zk+1,hk+1⁢(k+1))=rk⁢Ak⁢(x,y^k,pk)⁢zjk+1subscript1superscript1superscript1subscriptℎ1superscript1superscriptsuperscriptsuperscript^subscriptsubscriptsuperscript1f_k+1( x^k+1,z^k+1,h_k+1( x^k+1))=r^kA^k% (x, y^k,p_k)z^k+1_jfitalic_k + 1 ( italic_xitalic_k + 1 , zitalic_k + 1 , hitalic_k + 1 ( italic_xitalic_k + 1 ) ) = ritalic_k Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) zitalic_k + 1j (7) can be viewed as function taking inputs in X×Z×ΔZsubscriptΔX× Z× _ZX × Z × Δitalic_Z, and making two queries to A in order to determine the index j on which to recurse as well as the value Ak⁢(xk,y^k,pk)superscriptsuperscriptsuperscript^subscriptA^k(x^k, y^k,p_k)Aitalic_k ( xitalic_k , over start_ARG y end_ARGk , pitalic_k ). For the base case k=dk=dk = d, we abuse notation and let zjk+1=z^subscriptsuperscript1^z^k+1_j= zzitalic_k + 1j = over start_ARG z end_ARG and hk+1subscriptℎ1h_k+1hitalic_k + 1 be the function hℎh used to construct the strategy B⁢(x)B(x)B ( x ) at initialization. Hence fk+1∈ℱsubscript1ℱf_k+1 _k + 1 ∈ F. By (δ,ℱ)ℱ(δ,F)( δ , F )-indistinguishability of hk+1subscriptℎ1h_k+1hitalic_k + 1 on the distribution μk+1subscript1 _k+1μitalic_k + 1, we have that |∼μk+1z∼hk+1⁢()[fk+1⁢(,z,hk+1⁢())]−∼μk+1z∼gk+1⁢()[fk+1⁢(,z,hk+1⁢())]|<δsubscriptsimilar-tosubscript1similar-tosubscriptℎ1subscript1subscriptℎ1subscriptsimilar-tosubscript1similar-tosubscript1subscript1subscriptℎ1 *E_ subarrayc x % _k+1\\ z h_k+1( x) subarray [f_k+1( x,z,h_% k+1( x)) ]- *E_ subarrayc% x _k+1\\ z g_k+1( x) subarray [f_k+1( x,z,h_% k+1( x)) ] <δ| blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k + 1 end_CELL end_ROW start_ROW start_CELL z ∼ hitalic_k + 1 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ fitalic_k + 1 ( italic_x , z , hitalic_k + 1 ( italic_x ) ) ] - blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k + 1 end_CELL end_ROW start_ROW start_CELL z ∼ gitalic_k + 1 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ fitalic_k + 1 ( italic_x , z , hitalic_k + 1 ( italic_x ) ) ] | < δ (8) Next, the definition of rksubscriptr_kritalic_k and the true answer distribution function gksubscriptg_kgitalic_k imply that for k≥11k≥ 1k ≥ 1 ∼μkz∼gk⁢()[rk⁢(,z,hk⁢())]subscriptsimilar-tosubscriptsimilar-tosubscriptsubscriptsubscriptℎ *E_ subarrayc x% _k\\ z g_k( x) subarray [r_k( x,z,h_k(% x)) ]blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k end_CELL end_ROW start_ROW start_CELL z ∼ gitalic_k ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ritalic_k ( italic_x , z , hitalic_k ( italic_x ) ) ] =∼μkz∼gk⁢()[rk⁢Ak⁢(x,y^k,pk)⁢(MD⁢(x,yk,zk)−pk)]absentsubscriptsimilar-tosubscriptsimilar-tosubscriptsuperscriptsuperscriptsuperscript^subscriptsubscriptsuperscriptsuperscriptsubscript = *E_ subarrayc x% _k\\ z g_k( x) subarray [r^kA^k(x, y^k,p_k% )(M_D(x,y^k,z^k)-p_k) ]= blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k end_CELL end_ROW start_ROW start_CELL z ∼ gitalic_k ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ritalic_k Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) ( Mitalic_D ( x , yitalic_k , zitalic_k ) - pitalic_k ) ] =∼μkz∼gk⁢()[rk⁢Ak⁢(x,y^k,pk)⁢(gk+1⁢(x,y^k)−pk)]absentsubscriptsimilar-tosubscriptsimilar-tosubscriptsuperscriptsuperscriptsuperscript^subscriptsubscript1superscript^subscript = *E_ subarrayc x% _k\\ z g_k( x) subarray [r^kA^k(x, y^k,p_k% )(g_k+1(x, y^k)-p_k) ]= blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k end_CELL end_ROW start_ROW start_CELL z ∼ gitalic_k ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ritalic_k Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) ( gitalic_k + 1 ( x , over start_ARG y end_ARGk ) - pitalic_k ) ] =∼μk+1z∼gk+1⁢()[fk+1⁢(,z,hk+1⁢())]−∼μk+1z∼hk+1⁢()[fk+1⁢(,z,hk+1⁢())]absentsubscriptsimilar-tosubscript1similar-tosubscript1subscript1subscriptℎ1subscriptsimilar-tosubscript1similar-tosubscriptℎ1subscript1subscriptℎ1 = *E_ subarrayc x% _k+1\\ z g_k+1( x) subarray [f_k+1( x,z,h_% k+1( x)) ]- *E_ subarrayc% x _k+1\\ z h_k+1( x) subarray [f_k+1( x,z,h_% k+1( x)) ]= blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k + 1 end_CELL end_ROW start_ROW start_CELL z ∼ gitalic_k + 1 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ fitalic_k + 1 ( italic_x , z , hitalic_k + 1 ( italic_x ) ) ] - blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k + 1 end_CELL end_ROW start_ROW start_CELL z ∼ hitalic_k + 1 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ fitalic_k + 1 ( italic_x , z , hitalic_k + 1 ( italic_x ) ) ] <δabsent <δ< δ (9) where the final inequality follows from (8). Finally, in the base case k=00k=0k = 0, we force A to produce the subset u~Sisubscript~subscript u_S_iover~ start_ARG u end_ARGS start_POSTSUBSCRIPT i end_POSTSUBSCRIPT of the implicit proof without additional context, and so A’s answers at the bottom level must correspond to a single consistent proof u~~ uover~ start_ARG u end_ARG, which implies that g0subscript0g_0g0 is well-defined. Hence, we have ∼μ0z∼g0⁢()[r0⁢(,z,h0⁢())]subscriptsimilar-tosubscript0similar-tosubscript0subscript0subscriptℎ0 *E_ subarrayc x% _0\\ z g_0( x) subarray [r_0( x,z,h_0(% x)) ]blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT 0 end_CELL end_ROW start_ROW start_CELL z ∼ g0 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ r0 ( italic_x , z , h0 ( italic_x ) ) ] =∼μ0z∼g0⁢()[A0⁢(x,y^0,p0)⁢(⁢(u~Si)−p0)]absentsubscriptsimilar-tosubscript0similar-tosubscript0superscript0superscript^0subscript0subscript~subscriptsubscript0 = *E_ subarrayc x% _0\\ z g_0( x) subarray [A^0(x, y^0,p_0)(% O( u_S_i)-p_0) ]= blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT 0 end_CELL end_ROW start_ROW start_CELL z ∼ g0 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ A0 ( x , over start_ARG y end_ARG0 , p0 ) ( O ( over~ start_ARG u end_ARGS start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) - p0 ) ] ≤∼μ0z∼g0⁢()[A0⁢(x,y^0,p0)⁢(g1⁢(x,y^0)−p0)]absentsubscriptsimilar-tosubscript0similar-tosubscript0superscript0superscript^0subscript0subscript1superscript^0subscript0 ≤ *E_ subarrayc x% _0\\ z g_0( x) subarray [A^0(x, y^0,p_0)(g_% 1(x, y^0)-p_0) ]≤ blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT 0 end_CELL end_ROW start_ROW start_CELL z ∼ g0 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ A0 ( x , over start_ARG y end_ARG0 , p0 ) ( g1 ( x , over start_ARG y end_ARG0 ) - p0 ) ] =∼μ1z∼g1⁢()[f1⁢(,z,h1⁢())]−∼μ1z∼h1⁢()[f1⁢(,z,h1⁢())]absentsubscriptsimilar-tosubscript1similar-tosubscript1subscript1subscriptℎ1subscriptsimilar-tosubscript1similar-tosubscriptℎ1subscript1subscriptℎ1 = *E_ subarrayc x% _1\\ z g_1( x) subarray [f_1( x,z,h_1(% x)) ]- *E_ subarrayc% x _1\\ z h_1( x) subarray [f_1( x,z,h_1(% x)) ]= blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT 1 end_CELL end_ROW start_ROW start_CELL z ∼ g1 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ f1 ( italic_x , z , h1 ( italic_x ) ) ] - blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT 1 end_CELL end_ROW start_ROW start_CELL z ∼ h1 ( italic_x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ f1 ( italic_x , z , h1 ( italic_x ) ) ] <δ.absent <δ.< δ . (10) Combining (A.2) and (A.2) with (6) implies that for all k≥00k≥ 0k ≥ 0 ∼μkz∼hk⁢(x)[rk⁢(,z,hk⁢())]<2⁢δ.subscriptsimilar-tosubscriptsimilar-tosubscriptℎsubscriptsubscriptℎ2 *E_ subarrayc x _k\\ z h_k(x) subarray [r_k( x,z,h_k( x% )) ]<2δ.blackboard_Estart_ARG start_ROW start_CELL italic_x ∼ μ start_POSTSUBSCRIPT k end_CELL end_ROW start_ROW start_CELL z ∼ hitalic_k ( x ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ritalic_k ( italic_x , z , hitalic_k ( italic_x ) ) ] < 2 δ . (11) Summing over the d rounds of the debate and adding the expected rinitsubscriptinitr_ initrinit from (5) completes the proof. ∎ A.3 Completeness In this section we prove the completeness case of Theorem 6.1. The main idea is that if x∈Lx∈ Lx ∈ L, for an AA-provable, sufficiently stable language L, the strategy of A correctly outputting proofs y and pointing out inaccurate estimations by B will always yield some positive reward for A. Lemma A.5 (Completeness). Let 0<γ<120120<γ< 120 < γ < divide start_ARG 1 end_ARG start_ARG 2 end_ARG. If the decomposition MDsubscriptM_DMitalic_D is (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stable and AA-provable each with probability 1−γ11- 1 - γ, there exists a strategy A∈A ∈ A in the protocol such that with probability at least 1−2⁢γ121-2 1 - 2 γ over an input x∼μsimilar-tox ∼ μ and all B∈ℬB ∈ B [VA,B⁢(x)]≥(1−ϵ)⁢rd+1superscript1italic-ϵsuperscript1 *E [V^A,B(x) ]≥(1-ε)r^d+1blackboard_E [ Vitalic_A , B ( x ) ] ≥ ( 1 - ϵ ) ritalic_d + 1 where V is the verifier for the protocol of Figure 2 and the expectation is over the internal randomness of the protocol only. Proof. By assumption, each of AA-provability and (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stability hold on a per-input basis with probability at least 1−γ11- 1 - γ. Thus, with probability at least 1−2⁢γ121-2 1 - 2 γ both conditions hold simultaneously. The remainder of the proof will apply to this case only. At initialization, the strategy for A is to output A⁢(x)=L⁢(x)A(x)=L(x)A ( x ) = L ( x ). If |pd−A⁢(x)|≤ϵsubscriptitalic-ϵ p_d-A(x) ≤ε| pitalic_d - A ( x ) | ≤ ϵ, then in all rounds k∈[d]delimited-[]k∈[d]k ∈ [ d ] set Ak⁢(x,y^,pk)=0superscript^subscript0A^k(x, y,p_k)=0Aitalic_k ( x , over start_ARG y end_ARG , pitalic_k ) = 0. Hence all subsequent rewards rk=0subscript0r_k=0ritalic_k = 0, and the expected reward for A is at least [VA,B⁢(x)]=[rinit]=rd+1⋅ℙz^∼B⁢(x)[A⁢(x)=z^]≥(1−ϵ)⁢rd+1.superscriptsubscriptinit⋅superscript1subscriptℙsimilar-to^^1italic-ϵsuperscript1 *E [V^A,B(x) ]= *E% [r_ init ]=r^d+1· *P_ z% B(x)[A(x)= z]≥(1-ε)r^d+1.blackboard_E [ Vitalic_A , B ( x ) ] = blackboard_E [ rinit ] = ritalic_d + 1 ⋅ blackboard_Pover start_ARG z end_ARG ∼ B ( x ) [ A ( x ) = over start_ARG z end_ARG ] ≥ ( 1 - ϵ ) ritalic_d + 1 . (12) If instead |pd−A⁢(x)|>ϵsubscriptitalic-ϵ p_d-A(x) >ε| pitalic_d - A ( x ) | > ϵ, then the strategy for A will simply be to in each round produce the correct queries yksuperscripty^kyitalic_k, and then to correctly choose Ak⁢(x,y^k,pk)superscriptsuperscript^subscriptA^k(x, y^k,p_k)Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) and the recursion index j according to the following rules. Let g be the function that assigns the correct value to each query yksuperscripty^kyitalic_k. Ak⁢(x,y^k,pk)=−1g⁢(x,y^k)−pk<−ϵ1g⁢(x,y^k)−pk>ϵ0otherwisesuperscriptsuperscript^subscriptcases1superscript^subscriptitalic-ϵ1superscript^subscriptitalic-ϵ0otherwiseA^k(x, y^k,p_k)= cases-1&g(x, y^k)-p_k<-ε\\ 1&g(x, y^k)-p_k>ε\\ 0&otherwise casesAitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) = start_ROW start_CELL - 1 end_CELL start_CELL g ( x , over start_ARG y end_ARGk ) - pitalic_k < - ϵ end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL g ( x , over start_ARG y end_ARGk ) - pitalic_k > ϵ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (13) j=argmaxi∈[q]⁡|Bik⁢(x,yk,z<ik)−g⁢(x,yik)|.subscriptargmaxdelimited-[]superscriptsubscriptsuperscriptsuperscriptsubscriptabsentsubscriptsuperscriptj=argmax_i∈[q] B_i^k(x,y^k,z_<i^k)-g(x% ,y^k_i) .j = argmaxitalic_i ∈ [ q ] | Bitalic_iitalic_k ( x , yitalic_k , z< iitalic_k ) - g ( x , yitalic_kitalic_i ) | . That is Ak⁢(x,y^k,pk)superscriptsuperscript^subscriptA^k(x, y^k,p_k)Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) is chosen to indicate over/under-estimation by at least ϵitalic-ϵεϵ, and j is chosen to be the most inaccurately estimated probability by B. Such a strategy A exists in AA because L is AA-provable. Now consider the last round l in the recursion in which |pl−g⁢(x,y^l)|>ϵsubscriptsuperscript^italic-ϵ p_l-g(x, y^l) >ε| pitalic_l - g ( x , over start_ARG y end_ARGl ) | > ϵ. Such a round l exists, because we are in the case that the initial probability pd=B⁢(x)subscriptp_d=B(x)pitalic_d = B ( x ) selected by B satisfied |pd−A⁢(x)|>ϵsubscriptitalic-ϵ p_d-A(x) >ε| pitalic_d - A ( x ) | > ϵ. Because A always selects the maximum error step j for recursion, this implies that for all subsequent rounds k≤lk≤ lk ≤ l, |Bik⁢(x,yk,z<ik)−g⁢(x,yik)|≤ϵ.superscriptsubscriptsuperscriptsuperscriptsubscriptabsentsubscriptsuperscriptitalic-ϵ B_i^k(x,y^k,z_<i^k)-g(x,y^k_i) ≤ε.| Bitalic_iitalic_k ( x , yitalic_k , z< iitalic_k ) - g ( x , yitalic_kitalic_i ) | ≤ ϵ . (14) This implies that Ak⁢(x,y^k,pk)=0superscriptsuperscript^subscript0A^k(x, y^k,p_k)=0Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) = 0 by construction, and hence rk=0subscript0r_k=0ritalic_k = 0 for all rounds k≤l−11k≤ l-1k ≤ l - 1. Combining (14) with (ϵ,ρ)italic-ϵ(ε,ρ)( ϵ , ρ )-stability of MDsubscriptM_DMitalic_D implies |[MD⁢(x,yl,zl)−g⁢(x,y^l)∣l≥1]|≤ρ⁢ϵsubscriptsuperscriptsuperscriptconditionalsuperscript^1italic-ϵ *E [M_D(x,y^l,z^l)-g(x, y^l% ) l≥ 1 ] ≤ρε| blackboard_E [ Mitalic_D ( x , yitalic_l , zitalic_l ) - g ( x , over start_ARG y end_ARGl ) ∣ l ≥ 1 ] | ≤ ρ ϵ (15) By (13) and (15), the expected reward in round l conditioned on l≥11l≥ 1l ≥ 1 will be at least [rl∣l≥1]conditionalsubscript1 *E [r_l l≥ 1 ]blackboard_E [ ritalic_l ∣ l ≥ 1 ] =[rl⁢Al⁢(x,y^l,pl)⁢(MD⁢(x,yl,zl)−pl)∣l≥1]absentconditionalsuperscriptsuperscriptsuperscript^subscriptsubscriptsuperscriptsuperscriptsubscript1 = *E [r^lA^l(x, y^l,p_l)% (M_D(x,y^l,z^l)-p_l ) l≥ 1 ]= blackboard_E [ ritalic_l Aitalic_l ( x , over start_ARG y end_ARGl , pitalic_l ) ( Mitalic_D ( x , yitalic_l , zitalic_l ) - pitalic_l ) ∣ l ≥ 1 ] =[rl⁢Al⁢(x,y^l,pl)⁢(MD⁢(x,yl,zl)−g⁢(x,y^l)+g⁢(x,y^l)−pl)∣l≥1]absentconditionalsuperscriptsuperscriptsuperscript^subscriptsubscriptsuperscriptsuperscriptsuperscript^superscript^subscript1 = *E [r^lA^l(x, y^l,p_l)% (M_D(x,y^l,z^l)-g(x, y^l)+g(x, y^l)-p_l ) l% ≥ 1 ]= blackboard_E [ ritalic_l Aitalic_l ( x , over start_ARG y end_ARGl , pitalic_l ) ( Mitalic_D ( x , yitalic_l , zitalic_l ) - g ( x , over start_ARG y end_ARGl ) + g ( x , over start_ARG y end_ARGl ) - pitalic_l ) ∣ l ≥ 1 ] ≥[rl⁢(−|MD⁢(x,yl,zl)−g⁢(x,y^l)|+Al⁢(x,y^l,pl)⁢(g⁢(x,y^l)−pl))∣l≥1]absentconditionalsuperscriptsubscriptsuperscriptsuperscriptsuperscript^superscriptsuperscript^subscriptsuperscript^subscript1 ≥ *E [r^l (- M_D% (x,y^l,z^l)-g(x, y^l) +A^l(x, y^l,p_l) (g% (x, y^l)-p_l ) ) l≥ 1 ]≥ blackboard_E [ ritalic_l ( - | Mitalic_D ( x , yitalic_l , zitalic_l ) - g ( x , over start_ARG y end_ARGl ) | + Aitalic_l ( x , over start_ARG y end_ARGl , pitalic_l ) ( g ( x , over start_ARG y end_ARGl ) - pitalic_l ) ) ∣ l ≥ 1 ] ≥[rl∣l≥1]⁢(ϵ−ρ⁢ϵ).absentconditionalsuperscript1italic-ϵitalic-ϵ ≥ *E [r^l l≥ 1 ](% ε-ρε).≥ blackboard_E [ ritalic_l ∣ l ≥ 1 ] ( ϵ - ρ ϵ ) . (16) Conditioning on the outcome l=00l=0l = 0, and again applying (13) along with the fact that A outputs the correct implicit proof u~Sisubscript~subscript u_S_iover~ start_ARG u end_ARGS start_POSTSUBSCRIPT i end_POSTSUBSCRIPT yields [r0∣l=0]conditionalsubscript00 *E [r_0 l=0 ]blackboard_E [ r0 ∣ l = 0 ] =[A0⁢(x,y^0,p0)⁢(⁢(u~Si)−p0)∣l=0]absentconditionalsuperscript0superscript^0subscript0subscript~subscriptsubscript00 = *E [A^0(x, y^0,p_0) (% O( u_S_i)-p_0 ) l=0 ]= blackboard_E [ A0 ( x , over start_ARG y end_ARG0 , p0 ) ( O ( over~ start_ARG u end_ARGS start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) - p0 ) ∣ l = 0 ] =[A0⁢(x,y^0,p0)⁢(⁢(u~Si)−g⁢(x,y^0)⏟0+g⁢(x,y^0)−p0)|l=0]absentconditionalsuperscript0superscript^0subscript0subscript⏟subscript~subscriptsuperscript^00superscript^0subscript00 = *E [A^0(x, y^0,p_0) (% O( u_S_i)-g(x, y^0)_0+g(x, y^% 0)-p_0 ) |l=0 ]= blackboard_E [ A0 ( x , over start_ARG y end_ARG0 , p0 ) ( under⏟ start_ARG O ( over~ start_ARG u end_ARGS start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) - g ( x , over start_ARG y end_ARG0 ) end_ARG0 + g ( x , over start_ARG y end_ARG0 ) - p0 ) | l = 0 ] ≥ϵ.absentitalic-ϵ ≥ε.≥ ϵ . (17) Next observe that, since Ak⁢(x,y^k,pk)∈[−1,1]superscriptsuperscript^subscript11A^k(x, y^k,p_k)∈[-1,1]Aitalic_k ( x , over start_ARG y end_ARGk , pitalic_k ) ∈ [ - 1 , 1 ], the total expected reward from all prior rounds k≥l+11k≥ l+1k ≥ l + 1 is at least [rinit+∑k=l+1drk]≥[−∑k=l+1d+1rk]=[−rl+1⁢(1−rd+21−r)]>[−2⁢rl+1]subscriptinitsuperscriptsubscript1subscriptsuperscriptsubscript11superscriptsuperscript11superscript212superscript1 *E [r_ init+ _k=l+1^dr_k ]% ≥ *E [- _k=l+1^d+1r^k ]=% *E [-r^l+1 ( 1-r^d+21-r )% ]> *E [-2r^l+1 ]blackboard_E [ rinit + ∑k = l + 1d ritalic_k ] ≥ blackboard_E [ - ∑k = l + 1d + 1 ritalic_k ] = blackboard_E [ - ritalic_l + 1 ( divide start_ARG 1 - ritalic_d + 2 end_ARG start_ARG 1 - r end_ARG ) ] > blackboard_E [ - 2 ritalic_l + 1 ] (18) where the last inequality follows from the fact that r<1/212r<1/2r < 1 / 2. Recall that r=14⁢ϵ⁢(1−ρ)14italic-ϵ1r= 14ε(1-ρ)r = divide start_ARG 1 end_ARG start_ARG 4 end_ARG ϵ ( 1 - ρ ). Summing (A.3) and (18) yields the total reward for A conditioned l≥11l≥ 1l ≥ 1 [VA,B⁢(x)∣l≥1]≥[rl⁢(ϵ−ρ⁢ϵ)−2⁢rl+1∣l≥1]>[12⁢rl⁢ϵ⁢(1−ρ)∣l≥1]>rd+1conditionalsuperscript1superscriptitalic-ϵitalic-ϵconditional2superscript11conditional12superscriptitalic-ϵ11superscript1 *E [V^A,B(x) l≥ 1 ]≥ % *E [r^l(ε-ρε)-2r^l+1 l≥ 1 ]>% *E [ 12r^lε(1-ρ) l≥ 1% ]>r^d+1blackboard_E [ Vitalic_A , B ( x ) ∣ l ≥ 1 ] ≥ blackboard_E [ ritalic_l ( ϵ - ρ ϵ ) - 2 ritalic_l + 1 ∣ l ≥ 1 ] > blackboard_E [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ritalic_l ϵ ( 1 - ρ ) ∣ l ≥ 1 ] > ritalic_d + 1 (19) where the final inequality follows from l≤dl≤ dl ≤ d. Similarly, summing (A.3) and (18) yields [VA,B⁢(x)∣l=0]≥ϵ−2⁢r>2⁢r.conditionalsuperscript0italic-ϵ22 *E [V^A,B(x) l=0 ]≥ε-2r>2r.blackboard_E [ Vitalic_A , B ( x ) ∣ l = 0 ] ≥ ϵ - 2 r > 2 r . (20) Taking the minimum of the possible outcomes given by (12), (19), and (20) completes the proof. ∎ A.4 Final proof The proof of Theorem 6.1 follows by plugging in the correct values into Lemma A.5 and Lemma A.4. Proof of Theorem 6.1. By Theorem 8.3, for all 0<ϵ<1/20italic-ϵ120<ε<1/20 < ϵ < 1 / 2, and s≥Ω⁢(cd)Ωsuperscripts≥ (c^d )s ≥ Ω ( citalic_d ) there is an (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable recursive decomposition of L with depth d=log⁡nlog⁡(8/ϵ)−18italic-ϵ1d= n (8/ε)-1d = divide start_ARG log n end_ARG start_ARG log ( 8 / ϵ ) end_ARG - 1 and width q=O⁢(c⋅(8ϵ)c)⋅superscript8italic-ϵq=O (c· ( 8ε )^c )q = O ( c ⋅ ( divide start_ARG 8 end_ARG start_ARG ϵ end_ARG )c ). Plugging ρ=1212ρ= 12ρ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG and d into Lemma A.5 yields completeness: rd+1superscript1 r^d+1ritalic_d + 1 =(ϵ8)log⁡nlog⁡(8/ϵ)=1nabsentsuperscriptitalic-ϵ88italic-ϵ1 = ( ε8 ) n (8/ε)% = 1n= ( divide start_ARG ϵ end_ARG start_ARG 8 end_ARG )divide start_ARG log n end_ARG start_ARG log ( 8 / ϵ ) end_ARG = divide start_ARG 1 end_ARG start_ARG n end_ARG x∼μ[VA,B⁢(x)]subscriptsimilar-tosuperscript *E_x μ [V^A,B(x) ]blackboard_Ex ∼ μ [ Vitalic_A , B ( x ) ] ≥(1−ϵ)⁢rd+1=(1−ϵ)nabsent1italic-ϵsuperscript11italic-ϵ ≥(1-ε)r^d+1= (1-ε)n≥ ( 1 - ϵ ) ritalic_d + 1 = divide start_ARG ( 1 - ϵ ) end_ARG start_ARG n end_ARG Further setting δ=η/(2⁢d+1)21δ=η/(2d+1)δ = η / ( 2 d + 1 ), d, and q in Lemma A.4 yields soundness: x∼μ[VA,B⁢(x)]subscriptsimilar-tosuperscript *E_x μ [V^A,B(x) ]blackboard_Ex ∼ μ [ Vitalic_A , B ( x ) ] <rd+1⋅ℙ[A⁢(x)=L⁢(x)]+(2⁢d+1)⁢δabsent⋅superscript1ℙ21 <r^d+1· *P[A(x)=L(x)]+(2d+1)δ< ritalic_d + 1 ⋅ blackboard_P [ A ( x ) = L ( x ) ] + ( 2 d + 1 ) δ =ℙ[A⁢(x)=L⁢(x)]n+ηabsentℙ = *P[A(x)=L(x)]n+η= divide start_ARG blackboard_P [ A ( x ) = L ( x ) ] end_ARG start_ARG n end_ARG + η ∎ Appendix B Proof of Theorem 6.2 The proof of Theorem 6.2 follows from an analysis of Stackelberg equilibria via the soundness and completeness of Lemma A.4 and Lemma A.5. Proof of Theorem 6.2. Let A∈A ∈ A and B∈ℬB ∈ B be the respective strategies of debaters A and B in any α-approximate, A-leading Stackelberg equilibrium in the protocol of Figure 2. Suppose that A is dishonest with probability γ i.e. ℙx∼μ[A⁢(x)≠L⁢(x)]=γ>0subscriptℙsimilar-to0 *P_x μ [A(x)≠ L(x) ]=γ>0blackboard_Px ∼ μ [ A ( x ) ≠ L ( x ) ] = γ > 0 Then by Lemma A.4 there exists a strategy B′∈ℬsuperscript′ℬB ′ ∈ B , such that x∼μ[VA,B′,⁢(x)]≤(1−γ)⁢rd+1+(2⁢d+1)⁢δ.subscriptsimilar-tosuperscriptsuperscript′1superscript121 *E_x μ [V^A,B ,O(x)% ]≤(1-γ)r^d+1+(2d+1)δ.blackboard_Ex ∼ μ [ Vitalic_A , B start_POSTSUPERSCRIPT ′ , O end_POSTSUPERSCRIPT ( x ) ] ≤ ( 1 - γ ) ritalic_d + 1 + ( 2 d + 1 ) δ . The definition of α-approximate Stackelberg equilibrium then implies that x∼μ[VA,B⁢(x)]≤(1−γ)⁢rd+1+(2⁢d+1)⁢δ+α.subscriptsimilar-tosuperscript1superscript121 *E_x μ [V^A,B(x) ]≤(1-γ)r^% d+1+(2d+1)δ+α.blackboard_Ex ∼ μ [ Vitalic_A , B ( x ) ] ≤ ( 1 - γ ) ritalic_d + 1 + ( 2 d + 1 ) δ + α . (21) On the other hand, by Lemma A.5 we have that there exists a strategy A′∈superscript′A ′ ∈ A such that even for the best response B∗∈ℬsuperscriptℬB^* ∗ ∈ B to A′, the expected reward is at least x∼μ[VA′,B∗,⁢(x)]>(1−ϵ)⁢rd+1−ϵ2⁢nsubscriptsimilar-tosuperscriptsuperscript′1italic-ϵsuperscript1italic-ϵ2 *E_x μ [V^A ,B^*,O(% x) ]>(1-ε)r^d+1- ε2nblackboard_Ex ∼ μ [ Vitalic_A start_POSTSUPERSCRIPT ′ , B∗ , O end_POSTSUPERSCRIPT ( x ) ] > ( 1 - ϵ ) ritalic_d + 1 - divide start_ARG ϵ end_ARG start_ARG 2 n end_ARG (22) Then by the definition of an α-approximate Stackelberg equilibrium, followed by an application of (22) and (21) we have α α ≥x∼μ[VA′,B∗,⁢(x)]−x∼μ[VA,B⁢(x)]absentsubscriptsimilar-tosuperscriptsuperscript′subscriptsimilar-tosuperscript ≥ *E_x μ [V^A ,B^% *,O(x) ]- *E_x μ [V^A,B% (x) ]≥ blackboard_Ex ∼ μ [ Vitalic_A start_POSTSUPERSCRIPT ′ , B∗ , O end_POSTSUPERSCRIPT ( x ) ] - blackboard_Ex ∼ μ [ Vitalic_A , B ( x ) ] >(1−ϵ)⁢rd+1−ϵ2⁢n−(1−γ)⁢rd+1−(2⁢d+1)⁢δ−α.absent1italic-ϵsuperscript1italic-ϵ21superscript121 >(1-ε)r^d+1- ε2n-(1-γ)r^d+1-(2d+1)% δ-α.> ( 1 - ϵ ) ritalic_d + 1 - divide start_ARG ϵ end_ARG start_ARG 2 n end_ARG - ( 1 - γ ) ritalic_d + 1 - ( 2 d + 1 ) δ - α . Rearranging and using δ=ϵ⁢rd+1(2⁢d+1)italic-ϵsuperscript121δ= ε r^d+1(2d+1)δ = divide start_ARG ϵ ritalic_d + 1 end_ARG start_ARG ( 2 d + 1 ) end_ARG, α<ϵ2⁢nitalic-ϵ2α< ε2nα < divide start_ARG ϵ end_ARG start_ARG 2 n end_ARG completes the proof: γ γ <ϵ+(2⁢d+1)⁢δ+2⁢α+ϵ2⁢nrd+1absentitalic-ϵ212italic-ϵ2superscript1 <ε+ (2d+1)δ+2α+ ε2nr^d+1< ϵ + divide start_ARG ( 2 d + 1 ) δ + 2 α + divide start_ARG ϵ end_ARG start_ARG 2 n end_ARG end_ARG start_ARG ritalic_d + 1 end_ARG =ϵ+ϵ⁢rd+1+2⁢α+ϵ2⁢nrd+1absentitalic-ϵitalic-ϵsuperscript12italic-ϵ2superscript1 =ε+ ε r^d+1+2α+ ε2nr^% d+1= ϵ + divide start_ARG ϵ ritalic_d + 1 + 2 α + divide start_ARG ϵ end_ARG start_ARG 2 n end_ARG end_ARG start_ARG ritalic_d + 1 end_ARG =2⁢ϵ+2⁢α⁢n+ϵ/2absent2italic-ϵ2italic-ϵ2 =2ε+2α n+ε/2= 2 ϵ + 2 α n + ϵ / 2 <4⁢ϵabsent4italic-ϵ <4ε< 4 ϵ ∎ Appendix C Stability In this section we prove Theorem 8.3, showing that every s-supported language can be transformed via majority voting into a language that admits a stable recursive decomposition. The main idea is to use a simplified variant of von Neumann’s error-correcting approach to constructing reliable circuits from unreliable components Von Neumann [1956]. Proof of Theorem 8.3. Let M be the time ncsuperscriptn^cnitalic_c verifier for L, and for any x∈Lx∈ Lx ∈ L let y∗superscripty^*y∗ be the corresponding witness such that M∗⁢(x,y∗)=1superscriptsuperscriptsuperscript1M^O^*(x,y^*)=1Mcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( x , y∗ ) = 1. Let τ be the transcript of M∗⁢(x,y∗)superscriptsuperscriptsuperscriptM^O^*(x,y^*)Mcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( x , y∗ ). Checking that τ is a valid transcript can be encoded as an equivalent circuit C consisting of d layers. The input layer consists of the bits of τ, the next layer the consists of O⁢(1)1O(1)O ( 1 )-sized CNF formulas that check whether an individual step of τ was executed correctly, and each subsequent layer consists of the AND of groups of w=Θ⁢(8ϵ)cΘsuperscript8italic-ϵw= ( 8ε)^cw = Θ ( divide start_ARG 8 end_ARG start_ARG ϵ end_ARG )c outputs of the previous layer. Choosing the constant on w, we can arrange for C to have depth d=c⁢log⁡n/log⁡w−1=log⁡nlog⁡(8/ϵ)−118italic-ϵ1d=c n/ w-1= n (8/ε)-1d = c log n / log w - 1 = divide start_ARG log n end_ARG start_ARG log ( 8 / ϵ ) end_ARG - 1. The construction. Next we perform the following modification of C to produce a stable circuit C′. The construction proceeds layer-by-layer, where in each iteration we utilize additional independent inputs vijsuperscriptsubscriptv_i^jvitalic_iitalic_j in place of the original queries to τisubscript _iτitalic_i. We recursively define intermediate circuits Ck′subscriptsuperscript′C _kC′italic_k as follows. To construct Ck′subscriptsuperscript′C _kC′italic_k first connect κ independent inputs vijsuperscriptsubscriptv_i^jvitalic_iitalic_j to a majority gate MisubscriptM_iMitalic_i, and then add the original first layer of logic gates in C, with their i-th input connected to the majority gate MisubscriptM_iMitalic_i. The circuit Ck′subscriptsuperscript′C _kC′italic_k is constructed recursively by making κ copies of Ck−1′subscriptsuperscript′1C _k-1C′italic_k - 1, each taking a fresh set of independent inputs vijsuperscriptsubscriptv_i^jvitalic_iitalic_j, connecting the κ copies of the i-th output gate of Ck−1′subscriptsuperscript′1C _k-1C′italic_k - 1 to a majority gate, and then adding the original k-th layer of logic gates from C, with each input wire connected to the corresponding majority gate. Observe that for each original input, the total number of independent inputs vijsuperscriptsubscriptv_i^jvitalic_iitalic_j required by the construction is s=κdsuperscripts=κ^ds = κitalic_d. Correctness of C′. For a logic gate G in layer k of C′, let vGsubscriptv_Gvitalic_G be the subset of inputs vijsuperscriptsubscriptv_i^jvitalic_iitalic_j that feed into G. Similarly let vG¯subscript¯v_ Gvover¯ start_ARG G end_ARG be the completement of this subset. We now proceed by induction, with the inductive hypothesis being that for any gate G in layer k, ℙ∼[G=G∗∣vG¯]>1−ϵ4⁢n.subscriptℙsimilar-tosuperscriptconditionalsuperscriptsuperscriptsubscript¯1italic-ϵ4 *P_O [G^O=G^% O^* v_ G]>1- ε4n.blackboard_PO ∼ P [ Gcaligraphic_O = Gcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ vover¯ start_ARG G end_ARG ] > 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG . (23) For the base case k=11k=1k = 1, the inputs vijsuperscriptsubscriptv_i^jvitalic_iitalic_j satisfy the condition because L is s-supported. For the inductive case, observe that G in layer k takes as input w majority gates, where each individual majority is computed over outputs from κ copies of Ck−1′subscriptsuperscript′1C _k-1C′italic_k - 1 each evaluated on a fresh set of independent inputs. Let Maj⁡(z1,…,zκ)Majsubscript1…subscriptMaj(z_1,…,z_κ)Maj ( z1 , … , zitalic_κ ) be one of the majority gates, with inputs give by zisubscriptz_izitalic_i. Let zisubscriptz_izitalic_i be the value of the input under oracle OO, and zi∗superscriptsubscriptz_i^*zitalic_i∗ the value under ∗superscriptO^*O∗. By the inductive hypothesis (23) ℙ∼[zi=zi∗∣z≠i,vG¯]>1−ϵ4⁢nsubscriptℙsimilar-tosubscriptconditionalsubscriptsuperscriptsubscriptabsentsubscript¯1italic-ϵ4 *P_O [z_i=z^*_i z_% ≠ i,v_ G]>1- ε4nblackboard_PO ∼ P [ zitalic_i = z∗italic_i ∣ z≠ i , vover¯ start_ARG G end_ARG ] > 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG (24) since the sets of inputs feeding into each zisubscriptz_izitalic_i are disjoint. Further, since majority is a monotone boolean function and each of the zi∗superscriptsubscriptz_i^*zitalic_i∗ all take the same value under the ground-truth oracle ∗superscriptO^*O∗, the distribution that minimizes the probability ℙ∼[Maj⁡(z)=Maj⁡(z∗)]subscriptℙsimilar-toMajMajsuperscript *P_O [Maj(z)=% Maj(z^*)]blackboard_PO ∼ P [ Maj ( z ) = Maj ( z∗ ) ] under the constraint that (24) is satisfied is given by independently setting each bit zi=zi∗subscriptsuperscriptsubscriptz_i=z_i^*zitalic_i = zitalic_i∗ with probability equal to 1−ϵ4⁢n1italic-ϵ41- ε4n1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG. Set κ=3⁢ln⁡(2⁢w/ϵ)D(1/2∥1−3ϵ/2)=3⁢ln⁡(2⁢w/ϵ)ln⁡(1/(6⁢ϵ⁢(1−3⁢ϵ/2)))>3κ= 3 (2w/ε)D(1/2 1-3ε/2)= 3 (2w/% ε) (1/(6ε(1-3ε/2)))>3κ = divide start_ARG 3 ln ( 2 w / ϵ ) end_ARG start_ARG D ( 1 / 2 ∥ 1 - 3 ϵ / 2 ) end_ARG = divide start_ARG 3 ln ( 2 w / ϵ ) end_ARG start_ARG ln ( 1 / ( 6 ϵ ( 1 - 3 ϵ / 2 ) ) ) end_ARG > 3. Since w and ϵitalic-ϵεϵ are constants, we have that for sufficiently large n ln⁡(4⁢n⁢w/ϵ)D(1/2∥1−ϵ4⁢n)=ln⁡(4⁢n⁢w/ϵ)12⁢log⁡(nϵ⁢(1−ϵ4⁢n))=2⁢ln⁡(n)+ln⁡(4⁢w/ϵ)ln⁡n+ln⁡(1/(ϵ⁢(1−ϵ4⁢n)))<3 (4nw/ε)D (1/2 1- ε% 4n )= (4nw/ε) 12 ( nε(1-% ε4n) )=2 (n)+ (4w/ε) n+ (1/(% ε(1- ε4n)))<3divide start_ARG ln ( 4 n w / ϵ ) end_ARG start_ARG D ( 1 / 2 ∥ 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG ) end_ARG = divide start_ARG ln ( 4 n w / ϵ ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG log ( divide start_ARG n end_ARG start_ARG ϵ ( 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG ) end_ARG ) end_ARG = 2 divide start_ARG ln ( n ) + ln ( 4 w / ϵ ) end_ARG start_ARG ln n + ln ( 1 / ( ϵ ( 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG ) ) ) end_ARG < 3 (25) By standard Chernoff bounds and (25) we then have ℙ∼[Maj⁡(z)≠Maj⁡(z∗)∣vG¯]subscriptℙsimilar-toMajconditionalMajsuperscriptsubscript¯ *P_O [% Maj(z) (z^*) v_ G]blackboard_PO ∼ P [ Maj ( z ) ≠ Maj ( z∗ ) ∣ vover¯ start_ARG G end_ARG ] ≤exp(−κD(1/2||1−ϵ/4n)) ≤ (-κ D (1/2||1-ε/4n ) )≤ exp ( - κ D ( 1 / 2 | | 1 - ϵ / 4 n ) ) ≤exp⁡(−κ⁢13⁢ln⁡(4⁢n⁢w/ϵ))absent134italic-ϵ ≤ (-κ 13 (4nw/ε) )≤ exp ( - κ divide start_ARG 1 end_ARG start_ARG 3 end_ARG ln ( 4 n w / ϵ ) ) ≤exp⁡(−ln⁡(4⁢n⁢w/ϵ))=ϵ4⁢n⁢w.absent4italic-ϵitalic-ϵ4 ≤ (- (4nw/ε) )= ε4nw.≤ exp ( - ln ( 4 n w / ϵ ) ) = divide start_ARG ϵ end_ARG start_ARG 4 n w end_ARG . Hence, by the union bound over the w inputs to G we have ℙ∼[G=G∗∣vG¯]>1−ϵ4⁢n.subscriptℙsimilar-tosuperscriptconditionalsuperscriptsuperscriptsubscript¯1italic-ϵ4 *P_O [G^O=G^% O^* v_ G]>1- ε4n.blackboard_PO ∼ P [ Gcaligraphic_O = Gcaligraphic_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ vover¯ start_ARG G end_ARG ] > 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG . (26) Recalling Definition 8.2, this implies that x∈Lx∈ Lx ∈ L if and only if (x,r)∈L′(x,r)∈ L ( x , r ) ∈ L′ with probability at least 1−ϵ4⁢n1italic-ϵ41- ε4n1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG over the uniform random choice of r. Stability of C′. Next we utilize the structure of C′ to produce an (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable recursive decomposition. The constructed circuit C′ takes a form where each layer of fan-in-w logic gates G (AND or OR) alternates with a layer of fan-in-κ majority gates. The decomposition will be defined by letting MDsubscriptM_DMitalic_D be the machine that computes a logic gate G composed with w majority gates, where the queries y1,…,yqsubscript1…subscripty_1,…,y_qy1 , … , yitalic_q are each c⁢log⁡nc nc log n-bit labels of input wires in C′. Hence, the total queries in each round by MDsubscriptM_DMitalic_D will be q=w⁢κq=w = w κ. The decomposition proceeds downward from the output gate of C′, where each round corresponds to a logic gate G composed with majority, and the true distribution on query answers is given by evaluating C′ at inputs given by ∼similar-toO ∼ P. Next for any set of queries yisubscripty_iyitalic_i, let pi⁢(z<i)=ℙ∼[zi=1∣z<i]subscriptsubscriptabsentsubscriptℙsimilar-tosubscriptconditional1subscriptabsentp_i(z_<i)= *P_O [z_i=1% z_<i]pitalic_i ( z< i ) = blackboard_PO ∼ P [ zitalic_i = 1 ∣ z< i ]. We now prove that each MDsubscriptM_DMitalic_D is (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable at pisubscriptp_ipitalic_i. Let qi⁢(z<i)subscriptsubscriptabsentq_i(z_<i)qitalic_i ( z< i ) be any sequence of probabilities satisfying |qi⁢(zi∣z<i)−pi⁢(zi∣z<i)|<ϵsubscriptconditionalsubscriptsubscriptabsentsubscriptconditionalsubscriptsubscriptabsentitalic-ϵ q_i(z_i z_<i)-p_i(z_i z_<i) <ε| qitalic_i ( zitalic_i ∣ z< i ) - pitalic_i ( zitalic_i ∣ z< i ) | < ϵ. The computation of MDsubscriptM_DMitalic_D corresponds to a logic gate G composed with w majority gates. As above, let Maj⁡(z1,…,zκ)Majsubscript1…subscriptMaj(z_1,…,z_κ)Maj ( z1 , … , zitalic_κ ) be one of the majority gates. By equation (24) we have ℙzi∼pi[zi=zi∗∣z≠i]>1−ϵ4⁢nsubscriptℙsimilar-tosubscriptsubscriptsubscriptconditionalsuperscriptsubscriptsubscriptabsent1italic-ϵ4 *P_z_i p_i[z_i=z_i^* z_≠ i]>1% - ε4nblackboard_Pz start_POSTSUBSCRIPT i ∼ pitalic_i end_POSTSUBSCRIPT [ zitalic_i = zitalic_i∗ ∣ z≠ i ] > 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG And hence ℙzi∼qi[zi=zi∗∣z≠i]>1−ϵ4⁢n−ϵ>1−3⁢ϵ/2subscriptℙsimilar-tosubscriptsubscriptsubscriptconditionalsuperscriptsubscriptsubscriptabsent1italic-ϵ4italic-ϵ13italic-ϵ2 *P_z_i q_i[z_i=z_i^* z_≠ i]>1% - ε4n-ε>1-3ε/2blackboard_Pz start_POSTSUBSCRIPT i ∼ qitalic_i end_POSTSUBSCRIPT [ zitalic_i = zitalic_i∗ ∣ z≠ i ] > 1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG - ϵ > 1 - 3 ϵ / 2 (27) Again, since majority is a monotone boolean function and each of the zi∗superscriptsubscriptz_i^*zitalic_i∗ all take the same value under the ground-truth oracle ∗superscriptO^*O∗, the distribution that minimizes the probability ℙzi∼qi[Maj⁡(z)=Maj⁡(z∗)]subscriptℙsimilar-tosubscriptsubscriptMajMajsuperscript *P_z_i q_i[Maj(z)=% Maj(z^*)]blackboard_Pz start_POSTSUBSCRIPT i ∼ qitalic_i end_POSTSUBSCRIPT [ Maj ( z ) = Maj ( z∗ ) ] under the constraint that (27) is satisfied is given by independently setting each bit zi=zi∗subscriptsuperscriptsubscriptz_i=z_i^*zitalic_i = zitalic_i∗ with probability equal to 1−3⁢ϵ/213italic-ϵ21-3ε/21 - 3 ϵ / 2. By standard Chernoff bounds we then have ℙzi∼qi[Maj⁡(z)≠Maj⁡(z∗)]≤e−κD(1/2||1−3ϵ/2)≤e−ln⁡(2⁢w/ϵ)=ϵ2⁢w. *P_z_i q_i[Maj(z)≠% Maj(z^*)]≤ e^-κ D (1/2||1-3ε/2 )% ≤ e^- (2w/ε)= ε2w.blackboard_Pz start_POSTSUBSCRIPT i ∼ qitalic_i end_POSTSUBSCRIPT [ Maj ( z ) ≠ Maj ( z∗ ) ] ≤ e- κ D ( 1 / 2 | | 1 - 3 ϵ / 2 ) ≤ e- ln ( 2 w / ϵ ) = divide start_ARG ϵ end_ARG start_ARG 2 w end_ARG . Hence, the union bound over the w majorities yields ℙzi∼qi[MD⁢(x,yi,zi)=MD⁢(x,yi,zi∗)]>1−ϵ2.subscriptℙsimilar-tosubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsuperscriptsubscript1italic-ϵ2 *P_z_i q_i[M_D(x,y_i,z_i)=M_D(x,y_i% ,z_i^*)]>1- ε2.blackboard_Pz start_POSTSUBSCRIPT i ∼ qitalic_i end_POSTSUBSCRIPT [ Mitalic_D ( x , yitalic_i , zitalic_i ) = Mitalic_D ( x , yitalic_i , zitalic_i∗ ) ] > 1 - divide start_ARG ϵ end_ARG start_ARG 2 end_ARG . (28) For ∼similar-toO ∼ P, let z¯isubscript¯ z_iover¯ start_ARG z end_ARGi be the correct answers to the queries yisubscripty_iyitalic_i under the oracle OO. Combining (28) with (26) we have that with probability at least 1−ϵ4⁢n1italic-ϵ41- ε4n1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG over the choice of ∼similar-toO ∼ P |ℙzi∼qi[MD⁢(x,y,z)]−MD⁢(x,y,z¯)|<ϵ2.subscriptℙsimilar-tosubscriptsubscriptsubscriptsubscript¯italic-ϵ2 *P_z_i q_i[M_D(x,y,z)]-M_D(x,% y, z) < ε2.| blackboard_Pz start_POSTSUBSCRIPT i ∼ qitalic_i end_POSTSUBSCRIPT [ Mitalic_D ( x , y , z ) ] - Mitalic_D ( x , y , over¯ start_ARG z end_ARG ) | < divide start_ARG ϵ end_ARG start_ARG 2 end_ARG . (29) Recalling Definition 8.2, this immediately implies that with probability at least 1−ϵ4⁢n1italic-ϵ41- ε4n1 - divide start_ARG ϵ end_ARG start_ARG 4 n end_ARG over (x,r)∼μ′similar-tosuperscript′(x,r) μ ( x , r ) ∼ μ′ the decomposition MDsubscriptM_DMitalic_D is an (ϵ,1/2)italic-ϵ12(ε,1/2)( ϵ , 1 / 2 )-stable decomposition of L′. Finally, plugging in the values of w=Θ⁢(8ϵ)cΘsuperscript8italic-ϵw= ( 8ε )^cw = Θ ( divide start_ARG 8 end_ARG start_ARG ϵ end_ARG )c and κ=3⁢ln⁡(2⁢w/ϵ)ln⁡(1/(6⁢ϵ⁢(1−3⁢ϵ/2)))32italic-ϵ16italic-ϵ13italic-ϵ2κ= 3 (2w/ε) (1/(6ε(1-3ε/2)))κ = divide start_ARG 3 ln ( 2 w / ϵ ) end_ARG start_ARG ln ( 1 / ( 6 ϵ ( 1 - 3 ϵ / 2 ) ) ) end_ARG yields the number of queries q and s-supportedness parameter q=w⁢κ=O⁢(c⋅(8ϵ)c)s=κd=Θ⁢(cd).formulae-sequence⋅superscript8italic-ϵsuperscriptΘsuperscriptq=wκ=O (c· ( 8ε )^c ) s% =κ^d= (c^d ).q = w κ = O ( c ⋅ ( divide start_ARG 8 end_ARG start_ARG ϵ end_ARG )c ) s = κitalic_d = Θ ( citalic_d ) . ∎ Appendix D Recommendations for empirical debate We make several qualitative recommendations for empirical debate based on our new protocol: Asymmetry between the debaters. Much empirical work on debate thus far has had symmetric debaters, distinguished only by which debater speaks first. Our protocol suggests that it is useful to have asymmetry between the debaters. The debater A should be tasked with proposing a solution and providing evidence to defend it, while debater B should attempt to identify flaws, and evaluate the plausibility of the evidence. This setup also makes sense for empirical debate research on open-ended tasks, as the first debater will need to propose a solution and then defend its correctness. Explicit uncertainty estimates. Our protocols rely heavily on the ability to ask the estimator B to produce probability estimates for subclaims that arise in the debate. The payoffs are also determined by these estimates. In a practical setting one could use multiple samples from a generative model to produce uncertainty estimates, or directly use token probabilities of the answer from an LLM. For example, after an LLM A generates an argument, one could prompt both A and opposing LLM B to read through the argument step-by-step, and at each step answer whether the claim seemed true. Token probabilities could then be compared between A and B for the truth of each claim. Game theoretically motivated training algorithms. We depart from prior theoretical work on debate by considering different game theoretic equilibrium concepts to reason about our protocols. The leader-follower structure implies that for every training update made to A, we should make multiple updates to B. This ensures that whenever A receives a gradient update, it comes from an episode of debate where B is playing an approximate “best response” to A’s current strategy. This type of training algorithm makes particular sense in the asymmetric setting where A is proposing a complex solution to some problem, and is asked to defend the solution. In this case, we would like to train B to ensure the ability to identify flaws in A’s solutions, before providing a new update to A.