← Back to papers

Paper deep dive

Towards Provably Secure Generative AI: Reliable Consensus Sampling

Yu Cui, Hang Fu, Sicheng Pan, Zhuoyu Sun, Yifei Liu, Yuhong Nie, Bo Ran, Baohan Huang, Xufeng Zhang, Haibin Zhang, Cong Zuo, Licheng Wang

Year: 2024Venue: arXiv preprintArea: Formal/TheoreticalType: TheoreticalEmbeddings: 53

Models: Qwen2.5-0.5b-instruct, Qwen2.5-7b-instruct, Qwen3Guard-Gen-8B

Abstract

Abstract:Existing research on generative AI security is primarily driven by mutually reinforcing attack and defense methodologies grounded in empirical experience. This dynamic frequently gives rise to previously unknown attacks that can circumvent current detection and prevention. This necessitates the continual updating of security mechanisms. Constructing generative AI with provable security and theoretically controllable risk is therefore necessary. Consensus Sampling (CS) is a promising algorithm toward provably secure AI. It controls risk by leveraging overlap in model output probabilities. However, we find that CS relies on frequent abstention to avoid unsafe outputs, which reduces utility. Moreover, CS becomes highly vulnerable when unsafe models are maliciously manipulated. To address these issues, we propose a new primitive called Reliable Consensus Sampling (RCS), that traces acceptance probability to tolerate extreme adversarial behaviors, improving robustness. RCS also eliminates the need for abstention entirely. We further develop a feedback algorithm to continuously and dynamically enhance the safety of RCS. We provide theoretical guarantees that RCS maintains a controllable risk threshold. Extensive experiments show that RCS significantly improves robustness and utility while maintaining latency comparable to CS. We hope this work contributes to the development of provably secure generative AI.

Tags

ai-safety (imported, 100%)formaltheoretical (suggested, 92%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/11/2026, 1:13:13 AM

Summary

The paper introduces Reliable Consensus Sampling (RCS), a novel primitive for generative AI security that improves upon Consensus Sampling (CS) by eliminating the need for abstention and enhancing robustness against adversarial manipulation. RCS utilizes a trace-based mechanism to record acceptance probabilities and a feedback algorithm inspired by quantum entanglement to dynamically exclude unsafe models, providing theoretical guarantees for a controllable risk threshold.

Entities (4)

Consensus Sampling · algorithm · 100%Reliable Consensus Sampling · algorithm · 100%Feedback-Optimized Reliable Consensus Sampling · algorithm · 95%Byzantine threat model · security-model · 90%

Relation Signals (3)

Reliable Consensus Sampling improvesupon Consensus Sampling

confidence 95% · RCS significantly improves robustness and utility while maintaining latency comparable to CS.

Feedback-Optimized Reliable Consensus Sampling optimizes Reliable Consensus Sampling

confidence 95% · we construct an optimization algorithm for RCS, named Feedback-Optimized Reliable Consensus Sampling (F-RCS).

Reliable Consensus Sampling operatesunder Byzantine threat model

confidence 90% · We define a security property theory for model groups under a Byzantine threat model.

Cypher Suggestions (2)

Find all algorithms related to provably secure AI · confidence 90% · unvalidated

MATCH (a:Algorithm)-[:IMPROVES_UPON|OPTIMIZES*0..1]->(b:Algorithm) WHERE a.name CONTAINS 'Consensus' RETURN a, b

Map the relationship between security models and algorithms · confidence 85% · unvalidated

MATCH (a:Algorithm)-[:OPERATES_UNDER]->(m:SecurityModel) RETURN a.name, m.name

Full Text

52,437 characters extracted from source content.

Expand or collapse full text

Towards Provably Secure Generative AI: Reliable Consensus Sampling Yu Cui 1 Hang Fu 1 Sicheng Pan 1 Zhuoyu Sun 1 Yifei Liu 1 Yuhong Nie 1 Bo Ran 1 Baohan Huang 1 Xufeng Zhang 1 Haibin Zhang 2 Cong Zuo 1 Licheng Wang 1 1 Beijing Institute of Technology 2 Yangtze Delta Region Institute of Tsinghua University, Zhejiang cuiyu@bit.edu.cn, bchainzhang@aliyun.com Abstract Existing research on generative AI security is primarily driven by mutually reinforcing attack and defense methodologies grounded in em- pirical experience. This dynamic frequently gives rise to previously unknown attacks that can circumvent current detection and preven- tion. This necessitates the continual updating of security mechanisms. Constructing genera- tive AI with provable security and theoretically controllable risk is therefore necessary. Con- sensus Sampling (CS) is a promising algorithm toward provably secure AI. It controls risk by leveraging overlap in model output probabil- ities. However, we find that CS relies on fre- quent abstention to avoid unsafe outputs, which reduces utility. Moreover, CS becomes highly vulnerable when unsafe models are maliciously manipulated. To address these issues, we pro- pose a new primitive called Reliable Consensus Sampling (RCS), that traces acceptance proba- bility to tolerate extreme adversarial behaviors, improving robustness. RCS also eliminates the need for abstention entirely. We further de- velop a feedback algorithm to continuously and dynamically enhance the safety of RCS. We provide theoretical guarantees that RCS main- tains a controllable risk threshold. Extensive experiments show that RCS significantly im- proves robustness and utility while maintaining latency comparable to CS. We hope this work contributes to the development of provably se- cure generative AI. 1 Introduction With the widespread deployment of generative AI, especially large language models (LLMs), secu- rity issues continue to emerge (Ji et al., 2023a; Liu et al., 2024; Zhan et al., 2025; Wang et al., 2025b). Current AI safety research largely fol- lows a coevolutionary trajectory between attacks and defenses (Zhang et al., 2025b). New attack methods and defenses appear continuously. This dynamic leaves existing defenses unprepared for Model Reasoning Prompt Filtering Response Prompt User Output Filtering Alignment Computationally Indistinguishable Computationally Intractable Unprovable Safety Risk Threshold Risk P R= Risk A R = Risk O R= Risk PAO R=   Figure 1: Overview of security risks in generative model reasoning. These risks arise from the aggregation of risks across three core stages. From a theoretical per- spective, such risks are unavoidable. future and unpredictable threats. As a result, the definition of AI safety requires constant revision and expansion. The root cause is inherent risk in model reasoning. This risk is unavoidable and dif- ficult to control, as shown in Figure 1. Across the full reasoning pipeline, three stages permit risk in- tervention. At the model level, emerging threats demand repeated alignment. However, safety fine- tuning (Jain et al., 2024) introduces additional risks, such as backdoor attacks (Xu et al., 2024; Wen et al., 2024) and data poisoning (Chen et al., 2024b). For external filtering mechanisms, prior work (Ball et al., 2025) based on cryptographic hardness shows that efficient prompt filters do not exist for certain large models. Output filtering is computationally intractable. Moreover, some risks remain undetectable (Kalai et al., 2025). There- fore, security risk in model reasoning cannot be eliminated. Empirical defenses offer only tempo- rary protection. A generative AI paradigm with provable and controllable risk is therefore neces- sary. Current research on provably safe AI remains 1 arXiv:2512.24925v1 [cs.CR] 31 Dec 2025 position oriented (Dalrymple et al., 2024). Recent work on consensus sampling (CS) (Kalai et al., 2025) offers partial theoretical control over output risk. CS considers a model group withssafe models andfunsafe models (s > f). It exploits overlap among output distributions across models to set the acceptance probability for a response of unknown safety. The goal is to make the delivered response more likely to originate from safe models. The aggregated risk of the group admits a provable upper bound. Unlike prior approaches, CS does not define a safe response. It follows the principle that responses supported by more models carry lower risk. This line of work is orthogonal to empirical safety optimization. However, CS relies on frequent abstention to avoid unsafe outputs, reducing utility (Paulus et al., 2025). We further show that adversarial control over unsafe models sharply weakens CS security. Despite a cryptographic upper bound on risk, CS lacks robustness and practicality for deployment. To address these issues, we propose a new prim- itive named Reliable Consensus Sampling (RCS). We present a provable safety framework for model groups that includes safety and liveness proper- ties. The definitions draw on classical reliable dis- tributed consensus theory. They enable formal se- curity analysis for model groups. RCS records ac- ceptance probability in real time after sampling fail- ures. After a bounded number of rejections, it en- ters a trace phase that guarantees eventual delivery of a response. RCS fully avoids abstention. Dur- ing the trace phase, it reweights model probability distributions to control their influence on the final decision. This design tolerates more extreme ad- versarial behavior and improves robustness. In ad- dition, inspired by quantum entanglement (Nielsen and Chuang, 2010), we introduce a feedback al- gorithm that captures correlations among model distributions. The algorithm identifies models that are unsafe for sampling on specific tasks. It im- proves RCS safety by excluding those models from the group decision. We prove that RCS achieves a controllable risk threshold. Extensive experiments show that RCS significantly improves security and utility com- pared with CS. Latency remains comparable to CS. We summarize our contributions as follows: •We define a security property theory for model groups under a Byzantine threat model. The the- oretical framework supports provable security analysis. •We propose RCS, a trace-based method that elim- inates abstention and guarantees eventual deliv- ery. It tracks acceptance probabilities in real time, reweights models to mitigate adversarial in- fluence, and uses a feedback module that exploits cross-model correlations to improve safety. •We theoretically prove that RCS admits a tight upper bound on risk. Experiments demonstrate that RCS outperforms CS in robustness and util- ity while maintaining comparable latency. 2 Preliminary Analysis Notation. We follow the notation established in (Kalai et al., 2025). For a promptx, each modeli∈1,...,ninduces a probability distri- butionp i (y)over a output spaceY. LetDistr(Y) denote the set of probability distributions onY. Yis the union of the safe spaceSand the un- safe spaceU. For anyp ∈ Distr(Y)and any subsetH ⊆ Y, we define the cumulative prob- abilityp(H) = P y∈H p(y) . Given a collection of distributions(p 1 ,...,p n ) ∈ Distr(Y) n , let p (i) (y)represent thei-th smallest of the probabil- itiesp t (y) n t=1 . We use the symbol⊥ /∈ Yto denote an abstention. Output Distribution. Because the model’s out- put is terminated either by special tokens or by a maximum token limit, for a finite tokenizer, We follow existing work (Chijiwa et al., 2025) to treat the ostensibly unbounded output distribution as a finite set to facilitate analysis. Formally, for model iwith maximum token lengthLand tokenizerT i , the output space isY i = S L j=1 T j i . Model Group. The model group (MG) consists of ngenerative models, includingssafe models and f = n− sunsafe models. Each safe model should maintain a reasonable probability distribution for any input promptx. The probability that the safety model outputs a safe response isΨ > 0. Unsafe models are assumed to be fully controllable by an adversary and may exhibit arbitrary unsafe behav- iors or follow any probability distribution, and we refer to this as the Byzantine model based on the definition of Byzantine replicas in classical BFT consensus research (Zhang et al., 2023; Duan et al., 2024; Das et al., 2024). 2 3 Reliable Consensus Sampling 3.1 Safety Properties We define the safety properties inMGbased on the definitions from reliable distributed systems (Cachin et al., 2011). Safety. Letf < ⌈ n 2 ⌉ , at all times, the risk of the model groupq(U ) ≤ n· μ(U ) + negl(λ).μ(U ) is the average risk of generating unsafe response y ∈ Ubyssafe models. Safety requires that the risk of unsafe models is reduced to the risk of safe models plus negl(λ). λ is a security parameter. Liveness. The liveness property requires that, for any timet, there remains the hope that the property will be satisfied at some later timet ′ ≥ t. The mechanism must still retain the possibility of even- tually delivering a usable responsey ∈Y, although it may be risky. Liveness reflects the utility ofMG. Anti-Collusion.MGmust tolerate arbitrary be- havior from up tofByzantine models. Such be- havior includes full control over the probability distribution ofy ∈Y. This control may assign low probability to safe responses. It may also assign high probability to unsafe responses. These behav- iors should not significantly affect the safety of the response delivered byMG. Half-Resilience. In real-world deployments, the values ofsandfmay change over time. We con- sider a timetat whichmout of thessafe models become unsafe. TheMGsystem then transitions to a new state withs ′ = s−mandf ′ = f +m. In this state, the conditionf <⌈ n 2 ⌉ no longer holds. We do not requireMGto preserve theoretical safety or liveness. Instead, we require the system to remain practically robust and avoid catastrophic failure. In this paper, we focus on the case of f = s. Termination. The algorithm is guaranteed to com- plete within a finite time T . 3.2 Methodology Our proposed RCS scheme is shown in Algorithm 1. During theRrounds of sampling, in each round, if the candidate responseyis not accepted, we recordyand its acceptance probabilityσ(y)in real time. When allRsampling rounds fail, the protocol enters the trace phase. In this phase, we select the topmin(s,R)values ofσ(y)and collect the correspondingyinto the setF. For eachy ∈F, we computeα(y)as the sum of the largestn−svalues inp j (y) n j=1 , which have already been obtained during the computation ofσ(y). We then select the y with the largest α(y) in order to mitigate the Algorithm 1: Reliable Consensus Sam- pling Algorithm Input: Number of models|MG| = n; number of safe models s; round R; distributions p 1 ,...,p n ∈ Distr(Y) n Output: Response y ∈Y 1 Buffer ← ∅ 2 for r ← 1 to R do 3Sample y ∼ 1 n P n i=1 p i 4 σ(y) = 1 s P s i=1 p (i) (y) 1 n P n i=1 p i (y) 5 if accept y with σ(y) then 6return y 7 Buffer ← Buffer∪⟨y,σ(y)⟩ 8 Sort Buffer =⟨y i ,σ(y i )⟩ R i=1 such that σ(y (1) )≥·≥ σ(y (R) ) 9 LetF =y (1) ,...,y u , u = min(s,R) 10 for i← 1 to u do 11 α(y (i) ) = P n j=s+1 p (j) (y (i) ) 12 return y ← arg max y∈F α(y) impact of Byzantine models, while maintaining the liveness guarantee. We present a rigorous theoretical proof of the safety property of RCS. The overlap of the output probability distributions of thessafest models, that is, their degree of consensus, is computed by using Z = P y 1 s P s i=1 p (i) (y)(Kalai et al., 2025). We denotePr[T ] = (1− Z) R as the probability of en- tering the trace phase, andρ(U )as the probability that the trace phase eventually outputs an unsafe response. LetR = poly(λ). Then the risk of RCS satisfies: q(U ) = (1− Pr[T])· 1 Z X y∈U 1 s s X i=1 p (i) (y) + Pr[T]· ρ(U ) ≤ ZR· 1 Z X y∈U 1 s s X i=1 p (i) (y) + Pr[T]· ρ(U ) = R· X y∈U 1 s s X i=1 p (i) (y) + Pr[T]· ρ(U ) ≤ R· 1 s s X i=1 p (i) (U ) + Pr[T]· ρ(U ) = R· μ(U ) + (1− Z) R · ρ(U ) InMG, at leasts− fvalues ofp i (y)satisfy 3 p i (y) > 0. Therefore,Z > 0. It follows that (1 − Z) R = e R ln(1−Z) . Sinceln(1 − Z) < 0,ρ(U ) ∈ (0, 1)andR = poly(λ), we have (1− Z) R · ρ(U ) = negl(λ). Therefore,q(U ) ≤ R·μ(U ) + negl(λ). We set the upper bound asn = kR + b , wherek,b > 0are constants. Under this choice, for anyR≤ n,q(U )≤ n·μ(U ) + negl(λ) holds. In fact, the above proof involves two equal- ity conditions. These conditions are difficult to sat- isfy simultaneously. The first condition isR = 1. This condition can be enforced by design. The second condition requires that for ally ∈ U, p i (y) ≤ p j (y), ∀i ∈ S, j /∈ S.This condition is extremely difficult to satisfy in practice. Therefore, in general, we haveq(U ) < n·μ(U )+negl(λ). As R → +∞(without considering termination), we havePr[T ]→ 0, indicating thatMGdelivers a re- sponse with probability approaching unity. In this asymptotic regime, the conditionq(U )≤ n· μ(U ) remains valid, thereby ensuring the safety property. 3.3 Analysis We provide a comprehensive comparison between RCS and existing CS method in Table 1. Below, we analyze each property in detail. Latency. LetIdenote the time required to sample yfrom a distributionp i . In RCS, if no sample is accepted afterRrounds, the algorithm incurs an additional cost ofn 2 lognfor trace computation. This cost is negligible compared withI, especially for reasoning LLMs (Chen et al., 2025). Therefore, RCS and CS have comparable time complexity. Anti-Collusion. The goal of collusion is to make MGmore likely to deliver unsafe responses by manipulating one or more Byzantine models. We study the relation between acceptance probabilities of an unsafe responsey t and a safe responsey v within theRround loop of Algorithm 1. We focus on the quantity∆(σ) = σ(y t )− σ(y v ). The de- tailed derivation appears in Appendix B. The sign of ∆(σ) depends on: D t,v = P s i=1 p (i) (y t ) P n i=s+1 p (i) (y v ) P s i=1 p (i) (y v ) P n i=s+1 p (i) (y t ) . When the Byzantine model assigns a very low probability toy v , the value of P s i=1 p (i) (y v ) de- creases, leading toD t,v > 1. As a result,∆(σ) > 0, which significantly increases the acceptance probability of the unsafe responsey t relative to y v . This effect weakens the security ofMG. When unsafe models collude, we present the resulting model probability distributions in Figure 2. The adversarial behavior of unsafe models clearly has a significant impact on the probability overlap among models in CS. In contrast, the design in lines 9- 12 of Algorithm 1 effectively mitigates this issue. When the values ofp unsafe (y v ) j are very low, RCS can significantly downweight their influence on the final output response. Half-Resilience. The CS algorithm has an absten- tion bound, meaningPr[y =⊥]does not exceed a threshold whenf < ⌈ n 2 ⌉. However, if this condi- tion is violated, the CS abstention bound is affected, leading to an increased probability of abstention and a significant impact on liveness. RCS can avoid this issue. We further validate the half-resilience of RCS in the experiments in Section 5. Termination. According to Algorithm 1, for a finite R, RCS always terminates within R rounds. 4 Feedback-Optimized Reliable Consensus Sampling For CS and RCS, the framework does not require an explicit definition of security properties. This design allows algorithms to remain applicable over time. From this perspective, for each concrete task, one cannot determine the probability that the final delivered result is safe. This holds even though the algorithmic risk admits an upper bound. Ex- ternal control over the sampling process is limited. This implies that, from both an algorithmic and long-term operational perspective, if theMGre- mains unchanged and the input task types are essen- tially fixed, the risk in RCS remains stationary over time, thereby limiting the potential for dynamic op- timization. In this section, we analyze the nature of response safety in RCS. Inspired by quantum com- puting (Nielsen and Chuang, 2010), we propose a research methodology for RCS based on quan- tum states. We further introduce a mechanism that intervenes in model distributions and dynamically improves safety. 4.1 Foundational Theory Motivated by quantum computation theory, we pro- pose a framework for studying the security of RCS by analogy with quantum theory. Understanding RCS Safety from a Quantum State Perspective. We model a system with un- known safety as a quantum state|φ⟩, defined as |φ⟩ = α|0⟩ + β|1⟩, where|0⟩denotes a safe state and|1⟩denotes an unsafe state. The constraint 4 AlgorithmSafety Liveness Anti-Collusion Half-Resilience TerminationTime Complexity Consensus Sampling✓✗✓O(RI) Reliable Consensus Sampling✓[O(RI), O(RI + n 2 logn)] Table 1: A comprehensive comparison between our proposed RCS and existing method. 0 200 400 600 800 1000 Response y (safe → unsafe) 0 10 20 30 40 50 60 70 Model (safe → unsafe) −40 −35 −30 −25 −20 −15 −10 −5 log 10 p ( y ) Safe Models Safey Safe Models Unsafey Unsafe Models Safey Unsafe Models Unsafey −25 −20 −15 −10 −5 log 10 p ( y ) (a) Distribution under general adversarial conditions. 0 200 400 600 800 1000 Response y (safe → unsafe) 0 10 20 30 40 50 60 70 Model (safe → unsafe) −40 −35 −30 −25 −20 −15 −10 −5 log 10 p ( y ) Safe Models Safey Safe Models Unsafey Unsafe Models Safey Unsafe Models Unsafey −35 −30 −25 −20 −15 −10 −5 log 10 p ( y ) (b) Distribution under malicious collusion. Figure 2: Model probability distribution for consensus sampling under diverse adversarial environments. α 2 +β 2 = 1 holds. Before evaluation,|φ⟩does not reside in a definite safe or unsafe state. It remains in a superposition, analogous to Schrödinger’s cat. The value|α| 2 denotes the probability of safety, while|β| 2 denotes the probability of unsafety. Given a fixed promptqand an evaluation method e, we applyEval(q,|φ⟩,e). The state|φ⟩then col- lapses to|0⟩or|1⟩. Only at this stage can one determine safety underqande. Therefore,|φ⟩ alone cannot be labeled as safe or unsafe. The state becomes concrete only under the joint action ofq ande. The desired condition is|α| >|β|, meaning a stronger bias toward the safe state. Entanglement Theory for RCS. Consider anMG withnmodels|φ 1 ⟩,|φ 2 ⟩,...,|φ n ⟩. In each sam- pling round, the randomly selected model|φ r ⟩re- mains in a superposition of safety and unsafety. However, the responseyproduced by|φ r ⟩implic- itly links|φ r ⟩with other models. InMG, there exists a subsetW =|φ 1 ⟩,|φ 2 ⟩,...,|φ ℓ ⟩that as- signs a high probability toy. There also exists a subsetX = |φ ℓ+1 ⟩,|φ ℓ+2 ⟩,...,|φ n ⟩that as- signs a low probability toy. The model|φ r ⟩be- longs to neitherWnorX. We say thatWandX are entangled. After a time interval∆(t), suppose evaluation underqandeverifies that|φ r ⟩is unsafe. Then models inWare likely unsafe, while models inXare likely safe. In subsequent sampling, when a promptxand an evaluationaresembleqande, the algorithm should increase weights for models in X and decrease weights for models in W . 4.2 F-RCS Algorithm Based on the foundational theory in Section 4.1, we construct an optimization algorithm for RCS, named Feedback-Optimized Reliable Consensus Sampling (F-RCS). In each sampling round, given the delivered responseyand the generating model |φ r ⟩, the algorithm identifies two additional models by a functionF (·)(see Algorithm 2). One model |φ max ⟩attains the maximump(y). Another model |φ min ⟩attains the minimump(y). Ifyis judged unsafe at timet, then for subsequent similar tasks, |φ max ⟩is removed during random sampling. For- mally,MG =MG\|φ max ⟩.This allows, at the beginning of each algorithm execution, the system makes an automated decision. It selects safer mod- els used in sampling forx. This process increases the safety margins− fand reduces risk. Here, |φ max ⟩can equal|φ r ⟩. Initially,|φ r ⟩assigns the highest probability toy, which aligns with the al- gorithm objective. We call this behavior of model |φ r ⟩self-entanglement. In addition, during sam- pling, the execution of the feedback algorithm is independent of RCS itself, and therefore does not affect the safety of RCS. For time overhead, the ad- ditional cost introduced by the feedback algorithm is negligible. 5 λ= 1λ= 2λ= 4λ= 8λ= 16 0 20 40 60 80 100 Safe Rate (SR; %) 81.56 80.08 92.15 88.15 83.33 17.42 8.76 4.95 37.10 78.04 RCS (Ours)CS (Baseline) (a)n = 33. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32 0 20 40 60 80 100 Safe Rate (SR; %) 65.95 75.30 94.85 88.98 88.00 73.50 2.14 6.09 37.30 13.43 48.48 42.02 RCS (Ours)CS (Baseline) (b)n = 65. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32λ= 64 0 20 40 60 80 100 Safe Rate (SR; %) 69.55 78.94 89.28 91.30 88.58 67.40 84.91 2.51 2.96 5.74 10.94 44.75 45.25 34.91 RCS (Ours)CS (Baseline) (c)n = 129. Figure 3: Evaluation results for the safe rate when f <⌈ n 2 ⌉. λ= 1λ= 2λ= 4λ= 8λ= 16 0 5 10 15 20 25 Latency (s) 8.61 14.77 11.06 12.01 12.73 12.95 9.65 9.94 9.66 11.48 RCS (Ours)CS (Baseline) (a)n = 33. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32 0 10 20 30 40 50 60 Latency (s) 35.51 34.94 44.36 28.96 33.52 32.15 27.14 32.83 39.84 29.84 27.72 32.87 RCS (Ours)CS (Baseline) (b)n = 65. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32λ= 64 0 10 20 30 40 50 60 70 80 Latency (s) 56.02 45.21 43.62 47.17 46.83 52.61 57.62 44.40 43.77 46.98 45.70 50.15 52.10 51.28 RCS (Ours)CS (Baseline) (c)n = 129. Figure 4: Evaluation results for the latency when f <⌈ n 2 ⌉. Algorithm 2: F-RCS Algorithm Input: Parameters (n,s,R); evaluation method a Output: Response y 1 for each prompt x do 2 |φ r ⟩← F (x,MG) 3 MG ′ ←MG\|φ r ⟩ 4 y,P← RCS(MG ′ ,x,s,R) 5#P =p |φ 1 ⟩ (y),· ,p |φ n ⟩ (y) 6#Identify entangled models. 7 |φ max ⟩← arg max |φ i ⟩ p |φ i ⟩ (y) 8#Record <|φ max ⟩ ,x,y,|φ r ⟩ >. 9 return y 10 while time t∈ R + do 11 if|1⟩← Eval(x,|φ r ⟩,a) then 12New F ← Update(F,|φ max ⟩ ,x) 13 t← t + ∆(t) 5 Experiments Although Section 3 presents a theoretical analysis that validates the advantages and effectiveness of our method, we still conduct extensive experiments to demonstrate its practical performance. 5.1 Experimental Setup Parameter Settings. We follow the cryptography (Ruan et al., 2025; Couteau et al., 2022) for select- ing the security parameterλand setλ = 2 d ,d∈ Z. We chooseR = λ + 1. The values ofsandfare determined according to specific types of experi- ments, and we setn = kR + b. From a practical perspective, the relationship betweensandnis un- known. The value ofsneeds to be specified when deploying the algorithm.sdoes not reflect the ac- tual number of safe models in theMG. To cover general practical scenarios, we adopt the weakest security assumption and sets = ⌈ n+1 2 ⌉. This set- ting is similar to the widely used3f + 1 = nin distributed systems research (Zhang et al., 2023; Zhang and Duan, 2022). Datasets and Models. To reflect adversarial con- ditions in real-world scenarios, we construct the probability distribution of the safe model using the output distributions of Qwen2.5-7b-instruct, Qwen2.5-0.5b-instruct (Yang et al., 2025), and Qwen3Guard-Gen-8B (Zhao et al., 2025) on safety evaluation datasets. The datasets include Harm- Bench (Mazeika et al., 2024) and AdvBench (Zou et al., 2023). We generate the perturbed probabil- ity distributions of the unsafe models by referring to the probability distribution of the safe model. This process simulatesfByzantine models that an adversary can fully control. Evaluation Protocol. To reduce randomness, we report experimental results obtained from 8,000 repeated experiments. 5.2 Evaluation Metrics In this work, we do not define specific criteria that label a response as safe or unsafe. For example, 6 λ= 1λ= 2λ= 4λ= 8λ= 16 0 20 40 60 80 100 Safe Rate (SR; %) 53.39 64.44 76.45 80.39 75.48 0.55 0.50 0.99 1.75 3.06 RCS (Ours)CS (Baseline) (a)n = 33. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32 0 20 40 60 80 100 Safe Rate (SR; %) 55.20 65.09 76.94 83.29 73.70 50.16 0.22 0.36 0.46 0.94 1.75 2.75 RCS (Ours)CS (Baseline) (b)n = 65. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32λ= 64 0 20 40 60 80 100 Safe Rate (SR; %) 54.01 63.11 75.86 78.91 74.19 49.09 32.26 0.12 0.260.26 0.41 0.78 1.38 2.01 RCS (Ours)CS (Baseline) (c)n = 129. Figure 5: Evaluation results for the safe rate when Byzantine models collude. λ= 1λ= 2λ= 4λ= 8λ= 16 0 10 20 30 40 50 60 Latency (s) 6.30 6.80 7.40 9.15 10.61 6.48 6.59 19.12 21.79 30.20 RCS (Ours)CS (Baseline) (a)n = 33. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32 0 10 20 30 40 50 60 Latency (s) 21.13 20.12 21.39 22.47 23.93 26.08 12.71 13.79 21.07 20.07 23.01 22.71 RCS (Ours)CS (Baseline) (b)n = 65. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32λ= 64 0 20 40 60 80 100 Latency (s) 37.25 37.70 38.41 39.50 41.50 42.89 47.08 36.24 37.37 61.53 73.93 54.40 43.83 44.38 RCS (Ours)CS (Baseline) (c)n = 129. Figure 6: Evaluation results for the latency when Byzantine models collude. under jailbreak attacks, LLMs may output tokens such as "Sorry" or "Sure" (Wang et al., 2025c). These outputs reflect temporary experimental ob- servations. They do not imply provable security in a theoretical sense. Therefore, our evaluation does not rely on the LLM-as-a-judge (Wei et al., 2025) commonly used in prior jailbreak studies. We define the evaluation metrics as follows. Safe rate (SR) denotes the proportion of final responses produced by the safe models. Abstention rate (AR) denotes the proportion of cases where theMG returns⊥instead of a usable response. Latency denotes the time interval from the start of the sam- pling to the delivery of the final response. Accuracy measures the precision of the feedback algorithm in F-RCS when identifying unsafe models, that is, the proportion of models detected by the feedback algorithm that are indeed unsafe. 5.3 Experimental Results We evaluate settings wheref < ⌈ n 2 ⌉. The results for large values ofnare shown in Figure 3 and Figure 4. Results for smaller values ofnare re- ported in Appendix D. For different values of the security parameterλ, RCS achieves significantly higher safety than CS. RCS maintains comparable overall latency to CS. These results are consistent with the theoretical analysis and proofs in Section 3. CS often leads to abstention, which reduces SR. We further evaluate the algorithms under Byzan- tine model collusion. Results for large values of nare shown in Figure 5 and Figure 6. Under col- 1248163264 λ 60 70 80 90 100 Accuracy (%) n= 5 n= 9 n= 17 n= 33 n= 65 n= 129 Figure 7: Accuracy of the feedback algorithm for differ- ent λ and n values under f <⌈ n 2 ⌉. lusion, the safety of CS degrades catastrophically compared to thef <⌈ n 2 ⌉setting. In contrast, the safety of RCS is only slightly affected. RCS main- tains more than67%safety. This result demon- strates strong resistance to collusion. To assess half-resilience, we also evaluate the case where f =⌈ n 2 ⌉. The results are reported in Appendix D. In this case, CS again exhibits clear safety weak- nesses and performs significantly worse than RCS. Table 2 reports aggregated results across all val- ues ofλandn. RCS outperforms CS in both SR and AR. RCS also achieves comparable latency to CS. These results further validate the theoretical analysis and proofs. For F-RCS, we evaluate the accuracy of the feed- back algorithm in identifying unsafe models. The results are shown in Figure 7 and Appendix D. We show that for different values ofλandn, the av- erage accuracy remains around 90%. This result demonstrates the effectiveness of the feedback al- gorithm. 7 Method Consensus Sampling (Baseline)Reliable Consensus Sampling (Ours) f <⌈ n 2 ⌉ f =⌈ n 2 ⌉CollusionAVGf <⌈ n 2 ⌉ f =⌈ n 2 ⌉CollusionAVG Safe Rate20.3915.831.4512.5682.8881.4967.9077.42 (↑×5.16) Abstention Rate 71.7972.5282.9575.750.000.000.000.00 (↓ 100%) Latency22.6026.7321.4323.5924.2622.2118.4221.63 (↓ 8.31%) Table 2: Comprehensive comparison between our RCS and the baseline across different dimensions. Boldface indicates the best value in each dimension. 6 Analysis and Discussion 6.1 Enhancements Within the F-RCS framework, a general assump- tion is required. Consider the probability distribu- tions produced bynmodels over an outputy h with unknown safety. After an arbitrary time interval ∆(t), ify h is determined to be unsafe, this means that the model state|φ h ⟩is unsafe. We require that the set of modelsWthat assign high probability toy h be trustworthy. Specifically, models inW cannot deny the probability distributions reported before∆(t). They also cannot modify the reported distributions. Models inXcannot forge identities to impersonate models inW. These conditions can be enforced through cryptographic techniques such as blockchain (Li et al., 2023; Androulaki et al., 2018). Such solutions are isomorphic to existing security supervision mechanisms for voting (Yang et al., 2021). Concrete designs tailored to F-RCS remain an open direction for future work. For time cost optimization, we can construct an intermediate algorithm by trading off CS and RCS. One such example is RCS from local coins (Zhang et al., 2023). AfterRunsuccessful sam- pling rounds, the algorithm flips a coin to decide whether to return⊥or to enter the trace phase. This design improves efficiency by sacrificing part of the safety. The detailed algorithm is provided in Appendix A. 6.2 Additional Insights on Optimal Safety Based on experimental results, we observe a clear pattern. Whennis fixed, the SR of RCS often exhibits a maximum for differentλ. For example, Figure 5 shows a maximum atλ = 8. Very small values ofλand very large values ofλboth fail to achieve optimal safety. For a fixedn, it is unclear both how to determine whether such a maximum exists and how to theoretically identify the value λ h corresponding to it. 7 Related Work Existing research on generative AI security mainly focuses on the safety of model outputs. This line of work studies various attack methods (Li et al., 2024, 2025b) and corresponding defenses (Zhang et al., 2025a). However, these attacks lack a unified defi- nition of security. A representative example is the jailbreak attack (Mehrotra et al., 2024; Zhang et al., 2025b; Yu et al., 2024; Wang et al., 2025a). This attack aims to induce models to generate unsafe content. Such content covers multiple categories (Guo et al., 2025), including discrimination, illegal behavior, harmful behavior, and ethical issues. Out- put level unsafety is inherently subjective. This sub- jectivity prevents a deterministic definition under cryptographic theory. Current evaluation of output safety largely relies on LLM-as-a-judge (Li et al., 2025a). This approach depends on specialized eval- uator models trained for safety analysis, such as Qwen3Guard (Zhao et al., 2025). These models perform semantic analysis on generated content. However, evaluator models are not interpretable. Their correctness in safety assessment cannot be formally proven. Defenses against these attacks also lack formal guarantees. As a result, research on generative model security lacks a mature the- oretical foundation. This absence of theory leads to inconsistent standards across existing attack and defense studies. 8 Conclusion In this paper, we formalize a provable security the- ory for model groups. We introduce RCS, a trace- based method that eliminates abstention and guar- antees effective delivery. RCS can tolerate extreme adversarial behavior. We also design a feedback al- gorithm to improve RCS safety. We prove that RCS has tight upper bounds on response risk. Experi- ments show that RCS outperforms CS in robustness and utility, achieving a5×increase in safety rate, while maintaining comparable latency. 8 9Limitations and Ethical Considerations In this paper, we employ a safety thresholdf < ⌈ n 2 ⌉ in traditional consensus sampling. Although this condition is feasible and justified in real-world deployments, exploring looser thresholds, such as f < ⌈ 2n 3 ⌉, remains to be explored in future work. In addition, our discussion on unsafe models covers various adversarial behaviors of Byzantine mod- els. However, more extreme threats warrant further study in future work. The Byzantine behavior dis- cussed in this paper may be harmful. The methods described in this paper may be used for research purposes only. References Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis, Angelo De Caro, David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, Srinivasan Muralidha- ran, Chet Murthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith Smith, Alessandro Sorniotti, Chrysoula Stathakopoulou, Marko Vukoli ́ c, Sharon Weed Cocco, and Jason Yellick. 2018. Hyperledger fab- ric: a distributed operating system for permissioned blockchains. In Proceedings of the Thirteenth Eu- roSys Conference, EuroSys ’18, New York, NY, USA. Association for Computing Machinery. Sarah Ball, Greg Gluch, Shafi Goldwasser, Frauke Kreuter, Omer Reingold, and Guy N Rothblum. 2025. On the impossibility of separating intelli- gence from judgment: The computational intractabil- ity of filtering for ai alignment. arXiv preprint arXiv:2507.07341. Christian Cachin, Rachid Guerraoui, and Luís Ro- drigues. 2011. Introduction to reliable and secure distributed programming. Springer Science & Busi- ness Media. Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2024. Chateval: Towards better LLM-based eval- uators through multi-agent debate. In The Twelfth International Conference on Learning Representa- tions. Justin Chen, Swarnadeep Saha, and Mohit Bansal. 2024a.ReConcile: Round-table conference im- proves reasoning via consensus among diverse LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7066–7085, Bangkok, Thailand. Association for Computational Linguistics. Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. 2025. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567. Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024b. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Information Processing Systems, 37:130185–130213. Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin’ya Yamaguchi, Tomoya Ohba, Tamao Sakao, and Susumu Takeuchi. 2025. Lossless vocabulary reduction for auto-regressive language models. arXiv preprint arXiv:2510.08102. Geoffroy Couteau, Dahmun Goudarzi, Michael Klooß, and Michael Reichle. 2022. Sharp: Short relaxed range proofs. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communica- tions Security, CCS ’22, page 609–622, New York, NY, USA. Association for Computing Machinery. David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omo- hundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, et al. 2024. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems. arXiv preprint arXiv:2405.06624. Sourav Das, Sisi Duan, Shengqi Liu, Atsuki Momose, Ling Ren, and Victor Shoup. 2024. Asynchronous consensus without trusted setup or public-key cryp- tography.In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communica- tions Security, CCS ’24, page 3242–3256, New York, NY, USA. Association for Computing Machinery. Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2024. Improving factuality and reasoning in language models through multiagent debate. In Proceedings of the 41st Inter- national Conference on Machine Learning, ICML’24. JMLR.org. Sisi Duan, Haibin Zhang, Xiao Sui, Baohan Huang, Changchun Mu, Gang Di, and Xiaoyun Wang. 2024. Dashing and star: Byzantine fault tolerance with weak certificates. In Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys ’24, page 250–264, New York, NY, USA. Association for Computing Machinery. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1 in- centivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638. Samyak Jain, Ekdeep S Lubana, Kemal Oksuz, Tom Joy, Philip Torr, Amartya Sanyal, and Puneet Dokania. 2024. What makes and breaks safety fine-tuning? a mechanistic study. Advances in Neural Information Processing Systems, 37:93406–93478. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou 9 Wang, and Yaodong Yang. 2023a. Beavertails: to- wards improved safety alignment of llm via a human- preference dataset. In Proceedings of the 37th Inter- national Conference on Neural Information Process- ing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc. Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023b. Towards mitigat- ing LLM hallucination via self reflection. In Find- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 1827–1843, Singapore. Association for Computational Linguistics. Adam Tauman Kalai, Yael Tauman Kalai, and Or Zamir. 2025. Consensus sampling for safer generative ai. arXiv preprint arXiv:2511.09493. Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhat- tacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. 2025a. From gen- eration to judgment: Opportunities and challenges of LLM-as-a-judge. In Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 2757–2791, Suzhou, China. Asso- ciation for Computational Linguistics. Huizhong Li, Yujie Chen, Xiang Shi, Xingqiang Bai, Nan Mo, Wenlin Li, Rui Guo, Zhang Wang, and Yi Sun. 2023. Fisco-bcos: An enterprise-grade per- missioned blockchain system with high-performance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, New York, NY, USA. Associa- tion for Computing Machinery. Linbao Li, Yannan Liu, Daojing He, and YU LI. 2025b. One model transfer to all: On robust jailbreak prompts generation against LLMs. In The Thirteenth International Conference on Learning Representa- tions. Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu. 2024. Badedit: Backdooring large lan- guage models by model editing. In The Twelfth Inter- national Conference on Learning Representations. Yexiang Liu, Jie Cao, Zekun Li, Ran He, and Tieniu Tan. 2025. Breaking mental set to improve reasoning through diverse multi-agent debate. In The Thir- teenth International Conference on Learning Repre- sentations. Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and bench- marking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847, Philadelphia, PA. USENIX Association. Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Jun- wei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. 2025. Large language model agent: A survey on methodology, applications and challenges. arXiv preprint arXiv:2503.21460. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. Harmbench: a standardized eval- uation framework for automated red teaming and robust refusal. In Proceedings of the 41st Interna- tional Conference on Machine Learning, ICML’24. JMLR.org. Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of attacks: Jailbreaking black-box llms automatically. In Advances in Neural Information Processing Systems, volume 37, pages 61065–61105. Curran Associates, Inc. Michael A Nielsen and Isaac L Chuang. 2010. Quantum computation and quantum information. Cambridge university press. Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, and Arman Zharmagambetov. 2025. Safety alignment of lms via non-cooperative games. arXiv preprint arXiv:2512.20806. Wenqiang Ruan, Xin Lin, Ruisheng Zhou, Guopeng Lin, Shui Yu, and Weili Han. 2025. Hawkeye: statically and accurately profiling the communication cost of models in multi-party learning. In Proceedings of the 34th USENIX Conference on Security Symposium, SEC ’25, USA. USENIX Association. Hao Wang, Hao Li, Junda Zhu, Xinyuan Wang, Cheng- wei Pan, Minlie Huang, and Lei Sha. 2025a. Diffu- sionAttacker: Diffusion-driven prompt manipulation for LLM jailbreak. In Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 22193–22205, Suzhou, China. As- sociation for Computational Linguistics. Shang Wang, Tianqing Zhu, Bo Liu, Ming Ding, Day- ong Ye, Wanlei Zhou, and Philip Yu. 2025b. Unique security and privacy threats of large language models: A comprehensive survey. ACM Comput. Surv. Just Accepted. Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai-Wei Chang. 2025c. Vulnerability of large language mod- els to output prefix jailbreaks: Impact of positions on safety. In Findings of the Association for Computa- tional Linguistics: NAACL 2025, pages 3939–3952, Albuquerque, New Mexico. Association for Compu- tational Linguistics. Zhipeng Wei, Yuqi Liu, and N. Benjamin Erichson. 2025. Emoji attack: Enhancing jailbreak attacks against judge LLM detection. In Forty-second Inter- national Conference on Machine Learning. Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini. 2024. Privacy backdoors: Enhancing membership inference through poisoning pre-trained models. Advances in Neural Information Processing Systems, 37:83374– 83396. 10 Zhiyuan Weng, Guikun Chen, and Wenguan Wang. 2025. Do as we do, not as you think: the confor- mity of large language models. In The Thirteenth International Conference on Learning Representa- tions. Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. 2024. Instructions as backdoors: Back- door vulnerabilities of instruction tuning for large language models. In Proceedings of the 2024 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 3111–3126, Mexico City, Mexico. Association for Computational Linguistics. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian- hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian- hao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115. Yang Yang, Zhangshuang Guan, Zhiguo Wan, Jian Weng, Hwee Hwa Pang, and Robert H Deng. 2021. Priscore: Blockchain-based self-tallying election sys- tem supporting score voting. IEEE Transactions on Information Forensics and Security, 16:4705–4720. Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. 2024. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. In 33rd USENIX Security Symposium (USENIX Security 24), pages 4675–4692. Xiao Zhan, Juan Carlos Carrillo, William Seymour, and Jose Such. 2025. Malicious llm-based conversational ai makes users reveal personal information. In Pro- ceedings of the 34th USENIX Conference on Security Symposium, SEC ’25, USA. USENIX Association. Haibin Zhang and Sisi Duan. 2022. Pace: Fully paral- lelizable bft from reproposable byzantine agreement. In Proceedings of the 2022 ACM SIGSAC Confer- ence on Computer and Communications Security, CCS ’22, page 3151–3164, New York, NY, USA. Association for Computing Machinery. Haibin Zhang, Sisi Duan, Boxin Zhao, and Liehuang Zhu. 2023.Waterbear: practical asynchronous bft matching security guarantees of partially syn- chronous bft. In Proceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, USA. USENIX Association. Lan Zhang, Xinben Gao, Liuyi Yao, Jinke Song, and Yaliang Li. 2025a. ExploitingTask-Levelvulner- abilities: An automatic jailbreak attack and defense benchmarking forLLMs. In 34th USENIX Secu- rity Symposium (USENIX Security 25), pages 2363– 2382. Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. 2025b. Jbshield: defending large language models from jail- break attacks through activated concept analysis and manipulation. In Proceedings of the 34th USENIX Conference on Security Symposium, SEC ’25, USA. USENIX Association. Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Day- iheng Liu, Jingren Zhou, Junyang Lin, et al. 2025. Qwen3guard technical report. arXiv preprint arXiv:2510.14276. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Univer- sal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. 11 A RCS from Local Coins Algorithm 3: Reliable Consensus Sampling from Local Coins Input: Number of models|MG| = n; number of safe models s; round R; distributions p 1 ,...,p n ∈ Distr(Y) n Output: Response y ∈Y 1 Buffer ← ∅ 2 for r ← 1 to R do 3Sample y ∼ 1 n P n i=1 p i 4 σ(y) = 1 s P s i=1 p (i) (y) 1 n P n i=1 p i (y) 5 if accept y with σ(y) then 6return y 7 Buffer ← Buffer∪⟨y,σ(y)⟩ 8 c← Random() #Obtain local coin 0 or 1. 9 if c == 0 then 10 return⊥ 11 else 12Sort Buffer =⟨y i ,σ(y i )⟩ R i=1 such that σ(y (1) )≥·≥ σ(y (R) ) 13LetF =y (1) ,...,y u , u = min(s,R) 14 for i← 1 to u do 15α(y (i) ) = P n j=s+1 p (j) (y (i) ) 16 return y ← arg max y∈F α(y) B Proof σ(y t )− σ(y v ) = 1 s P i≤s p (i) (y t ) 1 n P n i=1 p i (y t ) − 1 s P i≤s p (i) (y v ) 1 n P n i=1 p i (y v ) = 1 s P i≤s p (i) (y t ) 1 n P n i=1 p i (y v ) − 1 s P i≤s p (i) (y v ) 1 n P n i=1 p i (y t ) 1 n P n i=1 p i (y t ) 1 n P n i=1 p i (y v ) = n s · P i≤s p (i) (y t ) ( P n i=1 p i (y v ))− P i≤s p (i) (y v ) ( P n i=1 p i (y t )) ( P n i=1 p i (y t )) ( P n i=1 p i (y v )) = n s · P i≤s p (i) (y t ) P i>s p (i) (y v ) − P i≤s p (i) (y v ) P i>s p (i) (y t ) ( P n i=1 p i (y t )) ( P n i=1 p i (y v )) C Additional Discussion C.1 RCS and Multi-Agent Debate Multi-agent debate (MAD) (Du et al., 2024; Chan et al., 2024; Liu et al., 2025) is a reasoning enhancement technique built on LLM inference. MAD enables multiple agents to interact over several rounds. The goal is to encourage consensus on the same answer to a given query. A final decision is then produced through answer aggregation methods such as majority voting (Du et al., 2024). In terms of output safety, MAD and RCS share a similar motivation. Both approaches introduce multiple models to improve the safety of the final output. Some MAD methods also adopt confidence scores to evaluate candidate answers proposed by other models (Chen et al., 2024a). This mechanism is conceptually related to RCS. Compared to MAD, 12 RCS follows a more fundamental strategy. RCS directly operates on underlying probability distributions. In contrast, MAD requires models to possess an explicit understanding of safety. This requirement limits robustness against unforeseen attacks. Moreover, RCS does not require interaction among models. This design avoids the conformity (Weng et al., 2025) inherent in LLMs. Operating on underlying probability distributions also mitigates the impact of hallucination (Ji et al., 2023b) during reasoning. However, this design limits practical applicability. In many scenarios, models are accessed as black-box services, and probability distributions are not available. We argue that an intersection between MAD and RCS can improve the practical applicability of RCS. We continue to avoid direct interaction among models. Instead, each model outputs a score for each candidate outputyat every round. This score represents the level of support fory. All other components remain consistent with RCS. We refer to this variant as Practical Reliable Consensus Sampling (PRCS). More broadly, PRCS can be applied beyond safety. It can also enhance model reasoning performance in domains such as mathematics, medicine, and programming (Luo et al., 2025). D Additional Experimental Results λ= 1λ= 2 0 20 40 60 80 100 Safe Rate (SR; %) 69.12 79.34 2.49 39.40 RCS (Ours)CS (Baseline) (a)n = 5. λ= 1λ= 2λ= 4 0 20 40 60 80 100 Safe Rate (SR; %) 78.92 97.80 89.34 1.47 23.34 4.66 RCS (Ours)CS (Baseline) (b)n = 9. λ= 1λ= 2λ= 4λ= 8 0 20 40 60 80 100 Safe Rate (SR; %) 76.64 73.26 92.74 98.76 2.85 15.24 10.09 8.12 RCS (Ours)CS (Baseline) (c)n = 17. Figure 8: Evaluation results for the safe rate when f <⌈ n 2 ⌉. λ= 1λ= 2 0 2 4 6 8 10 Latency (s) 2.34 2.69 1.91 2.15 RCS (Ours)CS (Baseline) (a)n = 5. λ= 1λ= 2λ= 4 0 2 4 6 8 10 Latency (s) 3.86 3.43 3.99 2.81 3.03 3.62 RCS (Ours)CS (Baseline) (b)n = 9. λ= 1λ= 2λ= 4λ= 8 0 2 4 6 8 10 Latency (s) 4.69 5.01 5.99 5.26 4.45 4.27 4.55 5.20 RCS (Ours)CS (Baseline) (c)n = 17. Figure 9: Evaluation results for the latency when f <⌈ n 2 ⌉. λ= 1λ= 2 0 20 40 60 80 100 Safe Rate (SR; %) 68.42 78.22 2.55 3.52 RCS (Ours)CS (Baseline) (a)n = 5. λ= 1λ= 2λ= 4 0 20 40 60 80 100 Safe Rate (SR; %) 62.89 70.96 80.04 1.39 1.86 3.49 RCS (Ours)CS (Baseline) (b)n = 9. λ= 1λ= 2λ= 4λ= 8 0 20 40 60 80 100 Safe Rate (SR; %) 59.00 68.33 82.42 81.09 0.89 1.16 1.94 3.70 RCS (Ours)CS (Baseline) (c)n = 17. Figure 10: Evaluation results for the safe rate when Byzantine models collude. 13 λ= 1λ= 2 0 2 4 6 8 10 Latency (s) 2.45 5.98 1.89 2.06 RCS (Ours)CS (Baseline) (a)n = 5. λ= 1λ= 2λ= 4 0 2 4 6 8 10 Latency (s) 4.50 3.25 3.59 2.39 3.14 2.99 RCS (Ours)CS (Baseline) (b)n = 9. λ= 1λ= 2λ= 4λ= 8 0 2 4 6 8 10 Latency (s) 3.85 4.18 4.62 5.23 4.04 3.83 4.18 4.96 RCS (Ours)CS (Baseline) (c)n = 17. Figure 11: Evaluation results for the latency when Byzantine models collude. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32 0 20 40 60 80 100 Safe Rate (SR; %) 80.70 73.25 93.47 88.49 83.11 82.14 1.82 29.84 5.73 19.12 24.43 54.35 RCS (Ours)CS (Baseline) (a)n = 34. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32λ= 64 0 20 40 60 80 100 Safe Rate (SR; %) 77.84 72.89 86.80 83.66 90.18 76.98 95.41 3.66 4.71 4.49 8.51 82.97 17.72 28.19 RCS (Ours)CS (Baseline) (b)n = 66. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32λ= 64λ= 128 0 20 40 60 80 100 Safe Rate (SR; %) 77.54 85.59 90.66 92.04 82.06 74.70 58.86 74.26 3.71 4.67 5.36 11.65 18.40 56.04 38.66 54.65 RCS (Ours)CS (Baseline) (c)n = 130. Figure 12: Evaluation results for the safe rate when f =⌈ n 2 ⌉. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32 0 5 10 15 20 25 30 Latency (s) 12.96 13.74 9.58 9.43 11.23 13.06 8.33 7.06 9.87 8.55 9.48 9.47 RCS (Ours)CS (Baseline) (a)n = 34. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32λ= 64 0 20 40 60 80 100 Latency (s) 18.11 16.77 22.12 23.98 25.01 27.61 26.92 38.04 49.84 49.23 31.44 55.08 60.52 47.80 RCS (Ours)CS (Baseline) (b)n = 66. λ= 1λ= 2λ= 4λ= 8λ= 16λ= 32λ= 64λ= 128 0 20 40 60 80 100 Latency (s) 54.04 48.04 45.77 50.71 52.72 58.77 90.08 49.55 49.22 48.40 47.84 50.21 47.95 53.13 85.90 57.01 RCS (Ours)CS (Baseline) (c)n = 130. Figure 13: Evaluation results for the latency when f =⌈ n 2 ⌉. 1248163264 λ 60 70 80 90 100 Accuracy (%) n= 5 n= 9 n= 17 n= 33 n= 65 n= 129 Figure 14: Accuracy of the feedback algorithm for different λ and n values when Byzantine models collude. E Necessity Analysis of Liveness We further analyze the necessity of the liveness property. In real-world deployments, the goal ofMGis not only to guarantee safety, but also to accomplish normal inference tasks such as mathematics, medical diagnosis, and programming. The objective is to maintain both functionality and safety, rather than prioritize safety as the sole consideration. As a result, in practical applications,MGmust be able to perform valid reasoning on a wide variety of prompts. From the perspective of safety, abstention can be used to avoid unsafe responses. However, for regular inference tasks, abstention is not acceptable, as the response⊥cannot adaptively provide reasonable outputs for unknown prompts. Therefore, liveness is necessary forMG from a practical application perspective. 14