Paper deep dive

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

Zichen Xie, Wenxi Wang

Year: 2026Venue: arXiv preprintArea: cs.SEType: PreprintEmbeddings: 65

Abstract

Abstract:As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs' understanding of the entire verification process. VCoT-Bench measures performance along three orthogonal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

64,260 characters extracted from source content.

Expand or collapse full text

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought Zichen XieWenxi Wang Abstract As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leverag- ing VCoT-Lift, we introduce VCoT-Bench, a com- prehensive benchmark of 1,988 VCoT comple- tion tasks for rigorously evaluating LLMs’ under- standing of the entire verification process. VCoT- Bench measures performance along three orthog- onal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers. 1. Introduction Rust (Matsakis & Klock, 2014) has emerged as the cor- nerstone of modern systems programming, adopted by the Linux kernel and major technology infrastructure for its rig- orous guarantees of memory safety, concurrency correctness, and high performance. However, as the software industry increasingly relies on Large Language Models (LLMs) to synthesize complex code (Zhuo et al., 2024; Zhang et al., 2024; Ishibashi & Nishimura, 2024; Yang et al., 2025b; Zhang et al., 2023), subtle errors can be introduced to make University of Virginia. Zichen Xie <graysonxie@virginia.edu>, Wenxi Wang <wenxiw@virginia.edu> the programs functionally incorrect. Consequently, formally verifying the correctness of Rust programs, by mathemat- ically proving that programs satisfy precise specifications under all possible inputs, has become increasingly critical for establishing reliable and trustworthy systems. Verus (Lattuada et al., 2024) is a state-of-the-art formal ver- ification framework for Rust programs. A Verus program consists of three components: (1) executable Rust code, (2) specifications that precisely describe intended behav- ior through preconditions (requires) and postconditions (ensures), and (3) user-provided proof hints, including loop invariants, assertions, and lemma functions, which guide the verifier. Figure 1a presents a running Verus pro- gram. Verus operates as an Automated Theorem Proving (ATP) system: it translates the program into SMT formulas and, with the assistance of user-provided proof hints, checks whether the Rust code satisfies the specifications using the SMT solver Z3 (De Moura & Bjørner, 2008). To the best of our knowledge, existing work on LLMs for Verus focuses exclusively on automatic proof-hint syn- thesis. Systems such as AlphaVerus (Aggarwal et al., 2024), SAFE (Chen et al., 2024), AutoVerus (Yang et al., 2024), RagVerus (Zhong & Si, 2025), and VeriStruct (Sun et al., 2025) specialize LLMs for generating proof hints, while VeruSAGE (Yang et al., 2025a) evaluates model performance on this task. However, existing approaches treatverificationprocessasablack box,measuringsuccess solelybyabinaryoutcome: whether the program verifies with the generated proof hints. However, such binary evaluation obscures the central ques- tion: Do LLMs actually understand the verification logic, or are they merely exploiting statistical patterns? Judg- ing a model solely by whether a program verifies reveals nothing about how verification was achieved. It cannot dis- tinguish a model that constructs a sound deductive chain from one that succeeds through chance syntactic alignment. To meaningfully assess whether LLMs exhibit reasoning capabilities comparable to automated theorem provers (e.g., Z3 SMT solver), we must open the black box and exam- ine whether they can construct a Verification Chain-of- Thought (VCoT), the step-by-step deductive process un- derlying formal verification, analogous to chain-of-thought reasoning in natural language. 1 arXiv:2603.18334v1 [cs.SE] 18 Mar 2026 VCoT-Bench: Evaluating via Verification Chain of Thought (a) A running example of a Verus program.(b) The Z3 proof for the example program. Figure 1. (a). The program is about replacing the last element of the first vector (first) with all elements of the second (second). Executable Rust code,specifications, andproof hintsare highlighted. The specifications define the precondition (requires) and postcondition (ensures), and verification relies on three parts of proof hints: two blocks of loop invariants and one assertion. (b) Shows the Z3 proof of the running example withhigh-level,medium-level, andlow-levelproofs highlighted. The high-level ruleunit-resolution(line 491) establishes the property|seq.subrange(s,j,k)| = k − j, while medium-level rules such as monotonicityandrewrite(line 622) prove index bounds. Low-level rules encode trivial reasoning, e.g.,refl(lines 653–654), which simply asserts $x775 = $x775. Opening the verification black box is fundamentally hin- dered by the nature of the ATP solver. When verification succeeds, Z3 can emit a proof trace, but these traces are designed for machine consumption rather than human un- derstanding. As shown in Figure 1b, the Z3 proof for our small running example spans 10,003 lines, yet 81.69% of it consists of low-level reasoning, such as trivial equalities like $x775 = $x775. This extreme verbosity creates a seman- tic gap that obscures the solver’s internal reasoning, leaving existing research confined to a results-oriented paradigm that cannot inspect, analyze, or evaluate the underlying veri- fication process. To bridge this gap, we introduce VCoT-Lift, an LLM-based framework that lifts low-level Z3 solver reasoning into ex- plicit, human-readable Verus verification steps, exposing the solver’s internal reasoning as a VCoT and providing transparent ground truth for program verification. Con- structing such a VCoT is fundamentally a semantic abstrac- tion problem, raising three core challenges: soundness, completeness, and conciseness. VCoT-Lift addresses these challenges through a four-stage pipeline that enforces sound- ness via direct interaction with the Verus verifier, improves completeness through an interactive loop between global transformation and targeted completeness checking, and enhances conciseness by principled pruning of trivial and redundant reasoning. Leveraging VCoT-Lift, we introduce VCoT-Bench, a com- prehensive benchmark that moves LLM evaluation beyond binary pass/fail outcomes to a fine-grained analysis of ver- ification reasoning. VCoT-Bench consists of 1,988 VCoT completion tasks constructed by digging structured “proof holes” across the entire VCoT at the level of semantic blocks. The benchmark evaluates model understanding along three orthogonal dimensions: 1) robustness to varying degrees of missing reasoning; 2) competence across proof types; and 3) sensitivity to the position of missing reasoning steps. We conducted a comprehensive study of ten state-of-the-art LLMs using VCoT-Bench, revealing a substantial gap be- tween LLMs and automated theorem provers. While LLMs can often complete verification steps given sufficient local context, their performance drops sharply even under modest information loss, exposing a heavy reliance on syntactic scaffolding rather than principled deduction. This weakness is most pronounced in intermediate, connective proof steps that require tracking program state and composing multi- step reasoning, as well as in assertion proofs demanding precise, state-dependent logic. In summary, this paper makes the following contributions: • Concept: We propose the new concept of VCoT. 2 VCoT-Bench: Evaluating via Verification Chain of Thought LLMLLMLLMLLM Transformer Intermediate Verus Proof Transformed Verus Proof Pruner Pruned Verus Proof Repair Z3 Rule Hierarchy Z3 Segments (lemma) Verus Verus Program ... Checker Final Verus Proof Verified Program Z3 Proof Intermediate Verus Proof Z3 Segments (quant)Z3 Segments (mp) Trivial Proof Definitions Lemma Checker MP Checker Quantifier Checker Redundant Proof Examples Definitions Grammar Rules Error Messages Z3 Z3 Z3 Verus Verus Verus Examples Repair Examples Expert Knowledge LLM Mapping lemma Instructions Filter LLM Mapping quantifier Instructions Filter LLM Mapping mp Instructions Filter Figure 2. Overview of VCoT-Lift. • Tool: We develop VCoT-Lift, the first framework that lifts low-level reasoning into high-level VCoT. •Benchmark: We introduce VCoT-Bench, the first bench- mark for VCoT completion, containing 1,988 tasks strati- fied by three orthogonal evaluation dimensions. •Study: We perform a comprehensive study of ten LLMs, uncovering their critical limitations in formal verification. 2. VCoT-Lift In this section, we introduce VCoT-Lift, an LLM-based framework that lifts low-level reasoning from Z3 proofs into explicit Verus-level verification steps, forming a VCoT that exposes how verification unfolds at the program level. Why use LLMs? Transforming Z3 proofs into Verus-level verification steps is a semantic abstraction challenge: Z3 proofs are extremely fine-grained, while Verus operates over high-level constructs such as invariants, assertions, and lem- mas, making the required semantic compression beyond static pattern matching or deterministic symbolic transla- tion (Böhme & Nipkow, 2010; Blanchette et al., 2013; Mo- hamed et al., 2025). LLMs can recover this abstraction by jointly analyzing Z3 proofs and Verus programs to infer log- ical intent and synthesize equivalent high-level verification steps (Li et al., 2023; Liu et al., 2024; Wei et al., 2025; Wen et al., 2024; Li et al., 2022; Jiang et al., 2024; Islam et al., 2024; Zhuo et al., 2024), and thus VCoT-Lift adopts LLMs as its central reasoning engine. Transformation Challenges. Even with LLMs, transform- ing Z3 proofs into Verus-level verification steps remains difficult. This process must simultaneously address three fundamental challenges: •Soundness: The transformed verification steps should faithfully preserve the semantics of the original Z3 proof. •Completeness: The transformed verification should cap- ture the entire reasoning process of the Z3 proof, with no essential steps omitted. •Conciseness: The resulting verification steps should be concise, free from trivial or redundant steps. 2.1. Overview To address these challenges, we present VCoT-Lift as a four-stage pipeline (Figure 2). The pipeline starts with an iterative transformer–checker loop that improves proof com- pleteness: the transformer synthesizes Verus proofs from Z3 proofs, and the checker evaluates their completeness until predefined criteria are met. Next, the Proof Pruner removes trivial or redundant steps to improve conciseness. Finally, Proof Repair enforces soundness by interacting with Verus to detect and fix syntactic and semantic errors. We detail each stage and its interactions in the following. Figure 8 in Appendix A shows the complete VCoT exposed by VCoT-Lift for the running example, including 82 lines of additional lifted Verus proofs, which naturally unfold into five verification phases. 2.2. Proof Transformer Given a Verus program and its corresponding Z3 proof, Proof Transformer aims to transform the internal reasoning of the Z3 proof into semantically equivalent Verus proofs. The key challenge is that Z3 proofs are long and dominated by low-level reasoning, with critical logical steps tightly entangled throughout. Disentangling this structure requires a holistic view of the entire proof: because the reasoning is globally interconnected, the transformer cannot safely decompose proofs without losing context and must process them end to end to recover the underlying semantics. How- ever, such long-context comprehension remains a major bottleneck for current LLMs (Li et al., 2024; Dong et al., 2024; Hosseini et al., 2025; Du et al., 2025). To address this challenge, we introduce a Z3 rule hierarchy that explicitly steers LLM attention toward reasoning steps more likely to be semantically informative. Our classifica- tion is based on syntactic cues of proof rules, but is moti- vated by empirical observations of how often such syntactic patterns tend to surface semantically meaningful reasoning in practice. Concretely, we organize all 36 Z3 proof rules (Z3Prover, 2026) into three importance levels: 3 VCoT-Bench: Evaluating via Verification Chain of Thought Figure 3. The intermediate VCoT for the running example’s sec- ond while loop across stages of VCoT-Lift. 1 ⃝Shows the Verus program with originalproof hintsandtransformed proofs; after three transformer–checker loops, several assertions and a lemma establishing a key propertys.subrange(0, s.len()) == sare introduced. 2 ⃝Shows thepruned proofs. 3 ⃝Shows the repaired proofs, fixing a syntactic error. •High-level rules (8): frequently corresponds to explicit Verus verification steps (e.g., unit-resolution). • Medium-level rules (12): support intermediate proof con- struction (e.g., trans and rewrite). • Low-level rules (16): primarily encode tautologies, syn- tactic artifacts, or proof metadata (e.g.,reflandsymm). Table 2 in Appendix B.1 presents the full categorization. Figure 1b illustrates a Z3 proof of our running example annotated by rule importance. Figure 3 part 1 ⃝ shows the verification steps transformed from the Z3 proof by Proof Transformer. lemmp mp segments aggregation lemma segments aggregation Lemma Checker Complete? Complete? Aggregated Z3 Proof Segments (let (($x12319 (>= ?x12318 0))) (let ((?x12243 (vstd!seq.Seq.len.? $ ?x1373 ?x11703))) (let ((@x3395 (mp~ ... ... Mapping From Modus Ponens Checker lem lem lem lem lem mp Modus Ponens Checker ... (let (($x5602 (forall ((A&. Dcr) (A& Type) (s! Poly) (j! Poly) (k! Poly) ) (! (let ((?x1951 (vstd!seq.Seq.subrange.? A&. A& s! j! k!))) ... Z3 Proof ... (let ((@x13404 (unit-resolution @x12316 @x105 (unit-resolution @x5568 ...(or $x12275 $x12317)))) ... assert(second@.subrange(0, index) == second@) by lemma_subrange_all(second@); proof fn lemma_subrange_all<A>( s: Seq<A>) ensures s.subrange(0, s.len() as int) == s, ... Verus-level Proof Missing Corresponding Verus-level Proof Figure 4. Conceptual visualization of: 1 ⃝Aggregated Z3 proof segments for the running example, wherempandlemmasegments are grouped and passed to the corresponding checkers; 2 ⃝ the mapping mechanism, where the modus ponens checker identifies a Z3 proof without a corresponding Verus-level proof and marks the transformation incomplete. 2.3. Proof Checker The Proof Checker is a key component of VCoT-Lift that evaluates the completeness of proofs produced by the Proof Transformer. To improve efficiency and focus on essential reasoning, completeness checking is restricted to high-level Z3 proof rules, verifying only whether crit- ical proof steps are correctly transformed. Because these rules share common reasoning patterns, we group the eight high-level rules into five categories: lemma (lemma), theory lemma (th-lemma), modus ponens (mp,mp∼), quantifier (quant-intro,quant-inst,skolemize), and unit reso- lution (unit-resolution). To enable precise and specialized completeness assessment, we design a dedicated checker agent for each proof-rule category to evaluate the completeness of its corresponding transformations. Inputs: Aggregated Z3 Proof Segments. To enable fo- cused completeness check, each checker agent receives only the Z3 proof segments relevant to the proof rules it assesses. VCoT-Lift constructs an abstract syntax tree of the Z3 proof, where each node represents a proof statement, and its subtree contains all steps required to derive that statement, referred to as the rule’s dependency. By locating the node for a target proof rule and extracting its subtree, VCoT-Lift precisely isolates the proof segments associated with that rule. Even high-level proof rules occur frequently in Z3 proofs, and processing each segment independently would require many LLM invocations. Moreover, proof segments, espe- cially within the same category, are often interdependent; for example, one lemma may depend on another, as shown in Figure 4, part 1 ⃝. Evaluating segments in isolation would thus introduce redundant context and unnecessary LLM 4 VCoT-Bench: Evaluating via Verification Chain of Thought calls. To address this issue, VCoT-Lift aggregates all segments from the same proof-rule category and provides them as a single input to the corresponding checker agent. As shown in Figure 4, part 1 ⃝, the lemma checker receives all lemma rules together with their dependencies in one batch, enabling joint assessment and avoiding redundant processing. Checking Mechanism 2: Abstraction Filter. Although essential reasoning is often expressed with high-level proof rules, such rules can also encode trivial or redundant steps. For example, the Z3 proofunit-resolution @x11036 @x3395 $x11026uses the high-level rule unit-resolutionbut merely establishes0 = 0.A checker agent may mistakenly treat such trivial steps as essential missing logic, causing it to diverge (e.g., fail to terminate) by repeatedly flagging the proof as incomplete based on unnecessary reasoning. To address this issue, VCoT-Lift introduces an abstraction filter that regulates proof granularity and guides the checker to ignore unnecessary reasoning. It distinguishes between trivial and redundant proof steps, both of which can be safely excluded from completeness evaluation. Trivial steps correspond to local reasoning that the solver can justify using only immediate contextual facts, without invoking additional axioms or unfolding complex defini- tions. Typical examples include basic arithmetic facts and properties such as reflexivity (e.g.,index = index) and commutativity (e.g., ((x + 0) = x)↔ (x = (x + 0))). Redundant steps are reasoning steps that add no new se- mantic information beyond what is already established. We categorize redundancy into three forms: (1) normalization, which rewrites expressions into canonical forms; (2) reasser- tion, which restates facts already implied by the current context; and (3) definition expansion, which unfolds defini- tions without advancing the core proof argument. In our running example, the modus ponens checker initially flagsindex < index + 1as a missing proof step, but it is correctly classified as trivial and filtered out by the abstraction filter, allowing the proof to be marked complete. Checking Mechanism 2: Explicit Mapping. To reduce hallucinations and strengthen reasoning, VCoT-Lift adopts an explicit justification mechanism. Instead of issuing a binary completeness verdict, each checker must justify its decision by explicitly mapping each Z3 proof step within its rule category to a semantically equivalent step in the transformed Verus proof. Figure 4, part 2 ⃝, illustrates this mapping mechanism, where the modus ponens checker ex- plicitly identifies that the stepassert(0 <= index <= second@.len())has no corresponding transformation. This structured, traceable output improves checker relia- bility and enables rigorous human validation. Checking Mechanism 3: Specialization. For each checker agent, we describe the semantics and usage of the associated Z3 proof rules and specify the forms of Verus proofs into which they are commonly transformed. For example, for the lemma checker, we explain that lemmas discharge hypothe- ses and that alemmaproof rule is often transformed into a lemma function, a proof block, or a sequence of assertions. 2.4. Transformer–Checker Loop VCoT-Lift further addresses extremely long proof contexts with a perform-all-check-partial architecture that deliber- ately separates responsibilities: the Proof Transformer op- erates with a global view of the entire Z3 proof, while the checker validates only a targeted subset of high-level, essen- tial proof steps. To preserve this asymmetry, the checker emits only coarse- grained binary signals (complete or incomplete), without explanations. Because its perspective is inherently partial, providing localized feedback could mislead or constrain the transformer’s global reasoning. By acting as lightweight guardrails, the checkers guide transformation without pol- luting the transformer’s context. This separation improves scalability and strengthens trans- formation robustness. When the checker signals incom- pleteness, the transformer reprocesses the entire Z3 proof holistically, naturally recovering additional missing steps, including those beyond the checker’s limited scope. In this way, VCoT-Lift combines efficient partial checking with global regeneration to achieve comprehensive and reliable proof transformation under long-context constraints. In Figure 3, part 1 ⃝, across three transformer–checker loops, the proofs become progressively more complete. 2.5. Proof Pruner The transformed Verus proofs may include trivial or redun- dant steps. For example, handlinga % 2can introduce a trivial assertionassert(2 != 0)to prevent division by zero. While such checks are reasonable at the Z3 level, they are unnecessary in Verus. To improve conciseness, VCoT- Lift introduces a Proof Pruner to remove these trivial and redundant steps. Building on the abstraction filter’s definitions of trivial and redundant steps, the Proof Pruner refines these notions with fine-grained criteria tailored to the three Verus proof types, supported by representative examples. For instance, redun- dant assertions restate facts already established in nearby proof steps. To preserve rigor, the pruner agent must pro- vide an explicit justification for every removal, ensuring the resulting proofs are both concise and transparent for human validation. As shown in Figure 3, the assertionassert(0 <= index <= second@.len())is pruned because it is 5 VCoT-Bench: Evaluating via Verification Chain of Thought Verified Verus Program 0 50 100 150 200 250 # Proof Lines # Proof Lines (VCoT-Bench-Org) # Proof Lines (Verus-Bench) Figure 5. Comparison of Total Proof Lines per Program Between VCoT-Bench-Org and Verus-Bench already established by the precedingwhileloop invariants (line 39); and a proof block (lines 55–61) is removed since it duplicates the preceding assertion (lines 52–53). 2.6. Proof Repair Proof Repair is the final stage of the pipeline and ensures the soundness of the transformed verification steps. Because Verus is relatively new and has limited public resources, LLMs often produce syntactic or semantic errors during transformation. To address this, VCoT-Lift introduces a dedicated repair agent that systematically detects and cor- rects such errors. To mitigate model unfamiliarity with Verus grammar, we in- ject expert verification knowledge to guide the repair agent. For example, when multiple type conversions cause mis- leading error messages, the agent is instructed to explicitly parenthesize each conversion. We also provide explicit grammar rules and curated repair examples. At this stage, the Verus verifier compiles and checks the program with the transformed proofs, and the resulting error messages are fed back to the repair agent, which iteratively refines the proofs until successful verification. As shown in Figure 3, in our running example, Proof Repair corrects a syntactic error for a transformed proof step. 3. VCoT-Bench Leveraging VCoT-Lift to construct VCoTs as ground truth, we create VCoT-Bench, a benchmark that rigorously evalu- ates whether LLMs truly understand the verification process. 3.1. Ground-Truth: VCoT-Bench-Org We build our benchmark on Verus-Bench (Yang et al., 2024), a widely used public benchmark for Verus proof-hint gener- ation. Verus-Bench comprises 150 verified Verus programs drawn from MBPP-DFY-153 (Misu et al., 2024), Clover- Bench (Stack Overflow, 2024), Diffy (Chakraborty et al., 2021), and verified algorithms from the Verus libraries. Fig- ure 1a shows an example from Verus-Bench (task_id_240). For each verified program, we extract its Z3 proof and apply VCoT-Lift to lift low-level reasoning into explicit Verus- level verification steps, yielding the program’s VCoT. This process produces VCoT-Bench-Org, a ground-truth dataset 10%-20% 11.62% 30%-40% 10.81% 50%-60% 11.12% 70%-80% 12.53% 90%-100% 12.22% Loop Invariants 7.40% Assertion 7.34% Lemma Function 7.34% Front 6.54% Middle 6.54% End 6.54% Ratio 58.30% Type 22.08% Location 19.62% Figure 6. Composition of VCoT-Bench. with fully expanded verification reasoning. As shown in Figure 5, and Figure 9 in Appendix B.2, VCoT-Bench-Org exposes substantially richer verification details than Verus- Bench: on average, it exhibits a 6.5x increase in proof lines (up to 32x), a 13.4x increase in assertions (up to 54x), and a 1.94x increase in lemma functions (up to 9x). 3.2. VCoT Completion Tasks via Semantic Blocks. To assess how LLMs understand the VCoT, we construct VCoT completion tasks by digging “proof holes” in VCoT- Bench-Org at the granularity of semantic blocks. Each semantic block falls into one of three categories: Lemma Blocks, where each independent lemma function forms a block; Invariant Blocks, where all loop invariants associated with the same loop form a block; and Assertion Blocks, where all assertions between two executable code lines are grouped into one block. As shown in Figure 3, Part 1 ⃝, the lemma functionlemma_subrange_all(lines 65–70) forms one semantic block; the loop invariants associated with thewhileloop (lines 38–42) form another; and the assertions after thewhileloop constitute the third block (lines 47–61). 3.3. Three Benchmark Variants With the semantic blocks, we construct the full VCoT-Bench suite with 1,988 VCoT completion tasks. To rigorously eval- uate LLMs’ understanding of the entire verification process, we organize these tasks along three orthogonal dimensions, Ratio, Type, and Location. Detailed statistics of VCoT- Bench are provided in Figure 6. VCoT-Bench-Ratio. This sub-benchmark evaluates an LLM’s ability to recover missing verification steps under varying degrees of information loss, quantified by the ratio of removed semantic blocks. For each program in VCoT- Bench-Org containingNsemantic blocks, we generateN variants by randomly removing between 1 andNblocks and computing the corresponding removal ratios. This pro- cedure yields a total of 1,159 VCoT completion tasks. VCoT-Bench-Type. This sub-benchmark evaluates an LLM’s ability to reconstruct specific proof types. For each 6 VCoT-Bench: Evaluating via Verification Chain of Thought program, we generate up to three variants by removing all blocks of a given type: Invariant-removal (all loop invari- ants), Assertion-removal (all inline assertions), and Lemma- removal (all lemma functions). This procedure yields 439 VCoT completion tasks. VCoT-Bench-Loc. The order of the VCoT is critical. This sub-benchmark evaluates how the position of missing rea- soning steps affects a model’s ability to recover the underly- ing logic. We define three removal zones: Front (first 33% of the VCoT), Middle (33%–66%), and End (final 33%). This process yields 390 VCoT completion tasks. 4. Study Using VCoT-Bench, we conduct a thorough study of ten state-of-the-art LLMs, evaluating their verification capabili- ties by measuring performance on VCoT completion. 4.1. Study Setup Subjects. We study ten state-of-the-art LLMs in a zero-shot setting, comprising six proprietary models: GPT-5.2, GPT- 5-mini, Claude Sonnet 4.5, Claude Haiku 4.5, Gemini 3, and Gemini 3 Flash; and four open-source models: DeepSeek V3.2 (685B), DeepSeek R1 (685B), Qwen 3 (8B, evaluated under both thinking and non-thinking inference modes), and gpt-oss (20B). All models are run with default settings, with output limits set sufficiently high for all programs. Table 3 in Appendix B.3 summarizes their configuration details. Evaluation Metrics. To evaluate LLM performance on VCoT completion tasks, we employ three metrics: Syntac- tic Accuracy (SynAcc), Semantic Accuracy (SemAcc), and Overall Accuracy (Acc). Syntactic correctness is checked by invoking Verus in syntax-only mode (-no-verify). Semantic correctness is evaluated by a specialized LLM-based judge, guided by a detailed protocol with curated examples to align its deci- sions with human judgment. To validate the reliability of this judge, we conduct a meta-evaluation on 100 stratified tasks sampled from VCoT-Bench. The judge achieves a 94% agreement rate with author consensus, demonstrating its effectiveness in capturing semantic equivalence. For overall accuracy, inspired by CodeBLEU (Ren et al., 2020), we define a weighted accuracy metric that empha- sizes semantic correctness over syntactic correctness. Each prediction is categorized into one of four levels: • Level 3: Both semantically and syntactically correct. • Level 2: Semantically correct but syntactically incorrect. • Level 1: Syntactically correct but semantically incorrect. • Level 0: Incorrect in both dimensions. Overall accuracy is computed as Acc = N (3) + 0.5 N (2) + 0.25 N (1) N × 100%,(1) 10%20%30%40%50%60%70%80%90%100% Ratio of Removed Blocks 0 10 20 30 40 50 60 70 80 Accuracy (%) GPT-5.2 GPT-5 mini gpt-oss Claude Sonnet 4.5 Claude Haiku 4.5 Gemini 3 Gemini 3 Flash DeepSeek V3.2 DeepSeek R1 Qwen 3 think Qwen 3 non-think Figure 7. Overall accuracy across proof-removal ratios. whereN (i) denotes the number of cases at Leveli, andN is the total number of cases. Notes: Due to space constraints, we report and analyze only overall accuracy. Detailed results and discussion of syntactic and semantic accuracy are deferred to Appendix C. 4.2. Performance across Proof-Removal Ratios. Using VCoT-Bench-Ratio, we evaluate LLM performance across varying proof-removal ratios and identify five key findings from the results shown in Figure 7. F1: Context removal exposes fragile reasoning. Accu- racy drops sharply as proof blocks are removed. With only 10% of blocks missing, the best model, Claude Sonnet 4.5, achieves just 71.58% accuracy, while the weakest, gpt-oss, reaches only 32.89%. When all blocks are removed (100%), turning the task into full VCoT construction, performance collapses: Claude Sonnet 4.5 falls to 17.22% and Qwen 3 think to near zero (0.66%). This shows that current LLMs do not reason from first principles but rely heavily on local pattern matching and syntactic scaffolding; once this context is stripped away, they fail to construct complex, multi-step deductive chains. F2: A 40% removal threshold breaks proof inference. Al- though accuracy declines overall, the drop is non-uniform. Across all models, performance falls sharply as block re- moval increases from 10% to 40%, then degrades much more slowly from 40% to 100%. This pattern reflects the loss of critical structural anchors in the verification logic: once these anchors fall below a threshold, logical continuity collapses, and models can no longer infer the proof logic, making it uniformly difficult regardless of further removal. F3: Model scale helps. Performance is strongly influenced by model scale: larger models (e.g., proprietary variants and DeepSeek V3.2/R1) substantially outperform smaller ones (e.g., gpt-oss and Qwen 3), underscoring that sufficient pa- rameter capacity is crucial for capturing the intricate logical dependencies of verification. 7 VCoT-Bench: Evaluating via Verification Chain of Thought Table 1. Overall accuracy across proof types and proof locations. Thehighest andsecond-highest accuracy are highlighted. Results for overall accuracy Model NameVCoT-Bench-TypeVCoT-Bench-Loc Invariant Assertion LemmaFrontMiddleEnd GPT-5.254.4234.2555.4858.2741.1565.58 GPT-5 mini23.4732.8836.3049.0436.9260.00 gpt-oss 17.6921.5826.2019.819.8128.65 Claude Sonnet 4.568.7139.5551.5469.0446.3565.96 Claude Haiku 4.552.5527.2342.6458.8535.1957.69 Gemini 362.0722.9542.8162.6936.1546.73 Gemini 3 Flash57.6528.6041.6167.1247.6962.50 DeepSeek V3.249.4929.4544.3556.7332.3158.85 DeepSeek R146.6028.9438.5353.4627.8848.08 Qwen 3 think10.0311.4722.4322.8814.8135.58 Qwen 3 non-think 26.8713.7023.6333.4616.5437.12 F4:Reasoning variants can hurt performance. We observe counterintuitive inversions within model families: Qwen 3 generally outperforms its “thinking” variant, and Gemini 3 Flash overall surpasses Gemini 3. This suggests that current reasoning-oriented paradigms can inject noise into verifica- tion. In formal proofs, where syntactic precision and solver fidelity are critical, verbose “thinking” can induce halluci- nated predicates or semantic drift, while simpler models may benefit from a more direct, lower-entropy pattern-matching approach. F5: 100% removal accuracy is inflated by small programs. We observe a slight accuracy increase for most models at 100% removal. This is due to the discrete nature of the removal rate. Single-block programs appear only at 100%, causing small, easier programs to be overrepresented and inflating accuracy at this extreme. 4.3. Performance across Proof Types We evaluate LLM performance across proof types using VCoT-Bench-Type and report three key findings based on the results in Table 1. F1: Assertions are the hardest. Across most models, asser- tion completion has the lowest accuracy. Even the strongest model, Claude Sonnet 4.5, achieves only 39.55% on asser- tions, compared to 68.71% on loop invariants and 51.54% on lemma functions. This consistent gap reflects a funda- mental weakness: assertions require precise, state-specific reasoning where missing a single condition causes failure, whereas invariants and lemmas allow greater reliance on higher-level structural patterns. F2: Loop invariants are easiest yet most discriminative. For nearly all models, loop invariant completion achieves the highest accuracy, but also shows the widest spread across models, ranging from 68.71% (Claude Sonnet 4.5) to 10.03% (Qwen 3 think), a 58.68% gap. This variance far exceeds that of assertions (28.08%) and lemma func- tions (33.05%), indicating that loop invariants are the most discriminative proof type: they require holistic reasoning over iterative program behavior, clearly separating models with genuine verification capability from those relying on shallow pattern matching. F3: Model scale helps, but is not decisive. Model size generally correlates with higher accuracy, but the trend is not absolute. For example, GPT-5 mini (23.47%) under- performs Qwen 3 non-think (26.87%) on loop invariants. While proprietary models typically lead, DeepSeek V3.2 matches the performance of Claude Haiku 4.5, showing that well-trained open-source models can narrow the gap. 4.4. Performance across Proof Locations We evaluate LLM performance across proof locations using VCoT-Bench-Loc and report two key findings based on the results in Table 1. F1: Connective reasoning in Middle is the hardest. Across all strong models, performance drops most sharply when missing blocks occur in the Middle of the proof. For exam- ple, Claude Sonnet 4.5 falls from 69.04% (Front) to 46.35% (Middle), and Gemini 3 from 62.69% to 36.15%. This con- sistent decline exposes a shared weakness: while models handle Front blocks that set up constraints and End blocks that follow recognizable closing patterns, they struggle with the Middle, where connective reasoning is required. These intermediate steps demand propagating invariants, tracking state evolution, and composing multi-step deductions, tasks for which local pattern matching offers little support. F2: Model scale may not benefit Middle. Larger models perform well on Front and End blocks but do not dominate on Middle blocks. Gemini 3 Flash achieves the highest Middle accuracy (47.69%), outperforming both Gemini 3 (36.15%) and GPT-5.2 (41.15%). This inversion indicates that Middle-block performance is not driven by model scale; instead, models optimized for low-latency reasoning may better track intermediate verification states, while larger, more verbose models are more prone to semantic drift. 5. Conclusion This paper introduces VCoT-Lift and VCoT-Bench, which expose solver-level reasoning as explicit VCoTs to rigor- ously assess whether LLMs can reconstruct the step-by-step logic of Rust verification, revealing a wide gap between current models and automated theorem provers in genuine deductive reasoning. Looking forward, VCoT can go be- yond evaluation, providing a concrete supervision signal for training, diagnosing, and guiding learning-based verifi- cation systems toward symbolic alignment, and reframing the field from whether models pass verification to whether they can truly reconstruct and reason through the underlying verification logic. 8 VCoT-Bench: Evaluating via Verification Chain of Thought References Aggarwal, P., Parno, B., and Welleck, S. Alphaverus: Boot- strapping formally verified code generation through self- improving translation and treefinement. arXiv preprint arXiv:2412.06176, 2024. Anthropic. Introducing claude haiku 4.5.https://w. anthropic.com/news/claude-haiku-4-5 , October 2025a. [Online; accessed: 11-Jan-2026]. Anthropic. Introducing claude sonnet 4.5.https://w. anthropic.com/news/claude-sonnet-4-5 , September 2025b. [Online; accessed: 11-Jan-2026]. Blanchette, J. C., Böhme, S., and Paulson, L. C. Extending sledgehammer with smt solvers. Journal of automated reasoning, 51(1):109–128, 2013. Böhme, S. and Nipkow, T. Sledgehammer: judgement day. In International Joint Conference on Automated Reasoning, p. 107–121. Springer, 2010. Chakraborty, S., Gupta, A., and Unadkat, D. Diffy: Induc- tive reasoning of array programs using difference invari- ants. In International Conference on Computer Aided Verification, p. 911–935. Springer, 2021. Chen, T., Lu, S., Lu, S., Gong, Y., Yang, C., Li, X., Misu, M. R. H., Yu, H., Duan, N., Cheng, P., et al. Automated proof generation for rust code via self-evolution. arXiv preprint arXiv:2410.15756, 2024. De Moura, L. and Bjørner, N. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, p. 337–340. Springer, 2008. Dong, Z., Li, J., Men, X., Zhao, X., Wang, B., Tian, Z., Wen, J.-R., et al. Exploring context window of large language models via decomposed positional vectors. Advances in Neural Information Processing Systems, 37:10320– 10347, 2024. Du, Y., Tian, M., Ronanki, S., Rongali, S., Bodapati, S., Galstyan, A., Wells, A., Schwartz, R., Huerta, E. A., and Peng, H. Context length alone hurts llm performance de- spite perfect retrieval. arXiv preprint arXiv:2510.05381, 2025. Google. Gemini 3.https://deepmind.google/ models/gemini/, November 2025a. [Online; ac- cessed: 11-Jan-2026]. Google.Gemini 3 flash.https://deepmind. google/models/gemini/flash/ ,December 2025b. [Online; accessed: 11-Jan-2026]. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. Hosseini, P., Castro, I., Ghinassi, I., and Purver, M. Efficient solutions for an intriguing failure of llms: Long context window does not mean llms can analyze long sequences flawlessly. In Proceedings of the 31st international confer- ence on computational linguistics, p. 1880–1891, 2025. Ishibashi, Y. and Nishimura, Y. Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization.arXiv preprint arXiv:2404.02183, 2024. Islam, M. A., Ali, M. E., and Parvez, M. R. Mapcoder: Multi-agent code generation for competitive problem solving. arXiv preprint arXiv:2405.11403, 2024. Jiang, J., Wang, F., Shen, J., Kim, S., and Kim, S. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024. Lattuada, A., Hance, T., Bosamiya, J., Brun, M., Cho, C., LeBlanc, H., Srinivasan, P., Achermann, R., Chajed, T., Hawblitzel, C., et al. Verus: A practical foundation for systems verification. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, p. 438–454, 2024. Li, C., Liang, J., Zeng, A., Chen, X., Hausman, K., Sadigh, D., Levine, S., Fei-Fei, L., Xia, F., and Ichter, B. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474, 2023. Li, J., Wang, M., Zheng, Z., and Zhang, M. Loogle: Can long-context language models understand long contexts? In Proceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), p. 16304–16333, 2024. Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alpha- code. Science, 378(6624):1092–1097, 2022. Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025. Liu, C., Dylan Zhang, S., Ibrahimzada, A. R., and Jabbar- vand, R. Codemind: A framework to challenge large language models for code reasoning. arXiv e-prints, p. arXiv–2402, 2024. 9 VCoT-Bench: Evaluating via Verification Chain of Thought Matsakis, N. D. and Klock, F. S. The rust language. In Pro- ceedings of the 2014 ACM SIGAda annual conference on High integrity language technology, p. 103–104, 2014. Misu, M. R. H., Lopes, C. V., Ma, I., and Noble, J. Towards ai-assisted synthesis of verified dafny methods. Proceed- ings of the ACM on Software Engineering, 1(FSE):812– 835, 2024. Mohamed, A., Mascarenhas, T., Khan, H., Barbosa, H., Reynolds, A., Qian, Y., Tinelli, C., and Barrett, C. Lean- smt: An smt tactic for discharging proof goals in lean. In International Conference on Computer Aided Verification, p. 197–212. Springer, 2025. OpenAI.Introducing gpt-5.2.https://openai. com/index/introducing-gpt-5-2/,Decem- ber 2025a. [Online; accessed: 11-Jan-2026]. OpenAI.Gpt-5 mini model.https://platform. openai.com/docs/models/gpt-5-mini, Au- gust 2025b. [Online; accessed: 11-Jan-2026]. OpenAI. Introducing gpt-oss.https://openai.com/ index/introducing-gpt-oss/, August 2025c. [Online; accessed: 11-Jan-2026]. Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., Sun- daresan, N., Zhou, M., Blanco, A., and Ma, S. Code- bleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020. Stack Overflow. "rust continues to be the most-admired pro- gramming language", 2024. URLhttps://survey. stackoverflow.co/2024/technology.[On- line; accessed: 01-Oct-2025]. Sun, C., Sun, Y., Amrollahi, D., Zhang, E., Lahiri, S., Lu, S., Dill, D., and Barrett, C. Veristruct: Ai-assisted automated verification of data-structure modules in verus. arXiv preprint arXiv:2510.25015, 2025. Team, Q. Qwen3 technical report, 2025. URLhttps: //arxiv.org/abs/2505.09388. Wei, Y., Duchenne, O., Copet, J., Carbonneaux, Q., Zhang, L., Fried, D., Synnaeve, G., Singh, R., and Wang, S. I. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449, 2025. Wen, J., Guan, J., Wang, H., Wu, W., and Huang, M. Code- plan: Unlocking reasoning potential in large language models by scaling code-form planning. In The Thirteenth International Conference on Learning Representations, 2024. Yang, C., Li, X., Misu, M. R. H., Yao, J., Cui, W., Gong, Y., Hawblitzel, C., Lahiri, S., Lorch, J. R., Lu, S., et al. Autoverus: Automated proof generation for rust code. arXiv preprint arXiv:2409.13082, 2024. Yang, C., Neamtu, N., Hawblitzel, C., Lorch, J. R., and Lu, S. Verusage: A study of agent-based verification for rust systems. arXiv preprint arXiv:2512.18436, 2025a. Yang, C., Zhao, Z., Xie, Z., Li, H., and Zhang, L. Knighter: Transforming static analysis with llm-synthesized check- ers. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, p. 655–669, 2025b. Z3Prover. Z3.proof.https://z3prover.github. io/api/html/ml/Z3.Proof.html , 2026. [On- line; accessed: 26-Jan-2026]. Zhang, F., Chen, B., Zhang, Y., Keung, J., Liu, J., Zan, D., Mao, Y., Lou, J.-G., and Chen, W. Repocoder: Repository-level code completion through iterative re- trieval and generation. arXiv preprint arXiv:2303.12570, 2023. Zhang, K., Li, J., Li, G., Shi, X., and Jin, Z. Codeagent: Enhancing code generation with tool-integrated agent sys- tems for real-world repo-level coding challenges. arXiv preprint arXiv:2401.07339, 2024. Zhong, S. C. and Si, X. Towards repository-level program verification with large language models. In Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages, p. 27– 39, 2025. Zhuo, T. Y., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al. Big- codebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024. 10 VCoT-Bench: Evaluating via Verification Chain of Thought A. Complete VCoT Exposed by VCoT-Lift Figure 8 shows the complete VCoT exposed by VCoT-Lift for the running example program, which replaces the last element of the first vector (first) with all elements of the second vector (second). The VCoT is composed of five primary reasoning phases. Phase 1: Preparing sequence properties Before the first loop, the proof establishes foundational sequence properties needed later by invoking auxiliary lemmas on sequence con- catenation and subranges.lemma_subrange_add_whole shows that splitting a sequence into two parts and then concatenating them reconstructs the original sequence, whilelemma_seq_add_append_oneshows that, during sequence construction, appending an element commutes with concatenation. Together, these lemmas ensure correct splitting and recombination of sequences and provide the algebraic basis for the loop invariants. Phase 2: Verifying the first loop. The first loop copies all but the last element offirstintoreplaced_list. Its loop invariant maintains thatreplaced_listequals first@.subrange(0, index). Each iteration appends first[index], and the invariant is preserved by showing that extending the subrange by one element matches the effect of push. Phase 3: Bridging between the two loops. After the first loop terminates, the proof establishes thatreplaced_list equalsfirst.subrange(0, first.len() - 1). This result is then normalized to match the second loop invariant by rewriting it as an addition with an empty subrange of second, namely (first@.subrange(0, first.len() - 1).add(second@.subrange(0, 0))). Phase 4: Verifying the second loop. The second loop appends all elements ofsecondtoreplaced_list. Itsloopinvariantstatesthatreplaced_list equalsfirst@.subrange(0, first.len() - 1).add(second@.subrange(0, index)).Each iteration extends the subrange ofsecondby one element, and the invariant is preserved by reasoning thatpush corresponds to subrange extension. Step 5: Finalizing the verification. When the second loop finishes withindex == second.len(), and a final proof step applies lemma functionlemma_subrange_all toprovesecond@.subrange(0, second.len()) equalssecond.This establishes the postcondi- tionreplaced_list@ == first@.subrange(0, first.len() - 1).add(second@), completing the verification. B. Additional Details B.1. Z3 Proof Rule Hierarchy Table 2 provides a detailed overview of 36 Z3 proof rules. For each rule, we report its importance level along with a detailed explanation of its semantics and usage. B.2. VCoT-Bench We provide additional details comparing the numbers of assertions and lemma functions between VCoT-Bench-Org and Verus-Bench, as shown in Figure 9. As shown in Figure 9a, programs in VCoT-Bench-Org con- tain substantially more assertions, which are a core compo- nent of VCoT because they explicitly expose intermediate verification reasoning. Each program contains an average of 15.47 additional assertions, with a maximum of 56. More- over, among 150 programs, 145 (96.67%) show an increase in the number of assertions. Figure 9b compares the numbers of lemma functions, show- ing that programs in VCoT-Bench-Org also include signifi- cantly more lemmas, whereas only a small fraction of Verus- Bench programs contain lemma functions. Each program contains an average of 1.94 additional lemma functions, with a maximum of 9. Furthermore, 123 out of 150 pro- grams (82%) show an increase in the number of lemma functions. We omit a detailed comparison of loop invariant counts be- cause their increase is marginal (increase 1.99%). Unlike assertions and lemma functions, which depend on one an- other to form a chain of thought, Z3 reasons about loop invariants as a fixed point. As a result, loop invariants act as a summarized representation of a loop, with limited ad- ditional verification information to be lifted from the Z3 proof. B.3. Study Setup Machine. All our experiments are run on a machine with Ubuntu 22.04.5 LTS, 192-core CPUs, and 512GB RAM. For running open-source models locally, we use NVIDIA RTX PRO 6000 Blackwell GPUs with 96 GB VRAM. Model Details. Table 3 presents the details of the evaluated LLMs, including ten models with their respective vendors, model checkpoints, and access types (via API or GPU). C. Analysis of Syntactic and Semantic Accuracy In this section, we report and analyze the syntactic accu- racy (SynAcc) and semantic accuracy (SemAcc) of LLM performance. 11 VCoT-Bench: Evaluating via Verification Chain of Thought Figure 8. The complete VCoT exposed by VCoT-Lift for the running Verus example withspecifications, including originalproof hints and theadditional proof steps. The VCoT is composed of five primary reasoning phases. C.1. Performance across Proof Ratios. Using VCoT-Bench-Ratio, we evaluate LLM performance in both syntax and semantics across semantic proof removal ratios, and report three key findings based on the results shown in Figure 10. F1: Proof removal reveals both syntactic and semantic 12 VCoT-Bench: Evaluating via Verification Chain of Thought Table 2. Z3 Proof Rule Hierarchy. LevelRuleDescription High lemmaDischarge a hypothesis; if a contradiction is derived from an assumption P , the rule concludes¬P . th-lemmaIntroduce a theory-specific tautology or conflict clause generated by the solver to justify an inference that is valid within that theory’s specific axioms. mpGiven a proof of P and a proof that P implies Q (or P = Q), this rule concludes Q. mp∼Given a proof of P and a proof that P is equivalent to Q (P ⇐⇒ Q), this rule concludes Q. unit-resolutionPerforms a resolution step between a multi-literal clause and one or more unit clauses. quant-introGiven an equivalence between two terms f(x) ⇐⇒ g(x), concludes that∀x.f(x) ⇐⇒ ∀x.g(x). quant-instJustifies the specialized case where a universal quantifier is removed by replacing the bound variable with a specific term, concluding that (∀x.φ) =⇒ φ[t/x]. skolemizeEliminates an existential quantifier∃x.φ[x]by replacing the bound variable with a fresh constant or function c, yielding the equisatisfiable formula φ[c]. Medium assertedIntroduces an initial formula or axiom provided directly by the user as a premise. hypothesisIntroduces a temporary assumption into a local scope to facilitate conditional reasoning. and-elimDerives a specific conjunct from a conjunction; given a proof of(l 1 ∧ l 2 ∧·∧ l n ), concludes a single literal l i , representing the elimination of the logical "and" operator. not-or-elimDerives the negation of a specific disjunct from a negated disjunction; given a proof of¬(l 1 ∨l 2 ∨·∨l n ), concludes¬l i , applying De Morgan’s laws to extract individual negative literals. transGiven proofs of x = y and y = z, concludes x = z. trans*Chains an arbitrary sequence of equality or equivalence steps (x 1 = x 2 ,x 2 = x 3 ,...,x n−1 = x n ) to conclude the direct relation between the first and last terms (x 1 = x n ). monotonicityGiven a 1 = b 1 ,...,a n = b n , concludes f(a 1 ,...,a n ) = f(b 1 ,...,b n ). derSimplifies a formula by eliminating a universally quantified variablexwhen the body contains a conjunct x = t (where x /∈ t), applying the variable substitution φ[t/x]. rewriteRepresents a single step of the simplifier where t 1 is transformed into t 2 such that⊢ t 1 = t 2 . rewrite*Collapses a series of individual simplifications and algebraic reductions into a single proof object, concluding that the initial term is equivalent to the final fully-reduced form. def-axiomIntroduces a formula that defines a new constant or function symbol. apply-defReplaces a defined symbol with its corresponding definition. Low trueConcludes the logical constant ’true’ =/∼Justifies basic term equality or equivalence; it concludes that two terms are identical or logically equivalent (t ⇐⇒ t or t = t). iff-trueSimplifies a formula by concluding P ⇐⇒ ⊤ given a proof of P iff-falseSimplifies a formula by concluding P ⇐⇒ ⊥ given a proof of¬P . goalRepresents the final formula to be proven. reflConcludes t = t for any term t, serving as the fundamental base case for all equality-based reasoning. symmGiven a proof of t 1 = t 2 , concludes t 2 = t 1 . commutativityGiven a termf(x,y), concludesf(x,y) = f(y,x)for operations like addition, multiplication, or logical disjunction/conjunction. pull-quantGiven a formula where a quantifier is nested, such as(∀x.φ)∨ ψ(wherexis not free inψ), concludes the equivalence (∀x.φ)∨ ψ ⇐⇒ ∀x.(φ∨ ψ). push-quantGiven a formula like∀x.(φ∧ψ), this rule concludes the equivalence∀x.(φ∧ψ) ⇐⇒ (∀x.φ)∧(∀x.ψ). elim-unusedConcludes the equivalence (∀x.φ) ⇐⇒ φ or (∃x.φ) ⇐⇒ φ, x is not a free variable in φ. distributivityJustifies the expansion or factoring of terms where one operator distributes over another; most commonly, it validates identities likea×(b+c) = (a×b)+(a×c)in arithmetic, orp∧(q∨r) ⇐⇒ (p∧q)∨(p∧r) in boolean logic. nnf-posValidates the transformation of a formula into Negation Normal Form (NNF) under a positive context. nnf-negValidates the transformation of a formula into Negation Normal Form (NNF) under a negative context. iff-oeqJustifies the equivalence between a formula P and its boolean reduction based on the current context. def-introJustifies the introduction of a new boolean constant p as a proxy for a complex formula φ. 13 VCoT-Bench: Evaluating via Verification Chain of Thought Table 3. Detailed Configurations for the Evaluated LLMs VendorModel NameCheckpointType OpenAI GPT-5.2 (OpenAI, 2025a)gpt-5.2-2025-12-11API GPT-5 mini (OpenAI, 2025b)gpt-5-mini-2025-08-07API gpt-oss 20b (OpenAI, 2025c)gpt-oss-20bGPU Anthropic Claude Sonnet 4.5 (Anthropic, 2025b) claude-sonnet-4-5-20250929 API Claude Haiku 4.5 (Anthropic, 2025a)claude-haiku-4-5-20251001API Google Gemini 3 (Google, 2025a)gemini-3-pro-previewAPI Gemini 3 Flash (Google, 2025b)gemini-3-flash-previewAPI DeepSeek DeepSeek V3.2 (Liu et al., 2025)DeepSeek-V3.2API DeepSeek R1 (Guo et al., 2025)DeepSeek-R1-0528API AlibabaQwen 3 (Team, 2025)Qwen3-8BGPU Verified Verus program 0 10 20 30 40 50 60 # Assertion # Assertion (VCoT-Bench-Org) # Assertion (Verus-Bench) (a) Comparison of # Assertion. Verified Verus program 0 2 4 6 8 10 # Lemma Function # Lemma Function (VCoT-Bench-Org) # Lemma Function (Verus-Bench) (b) Comparison of # Lemma Function. Figure 9. Comparison of # Assertion and # Lemma Function per program between VCoT-Bench-Org and Verus-Bench. fragility. Both syntactic and semantic accuracies decline as the proof removal ratio increases. For syntactic accuracy, with 10% removal, the Gemini family achieves near-perfect performance (91.58%), substantially outperforming other models. However, at 100% removal, syntactic accuracy drops sharply to 45.7% for Gemini 3 and 24.5% for Gemini 3 Flash. This steep decline indicates that current LLMs lack a robust understanding of Verus syntax and instead rely on local contextual cues, which disappear as more proofs are removed, exposing structural fragility. For semantic accuracy, with 10% removal, GPT-5.2 achieves the highest performance (73.68%), but this drops to 27.15% at 100% removal. The decline is even more severe for Qwen 3, whose semantic accuracy falls to 0% at 80% or 90% removal. These results further indicate that current LLMs lack a sufficient understanding of VCoT. 10%20%30%40%50%60%70%80%90%100% Ratio of Removed Blocks 0 20 40 60 80 100 Accuracy (%) GPT-5.2 GPT-5 mini gpt-oss Claude Sonnet 4.5 Claude Haiku 4.5 Gemini 3 Gemini 3 Flash DeepSeek V3.2 DeepSeek R1 Qwen 3 think Qwen 3 non-think (a) Syntactic Accuracy 10%20%30%40%50%60%70%80%90%100% Ratio of Removed Blocks 0 10 20 30 40 50 60 70 80 Accuracy (%) GPT-5.2 GPT-5 mini gpt-oss Claude Sonnet 4.5 Claude Haiku 4.5 Gemini 3 Gemini 3 Flash DeepSeek V3.2 DeepSeek R1 Qwen 3 think Qwen 3 non-think (b) Semantic Accuracy Figure 10. Impact of semantic block removal on (a) Syntactic Accuracy, and (b) Semantic Accuracy. F2: Different decline trend for syntactic and semantic ac- curacy. Syntactic and semantic accuracies exhibit different decline patterns. Syntactic accuracy decreases smoothly as proofs are removed, whereas semantic accuracy drops sharply when removal increases from 10% to 40%, then 14 VCoT-Bench: Evaluating via Verification Chain of Thought Table 4. Syntactic and Semantic Accuracy across Proof Types. The highest andsecond-highest scores are highlighted. Results for VCoT-Bench-Type Model NameInvariantAssertionLemma SynAcc SemAccSynAcc SemAccSynAcc SemAcc GPT-5.259.1862.5913.7056.1661.6459.59 GPT-5 mini12.2439.462.7463.0123.9754.79 gpt-oss0.0035.374.7939.7334.2530.82 Claude Sonnet 4.597.9659.1853.4239.7390.4139.73 Claude Haiku 4.5 89.8040.828.2247.2669.1836.99 Gemini 398.6450.3450.6816.4492.4726.71 Gemini 3 Flash97.9644.9028.0838.3688.3626.71 DeepSeek V3.289.8037.414.1156.1655.4846.58 DeepSeek R179.5937.417.5352.7434.9349.32 Qwen 3 think30.614.0818.4912.3339.7317.81 Qwen 3 non-think 73.4712.2428.0811.6447.2616.44 declines more gradually from 40% to 100%. This difference stems from their distinct roles. Syntactically, semantic proof blocks are largely independent, so each con- tributes similarly to overall correctness, producing a smooth decline as blocks are removed. Semantically, however, early removal disproportionately eliminates critical structural an- chors in the verification logic; once these anchors fall below a threshold, logical continuity collapses, and models can no longer infer the proof structure, making the task uniformly difficult regardless of further removal. F3: Syntactic-semantic imbalance. A syntactic-semantic imbalance is observed here. While Gemini 3 exhibits low semantic accuracy but achieves relatively high syntactic accuracy, GPT-5.2 shows the opposite trend, with lower syntactic accuracy yet excelling in semantic understanding. This contrast highlights the pressing need for models to bridge the gap between grammatical fluency and verification understanding to ensure more holistic reasoning. C.2. Performance across Proof Types Using VCoT-Bench-Type, we evaluate LLM performance in both syntax and semantics across proof types, and report three key findings based on results in Table 4. F1: Assertions are the hardest for syntactic accuracy. Syn- tactic accuracy for loop invariants and lemma functions is consistently much higher than for assertions across most models. For example, DeepSeek V3.2 achieves 89.80% syn- tactic accuracy on loop invariants but drops sharply to 4.11% on assertions. This large gap reflects LLMs’ limited famil- iarity with assertion syntax. Loop invariants impose looser grammatical constraints and closely resemble Rust code, making them easier to generate. Assertions, by contrast, involve substantially more complex grammar, particularly strict type discipline that often requires explicit conversions toint, which models frequently miss. Lemma functions also involve specialized constructs such asrequiresand ensures, but demand less grammatical precision than as- Table 5. Syntactic and Semantic Accuracy across Proof Locations. Thehighest andsecond-highest scores are highlighted. Results for VCoT-Bench-Loc Model NameFrontMiddleEnd SynAcc SemAccSynAcc SemAccSynAcc SemAcc GPT-5.263.8563.8535.3854.6266.1573.85 GPT-5 mini34.6267.6923.0856.9241.5483.08 gpt-oss10.0030.776.9215.3822.3138.46 Claude Sonnet 4.593.0862.3163.0846.9280.7764.62 Claude Haiku 4.580.7754.6240.0041.5466.1562.31 Gemini 390.7754.6264.6228.4680.0036.92 Gemini 3 Flash92.3160.0066.1546.1586.9256.15 DeepSeek V3.280.0053.0823.0847.6960.7765.38 DeepSeek R165.3856.1518.4642.3151.5453.85 Qwen 3 think49.2315.3833.0810.7763.8528.46 Qwen 3 non-think76.9220.7738.4611.5461.5431.54 sertions, resulting in syntactic accuracy between that of loop invariants and assertions. F2: Semantic accuracy varies across proof types. Asser- tions are not uniformly the most challenging proof type in terms of semantic accuracy, as model performance varies across proof types. Loop invariants require holistic reason- ing across iterations, where GPT-5.2 and Claude Sonnet 4.5 perform best, achieving semantic accuracies of 62.59% and 59.18%, respectively. Assertions demand strong step-by- step inductive reasoning, on which GPT-5 mini performs best, outperforming GPT-5.2 (54.62%), Claude Sonnet 4.5 (46.92%), and Gemini 3 (28.46%). Lemma functions re- quire contextual reasoning over the main proof body and auxiliary properties, a setting in which the GPT family achieves the strongest overall performance. F3: Syntactic-semantic imbalance remains. A clear syn- tactic–semantic imbalance persists for Gemini 3 and GPT- 5.2. Gemini 3 shows a striking disparity on loop invari- ants, achieving near-perfect syntactic accuracy (98.64%) but much lower semantic accuracy (50.34%). In contrast, GPT-5.2 exhibits the opposite pattern, with lower syntactic accuracy (59.18%) but higher semantic accuracy (62.59%). This imbalance appears consistently across all proof types, indicating that the syntactic–semantic bias persists regard- less of proof type. C.3. Performance across Proof Locations Using VCoT-Bench-Loc, we evaluate LLM performance in both syntactic and semantic accuracy across proof locations and report three key findings based on the results in Table 5. F1: Middle Blocks are syntactically hardest. Across all models, syntactic accuracy on Front and End blocks con- sistently exceeds that on Middle blocks. For example, DeepSeek V3.2 achieves 90% syntactic accuracy on Front blocks and 60.77% on End blocks, but only 23.08% on Middle blocks. This gap stems from differences in syn- tactic structure: Front and End blocks typically contain 15 VCoT-Bench: Evaluating via Verification Chain of Thought short constraints with simple equalities or bounds, whereas Middle blocks usually require longer, more complex ex- pressions with nested function calls and chained conditions to maintain state consistency, placing greater demands on grammatical understanding and leading to lower syntactic accuracy. F2: Middle blocks are semantically hardest Across all models, semantic accuracy on Front and End blocks ex- ceeds that on Middle blocks. Even for the best-performing model, GPT-5 mini, semantic accuracy reaches 67.69% on Front blocks and 83.08% on End blocks, but only 56.62% on Middle blocks. This pattern reinforces that current LLMs struggle significantly with connective reasoning in the mid- dle of proofs. F3: Syntactic gaps exceed semantic gaps. For syntac- tic accuracy, the performance gap across locations reaches 57.92%, while for semantic accuracy it narrows to 26.16%. This discrepancy suggests that current LLMs lack a ro- bust understanding of Verus syntax but exhibit a relatively stronger grasp of verification semantics, which may build on code reasoning capabilities that most models are already trained on. 16