Paper deep dive

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 66

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/22/2026, 5:57:40 AM

Summary

CoVerRL is a framework for label-free reinforcement learning that addresses the 'consensus trap' in LLMs, where models reinforce systematic errors due to output diversity collapse. By enabling a single model to alternate between generator and verifier roles, CoVerRL facilitates a co-evolutionary process where majority-voted pseudo-labels train the verifier, and the verifier filters self-consistent errors from the generator's output. This bidirectional bootstrapping improves both reasoning and self-verification accuracy across Qwen and Llama model families.

Entities (6)

CoVerRL · framework · 100%Answer-Anchored GRPO · optimization-algorithm · 95%Llama · model-family · 95%MATH · dataset · 95%Qwen · model-family · 95%TTRL · baseline-method · 95%

Relation Signals (3)

CoVerRL → evaluatedon → MATH

confidence 95% · To test our method at scale, we use the large, standard MATH training set.

CoVerRL → outperforms → TTRL

confidence 95% · CoVerRL consistently outperforms TTRL across all models and benchmarks

CoVerRL → utilizes → Answer-Anchored GRPO

confidence 95% · We optimize using Group Relative Policy Optimization (GRPO) with a modification... We thus propose Answer-Anchored GRPO

Cypher Suggestions (2)

Find all models evaluated using the CoVerRL framework. · confidence 90% · unvalidated

MATCH (m:Model)-[:EVALUATED_WITH]->(f:Framework {name: 'CoVerRL'}) RETURN m.name

Identify the performance improvement of CoVerRL over baselines. · confidence 85% · unvalidated

MATCH (f:Framework {name: 'CoVerRL'})-[r:OUTPERFORMS]->(b:Baseline) RETURN f.name, b.name, r.improvement_percentage

Abstract

Abstract:Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9\% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55\% to over 85\%, confirming that both capabilities genuinely co-evolve.

PDF

Open source PDF →Open local PDF →

Full Text

65,797 characters extracted from source content.

Expand or collapse full text

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution Teng Pan 1,2 , Yuchen Yan 1 , Zixuan Wang 1,2 , Ruiqing Zhang 2 , Guiyang Hou 1 , Wenqi Zhang 1 , Weiming Lu 1 , Jun Xiao 1 , Yongliang Shen 1† 1 Zhejiang University, 2 Baidu Inc. † Corresponding authors Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co-evolve. Date: March 19, 2026 Project Page: https://zju-real.github.io/CoVerRL Code: https://github.com/ZJU-REAL/CoVerRL Correspondence: pt6,syl@zju.edu.cn 1 Introduction Large Language Models (LLMs) (DeepSeek-AI et al., 2025; OpenAI et al., 2024) have demonstrated remarkable capabilities in complex reasoning tasks, especially with recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) that further amplify these abilities by optimizing models against ground-truth signals. However, the standard RLVR paradigm is heavily based on large amounts of high-quality labeled data or ground-truth verifiers, which are often unavailable or expensive to obtain. Consequently, enabling LLMs to self-evolve in a label-free setting has emerged as a critical frontier in current research. The dominant paradigm for label-free training employs majority voting as a proxy for correctness (Prasad et al., 2025; Zuo et al., 2025; Liu et al., 2025a). Given a problem, the model generates multiple reasoning trajectories, and the most frequent answer becomes the pseudo-label for reward computation. The underlying assumption is that consistent answers across diverse reasoning paths likely reflect correct solutions. Notably, test-time reinforcement learning (TTRL) (Zuo et al., 2025) reveal that majority voting remains effective even when pseudo-labels are incorrect through a “lucky hit” mechanism: samples whose answers differ from an incorrect pseudo-label still receive correct negative rewards. This explains why reward accuracy can remain high despite low label accuracy, providing viable training signals even on challenging benchmarks. However, we identify a critical instability in this paradigm. Since rewards are computed against the majority answer, the training objective inherently encourages the model to produce more consistent outputs. As training progresses, output diversity systematically decreases and predictions concentrate around the majority answer. When this answer is incorrect, a destructive feedback loop emerges: the model confidently reinforces its own errors, diversity collapses further, and reward accuracy degrades. Figure 1 provides empirical evidence: 1 arXiv:2603.17775v1 [cs.CL] 18 Mar 2026 +18% stable decreasing Figure 1 Training dynamics of TTRL and CoVerRL on Qwen3-1.7B-Base. CoVerRL maintains high reward accu- racy while steadily improving label accuracy, whereas TTRL suffers from reward accuracy degradation as output diver- sity collapses. under TTRL, reward accuracy begins near 95% but steadily drops to approximately 80%, while label accuracy stagnates around 70% with no im- provement. Once the model converges to system- atic errors with high confidence, pure consensus methods offer no mechanism for recovery. This reveals a fundamental limitation: majority voting captures consistency but cannot detect correctness. When models repeatedly produce identical wrong answers, a phenomenon known as self-consistent errors (Tan et al., 2025), consensus-based rewards actively reinforce these mistakes. A natural remedy is self-verification: if the model could judge correctness, self-consistent errors could be filtered before corrupting training. However, prior work has shown that LLMs without specific training cannot reliably detect their own reason- ing errors (Huang et al., 2024; Tyen et al., 2024). Expecting a model to verify solutions it cannot correctly generate creates a circular dependency that severely limits self-correction without external supervision. The situation appears intractable: consensus provides training signal but cannot escape systematic errors, while verification could detect errors but cannot be trained without ground truth. We resolve this tension through a key observation: majority voting and self-verification, while individually insufficient, can bootstrap each other’s improvement. We propose CoVerRL (Co-evolving Verifier and generator via Reinforcement Learning), a framework where a single model alternates between generator and verifier roles within a unified multi-turn RL process. The generator produces candidate solutions and establishes pseudo-labels via majority voting; the verifier scrutinizes reasoning processes to identify errors independent of answer frequency. The critical insight is bidirectional supervision: consensus-derived pseudo-labels provide contrastive signal for verifier training, while the improving verifier filters self-consistent errors from pseudo- labels. This creates a virtuous cycle where better verification yields cleaner labels, producing more accurate rewards that further improve both generation and verification. As shown in Figure 1, CoVerRL maintains reward accuracy above 90% throughout training, while label accuracy steadily increases from 75% to 88%, demonstrating the verifier’s growing ability to escape the failure mode that plagues pure consensus methods. We evaluate CoVerRL across three models on four reasoning benchmarks. CoVerRL consistently outperforms TTRL, achieving average improvements of 5.7%, 5.9%, and 4.7% respectively across model scales. Beyond generation accuracy, CoVerRL produces capable verifiers as a byproduct: self-verification accuracy improves dramatically, from 55.8% to 81.2% on Qwen3-1.7B and from 54.0% to 86.5% on Qwen2.5-7B. This confirms that the two capabilities genuinely co-evolve rather than one merely supporting the other. We summarize our contributions as follows: •We identify the consensus trap in majority voting based label-free RL, where diversity collapse causes reward accuracy degradation as models become overconfident in systematic errors, explaining why such methods eventually stagnate. •We propose CoVerRL, a co-evolution framework that unifies generation and verification into a multi-turn RL process, enabling mutual bootstrapping where each capability supervises improvement of the other without external labels. •We validate CoVerRL across Qwen and Llama model families, demonstrating 4-6% improvements over label-free baselines on mathematical reasoning benchmarks while producing verifiers that generalize well to held-out evaluation. 2 Majority Group Prefix Group GRPO Update No 푣 ! 풒 푣 ",$ 푦 " Correct Pseudo-Label Generation ... ... 푦 1 ... ... Verified as ‘Wrong’ Verify ... 푧 ",$ $ ... ... 푧 ",$ % 푧 ! $ 푧 ! % 푣 ",! 푣 $ 푣 ! 푣 $ ... 푣 ! Answer-Anchored Group ... 푣 ",$ 푦 ! Dual-role Co-evolution 흅 휽 품풆풏 Verify End > Yes 푦 2 푦 푛 ... 푦 1 ... Draft Pseudo-Label Voting 푦 2 푦 푛 흅 휽 풗풆풓 흅 휽 풗풆풓 흅 휽 품풆풏 | | | | Self-consistent Filtering 푦 1 푦 2 푦 푛 ... ... ... ... ... 푧 ",$ $ 푧 ",$ % 푧 ! $ 푧 ! % 푣 ",! ... 푣 ",$ 흅 휽 Positive Samples 푦 $ Negative Sample Self Correction Figure 2 Overview of the CoVerRL framework. The pipeline illustrates the Pseudo-Label Generation process and the Online Dual-Role Co-Evolution strategy, facilitating a mutual bootstrapping process where generation and verification capabilities jointly evolve. 2 Related Work 2.1 LLM Self-Evolution To achieve label-free evolution, existing research has exploited the intrinsic confidence of LLMs as a supervision signal (Prabhudesai et al., 2025; Zhao et al., 2025b; Anonymous, 2025; Li et al., 2025), leveraged self-consistency through majority voting to establish pseudo-labels (Prasad et al., 2025; Zuo et al., 2025; Liu et al., 2025a; Zhou et al., 2025; Yu et al., 2025b; Du et al., 2025), or employed self-verification mechanisms (Yuan et al., 2025). Another paradigm is self-play, where a single model simulates multiple roles and generates its own training data (Zhao et al., 2025a; Huang et al., 2025; Xia et al., 2025; Yang et al., 2025b; Guo et al., 2025; Wei et al., 2025). In this work, we leverage both majority voting and self-verification to co-evolve the generator and verifier, transcending the limitations of consensus trap and unreliable standalone verification. 2.2 LLM Self-Verification and Self-Correction Recent advances have demonstrated that LLMs are capable of self-refinement when provided with external feedback (Chen et al., 2023; Liu et al., 2025c; Jain et al., 2024; Liu et al., 2025b). Concurrently, Several studies train standalone verifier models to bridge the verification gap (Cobbe et al., 2021; Lightman et al., 2023; Zhang et al., 2025a,b). Beyond relying on external signals, many studies focus on intrinsic self-correction (Miao et al., 2023; Lee et al., 2025; Ma et al., 2025; Xiong et al., 2025; Yan et al., 2025), which predominantly relies on either prompt engineering or two-stage paradigms combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). With the rapid progress in multi-turn RL (Feng et al., 2025a; Jin et al., 2025; Wang et al., 2025; Feng et al., 2025b), recent studies have investigated training self-correction capabilities directly through a single reinforcement learning stage (Liu et al., 2025d; Jiang et al., 2025). In our work, we focus on enhancing self-verification and self-correction capabilities in a label-free paradigm, which has been rarely explored in previous studies. 3 Method In this section, we present CoVerRL, a framework that enables label-free reinforcement learning through the co-evolution of generation and verification capabilities. The key principle of CoVerRL is that a single policy modelπ θ acts as two roles: a generatorπ gen θ and a verifierπ ver θ . Through their interaction, the model produces high-quality pseudo-labels (Section 3.1) and achieves coordinated dual-role co-evolution (Section 3.2). 3 3.1 Pseudo-Label Generation The key challenge in label-free RL is obtaining reliable supervision signals without ground truth. Majority voting suffers from diversity collapse, while self-verification lacks training signal. CoVerRL addresses this by combining both mechanisms: majority voting drafts initial pseudo-labels, and self-verification filters out self-consistent errors. Majority Voting. Given a queryq, the model acting as a generator samplesNreasoning pathsy i ∼ π gen θ (·| q), forming a collection of query-response pairsY=(q,y i ) N i=1 . Following Zuo et al. (2025), we group these outputs and identify the most frequent answer as the draft pseudo-label ˆy . This partitions the outputs into a positive set Y + =(q,y i )∈ Y : ans(y i ) = ˆy and a negative set Y − =(q,y i )∈ Y : ans(y i )̸= ˆy. Self-consistent Verification Filtering. To filter self-consistent errors, the model switches to its verifier roleπ ver θ to scrutinize the candidate solutions. We define a maximum verification budgetMto maintain computational efficiency; if|Y + | > M, we randomly sampleMtuples fromY + to form the verification set, otherwise include all tuples inY + . For each selected tuple (q,y j ), the verifier produces a binary judgmentv j ∼ π ver θ (· | q,y j ), wherev j ∈ 0,1, and form the verification setV + =(q,y j ,v j ) |V + | j=1 , where |V + |=min(|Y + |,M). We then apply a consensus filter: the queryqand its associated sets are retained for training only if the majority of verifications in V + are positive: |V + | X j=1 v j > |V + | 2 (1) This filtering mechanism specifically targets the failure mode where the generator produces self-consistent errors; by leveragingπ ver θ to identify flawed reasoning, we exclude such samples before they corrupt the training process. 3.2 Online Dual-Role Co-Evolution With filtered pseudo-labels established, we now describe how both roles are jointly optimized through multi-turn reinforcement learning. Contrastive Verifier Training. To train the verifier without ground-truth labels, we construct contrastive examples from the derived pseudo-labels. The positive verification setV + (targeting the majority answerˆy) is obtained during the filtering stage. For negative examples, we select the answery s with the lowest frequency fromY − and generate|V + |verification paths to form the negative verification setV − =(q,y s ,v j ) |V + | j=1 ∼ π ver θ (· | q,y s ), where eachv j targets the voted “Incorrect” status. Enforcing|V − |=|V + |ensures balanced training, which we show in Appendix C is equivalent to implicit preference optimization and induces beneficial curriculum effects. Self-Correction. Beyond verification, we equip the model with error recovery capabilities by leveraging the interaction between its roles. For any triplet (q,y,v)∈ V + ∪V − where the verifier identifies an error (v= 0), the model switches back to its generator roleπ gen θ . It is then triggered to produceKrevised solutions z k K k=1 ∼ π gen θ (· | q,y,v), conditioned on the query, the initial failed attempt, and the verifier’s feedback. This structure allows the model to learn self-correction from its own internal critiques, extending the label-free paradigm to iterative refinement. Reward Design. We define a reward functionr=r f +r a combining format and accuracy components. The format reward r f ∈0, 1 ensures outputs adhere to structural constraints. The accuracy reward is: r a = I(o = ˆo)(2) whereois the model’s output andˆois the corresponding pseudo-label. For generation,ˆois the consensus answer ˆy; for verification, ˆo is the target judgment (“Correct” for V + , “Incorrect” for V − ). 4 Answer-Anchored GRPO. We optimize using Group Relative Policy Optimization (GRPO) (Shao et al., 2024) with a modification for our multi-turn structure. Standard GRPO groups responses by query prefix. In our case, although verifications in V + =(q,y j ,v j ) M j=1 do not strictly originate from the same prefix, every y j targets the same answerˆy. We thus propose Answer-Anchored GRPO: we group these|V + |paths together as they share the same answer anchorˆy; for generation andV − , we revert to standard prefix-based grouping. The advantage is computed as: A i = r i − mean(r) std(r) (3) wherer contains rewards within the appropriate group. This hybrid strategy isolates verification variance while maintaining stable generation optimization. The Co-Evolution Cycle. The complete training loop proceeds as follows: (1) generate solutions and draft pseudo-labels via majority voting, (2) verify positive samples and filter self-consistent errors, (3) construct contrastive verification data, (4) collect self-correction trajectories, and (5) update the model with Answer- Anchored GRPO on all roles jointly. As training progresses, the verifier improves through contrastive learning, enabling better filtering of self-consistent errors. Cleaner pseudo-labels yield more accurate rewards, improving the generator. The improved generator produces higher-quality consensus labels, further benefiting verifier training. This bidirectional bootstrapping, visualized in Figure 2, enables both roles to co-evolve without external supervision. The detailed algorithm flow is presented in Appendix B. 4 Experiments 4.1 Environmental Setup Base Models. To verify the effectiveness of CoVerRL on diverse model architectures, we adopt three representative models from the Qwen and Llama families, including base and instruct models: Llama-3.2-3B- Instruct (Grattafiori et al., 2024), Qwen2.5-7B (Qwen et al., 2025) and Qwen3-1.7B-Base (Yang et al., 2025a) with thinking mode. Training and Evaluation Data. To test our method at scale, we use the large, standard MATH (Hendrycks et al., 2021) training set. For evaluation, we employ three in-domain mathematical benchmarks to assess reasoning capabilities: MATH500 (Hendrycks et al., 2021), AMC (Li et al., 2024), and AIME24 (Li et al., 2024). Additionally, we include the GPQA-Diamond (Rein et al., 2023) to evaluate the models’ generalization on challenging out-of-domain question-answering tasks. Baseline and Implementation Details. We mainly compare our method against TTRL (Zuo et al., 2025). For the TTRL baseline, we follow the practice in Zuo et al. (2025) by sampling 64 responses for majority voting and downsampling 32 responses for training. For our CoVerRL, at the first turn, we sample 32 responses for majority voting and downsample them to 16 for training; In later self-verification and self-correction, we utilize all samples for training to maximize data efficiency. About detailed description of the hyperparameters, please refer to Appendix A. Evaluation Protocol. Following Jiang et al. (2025), we adopt a sequential inference strategy: the model first generates a solution, then verifies its correctness. If verification passes, inference terminates; otherwise, the model attempts a revision. We report both Acc.@first (first-attempt accuracy, equivalent to Pass@1) and Acc.@final (accuracy after potential self-correction), averaged over 16 different generations. 4.2 Main Results Table 1 presents the main results. CoVerRL consistently outperforms TTRL across all models and benchmarks, achieving average improvements of 5.7%, 5.9%, and 4.7% in Acc.@final for the three models respectively. 5 Table 1 Main results on mathematical reasoning and out-of-domain benchmarks. Each cell displays Acc.@first / Acc.@final. ∆ indicates the improvement of CoVerRL over TTRL. ModelMethodMATH500AMCAIME24GPQAAverage Qwen3-1.7B-Base Base Model53.5 / 53.324.6 / 24.53.8 / 3.327.5 / 27.327.4 / 27.1 TTRL65.1 / 65.031.1 / 30.95.2 / 5.230.9 / 30.733.1 / 33.0 CoVerRL (Ours) 69.0 / 71.9 36.0 / 38.69.8 / 10.632.9 / 33.6 36.9 / 38.7 ∆+3.9 / +6.9 +4.9 / +7.7 +4.6 / +5.4 +2.0 / +2.9 +3.8 / +5.7 Llama-3.2-3B-Instruct Base Model42.7 / 41.017.0 / 15.74.6 / 5.026.9 / 26.122.8 / 22.0 TTRL52.6 / 52.223.8 / 23.313.8 / 14.029.8 / 28.230.0 / 29.4 CoVerRL (Ours) 55.9 / 59.3 28.3 / 32.2 16.3 / 16.9 32.3 / 32.6 33.2 / 35.3 ∆+3.3 / +7.1 +4.5 / +8.9 +2.5 / +2.9 +2.5 / +4.4 +3.2 / +5.9 Qwen2.5-7B Base Model50.1 / 51.425.5 / 26.45.2 / 6.529.9 / 29.727.7 / 28.5 TTRL73.8 / 74.242.2 / 42.212.7 / 12.535.8 / 35.641.1 / 41.1 CoVerRL (Ours) 76.8 / 79.6 47.6 / 49.2 14.6 / 17.1 36.2 / 37.2 43.8 / 45.8 ∆+3.0 / +5.4 +5.4 / +7.0 +1.9 / +4.6 +0.4 / +1.6 +2.7 / +4.7 Improved Generation Quality. CoVerRL improves Acc.@first over TTRL by 3.8%, 3.2%, and 2.7% on average across the three models, demonstrating that the co-evolution mechanism enhances the generator itself, not merely the correction process. This confirms that filtering self-consistent errors produces cleaner training signals that benefit generation quality directly. Effective Self-Correction. The gap between Acc.@first and Acc.@final widens substantially under CoVerRL compared to TTRL. On AMC, CoVerRL achieves Acc.@final improvements of 7.7%, 8.9%, and 7.0% over TTRL, significantly larger than the Acc.@first gains. This indicates that the verifier trained through co- evolution can reliably identify errors and guide effective corrections, whereas TTRL’s self-correction provides minimal benefit due to unreliable verification. Table 2 Self-verification accuracy on four benchmarks. ModelMATH500 AMC AIME24 GPQA Qwen3-1.7B-Base55.828.511.738.7 + CoVerRL81.264.448.957.9 Llama-3.2-3B-Instruct57.050.955.856.9 + CoVerRL77.276.275.655.2 Qwen2.5-7B54.030.912.935.7 + CoVerRL86.570.958.858.4 Out-of-Domain Generalization. On GPQA- Diamond, a challenging out-of-domain benchmark, CoVerRL maintains consistent improvements. No- tably, while base models and TTRL often suf- fer degradation during revision (Acc.@final< Acc.@first), CoVerRL reverses this trend across all models. This suggests that the verification capabil- ity learned through co-evolution transfers beyond the training distribution. Verifier Quality. Table 2 reports self-verification accuracy. CoVerRL dramatically improves verification capability: Qwen3-1.7B-Base improves from 28.5% to 64.4% on AMC, and Qwen2.5-7B improves from 54.0% to 86.5% on MATH500. These gains confirm that both generation and verification genuinely co-evolve rather than one merely supporting the other. 4.3 Analysis Verifier and Generator Co-Evolve Mutually. Figure 3 reveals the co-evolution dynamics across three models. In early training, verifier accuracy (green dashed line) starts low due to the lack of verification-specific supervision, resulting in label accuracy comparable to TTRL. As training progresses, the verifier improves by learning from pseudo-labels that, while noisy, provide contrastive signal between majority and minority answers. This improved verification capability enables better filtering of self-consistent errors. Consequently, CoVerRL’s label accuracy (red line) steadily increases and eventually surpasses TTRL (blue line) by a significant margin across all three model backbones. The consistent pattern across different architectures and 6 50100150200 Training Step 0.6 0.8 Accuracy Qwen3-1.7B-Base 50100150200 Training Step 0.6 0.8 1.0 Accuracy Llama-3.2-3B-Instruct 50100150200 Training Step 0.6 0.8 Accuracy Qwen2.5-7B CoVerRL Label Acc.TTRL Label Acc.CoVerRL Verification Acc. Figure 3 Co-evolution dynamics across three model backbones. As training progresses, verification accuracy improves through contrastive learning, which in turn enables better filtering of self-consistent errors and leads to higher label accuracy than TTRL. step 0-50step 50-100step 100-234 0.0 0.2 0.4 0.6 All Samples TNR FPR step 0-50step 50-100step 100-234 Training Step 0.0 0.2 0.4 0.6 High-Consistent Samples (=0.2) TNR FPR Figure 4 Evolution of the verification capability of the Qwen3-1.7B-Base. We report the True Negative Rate (TNR) and the False Positive Rate (FPR). 50100150200 0.8 0.9 1.0 Reward Acc First Generation 50100150200 0.8 0.9 1.0 Reward Acc Self-verification 50100150200 Training Step 0.8 0.9 1.0 Reward Acc Self-correction Figure 5 Reward accuracy across generation, verification, and self-correction roles during training on Llama-3.2-3B- Instruct. scales confirms that both capabilities bootstrap each other through the bidirectional supervision mechanism rather than one merely supporting the other. For a clearer demonstration of the verifier’s evolution, Figure 4 reports the True Negative Rate (TNR), the ratio of incorrect samples correctly filtered, and the False Positive Rate (FPR), the proportion of correct samples mistakenly discarded. The left panel shows that TNR steadily climbs from approximately 22% to nearly 50%, while the FPR remains stable at around 5%, indicating the verifier increasingly prunes erroneous paths while maintaining a minimal false-alarm rate for correct solutions. Reward Accuracy Remains High Across All Roles. Figure 5 tracks reward accuracy for generation, verifi- cation, and self-correction throughout training on Llama-3.2-3B-Instruct. All three roles maintain reward accuracy above 90%, indicating that the pseudo-labels derived from the co-evolution process provide reliable supervision signals for each role. The sustained high reward accuracy across all roles demonstrates that the bidirectional bootstrapping mechanism successfully prevents the degradation identified in Section 1. yielding both a marked improvement in Acc.@first and a consistently higher Acc.@final. Verification Filtering Outperforms Naive Consistency Filtering. Since samples with lower majority ratios are more likely to be incorrect, a natural alternative to verification-based filtering is to simply discard samples below a consistency threshold. We compare CoVerRL against TTRL with naive filtering thresholds τ ∈0.1,0.2in Figure 6. The left panel shows label accuracy dynamics: while naive filtering initially improves label accuracy by removing low-confidence predictions, its advantage diminishes as training progresses and 7 50100150200 Training Step 0.6 0.8 1.0 Label Acc 50100150200 Training Step 0.7 0.8 0.9 1.0 Reward Acc TTRL (Vanilla)TTRL (Filter-0.1)TTRL (Filter-0.2)CoVerRL (Ours) Figure 6 Comparison of CoVerRL against naive consistency filtering on Qwen3-1.7B-Base. Naive filtering (τ ∈0.1,0.2) improves label accuracy initially but converges to TTRL baseline, while failing to improve reward accuracy. CoVerRL achieves sustained gains on both metrics. model outputs become increasingly consistent due to the RL objective, eventually converging to vanilla TTRL performance. In contrast, CoVerRL’s label accuracy continues to improve throughout training, reaching approximately 90% compared to TTRL’s 70%. The right panel reveals a more fundamental distinction in reward accuracy: naive filtering provides no improvement over vanilla TTRL because it cannot detect errors that happen to achieve high consensus. CoVerRL maintains reward accuracy above 90% by specifically targeting self-consistent errors through reasoning-based verification. As shown in the right panel of Figure 4, for samples with majority ratio larger than 0.2, the TNR reaches around 40% after 100 steps, confirming the verifier’s strong capability to prune confident errors missed by naive filtering. Table 3 Performance on RewardBench. Scores marked with ∗ are from the RewardBench leaderboard. ModelScore Qwen3-1.7B-Base46.8 Llama-3.2-3B-Instruct58.6 Qwen2.5-7B49.7 Meta-Llama-3-70B-Instruct66.2 ∗ gpt-4-turbo-2024-04-0967.3 ∗ gpt-4o-2024-05-1372.5 ∗ Qwen3-1.7B-Base + CoVerRL56.4 Llama-3.2-3B-Instruct + CoVerRL 70.0 Qwen2.5-7B + CoVerRL60.9 Verification Capability Generalizes Beyond Train- ing Distribution. To assess whether the learned verification capability transfers beyond the model’s own generated responses, we evaluate the trained verifiers on RewardBench (Lambert et al., 2024) mathprm. As shown in Table 3, CoVerRL- enhanced models substantially outperform their base versions across all three architectures. Most notably, Llama-3.2-3B-Instruct improves from 58.6 to 70.0, surpassing the significantly larger Meta- Llama-3-70B-Instruct (66.2) and GPT-4-Turbo (67.3). This strong performance suggests that CoV- erRL trains genuine reasoning-based verification ca- pabilities and that the co-evolution process induces transferable skills for evaluating solution quality. Reasoning Capability is Essential for Effective Verification. To validate that verification quality depends on substantive reasoning rather than superficial pattern matching, we compare thinking mode versus non-thinking mode on Qwen3-1.7B-Base in Figure 7. The non-thinking mode removes the extended reasoning process during verification. Without this reasoning capacity, the model rapidly collapses to trivial outputs such as “The answer is correct” or “The answer is wrong,” with mean response length dropping from over 500 tokens to near zero within 100 training steps. This length collapse severely limits the verifier’s discriminative ability, resulting in significantly lower verification accuracy compared to thinking mode throughout training. The result confirms that CoVerRL’s effectiveness stems from the verifier’s ability to scrutinize and reason about solution processes, not merely from adding a binary verification step. 8 50100150200 Training Step 0 200 400 600 Mean Response Len 050100150200 Training Step 0.4 0.6 0.8 Verification Acc Non-ThinkingThinking Figure 7 Comparison of training response length and validation accuracy on MATH500 between the Thinking mode and Non-thinking mode on Qwen3-1.7B-Base. Non-thinking mode suffers from length collapse (left), leading to inferior verification accuracy (right). 4.4 Ablation Study Balanced Training Prevents Verifier Collapse. We investigate whether existing multi-turn RL methods can be directly combined with majority voting for label-free training. We construct a baseline named TTRL + PAG, which applies PAG’s verify-then-revise workflow (Jiang et al., 2025) using majority-voted pseudo-labels instead of ground truth, while keeping other components in PAG (PPO optimization, turn-independent advantage normalization) unchanged. Figure 8 compares this baseline against CoVerRL on Llama-3.2-3B-Instruct. TTRL + PAG suffers catastrophic collapse: Correct Recall (the proportion of correct answers identified as correct) drops to near zero while Wrong Recall (the proportion of incorrect answers identified as wrong) surges to nearly 100%. This indicates the verifier degenerates into rejecting all solutions indiscriminately. The failure stems from information asymmetry in the training data: positive samples that reach the majority answer tend to share similar reasoning patterns, providing limited diversity, while negative samples exhibit high variance across different incorrect answers. Without balanced sampling, the model overfits to negative features and loses the ability to recognize correct solutions. As a result, TTRL+PAG’s Acc.@final drops sharply from around 0.45 to nearly zero as training proceeds, and this collapse further propagates to Acc.@first, causing it to also break down and remain at extremely low levels in subsequent training steps. CoVerRL avoids this collapse by enforcing|V + |=|V − |and using GRPO instead of PPO, demonstrating that our framework design is essential for stable co-evolution under label-free conditions. 050100150200 Training Step 0.00 0.25 0.50 0.75 1.00 Recall 050100150200 Training Step 0.0 0.2 0.4 0.6 Accuracy CoVerRL CR CoVerRL WR TTRL+PAG CR TTRL+PAG WR CoVerRL Acc.@final CoVerRL Acc.@first TTRL+PAG Acc.@final TTRL+PAG Acc.@first Figure 8 Comparison of TTRL + PAG and CoVerRL on Llama-3.2-3B-Instruct. TTRL + PAG exhibits verifier collapse with Correct Recall dropping to near zero (left), causing accuracy degradation (right). CoVerRL maintains stable training through balanced sampling. Component Ablation. Table 4 presents component ablations on Qwen3-1.7B-Base. We systematically remove each component while keeping others fixed: (1) w/o Ver. Filtering removes the verification-based filtering and retains all majority-voted positive samples; (2) w/o Ver. Update freezes the verifier and excludes 9 MethodMATH500 AMC AIME24 GPQA CoVerRL69.0/71.9 36.0/38.6 9.8/10.6 32.9/33.6 w/o Ver. Filtering 67.1/70.3 34.6/38.1 7.1/9.2 29.6/29.4 w/o Ver. Update64.6/64.6 34.3/34.4 5.6/5.6 29.1/29.3 w/o Self-Correction 67.1/68.4 35.3/35.2 6.7/7.8 31.0/25.4 w/o A-GRPO67.7/71.1 35.1/38.0 8.5/10.2 31.2/31.9 Table 4 Ablation study on Qwen3-1.7B-Base. Results are re- ported as Acc.@first / Acc.@final. it from RL optimization; (3) w/o Self- Correction removes the revision step triggered by negative verification judgments; (4) w/o A-GRPO replaces Answer-Anchored GRPO with standard prefix-based GRPO, randomly sampling a single positive solution for veri- fication instead of utilizing all positive sam- ples. Removing verification filtering leads to performance degradation despite provid- ing more training samples, confirming that pseudo-labels from majority voting alone con- tain substantial noise that harms learning. Excluding the verifier from optimization causes decline in Acc.@first due to noisier pseudo-labels, while Acc.@final shows almost no improvement over Acc.@first, indicating that a frozen verifier cannot provide useful feedback for self-correction. Removing self-correction training results in consistent degradation across benchmarks, with particularly severe impact on GPQA where Acc.@final (25.4%) falls below Acc.@first (31.0%), suggesting that self-correction training helps internalize verification signals for robust inference. Finally, replacing Answer-Anchored GRPO with standard GRPO yields moderate performance loss, validating that grouping verification paths by their target answer better captures the diversity of correct reasoning patterns compared to random sampling. 5 Conclusion We introduce CoVerRL, a novel framework that integrates the Generator and Verifier roles within a unified multi-turn reinforcement learning paradigm to achieve robust label-free evolution. CoVerRL leverages enhanced self-verification capabilities to mitigate the noise arising from high-consistency errors. Extensive experiments demonstrate that CoVerRL yields substantial improvements in reasoning benchmarks while co-evolving a robust verifier that exhibits strong transferability to out-of-domain tasks. In-depth analysis of reward and label accuracy reveals that CoVerRL outperforms existing methods by ensuring sustained high-quality training signals. Furthermore, ablation studies confirm that proposed framework and components are important to prevent model collapse, paving the way for reliable continuous self-evolution. References Anonymous. Label-free GUI grounding via confidence-guided negative reinforcement learning. In Submitted to The Fourteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=Hghm8tVvbs. under review. Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, October 2023. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, et al. Training verifiers to solve math word problems, November 2021. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, January 2025. Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test-time reinforcement learning for gui grounding via region consistency, 2025. URL https://arxiv.org/abs/2508.05615. Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, et al. Retool: Reinforcement learning for strategic tool use in llms, April 2025a. Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. https://arxiv.org/abs/2505.10978v1, May 2025b. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The llama 3 herd of models, November 2024. 10 Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, et al. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators, December 2025. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, et al. Measuring mathematical problem solving with the math dataset, November 2021. Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, et al. R-zero: Self-evolving reasoning llm from zero data, August 2025. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet, March 2024. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, et al. Livecodebench: Holistic and contamination free evaluation of large language models for code, June 2024. Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, et al. Pag: Multi-turn reinforced llm self-correction with policy as generative verifier, June 2025. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, et al. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, August 2025. Nathan Lambert, Valentina Pyatkin, Jacob Morrison, L. J. Miranda, Bill Yuchen Lin, et al. Rewardbench: Evaluating reward models for language modeling, June 2024. Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, and Jihoon Tack. Revise: Learning to refine at test-time via intrinsic self-verification, July 2025. Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 2024. 13:9. Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets. Confidence is all you need: Few-shot rl fine-tuning of language models, June 2025. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, et al. Let’s verify step by step, May 2023. Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, and ShaoGuo Liu. Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism, August 2025a. Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, et al. A simple "try again" can elicit multi-turn llm reasoning, July 2025b. Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, et al. Agentbench: Evaluating llms as agents, October 2025c. Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, et al. Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards, May 2025d. Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, et al. S$^2$r: Teaching llms to self-verify and self-correct via reinforcement learning, February 2025. Ning Miao, Yee Whye Teh, and Tom Rainforth. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning, October 2023. OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, et al. Openai o1 system card, December 2024. Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Maximizing confidence alone improves reasoning, June 2025. Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, et al. Self-consistency preference optimization, July 2025. Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen2.5 technical report, January 2025. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, et al. Gpqa: A graduate-level google-proof q&a benchmark, November 2023. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, April 2024. 11 Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, p. 1279–1297, March 2025. doi: 10.1145/3689031.3696075. Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, et al. Too consistent to detect: A study of self-consistent errors in llms, September 2025. Gladys Tyen, Hassan Mansoor, Victor Cărbune, Peter Chen, and Tony Mak. Llms cannot find reasoning errors, but can correct them given the error location, June 2024. Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, April 2025. Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, et al. Toward training superintelligent software agents through self-play swe-rl, December 2025. Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, et al. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, November 2025. Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. Self-rewarding correction for mathematical reasoning, February 2025. Yuchen Yan, Jin Jiang, Yang Liu, Yixin Cao, Xin Xu, et al. S^3cmath: Spontaneous step-level self-correction makes large language models better mathematical reasoners. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24):25588–25596, April 2025. ISSN 2374-3468. doi: 10.1609/aaai.v39i24.34749. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report, May 2025a. Ziyi Yang, Weizhou Shen, Ruijun Chen, Chenliang Li, Fanqi Wan, et al. Spell: Self-play reinforcement learning for evolving long-context language models, September 2025b. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, et al. Dapo: An open-source llm reinforcement learning system at scale, May 2025a. Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, et al. Restrain: From spurious votes to signals – self-driven rl with self-penalization, October 2025b. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, March 2025. Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction, February 2025a. Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, et al. The lessons of developing process reward models in mathematical reasoning, June 2025b. Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, et al. Absolute zero: Reinforced self-play reasoning with zero data, October 2025a. Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards, August 2025b. Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, et al. Evolving language models without labels: Majority drives selection, novelty promotes variation, September 2025. Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, et al. Ttrl: Test-time reinforcement learning, June 2025. 12 Contents 1 Introduction1 2 Related Work3 2.1 LLM Self-Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 LLM Self-Verification and Self-Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Method3 3.1 Pseudo-Label Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Online Dual-Role Co-Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 Experiments5 4.1 Environmental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Conclusion10 A Implementation Details14 B The Algorithm Workflow of CoVerRL14 C Theoretical Analysis of Balanced Verification Training14 C.1 Problem Setup and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 C.2 Derivation of Group Relative Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 C.3 Equivalence to Direct Preference Optimization (DPO) . . . . . . . . . . . . . . . . . . . . . . 16 C.4 Dynamic Gradient Reweighting (Auto-Curriculum) . . . . . . . . . . . . . . . . . . . . . . . . 17 D Additional Results17 D.1 Additional Ablation Study and Hyperparameter Analysis . . . . . . . . . . . . . . . . . . . . 17 D.2 Impact of Naive Filtering by Constant Threshold . . . . . . . . . . . . . . . . . . . . . . . . . 18 D.3 Statistics of Training Sample Size and Reward Accuracy . . . . . . . . . . . . . . . . . . . . . 18 E Case Study19 13 A Implementation Details We implement our method based on the VeRL (Sheng et al., 2025) framework. Table 5 and Table 6 present the hyperparameters of TTRL and CoVerRL, respectively. The main differences are summarized as follows. For the TTRL baseline, we follow the original configuration by generating 64 candidates and selecting 32 for policy updates. For CoVerRL, we adopt a configuration that balances training stability with the number of training samples comparable to TTRL, which can be found in Appendix D.3. Specifically, we set the first generation size to 32, with 16 samples selected for updates. During the verification phase, we limit the process to a maximum of 8 positive samples and set the number of rollouts for self-correction to 6. Additionally, we adopt the clip-higher(0.28) (Yu et al., 2025a) for CoVerRL, which is currently widely used now to ensure that the model does not converge to overly repetitive outputs across multi-turn interactions (Jiang et al., 2025). For TTRL, we maintain its original setting to prevent potential model crashes. Details are presented in Appendix D.1 HyperparameterValue Train Batch Size64 Batch Size for Gradient Update256 PPO Micro-Batch Size16 Training epochs2 Max Prompt Length8192 Max Response Length2048 Training Temperature1.0 Validation Temperature0.6 Learning rate5× 10 −7 Use KL LossTrue Number of first generation N64 Clip_ratio_low0.2 Clip_ratio_high0.2 Table 5 Hyperparameters of TTRL HyperparameterValue Train Batch Size64 Batch Size for Gradient Update256 PPO Micro-Batch Size16 Training epochs2 Max Prompt Length8192 Max Response Length2048 Training Temperature1.0 Validation Temperature0.6 Learning rate5× 10 −7 Use KL LossTrue Number of first generation N32 Max number of Self-verification M8 Number of Self-correction K6 Clip_ratio_low0.2 Clip_ratio_high0.28 Table 6 Hyperparameters of CoVerRL B The Algorithm Workflow of CoVerRL In this section, we provide a detailed algorithm flowchart of the CoVerRL as shown in Algorithm 1. C Theoretical Analysis of Balanced Verification Training In this section, we provide a theoretical justification for the balanced sampling strategy employed in DuRL. We show that balancing the sample size of positive and negative verification sets is equivalent to implicit preference optimization and induces an auto-curriculum effect. C.1 Problem Setup and Notation Let q denote the query. Similar as Section 3, We define the inputs for verification: • ˆy: The selected positive sample from the positive set Y + . • y − : The selected negative sample from the negative set Y − . The verifierπ θ (v|q,y) generates a binary judgementv ∈ 1,0, wherev= 1 denotes “Correct” andv= 0 denotes “Wrong”. For simplicity, we assume these probabilities are complementary: π θ (v = 0|q,y) = 1− π θ (v = 1|q,y)(4) 14 Algorithm 1: CoVerRL Training Framework Input: Dataset D, Initial Policy π θ , Hyperparameters N,M,K for iteration = 1,...,I do Sample batch of queries Q∼D; foreach query q ∈ Q do ▷ Pseudo-Label Generation; Generate N reasoning paths y i ∼ π gen θ (·| q) and form Y =(q,y i ) N i=1 ; Obtain majority answer ˆy ← MajorityVoting(Y ); Define sets Y + ←(q,y)∈ Y | ans(y) = ˆy and Y − ←(q,y)∈ Y | ans(y)̸= ˆy; Sample subset S ⊆ Y + with |S| = min(|Y + |,M); Generate verifications V + ←(q,y,v)| (q,y)∈ S,v ∼ π ver θ (·| q,y); if P (q,y,v)∈V + v >|V + |/2 then ▷ Online Dual-role Co-evolution; Select lowest frequency answer y s from Y − ; Generate negative verifications V − ←(q,y s ,v j ) |V + | j=1 ∼ π ver θ (·| q,y s ); Initialize corrections C ←∅; foreach (q,y,v)∈V + ∪V − where v = 0 do Generate correction set Z ←z k K k=1 ∼ π gen θ (·| q,y,v); C ←C∪(q,y,v,Z); end ▷ Reward Assignment; r y ← I(ans(y) = ˆy) for y ∈ Y ; r v ← I(v = 1) if (q,y,v)∈V + else I(v = 0); r z ← I(ans(z) = ˆy) for z ∈Z where (q,y,v,Z)∈C; Compute Advantage within Y,V + ,V − , and each Z ∈C. Note that V + is an Answer-Anchored Group where ∀(q,y,v)∈V + , ans(y) = ˆy. Update policy π θ by maximizing the GRPO objective; end end end Consequently, their log-gradients are directly related: ∇ θ logπ θ (v = 0|q,y) = − π θ (v = 1|q,y) 1− π θ (v = 1|q,y) ∇ θ logπ θ (v = 1|q,y) (5) We construct two verification sets for training as defined in Sec. 3.2: 1. Positive Verification Set (V + ): Input (q, ˆy), sample size M. The reward is r = 1 if v = 1, else 0. 2. Negative Verification Set (V − ): Input (q,y − ), sample sizeM. The reward isr= 1 ifv= 0 (correctly rejecting), else 0. C.2 Derivation of Group Relative Gradients The GRPO objective computes advantages by normalizing rewards within each group. The gradient is approximated as: ∇ θ J(θ) = E " 1 M M X i=1 A i ∇ θ logπ θ (v i ) # (6) where A i = r(v i )−μ σ , with μ and σ representing the mean and standard deviation of rewards within the set. 15 Gradient for SetV + . Letp + =π θ (v= 1|q, ˆy). The statistics of the binary rewards (where target isv= 1) are: μ + = E[r] = p + , σ + = p p + (1− p + )(7) The advantages for the two possible outcomes (v = 1 and v = 0) are: A + v=1 = 1− p + σ + = s 1− p + p + (8) A + v=0 = 0− p + σ + =− s p + 1− p + (9) Substituting these into the expected gradient: ∇ θ J V + = p + A + v=1 ∇ θ logπ θ (v = 1|q, ˆy) + (1− p + )A + v=0 ∇ θ logπ θ (v = 0|q, ˆy) = (σ + + σ + )∇ θ logπ θ (v = 1|q, ˆy) = 2σ + ∇ θ logπ θ (v = 1|q, ˆy) (10) Gradient for SetV − . Letp − =π θ (v= 1|q,y − ) (the probability of falsely predicting “Correct” for a negative sample). Here, the target isv= 0, so reward is 1 whenv= 0. The probability of getting reward 1 is 1− p − . μ − = 1− p − , σ − = p (1− p − )p − (11) The advantages are: A − v=0 = 1− (1− p − ) σ − > 0(12) A − v=1 = 0− (1− p − ) σ − < 0(13) Following a similar derivation to maximize the likelihood of v = 0: ∇ θ J V − = (1− p − )A − v=0 ∇ θ logπ θ (v = 0|q,y − ) + p − A − v=1 ∇ θ logπ θ (v = 1|q,y − ) = 2σ − ∇ θ logπ θ (v = 0|q,y − ) (14) Using the identity ∇ logπ(v = 0)≈−∇ logπ(v = 1): ∇ θ J V − ≈−2σ − ∇ θ logπ θ (v = 1|q,y − )(15) C.3 Equivalence to Direct Preference Optimization (DPO) Assuming balanced sampling (|V + |=|V − |=M), we first consider the scenario where the model exhibits comparable verification proficiency across both sets. Specifically, if the probability of correctly identifying a positive sample (p + ) is approximately equal to the probability of correctly rejecting a negative sample (1− p − ), then the standard deviations of the rewards converge: σ + = p p + (1− p + )≈ p (1− p − )p − = σ − ≜ σ(16) Under this condition, the total gradient of the joint optimization is proportional to the sum of the set gradients: ∇ θ J total ∝∇ θ J V + +∇ θ J V − ≈ 2σ ∇ θ logπ θ (v = 1|q, ˆy) −∇ θ logπ θ (v = 1|q,y − ) (17) 16 This structure is mathematically equivalent to the gradient form of the DPO objective: ∇ θ J DPO = E w dpo · (∇ θ logπ θ (y win |x) −∇ θ logπ θ (y lose |x)) (18) where the winning trajectory is judgingˆyasv= 1 (Correct), and the losing trajectory is judgingy − asv= 1 (Incorrectly labeled as Correct). Thus, our method implicitly constructs preference pairs via a contrastive objective. C.4 Dynamic Gradient Reweighting (Auto-Curriculum) A key property of GRPO is that gradients are naturally weighted by the standard deviation of the reward distributionσ(π) = p π(1− π). This introduces a dynamic balancing mechanism when the capabilities are not yet balanced (i.e., σ + ̸= σ − ). Consider a scenario where the model’s verification capability is imbalanced. Letp + > p − ≥0.5, meaning π θ (v= 1|q, ˆy) is high (near 1) whileπ θ (v= 1|q,y − ) is confusingly high (near 0.5, failing to reject). Sinceσ(p) is a concave function maximized at 0.5 and approaching 0 as p→ 1, we have: σ(p + ) < σ(p − ) =⇒ 2σ + < 2σ − (19) Substituting this into the total gradient: ∇ θ J total ∝ 2σ + |z Small ∇ logπ(v = 1|q, ˆy) − 2σ − |z Large ∇ logπ(v = 1|q,y − ) (20) This inequality demonstrates an Auto-Curriculum Effect: the optimization automatically down-weights the “easy” set (V + ) where the model is confident, and up-weights the “hard” uncertain set (V − ). This drives the model to focus on correcting its weakness in identifying wrong answers until the capabilities on both sets are balanced (σ + ≈ σ − ). D Additional Results D.1 Additional Ablation Study and Hyperparameter Analysis MethodMATH500AMCAIME24 GPQA CoVerRL69.0/71.9 36.0/38.6 9.8/10.6 32.9/33.6 w/o least ratio 68.6/72.0 36.6/39.8 7.5/9.6 31.0/32.1 w/o clip-higher 66.8/68.9 33.1/31.6 6.5/7.9 31.2/30.2 Table 7 Ablation Study of CoVerRL on Qwen3-1.7B-Base. Results are reported as Acc.@first / Acc.@final. In CoVerRL, we adopt two specific settings. First, we use a clip-higher mechanism. Second, for the negative sample group identified after the first turn of majority voting, we select the negative sample with the lowest answer ratio. To analyze the effects of these settings, we conduct ablation studies with the results shown in Table 7. For w/o clip-higher, we revert to the standard clipping ratio of 0.2. For w/o least ratio, we randomly sample from the negative candidates. We can see that clip-higher is critical; without it, performance on four benchmarks decreases. The effect of w/o least ratio is less significant; while it generally leads to a slight decline, the performance is comparable. For the TTRL baseline, we try the same clip-higher setting utilized in CoVerRL. However, we observed a model collapse during training, as illustrated in Figure 9. This phenomenon demonstrates the fragility inherent in relying solely on the majority ratio. 17 50100150200 Training Step 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Reward Acc(clip-high 0.28) Label Acc(clip-high 0.28) Reward Acc(clip-high 0.2) Label Acc(clip-high 0.2) Figure 9 Comparison of training dynamics of TTRL with different clip-high ratio across label and reward accuracy. 50100150200 Training Step 0.5 0.6 0.7 0.8 0.9 1.0 Label Accuracy CoVerRL (Filter-0.2)CoVerRL Figure 10 Comparison of training dynamics between CoV- erRL and CoVerRL with constant filtering threshold. D.2 Impact of Naive Filtering by Constant Threshold MethodMATH500 AMC AIME24 GPQA TTRL65.1/65.0 31.1/30.9 5.2/5.2 30.9/30.7 TTRL w/ filter-0.163.3/63.1 33.1/33.1 3.8/3.8 31.1/30.7 TTRL w/ filter-0.266.2/66.0 34.8/34.9 8.1/7.9 31.7/31.3 CoVerRL69.0/71.9 36.0/38.6 9.8/10.6 32.9/33.6 CoVerRL w/ filter-0.2 68.7/72.1 37.0/39.2 9.2/12.1 30.9/31.6 Table 8 Impact of the constant filtering threshold. Results are reported as Acc.@first / Acc.@final. Beyond experiments in Section 4.3, we also conduct additional experiments by combining CoVerRL with naive filtering. Figure 10 illustrates the training dynamics of CoVerRL w/ filter-0.2, while Table 8 compares the performance of TTRL and CoVerRL under different filtering configurations. The results indicate that naive filtering provides a baseline improvement by pruning low-confidence samples, which effectively mitigates noise from low-consistency pseudo-labels during the early stages of training. This is observed in both TTRL and CoVerRL when combined with naive filtering (Table 8), as well as in the training dynamics shown in Figure 10. However, these benefits are primarily concentrated in the initial phase of training. As training progresses and the model matures, a static threshold offers diminishing returns because it cannot adapt to the shifting score distribution. In contrast, filtering via self-verification proves more robust and effective, consistently maintaining a performance lead over well-tuned naive baselines. Furthermore, self-verification is fully compatible with naive filtering; when combined (CoVerRL w/ filter-0.2), the model achieves further gains, reaching 12.1% on AIME24. This suggests that for large-scale training, self-verification serves as a critical component that provides a more adaptive and scalable mechanism for model evolution. D.3 Statistics of Training Sample Size and Reward Accuracy ModelFirst Generation Verification Correction Total Qwen3-1.7B-Base798.6535.51132.72466.9 Llama-3.2-3B-Instruct794.6434.01159.72388.3 Qwen2.5-7B924.1399.0849.32172.4 Table 9 Average number of training samples per step of CoVerRL. 18 ModelFirst Generation Verification Correction Total Qwen3-1.7B-Base0.91450.89810.90000.9042 Llama-3.2-3B-Instruct0.94820.95020.93990.9445 Qwen2.5-7B0.90320.86880.85180.8768 Table 10 Weighted average reward accuracy per step of CoVerRL. Table 9 shows that CoVerRL maintains a total training volume comparable to the TTRL baseline (fixed at 2048 samples per step). As specified in Table 6, an appropriate rollout budget ensures a relative balance in training volume among first generation, verification, and correction without the need for downsampling. Furthermore, Table 10 demonstrates that reward accuracy remains high and consistent across all three components. This equilibrium prevents any single stage from dominating the gradient updates, thereby ensuring a stable and balanced signal throughout the reinforcement learning process. E Case Study Example 1 User: Let x,y and z be positive real numbers that satisfy the following system of equations: log 2 x yz = 1 2 log 2 y xz = 1 3 log 2 z xy = 1 4 Then the value of log 2 (x 4 y 3 z 2 ) is m n where m and n are relatively prime positive integers. Find m + n. Let’s think step by step and output the final answer within. Assistant: To solve the given system of equations involving logarithms, we start by expressing each equation in terms of logarithms: 1. log 2 x yz = 1 2 2. log 2 y xz = 1 3 3. log 2 z xy = 1 4 First, we rewrite each logarithmic equation in its exponential form: 1. log 2 x yz = 1 2 =⇒ x yz = 2 1/2 = √ 2 2. log 2 y xz = 1 3 =⇒ y xz = 2 1/3 3. log 2 z xy = 1 4 =⇒ z xy = 2 1/4 From these equations, we can express x, y, and z in terms of each other: 1. x = yz √ 2 2. y = xz· 2 1/3 3. z = xy· 2 1/4 Next, we substitute the first equation into the second and third equations to eliminate x and solve for y and z: From x = yz √ 2, substitute into y = xz· 2 1/3 : y = (yz √ 2)z· 2 1/3 = yz 2 √ 2· 2 1/3 = yz 2 · 2 1/2+1/3 = yz 2 · 2 5/6 1 = z 2 · 2 5/6 =⇒ z 2 = 2 −5/6 =⇒ z = 2 −5/12 Now substitute z = 2 −5/12 into x = yz √ 2: x = y· 2 −5/12 · √ 2 = y· 2 −5/12 · 2 1/2 = y· 2 −5/12+6/12 = y· 2 1/12 =⇒ x = y· 2 1/12 Next, substitute x = y· 2 1/12 and z = 2 −5/12 into z = xy· 2 1/4 : 2 −5/12 = (y· 2 1/12 )y· 2 1/4 = y 2 · 2 1/12+1/4 = y 2 · 2 1/12+3/12 = y 2 · 2 4/12 = y 2 · 2 1/3 2 −5/12 = y 2 · 2 1/3 =⇒ y 2 = 2 −5/12−1/3 = 2 −5/12−4/12 = 2 −9/12 = 2 −3/4 =⇒ y = 2 −3/8 Now we find x: x = y· 2 1/12 = 2 −3/8 · 2 1/12 = 2 −3/8+1/12 = 2 −9/24+2/24 = 2 −7/24 19 We now need to find log 2 (x 4 y 3 z 2 ) : x 4 = (2 −7/24 ) 4 = 2 −7/6 y 3 = (2 −3/8 ) 3 = 2 −9/8 z 2 = (2 −5/12 ) 2 = 2 −5/6 x 4 y 3 z 2 = 2 −7/6 · 2 −9/8 · 2 −5/6 = 2 −7/6−9/8−5/6 = 2 −7/6−5/6−9/8 = 2 −12/6−9/8 = 2 −2−9/8 = 2 −16/8−9/8 = 2 −25/8 log 2 (x 4 y 3 z 2 ) = log 2 (2 −25/8 ) = − 25 8 = 25 8 Thus, m = 25 and n = 8, and m + n = 25 + 8 = 33. User: Check the solution step-by-step. If you find a mistake: state the wrong step, explain why it’s wrong, and end your response with ’The answer is wrong’. If all steps are correct, end your response with ’The answer is correct’. Assistant: Let’s verify the solution step-by-step using the given system of equations: 1. log 2 x yz = 1 2 2. log 2 y xz = 1 3 3. log 2 z xy = 1 4 First, let’s rewrite each equation in exponential form: 1. x yz = 2 1/2 ⇒ x = yz· 2 1/2 2. y xz = 2 1/3 ⇒ y = xz· 2 1/3 3. z xy = 2 1/4 ⇒ z = xy· 2 1/4 Now, let’s expressx,y, andzin terms of each other. Start by solving the second and third equations forxandy: From the second equation: y = xz· 2 1/3 ⇒ x = y z· 2 1/3 From the third equation: z = xy· 2 1/4 ⇒ x = z y· 2 1/4 Now let’s substitute x from the first equation into the second and third equations. From the first equation: x = yz· 2 1/2 Substitute x = yz· 2 1/2 into the second equation: y = (yz· 2 1/2 )z· 2 1/3 = yz 2 · 2 1/2 · 2 1/3 = yz 2 · 2 5/6 1 = z 2 · 2 5/6 ⇒ z 2 = 2 −5/6 ⇒ z = 2 −5/12 Now substitute z = 2 −5/12 back into x = yz· 2 1/2 : x = y· 2 −5/12 · 2 1/2 = y· 2 −5/12+6/12 = y· 2 1/12 Next, substitute x = y· 2 1/12 and z = 2 −5/12 into the third equation: z = (y· 2 1/12 )y· 2 1/4 = y 2 · 2 1/12 · 2 1/4 = y 2 · 2 1/12+1/4 = y 2 · 2 1/12+3/12 = y 2 · 2 4/12 = y 2 · 2 1/3 2 −5/12 = y 2 · 2 1/3 ⇒ y 2 = 2 −5/12−1/3 = 2 −5/12−4/12 = 2 −9/12 = 2 −3/4 ⇒ y = 2 −3/8 Now we find x: x = y· 2 1/12 = 2 −3/8 · 2 1/12 = 2 −3/8+1/12 = 2 −9/24+2/24 = 2 −7/24 Now we need to find log 2 (x 4 y 3 z 2 ) : x 4 = (2 −7/24 ) 4 = 2 −7/6 y 3 = (2 −3/8 ) 3 = 2 −9/8 z 2 = (2 −5/12 ) 2 = 2 −5/6 x 4 y 3 z 2 = 2 −7/6 · 2 −9/8 · 2 −5/6 = 2 −7/6−5/6−9/8 = 2 −12/6−9/8 = 2 −2−9/8 = 2 −16/8−9/8 = 2 −25/8 log 2 (x 4 y 3 z 2 ) = log 2 (2 −25/8 ) = − 25 8 = 25 8 The value of log 2 (x 4 y 3 z 2 ) is 25 8 . Therefore, m = 25 and n = 8, and m + n = 25 + 8 = 33. The answer is correct. 20 Example 2 User: The set of points (x,y,z) that satisfy 2x = 3y =−z is a line. The set of points (x,y,z) that satisfy 6x =−y =−4z is another line. Find the angle between these lines, in degrees. Let’s think step by step and output the final answer within . Assistant: To find the angle between the two lines given by the equations 2x= 3y=−zand 6x=−y=−4z, we first need to express these equations in parametric form and find their direction vectors. Step 1: Finding the direction vectors of the lines For the first line 2x = 3y =−z: Let 2x = 3y =−z = t. Then we have: x = t 2 , y = t 3 , z =−t. For the second line 6x =−y =−4z: Let 6x =−y =−4z = t. Then we have: x = t 6 , y =−t, z = t 4 . The direction vector of the second line is: d 2 = 1 6 ,−1, 1 4 . Step 2: Finding the angle between the direction vectors The cosine of the angle θ between two vectors d 1 and d 2 is given by: cosθ = d 1 · d 2 |d 1 ||d 2 | . First, we compute the dot product d 1 · d 2 : d 1 · d 2 = 1 2 , 1 3 ,−1 · 1 6 ,−1, 1 4 = 1 2 · 1 6 + 1 3 · (−1) + (−1)· 1 4 = 1 12 − 1 3 − 1 4 . To combine these terms, we need a common denominator, which is 12: 1 12 − 4 12 − 3 12 = 1− 4− 3 12 = −6 12 =− 1 2 . Next, we compute the magnitudes |d 1 | and |d 2 |: |d 1 | = r 1 2 2 + 1 3 2 + (−1) 2 = r 1 4 + 1 9 + 1 = r 9 36 + 4 36 + 36 36 = r 49 36 = 7 6 . |d 2 | = r 1 6 2 + (−1) 2 + 1 4 2 = r 1 36 + 1 + 1 16 = r 4 144 + 144 144 + 9 144 = r 157 144 = √ 157 12 . Now, we can find cosθ: cosθ = − 1 2 7 6 √ 157 12 = − 1 2 7 √ 157 72 = −36 7 √ 157 =− 36 7 √ 157 . The angle θ is given by: θ = cos −1 − 36 7 √ 157 . We need to find the numerical value of this angle. First, we calculate the value of 36 7 √ 157 : 36 7 √ 157 ≈ 36 7· 12.53 ≈ 36 87.71 ≈ 0.410. 21 Thus, cosθ ≈−0.410. The angle whose cosine is approximately −0.410 is approximately 114.8 ◦ . Therefore, the angle between the two lines is: 120 degrees. User: Check the solution step-by-step. If you find a mistake: state the wrong step, explain why it’s wrong, and end your response with ’The answer is wrong’. If all steps are correct, end your response with ’The answer is correct’. Assistant: Let’s examine the problem step-by-step to identify any potential mistakes in the solution. Given: 2x = 3y =−z 6x =−y =−4z First, let’s find the direction vector of the first line: 2x = 3y =−z = t =⇒ x = t 2 , y = t 3 , z =−t So the direction vector is: d 1 = 1 2 , 1 3 ,−1 Next, let’s find the direction vector of the second line: 6x =−y =−4z = t =⇒ x = t 6 , y =−t, z = t 4 So the direction vector is: d 2 = 1 6 ,−1, 1 4 To find the angle θ between the two lines, we use the dot product formula: cosθ = d 1 · d 2 |d 1 ||d 2 | First, we calculate the dot product d 1 · d 2 : d 1 · d 2 = 1 2 1 6 + 1 3 (−1) + (−1) 1 4 = 1 12 − 1 3 − 1 4 Converting the fractions to have a common denominator of 12: 1 12 − 4 12 − 3 12 = 1− 4− 3 12 = −6 12 =− 1 2 Next, we calculate the magnitudes of d 1 and d 2 : |d 1 | = r 1 2 2 + 1 3 2 + (−1) 2 = r 1 4 + 1 9 + 1 = r 9 36 + 4 36 + 36 36 = r 49 36 = 7 6 |d 2 | = r 1 6 2 + (−1) 2 + 1 4 2 = r 1 36 + 1 + 1 16 = r 4 144 + 144 144 + 9 144 = r 157 144 = √ 157 12 Now, we calculate cosθ: cosθ = d 1 · d 2 |d 1 ||d 2 | = − 1 2 7 6 √ 157 12 = − 1 2 7 √ 157 72 = −36 7 √ 157 Now we need to find the angle θ: θ = cos −1 −36 7 √ 157 The given solution states the angle as 120 ◦ . Let’s check if this is correct: cos 120 ◦ =− 1 2 22 We need to compare −36 7 √ 157 with − 1 2 . For the solution to be correct, we need: −36 7 √ 157 ≈− 1 2 Solving for √ 157: 36 7 √ 157 = 1 2 =⇒ 72 = 7 √ 157 =⇒ √ 157 = 72 7 Since √ 157≈ 12.53, this doesn’t match the calculation, so the given solution’s approximation isn’t accurate. The answer is wrong. User: You indicated that your previous answer was wrong. Please reflect step by step and output your final answer within. Assistant: Let’s start by understanding the problem. We are given two lines in three-dimensional space: 1. The first line is defined by the equation 2x= 3y=−z. 2. The second line is defined by the equation 6x =−y =−4z. We need to find the angle between these two lines. Step 1: Finding the direction vectors of the lines To find the direction vectors, we express each line in parametric form. For the first line 2x = 3y =−z = t: We can express x, y, and z in terms of a parameter t: 2x = t =⇒ x = t 2 3y = t =⇒ y = t 3 −z = t =⇒ z =−t So, the parametric form of the first line is: t 2 , t 3 ,−t The direction vector d 1 of the first line is: d 1 = 1 2 , 1 3 ,−1 For the second line 6x =−y =−4z = t: We can express x, y, and z in terms of a parameter t: 6x = t =⇒ x = t 6 −y = t =⇒ y =−t −4z = t =⇒ z =− t 4 So, the parametric form of the second line is: t 6 ,−t,− t 4 The direction vector d 2 of the second line is: d 2 = 1 6 ,−1,− 1 4 Step 2: Finding the angle between the direction vectors The angle θ between two vectors d 1 and d 2 can be found using the dot product formula: cosθ = d 1 · d 2 |d 1 ||d 2 | First, calculate the dot product d 1 · d 2 : d 1 · d 2 = 1 2 , 1 3 ,−1 · 1 6 ,−1,− 1 4 23 = 1 2 · 1 6 + 1 3 ·−1 + −1·− 1 4 = 1 12 − 1 3 + 1 4 = 1 12 − 4 12 + 3 12 = 1− 4 + 3 12 = 0 12 = 0 Since the dot productd 1 · d 2 = 0, we know that the vectorsd 1 andd 2 are orthogonal. Therefore, the angle between them is 90 ◦ . Thus, the angle between the two lines is 90 . 24