Paper deep dive

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, Wenting Zhao

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 253

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/22/2026, 6:10:18 AM

Summary

The paper introduces the Principia suite, comprising PrincipiaBench for evaluating LLM reasoning on complex mathematical objects, the Principia Collection for training, and Principia VerifyBench for assessing verifiers. It proposes RLLM (Reinforcement Learning with a strong LM as a Reward Model) and ParaGator (a test-time scaling recipe using parallel aggregation) to improve reasoning performance, demonstrating that on-policy training and model-based verification significantly outperform traditional methods on STEM tasks.

Entities (5)

Principia Collection · dataset · 100%Principia VerifyBench · benchmark · 100%PrincipiaBench · benchmark · 100%ParaGator · method · 95%RLLM · method · 95%

Relation Signals (3)

Principia Collection → improvesperformanceof → LLM

confidence 90% · RL training four different LMs... on the Principia Collection yields improvements of 7.22-18.35% in average

RLLM → outperforms → RLHF

confidence 90% · We first demonstrate that RLLM outperforms RLHF (with scalar RMs) and RLVR (with rule-based verifiers)

ParaGator → scales → Test-time compute

confidence 85% · we develop a test time scaling recipe for strong reasoning via on-policy-trained parallel aggregation

Cypher Suggestions (2)

Find all benchmarks introduced in the paper · confidence 90% · unvalidated

MATCH (b:Benchmark) WHERE b.name CONTAINS 'Principia' RETURN b.name

Map methods to their performance improvements · confidence 80% · unvalidated

MATCH (m:Method)-[:IMPROVES]->(b:Benchmark) RETURN m.name, b.name

Abstract

Abstract:The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

PDF

Open source PDF →Open local PDF →

Full Text

252,684 characters extracted from source content.

Expand or collapse full text

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Pranjal Aggarwal 1 , Marjan Ghazvininejad 1 , Seungone Kim 1 , Ilia Kulikov 1 , Jack Lanchantin 1 , Xian Li 1 , Tianjian Li 1 , Bo Liu 1 , Graham Neubig 3 , Anaelia Ovalle 2 , Swarnadeep Saha 1 , Sainbayar Sukhbaatar 1 , Sean Welleck 3 , Jason Weston 1 , Chenxi Whitehouse 1 , Adina Williams 1 , Jing Xu 1 , Ping Yu 1 , Weizhe Yuan 1 , Jingyu Zhang 1 , Wenting Zhao 1 (Authors listed in alphabetical order), 1 FAIR at Meta, 2 UCLA, 3 CMU The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (i) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (i) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities. Principia Collection: https://huggingface.co/datasets/facebook/principia-collection PrincipiaBench: https://huggingface.co/datasets/facebook/principia-bench Introduction The ability to perform complex reasoning is foundational for the challenging tasks that humans wish state of the art AI models to solve. Deriving mathematical objects such as matrices and piecewise functions from a problem statement is foundational for precisely articulating scientific phenomena and is required in many graduate-level studies (Hestenes, 2009; Meerschaert, 2013; Goos et al., 2023). For example, when solving the problem statement in Figure 2, a student may be required to write the answer as a mathematical expression such as “ 1 |G| 2 + P x∈G,x̸=1 Re(χ(x)) ”. However, evaluations of language model (LM) reasoning have largely relied on problems that require either (1) numerical value answers (e.g., AIME, HMMT) or (2) multiple-choice options (e.g., GPQA-Diamond (Rein et al., 2024), SuperGPQA (Du et al., 2025)), primarily due to the convenience of automated scoring. Assessing LMs using such restricted answer formats and then inferring broad reasoning competence risks overstating their true capabilities (Li et al., 2024b, 2025d). In this work, we contribute both datasets and methods towards the goal of developing stronger reasoning methods. In section 1 we develop datasets and benchmarks in order to improve an LM’s ability to derive complex mathematical objects. We introduce PrincipiaBench, an evaluation benchmark composed of 2,558 problems drawn from four sources: RealMath (Zhang et al., 2025), Physics (Feng et al., 2025), ARB (Sawada et al., 2023), and the Mathematics and Engineering subsets of SuperGPQA (Du et al., 2025) (with the answer options removed; see Figure 2). Each problem in PrincipiaBench requires a complex mathematical-object answer and has been carefully filtered to ensure clarity. We evaluate 27 LMs on PrincipiaBench and find that it provides a more challenging testbed for assessing the reasoning capabilities of frontier LMs. We then introduce the Principia Collection, a synthetic training dataset consisting of 248K problem statements that are grounded in two subject taxonomies - Mathematics Subject Classification (MSC 2020) 1 and Physics Subject Headings 1 https://zbmath.org/static/msc2020.pdf 1 arXiv:2603.18886v1 [cs.AI] 19 Mar 2026 (PhySH) 2 - and require generating six types of mathematical-object answers (equations, inequalities, intervals, sets, matrices, and piecewise functions). We show that a strong LLM verifier is crucial to training with such data. However, with access to such a verifier RL training four different LMs – Qwen2.5-7B-Base (Yang et al., 2024), Qwen3-4B-Base, Qwen3-4B (Yang et al., 2025b), and OctoThinker-8B-Long-Base (Wang et al., 2025a) – on the Principia Collection yields improvements of 7.22-18.35% in average compared to each base LM on PrincipiaBench. These RL-trained LMs also outperform competitive baselines sharing the same base LM, including Qwen2.5-7B-Instruct (Yang et al., 2024), General-Reasoner-4/7B (Ma et al., 2025a), OpenReasoner-Zero (Hu et al., 2025a), SimpleRL-Qwen2.5-7B-Zoo (Zeng et al., 2025), and Polaris-4B (An et al., 2025). Furthermore, training on the Principia Collection leads to 7.5-17.5% improvements on AIME 2024 (numerical) and 12.31-25.47% improvements on GPQA-Diamond (MCQA), indicating that enhanced reasoning transfers to other formats. In section 2 we develop a post-training method for strong reasoning we term RLLM, Reinforcement Learning with a strong LM as a Reward Model (RM), where the RM is trained on-policy. Post-training for LLMs typically follows one of two paradigms: Reinforcement Learning from Human Feedback (RLHF), which relies on scalar reward models trained from human preference data, or Reinforcement Learning with Verifiable Rewards (RLVR), which depends on rule-based verifiers. Scalar reward models do not generate chain-of- thought reasoning, making them prone to reward hacking and limiting their effectiveness on complex reasoning tasks. Rule-based verifiers, meanwhile, assume access to gold answers that can be both hard-to-obtain and hard-to-verify, limiting their utility to easily-verifiable math and code problems. We show that RLLM, where a strong LM itself is used as a Reward Model, can serve as a single, unified post-training recipe for easy-to-verify, hard-to-verify, as well as non-verifiable domains. RLLM applies RL in two stages: (1) training an LM with verifiable rewards to act as a thinking Reward Model, and (2) post-training a policy-LM using the LM-as-RM’s k-wise comparative judgments as rewards. Notably, recent rubric-based evaluation methods – where structured criteria guide judgment – can be viewed as a special case within our framework: the LM-as-RM implicitly internalizes and flexibly applies such rubrics through its reasoning, without requiring explicitly specified scoring rules. We first demonstrate that RLLM outperforms RLHF (with scalar RMs) and RLVR (with rule-based verifiers) across easy-to-verify and hard-to-verify math and physics benchmarks from PrincipiaBench as well as non-verifiable instruction-following tasks. We then show that on-policy training of the LM-as-RM outperforms both prompted LMs-as-RMs (including a larger GPT-OSS-120B) and off-policy trained ones. Finally, through extensive analyses across a wide range of policy–reward LM pairings – varying in model size, capability, and training data (easy- vs. hard-to-verify, reference-free vs. reference-based tasks) – we identify the key ingredients for effective post-training with Language Models as Reward Models. In section 3 we develop a test time scaling recipe for strong reasoning via on-policy-trained parallel aggregation. Parallel Thinking—a scaffold where a language model generates multiple solutions in parallel before aggregating them—has emerged as a promising approach for improving reasoning performance. However, we identify two key limitations in current methods: (1) during parallel generation, the model is unaware that its outputs will later be aggregated, leading to redundant or insufficiently diverse solutions, and (2) training typically focuses on aggregating off-policy generations, creating a mismatch between training and inference. We introduce ParaGator, which addresses both issues through a unified online training framework. Specifically, we apply pass@k optimization to the parallel generation stage, encouraging diverse and complementary solutions, while using pass@1 optimization for the aggregation stage to select or synthesize the best final answer. By training both stages online and end-to-end, our method ensures the generator produces outputs well-suited for aggregation. Experiments on competition math and scientific reasoning benchmarks, including PrincipiaBench, demonstrate that ParaGator achieves strong results, outperforming existing parallel reasoning methods. 2 https://physh.org/ 2 1 Principia: Training Language Models to Reason over Mathematical Objects Figure 1 RL training on the Principia Collection, which requires mathematical-object outputs, improves an LM’s reasoning capability. (a) On our new challenging benchmark, PrincipiaBench, which requires deriving mathematical objects, Qwen3-4B-Base, Qwen2.5-7B-Base, and OctoThinker-8B-Long-Base trained on Principia Collection yield average gains of +18.23%, +10.23%, and +15.16%, respectively. (b) Training on the Principia Collection also improves performance on numerical (AIME-2025) and MCQA (SuperGPQA) benchmarks, demonstrating cross-format reasoning gains. (c) Using a strong model-based verifier (GPT-OSS-120B) and excluding MCQA data provides the best performance. 1.1 Overview Our first goal is to develop datasets and benchmarks in order to improve an LM’s ability to derive complex mathematical objects. One of the biggest current challenges in improving these abilities is the lack of such targeted data. In the context of enhancing reasoning capabilities, prior works have proposed collecting data from reliable sources such as olympiad and grade-school math problems (Hendrycks et al., 2021; Mitra et al., 2024; Yu et al., 2024; Li et al., 2024a; Gao et al., 2025; Luo et al., 2025; Chen et al., 2025a) or transforming data from online forums, pre-training corpora, and textbooks with LMs (Yue et al., 2024; Yuan et al., 2025b; Moshkov et al., 2025b; Ma et al., 2025a; Fan et al., 2025). However, in existing post-training datasets, the answers are mostly composed of multiple-choice options, numerical values, and simple mathematical objects. Consequently, problem statements that simultaneously demand graduate-level STEM knowledge and the derivation of complex mathematical objects remain scarce in the literature. In this work, we explore recipes to both measure and improve an LM’s reasoning ability to derive mathematical objects from graduate-level STEM problem statements. First, to assess this capability, we introduce Principi- aBench, an evaluation benchmark composed of 2,558 problems drawn from four sources: RealMath (Zhang et al., 2025), Physics (Feng et al., 2025), ARB (Sawada et al., 2023), and the Mathematics and Engineering subsets of SuperGPQA (Du et al., 2025) (with the answer options removed; see Figure 2). Each problem in PrincipiaBench requires a complex mathematical-object answer and has been carefully filtered to ensure clarity. We evaluate 27 LMs on PrincipiaBench and find that it provides a more challenging testbed for assessing the reasoning capabilities of frontier LMs. For example, Qwen3-235B scores 55.58 and o3 scores 62.90 on PrincipiaBench, compared to 83.33 and 85.63 on AIME-2024, and 74.34 and 82.29 on GPQA-Diamond, indicating that PrincipiaBench opens additional headroom to track progress of improvements in LM reasoning. Next, we introduce the Principia Collection, a training dataset consisting of 248K problem statements that are grounded in two subject taxonomies - Mathematics Subject Classification (MSC 2020) 3 and Physics Subject Headings (PhySH) 4 - and require generating six types of mathematical-object answers (equations, inequalities, intervals, sets, matrices, and piecewise functions). A key component of using the Principia Collection for training is adopting a model-based verifier. Since the release of DeepSeek-R1 (Guo et al., 2025), many works have incorporated rule-based verifiers such as math-verify (Huggingface, 2025) during training, yet we find them surprisingly fragile when handling complex mathematical object answers. To assess the reliability of different verifiers, we introduce the Principia VerifyBench, a meta-evaluation benchmark with human labeled judgments of answer equivalence—whether a model’s output represents the same mathematical object as the 3 https://zbmath.org/static/msc2020.pdf 4 https://physh.org/ 3 Figure 2 LMs struggle to solve problems that require mathematical objects as answers. For example, Qwen3-235B can solve this problem from SuperGPQA (Du et al., 2025) in an MCQ setting by using the options as an anchor to perform backward chaining. However, when removing the options, the model starts making false assumptions (i.e., that all 1-eigenspaces share a common fixed vector and that the representation cannot be fully trivial) and derives the incorrect solution (i.e., concluding⟨1 G ,χ⟩ = 1 instead of allowing the valid case⟨1 G ,χ⟩ = 2). Note thatyellow-highlighted comments pinpoint and describe the critical flaws included within the model’s chain-of-thought. ground truth despite differences in expression. We gather 168 instances where math-verify and o3 judgments disagree, and find that o3 verifier is correct in 94.05% of cases, hence we use o3 as a judge for evaluation. As shown in Figure 1, RL training four different LMs – Qwen2.5-7B-Base (Yang et al., 2024), Qwen3-4B-Base, Qwen3-4B (Yang et al., 2025b), and OctoThinker-8B-Long-Base (Wang et al., 2025a) – on the Principia Collection yields improvements of 7.22-18.35% in average compared to each base LM on PrincipiaBench. These RL-trained LMs also outperform competitive baselines sharing the same base LM, including Qwen2.5- 7B-Instruct (Yang et al., 2024), General-Reasoner-4/7B (Ma et al., 2025a), OpenReasoner-Zero (Hu et al., 2025a), SimpleRL-Qwen2.5-7B-Zoo (Zeng et al., 2025), and Polaris-4B (An et al., 2025). Furthermore, training on the Principia Collection leads to 7.5-17.5% improvements on AIME 2024 (numerical) and 12.31-25.47% improvements on GPQA-Diamond (MCQA), indicating that enhanced reasoning transfers to other formats. Our main contributions are (1) PrincipiaBench, a benchmark for evaluating reasoning capabilities of LMs in deriving complex mathematical objects, (2) Principia Collection, a large-scale training dataset that improves reasoning capability of LMs across diverse answer formats, and (3) Principia VerifyBench, a meta-evaluation benchmark for assessing rule/model-based verifiers in determining answer equivalence of mathematical objects. 1.2 Motivation: Need for a benchmark and training data for mathematical objects The community currently lacks both a benchmark and training data that specifically require LMs to derive complex mathematical objects, beyond numerical values and multiple choice options. This gap largely arises from (1) the convenience of evaluating simpler answer formats and (2) the need to avoid inaccurate reward signals during training, which together have implicitly pushed the field toward easy-to-evaluate benchmarks 4 Figure 3 Performance consistently drops when removing options in MCQA benchmarks. In the mathematical and engineering subset of SuperGPQA, among instances where answers are expressed as mathematical objects, LMs show a 10—20% decrease in performance when options are removed. This suggests that MCQA evaluations tend to overestimate true reasoning ability (i.e., high MCQA scores do not necessarily translate into strong reasoning over mathematical objects). and easy-to-reward post-training datasets. However, when mathematicians, physicists, or machine learning researchers use math in their everyday work, they are building entirely new knowledge. Thus, naturally, they are not choosing from multiple existing possibilities like current multiple-choice question benchmarks, but rather performing de novo derivations of new mathematical objects. If we aspire for LMs to be integrated into ambitious and reasoning-intensive applications such as end-to-end research for scientific discovery (OpenAI, 2025a; Si et al., 2025; Lu et al., 2024; Yamada et al., 2025), it becomes essential to assess and support the ability to handle the mathematical objects that underlie mathematical and scientific reasoning. In this section, we examine why existing post-training datasets and benchmarks are insufficient for evaluating and improving the capability to derive mathematical objects. Together, these considerations motivate the development of our benchmark, PrincipiaBench, and our training dataset, the Principia Collection. LMstendtorelyonshortcutswhenoptionsarepresent. Consider the example shown in Figure 2. When presented with a SuperGPQA (Du et al., 2025) problem containing ten mathematical object choices, Qwen3-235B (Yang et al., 2025b) engages in backward chaining (Gandhi et al., 2025), using the provided options as anchors to derive necessary conditions and select the correct answer. This behavior may be natural in multiple-choice question answering (MCQA) settings (Wang et al., 2024a), and humans often employ similar strategies when solving exams (Gierl et al., 2017). However, once the options are removed, the model’s reasoning chain changes fundamentally: it begins to make false assumptions, overlook valid cases, and construct illogical arguments. Removing options from SuperGPQA leads to a 10–20% performance drop even for frontier LMs. Furthermore, based on our examination in Figure 2, we examine how performance changes across multiple instances. As shown in Figure 3, when we evaluate on the mathematics and engineering subset of SuperGPQA (Du et al., 2025)—filtering instances that require mathematical object answers 5 —model performance consistently declines by 10–20% once the options are removed. Notably, frontier models such as o3 and Qwen3-235B show substantial drops (69.10→62.90 and 69.33→55.58, respectively). Since open-ended generation is information-theoretically harder than MCQA (i.e., requiring the model to produce the exact answer from an unbounded space rather than select fromkcandidates), some performance drop is expected. However, the magnitude of these drops, combined with the qualitative reasoning failures shown in Figure 2, suggests that even the strongest LMs rely heavily on option-driven cues (Shah et al., 2020; Balepur et al., 2024; Balepur and Rudinger, 2024; Gupta et al., 2024; Chandak et al., 2025). This shows that current benchmarks may overestimate reasoning capability, as they fail to capture whether LMs can derive mathematical objects. The majority of RL post-training datasets consist of easy-to-reward (short) answers without complex mathematical object answers. As shown on the right side of Figure 4, the output (answer) token distributions of widely used RL post-training datasets—including OpenR1-Math (Hugging Face, 2025), NaturalReasoning (Yuan et al., 5 We provide more details on how we filtered out such instances in Subsection 1.3.1. 5 Figure 4 Token count distribution comparison between the Principia Collection and other widely used RL post-training datasets. The problem statements and answers of Principia Collection are relatively longer since it requires the derivation of complex mathematical objects and the problem statements based on graduate-level STEM subjects are very detailed. 2025b), DeepScaleR (Luo et al., 2025), OpenMathReasoning (Moshkov et al., 2025b), NuminaMath-1.5 (Li et al., 2024a), WebInstruct-Verified (Ma et al., 2025a), and TextbookReasoning (Fan et al., 2025)—mostly fall within the range of ten to a few hundred tokens. This is mainly because there exists a large proportion of numerical answers, mulitple choice option answers, and short text spans, where even the longest answers that range below 100∼200 tokens are not complex mathematical objects. Likewise, as shown on the left side of Figure 4, existing datasets also lack problem statements that resemble the structured, specification-like descriptions typical of graduate-level STEM coursework; their inputs mostly fall within a few hundred tokens. Motivated by these observations, in subsubsection 1.3.3, we describe how we construct a training dataset (Principia Collection) in which all answers are complex mathematical objects and all problem statements are grounded in graduate-level STEM subjects. Note that although token length itself is not a measure of difficulty, the differences are substantial: our answers are longer because they consist exclusively of mathematical objects, and our problem statements are longer because they encode graduate-level concepts with multiple conditions. 1.3 The Principia Suite 1.3.1 PrincipiaBench Based on our observations described in Section 1.2, we propose PrincipiaBench, a benchmark for assessing an LM’s capability to precisely derive mathematical objects directly without relying on MCQA formats. Problem Statement Sources & Filtering. We use the problems included in RealMath (Zhang et al., 2025), Physics (Feng et al., 2025), ARB (Sawada et al., 2023), and the Mathematics and Engineering subsets of SuperGPQA (Du et al., 2025) (with answer options removed). Note that not all problems in these benchmarks require deriving a mathematical object and some exhibit other undesirable traits. Therefore, we apply filtering based on specific constraints. The constraints used for filtering are: 1. The answer should be expressed as a mathematical object, specifically an equation, inequality, interval, set, matrix, or piecewise function. In contrast, instances that require a numerical value, proof, text span, boolean value, or multiple-choice option answers are excluded. We apply this rule because the primary goal of our benchmark is to evaluate an LM’s ability to precisely derive mathematical objects. 2.The problem statement should consist of only one question and should not require answering multiple questions at once. We apply this rule because solving multiple problems simultaneously can affect an LM’s performance (Son et al., 2024), and artificially decomposing them may lead to awkward formulations (e.g., lack of context). Also, if the LM answers only one of the sub-questions correctly, it introduces additional complexity in the evaluation, such as determining whether to assign partial credit. Example of excluded problem statement: 6 Seven antennae, radiating as electric dipoles polarized along the ˆ z direction, are placed along the x-axis in the xy-plane at x = 0, ± λ 2 , ±λ, ± 3λ 2 . The antennae all radiate at wavelength λ and are in phase. (a) Calculate the angular distribution of the radiated power as a function of the polar and azimuthal angles, θ and φ. Neglect any constant multiplying prefactors. (b) Consider the direction in which the radiated intensity is maximum for this array and for a single dipole antenna. How do these intensities compare? This example contains two sub-questions, labeled (a) and (b), within a single problem statement. 3.The problem statement should be self-contained. If it requires additional context (e.g., referring to an image) or omits essential definitions, we exclude such cases. This rule is applied because our benchmark aims to evaluate whether LMs can derive mathematical objects from well-formulated problem statements, not whether they can request clarifications or reason from incomplete information. Examples of excluded problem statements: A Carnot engine has a cycle as shown inFig.1.12. IfWandW ′ represent the work done by 1 mole of monatomic and diatomic gas, respectively, calculate W ′ W . Upon inspecting the Physics dataset (Feng et al., 2025), we found cases where no figure is provided, yet the problem statement requires a figure to solve the question, as shown in the example above. We regard such cases as indicating that the question is not self-contained. Letf:PG→ R 2 be a plane generic immersion. What are the congruences, modulo 2 and 4, of the sums X γ∈Γ 5 (PG) c(f (γ)), X γ∈Γ 6 (PG) c(f (γ)), X γ∈Γ 9 (PG) c(f (γ)), and X γ∈Γ 8 (PG) c(f (γ))? In this example, solving the problem would require additional context that defines the graphPG(or the class of graphs it belongs to), the set of cycles Γ k (PG), and the numerical invariant c(f (γ)). To filter out problems that exhibit any of the three undesirable trait types described above, we first prompted the GPT-OSS-120B model to label each problem with a binary yes/no decision for each trait. We then performed an initial filtering step by selecting only the problems that were judged to exhibit none of the three traits. Next, we conducted a second-stage manual inspection of all instances. Through this process, we obtained a total of 2,286 filtered instances. However, during evaluation, we identified 128 problems whose judgments showed substantial variability in a post-hoc analysis. We therefore excluded these cases and used the remaining 2,158 problems as our final dataset. Difficulty in verification. When the answer is a multiple-choice option or a numerical value, determining whether the model’s prediction exactly matches the ground truth is straightforward. However, when the answer is a complex mathematical object, it can be expressed in many valid forms, making equivalence checking challenging. Prior works have primarily used Python Sympy library with math-verify (Huggingface, 2025) being a representative example. In our early experiments, however, we find that such rule-based verification methods are fragile in practice. The following examples show the failure patterns that are often observed: [Case 1] Symbolic reasoning failure of equivalence: 7 Reference Answer: f UV (u,v) = 1 2π · 1 1+v 2 e − u 2 , u > 0, v ∈ R Model Prediction: f UV (u,v) = 1 2π(v 2 + 1) e −u/2 for u > 0, v ∈ R Sympy fails to recognize their equality because the terms appear in a different order (v 2 placed differently), the constant 2π is written separately, and −u/2 is expressed differently (−u/2 vs u 2 ). [Case 2] Brittleness in handling latex math commands: Reference Answer: −4ni Model Prediction: −4in Sympy fails to recognize their equality because the order of the symbolsiandnis reversed andiis represented differently (i vs. i). [Case 3] Brittleness in handling text spans: Reference Answer: The generating function is 2t 2 1− t 2 Y n≥1 1 1− t n . Model Prediction: 2t 2 1− t 2 ∞ Y m=1 1 1− t m . Sympy fails to recognize their equality because different variable symbols are used (nvs.m), the product notation is written differently (n≥1 vs.m= 1to ∞), and the reference answer includes an additional text span (“The generating function is”). Figure5 Annotation UI for constructing the Principia VerifyBench. Evaluation procedure. Given a problem state- ment, an LM should generate a prediction, and then, given the problem statement, refer- ence answer, LM prediction triplet as input, a verifier should determine if the reference answer and model prediction are equivalent to each other (i.e., reference-based evalua- tion). Notably, we find that a model-based verifier (such as a prompted general-purpose LM (Zheng et al., 2023; Kim et al., 2025b)) can effectively handle the aforementioned er- ror cases. Namely, we use o3 as a judge for answer equivalence checking by prompting it to output either “Equivalent” or “Not Equiv- alent”. 1.3.2 Principia VerifyBench To determine the reliability of using a model- based judge in PrincipiaBench, we collect hu- man annotations for cases where the judg- ments of o3 and math-verify (Huggingface, 2025) differ, and manually determine equiva- lence between the reference answer and the LM prediction. Using the problem statements in PrincipiaBench, we first generate responses with GPT-OSS-20B and GPT-OSS-120B and extract their predictions. We 8 Figure 6 Two qualitative examples of instances within the Principia Collection. Each instance consists of a problem statement, an answer, and a subject entity. The subject entities originate from Physics Subject Headings (PhySH) and Mathematics Subject Classification System (MSC2020). The required answer types are either equation, inequality, interval, set, matrix, and piecewise function. Note that for the example on the right side, the detailed setup (definitions oft i ,t 0 ,t G , etc.) is not a hint but a necessary part of the problem statement, which is typical in research-level mathematics, specifying how to combine these quantities into the final piecewise formula for e T. then sample 200 such instances (problem statement, reference answer, model prediction triplets) with disagreement: in 100 cases, math-verify judges as “Equivalent” while o3 judges as “Not Equivalent,” and in the remaining 100 cases, the judgments are reversed. Then, we gather eight human annotators to determine which is correct. Annotationprocedure. An example of the annotation UI is shown in Figure 5. Eight annotators are each assigned 50 instances, where for each instance, the annotator should label between “Equivalent”, “Not Equivalent”, and “I don’t know”. We add the “I don’t know” option to avoid introducing noise into the benchmark. Also, each instance is labeled by two annotators, where we only leave instances that the two annotators provide the same judgment. This results in 168 instances remaining, where 75 instances were labeled as “Equivalent” and 93 instances were labeled as “Not Equivalent”. We name the resulting benchmark as Principia VerifyBench. Evaluation procedure. Given the triplet as input, a verifier should determine if the reference answer and model prediction are equivalent to each other. The accuracy on this meta-evaluation benchmark could serve as a proxy to determine whether a verifier will provide reliable judgments (Zheng et al., 2023; Kim et al., 2023b, 2024) or accurate rewards for RL training (Lambert et al., 2025; Malik et al., 2025; Whitehouse et al., 2025b). 1.3.3 Principia Collection While many post-training datasets have been proposed to enhance the reasoning capabilities of LMs, none to date exclusively contain answers in the form of mathematical objects. In most cases, the answers are numerical values or simple mathematical expressions. To specifically improve an LM’s ability to derive mathematical objects, we propose a synthetic post-training dataset called the Principia Collection, which consists of 248,748 graduate-level problem statements and corresponding answers. Two examples are shown in Figure 6. The construction procedure is as follows (we include prompts for each step in subsection 4.3): Step 1: Topic Mining. We first collect the problem statement subjects from Mathematics Subject Classification 9 System (MSC 2020) and Physics Subject Headings. Each entity from these sources consists of a hierarchical structure that ranges from broad topics at the upper level to more specific research areas or concrete methodologies at the lower level. We extract them in full detail—such as “Techniques for Polymers & Soft Matter » Resonance techniques » Nuclear magnetic resonance » Magnetic resonance imaging” and “Differential geometry » Geometric evolution equations » Flows related to complex manifolds (e.g., Kähler–Ricci flows, Chern–Ricci flows)”—so that the entire hierarchy is explicitly represented. The main purpose of extracting these topics is to ensure high coverage in the resulting dataset. Specifically, methods such as Self-Instruct (Wang et al., 2023b) often generate redundant problems, which reduces the number of unique instances after filtering. By incorporating a wide range of subject entities—from broad scientific areas to specialized subfields—we can generate diverse graduate-level problem statements that involve complex mathematical objects. Step2: ProblemStatementGeneration. Next, we use the 9,573 extracted entities to generate problem statements. This process consists of three sub-stages and we use GPT-OSS-120B throughout these processes. •First, for each subject entity, we sketch strategy descriptions that outline the core capabilities required to solve a problem. For each entity, we generate 40 descriptions, resulting in a total of 382,920 strategies. •Second, we iterate over each subject entity & capability pair to generate problem statements. In this step, one of six mathematical object types (equation, inequality, interval, set, matrix, or piecewise function) is randomly selected so that the resulting problem requires that type as its answer. We find that the chain-of-thought (CoT) during problem proposal often incorporates elements that make the problem more conceptually challenging, consistent with observations from prior work (Yu et al., 2025). •Finally, we include an additional refinement step to revise problem statements that resemble simple knowledge-probing questions, ensuring they require more genuine reasoning during the solving process. We find this procedure effective in increasing the conceptual depth of the generated problems. For a better understanding, refer to the example shown in Table 1. The subject entity “Classical equilibrium statistical mechanics (general)” can be interpreted through multiple capabilities, each representing a distinct reasoning approach to the same physical concept. The strategy description, “Evaluating free-energy differences by exploiting convexity and pair-correlation functions,” highlights the link between microscopic interactions and macroscopic thermodynamic inequalities. The Problem Statement Proposal CoT operationalizes this idea by translating the entity’s core principle (“monotonicity of the Helmholtz free energy”) into a concrete question using potential ordering and partition functions, resulting in the Initial Problem Statement. While this problem statement captures the correct conceptual relationship, it remains straightforward, requiring only the recognition thatu ′ (r)≤u(r) impliesF ′ ≤F. The Problem Statement Revision CoT then strengthens it by introducing the interpolating HamiltonianH λ , the convexity ofF(λ), and the pair-correlation functiong u ′ (r), elevating the problem statement to one involving quantitative variational reasoning. Thus, the Revised Problem Statement becomes a more rigorous and challenging formulation while preserving conceptual coherence. Step 3: Filtering out Invalid Problem Statements. We then filter out problem statements with undesirable traits from the 382,920 problems generated in the previous step. The filtering is based on three main criteria: (1) the problem statement must consist of only one question and be self-contained, (2) the problem should explicitly require the intended answer type, and (3) extensive hints or the answer itself should not be included in the problem statement. We prompt GPT-OSS-120B to evaluate all three conditions and retain only those instances that are judged as “Yes” for every criterion. Step 4: Response Generation & Majority-Voting for mathematical objects. To obtain labels for the generated problem statements, we prompt GPT-OSS-120B eight times and apply self-consistency (Wang et al., 2023a), taking the majority vote as the label. This simple yet effective recipe has been shown to work well in prior works (Huang et al., 2023; Prasad et al., 2025; Yu et al., 2025). Importantly, our goal in applying self-consistency is to derive labels for newly generated problem statements (since no ground-truth answers exist), not to collect chain-of-thoughts (Zelikman et al., 2022) or to filter out easy problems (An et al., 2025). One unique challenge in applying self-consistency to our setting is that determining the majority vote itself is hard, as mathematical objects can be expressed in many equivalent forms. For instance, among eight predictionsa 1 toa 8 , each may appear in a different format even though they are mathematically identical. To address this, we perform pairwise equivalence checks using a GPT-OSS-120B as a verifier, producing an equivalence matrix that records whether each pair of predictionsa i (1≤ i≤8) are equivalent. However, this 10 CategoryContent Subject EntityStatistical mechanics, structure of matter→Equilibrium statistical mechanics→Classical equilibrium statistical mechanics (general) Strategy DescriptionEvaluating free-energy differences by exploiting convexity and pair-correlation functions: Applying the convexity of the Helmholtz free energy with respect to an interpolating interaction parameter, using Jensen’s inequality and the definition of the two-particle density (or radial distribution function) to derive rigorous upper and lower bounds on F − F ′ for systems with ordered pair potentials. Problem Statement 1. Identify a suitable statistical-mechanical inequality — monotonicity of F with respect to u(r). Proposal CoT2. Define the Hamiltonians and show u ′ (r)≤ u(r) implies Z ′ ≥ Z. (Summary)3. Convert to free energies: F ′ ≤ F. 4. Require a single final inequality as the answer. 5. Formulate the problem so the solver derives F ′ ≤ F using only these definitions. InitialConsider a classical canonical ensemble of N identical particles of mass m confined to a volume V and in Problem Statement thermal equilibrium at temperature T. The Hamiltonian of the system is H = N X i=1 p 2 i 2m + X 1≤i<j≤N u(|r i −r j |), whereu(r) is a pairwise interaction potential. Define a second Hamiltonian in which the pair potential is replaced by another function u ′ (r) that satisfies the pointwise inequality u ′ (r)≤ u(r) for all r ≥ 0. Let the corresponding canonical partition functions beZandZ ′ , and the Helmholtz free energies beF=−k B T lnZ andF ′ =−k B T lnZ ′ , respectively. Using only the given ordering of the potentials, derive the rigorous inequality that relates F ′ and F. Express your final answer as a single inequality (using “≤”, “≥”, etc.). Problem Statement 1. Start from the monotonicity inequality F ′ ≤ F. Revision CoT2. Add a quantitative upper bound using convexity of F (λ) (Gibbs–Bogoliubov inequality). (Summary)3. Introduce the interpolating Hamiltonian H λ = H ′ + λ(H− H ′ ). 4. Use the two-particle density or g u ′ (r) to express ensemble averages. 5. Derive a two-sided bound 0≤ F − F ′ ≤ Nρ 2 R 4πr 2 g u ′ (r)[u(r)− u ′ (r)]dr. 6. Require the final answer to appear as a single mathematical inequality using only “≤” or “≥”. RevisedConsider a classical canonical ensemble ofNidentical particles of massmconfined to a volumeVat temperature Problem Statement T. The Hamiltonian with pair potential u(r) is H = N X i=1 p 2 i 2m + X 1≤i<j≤N u(|r i −r j |). Define a second Hamiltonian in which the pair potential is replaced by a weaker function u ′ (r) that satisfies u ′ (r)≤ u(r) for all r ≥ 0. Let the corresponding canonical partition functions be Z and Z ′ , and the Helmholtz free energies be F =−k B T lnZ, F ′ =−k B T lnZ ′ . Introduce the interpolating Hamiltonian H λ = H ′ + λ(H− H ′ ),0≤ λ≤ 1, and denote its free energy byF(λ) =−k B T lnZ(λ) (F(0) =F ′ ,F(1) =F). Using only the given ordering of the potentials, the convexity ofF(λ) with respect toλ, and the definition of the two-particle density (or, for a homogeneous fluid, the radial distribution functiong u ′ (r)) of the system governed byu ′ (r), derive a rigorous two-sided bound on the difference F − F ′ . Express your final result as a single inequality of the form (inequality involving F, F ′ , N, ρ, g u ′ (r), u(r)− u ′ (r)) using only the symbols “≤” (or “≥”). Table 1 Example of a subject entity (acquired from PhySH), a strategy description (from step 1), an initial problem statement and its corresponding CoT, and a revised problem statement and its corresponding CoT (from step 2). introduces cases where transitivity is not preserved (e.g., when the verifier concludes thata 1 is equivalent to a 2 , a 2 to a 3 , but a 1 is not equivalent to a 3 ). 11 Figure 7 Illustration of our majority voting procedure used in dataset construction. Determining the majority vote itself becomes challenging with mathematical object predictions since the same object could be represented in distinct forms. Hence, (a) we assess the equivalence between each pair, (b) check if transitivity is preserved, (c) adopt a conservative rule of flipping judgments if transitivity is not preserved w.r.t. witnessess of each prediction pairs (we adopt a threshold of 0.6), and (d) using the final results, determine the majority vote using the equivalence judgments. Detailed procedure is in Algorithm 1 given in the Appendix. Figure 8 Answer type counts of the Principia Collection. Unlike other datasets, all 250K instances consist of mathematical objects. To mitigate such inconsistencies, as shown in Figure 7, we adopt a conservative heuristic that re-evaluates pairwise equivalence by assessing the transitivity among all equivalence relations. Specif- ically, when determining whethera 1 anda 2 are equivalent, we check whether their transitivity holds consistently with respect to other elements (a 3 througha 8 ). If the degree of transitiv- ity exceeds a predefined threshold, we retain the equivalence betweena 1 anda 2 . Because these adjustments can influence other pairwise relations, we employ a union–find algorithm to ensure consistent grouping 6 . Finally, if a prediction is found to be equivalent to at least a threshold number of other predictions (i.e., at least 5 out of 8), we designate this group as the majority vote and use it as the reference answer. Figure 8 shows the answer type counts of the Principia Collection after all the aforementioned procedures. Overall, we filter out 134,172 instances and use the remaining 248,748 instances in the following sections. 1.4 Experimental Results 1.4.1 Experimental Setup Baselines. In our main results (subsubsection 1.4.2), we demonstrate the effectiveness of our Principia Collection dataset by training multiple base LMs on the Principia Collection and comparing the trained LMs to (1) existing post-trained LMs that share the same base LMs; and (2) existing post-training datasets by training on the same set of base LMs. The specific baselines we employ from these two groups are listed below: •Post-Trained LM Baselines: We use post-trained LMs that share the same base LMs (Qwen2.5-7B- Base, Qwen3-4B-Base, Qwen3-4B, and OctoThinker-8B-Long-Base) and provide publicly available checkpoints on huggingface as our baselines. These include Qwen2.5-7B-Instruct (Yang et al., 2024), Qwen3-4B (Yang et al., 2025b), General-Reasoner-4/7B (Ma et al., 2025a), OpenReasoner-Zero (Hu et al., 2025a), SimpleRL-Qwen-2.5-7B-Zoo (Zeng et al., 2025), and Polaris-4B (An et al., 2025). Note that we use the available checkpoints without replicating them and employ the hyperparameters specified 6 Note that similar procedures have been explored in other disciplines—for example, Wittkop et al. (2011) use transitivity-based clustering in biological data, and Nguyen and Caruana (2007) develop clustering methods that aggregate multiple partitionings via pairwise similarity and voting mechanisms. 12 in each model card. •Post-training Dataset Baselines: We choose two baseline datasets, (1) WebInstruct-Verified (Ma et al., 2025a) and (2) DeepScaleR (Luo et al., 2025). WebInstruct-Verified is constructed by extracting problem–answer pairs from pre-training corpora and adeptly transforming them using Gemini-2.5- Pro (Comanici et al., 2025). It serves as our most direct baseline because it also encompasses mathematical objects as part of their dataset. DeepScaleR, on the other hand, collects problems from AMC, AIME, IMO, IMC, and the AoPS forum, thus containing a large proportion of challenging math-competition problems. Notably, both datasets are based on human-written materials, making them well-suited for comparing how effective our synthetic data (Principia Collection) is. We train Qwen3-4B-Base and OctoThinker-8B-Long-Base using these datasets, employing the same hyperparameters as our setting. Training configuration. We use Group Relative Policy Optimization (GRPO) (Shao et al., 2024), a widely adopted RL algorithm, to train LMs to generate CoTs using the Principia Collection. As our base models, we employ four LMs: Qwen2.5-7B-Base (Yang et al., 2024), Qwen3-4B-Base (Yang et al., 2025b), Qwen3-4B (thinking) (Yang et al., 2025b), and OctoThinker-8B-Long-Base (Wang et al., 2025a). These models are selected to cover a range of parameter scales (4B–8B), different model families (Qwen-based and Llama-based), and training configurations that include both zero RL and continual training of reasoning models—all within a computational budget that allows training on the full 250K-instance dataset in a reasonable time frame. To match the training compute when training on different datasets, we train up to at most 1K steps (which is equivalent to training on 1 epoch on the Principia Collection under our hyper-parameters) and choose the best performing checkpoint (per 50-step) using the overall validation scores. We use train batch size of 256, max response token length of 8192, PPO mini batch size of 256, clip ratio (low) of 0.2, clip ratio (high) of 0.3, temperature 0.7, top_p 1.0, 32 response rollouts per prompt, log prob micro batch size of 256, and reference log prob micro batch size of 256. We set the KL coefficient as 0.0 for Qwen3-4B-Base and 0.001 for other base LMs. An exception for this default setting is LMs that is trained upon Qwen3-4B (thinking), where we use train batch size of 128, PPO mini batch size of 128, clip ratio (low) of 0.2, clip ratio (high) of 0.28, top_p of 1.0. Following An et al. (2025), we gradually increase the max response token length from 52K to 90K and also increase the temperature from 1.2 to 1.4. All experiments are held on 8 H200 nodes with 64 GPUs. Evaluation configuration. For all of the LMs we trained, we use temperature 0.7, top_p 1.0, and max token length of 16,384 during evaluation. An exception is LMs that are trained from Qwen3-4B (thinking), where we use temperature 1.4, top_p 1.0, and max token length of 90,000, following the settings of An et al. (2025). Validation set. Unlike prior works that use AIME for validation purposes (Zeng et al., 2025; Hu et al., 2025a; An et al., 2025), we prepare a held-out validation set to prevent overfitting. The validation set are as follows. •Held-out set of the Principia Collection: We randomly sample 40 problem statements for each answer type (equation, inequality, iterval, set, matrix, piecewise function), which leads to a total of 240 instances. •GPQA (non-Diamond): We use GPQA questions that are not included in the diamond set. Note that we evaluate on both MCQA (with options provided) and free-form (without options provided) formats. •Physics (val set): We use the validation set of the Physics benchmark (Feng et al., 2025). We filter out invalid problem statements following the procedure discussed in Subsection 1.3.1. •WebInstruct-Verified (val set): We use the validation set of the WebInstruct-Verified training data (Ma et al., 2025a). This data consists of diverse formats such as MCQA, numerical, and mathematical objects. Evaluation benchmarks (Test set). Our test set consists of three main groups based on the required answer type. •Mathematical objects: We use our proposed PrincipiaBench. Note that the PrincipiaBench and Principia Collection use problem statements from different sources. We report mean@8 scores. •MCQA: We use GPQA-Diamond (Rein et al., 2024) and the mathematics and engineering subset of SuperGPQA (Du et al., 2025). We reportmean@16andmean@8scores for each benchmark, respectively. • Numerical: We use AIME-2024 and AIME-2025. For both benchmarks, we report mean@32 scores. 13 ModelTraining Data PrincipiaBench (Math Obj.)MCQANumerical Total Avg. RealMath Physics ARB SuperGPQA ∗ Avg. SuperGPQA ∗ GPQA-Diamond AIME-24 AIME-25 Qwen2.5-7B-Base Qwen2.5-7B-Base-11.199.32 16.7613.7512.7520.4127.408.023.6513.81 Qwen2.5-7B-Instruct-12.9510.45 19.1520.0515.6532.6135.2912.087.8118.80 SimpleRL-7B-ZooSimpleZoo-Data.14.0010.68 26.8621.1718.1828.4928.8515.318.7519.26 General-Reasoner-7BWebInstruct-Ver. 15.9612.39 26.8623.1519.5932.9536.5510.637.7120.78 OpenReasoner-ZeroORZ-Math-Coll.15.0913.75 30.8525.1221.2037.1037.2517.1913.6523.75 Principia-7B-Zero (Ours)Principia Collec.15.5915.11 32.4528.3422.8738.8039.7115.5210.7324.53 OctoThinker-8B-Long-Base OctoThinker-8B-Long-Base-3.162.735.323.793.756.5311.740.830.634.34 (Reimplemented)DeepScaleR10.6611.02 19.9516.8214.6134.3629.4213.5419.2719.38 (Reimplemented)WebInstruct-Ver. 10.6611.02 19.9520.5617.6734.3129.4213.5419.2719.84 Princpia-8B-Zero (Ours)Principia Collec.13.5714.20 28.4619.9619.0533.4933.6218.3320.7322.80 Qwen3-4B-Base Qwen3-4B-Base-9.435.80 17.8112.1811.3120.4618.0912.510.2113.31 Qwen3-4B (no-thinking)-17.8622.39 39.8928.9327.2739.7841.8922.8121.0429.32 (Reimplemented)DeepScaleR9.2420.91 38.3031.0427.4243.3844.6323.0221.7729.04 General-Reasoner-4BWebInstruct-Ver. 16.0618.07 39.3627.8825.3439.6341.0419.0614.2726.92 Principia-4B-Zero (Ours)Principia Collec.19.2821.81 43.6233.9229.6645.5343.5622.7123.9631.80 Qwen3-4B (thinking) Qwen3-4B (thinking)-23.8140.57 58.7841.7741.2358.8753.7071.8864.9051.79 Polaris-4BPolaris-Data.26.1751.02 64.3645.8246.8462.1857.3079.5077.1957.94 Principia-4B (Ours)Principia Collec.28.9651.24 66.5347.0548.4563.8357.4881.0678.1259.28 Reference Models Llama-3.2-3B-Instruct-3.703.301.334.253.1412.2415.883.330.525.57 Llama-3.1-8B-Instruct-6.015.809.317.407.1316.1821.463.131.158.81 Llama-3.3-70B-Instruct-18.4121.36 37.5025.8125.7738.7948.6119.274.3826.77 Qwen3-14B (no-thinking)-21.3428.64 50.2736.5034.1947.8953.4729.6926.0436.73 Qwen3-14B (thinking)-28.3651.36 67.0249.3549.0266.8265.5678.2370.7359.68 Qwen3-235B (no-thinking)-28.5444.32 63.0345.1445.2655.8262.9437.4027.0845.38 Qwen3-235B (thinking)-36.0258.64 73.9453.7455.5869.3374.3483.3380.9466.29 GPT-OSS-20B-42.5651.59 72.3450.7154.3063.7165.0774.7272.7161.68 GPT-OSS-120B-44.0559.89 74.4753.9258.0866.3569.3179.3176.1565.43 Claude-4.0-Sonnet-32.0441.82 59.5748.1945.4063.3770.7157.1944.5852.18 GPT-4.1-36.3051.25 61.4446.4348.8560.7867.2046.9835.7150.76 o3-48.7463.75 81.9157.1962.9069.1082.2985.6380.9971.20 Table 2 Evaluation performance on PrincipiaBench involving hard reasoning problems with mathematical objects. Principia models are trained with new synthetic reasoning problems involving mathematical objects (Principia Collection) – which gives improvements both on PrincipiaBench and existing MCQA and numerical benchmarks. Note that for SuperGPQA ∗ , we use problems that require the generation of mathematical objects from the mathematics and engineering subset. LMs are grouped based on the base model it was trained on, and the best scores from each groups are bolded. Note thatReimplementeddenotes the baselines we trained ourselves with either DeepScaleR or WebInstruct-Verified datasets, since there are no publicly available checkpoints for some of the base LMs. 1.4.2 Main Results The main results are shown in Table 2. TrainingonPrincipiaCollectionimprovestheabilitytoderivemathematicalobjects. Across all four base models-each colored in red(Qwen2.5-7B-Base), yellow(OctoThinker-8B-Long-Base), green(Qwen3-4B-Base), and blue (Qwen3-4B)—we observe a consistent trend where training on the Principia Collection leads to performance improvements on the PrincipiaBench. Specifically, Qwen2.5-7B-Base improves +10.12% (12.75→22.87), OctoThinker-8B-Long-Base improves +15.30% (3.75→19.05), Qwen3-4B-Base improves +18.35% (11.31 →29.66), and Qwen3-4B improves +7.22% (41.23→48.45). The resulting models (Principia-7B-Zero, Principia-8B-Zero,Principia-4B-Zero, andPrincipia-4B) achieve the highest scores on PrincipiaBench across all settings, implying that Principia Collection strengthens an LM’s ability to derive mathematical objects. Training on mathematical objects exhibits positive transfer to MCQA and numerical benchmarks. Unexpectedly, we also observe substantial improvements on MCQA and numerical benchmarks when training on the Principia Collection (which only consists of mathematical object answers). For instance,Principia-4Boutperforms Polaris-4B and markedly surpasses the Qwen3-4B (thinking) model on SuperGPQA, GPQA-Diamond, and AIME. Likewise,Principia-7B-Zero,Principia-8B-Zero, andPrincipia-4B-Zeroachieve the best average performances, including PrincipiaBench, MCQA and numerical benchmarks. These results demonstrate that learning to construct complex mathematical forms strengthens reasoning abilities that generalize beyond mathematical object derivation (PrincipiaBench) to diverse answer formats (AIME, GPQA, SuperGPQA). 14 Figure 9 Training directly on complex mathematical objects yields substantially better transfer than training on datasets requiring only numerical values or simple mathematical objects. Validation performance across different answer types when training with three RL post-training datasets. DeepScaleR and WebInstruct-Verified contain problems from reliable online sources but mostly require simple mathematical objects or numerical answers. In contrast, Principia Collection is a fully synthetic dataset built around diverse and complex mathematical objects. Across all categories, Principia-trained models exhibit consistently higher and more stable improvements over training steps, demonstrating that exposure to complex answer structures during learning is crucial for generalizing to tasks involving rich mathematical forms. FrontiermodelsstruggleonPrincipiaBench. Compared to widely used reasoning benchmarks such as SuperGPQA, GPQA-Diamond, and AIME, the vast majority of LMs—including frontier models like o3 (62.90) and Qwen3- 235B (55.58)—exhibit substantially lower performance on the PrincipiaBench. This highlights that even the strongest models struggle to derive mathematical objects. Consequently, PrincipiaBench serves as a challenging and informative testbed for evaluating the depth of reasoning capabilities in frontier LMs. Figure 10 Response length gradually increases when training on the Principia Collection. Note that the token length is measured on our validation set. Training directly on complex mathematical objects provides far stronger transfer than relying on datasets limited to simple objects or numerical answers. As shown in Figure 9, we compare the scores on our validation set when training Qwen3-4B-Base on different RL post-training. This ex- perimental result addresses the question: “Do we really need to train on data whose answers are complex mathe- matical objects, or will a model naturally generalize if it is trained only on problems with simple answers such as numerical values?” The results show that either training on DeepScaleR (consisting of problems from AMC, AIME, IMO, IMC, and the AoPS forum with numerical value answers) and WebInstruct-Verified (consisting of simple mathematical objects and numerical value answers) are insufficient to achieve this goal. In contrast, the Principia Collection is better aligned with the desired target (i.e., precisely deriving complex mathematical objects) and we observe that it leads to more consistent improvements over the course of training. Hence, this implies that practitioners who want their LMs to precisely derive mathematical objects can include the Principia Collection in their training suite. Furthermore, as a side effect, in Figure 10, training on the Principia Collection leads to a more stable increase in the LM’s response length than baselines over the course of training. 15 Verifier TypeVerifierActive Parameters Principia VerifyBench Agreement with Humans (%) Precision RecallF1 Rule-based Verifiermath-verify-5.95%5.386.675.96 Fine-tuned Verifiergeneral-verifier1.5B82.74%83.1393.24 87.90 Fine-tuned VerifierCompassVerifier3B81.55%90.7465.33 75.97 Fine-tuned VerifierCompassVerifier7B88.69%93.7580.00 86.33 Fine-tuned VerifierCompassVerifier32B91.66%94.2086.67 90.28 General-purpose LM Qwen3-4B4B92.26%89.7493.33 91.50 General-purpose LM Qwen3-14B14B93.45%92.2194.67 93.42 General-purpose LM GPT-OSS-20B3.6B94.64%95.8392.00 93.88 General-purpose LM GPT-OSS-120B5.1B95.24%97.1892.00 94.52 General-purpose LM o3-94.05%93.3393.33 93.33 Table 3 Principia VerifyBench results. We find that model-based verifiers are more effective at determining answer equivalence than rule-based verifiers. Also, o3 and GPT-OSS-120B achieve high agreement rate with ground truth labels, justifying our selection in employing a strong judge in PrincipiaBench and a reward model for RL training with the Principia Collection. 1.5 Analysis 1.5.1 Benefits of model-based verifiers over rule-based verifiers At the core of our recipe, we use a model-based verifier, both for training on Principia Collection and evaluating on PrincipiaBench. To examine its reliability over rule-based verifiers, we check the performance of rule-based verifiers and model-based verifiers on the Principia VerifyBench. In addition, we run ablation experiments by training LMs with rewards from rule/model-based verifiers, and comparing their downstream performance. Principia VerifyBench Experimental Setting. Using the 168 instances in Principia VerifyBench, we compare math-verify(Huggingface, 2025) and 13 different model-based verifiers. The model-based verifiers could be categorized into: (1) fine-tuned verifiers which have been trained for checking equivalence between the model prediction and the reference answer and (2) general-purpose LMs which aren’t specially trained but could serve as a verifier when prompted. For fine-tuned verifiers, we use general-verifier (1.5B) (Ma et al., 2025a) and compassverifier (Liu et al., 2025c) as baselines. For general-purpose LMs, we employ Qwen3-4/14B (Yang et al., 2025b), GPT-OSS-20/120B (Agarwal et al., 2025), and o3 (OpenAI, 2025c). We measure the agreement with ground-truth labels, the precision, the recall, and F1 scores for each baseline. PrincipiaVerifyBenchExperimentalResult. The results are shown in Table 3. First, we find that o3 scores 94.05% agreement with ground truth labels, which justifies our choice behind using o3 as a judge in PrincipiaBench. Furthermore, GPT-OSS-120B scores 95.24% agreement, which indicates that it could serve as a reliable verifier to provide accurate reward during RL training. Notably, we find that math-verify fails to adequately determine equivalence in most of the cases in Principia VerifyBench, only scoring 5.95% agreement with the ground-truth labels. Note that Principia VerifyBench was constructed in an adversarial manner for math-verify, since we specifically sample instances where the o3 judge and math-verify disagree. Hence, the experimental results here imply that, in cases where the two judgments diverge, o3 is more likely to be correct. Furthermore, we find that specially fine-tuned verifiers (general-verifier, compassverifier) exhibit lower agreement rates (<90%), primarily due to low recall, whereas even small general-purpose LMs such as Qwen3-4B achieve agreement rates as high as 92.26%. These findings suggest that, for RL training on the Principia Collection, it is better to use a general-purpose LM as the verifier rather than a fine-tuned verifiers. Reward Model Ablation Experimental Setting. Furthermore, we compare the effectiveness of a rule-based verifier and a model-based verifier by measuring the downstream performance of LMs RL trained with each verifier. In this experiment, we train Qwen3-4B-Base on either DeepScaleR or the Principia Collection using three different verifiers: math-verify, general-verifier, and GPT-OSS-120B. This results in six total training settings. Reward Model Ablation Experimental Result. The results are shown in Table 4. Notably, the performance gap across different verifier types is relatively small when training on DeepScaleR, where the total average ranges only from 28.67 to 30.45. This is likely because DeepScaleR contains many numerical-value answers sourced from math competitions (IMO, IMC, AMC, AIME), and thus using either rule-based or model-based 16 DataVerifier TypeVerifier Math Obj. (PrincipiaBench)MCQANumerical Total Avg. RealMath Physics ARB SuperGPQA ∗ Avg. SuperGPQA ∗ GPQA-Diamond AIME-24 AIME-25 ---9.435.80 17.8112.1811.3120.4618.0912.510.2113.31 DeepScaleRRulemath-verify9.2420.91 38.328.1024.1443.3844.6323.0221.7728.67 DeepScaleRModelgeneral-verifier16.2018.86 38.0330.1325.8142.3545.7425.0023.1329.93 DeepScaleRModelGPT-OSS-120B 15.5519.93 39.1631.0427.4245.3246.1424.1722.2930.45 Principia-Collec. Rulemath-verify11.5310.80 21.5415.6714.8929.0632.295.523.9616.30 Principia-Collec. Modelgeneral-verifier10.488.187.986.728.3410.2731.090.421.159.54 Principia-Collec. ModelGPT-OSS-120B19.2821.81 43.6233.9229.6645.5343.5622.7123.9631.80 Table 4 Training with rule-based verifier vs. model-based verifier: Using a strong model-based verifier becomes highly important when the data is more challenging to verify. While there is little difference in terms of performance when training on DeepScaleR, which consists of numerical value and simple mathematical object answers, the performance varies a lot when training on the Principia Collection with consists of complex mathematical object answers. The best performances across all baselines are bolded. Figure 11 Example of training instances used in data ablation experiments. We prepare training data that share the same subject entity (“Information and communication theory, circuits » Communication, information » Sampling theory in information and communication theory”) but requires either a mathematical object answer (our Principia Collection data), numerical answers, and multiple choice options. By training on these distinctly formatted datasets (subsubsection 1.5.2), we find that training to generate mathematical objects is effective at improving reasoning capabilities. verifiers does not make a substantial difference. 7 In contrast, the results vary substantially when training on the Principia Collection. Performance increases only modestly withmath-verify(13.31→16.30) and even decreases withgeneral-verifier(13.31→9.54), whereas using a strong model-based verifier yields the largest gain (13.31→31.80). We speculate that this difference arises from the effectiveness of GPT-OSS-120B in determining the equivalence of mathematical objects. Notably, althoughgeneral-verifierachieves an 82.74% agreement rate on Principia VerifyBench, the LM RL-trained withgeneral-verifierattains very low performance (9.54). This suggests that even a slightly imperfect verifier can significantly deteriorate RL training outcomes. Taken together, these findings indicate that employing a strong LM-based verifier is crucial when post-training on datasets whose answers consist of complex mathematical objects (Principia Collection). 1.5.2 Unifying different answer formats Experimental Setting. Frontier LMs are often post-trained on heterogeneous datasets, which raises a natural question: “When post-training data spans multiple answer formats (mathematical objects, numerical, MCQA), how does this mixture affect performance on benchmarks associated with each format, and does mixing these formats provide any benefit?” To answer this question, we systematically analyze how including mathematical-object data affects downstream performance when combined with other data types. As shown in Figure 11, we generate additional synthetic datasets that use the same underlying entity subjects as the Principia Collection but require either a numerical value answer or a multiple-choice option answer. The three problem statements share the same entity subject but require different answer formats. This enables to minimize the effect of problem diversity (which is crucial in practice) and solely compare the 7 As also shown in Figure 4, most answers are under 10 tokens, making the verification task comparatively simple. 17 Training Data Answer Format Method Math Obj. (PrincipiaBench)MCQANumerical Total Avg. Math Obj. Numer. MCQA.RealMath Physics ARB SuperGPQA ∗ Avg. SuperGPQA ∗ GPQA-Diamond AIME-24 AIME-25 ✗-9.435.80 17.8112.1811.3120.4618.0912.510.2113.31 ✓✗Single Training19.2821.81 43.6233.9229.6645.5343.5622.7123.9631.80 ✗✓✗Single Training17.9620.14 40.9630.2827.3446.1043.5026.9823.7531.19 ✗✓Single Training15.8418.75 37.2327.8724.9634.7429.2917.7119.7925.15 ✓✗Joint Training17.4619.89 40.9632.1827.6243.7542.2722.5020.3129.92 ✓✗Weight Merging20.3522.93 45.2134.3829.9747.3044.7925.9424.9233.23 ✓Joint Training16.8319.66 39.6330.9826.7738.8040.2820.4220.5228.39 ✓Weight Merging 17.6822.84 44.6833.5929.7046.0241.9526.0422.8131.95 Table 5 Unifying mathematical objects and numerical type data using weight merging leads to the best performances. We systematically generate three different types of synthetic data that shares the same subject entities but require different format of answers. Then, we either train on one of them, jointly train them, or merge the weights of single trained checkpoints. The best scores are bolded. effect of the answer format (Kim et al., 2025a). We follow the same data-generation procedure (described in Subsection 1.3.3), and the resulting numerical-value dataset contains 300K questions, and the MCQA dataset contains 350K questions, where fewer samples are filtered out compared to mathematical objects. For numerical-value answers, we categorize problems into six types—integer values with or without units, fractional values with or without units, and decimal values with or without units. For multiple-choice questions, we include three formats: questions with 4 options, 6 options, and 10 options. Then, we train Qwen3-4B-Base on these three different datasets and compare their performances on our evaluation suite. When mixing training datasets, we compare two approaches: (1) joint-training, where we mix the datasets and RL train them together, and (2) weight-merging, where we train LMs on each datasets separately, and them merge the resulting checkpoints. For weight merging, in practice, the performance might vary a lot depending on how we mix the checkpoints. For simplicity, we use linear merging (1:1 or 1:1:1 ratio). Experimental Results. The results are shown in Table 5. Our observation is that training solely on MCQA data performs poorly across nearly all benchmarks. For example, the MCQA-only model reaches only 25.04 total average performance, with weak results on both mathematical-object tasks (e.g., 24.96 on PrincipiaBench Avg.) and numerical benchmarks (17.71 on AIME-24 and 19.79 on AIME-25). This suggests that MCQA supervision alone provides little transferable signal for complex reasoning. As illustrated in Figure 2, a likely cause of this failure is that LMs often rely on the provided options as an anchor during their reasoning process. In an MCQA-only training setup, the model is repeatedly rewarded for behaviors that exploit this shortcut—such as backward reasoning from the answer choices or forming conclusions conditioned on option structure—because these strategies can still lead to correct predictions. We suspect that over the course of training, such option-anchored heuristics become reinforced and amplified, preventing the model from learning the deeper derivational reasoning patterns needed to generalize to mathematical-object or numerical tasks. In contrast, training on mathematical-object data yields strong transfer to both MCQA and numerical benchmarks. The math-only model achieves 45.53 on SuperGPQA, 43.56 on GPQA-Diamond, and robust numerical performance such as 22.71 on AIME-24 and 23.96 on AIME-25, demonstrating that mathematical- object supervision teaches reasoning skills that generalize well across answer formats. Most surprisingly, when unifying data types, weight merging consistently outperforms joint training. For example, when combining math-object and numerical data, weight merging achieves a total average of 33.23—substantially higher than the joint-training result of 29.92. A similar pattern holds when unifying all three answer formats, where weight merging again surpasses joint training (31.95 vs. 28.39). The strongest PrincipiaBench scores in these settings also come from weight merging, including 22.93 on Physics, 45.21 on ARB, and 34.38 on SuperGPQA*). Notably, for both joint training and weight merging, incorporating MCQA data consistently reduces performance compared to using only mathematical-object and numerical-value data. These results suggest that merging independently trained LMs may be more effective than co-training a single model on heterogeneous answer formats, and that MCQA supervision can dilute the benefits of mixing data. 1.6 Related Work Evaluating reasoning capability of LMs. Reasoning has long been considered a cornerstone of machine intelligence and remains one of the most important capabilities to assess (Bottou, 2014; Weston et al., 2015; Chollet, 2019). 18 In evaluating the reasoning capabilities of LMs, it has become common in recent years to use grade-school mathematics exams or mathematics Olympiad competition problems, which typically require a numerical answer (Cobbe et al., 2021; Hendrycks et al., 2021; Luong et al., 2025a). Recent efforts also assess reasoning across a broader range of disciplines through multiple-choice questions in subjects such as physics, chemistry, and biology—tasks that require both advanced university- or graduate-level knowledge and the ability to synthesize the information provided in the problem statement to infer the answer (Wang et al., 2024b; Rein et al., 2024; Du et al., 2025). On the other hand, there is considerable excitement that LMs could be used in reasoning-intensive real- world applications such as autonomous scientific discovery and assisting scientists’ workflows (Van Noorden and Perkel, 2023; Mondorf and Plank, 2024; Lu et al., 2024; Wang et al., 2025b; OpenAI, 2025a). Yet a substantial gap remains between these ambitions and the standardized tests we currently rely on. One of the most important capabilities overlooked in the literature between exam-style standardized tests and scientific discovery is the ability to manipulate and infer complex mathematical objects. Problems that require producing or reasoning over mathematical objects present a more challenging test bed than multiple- choice or numerical-answer standardized exams, enable the evaluation of a broader range of reasoning skills, and simultaneously assess a critical competency needed for conducting experiments or writing scientific papers (Hestenes, 2009; Meerschaert, 2013; Goos et al., 2023). It is also worth noting that Ma et al. (2025a), whose motivation is closely aligned with ours, trains General-Reasoner on diverse domains and formats, yet evaluates their model only on multiple-choice questions and olympiad problems that require numerical answers. In contrast, our work introduces PrincipiaBench to assess an LM’s ability to derive mathematical objects. Synthetic data for improving reasoning capability of LMs. Synthetic data generation has been widely used to improve the instruction-following capabilities of LMs, both by creating new prompts and by generating corresponding responses (Wang et al., 2023b; Taori et al., 2023; Liu et al., 2025a). In contrast, when acquiring data aimed at enhancing reasoning—such as mathematical problem-solving—the highest priority is placed on ensuring answer correctness. As a result, instead of generating entirely new problems, prior work has typically augmented chain-of-thoughts (CoTs) to existing problem–answer pairs (Zelikman et al., 2022; Kim et al., 2023a; Mukherjee et al., 2023) 8 . However, while this approach has the advantage of guaranteeing answer correctness, it also suffers from a key limitation: it restricts the diversity of problem sources to only those that already exist. Notably, commonly used post-training datasets—such as DeepScaleR (Luo et al., 2025), OmniMath (Gao et al., 2025), NuminaMath (Li et al., 2024a), and Skywork-OR1 (He et al., 2025)—collect problems from reliable sources including AMC, AIME, IMO, IMC, and Chinese K-12 exams. Yet none of these datasets contain a lot of the types of problems that require deriving complex mathematical objects. Another widely used data-synthesis approach extracts problems from online forums or documents in the pre-training corpus and then slightly modifies the problem statement while keeping the original answer unchanged. However, the diversity of the resulting data is still determined by the types of problems contained in those documents. Representative datasets produced through this approach include Natural Reasoning (Yuan et al., 2025b), WebInstruct-Verified (Ma et al., 2025a), and TextBookReasoning (Fan et al., 2025). Yet, when we analyze the token distribution of their outputs, we find that—despite the potential of document-based data synthesis—these datasets contain far too few instances that require the complex mathematical objects as well. In our work, we generate new problem statements using subjects drawn from graduate-level disciplines (i.e., entities from MSC2020 and PhySH) that require mathematical objects as answers. Furthermore, we demonstrate that training on the Principia Collection yields strong performance on both the PrincipiaBench and other reasoning benchmarks. 1.7 Conclusion In this work, we propose PrincipiaBench, a benchmark that explicitly requires mathematical-object derivation; the Principia Collection, an RL post-training dataset designed to teach this ability; and Principia VerifyBench, a meta-evaluation suite for assessing the performance of verifiers used during RL training and benchmarking. While the ability to derive mathematical objects is fundamental to scientific reasoning, existing benchmarks that rely on multiple-choice formats fail to capture this capability. Our main finding is that LMs trained on 8 When training LMs with SFT or DPO, CoTs that correctly predict the ground-truth answer are filtered beforehand; in RL settings, however, such CoT are not pre-collected but are instead rewarded when the policy model generates a correct answer. 19 the Principia Collection not only improve substantially on PrincipiaBench but they also exhibit enhanced reasoning on other answer formats, including MCQA and numerical tasks. Furthermore, through a range of experiments, we show that a model-based verifier is necessary for providing accurate rewards when training on Principia Collection. Looking ahead, we hope the artifacts introduced in this paper will be broadly useful to the community. PrincipiaBench can serve as a rigorous benchmark for researchers developing new RL algorithms or post- training datasets to track improvements in reasoning. The Principia Collection and its data generation pipeline can be utilized for future work on synthetic data synthesis. Finally, Principia VerifyBench provides a principled way to train and evaluate model-based verifiers for fast RL training. We hope that the ability to derive mathematical objects is no longer overlooked and is instead recognized as a core capability of LMs. 20 2RLLM: Unified Post-Training via On-Policy-Trained Language Model as a Reward Model Figure 12 Reinforcement Learning with an LM as Reward Model (RLLM). Comparison of RLHF, RLVR, and RLLM for post-training LLMs. RLHF optimizes a policy against a scalar reward model trained on human-annotated preference data using the Bradley-Terry objective. RLVR trains a policy using a rule-based verifier and hence is restricted to easy-to-verify tasks with ground-truth labels. RLLM (ours) is a special case of RL-from-AI-Feedback (RLAIF) that first trains an LM-as-RM on synthetic judgments using RL and then uses its generative rewards to optimize the policy. An LM-as-RM exploits an LLM’s (1) reasoning capabilities to produce higher-quality reward signals and (2) instruction-following capabilities to allow flexible reward design. Thus, RLLM unifies the post-training paradigm, enabling the policy model to excel across easy-to-verify, hard-to-verify, and non-verifiable tasks. We further show that training the RLLM reward model on-policy (via responses sampled from the policy model) improves results. 2.1 Overview Large Language Models (LLMs) have demonstrated remarkable performance across a wide variety of tasks, from general instruction following to complex reasoning (Guo et al., 2025; Yang et al., 2025b; Hurst et al., 2024; Grattafiori et al., 2024). A key factor behind this progress is the post-training stage that relies on Reinforcement Learning (RL) to align models to human preferences. Traditionally, this alignment has been achieved through Reinforcement Learning from Human Feedback (RLHF), where human preference data is used to train a scalar reward model that assigns single numerical scores to guide optimization (Ouyang et al., 2022). However, scalar reward models are limiting: they produce a final score without generating intermediate reasoning steps, making them mostly ineffective for improving reasoning capabilities and susceptible to issues such as reward hacking and poor generalization to out-of-distribution (OOD) scenarios (Gao et al., 2023). More recently, with the advancement of o1 (Jaech et al., 2024) and R1-style thinking LLMs (Guo et al., 2025), a new paradigm has emerged: Reinforcement Learning with Verifiable Rewards (RLVR) (Lambert et al., 2024). As illustrated in Figure 12, RLVR replaces the scalar reward model with rule-based verifiers (e.g., code compilers or math checkers) to provide grounded feedback. While RLVR offers verifiable correctness, it strictly requires access to “gold” answers that are often hard-to-obtain or hard-to-verify. This limits its utility primarily to domains with objective correctness criteria—such as simple math and coding—leaving a gap for complex reasoning tasks where answers are not easily checked by rules, or subjective tasks where no gold answer exists (Tao et al., 2025). These limitations motivate a third paradigm: RLLM (Reinforcement Learning with an LM itself as a Reward Model). Unlike scalar models that lack deeper reasoning, or rule-based verifiers that lack flexibility, RLLM leverages the inherent capabilities of the LLM itself to serve as a “thinking” Reward Model. The effectiveness of this approach stems from an LLM’s two core strengths: the model’s reasoning capabilities enable it to generate explicit thinking traces for higher-quality judgments, while its instruction-following capabilities allow for flexible reward design through natural language prompts. Consequently, RLLM serves as a single, unified post-training framework effective across easy-to-verify, hard-to-verify, and non-verifiable domains. Notably, 21 Easy-to-Verify Math Training Data Hard-to-Verify Math Training Data 30 32 34 36 38 40 42 44 46 Ave Accuracy on Math Benchmarks 39.6 35.7 40.9 43.2 Reference-Based Training RLVR RLLM (Ours) Easy-to-Verify Math Training Data Hard-to-Verify Math Training Data 37.3 36.0 41.2 43.4 Reference-Free Training RLHF RLLM (Ours) (a) Verifiable Tasks (Math) AlpacaEval 2.0ArenaHard 2.0 (Hard Prompts) ArenaHard 2.0 (Creative Writing) 0 10 20 30 40 50 Win Rate 47.2 6.0 8.5 49.2 8.4 14.7 Reference-Free Training RLHF RLLM (Ours) (b) Non-Verifiable Tasks Figure 13 Performance comparison of post-trained Qwen3-1.7B models on (a) verifiable tasks (average of five math benchmarks) and (b) non-verifiable instruction-following tasks. Models are trained via RLHF (withSkywork-Reward- V2-Llama-3.1-8B as scalar-RM), RLVR (withMath-Verifyas rule-based verifier) and, our RLLM (withJ1-Qwen3-32B as LM-as-RM). Post-training data for verifiable tasks is either (1) easy-to-verify, (2) hard-to-verify, (3) reference-free, or (4) reference-based. Across all these settings, RLLM achieves consistently higher accuracy and win rates than RLVR and RLHF, with particularly large gains when trained on hard-to-verify problems. 050100150200250300350400 LM-as-RM RL Steps 65.0 67.5 70.0 72.5 75.0 77.5 80.0 82.5 Validation Reward Validation Reward of LM-as-RM Training Raw Avg 0100200300400500600700800 RLLM RL Steps 40 45 50 55 60 65 70 75 Validation Reward Validation Reward of RLLM (Policy) Training Raw Avg 0100200300400500600700800 RLLM RL Steps 46 48 50 52 54 56 58 60 AIME24 Accuracy AIME24 Accuracy Raw Figure 14 RL validation rewards for LM-as-RM training (left) and RLLM policy training (center), alongside the downstream AIME24 accuracy of the policy model (right). Centered moving average is shown to highlight the trend. recent rubric-based evaluation methods – where structured criteria guide judgment – can be viewed as a special case within our framework: the LM-as-RM implicitly internalizes and flexibly applies such rubrics through its reasoning, without requiring explicitly specified scoring rules. Our proposed method operates in two distinct stages. First, in the Thinking LM-as-RM Training stage, we train an LLM to act as a thinking Reward Model. Crucially, we employ an on-policy training recipe: the RM is trained to evaluate responses sampled specifically from the policy model, using synthetic labels derived from a stronger teacher or verifier. Unlike scalar RMs, this LM-as-RM generates an explicit reasoning trace to justify its judgment before assigning a score. Second, in the Policy Post-Training stage, we use the generative rewards produced by our trained LM-as-RM to optimize the policy model via reinforcement learning. We validate RLLM through extensive experiments on diverse benchmarks, including competition math (MathArena), physics (Principia), and open-ended instruction following (AlpacaEval, ArenaHard). We demonstrate that RLLM significantly outperforms both RLHF (with SOTA scalar RMs) and RLVR (with rule-based verifiers). Notably, RLLM achieves large gains on hard-to-verify tasks, such as math or physics problems where standard verifiers fail, while simultaneously maintaining strong performance on non-verifiable creative tasks. Furthermore, we identify two critical ingredients for success: (i) the necessity of a sufficient generator-verifier gap (using a stronger RM to train a weaker policy), and (i) the importance of on-policy RM training, as we show that off-policy RMs struggle to provide accurate signals for downstream improvements. Our contributions are summarized as follows: 22 •We introduce RLLM, a unified post-training framework that employs a “thinking” Language Model as a Reward Model, improving upon the limited rule-based verifiers (required by RLVR) or uninterpretable, shallow scalar scoring (used in RLHF). •We demonstrate that RLLM outperforms leading RLHF and RLVR baselines for Qwen, Llama, and Octothinker models of varying sizes, achieving significant gains on both verifiable tasks (Math, Physics) and non-verifiable tasks (AlpacaEval, ArenaHard). •We provide a comprehensive analysis of the recipe required for effective LM-as-RM training, empirically proving that on-policy training and a strong generator-verifier capability gap are essential for successful policy improvement. 2.2 Preliminaries RLHF: Reinforcement Learning with Human Feedback. The standard RLHF (Ouyang et al., 2022; Bai et al., 2022a) pipeline trains a scalar reward model on pairwise human preference data. Each data point (x,y c ,y r )∈D includes an instructionx, a chosen responsey c , and a rejected responsey r . The reward modelr φ (x,y) is optimized using a Bradley-Terry objective: L R =−E (x,y c ,y r )∈D [logσ(r φ (x,y c )− r φ (x,y r ))],(1) whereσ(·) is the logistic sigmoid. This trains the model to assign higher rewards to the chosen responses. The trained scalar reward model is then used to optimize a policy model via Reinforcement Learning (e.g., PPO (Schulman et al., 2017)). In this paper, we use the term RLHF to specifically refer to the classical setup of InstructGPT (Ouyang et al., 2022), where a scalar reward model is trained from human preference data and thereafter, used to optimize a policy model. RLVR: Reinforcement Learning with Verifiable Rewards. Scalar reward models remain vulnerable to reward hacking, also known as reward over-optimization (Amodei et al., 2016; Gao et al., 2023; Eisenstein et al., 2024), especially under distributional shift, and have shown limited ability to drive improvements on challenging reasoning problems. Thus, in an attempt to prevent such hacking, RLVR replaces the reward model in RLHF with a verification function (Lambert et al., 2024) such that the policy only receives a reward when its generated responses are verifiably correct. Given an instructionx, a candidate responsey, and a reference answer y ref , the verification function ψ(·) is defined as: ψ x,y,y ref = ( γ, if correct, i.e., y is equivalent to y ref , 0, otherwise. (2) These verifiers are typically rule-based (e.g.,math-verifyfor math or unit tests for coding) but can also be model-based (e.g., a model that checks equivalence between two mathematical expressions). RLVR has shown great success on tasks with verifiable outcomes such as mathematical problem-solving (Shao et al., 2024) and verifiable instruction-following tasks (Lambert et al., 2024). However, its reliance on high-quality reference answers that are also easy-to-verify, makes its application limited, e.g. to easy-to-verify math and coding tasks, but not to non-verifiable or hard-to-verify tasks. 2.3 RLLM: Reinforcement Learning with Language Models as Reward Models To reduce dependence on costly human preference collection, a third paradigm – Reinforcement Learning from AI Feedback (RLAIF) – was introduced (Bai et al., 2022b; Lee et al., 2024) where an AI model is itself used to provide rewards. Our RLLM paradigm is a special case of RLAIF (see Figure 12), which employs a Language Model as a thinking Reward Model (LM-as-RM) to unify post-training across diverse task types. In contrast to scalar reward modelsr φ (·) used in RLHF or deterministic verifiersψ(x,y,y ref ) used in RLVR, RLLM utilizes generative rewardsr LM (·) obtained directly from a thinking LLM. This unifies the post-training recipe and allows the flexibility of computing rewards through pointwise, pairwise, or listwise judgments, in both reference-freer LM (x,y) and reference-basedr LM (x,y ref ,y) settings, depending on the task requirements described below. 23 We consider the standard RL objective for maximizing expected reward: max π θ policy E x∼D,y∼π(·|x) r LM (x,y) − βD KL (π θ policy ||π ref )(3) whereπ θ policy is the policy,π ref is the reference model, andβcontrols the KL-divergence penalty. Importantly, RLLM uses RL both to train an LLM as an RM and to optimize the policy using rewards produced by this LM-as-RM. 2.3.1 Task Settings and LM-as-RM Variants We consider a general post-training setting where LLMs are expected to improve across Verifiable (e.g., Math, Code) and Non-Verifiable tasks (e.g., Open-ended Chat). Verifiable tasks are those with objective correctness criteria while non-verifiable tasks are inherently subjective and lack definitive, externally verifiable references (Lu, 2025). Within verifiable tasks, we further differentiate between easy-to-verify problems (gradeable via simple rules likemath-verify) and hard-to-verify problems (requiring intermediate reasoning to check equivalence). Given this set of tasks, we define the following LM-as-RM reward formulations: Reference-Free Pointwise LM-as-RM:r LM (x,y)→(t,s). In a reference-free pointwise setting, the LM-as-RM assigns rewards to individual responses independently. Given an instructionxand a candidate responsey, the LLM generates a formatted judgment consisting of (i) a reasoning tracetenclosed within<think>tags, where the model analyzes the quality of y, and (i) a final scalar score s within <score> tags. Reference-Based Pointwise LM-as-RM:r LM (x,y ref ,y)→(t,s). In the reference-based setting, the LM-as-RM is additionally provided with a ground-truth referencey ref and instructed to ground its judgment in this solution. This enables robust evaluation for both easy-to-verify and hard-to-verify problems that require substantial reasoning, effectively allowing the model to act as a generative verifier. Pairwise and Listwise LM-as-RM:r LM (x,y)→(t,s) orr LM (x,y ref ,y)→(t,s). Pointwise evaluation can be extended to pairwise or more generally, to listwise evaluation, where the LM-as-RM jointly evaluates multiple candidate responses. In this setting, the input consists of the instructionxand a set ofkcandidate responsesy =y 1 ,...,y k . The model jointly reasons about the candidates intand outputs a list of scores s =s 1 ,...,s k . Such comparative evaluation has been shown to substantially improve zero-shot LLM-judge performance, particularly for non-verifiable tasks (Whitehouse et al., 2025a). In section 5, we provide the prompt template for reference-free pointwise LMs-as-RMs in Figure 24, and pairwise in Figure 25. In Figure 26 we show the prompt template for reference-based pointwise LMs-as-RMs. 2.3.2 LM-as-RM Training via RLVR Here we first describe how we use RL to train an LM-as-RM. While off-the-shelf LLMs can also serve as zero-shot RMs, their judgments often become unreliable when policy outputs drift out-of-distribution or when the generator–verifier gap is insufficient to provide trustworthy signals (Shao et al., 2025). To obtain more reliable and task-aligned reward signals, we thus train an LM-as-RM via Reinforcement Learning with Verifiable Rewards using theJ1framework (Whitehouse et al., 2025b).J1constructs synthetic judgment tasks with labels, converting diverse tasks into a unified verifiable format compatible with RLVR-style training. Synthetic Training Data Generation. Letπ θ policy denote the initial policy LLM that we want to optimize using an LM-as-RM. To train the LM-as-RM, we first sample on-policy responses fromπ θ policy and synthetically annotate the responses for the reward modeling task. Specifically, given a datasetDwith instructionsxand optionally available reference answersy ref , we generate reward model training data in three steps: (i) sample a set of responsesy from the policyπ θ policy ; (i) employ a strong teacher LLM to rate the correctness or quality of these responses, obtaining scoress. For mathematical reasoning tasks, these ratings are typically binary (correct/incorrect); for non-verifiable tasks, the scores span a continuous scale [s min ,s max ] reflecting response quality. Finally, we create a balanced dataset to ensure a uniform distribution over the assigned scores. The resulting synthetic dataset for LM-as-RM training is denoted as:D LM = (x,y ref ,y,s)|(x,y ref )∈D , where y ref =∅in reference-free settings. This would enable us to train an LM-as-RM in both reference-free or reference-based scenarios. 24 RLVR Training. Given the scores, we now train the LM-as-RM using RLVR. Following Whitehouse et al. (2025b), we format examples from the synthetic datasetD LM into seed LM-as-RM prompts (depending on the training configuration), and instruct the model to generate a judgment (t ′ ,s ′ ), consisting of a reasoning trace t ′ and a predicted scores ′ . We optimize the model using GRPO (Shao et al., 2024), assigning a reward of 1 if the predicted score matches the teacher score, and 0 otherwise. 2.3.3 RLLM Policy Training with LM-as-RM Given an LM-as-RMπ θ LM , either an off-the-shelf LLM or a trained one (as described above), we optimize the policy modelπ θ policy using the LM-as-RM’s generative rewards. All LM-as-RM variants generate thinking traces and scores; however, only the scores are used for policy optimization. In particular, we sample rollouts y i ∼ π θ policy (·| x) and depending on whether the LM-as-RM is a pointwise, pairwise, or listwise model and if reference answers are available or not, we compute rewards from the rollouts as follows: Pointwise RLLM. In this setting, the reward for each rollouty i is computed independently asr i =r LM (x,y i ) (or with referencey ref ). This approach is most computationally efficient, providing an absolute quality score without requiring comparisons with other rollouts. Pairwise RLLM. Pairwise comparison requires constructing pairs of rollouts to be evaluated jointly by the LM-as-RM. Lety =y 1 ,...,y n denote thenrollouts for a given prompt. We consider two pairwise strategies and both evaluate the ordered responses (y a ,y b ) and (y b ,y a ) to reduce positional bias (Zheng et al., 2023). •Exhaustive Pairwise. We compare each rollout with every other rollout, resulting in all n 2 ordered comparisons. Each rollout appears in 2(n−1) comparisons, and its final reward is obtained by averaging the scores across these comparisons. This provides the strongest comparative signal but hasO(n 2 ) judgment complexity. •Pivot Pairwise. To reduce the quadratic complexity of all-pairs, we randomly choose one rollouty pivot as a reference and pair it with every other rollout in both orders, resulting in 2(n−1) comparisons: (y a , y pivot ),(y pivot , y a ). All non-pivot rollouts receive scores relative to the same anchor, thus producing more consistent rewards compared to randomly constructingO(n) pairs. The pivot rollout’s reward is obtained by averaging the scores across all the comparisons. This setting remains lightweight, withO(n) LM-as-RM evaluations. Listwise RLLM. Similar constructions can be extended to a list of K > 2 comparisons, where the LM-as-RM jointly evaluates sets of K rollouts rather than pairs, while increasing the reward modeling complexity. The policy and the LM-as-RM may be initialized from the same or different base LLMs; when they coincide, this corresponds to a self-rewarding RLLM setting (Yuan et al., 2025a). Our experiments consider both cases. 2.4 Experimental Setup 2.4.1 LM-as-RM Models Training Recipe. For the purpose of our main experiments, we use theJ1recipe to train on-policy LM- as-RM models starting from a largeQwen3-32B(thinking) model. As we will show later in our analysis (subsubsection 2.5.2), (1) training a large RM such that it ensures a substantial generator-verifier gap and (2) training the RM on-policy are both critical for best downstream policy improvements. Recall that LM-as-RM training can be done in both reference-free and reference-based settings, in which case we will call the resultant RMsJ1-Qwen3-32B-RMandJ1-Qwen3-32B-Verifier, respectively. We train theseJ1models by (1) sourcing hard-to-verify prompts from OpenMathReasoning and Principia Collection, (2) generating 16 responses per prompt fromQwen3-1.7Bmodel (i.e., the policy model to be trained), and (3) labeling each response as correct or incorrect usingGPT-OSS-120B(since the final answers are hard-to-verify). See Figure 27 and Figure 28 for two such training examples of correct and incorrect responses. This results in a synthetically annotated and balanced dataset of 18,774 examples for pointwise LM-as-RM training with verifiable correctness rewards. For the purposes of ablations, whenever we train a weaker LM-as-RM model with off-policy samples, we will follow a similar J1 recipe with the exception that either the base model will change or the responses will be sampled from a different model than the policy to be aligned. 25 Training Hyperparameters. Following Whitehouse et al. (2025b), we implement allJ1models on top of verl(Sheng et al., 2024). We use a train batch size of 512 with a maximum sequence length of 4096 for both input and output. We also set the KL coefficient to 0.001, the number of rollouts to 8, and the sampling temperature to 1.0 during RL training. All 32BJ1models are trained using 128 H200 GPUs. The bestJ1 checkpoint is chosen based on a held-out validation set of 100 samples. Inference Hyperparameters. In addition to usingJ1models during policy training, we also perform offline evaluation of these models on in-distribution validation data. Inference is done using vLLM (Kwon et al., 2023) with a temperature of 0.6 and top_p value of 0.95. 2.4.2 RLLM Policy Models Training Recipe. In our primary RLLM experiments, we optimize aQwen3-1.7B(Instruct) policy using the aforementionedJ1-Qwen3-32Breward models. 9 This choice of model sizes for the policy and the RM creates a substantial generator–verifier gap, which we find to be essential for achieving strong downstream policy improvements. Starting with the sameQwen3-1.7Bpolicy, we experiment with four different RLLM training paradigms that vary in their training data or the LM-as-RM: • Policy Trained on Hard-to-Verify Samples without Reference Answer. This forms the main experimental setting of RLLM where we assume access to a post-training dataset of only hard-to-verify math problems without any reference answers. As LMs continue to grow in capability, we will increasingly confront tasks for which no human-curated reference answers exist. To test such a post-training setting, we sample 3000 hard-to-verify problems from the OpenMathReasoning dataset (Moshkov et al., 2025a). Following Tao et al. (2025), we select a sample as hard-to-verify if the final answer cannot be deterministically validated using a rule-based symbolic verifier likemath_verify(see example in Appendix Figure 29) and easy- to-verify, otherwise (see example in Appendix Figure 30). Given the reference-free setting, our RLLM recipe uses J1-Qwen3-32B-RM as the reward model for optimization. •Policy Trained on Hard-to-Verify Samples with Reference Answer. This resembles an RLLM setting where we train on the same dataset as above but this time, assuming access to reference answers. Since these reference answers are hard-to-verify, our RLLM recipe usesJ1-Qwen3-32B-Verifieras the (reference-based) verifier for policy training. •Policy Trained on Easy-to-Verify Samples with/without Reference Answer. Our motivation to perform RLLM on easy-to-verify samples is to show that even when reference answers exist and they are easy-to-verify, RLLM with a strong LM-as-RM can substitute rule-based verifiers likemath-verify. Note that an LM-as-RM has the advantage of additionally evaluating the entire CoT and identifying process errors, unlikemath-verifythat only performs equivalence checks between the final answers. For the purpose of this study, we conduct RLLM experiments by sampling 3000 easy-to-verify training examples from the same OpenMathReasoning dataset (Moshkov et al., 2025a). In subsequent analyses and ablations of RLLM, we also consider different combinations of policy and reward models e.g., other sizes of Qwen3 models, Octothinker (Wang et al., 2025a), and Llama models. For non-verifiable instruction-following tasks, we train RLLM models on 1K Wildchat prompts (Zhao et al., 2024). Training Hyperparameters. We implement all policy models on top offairseq2(Balioglu et al., 2023). Models are trained using 64 H200 GPUs, allocating 48 GPUs for the trainer and 16 GPUs for inference. Fairseq2’s trainer is implemented asSingle Program Multiple Data (SMPD)and all models (policy model, reference model, andJ1reward model) run as Ray actors. RLLM training uses a batch size of 32 with 8 rollouts per prompt, sampled with a temperature of 1.0. Similar toJ1training, we also set the KL coefficient to 0.001 for policy training. RLLM models are trained for a maximum of 1000 steps, checkpointing every 50 steps. To mitigate the effect of mismatch in log probabilities between the trainer and the inference engine in GRPO training, we apply truncated importance sampling correction with the maximum clip ratio set to 2.0. We also set max input and generation length to 4096 tokens and trainQwen3-1.7Bin non-thinking mode to prevent 9 Qwen instruct models are already post-trained using RL, making them harder to improve on, compared to their base counterparts. Despite that, we conduct all RLLM experiments on top of such strong instruct models and show further improvements. 26 long thinking sequences from exceeding the context window, ensuring that the reward models can evaluate the complete answers. Inference Hyperparameters. We evaluate RLLM-trained policy models using vLLM with a temperature of 0.6 and a top_p value of 0.95. The maximum decoding length is set to 40K tokens. Given the hybrid nature of Qwen3 models, we evaluate our post-trained models in both thinking and non-thinking modes. 2.4.3 Evaluation Benchmarks and Metrics To evaluate the effectiveness of RLLM as a general post-training recipe, we conduct experiments on both verifiable reasoning tasks as well as non-verifiable tasks. Verifiable Reasoning Benchmarks. Within verifiable reasoning, we experiment with both easy-to-verify bench- marks (e.g., questions with numerical answers) as well as hard-to-verify benchmarks (e.g., questions with mathematical objects as answers like equations). •Easy-to-verify Math Benchmarks. We evaluate on five popular competition math benchmarks from MathArena (Balunović et al., 2025) – AIME24, AIME25, BRUMO25, HMMT24, and HMMT25. For each benchmark, we report mean@16 results. •Hard-to-verify Physics Benchmark. We also evaluate on the Physics subset of PrincipiaBench. This serves as a test bed for both hard-to-verify reasoning problems as well as our models’ generalization to a different domain (given that RLLM’s training data only consists of math prompts). We use o3 (OpenAI, 2025b) to evaluate the correctness of the final answers and report mean@8 scores. Non-verifiable Tasks. Finally, we also evaluate RLLM on two non-verifiable instruction following benchmarks – AlpacaEval 2.0 (Li et al., 2023) and ArenaHard 2.0 (Li et al., 2025c). Following past work (Lanchantin et al., 2025), we report win rates (with and without length-control), using GPT-4o and GPT-4.1 as the evaluators for AlpacaEval and ArenaHard respectively. 2.4.4 Baselines We compare RLLM to two groups of baselines, RLHF and RLVR. •RLHF. This represents RL post-training with scalar reward models. In particular, we consider two strong general-purpose RMs –Skywork-Reward-V2-Llama-3.1-8B(Liu et al., 2025b) 10 andAthene-RM- 8B(Frick et al., 2024). We also compare against a math-specific reward modelAceMath-7B-RM(Liu et al., 2024). •RLVR. This refers to RL post-training with a verifier. The verifier operates in the presence of a reference answer and can either be rule-based (e.g.,math-verify) or model-based (e.g.,general-verifier(Ma et al., 2025b). Even thoughgeneral-verifieris a model-based verifier, it only checks the equivalence between the final answers (without any CoT reasoning). We compare RLLM to both these RLVR methods. Within the RLLM framework, we compare ourJ1-trained LMs-as-RMs to various prompted LLMs-as-RMs. These belong to different families and are of different sizes and capabilities, thus enabling us to analyze how scaling up the RM/Verifier affects policy training and downstream task improvements. In particular, we report RLLM results withLlama-3.1-8B-Instruct,Qwen-1.7B,Qwen-32B, andGPT-OSS-120Bas prompted LMs-as-RMs. 2.5 Results 2.5.1 Main Results RLLM (with prompted and trained LMs-as-RMs) outperforms RLHF (with scalar RMs) on both easy-to-verify and hard-to-verify reasoning tasks. First, in Table 6, we compare different post-trained Qwen3-1.7B models, optimized via either scalar RMs (RLHF) or LM-as-RM (RLLM). Within RLLM, we compare our trained 10 This is the best-performing publicly available reward model, according to the RewardBench2 leaderboard:https://huggingface. co/spaces/allenai/reward-bench. 27 Table 6 Reference-free setting: Comparison of different post-trained Qwen3-1.7B (Instruct) models using RLLM or RLHF on easy-to-verify and hard-to-verify reasoning benchmarks. All models are trained on hard-to-verify samples in a reference-free setting. RLHF’ed models are optimized using SOTA scalar RMs. RLLM models are optimized using either prompted LM-as-RM or our trainedJ1LM-as-RM. We observe improved RLLM results by scaling up the LM-as-RM, with J1-Qwen3-32B-RM improving AIME24 by 12% on top of a Qwen3-1.7B (Instruct) model. MethodRMRM Type RM Size Math (Easy-to-Verify)Hard-to-Verify AIME24 AIME25 BRUMO25 HMMT24 HMMT25 AveragePhysics Qwen3-1.7B–45.2236.2049.7823.3021.4435.2015.57 Policy Trained on hard-to-verify samples w/o Reference Answer RLHFNexusflow/Athene-RM-8BScalar8B40.2232.9238.9517.0819.5829.7513.18 RLHFSkywork-Reward-V2-Llama-3.1-8B Scalar8B48.3336.2647.9122.9224.3635.9616.93 RLHFnvidia/AceMath-7B-RMScalar7B48.7638.3447.6923.1222.7136.1216.24 RLLMLlama-3.1-8B-InstructGenerative8B48.5436.6846.4722.5124.1835.6715.68 RLLMQwen3-1.7BGenerative1.7B51.2537.7151.6726.7026.8838.8417.84 RLLMQwen3-32BGenerative32B54.3843.5353.7626.7031.0241.8816.48 RLLMGPT-OSS-120BGenerative120B52.0939.7952.9133.3028.3441.2916.36 RLLMJ1-Qwen3-32B-RMGenerative32B57.9144.1754.1633.3027.5043.4118.75 ∆ w/ Qwen3-1.7B–+12.59+7.97+4.38+10.0+6.06+8.21+3.18 Table 7 Reference-based setting: Comparison of different post-trained Qwen3-1.7B (Instruct) models using RLLM or RLVR on easy-to-verify and hard-to-verify reasoning benchmarks. All models are trained on hard-to-verify examples in a reference-based setting. RLVR models are optimized using either rule-based or model-based verifiers. RLLM models are optimized using either prompted or trained LM-as-RM (functioning as reference-based verifiers). All RLLM variants outperform all RLVR variants. MethodVerifier VerifierVerifierMath (Easy-to-Verify)Hard-to-Verify TypeSizeAIME24 AIME25 BRUMO25 HMMT24 HMMT25 AveragePhysics Qwen3-1.7B–45.2236.2049.7823.3021.4435.2015.57 Policy Trained on hard-to-verify samples w/ Reference Answer RLVRMath-VerifyRule-based –48.9634.7947.5121.2425.8335.6715.95 RLVRTIGER-Lab/general-verifier Generative 1.5B50.4241.6649.3823.5523.3337.6717.39 RLLMQwen3-1.7BGenerative1.7B52.5041.4649.5933.3026.6640.7016.93 RLLMQwen3-32BGenerative32B57.2943.3352.9323.3028.9741.1618.32 RLLMJ1-Qwen3-32B-VerifierGenerative32B55.8346.0553.3230.0031.0343.2419.36 ∆ w/ Qwen3-1.7B–+10.61+9.85+3.54+6.70+9.59+8.04+3.79 J1-Qwen3-32B-RM to different prompted LMs-as-RMs of varying sizes. All models are trained on the same hard-to-verify math prompts in a reference-free setting. Our main conclusions are listed below: •All RLLM models (rows annotated in green), except for the one trained with the weaker Llama-3.1- 8B-Instruct model, outperform all RLHF models, showcasing the effectiveness of LMs-as-RMs over scalar RMs (36.12→43.41). Importantly, while scalar RMs like Skywork-Reward-V2 may be the best-performing RM on offline benchmarks (e.g., RewardBench2), such performance does not translate to best downstream performance when doing online RL. This echoes the findings of past work that also shows that highest scoring RMs on static benchmarks does not ensure a good post-trained model (Malik et al., 2025). • In a self-rewarding RLLM setting where both the policy and the RM are Qwen3-1.7B models (5th row), we observe improvements over the base policy (35.20→38.84). This result demonstrates that a thinking LM-as-RM can provide sufficiently high-quality rewards to drive self-improvement, at least to some extent. •Furthermore, RLLM scales with the size and the capability of the LM-as-RM, with our on-policy-trained J1-Qwen3-32B-RM obtaining the best results and even outperforming a larger GPT-OSS-120B as the RM. In particular, our best RLLM model obtains an average of 8% absolute improvement on competition math over the Qwen3-1.7B Instruct model. This is noteworthy given that Qwen3 models are already heavily post-trained for these benchmarks. • RLLM-trained models also generalize to the out-of-domain and hard-to-verify Physics benchmark, while only training on math prompts. In summary, our results show that RLLM with strong LLMs-as-RMs can outperform RLHF. Moreover, 28 Table 8 Comparison of RLLM, RLHF, and RLVR across different training datasets – easy-to-verify, hard-to-verify, reference-free, and reference-based. RLLM on hard-to-verify data with a strong LM-as-RM outperforms all models trained on easy-to-verify data. MethodRM/VerifierRM Type RM Size Math AIME24 AIME25 BRUMO25 HMMT24 HMMT25 Average Qwen3-1.7B –45.2236.2049.7823.3021.4435.20 Policy Trained on easy-to-verify samples w/o Reference Answer RLHFSkywork-Reward-V2-Llama-3.1-8B Scalar8B48.9635.2143.9633.3025.2037.33 RLLMJ1-Qwen-32B-RMGenerative32B53.3445.4352.0826.7027.3341.18 Policy Trained on easy-to-verify samples w/ Reference Answer RLVRMath-VerifyRule-based–53.7540.8450.4226.7026.4439.63 RLLMJ1-Qwen-32B-VerifierGenerative32B54.9944.3652.2925.8326.8840.87 Policy Trained on hard-to-verify samples w/o Reference Answer RLHFSkywork-Reward-V2-Llama-3.1-8B Scalar8B48.3336.2647.9122.9224.3635.96 RLLMJ1-Qwen3-32B-RMGenerative32B57.9144.1754.1633.3027.5043.41 Policy Trained on hard-to-verify samples w/ Reference Answer RLVRMath-VerifyRule-based–48.9634.7947.5121.2425.8335.67 RLLMJ1-Qwen-32B-VerifierGenerative32B55.8346.0553.3230.0031.0343.24 on-policy training of an LM-as-RM can lead to further improvements. Using an LM for both the policy and as an RM also provides a natural framework for building self-rewarding and self-improving LLMs (Yuan et al., 2025a). RLLM also outperforms RLVR (with rule-based and model-based verifiers). Table 7 compares our RLLM-trained models with RLVR-trained models. In this setting, we train on the same dataset as above but assume access to reference answers. We list our main takeaways below: • When reference answers do exist, under the same RLLM framework, we can also employ strong thinking- LLMs as reference-based verifiers. RLLM with such thinking-verifiers can thus outperform RLVR with rule-based verifiers likemath-verify(35.67→43.24), which is known to fail in cases where answers are complex mathematical objects (Tao et al., 2025). •Similar to the reference-free setting described above, training an on-policy J1-Qwen3-32B-Verifier model improves math results by 2% over prompted Qwen3-32B (41.16→43.24), thus highlighting the effectiveness of RL-trained LMs-as-RMs over prompted ones. Our RLLM model also generalizes better to the hard-to-verify physics benchmark. RLLM on hard-to-verify training data outperforms RLVR on easy-to-verify training data. So far, we have shown that when the training data is hard-to-verify, RLLM can outperform RLHF and RLVR. A natural question then arises that if we already have access to easy-to-verify data (in which case, a rule-based verifier like math-verifymight also suffice), do we even need to train on hard-to-verify data? We answer this in Table 8 by training RLLM on all different subsets of training data – easy-to-verify, hard-to-verify, reference-free, and reference-based. Our conclusions are as follows: •Reference-free RLLM training on hard-to-verify data outperforms reference-based RLVR training on easy-to-verify data (39.63→43.41). This suggests that one way to scale RL for post-training (beyond easily verifiable domains) is to train strong LMs-as-RMs. •RLVR training on easy-to-verify data works better than on hard-to-verify data because of more accurate rewards (35.67→39.63). However, even when training on easy-to-verify data, RLLM can match or outperform RLHF and RLVR. •RLLM training on hard-to-verify data leads to better downstream performance than training on easy-to-verify data (41.18→ 43.41). Overall, this helps establish RLLM as a unified and superior recipe across all post-training regimes. Our results also indicate that the following RLLM setting leads to the best downstream performance: (1) hard training prompts with the (2) strongest on-policy trained LM-as-RM, (3) capable of rewarding rollouts either in reference-free or reference-based setting. 29 Table 9 Comparison of Win Rate (WR) and Length Controlled Win Rate (LCWR) of RLLM and RLHF on non- verifiable instruction-following tasks when training a Qwen3-1.7B policy (either in thinking or non-thinking mode). For AlpacaEval 2.0, we use GPT-4o as the evaluator and for ArenaHard 2.0, we use GPT-4.1 as the evaluator. RLLM matches or outperforms RLHF, obtaining best win rates on hard prompts of ArenaHard 2.0. MethodRMRM Type RM Size Datasets AlpacaEval 2.0ArenaHard 2.0ArenaHard 2.0 (LCWR / WR) (Hard Prompts) (Creative Writing) Policy Trained on non-verifiable WildChat samples in thinking mode Qwen3-1.7B –37.5 / 39.26.4 (-0.6 / +0.7) 6.9 (-1.0 / +1.1) RLHFNexusflow/Athene-RM-8BScalar8B38.0 / 42.64.3 (-0.6 / +0.7)15.8 (-2.0 / +1.9) RLHFSkywork-Reward-V2-Llama-3.1-8B Scalar8B43.0 / 47.26.0 (-0.6 / +0.6) 8.4 (-0.9 / +1.0) RLLMQwen3-1.7BGenerative1.7B43.9 / 49.28.5 (-0.8 / +0.8)14.7 (-1.4 / +1.8) Policy Trained on non-verifiable WildChat samples in non-thinking mode Qwen3-1.7B –27.8 / 30.03.1 (-0.6 / +0.4) 3.1 (-0.9 / +0.7) RLHFNexusflow/Athene-RM-8BScalar8B37.7 / 38.13.9 (-0.6 / +0.7) 8.0 (-1.4 / +1.3) RLHFSkywork-Reward-V2-Llama-3.1-8B Scalar8B34.9 / 40.03.6 (-0.8 / +0.6) 7.8 (-1.2 / +1.2) RLLMQwen3-1.7BGenerative1.7B41.4 / 43.76.0 (-0.8 / +0.7)9.1 (-1.2 / +1.2) Given the hybrid nature of Qwen3 models, we also evaluate our post-trained models in non-thinking mode. Results in Appendix Table 18 demonstrate that RLLM outperforms RLHF and RLVR by large margins. We draw similar conclusions in Appendix Table 19 when experimenting with a Llama-based OctoThinker-8B- Hybrid-Base model and in Appendix Table 17 with a larger Qwen3-8B seed model. Beyond verifiable domains, RLLM is also performant on non-verifiable instruction-following tasks. Next, in Table 9, we evaluate the effectiveness of RLLM for non-verifiable tasks. Unlike RLLM training for verifiable tasks which used a pointwise LM-as-RM, here we use a pairwise LM-as-RM. This is because non-verifiable tasks do not have a strict notion of correctness and past works have shown that such responses are also easier to evaluate in a comparative setting. In particular, we consider a self-rewarding RLLM setting with Qwen3-1.7B both as the policy and the LM-as-RM, construct pairs of rollouts, and prompt the Qwen3-1.7B LM-as-RM to assign scores between 0-10 to each of the two rollouts. These scores are then averaged across all pairs to get pointwise reward estimates. Later in the ablations, we also compare pointwise, pairwise, and k-wise reward assignment for these tasks. We observe that RLLM can match or outperform RLHF, while obtaining better win rates for harder prompts. Appendix Table 20 reports similar conclusions when training a Qwen3-8B policy model. This result again reinforces the effectiveness of RLLM as a general framework for post-training on all kinds of tasks. Like verifiable tasks, we expect further improvements on non-verifiable tasks upon on-policy training of the LM-as-RM, which we leave for future work. 2.5.2 Analyses and Ablations Generator-Verifier Gap. In this section, we investigate the impact of the generator–verifier gap on RLLM training, specifically examining how the capability gap between the policy LM and the LM-as-RM influences downstream policy improvements. Recall that for our main experiments, we trained a Qwen3-1.7B policy with a J1-Qwen3-32B-RM where the RM was trained on-policy (by sampling responses from the Qwen3-1.7B policy). Now we ask if we train a weaker 1.7B LM-as-RM on its own responses i.e., J1-Qwen-1.7B-RM, can that also lead to downstream improvements? As shown in Table 10, we do not observe further improvements on top of the prompted Qwen3-1.7B-as-RM with J1 training. This result is further evidenced by Figure 15, where we compare the raw accuracy of different LMs-as-RMs on an in-distribution validation set. We observe thatJ1training of a Qwen3-32B model leads to 10% improvement in judgment accuracy (averaged across 8 seeds) while providing almost no improvement on top of Qwen3-1.7B. In summary, training a Qwen3-1.7B model to evaluate its own responses leads to limited success and consequently, the resultant RM also does not lead to any downstream policy improvements. This underscores the importance of the capability gap between the generator and the verifier for obtaining downstream improvements. In Appendix Figure 31, we show examples of correct and incorrect thinking traces generated by J1-Qwen3-1.7B-RM and J1-Qwen3-32B-RM respectively. Off-policy vs On-policy trained LM-as-RM. In Table 11, we compare an on-policy trained LM-as-RM with two 30 Table 10 Analysis of Generator-Verifier Gap. RLLM post-training of a Qwen3-1.7B policy with a J1-Qwen3-1.7B LM-as-RM does not improve performance over the prompted LM-as-RM baseline while post-training with a stronger J1-Qwen3-32B LM-as-RM improves over the corresponding prompted baseline. MethodRM/VerifierRM Type RM Size Math (Easy-to-Verify)Hard-to-Verify AIME24 AIME25 BRUMO25 HMMT24 HMMT25 AveragePhysics Qwen3-1.7B –45.2236.2049.7823.3021.4435.2015.57 Policy Trained on hard-to-verify samples w/o Reference Answer RLLMQwen3-1.7BGenerative 1.7B51.2537.7151.6726.7026.8838.8417.84 RLLMJ1-Qwen3-1.7BGenerative 1.7B52.2939.5945.6330.0021.4637.7915.11 Policy Trained on hard-to-verify samples w/o Reference Answer RLLMQwen3-32BGenerative 32B54.3843.5353.7626.7031.0241.8816.48 RLLMJ1-Qwen3-32B-RM Generative32B57.9144.1754.1633.3027.5043.4118.75 Table 11 Comparison of RLLM post-training of Qwen3-1.7B with on-policy versus off-policyJ1-trained LMs-as-RMs. On-policy J1-Qwen3-32B-RM is trained on Qwen3-1.7B responses while off-policy models are trained on either weaker Llama responses or stronger Qwen3-8B responses. On-policy trained LM-as-RM outperforms off-policy trained ones. MethodRM/VerifierRM Training Data Math AIME24 AIME25 BRUMO25 HMMT24 HMMT25 Average Qwen3-1.7B –45.2236.2049.7823.3021.4435.20 Policy Trained on hard-to-verify samples w/o Reference Answer RLLMQwen3-32B (prompted)–54.3843.5353.7626.7031.0241.88 RLLMJ1-Qwen3-32B-RM (off-policy-trained) Llama-3.1-8B-Instruct 57.0944.3748.9626.8728.1241.08 RLLMJ1-Qwen3-32B-RM (off-policy-trained) Qwen3-8B54.9945.8250.0023.5527.7140.41 RLLMJ1-Qwen3-32B-RM (on-policy-trained) Qwen3-1.7B57.9144.1754.1633.3027.5043.41 off-policy trained RMs. All three RMs are trained on top of the same Qwen3-32B model using the sameJ1 recipe, differing only in their training data: the off-policy RMs are trained on responses generated either by a weaker Llama model or by a stronger Qwen3-8B model. Athough Figure 16 shows thatJ1training improves judgment accuracy for all these models on their respective in-distribution validation sets, the off-policy trained LMs-as-RMs do not transfer to downstream policy improvements. 11 This again shows that RM capability improvements measured on static, offline benchmarks (with different data distributions) may not always be indicative of downstream task improvements because of lack of OOD generalization. Reference-free vs Reference-based LM-as-RM. LMs-as-RMs have the flexibility of being trained and used in both a reference-free and a reference-based manner. In Figure 17, we compare LMs-as-RMs with and without reference answers, indicated by the suffixes ‘-Verifier’ and ‘-RM’ respectively. Unsurprisingly, presence of reference answers allows for more accurate judgments with Qwen3-32B-Verifier outperforming Qwen3-32B-RM by 16%. The performance improves further to 92% afterJ1training, showing that our recipe for training LMs-as-RMs is performant in both reference-free and reference-based settings. Pointwise, Pairwise, vs Listwise LM-as-RM Rewards. Recall that for non-verifiable tasks, we employed a pairwise LM-as-RM, primarily because non-verifiable tasks benefit from relative judgments. In Table 12, we compare the effect of scaling up reward modeling compute by conducting either pointwise, pairwise, or listwise scoring from the LM-as-RM. Since the complexity of pairwise scoring is quadratic in the number of rollouts, we also explore a second pairwise setting where one of the rollouts is chosen at random as a pivot (or reference) rollout to compare against. We observe that on the hard prompts, win rates improve with more judgments while for the other categories, results mostly saturate at pairwise comparisons. Overall, this highlights the flexibility of an LM-as-RM’s rewarding mechanism, allowing increased compute to be spent on evaluation. 2.6 Conclusion We showed that RLLM – RL with (RL-trained) language models as reward models – can serve as a single, unified post-training recipe across easy-to-verify, hard-to-verify, and non-verifiable tasks. Through extensive experiments, we demonstrated that RLLM outperforms both RLHF (with scalar RMs) and RLVR (with 11 The reward curves also reflect the respective hardness of the RM data wherein responses from the weaker Llama model are easiest to judge, followed by Qwen3-1.7B, and Qwen3-8B. 31 050100150200250300350400 RL Steps of LM-as-RM Training 50 60 70 80 90 Accuracy (%) Validation Accuracy vs RL Steps (reference-free setting) J1-Qwen3-32B-RM J1-Qwen3-1.7B-RM Qwen3-1.7B J1-Qwen3-1.7B-RM Qwen3-32B J1-Qwen3-32B-RM 0 10 20 30 40 50 60 70 80 90 Accuracy (%) 56.62% 56.88% 70.25% 81.12% +0.26% +10.87% Validation Accuracy of LMs-as-RMs (reference-free setting) Figure 15 Analysis of Generator-Verifier Gap. (a) Comparison of different LMs-as-RMs in a reference-free setting on a held-out validation set (of correct/incorrect responses).J1training on top of a weaker Qwen3-1.7B does not lead to further improvements, while the same on top of a stronger Qwen3-32B leads to 10% absolute improvement. Results are averaged across 8 seeds. (b) Corresponding validation reward curves for J1 training across RL steps. Table 12 Effect of scaling up reward modeling compute in RLLM via pointwise, pairwise, pairwise with a pivot rollout, and triplet-based scoring between rollouts. MethodRMRM Type#RM Judgments Datasets AlpacaEval 2.0ArenaHard 2.0ArenaHard 2.0 (LCWR / WR) (Hard Prompts) (Creative Writing) Qwen3-1.7B –37.5 / 39.26.4 (-0.6 / +0.7) 6.9 (-1.0 / +1.1) Policy Trained on WildChat non-verifiable samples in thinking mode RLLMQwen3-1.7B (prompted) Pointwisen40.1 / 46.25.3 (-0.7 / +0.8) 12.4 (-1.6 / +1.4) RLLMQwen3-1.7B (prompted) Pairwise with pivot2n41.0 / 42.86.3 (-0.8 / +0.8) 13.9 (-1.3 / +1.2) RLLMQwen3-1.7B (prompted) Pairwise n 2 43.9 / 49.28.5 (-0.8 / +0.8)14.7 (-1.4 / +1.8) RLLMQwen3-1.7B (prompted) Triplet n 3 42.0 / 48.110.1 (-0.9 / +1.0) 13.1 (-1.4 / +1.8) rule-based rewards), showcasing particularly large gains when training on hard-to-verify tasks. We also studied the importance of on-policy training of LM-as-RM models alongside the impact of generator-verifier gap and showed that these are important components for successful RLLM training. 32 050100150200250300350400 RL Steps of J1-Qwen3-32B-RM Training 50 60 70 80 90 Accuracy (%) Validation Accuracy vs RL Steps Impact of Different RM Training Data Training on Llama3.1-8B Responses Training on Qwen3-1.7B Responses Training on Qwen3-8B Responses Figure 16 Comparison of validation reward curves forJ1training on different distributions of data. Data is either sampled from a Llama3 model (off-policy weakest), Qwen3-8B model (off-policy strongest), or Qwen3-1.7B model (on-policy). While all LMs-as-RMs improve on their respective in-distribution validation data, off-policy trained RMs do not lead to downstream task improvements potentially because of lack of OOD generalization (Table 11). 050100150200250300350400 RL Steps of LM-as-RM Training 50 60 70 80 90 Accuracy (%) Validation Accuracy vs RL Steps (reference-free vs reference-based setting) J1-Qwen3-32B-Verifier J1-Qwen3-32B-RM Qwen3-1.7B-RM Qwen3-1.7B-Verifier Qwen3-32B-RM Qwen3-32B-Verifier J1-Qwen3-32B-RM J1-Qwen3-32B-Verifier 0 20 40 60 80 100 Accuracy (%) 56.62% 70.62% 70.25% 87.12% 81.12% 92.38% +14.00% +16.87% +11.26% Validation Accuracy of LMs-as-RMs (reference-free vs reference-based setting) Figure 17 (a) Validation reward curves for a reference-free J1-Qwen3-32B-RM and a reference-based J1-Qwen3-32B- Verifier. (b) Comparison of different LMs-as-RMs evaluated in either a reference-free setting (suffixed with ‘-RM’) or in a reference-based setting (suffixed with ‘-Verifier’). Reference-based LMs-as-RMs are more accurate than reference-free ones (e.g., J1-Qwen3-32B-Verifier is 10% better than J1-Qwen3-32B-RM). 33 3 Learning to Aggregate through Online Reinforcement Learning User Prompt Rollout 1 Rollout 2 Rollout 3 Given these solution attempts It is possible that any, all, or none of these solutions are correct or complete... Please provide a final, comprehensive solution. Rollout 1Rollout 3 aggregation prompt LM Agg Rollout 1 Agg Rollout 3 LM LMAgg Rollout 1 Rollout 2Rollout 3 aggregation promptLM Agg Rollout 2 Rollout 1Rollout 3 aggregation prompt Rollout 1Rollout 2 aggregation prompt LM Agg Rollout 3 LM Agg Rollout 1 Agg Rollout 2 aggregation promptLM Agg Rollout 2 Agg Rollout 1 Agg Rollout 3 aggregation prompt aggregation prompt LM Agg Rollout 3 Agg Rollout 3 Agg Rollout 1Agg Rollout 2 Round 2 ..... Pass@K optimizationPass@1 optimization Agg Rollout 2 Rollout 1Rollout 3 aggregation prompt Rollout 2Rollout 3 aggregation prompt LMAgg Rollout 2 LMAgg Rollout 3 LMAgg 2 Rollout 1 Agg Rollout 1Agg Rollout 3 aggregation prompt Agg Rollout 2Agg Rollout 3 aggregation prompt Agg Rollout 1Agg Rollout 2 aggregation prompt LMAgg 2 Rollout 2 LMAgg 2 Rollout 3 Aggregation Round 1Aggregation Round 2Round N Figure 18 ParaGator Training. Illustration of our parallel thinking scaffolding and our method. We use pass@k optimization (Chen et al., 2025b) for optimizing the initial round of responses and pass@1 optimization (standard RLVR) for optimizing the aggregation rollouts, and train end-to-end. User Prompt Rollout 1 Rollout 2 Rollout 3 Agg Rollout 1 Agg Rollout 2 Agg Rollout 3 Agg Rollout 1 Round 2Round NRound 1 ....... Agg Rollout 2 Agg Rollout 3 User Prompt Rollout 1 Rollout 2 Rollout 3 Agg Rollout 1 Agg Rollout 2 Agg Rollout 3 Round 2Round N Agg Rollout 2 Initial Round ....... Agg Rollout 3 Agg Rollout 1 Figure 19 ParaGator Inference. During each round, we sample rollouts from the past aggregation round, pack them into the aggregation prompt, and perform inference to obtain the next pool of rollouts. 3.1 Overview Scaling up test-time compute has emerged as a powerful tool for improving language models (LMs) on challenging reasoning tasks (OpenAI et al., 2024; Snell et al., 2024; Wu et al., 2025). A widely adopted pattern is to generate multiple solution attempts in parallel and then either select or aggregate among them (Wang et al., 2022; Chen et al., 2023; Brown et al., 2024; Pan et al., 2025; Meituan LongCat Team et al., 2026). This paradigm underlies many of the strongest recent systems in complex reasoning (Tongyi DeepResearch Team et al., 2025; Li et al., 2025a; Hu et al., 2025c). In particular, several works train an aggregator model to learn how to properly synthesize a final answer from a large pool of parallel-generated candidates (Zhao et al., 2025; Venkatraman et al., 2025). However, current parallel thinking strategies suffer from two main limitations. First, most methods treat the solution generator as fixed and only optimize the aggregator (Zhao et al., 2025; Venkatraman et al., 2025). This creates a fundamental mismatch: the aggregator is trained on off-policy generations that may differ substantially from what it encounters at inference time, leading to suboptimal performance. Second, even when the generator is trained on-policy, standard outcome-based RL encourages the model to converge toward a single high-probability solution mode (Yue et al., 2025; Li et al., 2025b). This diversity collapse leaves the aggregator with redundant candidates that provide little complementary information, thus undermining the core premise of parallel thinking, which relies on diverse solutions to enable effective aggregation. In this work, we address both limitations through a unified online training framework. Our key insight is that the generator and the aggregator require different optimization objectives: the generator should produce diverse, complementary solutions, while the aggregator should synthesize them into a single best answer. 34 To this end, we apply pass@k optimization (Chen et al., 2025b) to the parallel generation stage, explicitly encouraging diversity by rewarding the model when any of its k samples succeed. For the aggregation stage, we use pass@1 optimization, training the model to reliably produce the correct final answer. Crucially, both stages are trained online and end-to-end, ensuring the generator learns to produce candidates that are not only diverse but also well-suited for downstream aggregation. Our contributions are threefold: •We show that the self-aggregation method is upper bounded by the pass@k performance and that the model is unable to synthesize correct solutions based on incorrect ones. •We develop a framework that enables online training of self-aggregation methods, where a single model is optimized for both pass@k generation and pass@1 aggregation end-to-end. •We validate the effectiveness of our method, which generalizes across competition math and scientific reasoning setups, including our newly introduced PrincipiaBench. 3.2 Preliminaries Our goal is to train a language model that takes a problem as input, generates a set of candidate solutions in parallel, and then aggregates them over an arbitrary number of steps to produce a final answer. Formally, let xbe a given problem andy ⋆ its ground-truth solution. We consider two steps for a model to perform: (i) candidate sampling: the model reads the problem and generates a set ofmcandidate solutionsy 1:m and (i) self-aggregation: the model reads both the problem and the set of candidate solutions and generates a final solution ̃y. The performance of the model can be evaluated at either an initial solutiony i , or the final solution ̃y. 3.3 Method: ParaGator Our method jointly trains a single language model,M θ to (i) generate diverse candidate solutions and (i) aggregate these solutions into a final answer. Both stages are optimized end-to-end using online RL. Specifically, we train ParaGator using reinforcement learning with verifiable rewards (RLVR), with rewards computed at two levels: on the initial set of candidate solutions and on the final aggregated answer. The candidate set is rewarded with pass@k, incentivizing the model to produce plausible but diverse solutions. The final answer is rewarded with pass@1, incentivizing the model to reliably produce a single correct solution. Both candidate generation and aggregation rollouts are sampled online from the current policy during training. 3.3.1 Candidate Solution Generation and Pool Selection For each problem x, we first generate a pool of m candidate solutions by sampling from the model m times: y i ∼M θ (y | p C ,x), i = 1,...,m. At this stage, we evaluate each candidate with a verifiable reward function r(y i ,y ⋆ )∈0, 1, which indicates whether the candidate answer matches the ground-truth solution y ⋆ . Pass@K Candidate Optimization Pass@k is defined as the maximum reward givenkrandom samplesy 1 ,..y k ∼ π θ (·|x) from the policy:pass@k:=max[r(y 1 ),r(y 2 ),...,r(y k )]. We use the pass@k optimization method described in Chen et al. (2025b), where the advantages of a correct response and an incorrect response are given by: A correct = 1− μ(x) σ(x) , A incorrect = 1− μ(x)− ( N incorrect −1 k−1 ) ( N−1 k−1 ) σ(x) , whereNis the group size,N incorrect is the number of incorrect rollouts in this group, andμ(x) andσ(x) are the mean and standard deviation of the rewards for the group whose prompt isx. Compared to standard 35 GRPO (Shao et al., 2024), only the advantage of incorrect examples is modified by an offset of ( N incorrect −1 k−1 ) ( N−1 k−1 ) . In our work, we make a further modification as in Dr.GRPO (Liu et al., 2025d) by removing the division by σ(x). Intuitively, the model is rewarded when at least one of itsmattempts solves the problem, which encourages spreading probability mass across complementary solution modes rather than collapsing onto a single trajectory. We do not perform any additional heuristic filtering or reranking: all generated candidates are visible to the aggregator and contribute to the candidate-level RL signal. 3.3.2 Self-Aggregation Given the problemxand the candidate sety 1:m , we prompt the same model to act as an aggregator. The input is the problem concatenated with the candidates in a fixed, structured format, for example: Aggregation Prompt (p A ) Given the following problem: PROBLEM and these solution attempts: CANDIDATE_1 CANDIDATE_2 ... CANDIDATE_N ========================= it is possible that any, all, or none of these solutions are correct or complete. Carefully review the provided solutions, using them as starting points — correcting mistakes, filling in gaps, and/or combining useful ideas — to produce a final, comprehensive, and correct solution to the problem. Conditioned on this input, the model generates an aggregated solution ̃y ∼M θ (y | p A ,x,y 1:m ). We then compute an aggregation-level verifiable reward R agg (x,y 1:m , ̃y) = r( ̃y,y ⋆ )∈0, 1. Crucially, the aggregator is always trained on-policy: during training, it sees candidate pools sampled from the current generatorM θ , rather than from a frozen or separately trained model. This alignment between training and inference eliminates the off-policy mismatch common in prior self-aggregation methods and ensures that the generator learns to produce candidates that are well-suited for downstream aggregation. Pass@1 Aggregation Optimization The aggregated solutions use pass@1 performance: the aggregator receives a reward of 1 if and only if its final answer is correct. Unlike the candidate stage, only the single aggregated trajectory is rewarded, pushing the model to reliably synthesize the best answer from the available candidates. 3.3.3 Training Details We optimize both stages simultaneously with RLVR. For each training example, we generate a candidate pool y 1:m fromM θ (y | x), and an aggregation trajectory ̃yfromM θ (y | x,y 1:m ). We then updateθusing GRPO (Shao et al., 2024) with a KL-regularized RLVR loss. Because both terms are computed from online rollouts of the same policyM θ , the generator learns to produce candidate sets that are not only individually plausible and diverse, but also useful for the aggregator it will later become. 36 3.3.4 Inference Details During training, we optimize a single round of aggregation over one candidate pool. At inference time, however, we naturally generalize this to multiple iterations of aggregation, enabling sequential scaling in addition to the learned parallel sampling. Concretely, given problemx, we first sample an initial pool ofmcandidatesy (0) 1:m ∼M θ (y | x) and sample maggregated solutions ̃y (0) 1:m ∼M θ (y | x,y (0) 1:m ). We then form an updated candidate pooly (1) 1:m = ̃y (0) 1:m from these aggregation rollouts. The model is then prompted again in aggregation mode on (x,y (1) 1:m ) to produce a refined set of solutions ̃y (1) 1:m , and so on. This continues forTiterations before returning the final aggregated answer. This iterative procedure preserves the same generator–aggregator interface used during training, while allowing the model at test time to repeatedly refine its reasoning over an evolving pool of candidate solutions under a fixed compute budget. 3.4 Motivating Experiments: Inference-Only Results In this section, we first show that basic aggregation of parallel generations yields improvements on frontier open-sourced models, such as Kimi-K2-Thinking (Team et al., 2025) (§3.4.1). This motivates employing and improving aggregation procedures will likely continue to be useful as models scale, and that the results of our training experiments should generalize beyond the smaller models we employ. We then further show that self-aggregation requires diversity among the responses to be packed into the aggregation prompt in order to perform better (§3.4.2), motivating our training approach. 3.4.1 Self-aggregation improves frontier models In Figure 20, we plot the performance of self-aggregation on Kimi-K2-Thinking (Team et al., 2025), Qwen3- 4B-Thinking-2507, and Qwen3-4B-Instruct-2507 (Yang et al., 2025b), 3 strong open-sourced models, on 4 competition math benchmarks: AIME 2025 (Art of Problem Solving, 2025), Brumo 2025 (Balunović et al., 2025), HMMT 2025 (Balunović et al., 2025), and IMO-Answerbench (Luong et al., 2025b). AIME 25BRUMO 25HMMT 25IMOBench (answerbench) 60 65 70 75 80 85 90 95 100 95 96.48 95.0 92.5 96.78 97.5 85.42 87.24 92.92 68.1 72.2 79.25 Kimi-K2-Thinking AIME 25BRUMO 25HMMT 25IMOBench (answerbench) 20 30 40 50 60 70 80 90 100 72.86 77.14 81.87 73.39 76.56 81.15 50.78 56.37 57.55 24.12 27.07 32.35 Qwen3-4B-Thinking-2507 AIME 25BRUMO 25HMMT 25IMOBench (answerbench) 10 20 30 40 50 60 70 80 90 100 47.66 54.31 56.35 48.21 53.12 61.51 32.03 38.52 34.22 9.17 14.2 20.53 Qwen3-4B-Instruct-2507 Initial Pass@1 Majority Voting@4 Prompted Aggregation@4 Benchmarks Pass@1 (%) Figure 20 Parallel generation + aggregation (orange) brings gains across 4 competition math benchmarks (AIME, Brumo, HMMT and IMO-Answerbench) on top of 3 strong models: Kimi-K2-Thinking, Qwen3-4B-Thinking-2507, and Qwen3-4B-Instruct-2507, compared to standard generation (blue) and majority voting (green). 3.4.2 The role of candidate diversity (pass@k) in self-aggregation In Figure 21, we plot the performance of multiple rounds of aggregation using the scaffold detailed in Figure 19. We show pass@1, pass@4, and majority voting@4 curves. We see that the pass@1 performance (blue) never exceeds the initial pass@4 (green), showing that the asymptotic performance is bounded by pass@k at the initial round, motivating our pass@k optimization method. 37 123456789 Round 67.5 70.0 72.5 75.0 77.5 80.0 82.5 Performance Qwen3-4B-Thinking-2507 Avg. of HMMT, AIME, Brumo pass@4 pass@1 maj@4 (a) Self-aggregation performance on Qwen3-4B-Thinking- 2507. 12345678910 Round 44 46 48 50 52 54 56 58 Performance Qwen3-4B-Instruct-2507 Avg. of HMMT, AIME, Brumo pass@1 maj@4 pass@4 (b) Self-aggregation performance on Qwen3-4B-Instruct- 2507. Figure 21 Performance of repeated aggregation is upper bounded by the initial pass@k (green) for both Qwen3-4B- Thinking-2507 (left) and Qwen3-4B-Instruct-2507 (right). The asymptotic performance is upper-bounded by the pass@k at the initial round. Avg. of HMMT, Brumo, and AIME Initial Temperature Initial Pass@1 Initial Pass@k → Aggregation Pass@1 T=0.665.881.0174.11 T=0.865.881.374.9 T=1.065.781.775.2 Table 13 Model = Qwen3-4B-Thinking-2507. Effect of initial sampling temperature on decoding performance, averaged over HMMT, Brumo, and AIME. Increasing the initial temperature leaves pass@1 nearly unchanged while improving pass@k, resulting in higher aggregation performance. To further illustrate the importance of pass@k, we vary the initial sampling temperature (0.6, 0.8, 1.0) while keeping the aggregation sampling temperature fixed (1.0). We report the scores in Table 13. Pass@1 performance is similar at the initial round but a higher initial pass@k results in a higher aggregation pass@1. 3.5 Main Experiments We validate training our ParaGator in two regimes: competition math (§3.5.1) and scientific reasoning (§3.5.2). 3.5.1 Competition Math Setup We train on a subset of the DeepScaleR dataset (Luo et al., 2025), consisting of 10k prompts. Following previous work (Hu et al., 2025b), we train on a base model that has not been post-trained: Qwen3-4B-Base (Yang et al., 2025a). We report scores on the following 4 benchmarks: HMMT 2025 (Balunović et al., 2025), Brumo 2025 (Balunović et al., 2025), AIME 2025 (Art of Problem Solving, 2025), and OlympiadBench (He et al., 2024). We generate 32 solutions with the following sampling parameters:Temp= 1.0, top-p= 1, top-k=−1 and report pass@1 for up to 3 aggregation rounds. We compare ParaGator against prompting the base model, Dr.GRPO (Liu et al., 2025d) (GRPO with token-level loss aggregation and the standard-deviation divisor removed), and the offline aggregation training approach proposed independently by AggLM (Zhao et al., 2025) and RSA (Venkatraman et al., 2025). Table 14 summarizes which components of the pipeline are trained by each method. 38 Training Method Initial Round User Prompt Rollout 1 Rollout 2 Rollout 3 Given these solution attempts It is possible that any, all, or none of these solutions are correct or complete... Please provide a final, comprehensive solution. Rollout 1Rollout 3 aggregation prompt LM Agg Rollout 1 Agg Rollout 3 LM LMAgg Rollout 1 Rollout 2Rollout 3 aggregation promptLM Agg Rollout 2 Rollout 1Rollout 3 aggregation prompt Rollout 1Rollout 2 aggregation prompt LM Agg Rollout 3 LM Agg Rollout 1 Agg Rollout 2 aggregation promptLM Agg Rollout 2 Agg Rollout 1 Agg Rollout 3 aggregation prompt aggregation prompt LM Agg Rollout 3 Agg Rollout 3 Agg Rollout 1Agg Rollout 2 Round 2 ..... Pass@K optimizationPass@1 optimization Agg Rollout 2 Rollout 1Rollout 3 aggregation prompt Rollout 2Rollout 3 aggregation prompt LM Agg Rollout 2 LMAgg Rollout 3 LMAgg 2 Rollout 1 Agg Rollout 1Agg Rollout 3 aggregation prompt Agg Rollout 2Agg Rollout 3 aggregation prompt Agg Rollout 1Agg Rollout 2 aggregation prompt LMAgg 2 Rollout 2 LMAgg 2 Rollout 3 Aggregation Round 1Aggregation Round 2Round N Rollout 1 Rollout 2 Rollout 3 Agg Rollout 1 Agg Rollout 3 Agg Rollout 2 Aggregation Round User Prompt Rollout 1 Rollout 2 Rollout 3 Given these solution attempts It is possible that any, all, or none of these solutions are correct or complete... Please provide a final, comprehensive solution. Rollout 1Rollout 3 aggregation prompt LM Agg Rollout 1 Agg Rollout 3 LM LMAgg Rollout 1 Rollout 2Rollout 3 aggregation promptLM Agg Rollout 2 Rollout 1Rollout 3 aggregation prompt Rollout 1Rollout 2 aggregation prompt LM Agg Rollout 3 LM Agg Rollout 1 Agg Rollout 2 aggregation promptLM Agg Rollout 2 Agg Rollout 1 Agg Rollout 3 aggregation prompt aggregation prompt LM Agg Rollout 3 Agg Rollout 3 Agg Rollout 1Agg Rollout 2 Round 2 ..... Pass@K optimizationPass@1 optimization Agg Rollout 2 Rollout 1Rollout 3 aggregation prompt Rollout 2Rollout 3 aggregation prompt LM Agg Rollout 2 LMAgg Rollout 3 LMAgg 2 Rollout 1 Agg Rollout 1Agg Rollout 3 aggregation prompt Agg Rollout 2Agg Rollout 3 aggregation prompt Agg Rollout 1Agg Rollout 2 aggregation prompt LMAgg 2 Rollout 2 LMAgg 2 Rollout 3 Aggregation Round 1Aggregation Round 2Round N Rollout 1 Rollout 2 Rollout 3 Agg Rollout 1 Agg Rollout 3 Agg Rollout 2 No Aggregation (Only Dr. GRPO on initial candidates) pass@1 optimizationNo updates Offline Aggregation (Zhao et al., 2025; Venkatraman et al., 2025) No updates pass@1 optimization (data prepared offline) Online AggregationNo updatespass@1 optimization Online Multitask Aggregationpass@1 optimization pass@1 optimization ParaGator (Ours)pass@k optimization pass@1 optimization Table 14 Comparison of training strategies across the initial and aggregation rounds. Columns show whether model parameters are updated via pass@1 or pass@k optimization, or kept fixed. Results We plot the reward curves for both the initial round and the aggregation round at Figure 22, and report the detailed scores at Table 15. The curves show a clear trade-off in prior baselines: Dr.GRPO attains reasonable initial-round reward but lags in aggregation-round reward, while offline aggregation training exhibits the opposite pattern, improving aggregation performance at the expense of the initial round. Online multitask training partially mitigates this mismatch by optimizing both rounds jointly, but still underperforms our method. In contrast, ParaGator consistently achieves the highest reward in both rounds, which translates into the strongest overall pass rates across aggregation steps. As shown in Table 15, ParaGator delivers the best average performance after aggregation and achieves the strongest results on most benchmarks, demonstrating the benefit of jointly training initial-generation and aggregation behaviors within a unified online multitask framework. 050100150200250300 Step 0.2 0.3 0.4 0.5 0.6 Initial Reward Initial Reward vs Step (EMA Smoothed) Online Multitask (pass@k) Offline Aggregation (AggLM) Online Multitask Dr.GRPO (a) Reward by training step of the initial round. 050100150200250300 Step 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Aggregation Reward Aggregation Reward vs Step (EMA Smoothed) Online Multitask (pass@k) Offline Aggregation (AggLM) Online Multitask Dr.GRPO (b) Reward by training step for the aggregation round. Figure 22 Reward curves for training Qwen3-4B-Base on deepscaler using different methods. The baseline (Dr.GRPO) only optimizes the initial round, making the performance lag behind during the aggregation round. Offline aggregation only training (AggLM) only optimizes the aggregation round, making it lag behind during the initial round. Our proposed method (ParaGator) achieves the highest reward. 39 MethodRound HMMTBrumoAIMEOlympiadAvg. Qwen3-4B-Base Initial 3.20/5.15 16.67/29.04 8.20/21.33 33.72/47.41 15.45/25.73 Agg 12.5014.2510.3030.2814.33 Agg 21.8014.399.2028.1513.38 Agg 31.6011.388.0325.1611.54 Dr.GRPO (Liu et al., 2025d) Initial 9.79/15.94 25.05/35.27 19.27/28.11 45.65/54.92 24.94/33.56 Agg 111.3026.3521.1547.2526.51 Agg 212.4027.6021.8848.3327.55 Agg 313.7028.1021.2249.5028.13 Offline Aggregation Training (Zhao et al., 2025) Initial 6.34/9.02 19.31/27.94 14.49/23.65 38.62/49.10 19.69/27.43 Agg 112.0824.5421.0544.4625.53 Agg 215.1029.3226.4447.6729.63 Agg 317.0131.3327.3750.1131.46 Online Multitask Initial 10.16/14.49 26.15/35.11 18.12/27.95 45.33/55.20 24.94/- Agg 110.0529.1120.6246.9026.67 Agg 210.6827.2419.3847.2226.13 Agg 312.9029.3020.5047.1527.46 ParaGator-Zero-4B (Online Multitask w/ pass@k) Initial 11.15/22.11 28.07/39.96 20.05/32.40 45.42/54.49 26.17/33.19 Agg 113.4432.2423.5951.9330.30 Agg 215.2934.6725.4553.0932.12 Agg 317.6036.2527.2554.8934.00 Table 15 Competition Math evaluation results by method and round, with fine-tuning starting from Qwen3-4B-Base. Best values in each column bolded. Numbers = Pass@1/Pass@4. ParaGator gives the overall best results. 3.5.2 Scientific Reasoning Setup We train on a subset of the Principia dataset, consisting of a total of 30,000 questions. 15,000 are sampled from themathematical_objectssubset, and 15,000 are sampled from thenumericalsubset. We report pass@1 scores both on the original PrincipiaBench, consisting of 2558 questions 12 , and scores in Pass@1 for HMMT 2025, BrumoMath 2025, and AIME 2025. We train on two different models: Qwen3-4B-Base and Qwen3-4B-Instruct-2507 (Yang et al., 2025b). The former resembles the “RL-Zero" paradigm, where we perform RL directly on a base model that has not been post-trained. The latter resembles a more practical setup where we enhance parallel thinking behavior on a strong post-trained model. Results Across both backbones, we observe the same specialization trade-off: Dr.GRPO improves the initial-round reward but under-optimizes the aggregation round, while offline aggregation training (AggLM) does the opposite. Online multitask alleviates this mismatch by optimizing both stages jointly, and the pass@k variant achieves the strongest reward in both rounds, consistent with its goal of improving the quality of the candidate set consumed by aggregation. Qwen3-4B-Base. ParaGator yields the best post-aggregation performance overall, achieving the highest PrincipiaBench average (33.77) and the best competition-math average (23.72). It is strongest on all PrincipiaBench subsets (RealMath/Physics/ARB/SuperGPQA), and delivers the best AIME score (27.96), indicating broad improvements in scientific reasoning and hard math when both rounds are trained end-to-end. Qwen3-4B-Instruct-2507. In this stronger, post-trained regime, online multitask attains the best Princip- iaBench average after aggregation (58.68), while ParaGator remains competitive (58.07). In contrast, ParaGator achieves the largest gains on competition math, reaching the best overall average (61.38) and the top AIME score (71.20). This suggests that pass@k-aware training is particularly effective for difficult math, where aggregation benefits most from a higher-quality, more solution-bearing candidate pool. 12 Note that 128 filtered problems described in Section 1.3.1 were used in this evaluation as well due to concurrent experimentation. 40 0255075100125150175200 Step 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Initial Reward Initial Reward by Training Steps - Principia Offline Aggregation (AggLM) Online Multitask Online Multitask (pass@k) Dr.GRPO (a) Reward by training step of the initial round. 0255075100125150175200 Step 0.1 0.2 0.3 0.4 0.5 0.6 Aggregation Reward Aggregation Reward by Training Steps - Principia Offline Aggregation (AggLM) Online Multitask Online Multitask (pass@k) Dr.GRPO (b) Reward by training step for the aggregation round. Figure 23 Reward curves for training Qwen3-4B-Base on Principia. Optimizing for pass@k during the initial round (ParaGator) achieves the highest reward on both the initial round and the aggregation round. Scientific Reasoning (PrincipiaBench)Competition Math ModelRealMath Physics ARB SuperGPQA Avg. AIME Brumo HMMT Avg. Experiments on Qwen3-4B-Base Qwen3-4B-Base9.115.69 17.4411.0110.81 8.20 16.67 3.20 9.36 +aggregation8.354.79 18.2410.6810.52 10.30 14.25 2.50 9.02 Dr.GRPO17.4015.65 35.2629.324.40 20.93 23.19 8.14 17.42 +aggregation19.3221.06 44.1732.2829.21 25.95 31.07 13.10 23.37 Offline aggregation (AggLM)10.1512.37 25.4015.2615.80 9.13 17.11 7.42 11.22 +aggregation12.1019.64 31.2721.6221.15 16.91 25.65 12.40 18.32 Online Multitask18.0516.33 33.2624.1822.96 19.14 22.37 7.98 16.50 +aggregation23.0725.31 43.4029.6530.36 26.19 29.54 14.11 23.28 ParaGator-Zero-4B-Principia17.7121.09 38.6228.7526.54 21.05 22.58 7.04 16.89 +aggregation27.4029.24 45.1733.2433.77 27.96 31.06 12.15 23.72 Experiments on Qwen3-4B-Instruct-2507 Qwen3-4B-Instruct-250722.9433.75 63.3041.8740.47 47.71 30.83 55.42 44.65 +aggregation29.4544.12 72.5649.0948.80 56.46 35.94 61.88 51.43 Dr.GRPO (with KL)28.1349.95 68.7452.3149.78 58.02 44.69 61.56 54.20 +aggregation35.4457.14 76.6758.9457.05 63.65 47.29 65.52 58.79 Offline aggregation (AggLM)25.1940.05 62.7544.4643.11 51.06 37.47 54.91 47.81 +aggregation33.1951.04 72.1352.3052.17 62.42 43.16 65.14 56.91 Online Multitask29.1647.07 67.6553.4449.32 58.36 41.73 59.45 53.18 +aggregation37.1260.10 79.2358.2558.68 64.24 45.26 67.33 58.94 ParaGator-4B-Instruct-Principia 27.9346.18 66.7052.3948.30 59.44 42.32 59.67 53.81 +aggregation36.8459.41 79.1956.8558.07 71.20 46.65 66.28 61.38 Table 16 Scientific reasoning (PrincipiaBench) and competition math evaluation results. Numbers denote Pass@1. Best values per column and model group are bolded. ParaGator gives the overall best results. Overall, ParaGator consistently improves end-to-end aggregation performance. 41 3.6 Related Work Parallel Inference Scaling Recent works have explored scaling up test-time compute by generating multiple candidate solutions in parallel, moving beyond foundational majority-voting methods like Self-Consistency (Wang et al., 2022). Rather than simply selecting a single best output, modern approaches focus on dynamically selecting, compressing, or aggregating intermediate candidates to form a cohesive final response. One prominent direction is dynamic branching: Pan et al. (2025) train models to adaptively spawn child threads for specific sub-tasks and join their results back into the main reasoning process. Similarly, Zheng et al. (2025) trigger multiple reasoning threads at critical steps to be merged into the main context. Li et al. (2025a) adapt this parallel scaling to agentic contexts by branching partial rollouts only at highly uncertain steps, utilizing compressed reports during answer aggregation to maintain computational efficiency. To handle the massive context overhead of parallel scaling, Hu et al. (2025c) utilize message passing to compress long reasoning trajectories, enabling test-time compute to scale to millions of tokens under a fixed context length. Alternatively, other works emphasize multi-round iterative refinement; Venkatraman et al. (2025) generate parallel solutions and aggregate them over successive rounds. In a closely related approach, Madaan et al. (2025) employ a cyclical pipeline that generates parallel solutions, distills them into a condensed summary, and generates refined solutions conditioned on that summary. To ensure these iterative generation cycles do not amplify errors, Sharma et al. (2026) leverage Process Reward Models to provide step-level verification during aggregation of the candidate population. Alternatively, Singh et al. (2026) proposes a pairwise self-verification method to select the single best solution from the generated pool, rather than aggregating a new solution. In this work, we utilize end-to-end reinforcement learning to train a model to operate within a minimal scaffolding of parallel generation and subsequent aggregation. Optimizing for Solution Variety in RL for LLMs The effectiveness of aggregation frameworks is fundamentally tied to the diversity of the underlying candidate solutions. To encourage this during training, many recent works directly incorporate diversity metrics into the reinforcement learning reward signal, utilizing techniques such as embedding models (Zhou et al., 2026), trained classifiers (Li et al., 2025b), and LLM-based diversity judges (Hu et al., 2026). A particularly effective subset of these methods focuses on pass@k optimization (Tang et al., 2025; Walder and Karkhanis, 2025; Peng et al., 2025). Building on this direction, our work adopts the direct pass@k optimization strategy from Chen et al. (2025b), which specifically up-weights the advantage of incorrect solutions to prevent mode collapse and expand the exploration space. 3.7 Conclusion Scaling test-time compute is only as effective as the diversity and quality of the reasoning paths that are explored. Traditional parallel decoding and self-aggregation methods are bottlenecked by off-policy generations and mode collapse. To overcome these limitations, we introduced ParaGator, a unified online reinforcement learning framework that explicitly aligns and optimizes candidate generations with downstream aggregation. Our core insight is that generation and aggregation require distinct but complementary optimization strategies. In ParaGator, the generator actively explores a diverse, complementary set of solutions through pass@k optimization. Simultaneously, the aggregator is trained via pass@1 optimization to reliably synthesize the on-policy candidates into a final answer. Extensive evaluations across competition math and scientific reasoning benchmarks validate the strength of this approach. In both base models (e.g., Qwen3-4B-Base) and strong post-trained reasoners (e.g. Qwen3-4B- Instruct-2507), ParaGator consistently improves standard offline self-aggregation. The gains are particularly pronounced on highly complex tasks, such as AIME and Principia, where synthesizing diverse reasoning trajectories is critical. By co-training generation and aggregation end-to-end, ParaGator provides a robust, scalable recipe for improving inference-time reasoning. References Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025. 42 Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016. https://arxiv.org/abs/1606.06565. Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. https://hkunlp.github.io/blog/2025/Polaris. Art of Problem Solving. Aime problems and solutions, 2025.https://artofproblemsolving.com/wiki/index.php/ AIME_Problems_and_Solutions. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a Helpful and Harmless Assistant with Reinforcement Learning From Human Feedback. arXiv preprint arXiv:2204.05862, 2022a. https://arxiv.org/abs/2204.05862. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022b. https://arxiv.org/abs/2212.08073. Nishant Balepur and Rachel Rudinger. Is your large language model knowledgeable or a choices-only cheater? In Sha Li, Manling Li, Michael JQ Zhang, Eunsol Choi, Mor Geva, Peter Hase, and Heng Ji, editors, Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024), pages 15–26, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.knowllm-1.2.https: //aclanthology.org/2024.knowllm-1.2/. Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. Artifacts or abduction: How do LLMs answer multiple-choice questions without the question? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308–10330, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.555. https://aclanthology.org/2024.acl-long.555/. Can Balioglu, Alexander Erben, Martin Gleize, Artyom Kozhevnikov, Ilia Kulikov, and Julien Yao. fairseq2. 2023. http://github.com/facebookresearch/fairseq2. Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating LLMs on Uncontaminated Math Competitions. arXiv preprint arXiv:2505.23281, 2025.https://arxiv.org/abs/2505.23281. Léon Bottou. From machine learning to machine reasoning: An essay. Machine learning, 94(2):133–149, 2014. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. https://arxiv.org/abs/2407.21787. Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping. Answer matching outperforms multiple choice for language model evaluation. arXiv preprint arXiv:2507.02856, 2025. Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language model generation, 2023.https: //arxiv.org/abs/2311.17311. Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning. arXiv preprint arXiv:2505.16400, 2025a. Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025b.https://arxiv.org/abs/2508. 10751. François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 43 Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025. Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alexander Nicholas D’Amour, Krishnamurthy Dj Dvijotham, Adam Fisch, Katherine A Heller, Stephen Robert Pfohl, Deepak Ramachandran, et al. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. In First Conference on Language Modeling, 2024. https://arxiv.org/abs/2312.09244. Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning. arXiv preprint arXiv:2507.16812, 2025. https://arxiv.org/abs/2507.16812. Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, and Arman Cohan. Physics: Benchmarking foundation models on university-level physics problem solving. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 11717–11743, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.610. https://aclanthology.org/2025.findings-acl.610/. Evan Frick, Peter Jin, Tianle Li, Karthik Ganesan, Jian Zhang, Jiantao Jiao, and Banghua Zhu. Athene-70B: Redefining the Boundaries of Post-Training for Open Models, July 2024. https://nexusflow.ai/blogs/athene. Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025. Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=yaqPf0KAlN. Leo Gao, John Schulman, and Jacob Hilton. Scaling Laws for Reward Model Overoptimization. In International Conference on Machine Learning, pages 10835–10866, 2023. https://proceedings.mlr.press/v202/gao23h.html. Mark J Gierl, Okan Bulut, Qi Guo, and Xinxin Zhang. Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of educational research, 87(6):1082–1116, 2017. Merrilyn Goos, Susana Carreira, and Immaculate Kizito Namukasa. Mathematics and interdisciplinary stem education: Recent developments and future directions. ZDM–Mathematics Education, 55(7):1199–1217, 2023. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783, 2024. https://arxiv.org/abs/2407.21783. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, and Megan Ung. Changing answer order can decrease MMLU accuracy. arXiv preprint arXiv:2406.19470, 2024. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3828–3850, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. https://aclanthology.org/2024.acl-long.211/. Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report. arXiv preprint arXiv:2505.22312, 2025. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.https://openreview.net/forum? id=7Bywt2mQsCe. 44 David Hestenes. Modeling theory for math and science education. In Modeling Students’ Mathematical Modeling Competencies: ICTMA 13, pages 13–41. Springer, 2009. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025a. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025b.https://arxiv.org/abs/ 2503.24290. Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2025c.[https://github.com/stepfun-ai/PaCoRe/blob/main/pacore_report.pdf](https://github.com/stepfun-ai/ PaCoRe/blob/main/pacore_report.pdf). Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, and Bryan Hooi. Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms, 2026. https://arxiv.org/abs/2601.08763. Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 1051–1068, 2023. Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025.https://github.com/huggingface/ open-r1. Huggingface. Math-verify. 2025. https://github.com/huggingface/Math-Verify. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o System Card. arXiv preprint arXiv:2410.21276, 2024. https://arxiv.org/abs/2410.21276. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 System Card. arXiv preprint arXiv:2412.16720, 2024. https://arxiv.org/abs/2412.16720. Seungone Kim, Se Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, and Minjoon Seo. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12685–12708, 2023a. Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2023b. Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, 2024. Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, and Graham Neubig. Evaluating language models as synthetic data generators. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6385–6403, 2025a. Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, et al. Scaling evaluation-time compute with reasoning models as process evaluators. arXiv preprint arXiv:2503.19877, 2025b. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.https://dl.acm.org/doi/ abs/10.1145/3600006.3613165. 45 Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing Frontiers in Open Language Model Post-training. arXiv preprint arXiv:2411.15124, 2024. https://arxiv.org/abs/2411.15124. Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1755–1797, 2025. Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, et al. Bridging Offline and Online Reinforcement Learning for LLMs. arXiv preprint arXiv:2506.21495, 2025. https://arxiv.org/abs/2506.21495. Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. In International Conference on Machine Learning, pages 26874–26901. PMLR, 2024. https://arxiv.org/abs/2309.00267. Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, and Yong Jiang. Parallelmuse: Agentic parallel thinking for deep information seeking, 2025a. https://arxiv.org/abs/2510.24698. Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanis- las Polu. Numinamath.[https://huggingface.co/AI-MO/NuminaMath-1.5](https://github.com/project-numina/ aimo-progress-prize/blob/main/report/numina_dataset.pdf), 2024a. Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations, 2025b.https://arxiv.org/abs/2509. 02534. Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline. In Forty-second International Conference on Machine Learning, 2025c. https://arxiv.org/abs/2406.11939. Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. Can multiple-choice questions really be useful in detecting the abilities of llms? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2819–2834, 2024b. Xiang Li, Jiayi Xin, Qi Long, and Weijie J Su. Evaluating the unseen capabilities: How many theorems do llms know? arXiv preprint arXiv:2506.02058, 2025d. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An Automatic Evaluator of Instruction-following Models.https://github.com/tatsu-lab/ alpaca_eval, 5 2023. Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning. arXiv preprint arXiv:2510.24684, 2025a. Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy. arXiv preprint arXiv:2507.01352, 2025b. https://arxiv.org/abs/2507.01352. Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F Wong, Songyang Zhang, et al. Compassverifier: A unified and robust verifier for llms evaluation and outcome reward. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33454–33482, 2025c. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025d. Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling. arXiv preprint, 2024. https://arxiv.org/abs/2412.15084. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024. 46 Xun Lu. Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards. arXiv preprint arXiv:2506.00103, 2025. https://arxiv.org/abs/2506.00103. Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica.Deepscaler: Sur- passing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 , 2025. Notion Blog. Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025a. Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu Hoang Trinh, Quoc V Le, and Junehyuk Jung. Towards robust mathematical reasoning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35418–35442, Suzhou, China, November 2025b. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025. emnlp-main.1794. https://aclanthology.org/2025.emnlp-main.1794/. Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General-reasoner: Advancing llm reasoning across all domains. arXiv preprint arXiv:2505.14652, 2025a. Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun MA, and Wenhu Chen. General-reasoner: Advancing LLM reasoning across all domains. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025b. https://openreview.net/forum?id=pBFVoll8Xa. Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators, 2025. https://arxiv.org/abs/2510.01123. Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation. arXiv preprint arXiv:2506.01937, 2025. Mark Meerschaert. Mathematical modeling. Academic press, 2013. Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chao Zhang, Chen Gao, Chen Zhang, Chengcheng Han, Chenhui Yang, Chuyu Zhang, Cong Chen, Cunguang Wang, Daoru Pan, Defei Bu, Dengchang Zhao, Di Xiu, Dishan Liu, Dongyu Ru, Dunwei Tu, Fan Wu, Fengcheng Yuan, Fengcun Li, Gang Xu, Guanyu Wu, Guoyuan Lin, Haibin Wang, Hansi Yang, Hao Yang, Haonan Yan, Haoxiang Ma, Haoxing Wen, Hongyan Hao, Hongyin Tang, Hongyu Zang, Hongzhi Ni, Hui Su, Jiacheng Zhang, Jiahong Zhou, Jiahuan Li, Jiaming Wang, Jian Yang, Jianfei Zhang, Jianhao Xu, Jianing Wang, Jiapeng Zhu, Jiaqi Sun, Jiarong Shi, Jiarui Zhao, Jingang Wang, Jinluan Yang, Jinrui Ding, Jinwei Xiao, Jiyuan He, Juncan Xu, Kefeng Zhang, Keheng Wang, Li Wei, Lianhui Ma, Lin Qiu, Lingbing Kong, Lingchuan Liu, Linsen Guo, Mengshen Zhu, Mengxia Shen, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pengtao Zhang, Ping Liu, Qi Gu, Qiong Huang, Qiyuan Duan, Quanchi Weng, Rongxiang Weng, Rongzhi Zhang, Rumei Li, Shanglin Lei, Shengnan An, Shijun Dai, Shizhe Wu, Shuaikang Liu, Shuang Zhou, Shuo Wang, Songyuan Zhao, Tao Liang, Tianhao Hu, Tianze Chen, Wei Liu, Wei Shi, Wei Wang, Weifeng Tang, Wenjie Shi, Wenlong Zhu, Wentao Chen, Wentao Shi, Xi Su, Xiandi Ma, Xiangcheng Liu, Xiangyu Xi, Xiangyuan Liu, Xiangzhou Huang, Xiao Liu, Xiaodong Cai, Xiaolong Chen, Xiaowei Shi, Xiaoyu Li, Xin Chen, Xingchen Liu, Xuan Huang, Xuezhi Cao, Xunliang Cai, Yan Chen, Yang Bai, Yang Liu, Yang Yang, Yang Zheng, Yanyu Chen, Yaoming Wang, Yaoming Zhu, Yaorui Shi, Yaqi Huo, Yerui Sun, Yi Zhang, Yi-Kai Zhang, Yifan Lu, Yifan Zhao, Yihao Chen, Yitao Zhai, Yongjing Yin, Yongwei Zhou, Youshao Xiao, Yu Wang, Yu Yang, Yuchen Xie, Yuchen Yu, Yuchuan Dai, Yue Xu, Yueqing Sun, Yufei Zhang, Yuhuai Wei, Yulei Qian, Yunfan Liang, Yunke Zhao, Yuwei Jiang, Yuxin Bian, Yuxin Chen, Yuxin Liu, Zeyang Yu, Zhao Yang, Zhengsheng Huang, Zhengyu Chen, Zhijian Liu, Zhikang Xia, Zhimin Lin, Zhiyuan Yao, Zhuofan Chen, Zhuowen Han, Zijian Zhang, Ziran Li, Ziwen Wang, and Ziyuan Zhuang. Longcat-flash-thinking-2601 technical report, 2026. https://arxiv.org/abs/2601.16725. Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830, 2024. Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey. In First Conference on Language Modeling, 2024. https://openreview.net/forum?id=Lmjgl2n11u. 47 Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning Dataset. arXiv preprint arXiv:2504.16891, 2025a. https://arxiv.org/abs/2504.16891. Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathrea- soning dataset. arXiv preprint arXiv:2504.16891, 2025b. Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023. Nam Nguyen and Rich Caruana. Consensus clusterings. In Seventh IEEE international conference on data mining (ICDM 2007), pages 607–612. IEEE, 2007. OpenAI. Early experiments in accelerating science with gpt-5. 2025a.https://openai.com/index/ accelerating-science-gpt-5/. OpenAI. OpenAI o3 and o4-mini System Card, 2025b. https://openai.com/index/o3-o4-mini-system-card. OpenAI. Introducing openai o3 and o4-mini. 2025c. https://openai.com/index/introducing-o3-and-o4-mini/. OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu-Lemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Y. Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, and Zhuohan Li. Openai o1 system card, 2024. https://arxiv.org/abs/2412.16720. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training Language Models to Follow Instructions with Human Feedback. Advances in neural information processing systems, 35:27730–27744, 2022.https://papers.neurips.c/paper_files/ paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf. 48 Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. Learning adaptive parallel reasoning with language models, 2025. https://arxiv.org/abs/2504.15466. Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. Simko: Simple pass@k policy optimization, 2025. https://arxiv.org/abs/2510.14807. Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason E Weston, and Jane Yu. Self-consistency preference optimization. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=94G4eL3RWi. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. https://openreview.net/forum?id=Ti67584b98. Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J Nay, Kshitij Gupta, and Aran Komatsuzaki. Arb: Advanced reasoning benchmark for large language models. arXiv preprint arXiv:2307.13692, 2023. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017. https://arxiv.org/abs/1707.06347. Krunal Shah, Nitish Gupta, and Dan Roth. What do we expect from multiple-choice QA systems? In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3547– 3553, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.317. https://aclanthology.org/2020.findings-emnlp.317/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. Zhihong Shao, Yuxiang Luo, Chengda Lu, Z Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning. arXiv preprint arXiv:2511.22570, 2025. https://arxiv.org/abs/2511.22570. Rituraj Sharma, Weiyuan Chen, Noah Provenzano, and Tu Vu. Prism: Pushing the frontier of deep think via process reward model-guided inference, 2026. https://arxiv.org/abs/2603.02479. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024. Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs generate novel research ideas? a large-scale human study with 100+ NLP researchers. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=M23dTGWCZy. Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, et al.v_1: Unifying generation and self-verification for parallel reasoners. arXiv preprint arXiv:2603.04304, 2026. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024. Guijin Son, SangWon Baek, Sangdae Nam, Ilgyun Jeong, and Seungone Kim. Multi-task inference: Can large language models follow multiple instructions at once? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5606–5627, 2024. Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Remi Munos. Optimizing language models for inference time objectives using reinforcement learning. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=ZVWJO5YTz4. Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Sharon Li, Jason E Weston, and Ping Yu. Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense. arXiv preprint arXiv:2510.07242, 2025. https://arxiv.org/abs/2510.07242. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023. 49 Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang, Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang, Jianzhou Wang, Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Wenhao Wu, Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Xiaofei Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye, Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Haobing Zhan, Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin Zheng, Shaojie Zheng, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Zhen Zhu, Weiyu Zhuang, and Xinxing Zu. Kimi k2: Open agentic intelligence, 2025. https://arxiv.org/abs/2507.20534. Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report. arXiv preprint arXiv:2510.24701, 2025. Richard Van Noorden and Jeffrey M Perkel. Ai and science: what 1,600 researchers think. Nature, 621(7980):672–675, 2023. Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Recursive self-aggregation unlocks deep thinking in large language models, 2025. https://arxiv.org/abs/2509.26626. Christian Walder and Deep Karkhanis. Pass@k policy optimization: Solving harder reinforcement learning problems, 2025. https://arxiv.org/abs/2505.15201. Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. Llms may perform mcqa by selecting the least incorrect option, 2024a. https://arxiv.org/abs/2402.01349. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. https://arxiv.org/abs/2203.11171. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023a. https://openreview.net/forum?id=1PL1NIMMrw. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023b. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266–95290, 2024b. Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling. arXiv preprint arXiv:2506.20512, 2025a. Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. How do ai agents do human work? comparing ai and human workflows across diverse occupations. arXiv preprint arXiv:2510.22780, 2025b. Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. 50 Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz, and Francisco Guzmán. MENLO: From Preferences to Proficiency–Evaluating and Modeling Native-like Quality Across 47 Languages. arXiv preprint arXiv:2509.26601, 2025a. https://arxiv.org/abs/2509.26601. Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, and Swarnadeep Saha. J1: Incentivizing thinking in llm-as-a-judge via reinforcement learning. arXiv preprint arXiv:2505.10320, 2025b. Tobias Wittkop, Dorothea Emig, Anke Truss, Mario Albrecht, Sebastian Böcker, and Jan Baumbach. Comprehensive cluster analysis with transitivity clustering. Nature protocols, 6(3):285–295, 2011. Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=VNckp7JEHn. Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025a. https://arxiv.org/abs/2505.09388. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025b. Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, 2024. https://openreview.net/forum?id=N8N0hgNDRt. Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, and Jing Xu. Cot-self-instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks. arXiv preprint arXiv:2507.23751, 2025. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2025a. https://arxiv.org/abs/2401.10020. Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Ilia Kulikov, Kyunghyun Cho, Dong Wang, Yuandong Tian, Jason E Weston, et al. Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions. arXiv preprint arXiv:2502.13124, 2025b. Xiang Yue, Tianyu Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. Advances in Neural Information Processing Systems, 37:90629–90660, 2024. Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022. Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025. Jie Zhang, Cezara Petrui, Kristina Nikolić, and Florian Tramèr. Realmath: A continuous benchmark for evaluating language models on research-level mathematics. arXiv preprint arXiv:2505.12575, 2025. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild. arXiv preprint arXiv:2405.01470, 2024. 51 Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, and Ilia Kulikov. The majority is not always right: Rl training for solution aggregation, 2025. https://arxiv.org/abs/2509.06870. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning, 2025. https://arxiv.org/abs/2509.07980. Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation, 2026. https://arxiv.org/abs/2509.15194. 52 4 Appendix 4.1 Contributions Chapter 1 contributors: Seungone Kim, Pranjal Aggarwal, Bo Liu, Swarnadeep Saha, Ping Yu, Anaelia Ovalle † , Jack Lanchantin, Jing Xu, Weizhe Yuan, Wenting Zhao, Adina Williams, Marjan Ghazvininejad, Graham Neubig † , Sean Welleck † , Jason Weston, Ilia Kulikov. Chapter 2 contributors: Chenxi Whitehouse, Ilia Kulikov, Ping Yu, Jason Weston, Xian Li, Swarnadeep Saha. Chapter 3 contributors: Tianjian Li, Jingyu Zhang, Ping Yu, Swarnadeep Saha, Sainbayar Sukhbaatar, Jason Weston, Ilia Kulikov, Jack Lanchantin. All authors have Meta AI affiliation except: † : ULCA, † : CMU. 4.2 Hyperparameters for RL training For training LMs using a model-based verifier, we load the policy model and the verifier model on separate nodes. For running inferences with the verifier model (determining answer equivalence), we use vLLM (v0.8.4) online inference and for training the policy model, we use verl (v0.4.0) with vLLM for rollouts. For inter-communications between the nodes (i.e., sending the verification results to verl), we use litellm (v1.72.2). 4.3 Prompts 4.3.1 Prompts for Training and Evaluating Principia Models Default system prompt for LMs trained on Principia Collection <instructions> The user asks a question, and you should solve it as an intelligent assistant . You should first write the reasoning process step-by-step and then provide the user with the answer. The reasoning process and answer should be enclosed within <reason> </reason> and <answer> </ answer> tags, respectively (i.e., <reason> reasoning process here </reason> <answer> answer here </answer>). You should always start your response with <reason>, reason step-by-step, close your reasoning with </reason>, and then provide your final answer within <answer> and </answer> tags. Note that you should write <reason>, </reason>, <answer>, and </answer> only once in your response for clear separation of reasoning and answer. Most importantly, there should be no text before <reason> and after </answer> tags. Between <answer> and </answer>, you should be concise and only provide the final prediction without any additional explanations (e.g., <answer> L = \ 12 m \ x^2 \ (1 + \ 4x ^2a^2\ ) - \ mgx^2a </answer>). </instructions> 4.3.2 Prompts for Principia Collection Construction Prompt for Sketching Strategy Description (first step of Problem Statement Generation) ### Task Description: You are given a topic within mathematics or natural sciences. Your task is to generate eight capabilities relevant to this topic as a list . A capability is a concise, actionable description of a specific technique, method, or conceptual tool that an expert in the topic would use to solve problems within that area. Each capability should be: - Specific : Focus on a single , well-defined skill , method, or conceptual approach. 53 - Action-oriented: Phrase each capability as something that can be done (e.g., "Applying X to compute Y", "Bounding Z using W", "Framing A in terms of B"). - Relevant: Directly related to the topic and commonly used by researchers or advanced students in that field. - Non-trivial : Go beyond basic definitions or elementary facts; focus on techniques that require some expertise or insight . - Self -contained: Each capability should make sense on its own, without requiring reference to the others. - Non-overlapping: Each capability should not overlap with one another. ### Examples: The following are some examples of a topic and corresponding capabilities : (1) For the topic "Number theory >> Multiplicative number theory >> Distribution of integers with specified multiplicative constraints": <start> CAPABILITIES: 1. Framing prime conditions via Frobenius elements and splitting in number fields 2. Applying Chebotarev’s density theorem to compute the density of primes with specified residue properties 3. Using the inclusion -exclusion principle to combine multiple conditions on primes (no smaller power residues) 4. Exploiting linear disjointness of field extensions to treat residue conditions independently ("quasi- multiplicative " densities ) 5. Identifying and handling exceptional small primes that violate independence assumptions (special correlation cases) 6. Deriving explicit formulas for the density of primes satisfying complex power-residue conditions 7. Bounding tail contributions of infinite series using number-theoretic estimates (e.g., Euler’s totient function growth) 8. Estimating and controlling error terms to ensure the final computed density is accurate to the required precision <end> (2) For the topic "Algebraic geometry >> Curves in algebraic geometry >> Coverings of curves, fundamental group": <start> CAPABILITIES: 1. Analyzing polynomial coverings via monodromy groups (permutation of 19 sheets) in algebraic geometry 2. Modeling irreducibility as transitivity of the monodromy action on fibered points 3. Utilizing branch loci and monodromy representations to encode the polynomial’s structure 4. Applying Burnside’s lemma and Sylow’s theorem to deduce properties of a prime-order cycle in the permutation group 5. Establishing the normality of the 19-cycle subgroup to constrain the group’s orbit structure (number of orbits) 6. Using the Riemann-Hurwitz formula to relate branch points to cycle structure and bound the number of components 7. Deducing the cycle decomposition (e.g., 9 disjoint transpositions) from group-orbit analysis to determine irreducible components 8. Identifying the degree-19 Chebysshev polynomial (via the Riemann existence theorem) as the unique polynomial with the required monodromy, and adjusting it to satisfy oddness, monicity, and coefficient conditions <end> (3) For the topic "Number theory >> Sequences and sets >> Recurrences": <start> CAPABILITIES: 54 1. Solving the linear recurrence through its characteristic polynomial and eigen values (roots of the characteristic equation) 2. Writing an explicit formula for sequence terms using the contributions of each characteristic root 3. Analyzing the p-adic valuation of each characteristic root (condition $| |_p < 1$) to determine viability for continuous extension 4. Using divisibility conditions in the recurrence values to pinpoint the prime that permits a continuous extension (narrowing down candidates) 5. Reducing the characteristic polynomial modulo p to examine how roots behave in $ _p$ ( e.g., whether they become repeated or roots of unity) 6. Employing p-adic exponential and logarithm to define $a^x$ for non-integer $x$, enabling extension of the sequence to $ _p$ 7. Ensuring convergence of the power-series definition (requiring $|\ (a)|_p < 1$ or $|a-1|_p < 1$ for each root $a$) so that the extension is well-defined and continuous 8. Considering multiple potential branches ( difference choices of $p$-adic log for roots, especially when roots involve roots of unity) and requiring a unique branch to achieve a single continuous extension <end> (4) For the topic "Group theory and generalizations >> Representation theory of groups >> Representations of finite symmetric groups": <start> CAPABILITIES: 1. Recognizing that the given matrix tuple relations correspond to the Coxeter (braid) relations of a symmetric group 2. Reformulating the classification of such matrix tuples as the classification of representations of $S_n$ (symmetric group) of a given dimension 3. Noting that $GL(n)$-orbits of these tuples correspond ot isomorphism classes of $n$-dimensional representations of the group 4. Using the character table of the symmetric group to obtain all irreducible representation dimensions available 5. Accounting for the fact that each irreducible dimension (for non-self -conjugate partitions ) appears in two distinct irreps (partition and its conjugate) 6. Formulating the counting of $n$-dimensional representations as a partitioning of $n$ into the multiste of irreducible dimensions (with unlimited repeats allowed) 7. Setting up a generating function whose coefficients count the number of ways to sum irreducible dimensions to $n$ (each irreducible dimension contributing a factor) 8. Extracting the coefficient of $x^n$ in the power series expansion to obtain the number of distinct $n$-dimensional representation types (hence orbits) <end> (5) For the topic "Number theory >> Arithmetic algebraic geometry (Diophantine geometry) >> Curves over finite and local fields": <start> CAPABILITIES: 1. Interpreting the homogeneous equation as a projective algebraic curve over the finite field $\\ mathbbFp^k$ and translating the question into counting $ p^k$-rational points 2. Applying the Hasse-Weil bound to constrain the possible number of points on the curve (using genus or other invariants to get an upper bound) 3. Recognizing if the curve is maximal (attains the Hasse-Weil upper bound), which would directly determine the number of points 4. Utilizing symmetry in the curve’s equation (cyclic permutation of $x,y,z$) to reduce and simplify the point-counting process 5. Incorporating finite field extension techniques, such as constructing $ _p^k$ via an irreducible polynomial, to systematically enumerate solutions in the extension field 6. Accounting for the "up to scaling" condition by counting projective points (excluding the trivial zero vector and identifying scalar multiples as one) 55 7. Breaking the counting problem into cases (e.g., whether a coordinate is zero) and using inclusion- exclusion to avoid overcounting across those cases 8. Verifying the final count against theoretical bounds or known results forconsistency, and adjusting if any deviations are detected (ensuring the count respects the Hasse-Weil limits) <end> ### Format Requirements: (1) Before writing the list of capabilities , you should think step by step of what would constitute a good set of lists for the given topic. (2) After you think step by step, you should write "<start>" to denote that you will generate the list of capabilites . Your generation will be considered invalid if you do not generate any thinking before "<start>". (3) There should exactly be 8 capabilities , where each capability starts with "1. ", "2. ", "3. ", "4. ", "5. ", "6. ", "7. ", "8. ". If not, your generation will be treated as invalid . (4) After you are done writing the list of capabilities , write "<end>" to denote that you are done generating the list of capabilities . (5) You should not include any explanations or other text between "<start>" and "<end>". Also, "< start>" and "<end>" should occur only once in your generation (i.e., when you are proposing the list of capabilities after your thinking). If not, your generation will be treated as invalid . ### Guidelines for Thinking before Proposing the List of Capabilities: You should think step by step before actually writing the list of capabilities . Here are some guidelines: - List the main entities : Identify all relevant objects (e.g., particles , fields , operators, manifolds, equations, symmetries, primes, groups, functions, etc .) . - Describe core and challenging problems: What are the most important, subtle, or technically difficult questions in this area? (e.g., proving existence/uniqueness, classifying solutions , computing observables, determining distributions, understanding symmetries, etc.) - Highlight sources of complexity: What makes these problems hard? (e.g., nonlinearity, high dimensionality, coupling between variables, deep connections to other areas, etc .) - List foundational results and principles : What are the essential theorems, laws, or principles ? (e. g., Noether’s theorem, conservation laws, Chebotarev’s theorem, variational principles , gauge invariance, etc .) - General strategies : What are the typical approaches to analyze or manipulate these objects? (e.g., perturbation theory, symmetry reduction, use of generating functions, field extensions, variational calculus , etc .) - Specialized or clever techniques: Are there unique tricks , constructions, or methods specific to this topic? (e.g., Feynman diagrams, explicit class field theory, advanced inclusion-exclusion, Monte Carlo simulations, etc.) - Reduction steps: How do experts break down complex problems? (e.g., decomposing into simpler subsystems, using orthogonality, modular reduction, separation of variables, etc .) - Combining multiple conditions: What methods are used to handle simultaneous constraints or coupled systems? (e.g., inclusion-exclusion, Lagrange multipliers, coupled equations, etc .) - Exceptional or pathological cases: Are there special values, singularities , or degenerate cases that require separate treatment? - Failure of naive approaches: Where do standard methods break down, and why? (e.g., non- perturbative effects, lack of independence, divergences, etc .) - Technical obstacles : What are the hardest parts of the analysis? (e.g., bounding error terms, ensuring convergence, handling infinities , controlling chaos, etc .) Always note that it is better if both your thinking and list of capabilities are as detailed as possible , including specific equations or cases in context. ### Topic: Now it is your turn to propose eight capabilities relevant to the topic: [HERE_IS_THE_TOPIC] 56 Prompt for Generating Initial Problem Statement (second step of Problem Statement Generation ### Task Description: You are given a topic in mathematics or the natural sciences, a description of a capability , and a required answer format. Your task is to generate a question that is relevant to the given topic, requires the specified capability in order to solve the question, and must be answered in the specified answer format. ### Examples: Below are some examples that include a topic, a description of a capability , an answer format, and a corresponding question. Especially, check that the questions are not concept-checking question, but instead reasoning- intensive ones: (1) For the topic "Quantum Mechanics >> Quantum systems with discrete spectra >> Time-independent perturbation theory", the capability "Identifying that |001 decouples (zero matrix elements) because $x$ and $y$ cannot connect $z$-excited states", and answer type ’Equation: A mathematical statement that includes an equals sign (=), expressing the equality of two expressions (e.g., H = \ 12m \ ( - \ qc \ )^2 + q \φ).’: <question> The first excited state of three-dimensional isotropic harmonic oscillator (of natural angular frequency $\ _0$ and mass $m$) is three-fold degenerate. Use the perturbation method to calculate the splitting (to the first order) of this three-fold degenerate state due to a small perturbation of the form $H’ = bxy$, where $b$ is a constant. Give the first -order wave functions of the three split levels in terms of the wave functions of the unperturbed three-dimensional harmonic oscillator, given that, for a one-dimensional harmonic oscillator, $$ \ n|x|n + 1 \ = \ \ (n + 1)\ 2m\ _0.$$ </question> (2) For the topic "Group theory and generalizations >> Special aspects of infinite or finite groups >> Braid groups; Artin groups", the capability "Employing the determinant condition to further constrain the matrix variables , typically reducing the system to a single or a few key variables ", and answer type ’Piecewise Function: A function that is defined by different expressions or formulas over different intervals of its domain, with each piece applying to a specific part of the domain ( e.g., f(t) = \ cases 1 - |t |, & if 0 ≤ t < 2 \\ t - 3, & if 2 ≤ t < 4 \ cases). It should use symbols such as "\ cases", "\ cases", "if", "otherwise"’: <question> Consider the braid group ( B_3 ), the group of braids on three strands, with the standard presentation given by generators ( _1, _2 ) and the relation ( _1 _2 _1 = \ sigma_2 _1 _2 ). The Burau representation of ( B_3 ) is a homomorphism ( ρ: B_3 → GL_2(Z[t, t^-1]) ), where ( _1 ) and ( _2 ) are mapped to the matrices: [ ρ( _1) = pmatrix -t & 1 \ 0 & 1 pmatrix, ρ( _2) = pmatrix 1 & 0 \ t & -t pmatrix. ] For a real parameter ( t ≥ 0 ), consider the matrix ( M(t) = ρ( _1 _2 _1) ), the image of the braid word ( _1 _2 _1 ) under the Burau representation. Define the function ( f(t) ) as the sum of the absolute values of the eigenvalues of ( M(t) ), where the eigenvalues are computed over the field ( C ) (treating ( t ) as a real number). Determine the function ( f(t) ) explicitly as a piecewise function for ( t ≥ 0 ). </question> (3) For the topic "Numerical analysis >> Numerical linear algebra >> Numerical solutions to overdetermined systems, pseudoinverses", the capability "Formulating a regression as ( _x 12| Ax-b|^2), i.e.\ building the design matrix (A), response vector (b), parameter vector (x)", and 57 answer type ’Matrix: A rectangular array of numbers, symbols, or expressions, arranged in rows and columns, that is used to represent and manipulate data or mathematical relationships (e.g., $f(\\ beginbmatrix x \\\\ y \ bmatrix) = S + \ \\|G-S \\|\\|F-S\\|R (\ bmatrix x \\\\ y \ bmatrix - S)$). It should be formulated in latex-compatible format, using symbols such as "\ pmatrix", "\ pmatrix". This also includes vectors, which can be represented as a single-column or single-row matrix.’: <question> Formulate the data fitting problem as a least squares problem $$ x ^n \ f(x)= 12 ||Ax-b||_2^2$$ for the model $$ y(θ;t)= _1+ \ theta_2t+1.0+ _3 t^3$$ and measurement equation $$y_i = y(θ;t_i)+e_i ,\ \ \ i=1,2,...,m$$ where $e_i N(0, σ^2)$. Write expressions for $A,b,x$ and $n$. </question> ### Format Requirements: (1) Before writing the question, you should think step by step to sketch how the pquestion should be formulated as. (2) After you think step by step, you should write "<question>" to denote that you will generate the question. Your generation will be considered invalid if you do not generate any thinking before "< question>". (3) The question should not be a compositional problem. In other words, it shouldn’t be consisted of multiple sub-problems, but instead, require only a single decisive answer based on the requested answer format. (4) After you are done writing the list of capabilities , write "</question>" to denote that you are done generating the list of capabilities . (5) You should not include any explanations or other text between "<question>" and "</question>". Also, "<question>" and "</question>" should occur only once in your generation (i.e., when you are proposing the question after your thinking). If not, your generation will be treated as invalid. ### Properties of a Good Question: Here are the guidelines that defines what a good question is: (1) The question should be reasoning-intensive instead of a concept-checking question: Assume that the solver already has profound knowledge of the provided topic. The question should be challenging enough for the solver . Specifically , if the question simply requires standard calculation , combining well-known materials, and does not implicitly require any tricks or challenges to solve the question, the question should be considered as a very bad question. The question should require to reason deeply in order to derive a decisive answer. (2) The question should be not ill -defined: The question should provide only necessary conditions or information to start reasoning. (3) The question should not be consisted of multiple sub-problems: Instead of asking for multiple answers, the question should naturally require going through multiple steps implicitly in order to derive the final answer. (4) The question should not require looking at images. (5) Don’t make the problem too simple by providing hints: The question should not explicitly provide the idea of how to solve the problem and it also shouldn’t provide too much details that could serve as a hint. Also, do not provide too much details on the answer format (e.g., how it should be structured). (6) The question should be grounded to the provided materials: The question should be relevant to the provided topic, require the provided capability in order to solve it , and require to output the provided answer format. ### Guidelines for Thinking before Proposing the Question: You should think step by step before actually writing the question. Here are some guidelines: - Sketch multiple possible candidate of problems: The problems you provide should be relevant to the given topic while inherently requiring the provided capability . While doing this, note that the questions should ask to output the given answer format. - Attempt to solve the questions you wrote: This will give you a sense of which problems are either too easy, which are too hard, and which are ill -defined (e.g., setting is too synthetic or the 58 requirement is not clear). A good problem should require to think step by step in order to derive the final answer. - Ensure Uniqueness and Non-Triviality: Make sure the question is not a direct copy of standard textbook examples, and that it requires more than rote memorization or a single mechanical step. - Check for Ambiguity: Ensure that all terms, variables , and requirements are clearly defined so that the problem cannot be misinterpreted. - Balance Between Simplicity and Depth: The problem should not be so simple that it can be answered without any real thought, nor so complex that it becomes overwhelming or distracts from the core concept. - Consider Edge Cases: Think about whether the problem could be improved by including or excluding special cases (e.g., repeated $x$ values, missing data, etc .) , but avoid unnecessary complications unless they serve a pedagogical purpose. - Revise or filter out to choose the final question: Among the candidates you thought of, you could attempt to add more conditions, change the values or expressions to deliberately make the problem of high-quality, or discard questions that you think are not a good question based on your thinking. - Final check: Again, ensure that the question you propose should be relevant to the given topic, require the specified capability , and must be answered in the specified format. Always note that it is better if your thinking is as detailed as possible , while the question itself shouldn’t be overly verbose. ### Question Proposal: Now it is your turn to propose a question based on the following inputs: Topic: [HERE_IS_THE_TOPIC] Description of the capability : [HERE_IS_THE_DOMAIN] Answer Format: [HERE_IS_THE_ANSWER_TYPE] Prompt for Revising Initial Problem Statement (last step of Problem Statement Generation ### Task Description: You are given a simple question, a topic in mathematics or the natural sciences, a description of a capability , and a required answer format. It is very likely that the question is straightforward and textbook-level question for experts who has a lot of knowledge in this domain. Your task is to propose a harder variant of the question that even experts might need to think for a long time, based on the provided materials. Specifically , the question should be relevant to the given topic and original question, require the specified capability while reasoning to solve the question, and must be answered in the specified answer format. ### Examples: Below are some examples of challenging questions. These are not concept-checking question, but instead requires extensive reasoning: (1) <question> The first excited state of three-dimensional isotropic harmonic oscillator (of natural angular frequency $\ _0$ and mass $m$) is three-fold degenerate. Use the perturbation method to calculate the splitting (to the first order) of this three-fold degenerate state due to a small perturbation of the form $H’ = bxy$, where $b$ is a constant. Give the first -order wave functions of the three split levels in terms of the wave functions of the unperturbed three-dimensional harmonic oscillator, given that, for a one-dimensional harmonic oscillator, $$ \ n|x|n + 1 \ = \ \ (n + 1)\ 2m\ _0.$$ </question> 59 (2) <question> Consider the braid group ( B_3 ), the group of braids on three strands, with the standard presentation given by generators ( _1, _2 ) and the relation ( _1 _2 _1 = \ sigma_2 _1 _2 ). The Burau representation of ( B_3 ) is a homomorphism ( ρ: B_3 → GL_2(Z[t, t^-1]) ), where ( _1 ) and ( _2 ) are mapped to the matrices: [ ρ( _1) = pmatrix -t & 1 \ 0 & 1 pmatrix, ρ( _2) = pmatrix 1 & 0 \ t & -t pmatrix. ] For a real parameter ( t ≥ 0 ), consider the matrix ( M(t) = ρ( _1 _2 _1) ), the image of the braid word ( _1 _2 _1 ) under the Burau representation. Define the function ( f(t) ) as the sum of the absolute values of the eigenvalues of ( M(t) ), where the eigenvalues are computed over the field ( C ) (treating ( t ) as a real number). Determine the function ( f(t) ) explicitly as a piecewise function for ( t ≥ 0 ). </question> (3) <question> Let $$a+bi\∈ [i]\ \\0\\$$ and suppose that $$2^j\ (a+bi)$$ (i.e. $$2 j$$ is the exact power of 2 dividing $$a+bi$$). Also, let $$n$$ be the smallest integer such that \\[\ n\ \ (\ |\ a2^j\ |,\\;\\ left |\ b2^j\ |\ )+2\≤ w_n. \\] the value of \\[ \ _ [i](a+bi) \\] terms of $$n$$ and $$j$$, depending on whether \\[ \ |\ a2^j\ |+\ |\ b2^j\ |+3\≤ w_n+1 \\] or not. Please provide your solution in the form of a piecewise function using \"\ cases ... \ cases\". </question> (4) <question> Assume that the potential between two point charges \$ q_1 \$ and \$ q_2 \$ is not the Coulomb potential \$ A q_1 q_2 / r \$, but instead the Yukawa potential \$ A q_1 q_2 \\ mathrme^-K r / r \$, where \$ A \$ and \$ K \$ are constants, and \$ K=0^+ \$. Under this condition, determine how the Poisson equation for electrostatic potential would be modified . Please provide the your solution in the form of an equation. </question> (5) <question> Suppose there are two kinds of E. coli (bacteria), ‘ red ’ ones and ‘green’ ones. Each reproduces faithfully (no sex) by splitting into half , red $\→$ red + red or green $\→$ green + green, with a reproduction time of 1 hour. Other than the markers ‘red’ and ‘green ’, there are no differences between them. A colony of 5,000 ‘red ’ and 5,000 ‘ green’ E. coli is allowed to eat and reproduce. In order to keep the colony size down, a predator is introduced which keeps the colony size at 10,000 by eating (at random) bacteria. a very long time, what is the probability distribution of the number of red bacteria? Please provide the your solution in the form of an equation. </question> (6) <question> "$ f = 3x_1 + 4x_2 + 4x_3$ $$ .t.\ cases 6x_1 + 3x_2 + 4x_3 \≤ [1200, 100] \\\\ 5x_1 + 4x_2 + 5x_3 \≤ [1550, 200] \\\\ _1, x_2, x_3 \≥ 0 \ cases $$ are the fuzzy optimal solution and fuzzy optimal value? </question> 60 ### Format Requirements: (1) Before writing the question, you should think step by step to sketch how the question should be formulated as. (2) After you think step by step, you should write "<question>" to denote that you will generate the question. Your generation will be considered invalid if you do not generate any thinking before "< question>". (3) The question should not be a compositional problem. In other words, it shouldn’t be consisted of multiple sub-problems, but instead, require only a single decisive answer based on the requested answer format. (4) After you are done writing the list of capabilities , write "</question>" to denote that you are done generating the list of capabilities . (5) You should not include any explanations or other text between "<question>" and "</question>". Also, "<question>" and "</question>" should APPEAR ONLY ONCE in your generation (i.e., when you are proposing the question after your thinking). If not, your generation will be treated as invalid. ### Properties of a Good Question: Here are the guidelines that defines what a good question is: (1) The question should be reasoning-intensive instead of a concept-checking question: Assume that the solver already has profound knowledge of the provided topic. The question should be challenging enough for the solver . Specifically , if the question simply requires standard calculation , combining well-known materials, and does not implicitly require any tricks or challenges to solve the question, the question should be considered as a very bad question. The question should require to reason deeply in order to derive a decisive answer. (2) The question should be not ill -defined: The question should provide only necessary conditions or information to start reasoning. (3) The question should not be consisted of multiple sub-problems: Instead of asking for multiple answers, the question should naturally require going through multiple steps implicitly in order to derive the final answer. (4) The question should not require looking at images. (5) Don’t make the problem too simple by providing hints: The question should not explicitly provide the idea of how to solve the problem and it also shouldn’t provide too much details that could serve as a hint. Also, do not provide too much details on the answer format (e.g., how it should be structured). (6) The question should be grounded to the provided materials: The question should be relevant to the provided topic, require the provided capability in order to solve it , and require to output the provided answer format. Yet, it is important that the the capability is not explicitly mentioned in the revised question. ### Guidelines for Thinking to Make the Question Harder: You should deliberately increase the complexity and depth of the question so that it cannot be solved by a single mechanical step or by recalling a standard example. Here are some guidelines: - Introduce Non-Obvious Constraints or Twists: Add conditions or constraints that are not immediately apparent or that force the solver to reconsider their initial approach. For example , require the answer to satisfy an additional property, or introduce a subtle exception. - Incorporate Realistic or Contextual Elements: Frame the problem in a context that requires interpretation or modeling, rather than just computation. This could involve a real -world scenario, or data that must be interpreted before applying mathematical tools. - Consider Edge Cases and Exceptions: Deliberately include or highlight edge cases ( implicitly , not explicitly !) that might trip up a solver who is not thinking carefully , but avoid making the problem artificially tricky . - Avoid Direct Imitation of Textbook Problems: Ensure that the question is not a direct copy or minor variation of a standard example. Instead, combine elements from different problems, or introduce a novel twist. - Test Your Own Problem: Attempt to solve the question yourself. If you find that it requires you to pause, plan, and reason through multiple steps, it is likely at the right level . If it feels too direct , add another layer of complexity. 61 - Final Review for Clarity and Depth: Make sure the problem is still clearly stated, with all terms and requirements defined, but that it cannot be solved without genuine reasoning. The question should be challenging but fair , and the answer format should be specified. Always note that it is better if your thinking to propose a harder question is as detailed as possible . Also, don’t mention the "<question>" in your thinking before actually proposing the new question. ### Question Proposal: Now it is your turn to propose a harder question based on the following inputs: Topic: [HERE_IS_THE_TOPIC] Description of the capability : [HERE_IS_THE_DOMAIN] Answer Format: [HERE_IS_THE_ANSWER_TYPE] Original Question: [HERE_IS_THE_ORIGINAL_QUESTION] Prompt for Filtering out Invalid Problem Statements ### Task Description: You are given a question, a topic in mathematics or the natural sciences, and a required answer format. Your task is to check if the question is a valid or not. Read through the following instructions and answer in either "This is problematic" or "Looks good to me": ### Guidelines for determining the validity of a question: 1. First , determine if the question requires to output the answer type provided below. It may or may not explicitly mention to generate in such answer type, what you should determine is if it is a totally irrelevant question or not. Additionally, questions that do not ask to provide an answer and only provides statements or conditions should be considered as invalid . 2. Next, determine if the question is a compositional question or not. A compositional question refers to a question that is consisted of multiple sub-questions or a question that asks to provide at least two answers at once. This differs with a question that asks to provide a single answer which is composed of multiple parts. The rule of thumb is that if it asks to provide more than two answers in the question, it should be determined as a compositional question. 3. Lastly, determine if the answer is accidentally included within the problem or not. This also includes explicitly providing guidelines of how to solve the problem, which will make the problem overly easy. ### Format Requirements: (1) After you think step by step, you should write "<decision>" to denote that you will provide your decision of whether the question is valid or not. (2) Write "This is problematic" if there is at least one issue with the question and "Looks good to me" if there are none. (3) After providing your decision, write "</decision>" to denote that you are done generating the list of capabilities . (4) All together, you should write either "<decision> This is problematic </decision>" or "<decision> Looks good to me </decision>". ### Question: Now it is your turn to propose a question based on the following inputs: Question: [HERE_IS_THE_QUESTION] Answer Format: [HERE_IS_THE_ANSWER_TYPE] 62 Algorithm 1 Conservative Equivalence Checking for Majority Voting (Principia Collection Construction) Require:PredictionsA=a 1 ,...,a n ; model-based verifierV; agreement thresholdτ= 0.6; majority thresholdθ= 5 8 1:Notation:unique(A) returns the setUof distinct predictions inA;|U|is the number of uniques;|A|is the total predictions (e.g., |A|=8). Pairwise Equivalence Checking 2: U ← unique(A); n U ←|U| 3: for i← 1 to n U do 4:for j ← i + 1 to n U do 5:evaluateV(u i ,u j )∈True, False and record in matrixV[u i ,u j ] 6:end for 7: end for Measuring Transitivity Among Elements 8: if n U > 2 then 9:for all ordered pairs (x,y)∈ U × U, x̸= y do 10:agree(x,y)← 1 n U − 2 P w∈U\x,y 1V(x,w) =V(y,w) 11:ifV(x,y) = True and agree(x,y) < τ then 12:V(x,y)← False▷ break weak links inconsistent with transitivity 13:end if 14:end for 15: end if Grouping by Equivalence (DSU) 16: DSU (Disjoint-Set Union): maintain parent[·] with Find (path compression) and Union. 17: initialize DSU on U; cnt[u]←|a∈ A : a = u|▷ vote multiplicity per unique u 18: for all (x,y) withV(x,y) = True do 19:Union(x,y)▷ “Merge x,y in DSU” 20: end for 21: G ← DSU groups via Find 22: for all g ∈G do 23: c(g)← P u∈g cnt[u]▷ class vote mass 24: rep(g)← ( arg min u∈g len(string(u)),|g|≤ 2 element in g whose length is closest to the median, |g|≥ 3 25: end for Majority Decision (at least 5 of 8 votes) 26: if G ̸=∅ then 27: g ⋆ ← arg max g∈G c(g); a ⋆ ← rep(g ⋆ ) 28:if c(g ⋆ )≥⌈θ|A|⌉ then 29:return a ⋆ (majority vote), c(g ⋆ ), refined mapV▷ a ⋆ is chosen if majority count is beyond threshold 30:else 31:return "[NO MAJORITY]", c(g ⋆ ), refined mapV 32:end if 33: else 34:return "[NO VALID PREDICTIONS]", 0, refined mapV 35: end if 63 Prompt Template for Pointwise Reference-free LMs-as-RMs You are given a user question and a response from an AI assistant. Your task is to act as an impartial judge and evaluate how well the response fulfills the user’s instructions. You will be shown multiple responses to the same prompt, but only one at a time. Evaluate each response independently. Think carefully about how to assess the quality of the response and assign the assistant’s response a score 1 if the response is correct, and 0 if not. Enclose the score within <score> and </score> tags. Format your output like this: <think> your_thinking_process </think> <score> 0 or 1 </score> Below are the user’s question and the assistant’s response: [User Question] instruction [The Start of the Assistant’s Answer] response [The End of the Assistant’s Answer] Figure 24 Prompt template for pointwise reference-free LMs-as-RMs 5 Prompt Templates Figure 24, Figure 25, and Figure 26 show the prompt templates for training pointwise reference-free, pairwise, and pointwise reference-based LMs-as-RMs, respectively. Prompt Template for Pairwise LMs-as-RMs for Non-verifiable Tasks. You are given a user question and two responses from two AI assistants. Your task is to act as an impartial judge and evaluate which response better follows the user’s instructions and provides a higher-quality answer. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Think carefully about how to assess the quality of the responses and assign each response a score from 0 to 10, using either an integer or a decimal with up to 0.1 precision, with a higher score indicating a higher-quality response that better satisfies the criteria. Enclose the scores within the tags <score_A> </score_A>, and <score_B> </score_B>. Format your output like this: <think> your_thinking_process </think> <score_A> your_score_a </score_A> <score_B> your_score_b </score_B> Below are the user’s question and the two responses: [User Question] instruction [The Start of Assistant A’s Answer] response A [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] response B [The End of Assistant B’s Answer] Figure 25 Prompt template for pairwise LMs-as-RMs for non-verifiable tasks. Note that unlike verifiable tasks, here we ask the model to assign a score between 0 to 10 to capture the fine-grained quality. 64 Table 17 Comparison of different post-trained Qwen3-8B (Instruct) models using RLLM, RLVR, or RLHF, showing generalization of our RLLM recipe to a stronger 8B policy model. All models are evaluated in thinking mode. Method RM/VerifierRM Type RM Size Math (Easy-to-Verify)Hard-to-Verify AIME24 AIME25 BRUMO25 HMMT24 HMMT25 AveragePhysics Policy Trained on hard-to-verify samples w/o Reference Answer Qwen3-8B –76.0067.3068.7535.4141.2557.7451.70 RLHFNexusflow/Athene-RM-8B Scalar8B73.3460.4265.0037.9138.3355.0050.35 RLVRnvidia/AceMath-7B-RM Scalar7B72.9165.0065.4133.7542.4955.9150.76 RLLMQwen3-8BGenerative8B75.8470.0069.5850.0043.3361.7554.32 RLLMJ1-Qwen3-32BGenerative32B73.3270.4373.7443.3050.8262.3254.17 Table 18 Comparison of different post-trained Qwen3-1.7B (Instruct) models using RLLM, RLVR, or RLHF. All models are evaluated in non-thinking mode. Similar to our conclusions with results in thinking mode, RLLM outperforms RLHF and RLVR. MethodRM/VerifierRM Type RM Size Math (easy-to-verify)Hard-to-Verify AIME24 AIME25 BRUMO25 HMMT24 HMMT25 AveragePhysics Qwen3-1.7B–13.3310.8316.046.255.2110.338.75 Policy Trained on hard-to-verify samples w/o Reference Answer RLHFNexusflow/Athene-RM-8BScalar8B5.846.2512.082.281.655.627.95 RLHFSkywork-Reward-V2-Llama-3.1-8B Scalar8B21.8720.2121.8712.9210.0017.3710.00 RLHFnvidia/AceMath-7B-RMScalar7B20.2116.8718.967.725.8413.9210.45 RLLMJ1-Qwen3-32B-RMGenerative32B39.5835.2143.9720.2117.5131.3016.02 ∆ w/ Qwen3-1.7B–+26.25+24.38+27.93+13.96+12.30+20.97+7.27 Policy Trained on hard-to-verify samples w/ Reference Answer RLVRMath-VerifyRule-based –12.5112.7120.008.127.9312.258.75 RLVRTIGER-Lab/general-verifierGenerative 1.5B17.7115.4118.977.927.9213.5811.93 RLLMJ1-Qwen3-32B-VerifierGenerative32B35.4130.0141.0420.0015.0028.2914.36 Table 19 Comparison of different post-trained OctoThinker-8B-Hybrid-Base models using RLLM, RLVR, or RLHF. This shows the generalization of RLLM to Llama-based models. MethodRM/VerifierRM Type RM Size AIME24 AIME25 HMMT24 HMMT25 BRUMO25 MATH500 Average OctoThinker-8B-Hybrid-Base –2.492.704.300.000.0052.3610.32 Policy Trained on hard-to-verify samples w/o Reference Answer RLHFSkywork-Reward-V2-Llama-3.1-8B Scalar8B7.303.746.883.300.0060.0413.54 RLLMQwen3-1.7B (prompted)Generative1.7B7.934.379.796.700.8360.8615.08 ∆ w/ OctoThinker-8B-Hybrid-Base–+5.44+1.67+5.49+6.70+0.83+8.50+4.76 Policy Trained on hard-to-verify samples w/ Reference Answer RLVRMath-VerifyRule-based –3.311.456.673.300.4155.6411.80 RLVRTIGER-Lab/general-verifierGenerative 1.5B2.291.245.003.300.0047.679.92 RLLMQwen3-1.7B (prompted)Generative1.7B10.847.087.523.300.0061.7515.08 ∆ w/ OctoThinker-8B-Hybrid-Base–+8.35+4.38+3.22+3.30+0.00+9.39+4.76 Table 20 Comparison of Win Rate (WR) and Length Controlled Win Rate (LCWR) of RLLM and RLHF on non- verifiable instruction-following tasks when training a Qwen3-8B policy. For AlpacaEval 2.0, we use GPT-4o as the evaluator and for ArenaHard 2.0, we use GPT-4.1 as the evaluator. RLLM matches or outperforms RLHF, obtaining best win rates on hard prompts of ArenaHard 2.0. Method RMRM Type RM Size Datasets AlpacaEval 2.0ArenaHard 2.0ArenaHard 2.0 (LCWR / WR) (Hard Prompts) (Creative Writing) Policy Trained on non-verifiable WildChat samples in thinking mode Qwen3-8B –63.1 / 65.122.4 (-1.5 / +1.7) 33.3 (-2.5 / +2.7) RLHFNexusflow/Athene-RM-8BScalar8B70.9 / 71.225.4 (-1.6 / +1.6) 57.9 (-2.8 / +2.0) RLHFSkywork-Reward-V2-Llama-3.1-8B Scalar8B68.6 / 72.326.4 (-1.6 / +1.7) 50.9 (-2.9 / +2.9) RLLMQwen3-8B (prompted)Generative8B71.4 / 77.132.7 (-1.6 / +1.5)61.9 (-2.3 / +2.3) 65 Prompt Template for Pointwise Reference-based LMs-as-RMs You are given a user question, a reference answer, and a response from an AI assistant. Your task is to act as an impartial judge and evaluate how well the response fulfills the user’s instructions. You will be shown multiple responses to the same prompt, but only one at a time. Evaluate each response independently. Think carefully about how to assess the quality of the response and assign the assistant’s response a score 1 if the response is correct, and 0 if not. Enclose the score within <score> and </score> tags. Format your output like this: <think> your_thinking_process </think> <score> 0 or 1 </score> Below are the user’s question and the assistant’s response: [User Question] instruction [Reference Answer] reference_answer [The Start of the Assistant’s Answer] response [The End of the Assistant’s Answer] Figure 26 Prompt template for Pointwise Reference-based LMs-as-RMs, where an additional reference answer is provided compared to the reference-free version. Note that the reference answer is only provided as additional context and the LM-as-RM is still prompted to evaluate the entire response. 66 Pointwise LM-as-RM Training Example (Correct) Question: A conducting cylinder of radius a and finite length L (0≤ z ≤ L) is filled with the steady helical flow U(r,θ,z) = U 0 ˆ z + Ωr ˆ θ,0≤ r ≤ a. The magnetic field obeys the induction equation with magnetic diffusivityη. Assume that the end caps atz= 0 and z=Lare perfectly conducting, so that the axial dependence of any normal-mode perturbation must satisfy the boundary condition B = 0 at the end caps. Consequently the axial wavenumber is quantised as k n = nπ L , n = 1, 2,.... Adopt the normal-mode ansatz B(r,θ,z,t) = b(r)e imθ+ik n z+σt , and focus on the most unstable azimuthal mode m = 1. For this mode the growth rate can be written in the form σ = U 0 a RmF q,k n a − λ 2 1 , whereRm= U 0 a η is the magnetic Reynolds number,q= Ωa/U 0 is the dimensionless pitch,λ 1 ≈3.8317 is the first zero of the Bessel functionJ 1 , andF(q,k n a) is a positive dimensionless function that satisfies 0< F ≤1 for all admissibleqand k n . Derive the necessary condition for dynamo action (i.e.σ >0) that must hold for every admissible axial moden. Express your result as a single inequality involving only the magnetic Reynolds numberRm, the constantλ 1 , and the geometric ratio a/L. Reference Answer: Rm > λ 2 1 Model Response: To determine the necessary condition for dynamo action, we begin by analyzing the given expression for the growth rate: σ = U 0 a RmF (q,k n a)− λ 2 1 . For the magnetic field to grow (i.e., to exhibit dynamo action), the growth rate σ must be positive. Therefore: σ > 0⇒ RmF (q,k n a) > λ 2 1 . Thus, the necessary condition for dynamo action is: Rm > λ 2 1 F(q,k n a) . . . . [analysis of the function F (q,k n a) and worst-case scenarios] . . . To ensure thatRm > λ 2 1 F(q,k n a) holds for alln, we must consider the worst-case value ofF(q,k n a). SinceF(q,k n a) is bounded above by 1, the maximum of λ 2 1 F(q,k n a) over all n occurs when F (q,k n a) = 1. Therefore, the necessary condition for dynamo action is: Rm > λ 2 1 GPT-OSS-120B Judgment: The response derives the same inequality as the reference answer and correctly concludes that the most restrictive necessary condition is obtained when F = 1. Both answers are equivalent. CORRECT Figure 27 Example of a synthetically-generated correct training example for LM-as-RM training. Given a (question, reference answer), we sample a response from the policy and check whether the answer matches the reference answer using GPT-OSS-120B. 67 Pointwise LM-as-RM Training Example (Incorrect) Question: A conducting cylinder of radius a and finite length L (0≤ z ≤ L) is filled with the steady helical flow U(r,θ,z) = U 0 ˆ z + Ωr ˆ θ,0≤ r ≤ a. The magnetic field obeys the induction equation with magnetic diffusivityη. Assume that the end caps atz= 0 and z=Lare perfectly conducting, so that the axial dependence of any normal-mode perturbation must satisfy the boundary condition B = 0 at the end caps. Consequently the axial wavenumber is quantised as k n = nπ L , n = 1, 2,.... Adopt the normal-mode ansatz B(r,θ,z,t) = b(r)e imθ+ik n z+σt , and focus on the most unstable azimuthal mode m = 1. For this mode the growth rate can be written in the form σ = U 0 a RmF q,k n a − λ 2 1 , whereRm= U 0 a η is the magnetic Reynolds number,q= Ωa/U 0 is the dimensionless pitch,λ 1 ≈3.8317 is the first zero of the Bessel functionJ 1 , andF(q,k n a) is a positive dimensionless function that satisfies 0< F ≤1 for all admissibleqand k n . Derive the necessary condition for dynamo action (i.e.σ >0) that must hold for every admissible axial moden. Express your result as a single inequality involving only the magnetic Reynolds numberRm, the constantλ 1 , and the geometric ratio a/L. Reference Answer: Rm > λ 2 1 Model Response: To determine the necessary condition for dynamo action, we analyze the growth rate. For dynamo action, we require σ > 0: σ = U 0 a RmF (q,k n a)− λ 2 1 > 0 Rewriting: RmF (q,k n a) > λ 2 1 So, for every n, we must have: Rm > λ 2 1 F (q,k n a) To ensure this is true for alln, we must consider the worst-case scenario, i.e., the minimum value ofF(q,k n a), since it will make λ 2 1 F(q,k n a) as large as possible. . . . [analysis of the function F (q,k n a) and geometric considerations] . . . The functionF(q,k n a) is positive and less than or equal to 1 for all admissibleqandk n . The problem explicitly states that we are to express the necessary condition in terms of Rm, λ 1 , and the geometric ratio a/L. The necessary condition for σ > 0 for every n is: Rm > λ 2 1 · a L This is the most restrictive condition, since it ensures that the growth rate is positive for the worst-case value ofF(q,k n a). Judgment: The reference answer statesRm > λ 2 1 with no dependence on the geometric ratioa/L. The response incorrectly introduces a factor of a L in the final answer. This additional factor changes the condition qualitatively and is not present in the reference answer. The response’s condition is therefore not equivalent. INCORRECT Figure 28 Example of a synthetically-generated incorrect training example for LM-as-RM training. Given a (question, reference answer), we sample a response sampled from the policy and check whether the answer matches the reference answer using GPT-OSS-120B. 68 Example of a hard-to-verify math problem Question: Find all real polynomials p(x) such that p(x 2 + 1) = p(x) 2 + 1 for all x∈ R. Reference Answer: p n (x) = ((· ((x 2 + 1) 2 + 1) 2 +· ) 2 + 1 (nested n times) Figure 29 Example of a hard-to-verify math problem for policy training where the reference answer is a mathematical expression. Example of an easy-to-verify math problem Question: In △ABC with BC = a, CA = b, and AB = c, given that (b + c)(b− c) = ca and∠BAC = 75 ◦ , find∠ABC. Reference Answer: 70 ◦ Figure 30 Example of an easy-to-verify math problem for policy training where the reference answer is an integer. 69 J1-Qwen3-1.7B vs J1-Qwen3-32B Evaluation Traces Question: Consider two open Heisenberg antiferromagnetic spin-1/2 chains, one of lengthN 1 = 100 and one of lengthN 2 = 200. Using the density matrix renormalization group (DMRG) with three different numbers of kept statesm, the following total ground-state energies (in units of the exchange constant J) and discarded weights ε m are obtained: | Chain length |m| EnergyE m | Discarded weightε m | |—–|—|—|—–| | N 1 = 100 | 50 |−44.12 | 3.2×10 −4 | |N 1 = 100 | 100 |−44.23 | 8.5×10 −5 | |N 1 = 100 | 200 |−44.28 | 1.9×10 −5 | | N 2 = 200 | 50 | −88.30 | 3.2× 10 −4 | | N 2 = 200 | 100 | −88.44 | 8.5× 10 −5 | | N 2 = 200 | 200 | −88.51 | 1.9× 10 −5 | The DMRG algorithm is variational, i.e. eachE m is an upper bound to the exact ground-state energy for the corresponding chain length: E 0 (N )≤ E m (N ). In addition, the following rigorous bound on the energy error is known: |E m (N )− E 0 (N )|≤ K ε m , K = 1.5× 10 4 . The exact ground-state energy per site of the infinite open Heisenberg chain is e ∞ =−0.4431471806J, and for any open chain of length N the exact energy satisfies the finite-size inequality N e ∞ ≤ E 0 (N ) ≤ N e ∞ + C N , C = 12.0. Using **all** the information above (variational property, discarded-weight error bound, infinite-chain reference, finite-size bound, and the data for both chain lengths), give a **rigorous interval** that is guaranteed to contain the exact total ground-state energyE 0 (N 1 = 100) of the 100-site chain. Express your answer **only** in interval notation (e.g.[a, b]) without any additional text or equations. Response (Truncated): **From DMRG withm= 200:** -E 0 (100)∈[−44.28−0.285,−44.28 + 0.285] = [−44.565,−43.995] - Intersect with finite-size bound: [−44.31471806,−44.19471806] — [...] **Final Interval** The **exact** ground-state energyE 0 (100) is guaranteed to lie in the **intersection** of the DMRG error bounds and the finite-size bound. This is the **tightest and most rigorous** interval that contains E 0 (100). [−44.31471806,−44.19471806] Reference Answer: [-44.31471806, -44.28] Expected Evaluation: Incorrect <score> 0 </score> J1-Qwen3-1.7B-RM’s Incorrect Thinking Trace (Truncated): ...the assistant’s final interval is [-44.31471806, -44.19471806], which is the same as the finite-size bound. But the assistant also mentioned that the DMRG results with m=50, 100, 200 give energies -44.12, -44.23, -44.28...So the intersection of those is [-44.3147, -44.1947]. So the assistant’s final interval is [-44.3147, -44.1947], which is the intersection of the DMRG error bound and the finite-size bound...Therefore, the assistant’s answer is correct. So the score should be 1. Evaluation Score: <score> 1 </score> J1-Qwen3-32B-RM’s Correct Thinking Trace (Truncated): ...the DMRG results provide upper bounds. For example, the DMRG with m=200 givesE m =−44.28, which is an upper bound...SinceE 0 ≤ E 200 =−44.28, the tightest upper bound is−44.28, not−44.1947...the exactE 0 (100) must lie in [-44.3147, -44.28]. Because the upper bound is the minimum of the DMRG upper bounds (which is -44.28) and the finite-size upper bound (which is -44.1947)...the assistant’s interval is too wide. The correct interval should be the intersection of the finite-size lower bound and the DMRG upper bounds...Therefore, the rigorous interval should be [-44.31471806, -44.28]. The assistant’s answer is wrong. Evaluation Score: <score> 0 </score> Figure 31 Qualitative comparison of a weaker LM-as-RM J1-Qwen3-1.7B’s thinking trace with a stronger J1-Qwen3- 32B’s thinking trace. The stronger LM-as-RM correctly identifies the response as incorrect and also specifies the correct answer (which matches the reference answer). 70