← Back to papers

Paper deep dive

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 78

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/26/2026, 2:37:05 AM

Summary

The paper investigates the directional nature of RLVR (Reinforcement Learning with Verifiable Rewards) updates in LLMs, arguing that the signed token-level log-probability difference (Δlog p) is a superior metric for identifying reasoning-critical updates compared to magnitude-based metrics like entropy or KL divergence. The authors demonstrate that RLVR updates are sparse and concentrate on low-probability tokens, and they propose two applications: test-time extrapolation to amplify reasoning-critical policy shifts and training-time reweighting to focus learning on these critical tokens.

Entities (5)

DAPO · algorithm · 100%GRPO · algorithm · 100%RLVR · methodology · 100%Δlog p · metric · 100%AIME-24 · dataset · 95%

Relation Signals (3)

RLVR → concentratesupdateson → low-probability tokens

confidence 95% · RLVR’s policy gradient inherently concentrates updates on rare, low-probability tokens

Δlog p → identifies → reasoning-critical updates

confidence 95% · Δlog p more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics

DAPO → refines → GRPO

confidence 95% ¡ DAPO(Yu et al., 2025) is a state-of-the-art critic-free RLVR algorithm that further refines GRPO.

Cypher Suggestions (2)

Identify algorithms that refine GRPO ¡ confidence 95% ¡ unvalidated

MATCH (a1:Algorithm)-[:REFINES]->(a2:Algorithm {name: 'GRPO'}) RETURN a1.name

Find all metrics used to analyze RLVR updates ¡ confidence 90% ¡ unvalidated

MATCH (m:Metric)-[:USED_IN]->(a:Algorithm {name: 'RLVR'}) RETURN m.name

Abstract

Abstract:Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $\Delta\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $\Delta\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $\Delta\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

Tags

ai-safety (imported, 100%)cslg (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

77,813 characters extracted from source content.

Expand or collapse full text

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation Qwen Pilot Team, Alibaba Group ∗ Project PageGitHub Abstract Reinforcement learning with verifiable rewards (RLVR) has substantially im- proved the reasoning capabilities of large language models. While existing anal- yses identify that RLVR-induced changes are sparse, they primarily focus on the magnitude of these updates, largely overlooking their direction. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR’s effects, which can be captured by the signed, token-level log proba- bility difference∆ log pbetween the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that ∆ log pmore effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (e.g., divergence or entropy). Building on this insight, we propose two practical applications: (1) a test-time extrapolation method that amplifies the policy along the learned∆ log pdirection to improve reasoning accuracy without further training; (2) a training-time reweighting method that focuses learning on low-probability (corresponding to higher∆ log p) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR. 1 Introduction Recent advances have substantially improved the reasoning capabilities of large language models, giving rise to powerful reasoning-centric models such as OpenAI o1 (Jaech et al., 2024), Deepseek R1 (Guo et al., 2025), Gemini 2.5 (Comanici et al., 2025), and Qwen3 (Yang et al., 2025a). A key algorithmic driver of this progress is reinforcement learning with verifiable rewards (RLVR) (Guo et al., 2025; Team, 2025; Yang et al., 2025a), which fine-tunes a model’s generation policy using feedback from task-specific verifiers, thereby eliciting and amplifying the reasoning ability. To elucidate how RLVR confers its gains, a natural lens is to compare what changes in the final RL-trained modelπ RL relative to its base counterpartπ Base (Ren & Sutherland, 2025). Previous analyses have consistently shown that the RLVR-induced changes are sparse, impacting only a small subset of tokens in the output sequence. For example, Wang et al. (2025b) associate these changes with high-entropy tokens, Huan et al. (2025) corroborate the sparsity by measuring the KL divergence betweenπ Base andπ RL , while Yang et al. (2025b) and Deng et al. (2025) attribute this sparsity to selective gradient updates during RLVR training. However, when studying the difference between base and RLVR models, prior studies primarily emphasize the magnitude of change, but largely overlook the direction in their distributions. As shown in Fig. 1(b), magnitude-based metrics (e.g., entropy, KL divergence) yield nearly identical histograms for the base and final RLVR models, indicating that magnitude alone is insufficient to characterize the transformation from π Base to π RL . To address this gap, we directly quantify directional shifts in the model’s distribution using the signed, ∗ Full author list available in the Contributions section. 1 arXiv:2603.22117v1 [cs.LG] 23 Mar 2026 (a) Different Metrics(b) Histograms (density) on RL/Base Responses(c) Impact on Token Replacement Performance Entropy 피 ! ! ∼#(⋅|',! "! ) −log휋푦 * 푥,푦 +* KL Divergence 피 ! ! ∼# #$ (⋅|',! "! ) log 휋 ,- 푦 * 푥,푦 +* 휋 ./01 푦 * 푥,푦 +* Logp Difference 횫퐥퐨퐠 퐩 log휋 ,- 푦 * 푥,푦 +* −log휋 ./01 푦 * 푥,푦 +* Measuring the magnitude Measuring the direction: Base à RLVR BaseRLVR Logp Difference Δlog푝 achieves the most efficient replacement Baseà RLVR overlapoverlap Figure 1: (a) Token-level metrics for analyzing RLVR updates. (b) Histograms of each metric on responses generated by base and RLVR models. With a log-scale y-axis, most values concentrate near zero for all metrics, but only∆ log pshows a directional shift distinguishing RLVR from the base model. (c) Token-replacement performance: replacing base tokens with RLVR choices at positions selected by each metric, where∆ log p recovers RLVR performance with the fewest replacements. token-level log-probability difference: ∆ log p(y t |x, y <t ) = log π RL (y t |x, y <t )− log π Base (y t |x, y <t ),(1) which captures how RLVR shifts the probability mass on each token, with positive values indicating increased probabilities and negative values vice versa. As shown in Fig. 1(b), histograms of∆ log p exhibit a clear bimodal pattern with two distinct tails, highlighting a clear directional signature absent in magnitude-based metrics. This metric can reveal which token RLVR prioritizes, such as reasoning- critical tokens (e.g., those enhancing reasoning correctness) versus irrelevant ones. We further validate its utility via a token replacement intervention (Meng et al., 2026): for each metric, we identify salient positions and replace the base model’s tokens with the RLVR model’s choices at those positions during generation (cf. Algo. 1). As shown in Fig. 1(c), selecting by∆ log preaches RLVR-level performance with the fewest substitutions, pinpointing tokens where RLVR learns reasoning-critical updates. These findings underscore a key principle: analyzing the direction of changes, rather than solely their magnitude, provides deeper insights. The signed log-probability difference provides a practical and effective handle for this diagnostic analysis. Building on this principle, we first propose a test-time augmentation that extrapolates the RLVR policy’s distribution along the∆ log pdirection for reasoning-critical tokens selectively, amplifying reasoning- related updates and improving accuracy without additional training. Furthermore, we observe that tokens with the largest∆ log pconsistently correspond to low-probability tokens during RLVR training. Motivated by this, we design a probability-aware reweighting of policy-gradient advantages, upweighting contributions from low-probability tokens to focus learning on reasoning-critical positions as∆ log p indicated. This reweighting yields additional gains over current state-of-the-art RLVR methods (e.g., DAPO (Yu et al., 2025)) across diverse benchmarks and models. In summary, this work introduces a directional diagnostic for analyzing RLVR’s effects and, based on these findings, develops two practical strategies for reasoning enhancement: a test-time extrapolation technique and a training-time reweighting method. We hope our work offers a new perspective for analyzing and improving RLVR through the lens of update direction. 2 Preliminaries Group Relative Policy Optimization (GRPO). GRPO (Shao et al., 2024) is a variant of the milestone policy gradient algorithm PPO (Schulman et al., 2017). It is adapted for LLM training by eliminating the need for a separate critic model. For each QA pair(x,a)sampled from datasetD, GRPO generates a 2 group ofGresponsesy i G i=1 using the old policyπ θ old , computes their rewardsR i G i=1 , and estimates the advantage of each response in a group-relative manner: ˆ A i,t = R i − mean(R i G i=1 ) std(R i G i=1 ) .(2) Then the policy π θ is optimized by maximizing the following objective: J GRPO (θ) =E (x,a)∼D,y i G i=1 ∼π θ old (·|x) 1 G G ∑ i=1 1 |y i | |y i | ∑ t=1 min r i,t (θ) ˆ A i,t , clip r i,t (θ), 1− ε, 1 + ε ˆ A i,t − βD KL (π θ ||π ref ) , (3) wherer i,t (θ) = π θ (y i,t |x,y i,<t ) π θ old (y i,t |x,y i,<t ) is the importance sampling ratio,εis the clipping range forr i,t (θ), and D KL (π θ ||π ref ) regularizes the policy to stay close to a reference policy π ref . Dynamic Sampling Policy Optimization (DAPO). DAPO(Yu et al., 2025) is a state-of-the-art critic-free RLVR algorithm that further refines GRPO. It introduces several techniques, including clip-higher mecha- nism, dynamic sampling strategy, token-level loss aggregation, overlong punishment, and removing the KL penalty. DAPO’s objective is defined as: J DAPO (θ) =E (x,a)∼D,y i G i=1 ∼π θ old (·|x) 1 ∑ G i=1 |y i | G ∑ i=1 |y i | ∑ t=1 min r i,t (θ) ˆ A i,t , clip r i,t (θ), 1− ε low , 1 + ε high ˆ A i,t ,s.t. 0 < | y i | is_equivalent(a, y i ) | < G. (4) Given its success, we adopt DAPO as the primary baseline algorithm for our empirical analysis. Token-level metrics for RLVR analysis. To study how RLVR turns a base model into the RL-finetuned counterpart, we mainly compare the following token-level metrics for RLVR analysis: •Entropy: Wang et al. (2025b) observed that RLVR-induced changes are sparse and tend to concentrate on high-entropy tokens. This token-level entropy is defined as: H π (·|x, y <t ) = E y t ∼π(·|x,y <t ) [− log π(y t |x, y <t )].(5) We calculate this entropy for both the RLVR modelH π RL and the base modelH π Base . • Divergences: Huan et al. (2025) used KL Divergence to quantify the distributional shift, also finding that the changes are sparse. The token-level KL divergence is defined as: D KL π RL ,π Base (·|x, y <t ) = E y t ∼π RL (·|x,y <t ) log π RL (y t |x, y <t ) π Base (y t |x, y <t ) .(6) We also include its reversed variantD KL π Base ,π RL and the averaged KL DivergenceD KL = 1 2 (D KL π RL ,π Base + D KL π Base ,π RL ) to avoid asymmetry bias for a comprehensive analysis. 3 Dissecting the Token-Level Changes Introduced by RLVR This section aims to dissect the token-level mechanisms through which RLVR training transforms a base model into its fine-tuned counterpart. First, we show that the logp difference (∆ log p, Eq. 1) captures directional shifts in probability mass and separates base from RLVR generations, whereas magnitude-only metrics (entropy/divergence) do not. Second, we conduct a token replacement experiment to validate that ∆ log pmore precisely identifies sparse, reasoning-critical tokens targeted by RLVR. Finally, we explain the sparsity through a gradient analysis showing that policy updates concentrate on low-probability tokens of RLVR’s policy gradient updates. 3 051015202530 RLVR Replace Ratio (%) 10 20 30 40 AIME24 Avg@32 Base: 3.02 RLVR: 46.15 On ORZ-32B (Qwen2.5) Logp Diff. log p KL Divergence D KL Entropy H Base Random 051015202530 RLVR Replace Ratio (%) 10 20 30 40 50 AIME24 Avg@32 Base: 6.67 RLVR: 52.50 On DAPO-32B (Qwen2.5) Logp Diff. log p KL Divergence D KL Entropy H Base Random 0.02.55.07.510.012.515.017.5 RLVR Replace Ratio (%) 10 20 30 40 50 AIME24 Avg@32 Base: 11.46 RLVR: 54.58 On UniReason-14B (Qwen3) Logp Diff. log p KL Divergence D KL Entropy H Base Random Performance Across Varying RLVR Replace Ratios Figure 2: Token-replacement performance across metrics and model pairs. While all metrics can recover RLVR-level accuracy,∆ log pdoes so with the fewest replacements, demonstrating its precision in isolating the reasoning-critical minor tokens changed by RLVR training. 3.1 Statistical Analysis: Directional vs. Magnitude-Based Metrics Experimental Setup. We conduct a statistical analysis on outputs from several RLVR-base model pairs (ORZ (Hu et al., 2025a), DAPO (Yu et al., 2025), UniReason (Huan et al., 2025)) to compare how different token-level metrics capture RLVR-induced changes. We plot histograms of entropy, divergences, and logp difference of different models’ generated tokens on the AIME-24 dataset. Statistical Comparison. Fig. 1(b) shows the distributions of these metrics for the UniReason model pair. Across all metrics, the histograms are sharply peaked near zero (note the log-scale y-axis), confirming that RLVR-induced changes are sparse. 1 However, the entropy and KL divergence distributions are nearly identical for both the base and RLVR model outputs. In contrast, the∆ log pdistribution exhibits two distinct tails: a positive tail corresponding to tokens favored by the RLVR model and a negative tail for the base model. This pattern holds across all tested model pairs and for multiple entropy/divergence variants (Appx. E): the distributions of magnitude-based metrics are nearly indistinguishable between tokens generated by the RLVR and base models (Figs. 13-15), whereas∆ log pconsistently exhibits clear bimodal patterns (Fig. 12). This is because magnitude-only metrics quantify the size of the distributional change but ignore its direction, i.e., whether a given token is more favored by the RLVR model or the base model. With directional information,∆ log preveals a clear difference between the two modes, enabling more precise identification of the sparse, reasoning-enhancing updates induced by RLVR, and we will validate their impact on reasoning performance in the following section. 3.2 Recovering RLVR Performance via Selective Token Replacement Token Replacement Setup. To further assess how the minority tokens identified by each metric affect reasoning ability, we conduct a selective token replacement 2 experiment proposed by Meng et al. (2026). At each decoding step, we sample a token fromπ Base , then apply a metric-specific criterionf τ to decide whether to replace the token with one sampled fromπ RL (Alg. 1). The thresholdτis adjusted to control replacement rates across metrics, enabling fair comparisons. We compare entropy, KL Divergences 3 , and logp difference, with the corresponding replacement criteria functions defined as follows: 1 Wang et al. (2025b) argue that RLVR primarily modifies tokens with high entropy. The observed concentration of near-zero-entropy tokens is therefore consistent with sparse updates under their assumptions. 2 This follows thecross-sampleexperiment by Meng et al. (2026), which originally employs bidirectional token swapping to verify RL’s sparsity. We use the term selective token replacement to better reflect our specific setup: comparing how different metrics select base tokens to be replaced by π RL . 3 We mainly use the averaged KL divergenceD KL = 1 2 (D KL π RL ,π Base + D KL π Base ,π RL ) for token replacement to avoid 4 •Entropy: Following the hypothesis that RLVR updates target high-entropy positions (Wang et al., 2025b), we replace the base model’s token if its token distribution has entropy exceeding a thresholdτ: f τ H (y t |x, y <t ) = I(H(·|x, y <t ) > τ). •KL Divergences: Similarly, to target positions where the two models diverges most, we replace the token if the divergence is greater than τ: f τ D (y t |x, y <t ) = I D(·|x, y <t ) > τ . •Logp Difference: A large negative∆ log pfor a tokeny t indicates that RLVR has learned to penalize it relative to the base model. We exploit this by replacing tokens whose logp difference falls below a threshold τ: f τ logp (y t |x, y <t ) = I ∆ log p(y t |x, y <t ) < τ . Algorithm 1 Selective Token Replacement Require:Base and RLVR modelsπ Base ,π RL , prompt x, criterion function f τ (·)∈0, 1 1: Initialize response: t← 0, y ≤0 ← “” 2: while y t ̸= “<EOS>” do 3:t← t + 1 4:Sample from base: y t ∼ π Base (·|x, y <t ) 5:if f τ (y t |x, y <t ) = 1 then 6:Replace the token: y t ∼ π RL (·|x, y <t ) 7:end if 8: end while 9: return y ≤t This selective replacement setup, controlled by the metric-specific thresholds, allows us to compare the impact of tokens identified by each metric on rea- soning performance at matched replacement rates. Fig. 2 shows results on AIME-24 for three represen- tative metricsH π Base ,D KL , and∆ log p, while Fig. 6 in Appx. A.2 provides ablations with additional met- rics, including the RLVR model’s entropyH π RL and KL-divergence variants. All metrics are contrasted with a random baseline that uniformly replaces to- kens:f τ rand (·) = I ρ∼U[0,1] (ρ < τ). The key observa- tions are as follows: Observation I: Selectively replacing a minority of base models’ tokens can recover RLVR performance. As shown in Fig. 2, replacing 5-30% of a base model’s sampled tokens with different metrics suffices to match the final RLVR model’s accuracy. In contrast, randomly replacing the tokens without metric selection produces much slower performance growth. This demonstrates that RLVR-modified tokens are sparsely distributed along the sequence but disproportionately important for reasoning, highlighting the efficacy of the evaluated metrics in identifying these critical tokens. Observation I: Logp difference>divergence>entropy in identifying RLVR-learned reasoning patterns. Across all model pairs (Fig. 2),∆ log p-based replacement reaches the RLVR model’s accuracy with the fewest substitutions (around 10% of tokens). In comparison, magnitude-only metrics (e.g., divergence and entropy) require clearly more replacement to match RLVR performance, indicating lower precision in identifying reasoning-critical changes introduced by RLVR. Between these two, divergence consistently outperforms entropy, suggesting that RLVR changes may not be restricted to high-entropy positions. This ordering—∆ log phighest, followed by divergence, then entropy—remains stable across different divergence and entropy variants (Fig. 6 in Appx. A.2), further validating the superiority of logp difference in isolating the most influential positions. 3.3 A Gradient-Based Explanation for the Sparse Updates Our previous analysis established that the RLVR model differs from its base counterpart on a small but critical subset of tokens most effectively identified by∆ log p. Here, we provide a gradient-based explanation for this sparsity of changes: RLVR’s policy gradient inherently concentrates updates on rare, low-probability tokens, correlating with tokens with high∆ log p in the final model. RLVR’s policy gradient sparsely concentrates on low-probability tokens. The gradient of the DAPO objectiveJ DAPO for an un-clipped tokeny i,t can be written asw i,t ·∇ θ log π θ (y i,t |x,y i,<t ), wherew i,t = r i,t (θ) ˆ A i,t combines the importance sampling ratio and advantage. To analyze the token’s gradient norm, we have the following lemma (see the proof in Appx. D): potential asymmetry bias and include KL’s variants D KL π RL ,π Base and D KL π Base ,π RL for ablation study. 5 0.00.20.40.60.81.0 Tokens' Probability of Rollout Data 10 5 10 6 Count 59.48% P>=0.98 DAPO's Rollout Probability & Gradient Norm Distribution 3.17% P<0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Gradient Norm Coefficient's Ratio Count Grad Norm Coeff. Ratio (a) Gradient norm and probability 0%-4%4%-8%8%-12%12%-16%16%-20% Logp Difference log p Bin (Within Top a%-b%) 0.0 0.2 0.4 0.6 0.8 1.0 Token Probability Token Probability within Varying log p Bins Base RLVR (b) Token probability v.s.∆ log p 020406080100120140 Step 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 AIME24 Avg@32 Top-p: 0.7 Top-p: 0.9 Top-p: 0.95 Top-p: 1.0 (DAPO) RLVR Performance w/ Different Top-p Values (c) RLVR performance v.s. top-p Figure 3: (a) Token probability and gradient norm coefficient 1− π θ (·)of a DAPO step, where the gradient concentrates on rare, low-probability tokens. (b) Token probability within different∆ log pbins, where higher∆ log pbins contain lower probability for both base and RLVR models. (c) Effect of top-p filtering on RLVR training performance. Performance declines with more filtering. Lemma 3.1. For a softmax-parameterized LLM policy with logits vectorzfor the output tokeny i,t , theℓ1-norm of the DAPO objective’s gradient w.r.t. z is given by: ∥ ∇ z J DAPO (y i,t |x, y i,<t ) ∥ 1 = 2|w i,t |· 1− π θ (y i,t |x, y i,<t ) . This partial gradient’sℓ1-norm directly depends on 1− π θ (y i |x,y i,<t ), with larger gradient sizes for lower-probability tokens. Furthermore, Yang et al. (2025b) formally proved that the full gradient norm is tightly bounded by the 1− π θ (·)term. Consequently, low-probability tokens, despite their rarity, receive disproportionately large gradient updates. We corroborate this empirically in Fig. 3(a), which plots tokens’ probability and their gradient coefficient from an intermediate DAPO training step. Although low-probability tokens are sampled infrequently, they account for most of the total gradient mass. This concentration of gradients explains why RLVR’s modifications are sparse: learning is naturally focused on a small, high-impact set of low-probability positions. High∆ logptokens are the updated low-probability tokens. To complete the argument, we link the low- probability tokens that dominate training updates to the high-∆ log ptokens in the final model. Fig. 3(b) analyzes tokens grouped by their∆ log pvalues. It reveals two patterns: first, the probability of tokens in high-∆ log pbins increases substantially from the base to the RLVR model; second, these high-∆ log p tokens have clearly lower probabilities in both models. This confirms that the most significant updates learned by RLVR target those low-probability tokens, and the sparsity of RLVR’s changes is therefore a direct consequence of sparse, high-magnitude gradients acting on these critical tokens, which can be effectively identified post-hoc by their large∆ log p. Excluding low-probability tokens during training impairs performance. To causally verify the impor- tance of these low-probability tokens, we conduct a training-time intervention experiment to provide direct evidence for our hypothesis. We train the Qwen2.5-Math-7B base model (Yang et al., 2024) using DAPO but adopt a top-p sampling strategy during rollout to filter out low-probability tokens. The results, plotted in Fig. 3(c), are conclusive. Even a mild filter (e.g., top-p=0.95) leads to a substantial drop in performance compared to the default setting (top-p=1.0). As the filter becomes more aggressive (i.e., with lower top-p thresholds), performance degrades sharply. This experiment demonstrates that these low-probability tokens are not merely correlated with gradient size but are essential for the reasoning improvements achieved by RLVR training. 6 Takeaway 1.RLVR’s gains stem from sparse, high-impact modifications. Our analysis reveals that RLVR’s performance gains originate not from a global distribution shift, but from targeted, high-impact changes to a minority of tokens. 2.Logp difference pinpoints these sparse changes. By capturing the direction of probability shifts from base to RLVR, logp difference outperforms magnitude-only metrics like entropy or divergence in isolating the reasoning-critical tokens that RLVR learns. 3.Sparsity originates from RLVR’s focus on low-probability tokens. The sparse difference is explained by the inherent concentration of RLVR’s gradients on rare, low-probability tokens, making these tokens the focal point for improvement and the source of the sparse, high-∆ log p changes we observe. 4 Exploiting RLVR’s Directional Updates to Boost Reasoning Building on Sec. 3, which isolates sparse and directional updates via∆ log p, we propose two practical strategies to utilize this directional learning: (i) a test-time selective extrapolation that shifts probability mass further along the learned direction on critical tokens; (i) a training-time advantage reweighting that prioritizes low-probability tokens implicated by high∆ log p. Both methods provide practical ways to boost performance by exploiting the directional mechanisms of RLVR. 4.1 Test-Time Enhancement via Extrapolation Selective test-time extrapolation along the∆ logpdirection. Our token replacement experiment demon- strated that∆ log peffectively identifies the reasoning-critical changes of RLVR. This raises a natural question: Can we move beyond simple replacement and actively amplify these critical changes to surpass the RLVR model’s performance? We therefore instantiate a token-level extrapolation: treat ∆ log p = log π RL (·)− log π Base (·)as a learned “reasoning direction” pointing from base to RLVR dis- tribution. Our strategy is to amplify this signal by extrapolating the RLVR model’s distribution further along this direction. The extrapolated policy π γ Extra is given by: log π γ Extra (y t |x, y <t ) := log π RL (y t |x, y <t ) + γ·∆ log p(y t |x, y <t ) + z(x, y <t ) = (1 + γ)· log π RL (y t |x, y <t )− γ· log π Base (y t |x, y <t ) + z(x, y <t ), (7) whereγis a hyperparameter controlling the extrapolating strength, andz(·)is a log-partition function. In probability space, this is equivalent to re-weighting the RLVR distribution: π γ Extra (y t |x, y <t )∝ π RL (y t |x, y <t )· exp γ∆ log p(y t |x, y <t ) . This framing connects our method to reward-guided decoding literature (Khanov et al., 2024; Liu et al., 2024; Xu et al., 2025), where a reward function is used to re-weight the probability distribution. Our ∆ log p thereby acts as a token-level reward that encourages better reasoning in this framework. Why selective? RLVR’s improvements concentrate on a minority of tokens; most positions exhibit negligible∆ log p. A global intervention risks distorting well-calibrated tokens. We therefore apply extrapolation selectively, usingf τ logp to gate positions with large negative∆ log p, and sample from the extrapolated policy π γ extra only at those positions (substituting π RL in Algo. 1, Line 6). Empirical Setup. We evaluate our method on the AIME-24 benchmark using the ORZ, DAPO, and UniReason model pairs, generating 32 samples per question (see Appx. A.1 for more details). To isolate the impact of our strategy, we compare three approaches: (1) RLVR: The original, non-intervened RLVR modelπ RL ; (2) Selective Replace: Base model with tokens replaced byπ RL ; (3) Selective Extrapolate: Base model with tokens replaced byπ γ Extra . For a controlled comparison, we use the same selection criteria for (2) and (3), with the only difference being the extrapolation. 7 ORZ-32B (Qwen2.5) DAPO-32B (Qwen2.5) UniReason-14B (Qwen3) 42 44 46 48 50 52 54 56 58 AIME24 Avg@32 43.33 46.15 47.50 51.67 52.50 55.42 54.06 54.58 55.83 Extrapolation Performance Selective Replace RLVR Selective Extrapolate Figure 4: Extrapolation Performance Results. On AIME-24, Selective Extrapolation yields higher Avg@32 (average of 32 samples) thanπ RL across ORZ-32B, DAPO-32B, and UniReason-14B under matched gates (Fig. 4). In contrast, Selective Replace matches but does not surpass the RL baseline under the same criteria. These results indicate that moving beyondπ RL along∆ log pprovides incremental gains in reasoning accuracy. Extrapolating onπ RL . We also apply selective extrapolation directly onπ RL rather than onπ Base in Algo. 1 (Line 4). As the thresholdτinf τ logp increases, the AIME-24 performance improves up to a moderate intervention ratio, after which gains plateau (Table 1). This pattern aligns with the sparsity finding: amplifying a limited set of reasoning- critical tokens is effective, whereas aggressive interventions yield diminishing returns. Table 1: Selective Extrapolate (γ =0.1) on the RLVR model (DAPO-32B) instead of the base model. Replace Ratio 0.0% 1.8% 5.2% 20.0% Avg@3252.50 53.96 55.3155.10 Threshold τ N/A -0.5-0.20.0 Theoretical Justification. Following a standard simplifica- tion in theoretical analysis for LLM RL training (Munos et al., 2024; Shi et al., 2025; Huang et al., 2025), we consider a tab- ular softmax bandit policy:π θ (y|x)∝ exp(θ x,y ), where the logit is individually parameterized byθfor each prompt- response pair(x,y). We assume the policy is trained with Natural Policy Gradient (NPG (Kakade, 2001)) following Cui et al. (2025), since its updates resemble the controlled optimization of PPO (Schulman et al., 2017). The update rule of NPG via backtracking simplifies to: θ t+1 x,y − θ t x,y = η· A t (x,y), whereηis the step size andA t is the advantage function (Agarwal et al., 2021). In this context, our extrapolated policy (Eq. 7) is defined asπ ω(θ t ,γ) , whereω(θ t ,γ) = θ t + γ(θ t − θ 0 ). Under these conditions, we have the following theorem (the proof can be found in Appx. D): Theorem 4.1. For a given promptx, if a tabular softmax policyπ θ t is updated via natural policy gradient (Kakade, 2001), then the extrapolated policy π ω(θ t ,γ) satisfies: ∃ γ > 0, E y∼π ω(θ t ,γ) (·|s) [R x,y ]≥ E y∼π θ t (·|s) [R x,y ]. Equality holds if and only if the reward R x,y is constant for all y. This theorem shows that, in the simplified setting, extrapolating along the learned difference direction of∆ log pcan improve the expected reward. Nevertheless, we need to note that the proof relies on the idealized NPG’s update rule, with a monotonic learning process consistently adjusting the logits along the reward’s direction. In contrast, our empirical analysis has shown that the updates learned by RLVR concentrate only on a minority of tokens, with∆ log pon most tokens being negligible. This disparity motivates our selective extrapolation only on positions with a significant difference, which exhibit the consistent, directional updates assumed by the theory. 4.2 Training-Time Enhancement via Advantage Reweighting Training-time enhancement via probability-aware advantage reweighting. While our test-time ap- proach amplifies the learned reasoning signal post-hoc, our training-time strategy proactively strengthens the model’s reasoning signal during learning. Instead of extrapolating the final logp difference∆ log p, we leverage the observed correlation between high∆ log pand low-probability tokens (Fig. 3(b)), and propose to amplify the learning signal of these critical low-probability tokens. Since the parameter update is driven by the advantage term ˆ A i,t in policy gradient methods, we modify the advantage in DAPO (Eq. 4) to prioritize low-probability tokens: ̃ A i,t = 1 + α· 1− π θ old (y i,t |x, y i,<t ) · ˆ A i,t ,(8) 8 Table 2: Comparison of our reweighting method and DAPO on math reasoning benchmarks. Model Method AIME24AIME25AMCAverage Avg@32 Pass@16 Avg@32 Pass@16 Avg@32 Pass@16 Avg@32 Pass@16 Qwen2.5- Math-7B Base14.7947.466.6727.8440.6279.2520.6951.52 DAPO35.7354.0917.630.4573.0489.0342.1257.86 Ours39.0660.5818.5436.7273.6489.6943.7562.33 Qwen3- 8B-Base Base5.4230.635.7332.827.6478.0912.9347.17 DAPO36.9872.326.6746.7669.1388.5144.2669.19 Ours38.1369.8731.1555.3871.0592.346.7872.52 whereαis a hyperparameter controlling the reweighting strength. Such a concentration on low-probability tokens also aligns with our top-p experiment in Fig. 3(c), which finds that low-probability tokens are irreplaceable for RLVR training. Experimental setup. We modify only the advantage (Eq. 8) in the standard DAPO recipe and keep all other hyperparameters fixed. We evaluate model performance on three math reasoning benchmarks: AIME-24, AIME-25, and AMC. Following DAPO’s setup, we use top-p=0.7 for sampling during evaluation. We report Avg@32 and Pass@16 4 , both computed over 32 samples per problem to ensure a stable estimate of the pass rates (Chen et al., 2021). Table 3: Results of various reweighting methods. MethodPPL Dominate Ours AIME24 Avg@3235.6336.3539.06 Pass@1661.9555.2760.58 AIME25 Avg@3216.4613.0218.54 Pass@1632.1920.6936.72 AMC Avg@3272.0679.9773.64 Pass@1689.184.9389.69 Average Avg@3241.3843.1143.75 Pass@1661.0853.6362.33 Results: performance gains across models and datasets. We compare our reweighting method on two models: Qwen2.5-Math-7B (Yang et al., 2024) and Qwen3-8B-Base (Yang et al., 2025a). As shown in Tab. 2, enhancing low-probability tokens’ weight consistently improves reasoning accuracy across all tested models and datasets. Notably, this enhanced accuracy (Avg@32) doesn’t come at the cost of exploration ability (often measured by Pass@k) (Yue et al., 2025); in fact, the average Pass@16 also increases over the DAPO baseline. Comparison of different reweighting. While our reweighting method is motivated by the critical role of low-probability tokens, existing work has proposed alternative reweighting strategies that stem from different hypotheses: (1) PPL: Deng et al. (2025) find that RLVR updates favor low-ppl responses, so they reweight advantage to enhance these responses: ̃ A ppl i,t = [1− α· w ppl (y i )]· ˆ A i,t , wherew ppl (y i )is a normalized log-PPL weight. (2) Dominate: Yang et al. (2025b) argue that RLVR training can be over- dominated by low-probability tokens, so they propose to counteract this by upweighting high-probability tokens: ̃ A dom i,t = [α· π θ (y i,t ) + 1− α]· ˆ A i,t . We implement these methods using their recommended hyperparameters and compare the performance on Qwen2.5-Math-7B. As shown in Table 3, our method of directly amplifying low-probability tokens achieves the best overall performance for both Avg@32 and Pass@16. The training dynamics in Fig. 5 provide further insight: Our method not only exhibits higher reasoning accuracy but also a steady increase in response length. This simultaneous increase in performance and length is a key pattern in effective reasoning RLVR training (Guo et al., 2025), suggesting the promoted reasoning behavior by our method. Moreover, the training entropy of ̃ A dom i,t reweighting is clearly lower, since they adopt a more restrictive clip-higher ratio ofε high =0.24 than the 4 With 32 samples, we report the more stable Pass@16 instead of Pass@32 for Pass@k evaluation. 9 050100150200250 Step 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Avg@32 AIME24 Avg@32 Ours PPL Dominate DAPO 050100150200250 Step 0.1 0.2 0.3 0.4 0.5 Entropy Training-time Entropy 050100150200250 Step 700 800 900 1000 1100 1200 1300 1400 Length Response Length Figure 5: Training curves for different reweighting methods on Qwen2.5-Math-7B. defaultε high =0.28 in DAPO 5 . The lower entropy (less exploration) also explains their reduced Pass@k performance in Tab. 3. 5 Related Work Reinforcement learning for LLM. Reinforcement learning is a pivotal component of the LLM post- training pipeline. Early applications centered on Reinforcement Learning from Human Feedback (RLHF) for model alignment (Ouyang et al., 2022; Stiennon et al., 2020), while recent advancements shift their focus to building reasoning models with RL. OpenAI o1 (Jaech et al., 2024) is the first reasoning model, and DeepSeek R1 (Guo et al., 2025) introduces a detailed RLVR (Lambert et al., 2024) recipe for building reasoning models with the GRPO algorithm (Shao et al., 2024). These seminal works inspired the development of a series of subsequent models, from industrial systems like Kimi(Team, 2025), Qwen3 (Yang et al., 2025a), and Gemini 2.5 (Comanici et al., 2025), to open-source academic algorithms such as Dr.GRPO (Liu et al., 2025), Open-Reasoner-Zero (Hu et al., 2025a), DAPO (Yu et al., 2025), GSPO (Zheng et al., 2025), and QAE (Wu et al., 2025), to further improve the reasoning abilities. In this paper, we adopt DAPO as our baseline RLVR algorithm. Understanding the effects of RLVR. The success of RLVR has prompted a line of research dedicated to understanding its effects. While early work analyzed high-level cognitive behaviors of RLVR-trained models (Gandhi et al., 2025; Hu et al., 2025b; Bogdan et al., 2025), recent studies have deepened the analysis with token-level quantification (Qian et al., 2025; Wang et al., 2025a). Cui et al. (2025) studied the token entropy change during RLVR, Yang et al. (2025b) quantified the gradient norm of specific tokens, and Deng et al. (2025); Meng et al. (2026) used token replacement to measure their impact on reasoning performance. A core finding from these analyses is that RLVR induces sparse updates, which have been verified through high-entropy tokens (Wang et al., 2025b), KL Divergences (Huan et al., 2025), and the sparse gradient norm (Yang et al., 2025b; Deng et al., 2025). However, when studying the differences between base and RLVR models, prior studies mainly focus on the magnitude of changes, largely overlooking their direction. While (Yang et al., 2025b) analyzes the update direction (increase or decrease) of probabilities at each gradient step, we extend the notion of update direction to the full distributional shift from the base model to the RLVR model, and we propose explicitly extrapolating along this learned direction in distribution space. 6 Conclusion In this work, we introduced a directional analysis of RLVR based on the logp difference∆ log p, shown to be more effective in identifying sparse yet reasoning-critical updates than magnitude-based metrics (e.g., divergence or entropy). Building on this, we proposed a test-time extrapolation to amplify these directional updates and a training-time reweighting to focus learning on the low-probability tokens that ∆ log phighlights. Both methods improve reasoning performance across different settings, validating our key principle: diagnose and improve RLVR by its update direction. 5 This follows the recommended value in their paper (Yang et al., 2025b). We also tested the defaultε high =0.28, but it resulted in unstable training. 10 Limitations and future work. One primary limitation of our extrapolation method is the requirement of two models; future work could integrate this with parameter-efficient finetuning to reduce computational cost. The extrapolation also introduces additional hyperparameters, and future work can explore combin- ing the selection threshold and extrapolation strength for a more adaptive extrapolation. Additionally, our reweighting approach could be evaluated for different model scales or combined with other adaptive training techniques. Contributions Authors: Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. 11 References Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98): 1–76, 2021. URL http://jmlr.org/papers/v22/19-736.html. Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025. URL https://arxiv.org/abs/2506.19143. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, et al. Evaluating large language models trained on code, 2021. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025. Jia Deng, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. Decomposing the entropy- performance exchange: The missing keys to unlocking effective reinforcement learning, 2025. Kanishk Gandhi, Ayush K Chakravarthy, Anikait Singh, Nathan Lile, and Noah Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STars. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=QGJ9ttXLTy. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645:633–638, 2025. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025a. Xiao Hu, Xingyu Lu, Liyuan Mao, YiFan Zhang, Tianke Zhang, Bin Wen, Fan Yang, Tingting Gao, and Guorui Zhou. Why distillation can outperform zero-rl: The role of flexible reasoning. arXiv preprint arXiv:2505.21067, 2025b. Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025. Kexin Huang, Junkang Wu, Ziqian Chen, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, and Xiang Wang. Larger or smaller reward margins to select preferences for LLM alignment? In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum ?id=ncTwQagrj8. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. Sham M Kakade. A natural policy gradient. In T. Dietterich, S. Becker, and Z. Ghahramani (eds.), Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001. 12 Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. ARGS: Alignment as reward-guided search. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/for um?id=shgx0eqdw6. Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NeurIPS ’22, Red Hook, NY, USA, 2022. Tianlin Liu, Shangmin Guo, Leonardo Bianco, Daniele Calandriello, Quentin Berthet, Felipe Llinares- López, Jessica Hoffmann, Lucas Dixon, Michal Valko, and Mathieu Blondel. Decoding-time realignment of language models. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, p. 31015–31031. PMLR, 21–27 Jul 2024. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025. Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in RLVR fine-tuning of LLMs. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=8vWIXno8LW. Remi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhao- han Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, et al. Nash learning from human feedback. In Forty-first International Conference on Machine Learning, 2024. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems (NeurIPS), 35:27730–27744, 2022. Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning. arXiv preprint arXiv:2506.02867, 2025. Yi Ren and Danica J. Sutherland. Learning dynamics of LLM finetuning. In The Thirteenth International Conference on Learning Representations, 2025. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. Ruizhe Shi, Runlong Zhou, and Simon Shaolei Du. The crucial role of samplers in online direct preference optimization. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps: //openreview.net/forum?id=F6z3utfcYw. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 33:3008–3021, 2020. 13 Kimi Team. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. Qwen Team. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024. Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, and Wenhu Chen. Emergent hierarchical reasoning in llms through reinforcement learning. arXiv preprint arXiv:2509.03646, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025b. Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, and Xiangnan He. Quantile advantage estimation for entropy-safe reasoning. arXiv preprint arXiv:2509.22611, 2025. Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh. GenARM: Reward guided generation with autoregressive reward model for test- time alignment. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=J0qTpmbSbh. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self- improvement, 2024. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025a. Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, and Yunjian Xu. Do not let low-probability tokens over-dominate in rl for llms. arXiv preprint arXiv:2505.12929, 2025b. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025. Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization. arXiv preprint arXiv:2507.20673, 2025. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025. 14 A Selective Token Replacement & Extraploation A.1 Implementation Details Models. Our experiments use several publicly available RLVR-trained models and their corresponding base models from the Qwen series (Yang et al., 2025a; Team, 2024): • ORZ: The Open-Reasoner-Zero-32B model (Hu et al., 2025a), finetuned from Qwen2.5-32B base model using the PPO algorithm. •DAPO: The DAPO-Qwen-32B model (Yu et al., 2025), finetuned from the same Qwen2.5-32B base but with the DAPO algorithm. • UniReason: The UniReason-Qwen3-14B-RL model (Huan et al., 2025), finetuned from Qwen3-14B-Base using the GRPO algorithm. Sampling settings. We utilize the AIME-24 dataset to evaluate the replacement performance. We adopt the default chat prompt template from each model, with the user prompt defined as follows: [Question] Please reason step by step, and put your final answer within \ . We set the sampling parameters with top-p=0.7, temperature=1.0, max-length=20k, and sample 32 responses for each question. The answer is extracted from the last “boxed” wrapped text and verified using Math-Verify. We report the correctness averaged over 32 samples, i.e., Avg@32. Hyperparameters for extrapolation. As described in Algo. 1, the replacement is adopted selectively, controlled by the thresholdτin the criteria functionf τ , while the extrapolation strength is adjusted by the parameterγinπ γ Extra . For the extrapolation results in Fig. 4, the “Selective Extrpolate” and “Selective Replace” methods share the same hyperparameters for each model, which we summarize as follows: Table 4: Hyperparameters for the extrapolation results (Fig. 4). ModelORZUniReasonDAPO Threshold τ for f γ logp -0.4-0.35-0.3 Replaced Ratio10.1%7.5%11.4% γ in π γ Extra 0.10.10.05 A.2 Additional Experiments Additional metrics. As described in Sec. 3, our primary metrics for token replacement are the base model’s entropyH Base , KL DivergenceD KL , and logp difference∆ log p. For our ablation study, we include additional metrics: the RLVR model’s entropyH RL and two KL-divergence variants:D KL π RL ,π Base andD KL π Base ,π RL . We evaluate these metrics as criteria for the DAPO model’s selective replacement. By varying the thresholdτfor each criterion, we control the token replacement frequency and plot the performance on AIME-24 against various replacement ratios in Fig. 6. As shown in the figure, although the additional metrics’ selected replacements also approach the RLVR model’s performance, they still require more replacement than∆ log pdoes. This confirms the performance ordering for identifying reasoning-critical tokens: logp difference > divergence > entropy. Selected Tokens. To provide an intuitive comparison of the metrics, we analyze the tokens utilized for replacing the base model’s choice during DAPO’s token replacement of entropyH π Base , KL Divergence D KL , and logp difference∆ log p. To ensure a fair comparison, we adjust the threshold for each metric to achieve a replacement rate of approximately 8%. Fig. 7 illustrates each criterion’s top 50 substitution tokens. The figure reveals that entropy-based selection favors logical transition words (e.g., Thus, need, can), while the divergence and∆ log pcriteria utilize more specific mathematical reasoning tokens, 15 020406080100 RLVR Replace Ratio (%) 10 20 30 40 50 AIME24 Avg@32 Base: 6.67 RLVR: 52.50 Logp Difference log p Forward KL Divergence D KL RL , Base Reversed KL Divergence D KL Base , RL Averaged KL Divergence D KL Entropy H Base RLVR Entropy H RL Random Performance Across Varying RLVR Replace Ratios (on DAPO-32B/Qwen2.5) Figure 6: Selective token replacement results with additional criteria for DAPO. including a higher proportion of math symbols. Combined with the inferior performance of the entropy criterion, this suggests that these specific mathematical tokens might be more efficient for improving reasoning performance. 0123456 Percentage of All Replace Tokens (%) ' the' ' $' '. ' ' and' 'Let' ' we' ',' ' is' ': ' '.' 'To' 'Now' ' use' ' for' ' \\' ' that' ' a' ' this' ' Let' ' $\\' ' need' 'The' ' have' ' can' ' The' ' to' 'Since' ' of' ' approach' ' denote' ' in' 'So' ' ' 'Thus' ' (' 'We' ' such' '1' 'This' ' must' ' We' ' let' ' from' ' find' 'Rew' ' number' ' will' ' \\(' ' Therefore' ' it' By Entropy H Base 0.00.51.01.52.02.53.03.5 Percentage of All Replace Tokens (%) ' the' ' \\' ' $' '. ' ',' 'Let' '.' ' =' ' ' ': ' ' (' '1' ' and' 'Now' 'To' '2' ' we' ' use' ' is' 'The' ' $\\' 'Thus' ' -' '$' ' +' '\\[' '\\' ' a' '.\\' 'Since' ' have' 'Rew' 'So' ' this' ' for' ' need' 'This' 'We' ' approach' ' of' ' ' ′ ' Let' '3' ' to' ': ' '$,' ' from' ' can' ' in' By KL Divergence D KL 0.00.51.01.52.02.53.03.54.0 Percentage of All Replace Tokens (%) ' the' ' $' ' \\' '. ' ',' '.' ': ' ' and' ' (' 'Let' ' =' ' ' '$' ' is' ' we' '$,' '1' 'Now' ' a' ' of' ' have' '2' ' $\\' 'The' ' ' ' for' ' this' ' +' 'Thus' ' that' 'To' 'So' ' use' ′ ' to' ' in' ' from' '$.' ' \\(' ' -' ' such' ' approach' ' need' ' can' '.\\' 'Rew' '),' 'This' '\\[' '3' By Logp Difference log p Top 50 Replace Tokens by Different Metrics Figure 7: Top 50 tokens for replacing the base model’s choice under different metrics’ selection. Per-Problem Accuracy during Replacement. We also report the per-problem accuracy changes in the token-replacement experiment in Fig. 8, to more finely examine how gradually increasing the replacement ratio affects model performance. We observe that: (1) There exist some problems that are inherently difficult for the model, for which the accuracy remains zero across all replacement ratios. (2) For the remaining problems, the overall trend is that accuracy generally increases as the replacement ratio grows, and then begins to fluctuate. This is consistent with the fact that, when only performing token replacement, the performance is ultimately capped by the upper bound of the RLVR model. (3) For a small number of problems, accuracy initially drops when we introduce a small amount of replacement, and then begins to improve as the replacement ratio continues to increase (e.g., problem 0 of DAPO). 16 A qualitative inspection of these cases suggests that, for some of them, a small number of RL-replaced tokens introduce token options that the base model is not familiar with. As a result, the base model fails to continue the generation coherently, leading to an initial degradation in accuracy. However, when we further increase the replacement ratio, the generation becomes more strongly guided by the RL tokens, and the model’s performance on these problems recovers and improves. 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 0 0.000.050.100.150.20 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 1 0.000.050.100.150.200.25 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 2 0.000.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 3 0.000.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 4 0.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 5 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 6 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 7 0.020.040.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 8 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 9 0.000.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 10 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 11 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 12 0.0000.0250.0500.0750.1000.1250.150 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 13 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 14 0.000.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 15 0.000.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 16 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 17 0.000.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 18 0.000.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 19 0.020.030.04 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 20 0.050.100.150.20 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 21 0.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 22 0.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 23 0.020.030.040.050.060.07 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 24 0.0000.0250.0500.0750.1000.1250.150 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 25 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 26 0.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 27 0.000.050.100.15 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 28 0.020.040.060.080.100.120.14 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 29 (a) Per-problem accuracy on AIME24 of DAPO’s token replacement experiment 0.040.050.060.070.080.09 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 0 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 1 0.0000.0250.0500.0750.1000.1250.150 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 2 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 3 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 4 0.010.020.030.040.05 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 5 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 6 0.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 7 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 8 0.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 9 0.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 10 0.0250.0500.0750.1000.1250.150 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 11 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 12 0.0000.0250.0500.0750.1000.125 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 13 0.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 14 0.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 15 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 16 0.020.040.060.080.10 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 17 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 18 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 19 0.020.040.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 20 0.0000.0250.0500.0750.1000.1250.150 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 21 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 22 0.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 23 0.030.040.05 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 24 0.000.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 25 0.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 26 0.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 27 0.000.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 28 0.0000.0250.0500.0750.1000.125 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 29 (b) Per-problem accuracy on AIME24 of ORZ’s token replacement experiment 0.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 0 0.000.020.040.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 1 0.000.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 2 0.000.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 3 0.000.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 4 0.000.010.020.030.040.05 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 5 0.000.020.040.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 6 0.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 7 0.000.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 8 0.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 9 0.000.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 10 0.0000.0250.0500.0750.1000.125 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 11 0.010.020.030.04 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 12 0.000.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 13 0.000.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 14 0.000.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 15 0.000.020.040.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 16 0.000.020.040.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 17 0.000.010.020.030.040.050.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 18 0.000.020.040.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 19 0.000.010.020.030.04 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 20 0.0000.0250.0500.0750.1000.125 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 21 0.000.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 22 0.000.020.040.06 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 23 0.010.020.030.04 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 24 0.000.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 25 0.000.020.040.060.08 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 26 0.000.010.020.030.040.05 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 27 0.000.020.040.060.080.100.12 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 28 0.0000.0250.0500.0750.1000.1250.150 Avg Resample Rate 0.0 0.2 0.4 0.6 0.8 1.0 Avg Accuracy Problem 29 (c) Per-problem accuracy on AIME24 of UniReason’s token replacement experiment Figure 8: Per-problem accuracy changes on AIME24 during each model’s selective token replacement experiment. We report the results with∆ log p being the selection criterion. A.3 Hyperparameter Sensitivity Analysis Our test-time extrapolation distributionπ γ Extra introduces a hyperparameterγthat determines the strength of extrapolation along the learned∆ log pdirection. This intervention operates within the token replacement procedure (Algo. 1) and is applied only to tokens selected by the criterion∆ log p < τ. To verify the robustness of the performance gain of extrapolation over simply replacing the token fromπ RL , we perform a grid search over bothγand the token-selection thresholdτ. We evaluate γ∈0.05, 0.1and varyτacross different ranges for different models. For DAPO and ORZ, we testτ ∈ 17 −0.5,−0.4,−0.3,−0.2,−0.1. For UniReason, we adopt a denser gridτ ∈−0.5,−0.45,−0.4,−0.35,−0.3 because relatively few replacements are needed to reach the RLVR performance level (Fig. 2). As shown in Tab. 5, across nearly all models and hyperparameter settings, extrapolation consistently outperforms the replace-only variant, demonstrating a strong robustness of our method. Notably, once the replacement ratio is sufficiently high to match the RLVR’s performance, further increases in replacement provide little to no additional benefit, since the performance is bounded by the RLVR model itself. In contrast, a proper test-time extrapolation can further exceed RLVR performance by 1–3 points without any additional training. Table 5: Hyperparameter sensitivity analysis for the selective extrapolation experiment. The ∗ sign marks the reported value for extrapolation results in Fig. 4, while the † sign corresponds to the end point in token replacement of Fig. 2. (a) Hyperparameters and Avg@32 performance on AIME24 of DAPO (Avg@32 of π RL : 52.60). Threshold τ-0.5-0.4-0.3-0.2-0.1 Average Replace ratio8.8%10.0%11.4%13.4%16.5% Replace w/ π RL 51.98 † 51.5651.6752.7151.98 Extrapolate w/ γ = 0.0551.8853.0255.42 ∗ 54.0654.9 Extrapolate w/ γ = 0.154.1753.3353.8553.8554.27 (b) Hyperparameters and Avg@32 performance on AIME24 of ORZ (Avg@32 of π RL : 46.15). Threshold τ-0.5-0.4-0.3-0.2-0.1 Average Replace ratio9.5%10.1%10.8%11.6%12.7% Replace w/ π RL 43.6543.3346.15 † 44.9042.81 Extrapolate w/ γ = 0.0547.19 45.5245.8346.2543.44 Extrapolate w/ γ = 0.143.7547.50 ∗ 45.5247.0845.42 (c) Hyperparameters and Avg@32 performance on AIME24 of UniReason (Avg@32 of π RL : 54.58). Threshold τ-0.5-0.45-0.4-0.35-0.3 Average Replace ratio5.4%6.0%6.8%7.5%8.5% Replace w/ π RL 53.65 † 53.3353.1254.0653.54 Extrapolate w/ γ = 0.0551.8854.79 53.5455.0054.69 Extrapolate w/ γ = 0.154.3753.7553.9655.83 ∗ 55.10 B RLVR Training Details Hyperparameters Setting. We adopt the open-sourced DAPO recipe for RLVR training. Our configuration includes double clip ratios (ε low =0.2 andε high =0.28) and a learning rate of 1e-6 with a 10-step warmup. Each RLVR step consists of 512 prompts with 16 sampled responses each, processed in mini-batches of 32 prompts to yield 16 gradient updates per step. Maximum generation length (and overlong penalty thresholds) are set to 8k (4k) for Qwen2.5-Math-7B and 20k (16k) for Qwen3-8b-base, respectively. For reweighting, our parameterα(Eq. 8) is set to 0.2 for Qwen2.5 and 0.1 for Qwen3. Following the recommended values by Deng et al. (2025) and Yang et al. (2025b), we setαto 0.1 for ̃ A dom i,t and 0.01 for ̃ A PPL i,t . For ̃ A dom i,t specifically, we also adjust ε high to 0.24. Reproducibility Analysis. To account for random variations in the RL process, we also performed four separate training runs on the Qwen2.5-Math-7B backbone for our reweighting method. Fig. 9 displays 18 050100150200250 Step 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Avg@32 Reported value (39.06)Reported value (39.06)Reported value (39.06)Reported value (39.06) AIME24 Avg@32 Run1 Run2 Run3 Run4 050100150200250 Step 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Entropy Training-time Entropy 050100150200250 Step 600 800 1000 1200 1400 Length Response Length Figure 9: Reproducibility analysis. The learning curves across 4 independent runs on Qwen2.5-Math-7B with our reweighting method (Eq. 8) show consistent convergence and performance. the learning curves for these experiments (Run 1–4, where Run 1 is our reported run in Fig. 5). The results indicate that our method is highly reproducible; across all trials, the model reached or surpassed the performance levels presented in Tab. 3. C Performance beyond Pure-Math Reasoning Tasks Although our models are primarily trained and evaluated on math-focused datasets, it is important to assess their reasoning ability on non-math tasks to evaluate generalization ability. Following prior work (Zhao et al., 2025), we use the Minerva dataset (Lewkowycz et al., 2022), which contains 272 undergraduate-level STEM problems spanning diverse subjects such as Chemistry and Astronomy 6 . We begin by benchmarking the RLVR-trained models on Minerva using the same sampling parameters as in other evaluations (e.g., AIME24). As shown in Tab. 6, models trained with our reweighting method continue to outperform baselines in reasoning accuracy. Importantly, these gains do not come at the expense of exploration ability, as reflected by comparable or improved Pass@k scores. We further evaluate test-time extrapolation on Minerva. Because Minerva is substantially larger than AIME24 (around 7 times more questions), we report Avg@8 for the evaluated 14B–32B models. As shown in Fig. 10, test-time extrapolation consistently improves over the RLVR model’s accuracy, validating its generalization ability beyond pure-math datasets. We also report the hyperparameter grids in Tab. 7, where the extrapolation results also consistently outperform replacing with π RL only. Table 6: Performance of RLVR-trained models on Minerva. (a) On Qwen2.5-Math-7B MethodBaseDAPOPPLDominateOurs Avg@3218.3546.4348.6847.0149.72 Pass@1661.04 69.4468.6964.5970.37 (b) On Qwen3-8B-Base MethodBaseDAPOOurs Avg@3229.855.0456.57 Pass@1670.4376.9876.78 6 The dataset is named asOCWCoursesin the paper, which can be found inhttps://openreview.net/attachmen t?id=IFXTZERXdM7&name=supplementary_material. 19 UniReason-14B (Qwen3) ORZ-32B (Qwen2.5) DAPO-32B (Qwen2.5) 53 54 55 56 57 58 59 Minerva Avg@8 54.14 54.23 56.16 56.41 56.46 57.17 56.63 56.42 58.27 Extrapolation Performance Selective Replace RLVR Selective Extrapolate Figure 10: Extrapolation results on Minerva. Table 7: Hyperparameters and Avg@8 performance on Minerva benchmark. The ∗ sign marks the tuned value in Fig. 10. DAPOORZUniReason Threshold τ-1.0-0.9-1.0-0.9-1.0-0.9 Avg replace ratio 6.5%7.0%9.2%9.6%1.8%2.2% Replace w/ π RL 56.6356.4356.4156.3954.0054.14 Extrapolate w/ γ = 0.0556.857.2257.17 ∗ 57.0854.5054.50 Extrapolate w/ γ = 0.158.27 ∗ 56.5755.5155.2854.3256.16 ∗ D Proofs Proof of Lemma 3.1.For ease of notation, we omit the contextx,y i,<t here. The derivative of DAPO on an unclipped token y i,t is: ∇ θ J DAPO (y i,y ) =∇ θ r i,t (θ) ˆ A i,t =∇ θ π θ (y i,t ) π θ old (y i,t ) ˆ A i,t = r i,t (θ) ˆ A i,t ·∇ θ log π θ (y i,t ) = w i,t ·∇ θ log π θ (y i,t ). For the softmax-parameterized policyπ θ with logitszfory i,t , assumingy i,t corresponds to indexkof vocabularyV , we have: ∂ ∂z j log π θ (y i,t ) = 1 π θ (y i,t ) · ∂ ∂z j exp(z k ) ∑ l exp(z l ) = 1 π θ (y i,t ) ·    exp(z k ) ∑ l exp(z l )−exp(z k ) exp(z k ) ( ∑ l exp(z l )) 2 ,j = k − exp(z k ) exp(z j ) ( ∑ l exp(z l )) 2 ,j̸= k = ( 1− π θ (V k ),j = k −π θ (V j ),j̸= k = I(j = k)− π θ (V j ). 20 So the ℓ1-norm of∇ z J DAPO (y i,t ) becomes: ∥ ∇ z J DAPO (y i,t ) ∥ 1 = ∥ w i,t ∇ z log π θ (y i,t ) ∥ 1 =|w i,t |· ∑ j I(j = k)− π θ (V j ) =|w i,t |· 1− π θ (y i,t ) + ∑ j̸=k π θ (V j ) (y i,t =V k ) =|w i,t |· 2 1− π θ (y i,t ) . Proof of Theorem 4.1. LetJ (θ x ) = E y∼π θ x (·) [R x,y ], and we need to show that for each x: ∃ γ > 0,J (θ t x + γ(θ t x − θ 0 x ))≥J (θ t x ). Denote the extrapolation direction asd t x = θ t x − θ 0 x , this is equivalent to showing the directional derivative ofJ at θ t x along d t x is positive. The directional derivative is given by: ∇ d t x J (θ t ) =∇ θ x J (θ t x ) ⊤ d t x ∥d t x ∥ = 1 ∥d t x ∥ · ∑ y ∂J (θ t x ) ∂θ x,y d t x,y . For the softmax policy π θ x (y) = exp(θ x,y )/ ∑ y ′ exp(θ x,y ′ ), its gradient satisfies: ∂π θ x (y ′ ) ∂θ x,y = π θ x (y ′ ) I(y = y ′ )− π θ x (y) . So the partial gradient ofJ on y is: ∂J (θ x ) ∂θ x,y = ∑ y ′ R x,y ′ ∂π θ x (y ′ ) ∂θ x,y = R x,y π θ x (y)− π θ x (y) ∑ y ′ R x,y ′ π θ x (y ′ ) = π θ x (y)(R x,y − π ⊤ θ x R x ). Note that the advantage isA t (x,y) = R x,y − π ⊤ θ t x R x under the bandit setting, the directional derivative thus becomes: ∇ d t x J (θ t ) = 1 ∥d t x ∥ · ∑ y π θ t x (y)(R x,y − π ⊤ θ t x R x )d t x,y = 1 ∥d t x ∥ · ∑ a π θ t x (y)· A t (x, y)· d t x,y We now analyze the order of A t (x, y) and d t x,y . Under the assumed bandit setting, the order ofA t (x,y)is the same as the order ofR x,y , i.e.,A t (x,y 1 ) > A t (x ,y 2 )if and only ifR x,y 1 > R x,y 2 . Ford t x,y , we can prove that its order is also the same asR x,y with induction. At t = 1, using the update rule of NPG, we have: d 1 x,y − d 1 x,y ′ = η· (A 0 (x, y)− A 0 (x, y ′ )) = η· (R x,y − R x,y ′ ). So the order ofd 1 x,y is the same asR x,y . Assume at iterationt, the order ofd t x,y is the same asR x,y , then at iteration t + 1, we have: d t+1 x,y − d t+1 x,y ′ = d t x,y − d t x,y ′ + η· (A t (x, y)− A t (x, y ′ )) = d t x,y − d t x,y ′ + η· (R x,y − R x,y ′ ). So we still haved t+1 x,y > d t+1 x,y ′ ⇐⇒R x,y > R x,y ′ . Thus by induction, the order ofd t x,y is the same asR x,y for all t. Since the order of A t (x, y) and d t x,y are the same, we can apply the Chebyshev sum inequality to get: ∑ y π θ t x (y)· ∑ y π θ t x (y)· A t (x, y)· d t x,y ≥ ∑ y π θ t x (y)· A t (x, y) ! · ∑ y π θ t x (y)· d t x,y ! , 21 with the equality holds if and only if A t (x, y) or d t x,y is a constant for all y (i.e., constant reward). Note that the expectation of advantage ∑ y π θ t x (y)· A t (x, y) = 0, so we have: ∇ d t x J (θ t ) = 1 ∥d t x ∥ · ∑ y π θ t x (y)· A t (x, y)· d t x,y ≥ 0. The equality holds if and only if R x,y is a constant for all y. E Statistical Comparison of Different Metrics Empirical setup. We evaluate three RLVR models: ORZ, DAPO, UniReason, and their base counterparts. For each model, we generate 32 responses per question from the AIME-24 dataset, with a sampling strategy of top-p=0.7 and temperature=1.0. Our analysis focuses on several metrics comparing the model pairs: the base/RLVR model’s entropy, KL divergences, and the logp difference. The probability distribution versus different∆ log pbins in Fig. 3(b) is also measured on the DAPO’s generation under this setting. Statistics of Different Metrics. We compute each metric of the three RLVR model pairs on both the base model and the RLVR model’s generation. As shown in Fig. 12, the distribution of logp difference∆ log p is bimodal, with a positive tail for the RLVR’s generated text and a negative tail for the base model’s generation. In contrast, the distributions of other magnitude-based metrics are nearly identical regardless of which model generated the output (Fig. 13-15). Word Clouds of High-∆ log pTokens. To gain qualitative insight into the tokens identified as higher ∆ log p, whose probabilities are substantially increased by the RLVR training process, we generated word clouds from the top-100 high-∆ log ptokens for each model (Figure 11). As the figure shows, these tokens correspond to words related to problem-solving. They fall into two clear categories: explicit reasoning actions (e.g., combine, break, simplify) and logical transitions (e.g., wait, think, step). The prevalence of this vocabulary suggests that the RLVR model has learned to construct more effective reasoning processes. F The Use of Large Language Models We utilize LLMs only to polish some of the language of this paper. All content was originally drafted by the authors. The use of LLMs was restricted to refining some pre-existing text, and any suggested modifications were reviewed by the authors to confirm their accuracy and alignment with the original meaning. 22 Top-100 Tokens by Log Probability Difference (a) Top∆ log p tokens of DAPO Top-100 Tokens by Log Probability Difference (b) Top∆ log p tokens of ORZ Top-100 Tokens by Log Probability Difference (c) Top∆ log p tokens of UniReason Figure 11: Word clouds of top∆ log p tokens, measured w/ different RLVR-trained models. 23 12108642024 Logprob Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Within [-0.25,0.25]: 97.3% On Base generations 05101520 Logprob Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Within [-0.25,0.25]: 92.2% On RLVR generations Logprob Difference of UniReason - Base (a) Logp difference of UniReason 20151050 Logprob Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Within [-0.5,0.5]: 90.5% On Base generations 50510152025 Logprob Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Within [-0.5,0.5]: 90.9% On RLVR generations Logprob Difference of DAPO - Base (b) Logp difference of DAPO 25201510505 Logprob Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Within [-0.5,0.5]: 95.1% On Base generations 505101520 Logprob Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Within [-0.5,0.5]: 89.3% On RLVR generations Logprob Difference of ORZ - Base (c) Logp difference of ORZ Figure 12: Logp Difference histograms of different RLVR models, comparing the RLVR and base model’s generations. 24 (a) Divergence on UniReason’s generations 01234567 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Less than 0.1: 82.0% RLVR Model's Entropy 0123456 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Less than 0.1: 86.9% Base Model's Entropy UniReason & Base Model's Entropy (on UniReason generations) (b) Entropy on UniReason’s generations (c) Divergence on base’s generations 0123456 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Less than 0.1: 92.7% RLVR Model's Entropy 0123456 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Less than 0.1: 94.4% Base Model's Entropy UniReason & Base Model Entropy (on Base generations) (d) Entropy on base’s generations Figure 13: Divergence and entropy histograms of UniReason and its corresponding base model measured on UniReason or the base model’s generations. (a) Divergence on DAPO’s generations 0246810 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Less than 0.2: 72.2% RLVR Model's Entropy 0246810 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Less than 0.2: 74.3% Base Model's Entropy DAPO & Base Model's Entropy (on DAPO generations) (b) Entropy on DAPO’s generations (c) Divergence on base’s generations 012345 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Less than 0.1: 74.0% RLVR Model's Entropy 0123456 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Less than 0.1: 83.1% Base Model's Entropy DAPO & Base Model Entropy (on Base generations) (d) Entropy on base’s generations Figure 14: Divergence and entropy histograms of DAPO and its corresponding base model measured on DAPO or the base model’s generations. 25 (a) Divergence on ORZ’s generations 0123456 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Less than 0.1: 80.2% RLVR Model's Entropy 0123456 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Count Less than 0.1: 55.7% Base Model's Entropy ORZ & Base Model's Entropy (on ORZ generations) (b) Entropy on ORZ’s generations (c) Divergence on base’s generations 0123456 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Less than 0.1: 84.5% RLVR Model's Entropy 0123456 Entropy 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Count Less than 0.1: 83.1% Base Model's Entropy ORZ & Base Model Entropy (on Base generations) (d) Entropy on base’s generations Figure 15: Divergence and entropy histograms of ORZ and its corresponding base model measured on ORZ or the base model’s generations. 26