← Back to papers

Paper deep dive

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, Jingren Zhou

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 126

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/26/2026, 1:35:02 AM

Summary

This paper presents a systematic empirical study of the token-level distributional shifts induced by Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Models (LLMs). The authors demonstrate that RLVR fine-tuning is a sparse and targeted refinement process, where only a small fraction of token distributions exhibit significant divergence from the base model. Through cross-sampling interventions, they confirm that these sparse, high-divergence token decisions are functionally critical for recovering performance gains, contrasting this behavior with the more global distributional changes observed in Supervised Fine-Tuning (SFT).

Entities (6)

RLVR · methodology · 100%JS Divergence · metric · 99%Qwen2.5-32B · model · 98%DAPO · algorithm · 95%GRPO · algorithm · 95%SFT · methodology · 95%

Relation Signals (4)

JS Divergence measures Distributional Shifts

confidence 98% · To quantify distributional differences, we use the Jensen–Shannon (JS) divergence

RLVR induced Sparse Distributional Shifts

confidence 95% · RL fine-tuning induces highly sparse and targeted changes

DAPO isa RLVR

confidence 95% · RLVR variants trained using DAPO

SFT produces Denser Distributional Shifts

confidence 90% · Supervised fine-tuning produces substantially denser and more globally distributed shifts

Cypher Suggestions (2)

Find all algorithms associated with RLVR methodology · confidence 90% · unvalidated

MATCH (a:Algorithm)-[:IS_A]->(m:Methodology {name: 'RLVR'}) RETURN a.name

Identify metrics used to analyze distributional shifts · confidence 85% · unvalidated

MATCH (m:Metric)-[:MEASURES]->(p:Phenomenon {name: 'Distributional Shifts'}) RETURN m.name

Abstract

Abstract:Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

Tags

ai-safety (imported, 100%)cscl (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

126,019 characters extracted from source content.

Expand or collapse full text

†footnotetext: Published at ICLR 2026. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs Qwen Pilot Team, Alibaba Group Full author list available in the Authors section. Abstract Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR’s distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR’s performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process. (a) Divergence across a sequence (SimpleRL) (b) Divergence across a sequence (DAPO) PromptAnswerAnswerRL editRL editbase trajectorycontinue with basecontinue with baseBase token distributionRL-modified distribution (c) Visualization of RLVR as a sparse trajectory steering mechanism Figure 1: Overview: RLVR acts as sparse, high-impact token-level refinement. RL fine-tuning induces sparse distributional shifts: divergence between base and RL token distributions remains near zero at most positions, with only a small subset of tokens exhibiting substantial changes. 1 Introduction Recent advances in reinforcement learning with verifiable rewards (RLVR) (Lambert et al., 2024) for reasoning in large language models (LLMs), such as Group Relative Policy Optimization (GRPO) (Shao et al., 2024), have enabled substantial performance improvements on challenging reasoning and mathematical benchmarks. Despite this empirical success, the mechanisms through which RLVR modifies model behavior remain unclear. Most evaluations of RL fine-tuning focus on aggregate response-level metrics such as accuracy, reward, and response length. While informative, these provide only a high-level view of improvement and offer limited insight into the mechanisms by which model behavior changes. In particular, a central unresolved question is: how does RLVR reshape the token-level predictive distributions of a base model, and which of these changes actually drive downstream reasoning gains? Recent work has begun to analyze RL fine-tuning through token-level entropy and uncertainty perspectives (Wang et al., 2025; Cheng et al., 2025; Cui et al., 2025), highlighting the role of high-entropy tokens and exploration dynamics. Complementary analyses study RL-induced changes through token-level KL divergence and rank-shift statistics (Huan et al., 2025) as well as through the perspective of reasoning patterns (Chen et al., 2026). However, a more detailed distributional view of change remains missing: how such shifts are structured across positions and contexts, how probability mass is reallocated across candidate tokens, how they evolve over training, and to what extent they are responsible for RLVR’s performance gains. In this paper, we develop a fine-grained, token-level perspective on RLVR through the lens of distributional change. We perform a systematic empirical study of how RLVR alters next-token distributions relative to the base model, and connect these distributional shifts directly to sequence-level reasoning performance. Our analyses reveal that RLVR acts primarily as a sparse and targeted refinement process: most token distributions remain nearly unchanged, while a small subset of high-divergence positions carries disproportionate functional importance, guiding generation toward more effective reasoning trajectories otherwise accessible under the base model. Our contributions are organized as follows: • Structure of Token-Level Distributional Shifts. We show that RLVR induces sparse token-level distribution shifts relative to the base model. We characterize the structure of these shifts through divergence, entropy, and positional analyses, and compare across multiple RLVR methods, revealing differences in exploration and refinement behavior. • Cross-Sampling Interventions. We use forward and reverse cross-sampling interventions to measure the role of divergent token decisions. We show that modifying only a small fraction of token choices is sufficient to recover (in base-model generations) or erase (in RL-model generations) RLVR performance gains, linking the sparse distributional shifts directly to sequence-level reasoning outcomes. These results demonstrate that RL and base policies are behaviorally similar across most tokens but differ critically at a sparse set of high-impact decisions that steer reasoning trajectories. • Fine-Grained Distribution Mechanics. We analyze how RLVR modifies token distributions at high-divergence positions and show that it primarily reallocates probability mass within an existing candidate set rather than introducing new tokens. We support this with top-k overlap, rank, tail-probability, and training-evolution analyses. • Divergence-Weighted Advantage. Motivated by these findings, we study divergence-weighted variants of the RLVR advantage signal as a diagnostic objective modification and show that they can improve over baselines. Taken together, our results provide a unified token-level picture of RLVR fine-tuning: rather than globally rewriting model behavior, RLVR predominantly performs sparse, structured probability reallocation in a small set of critical token positions that steer downstream reasoning trajectories. This distributional and functional perspective helps clarify the mechanisms in which RLVR improves reasoning in LLMs. 2 Token Distribution Analysis between Base and Fine-tuned Models We begin by analyzing the general structure of distributional shifts induced by RLVR, with the goal of characterizing how token-level predictions differ between the base model and its RL-finetuned counterpart. Our analysis compares next-token distributions under identical sequence contexts: we take sequences generated by the RL policy and evaluate both models’ conditional distributions at each token position. This framing treats the RL-generated trajectory as a reference path and allows us to quantify how the base model would need to adapt in order to emulate it. 2.1 Preliminaries For each token position t and prefix x<tx_<t, let πbase(⋅∣x<t) _base(· x_<t) and πRL(⋅∣x<t) _RL(· x_<t) denote the conditional next-token distributions of the base and RL models, respectively, defined on a vocabulary space V. To quantify distributional differences, we use the Jensen–Shannon (JS) divergence, defined as JS(x<t)=DJS(πbase(⋅∣x<t)∥πRL(⋅∣x<t))=12DKL(πbase(⋅∣x<t)∥Mt)+12DKL(πRL(⋅∣x<t)∥Mt),JS(x_<t)=D_JS( _base(· x_<t) _RL(· x_<t))= 12D_KL\! ( _base(· x_<t) M_t )+ 12D_KL\! ( _RL(· x_<t) M_t ), where Mt=12(πbase(⋅∣x<t)+πRL(⋅∣x<t))M_t= 12 ( _base(· x_<t)+ _RL(· x_<t) ). One could use any notion of divergence or distance between probability measures, but we use JS divergence over something like KL divergence because: (i) it is symmetric, avoiding directional considerations; (i) it is bounded in [0,log⁡2][0, 2], preventing extreme values from dominating aggregate statistics; and (i) it remains well-defined even when the measures lack absolute continuity with respect to each other. The latter is particularly important in practice, as memory constraints often limit the retrieval of the full distribution over the entire vocabulary, and also when comparing top-p truncated distributions, for which KL divergence may be undefined. Unless otherwise stated, divergences are computed on top-p truncated distributions using the same sampling configuration employed during generation, while entropy and probabilities are from the full estimated distributions. This is so that the comparisons reflect the models’ effective differences under the actual sampling regime, while still grounding the entropy and probabilities in their complete output distributions. Robustness checks across different top-p values and against estimates of the full distributions are provided in Appendix A.5 (Figures 30 and 32). Models and Datasets. Our primary analysis focuses on Qwen2.5-32B (Qwen et al., 2025) as the base model, with RLVR variants trained using DAPO (Yu et al., 2025) and GRPO, the latter paired with the corresponding SimpleRL model (Zeng et al., 2025). For evaluation on AIME 2024 and AIME 2025, we sample 32 responses per problem for robustness. We further extend the analysis to additional models (Qwen2.5-Math-7B (Yang et al., 2024) with two variants corresponding to different upper clip settings, Qwen3-8B-Base (Yang et al., 2025a) with DAPO, and Mistral-Small-24B (MistralAI, 2025)) with SimpleRL, datasets (AIME25, GPQA (Rein et al., 2023), and the models’ respective fine-tuning datasets), and to comparisons with supervised fine-tuning (SFT). These extensions, reported in Appendix A.4 and Appendix A.5, confirm that our findings generalize across models, datasets, and training paradigms. 2.2 Distribution Shifts Are Highly Targeted and Sparse A natural starting question is: how broadly are distributional shifts distributed across token positions? To answer this, we examine the token-level JS divergence between the base and RL-finetuned models. Figure 2 presents log-scaled histograms and percentile curves of JS divergence for DAPO and SimpleRL on their respective generated responses for AIME 2024. (a) DAPO: Histogram (log y-axis) (b) DAPO: Percentile curve (c) SimpleRL: Histogram (log y-axis) (d) SimpleRL: Percentile curve Figure 2: JS divergence distributions for Qwen2.5 32B DAPO and SimpleRL on AIME 2024. The results reveal that RLVR refinement is highly sparse at the token distribution level. Under DAPO, more than 83% of token positions exhibit near-zero divergence, while this proportion exceeds 98% under SimpleRL. The clear spike at zero on the histograms and the steep rise of the percentile curves indicate that only a small subset of token positions undergo substantial distributional change as a result of RLVR. Comparing the two, DAPO exhibits a broader divergence distribution and a more gradual percentile curve, consistent with its clip-higher mechanism and lack of KL regularization, permitting broader exploratory updates. In contrast, SimpleRL imposes stricter constraints, resulting in more tightly concentrated changes. Importantly, even in the absence of KL regularization, the DAPO policy maintains near-zero divergence at most token distributions. For a more controlled comparison for models fine-tuned on the same dataset, Appendix A.5.2 presents the results for Qwen2.5-Math-7B trained with DAPO, comparing upper clip settings of 0.28 and 0.2. We see that, analogous to the results of the 32B models, the more restrictive 0.2 upper clip setting results in sparser distributional shifts, as shown by the percentiles corresponding to near-zero divergence (Figure 33). However, on its high-divergence set, the JS values are higher for the 0.20.2 clip, as indicated by the higher upper percentiles. This indicates that clip-higher admits a wider set of high-divergence token distributions but with reduced divergence magnitude at the extremes. We observe consistent behavior on AIME 2025 (Figure 31) and GPQA-Diamond (Figure 28), and the observed sparsity remains stable under different top-p settings and when using estimated full distributions instead of truncated ones (Appendix A.5, Figures 30 and 32). 2.3 Positional Concentration Beyond how broadly changes are distributed across token positions, we next ask: where within a generated sequence do distributional shifts tend to occur? Figure 3 plots the mean and median JS divergence as a function of normalized token position (token index divided by sequence length), with percentile bands, for DAPO and SimpleRL on AIME 2024. (a) DAPO (b) SimpleRL Figure 3: Mean and median JS divergence by normalized token position, with percentile bands. Both methods concentrate updates at the start and, to a lesser degree, at the end of responses. Both models exhibit a clear positional structure: average divergence across sequences is consistently higher near the beginning of the response, decreases through the middle, and increases modestly again toward the end. The early concentration aligns with the modification of initial high-level branching decisions, while the late increase aligns with adjustments to answer formatting and termination behavior. However, this aggregate trend masks substantial variability at the level of individual sequences; as reflected in Figures 1(a) and 1(b) and the wide percentile spread in Figure 3, high divergence occurs sporadically throughout the sequence. Comparing the two upper clip variants of Qwen2.5-Math-7B DAPO (Figure 34), both clip settings exhibit larger average divergences at the beginning of the sequence, with a smaller increase near the end, consistent with the behavior seen in the 32B models. Notably, the 0.2 clip setting shows higher average divergence at the beginning of the sequence compared to the 0.28 setting. 2.4 Divergence–Entropy Relationship To further understand the general structure underlying these sparse distributional shifts, we ask: How are such shifts related to the model’s token-level entropy? We thus examine the relationship between distributional divergence and predictive entropy on the token level. At each token position t, we compute the token-level entropy Hπ​(x<t)=−∑v∈π​(v∣x<t)​log⁡π​(v∣x<t),H_π(x_<t)=- _v π(v x_<t) π(v x_<t), and analyze how entropy relates to the distributional shifts from the base to RL model. Prior work suggests that RLVR updates may primarily affect high-entropy predictions while leaving low-entropy predictions largely unchanged (Wang et al., 2025). We explore this perspective by comparing entropy distributions across low- and high-divergence token positions. Specifically, token positions are grouped into low- and high-divergence bins, and we compare the entropy distributions of both the base and RL models within each bin. Figure 4 shows these results for DAPO, with corresponding SimpleRL results provided in Appendix A.5 (Figure 21). (a) Low JS divergence distributions (<0.1<0.1). (b) High JS divergence distributions (>0.1>0.1). Figure 4: Entropy distributions for low and high divergence distributions for DAPO. Low-divergence tokens are generally low-entropy, while high-divergence tokens span both high- and low-entropy regions, indicating that DAPO can modify even initially confident predictions. The results show that low-divergence token distributions are largely low-entropy, indicating that distributions that are preserved are mostly initially low-entropy, though with a non-negligible portion of them that lie in the high-entropy regime. High-divergence contexts, however, can span a broad entropy range. In particular, DAPO modifies both initially high- and low-entropy predictions, demonstrating its ability to override even confident base-model outputs. By contrast, SimpleRL concentrates divergence more strongly in higher base entropy regions, reflecting a more conservative update regime. Isolating the effect of clip-higher, Figure 35 illustrates this contrast more clearly. At high-divergence positions, the higher 0.280.28 upper clip produces a greater proportion of distributions with low base entropy, whereas the 0.20.2 clip concentrates its high-divergence distributions in the higher base entropy regime. Additionally, the resulting RL entropy is higher under clip-higher, while for the 0.20.2 clip it is concentrated at lower values, consistent with the overall entropy collapse observed under standard clipping (Yu et al., 2025) and the steadily increasing entropy induced by clip-higher. Per-sequence scatter plots (Appendix A.5, Figure 27) show some variability across sequences, but with DAPO exhibiting an overall broader entropy spread among divergent positions and SimpleRL showing a tighter concentration, consistent with our aggregate analysis. 2.5 Semantic Identity of Divergent Tokens Given the sparsity and general structure of these shifts, a natural next question is: Which types of tokens are actually being targeted by RL fine-tuning? (a) Tokens with high JS divergence (JS>0.1JS>0.1). (b) Tokens with low JS divergence (JS<0.01JS<0.01). Figure 5: Word clouds of high and low divergence tokens under DAPO. To investigate this, we examine which types of tokens tend to be sampled from high versus low divergence distributions. Figure 5 visualizes representative examples using word clouds, where the size of each token is proportional to its frequency. Upon an initial examination, tokens appearing in high-divergence distributions include common function words, reasoning-related terms, and certain equation fragments, whereas those in low-divergence distributions are dominated by numerals, operators, and structural components of mathematical expressions. However, token identity alone does not determine divergence behavior. Figure 23 shows the full JS divergence distributions for the tokens sampled most frequently from high- and low-divergence distributions, revealing substantial context dependence. For example, the word “the” appears among the most frequent high-divergence tokens, yet its full divergence distribution across all sampled occurrences is overwhelmingly concentrated in the lower regime. This suggests that token identity alone is insufficient to characterize divergence, and that a contextual perspective is essential, rather than solely by token semantics. Instead, what is likely more important is the role the token plays within the reasoning trajectory and in the (base) model’s predictive distribution (as we’l see in the cross-sampling experiments in Section 3). 2.6 Comparison with Supervised Fine-Tuning (SFT) While the above analyses reveal that RLVR induces sparse distributional shifts, it remains unclear whether this behavior is unique to RL fine-tuning. This raises the question: Is such sparsity a distinctive property of RLVR, or a more general feature of fine-tuning? A natural point of comparison is supervised fine-tuning (SFT), which optimizes models to imitate target tokens rather than optimizing verifiable rewards on self-generated trajectories. Appendix A.4 presents a controlled comparison between SFT and RLVR (DAPO) on Qwen2.5-32B. Under the same JS divergence measurements (Section 2.2), SFT exhibits a substantially larger high-divergence set and a broader divergence distribution than RLVR (Figure 12). This demonstrates that the sparsity of distributional shifts observed under RLVR is not a generic consequence of fine-tuning. (a) SFT: Histogram (b) SFT: Percentiles Figure 6: JS divergence distributions for Qwen2.5 32B fine-tuned with SFT on AIME 2024. Positional analysis further shows that SFT induces elevated divergence across the entire response, while still exhibiting increased divergence near the start of the sequence (Figure 14), mirroring, the early-position effects seen in RLVR. Finally, under the divergence–entropy analysis (Section 2.4), SFT’s divergent tokens concentrate more strongly in regions of high base-model entropy (compared with DAPO). While this concentration may be partially influenced by SFT outputs appearing more uncertain when evaluated under the base model, the resulting fine-tuned entropy values are nevertheless substantially lower than those of the base model (Figure 15). These results are consistent with SFT’s objective of directly learning target outputs, leading to globally broader and sharper distributional updates. Takeaways: General Structure of RLVR Distribution Shifts RLVR induces sparse and structured token-level distribution shifts. Across models and datasets, we observe: • Token-level sparsity of shifts: The vast majority of token positions exhibit near-zero JS divergence between base and RL models (often >80%>80\% and up to >98%>98\%), indicating that RLVR modifies only a small subset of token distributions (even without KL regularization), despite significant performance differences between the base and fine-tuned model. • Method-dependent spread: Less constrained methods (e.g., clip-higher DAPO) produce a broader divergence distribution, while more constrained methods (e.g., SimpleRL or lower clip settings) concentrate updates on fewer token distributions. • Positional concentration: Across sequences, divergence is consistently high near the beginning of responses and increases again near the end, however individual sequences exhibit varying divergence throughout. • Entropy interaction: Low-divergence positions are largely low-entropy (confident) predictions, though with some of high-entropy. High-divergence token distributions span a wide entropy range under DAPO, showing that RLVR can override even low-entropy base-model predictions, while more conservative methods focus on higher-entropy regions. • Context dependence: High-divergence token distributions are not determined solely by token identity; the same token can be sampled from both low and high divergence distributions depending on context. • Contrast with SFT: Supervised fine-tuning produces substantially denser and more globally distributed shifts, indicating that the observed sparsity is not a generic feature of fine-tuning. 3 Cross-Sampling: Functional Importance of Divergent Distributions In the previous section, we showed that only a small fraction of token distributions exhibit substantial shifts between the base and RL models. This observation motivates a fundamental question: Are these divergent token distributions directly responsible for the performance gains induced by RLVR? More generally, to what extent are the base and RL policies functionally different on their entire sequence distributions? More concretely, can the accuracy improvements of the RL model be recovered by generating primarily under πbase _base while selectively substituting a small number of tokens sampled from πRL _RL? On the other hand, does the RL model’s performance degrade when a small number of its token choices are replaced with those sampled from πbase _base? If RLVR’s gains are indeed concentrated in these sparse locations, then selectively intervening at such positions should have a disproportionate impact on performance. Furthermore, what happens if we intervene up to a certain number of interventions and then continue generation under the primary policy? Does the performance progressively improve/degrade as we increase the number of interventions, or does performance only change once most or all of the intervention-induced modifications are applied? To answer these questions, we conduct controlled cross-sampling experiments that selectively swap token choices between the base model πbase _base and the RL-trained model πRL _RL. We consider two complementary interventions: (i) forward cross-sampling, which injects RL-sampled tokens into base-model generations, and (i) reverse cross-sampling, which replaces RL-sampled tokens with base-model tokens during RL generation. Together, these interventions probe the contribution of RL-induced token-level changes by evaluating how introducing them into base-model trajectories or reverting them in RL trajectories influences reasoning performance. The general implementation procedure is summarized in Algorithm 1 of Appendix A.6. 3.1 Cross-Sampling Framework Let (Xt)t≥1(X_t)_t≥ 1 denote the sequence of random variables on V generated during decoding, and define the stopping time τ:=inft≥1:Xt=EOS∧Tmax,τ:= \t≥ 1:X_t=EOS\ T_ , where EOS is the end-of-sequence token and TmaxT_ is the maximum number of tokens to generate. The generated response is then the finite sequence X1:τX_1:τ. Let πprim _prim denote the primary policy, which governs generation by default, and let πint _int denote the intervention policy, which is used only at selected positions. These policies induce sequence-level distributions PprimP_prim and PintP_int over finite sequences. To model cross-sampling, we introduce a switching rule :<ℕ→0,1S:V^<N→\0,1\, where <ℕV^<N is the set of finite sequences over the vocabulary V, which determines, at each generation step, whether the next token is sampled from the intervention policy (St=1S_t=1) or the primary policy (St=0S_t=0). Given a partial sequence X<tX_<t, we define the switching variable St:=​(X<t)∈0,1,S_t:=S(X_<t)∈\0,1\, and the resulting mixed policy governing the law of the next token: Xt∼πmix(prim,int)(⋅∣X<t)=(1−St)πprim(⋅∣X<t)+Stπint(⋅∣X<t).X_t _mix^(prim,int)(· X_<t)=(1-S_t)\, _prim(· X_<t)+S_t\, _int(· X_<t). The corresponding sequence-level distribution is then denoted by Pmix(prim,int)P_mix^(prim,int). In our experiments, to align with the analysis in Section 2, the switching rule S is defined in terms of the token-level Jensen–Shannon divergence DJSD_JS between πprim(⋅∣X<t) _prim(· X_<t) and πint(⋅∣X<t) _int(· X_<t), but one could use any notion of divergence or distance between probability measures. Given a fixed threshold ε≥0 ≥ 0, we set (X<t)=DJS(πprim(⋅∣X<t)∥πint(⋅∣X<t))>ε,S(X_<t)= 1\D_JS( _prim(· X_<t) _int(· X_<t))> \, so that cross-sampling intervenes only at high-divergence positions. In Appendix A.1, we provide simple bounds on divergences between the sequential distributions PmixP_mix and PintP_int under different cross-sampling settings. Forward Cross-Sampling. In forward cross-sampling, the response is generated primarily under the base policy, which serves as the primary policy, i.e., πprim=πbase _prim= _base. The intervention policy is the RL policy, πint=πRL _int= _RL. At positions where St=1S_t=1, the next token is sampled from πRL(⋅∣X<t) _RL(· X_<t), after which generation continues under the base policy until the next intervention point or termination. This procedure tests whether selectively injecting RL token choices into trajectories that are otherwise generated by the base model is sufficient to recover RL-level performance. Reverse Cross-Sampling. In reverse cross-sampling, the roles of the base and RL policies are reversed. The response is generated primarily under the RL policy, which serves as the primary policy, i.e., πprim=πRL _prim= _RL, while the intervention policy is the base policy, πint=πbase _int= _base. At positions where St=1S_t=1, the next token is sampled from πbase(⋅∣X<t) _base(· X_<t), while all other positions follow the RL policy. This intervention selectively replaces RL-sampled tokens at high-divergence positions with base-model choices, allowing us to quantify how rapidly RL performance degrades when those decisions are removed. Evaluation with Intervention Budgets. We further evaluate the effect of cross-sampling by limiting the number of interventions, generating responses under the mixed policy PmixP_mix with a fixed number of cross-sampling interventions, after which generation proceeds under the primary policy. This can be viewed as enforcing an intervention budget by setting St=0S_t=0 for all t such that ∑s=1t−1Ss≥k _s=1^t-1S_s≥ k, where k denotes the maximum number of cross-sampling interventions. We then measure accuracy for a fixed divergence threshold ε as intervention count k increases. This setup probes whether early, limited interventions are sufficient to induce downstream performance differences when generation is subsequently completed by the primary policy. In forward cross-sampling, this tests whether a small number of RL-induced interventions can steer trajectories such that the base model can complete them with improved accuracy, even if high-divergence positions remain later in the sequence. In reverse cross-sampling, it evaluates whether the RL model can still produce correct solutions after introducing a small number of base-sampled tokens, or whether such perturbations instead lead to a corresponding degradation in performance. Connection to Speculative Decoding and BiLD. Our cross-sampling framework is conceptually related to speculative decoding (Leviathan et al., 2023; Kim et al., 2023) in that generation depends on two next-token distributions. It is closer to Big Little Decoder (BiLD) (Kim et al., 2023), which defines a routing policy (rather than an exact sampling scheme that preserves a designated target distribution). BiLD trades off latency and quality via fallback and rollback rules, while our approach defines a mixed policy πmix(prim,int) _mix^(prim,int) to investigate the role of high-divergence decisions between base and RL models for reasoning performance. BiLD triggers fallback based on small-model confidence (e.g., a max-probability threshold) and uses rollback based on a discrepancy between small/large predictive distributions (with a cross-entropy–based quantity), whereas we intervene when DJS(πprim(⋅∣X<t)∥πint(⋅∣X<t))D_JS( _prim(· X_<t) _int(· X_<t)) exceeds a threshold, and investigate the impact on performance as we increase the number of interventions for a fixed threshold. BiLD further proposes prediction alignment by fine-tuning the small model on large-model outputs to reduce avoidable disagreements, while our setting typically uses models that already agree at most positions (base vs. RL) to isolate fine-tuning effects. 3.2 Results and Findings We now evaluate the functional impact of cross-sampling interventions on downstream task performance. Figures 7 and 8 present the accuracy curves for forward and reverse cross-sampling on Qwen2.5-32B fine-tuned with SimpleRL, evaluated on AIME 2024. Each point along the curve corresponds to the Mean@32 accuracy obtained by generating responses under the mixed policy PmixP_mix with a fixed number of cross-sampling interventions, after which generation is completed under the primary policy. (a) AIME 2024 (b) AIME 2025 Figure 7: Forward cross-sampling results (Qwen2.5 32B SimpleRL): injecting RL tokens into base generations progressively recovers RL accuracy. Table 1 summarizes the number and proportion of cross-sampled tokens required to approximately recover RL-level performance (forward cross-sampling) or collapse to base-level performance (reverse cross-sampling) for Qwen2.5 32B SimpleRL DAPO and AIME 2024 and AIME 2025. Additional cross-sampling results, including experiments on additional model configurations and datsets, are provided in Appendix A.6. Forward Cross-Sampling: A small fraction of RL-sampled tokens suffices to recover or even exceed RL-level performance when generation otherwise follows the base model. Forward Cross-Sampling: From Figure 7 and Table 1, we observe that for Qwen2.5-32B with SimpleRL, forward cross-sampling under the mixed policy πmix(base,RL) _mix^(base,RL) recovers RL-level accuracy with remarkably few interventions. Injecting fewer than 4%4\% RL-sampled tokens per sequence on average, corresponding to fewer than 4040 effective token substitutions per response, is sufficient to close the performance gap from the base model (approximately 8%8\%) to the RL model (approximately 25%25\%) on AIME 2024. On AIME 2025, the effect is even more pronounced: using only 1.53%1.53\% effective cross-sampling, or roughly 1313 average token substitutions per response, raises accuracy from about 5%5\% to over 14%14\%. Interestingly, this level of performance exceeds that of the RL policy πRL _RL itself, indicating that the mixed policy πmix(base,RL) _mix^(base,RL) can, in some cases, outperform the standalone RL policy. This potentially occurs because the mixed policy is close but not identical to the RL policy (for ε>0 >0), which may sometimes avoid failures induced by the RL model. Recovering RL-level performance for DAPO requires a larger number of interventions, reflecting its substantially stronger fine-tuned performance. On AIME 2024, approximately 7.8%7.8\% effective tokens are are enough to boost accuracy from roughly 8%8\% to over 44%44\%, while on AIME 2025, fewer than 6.5%6.5\% effective interventions suffice to increase performance from around 5%5\% to over 33%33\%. Importantly, even though the performance gains are substantially larger, the number of critical token-level decisions remains small relative to sequence length. (a) AIME 2024 (b) AIME 2025 Figure 8: Reverse cross-sampling results (Qwen2.5 32B SimpleRL): swapping RL tokens with base tokens in RL generations causes near-monotonic degradation toward base performance. Reverse Cross-Sampling: Reverting a small fraction of RL tokens causes performance to collapse to, or even below, base-model levels when generation otherwise follows the RL model. Reverse Cross-Sampling: Reverse cross-sampling results show that the RL policy is highly sensitive to a small number of its token-level decisions, operating similarly as the base policy apart from the small number of token positions with high-divergence. From Figure 8 and Table 1, we observe that for Qwen2.5-32B with SimpleRL, generating under the mixed policy πmix(RL,base) _mix^(RL,base), that is generating primarily with the RL policy except at high-divergence positions, only a small fraction of base-sampled tokens is enough to rapidly degrade performance. On AIME 2024, replacing approximately 5%5\% of high-divergence distributions, corresponding to less than 3030 effective base-sampled tokens per response, is sufficient to collapse accuracy from RL levels (around 25%25\%) back to base-level performance (around 8%8\%). This phenomenon also holds on AIME 2025: roughly 4.7%4.7\% of effective base-sampled tokens (around 3030 effective tokens per response) reduces accuracy from approximately 12.7%12.7\% to below 4%4\%, which is below base-level performance. For DAPO, a larger number of substitutions is required to erase its gains, consistent with its substantially stronger performance. On AIME 2024, reverting roughly 10%10\% of effective tokens suffices to reduce accuracy from over 44%44\% to near base levels (around 8%8\%), while on AIME 2025, fewer than 10%10\% effective reversions collapse performance from over 33%33\% to below 4.5%4.5\%. Importantly, even in these cases, the required reversions constitute a small fraction of the total generated tokens, reinforcing that RL-level performance, across both SimpleRL and DAPO, depends critically on a sparse set of token-level decisions. Notably, the substituted base tokens are mostly plausible and semantically reasonable (Figure 52), yet they nonetheless progressively derail the reasoning process. Even when two token choices are locally equivalent or interchangeable to a human reader, they can induce different downstream conditional distributions and lead to diverging reasoning responses, revealing substantial trajectory sensitivity of the model. Progressive Steering of Reasoning Trajectories. Across both forward and reverse cross-sampling, reasoning performance varies smoothly as the cross-sampling intervention budget increases. In the forward direction, accuracy improves steadily as additional RL-sampled tokens are introduced, with no sharp threshold, indicating that each intervention contributes positively, on average, to performance and that RL’s gains are sparsely distributed across multiple decision points rather than requiring all RL-induced changes. In reverse cross-sampling, performance degrades in a similarly smooth and near-monotonic manner as RL token choices are reverted, demonstrating that RL-level performance depends on preserving a sparse set of token-level shifts throughout the generation, even when the remainder of the response is produced under the RL policy. An important aspect underlying both settings is that cross-sampling interventions are applied sequentially along the generation trajectory, while decoding otherwise proceeds under a single primary policy. Intuitively, one might expect that modifying only a small number of token choices, particularly early ones, would have limited impact once generation continues under the primary policy. However, our results show that this is not the case: injecting even just the first few RL-sampled tokens can already yield measurable performance gains in forward cross-sampling, while swapping the earliest few divergent tokens can noticeably degrade performance in the reverse setting. These effects arise not necessarily because early tokens are inherently dominant, but because small, local edits can redirect the generation process toward different reasoning trajectories, which are then continued by subsequent decoding under the primary policy. Rather than introducing entirely new reasoning behaviors, RLVR refines a sparse set of local token choices that reliably steer generation toward more effective reasoning trajectories that remain accessible to the base model, but are unlocked through these targeted edits. Taken together, the forward and reverse cross-sampling results show that the RL fine-tuned model operates similarly to the base model, modulo a small number of token-level decisions. Summary. Overall, these results establish that RLVR refinement operates in a highly targeted manner. Across datasets and training settings, forward and reverse cross-sampling show that RLVR’s performance gains are functionally concentrated in a sparse set of high-divergence token positions. Forward interventions show that introducing only a small number of RL-sampled tokens into base-model generations is sufficient to recover, and in some cases exceed, RL-level accuracy, while reverse interventions demonstrate that reverting a similarly small number of RL token choices with base-sampled tokens can rapidly erase these gains, and in some cases degrade performance below base-model levels. Importantly, these effects emerge progressively as the intervention budget increases, indicating that the reasoning trajectories are steadily shaped by sequential, local token-level modifications along the sequence. Table 1: Summary of cross-sampled tokens required to reach approximate RL-level performance (forward) or base-level performance (reverse) for Qwen2.5-32B on AIME 2024 and AIME 2025 with a token generation budget of 8000. Effective token counts/percentages exclude identity swaps during cross-sampling. Token percentages are computed at the sequence level. Dataset Method Eff. % Eff. # Initial (πprim _prim) Final (πmix _mix) Tokens Tokens Acc. (%) Acc. (%) AIME24 SimpleRL 3.863.86% 3838 8.238.23 >25>25 SimpleRL Reverse 55% 2929 25.5225.52 <8.3<8.3 DAPO 7.87.8% 280280 8.238.23 >44>44 DAPO Reverse 10.110.1% 173173 44.844.8 <8.5<8.5 AIME25 SimpleRL 1.531.53% 1313 5.35.3 >14>14 SimpleRL Reverse 4.734.73% 3131 12.7112.71 <4<4 DAPO 6.476.47% 230230 55 >33>33 DAPO Reverse 9.899.89% 181181 3232 <4.5<4.5 Takeaways: Functional Role of Divergent Token Distributions via Cross-Sampling Cross-sampling experiments show that RLVR performance gains are concentrated at a small set of high-divergence token positions, revealing that RL and base models are largely similar overall but differ critically at sparse, high-impact token decisions where RLVR guides generation toward more effective reasoning trajectories that are otherwise accessible to the base model. • Forward cross-sampling: Injecting a small fraction of RL-sampled tokens into base-model generations is sufficient to recover RL-level accuracy. In multiple settings, modifying only ∼ 1–10% of tokens per response closes most or all of the performance gap, and can sometimes exceed standalone RL decoding performance. • Reverse cross-sampling: Reverting a similarly small fraction of RL token choices back to base-sampled tokens in RL-model generations progressively collapses RL performance to base levels, and in some cases below base accuracy. RL gains are therefore highly sensitive to these sparse token-level decisions. • Model improvement vs. number of required interventions: Stronger RLVR models (e.g., DAPO) in general require more interventions to recover or erase their gains, but the required substitutions still form a small fraction of tokens overall. • Progressive trajectory shaping: Performance changes progressively and near-monotonically with the number of interventions, indicating that gains accumulate across multiple divergence points rather than requiring most or all RL- or base-induced changes to impact reasoning performance. Even a small number of early interventions can propagate forward to produce globally different reasoning trajectories with sustained performance impact, even when generation is continued under the primary policy. • Sensitivity to locally interchangeable tokens: Cross-sampled substitutions generally correspond to tokens that are locally plausible and often semantically equivalent from a human perspective, yet still produce significant downstream performance differences. This indicates that reasoning trajectories can be sensitive to distributionally distinct but interchangeable token choices. This sensitivity highlights the limited invariance of the generation process in such LLMs to locally equivalent token choices. • Insight into RLVR vs. base policies: The base and RL models behave similarly on most token decisions, but differ at a sparse set of high-divergence positions that have disproportionate impact on reasoning outcomes. RLVR thus acts as a targeted modification mechanism on the base model rather than a global policy shift. 4 Fine-Grained Mechanics of Distribution Shifts Having established the general aspects of RLVR-induced distribution shifts, namely their sparsity, positional concentration, relationship to entropy, as well as their functional importance to reasoning performance, a natural next question is: At token positions where substantial changes occur, how novel are these updates? Do they introduce (effectively) new candidate tokens, or primarily redistribute probability mass among existing ones? This question targets the mechanism underlying the sparse shifts observed earlier. While previous analyses reveal where and how much change occurs, and their importance to reasoning, they do not resolve what kind of change is taking place at the level of individual token distributions. To address this, we conduct a fine-grained analysis of high-divergence positions, examining how probability mass is reallocated within the next-token distributions. Concretely, we move beyond general changes and directly study the structural changes in candidate tokens and their rankings. We examine this through multiple lenses: (i) overlap in top-k candidate sets and token rank reordering, (i) low-probability token behavior, and (i) the evolution over the course of training. This fine-grained analysis reveals that, even at positions with substantial divergence, current RLVR methods mostly do not fundamentally change the candidate space of predictions. Instead, they primarily reorder and selectively amplify tokens that are already plausible under the base model, with limited promotion of low-probability tokens. Nevertheless, these comparatively rare cases of substantial re-ranking or promotion of low-probability tokens may still play an important role in enabling improved reasoning. We further validate these findings across additional datasets, models, and RLVR hyperparameter settings in Appendix A.5. 4.1 Top-k Overlap and Rank Reordering We first investigate whether RLVR changes which tokens are considered plausible, or mainly changes how they are prioritized. Concretely, we examine (1) the overlap between the base and RL models’ top-k candidate sets, and (2) how the relative ranking of shared candidates shifts. Figure 9 reports the fraction of shared tokens between the top-k sets of the base and RL fine-tuned models, restricted to high-divergence token distributions. Despite only considering high-divergence positions, overlap between top-k token sets remains high once k≥2k≥ 2. SimpleRL exhibits over 80% average overlap (often exceeding 85%), while DAPO shows slightly lower but still substantial overlap. Both methods display a sharp increase in overlap from k=1k=1 to k=2k=2, suggesting that while the top-1 token often changes at high-divergence positions, the replacement was typically already among the base model’s top-3. (a) DAPO: Top-k overlap across thresholds. (b) SimpleRL: Top-k overlap across thresholds. Figure 9: Top-k token overlap between base and RL models at divergent positions (JS > 0.1). Computed as the size of the intersection divided by k. High overlap for k≥2k≥ 2 shows that distributional shifts occur mostly within shared candidate sets. This observation is further clarified in Figure 22, which shows where the RL model’s top-3 tokens appear in the base model’s ranking, among high divergence positions. Around 30% of RL top-1 tokens are already ranked first under the base model, and over 80% (DAPO) and 90% (SimpleRL) fall within the base top-3. RL top-2 tokens typically lie within the base top-3–4, with SimpleRL exhibiting consistently stronger alignment. Comparing the upper-clip mechanism with Qwen2.5-Math-7B, we observe a similar but more nuanced behavior. For small k, DAPO with 0.280.28 upper clip yields lower average overlap with the base model’s top-k set than the variant with 0.2 clip (Figure 38), indicating more frequent changes among the highest-ranked tokens when using clip-higher. Interestingly, this trend reverses for larger k, where the model trained with 0.20.2 upper clip exhibits smaller overlap, suggesting that agreement with the base model deteriorates in the tail of the candidate set. A consistent picture emerges from the rank-shift analysis in Figure 39. For the 0.2 clip setting, the base-model ranks of the RL model’s top-3 tokens are more often preserved, but a non-negligible fraction of the RL model’s 3rd ranked token originate from much lower base ranks compared to the 0.28 clip setting. Taken together, these results suggest that clip-higher primarily redistributes probability mass within an already plausible candidate set, leading to more frequent reordering among the highest-ranked tokens while largely preserving the broader candidate space. In contrast, removing clip-higher tends to concentrate probability more strongly on a small number of dominant tokens, resulting in higher agreement with the base model among the very top ranks but poorer agreement deeper in the candidate set. Consistent with this, under the 0.2 clip setting, some tokens promoted to the RL model’s 3rd ranked token originate from much lower base ranks. However, given the substantially lower RL entropy observed in this setting, such promoted tokens may still carry relatively little probability mass in practice, with most of the probability concentrated on the top one or two tokens. As a result, these apparent rank promotions in the 0.2 clip setting may often reflect secondary effects in the low-probability tail rather than substantial changes to the primary token choice driving generation. (a) Percentage of divergent tokens with low base probability. (b) Histogram of RL probabilities for low base-probability tokens. Figure 10: Analysis of tail behavior under DAPO for divergent token distributions (JS>0.1JS>0.1). (a) shows the percentage of divergent tokens whose RL top-1 choice had base probability below a given threshold. (b) shows the distribution of RL probabilities for the subset with base probability <0.01<0.01. 4.2 Low-Probability Behavior: Does RL Invent or Select? We next examine whether RLVR promotes tokens that were highly unlikely under the base model, or instead amplifies alternatives that were already plausible but underweighted. For each high-divergence position, we take the RL model’s top-1 token and record its probability under the base distribution. We then compute the fraction of such tokens whose base probability falls below a given threshold among high-divergence positions. Figure 10 shows that under DAPO, only about 5% (among high-divergence positions) of RL top-1 tokens have base probability below 0.010.01, while under SimpleRL this fraction is nearly zero (Figure 24). Thus, even for DAPO, which encourages broader exploration through its clip-higher mechanism and lack of KL regularization, RLVR rarely elevates tokens that were highly unlikely in the base model. Comparing DAPO variants (Figure 37) with different upper clip settings, we observe that the clip-higher mechanism substantially increases the fraction of RL top-1 tokens (among high-divergence positions) whose base-model probability was initially very small compared to the variant without clip-higher. This aligns with earlier observations and supports the interpretation that clip-higher enables greater exploration, allowing tokens that were low probability in the base model distribution to be promoted more frequently. Importantly, although such low-probability promotions remain rare overall, they may still be consequential and important for improved reasoning performance. 4.3 Evolution Across Training Finally, we analyze how the distributional shifts develop over training. Using intermediate checkpoints from DAPO training on Qwen2.5-Math-7B, we track token-level distributions while conditioning on the final model’s (checkpoint 200) outputs, allowing us to follow the evolution of divergence for a fixed set of token sequences. (a) JS divergence percentiles. (b) Jaccard index with final divergent set (JSt>0.1JS_t>0.1). Figure 11: Distributional shifts grow increasingly focused and stable. Most tokens remain unchanged; updates concentrate in a sparse set late in training. Figure 11 shows that JS divergence increases monotonically throughout training, with higher percentiles (e.g., 95th and 99th) growing faster than lower ones. This widening gap indicates that distributional change becomes increasingly concentrated in a small subset of tokens, while the majority remain relatively stable. Consistent with this perspective, the Jaccard overlap between each checkpoint’s divergent-token set and the final set increases gradually before rising sharply near the end of training (Figure 11(b)). Takeaways: Fine-Grained Mechanics of RLVR Distribution Shifts At high-divergence positions, RLVR primarily reshapes probabilities within an existing candidate set rather than introducing new tokens, and this refinement becomes increasingly focused over training. • Shared candidate sets: Even at high-divergence positions, base and RL models retain strong overlap in their top-k token candidate sets, especially for k≥2k≥ 2. Distributional shifts therefore typically occur within a shared candidate set, rather than through expansion of the candidate space. • Re-ranking over replacement: Within this largely shared support, RLVR mainly changes which already-plausible tokens are prioritized, frequently swapping the top-ranked choice with another token that was already among the base model’s top candidates. • Selection rather than invention: RLVR rarely significantly promotes tokens that were highly unlikely under the base model. Most RL top-1 tokens at divergent positions already had nontrivial base probability, showing that RLVR predominantly amplifies underweighted but plausible alternatives rather than effectively inventing new ones. • Method differences: More exploratory methods (e.g., clip-higher DAPO) permit larger rank movements and more frequent promotion of lower-probability tokens than more constrained methods (e.g., SimpleRL or lower clip-high settings), consistent with their broader update behavior and stronger performance gains. Notably, clip-higher produces larger reshuffling within an already plausible candidate set, whereas removing clip-higher tends to concentrate probability mass on a few dominant tokens, occasionally promoting lower-ranked base tokens but without substantially broadening the effective candidate space. • Overall picture: Taken together, RLVR acts primarily as a targeted probability reallocation mechanism: sharpening and reordering choices within an existing plausibility set, rather than globally altering the (effective) token support or prediction structure. Although substantial reshuffling or promotion of low-probability tokens occurs only rarely, these events may still be important and contribute meaningfully to improved reasoning performance, a view consistent with the observation that stronger-performing models exhibit such behaviors more frequently. 5 Exploratory Investigation: Divergence-Weighted Advantages Our earlier analyses reveal that RL refinements are sparse and targeted, with only a small subset of tokens exhibiting meaningful distributional change. Moreover, cross-sampling experiments demonstrate that these high-divergence tokens are functionally critical, with performance gains hinging on precisely these positions. This raises a natural question: if only a small fraction of tokens drive improvements, can training be more effectively guided by modulating token-level learning signals according to these divergences? To investigate this possibility, we conduct a preliminary exploration of divergence-weighted advantages as a diagnostic intervention, where advantages are reweighted by token-level distributional divergence. We explore two different approaches: high-KL boost, which concentrates updates towards token distributions that are already changing substantially, and low-KL boost, which focuses updates on distributions that have changed less, potentially encouraging updates in previously stable regions. 5.1 GRPO-based Methods GRPO in brief. GRPO (Shao et al., 2024) samples G responses oii=1G\o_i\_i=1^G from a policy πθold(⋅∣q) _ _old(· q) for a prompt q with ground-truth answer a, assigns sequence-level rewards Rii=1G\R_i\_i=1^G, and computes a group-normalized advantage for each sample. GRPO then applies a PPO-style (Schulman et al., 2017) clipped surrogate objective at the token level, typically with an explicit KL penalty to a reference model. DAPO. DAPO (Yu et al., 2025) modifies GRPO with an asymmetric clip-higher mechanism, dynamic sampling of correct/incorrect completions, token-level averaging, and removal of the explicit KL penalty term. Its training objective is then given by JDAPO​(θ) J_DAPO(θ) =(q,a)∼,oii=1G∼πθold(⋅∣q) =\;E_(q,a) ,\;\o_i\_i=1^G _ _old(· q) [1∑i=1G|oi|​∑i=1G∑t=1|oi|min⁡(ri,t​(θ)​A^i,t,clip​(ri,t​(θ), 1−ϵlow, 1+ϵhigh)​A^i,t)] [ 1 _i=1^G|o_i| _i=1^G _t=1^|o_i| (r_i,t(θ)\, A_i,t,\;clip\! (r_i,t(θ),\,1- _low,\,1+ _high )\, A_i,t ) ] s.t.0<|oi|​_​(a,oi)|<G, .t. 0\;<\; |\\,o_i\; |\; is\_equivalent(a,o_i)\,\ |\;<\;G, with ri,t​(θ)=πθ​(oi,t∣q,oi,<t)πθold​(oi,t∣q,oi,<t),A^i,t=Ri−mean​(Rjj=1G)std​(Rjj=1G).r_i,t(θ)= _θ(o_i,t q,o_i,<t) _ _old(o_i,t q,o_i,<t), A_i,t= R_i-mean\! (\R_j\_j=1^G )std\! (\R_j\_j=1^G ). 5.2 Divergence-Weighted Advantage Standard RLVR objectives treat all tokens within a sequence uniformly in terms of their advantages (though the importance sampling ratios are defined on the token-level). Motivated by our observation that distributional shifts are sparse and concentrated, we investigate whether modulating token-level advantages according to divergence magnitude can help improve or control aspects of training. We explore modifications where advantages are rescaled depending on the per-token divergences. General formulation. We define a divergence-weighted advantage: A~t=wt⋅A^t A_t=w_t· A_t, where A^t A_t denotes the standard group-normalized advantage and wtw_t is a per-token weight based on divergence. To ensure that the introduced divergence weight influences only the weighting, divergences are detached from the computation graph. Choice of divergence. For ease of compatibility with standard frameworks, we employ KL divergence with respect to the old policy as our primary divergence quantity: KLtold=DKL(πθold(⋅∣x<t)∥πθ(⋅∣x<t)),KL_t^old=D_KL\! ( _ _old(· x_<t) _θ(· x_<t) ), where πθold _ _old denotes the policy from the previous update iteration, as in PPO/GRPO. This old-policy KL quantifies the magnitude of recent policy updates at each token position, serving as a proxy for the extent of local distributional change. For computational efficiency and compatibility with existing training frameworks such as verl (Sheng et al., 2024), we estimate these quantities using KL estimators computed over sampled tokens only, which may not capture the full distributional structure. In particular, we use the k3k_3 estimator (Schulman, 2020) defined by KLest​(πθold∥πθ)≈k3​(πθ(⋅∣x<t)πθold(⋅∣x<t))=πθ(⋅∣x<t)πθold(⋅∣x<t)−log⁡πθ(⋅∣x<t)πθold(⋅∣x<t)−1.KL_est( _ _old _θ)≈ k_3\! ( _θ(· x_<t) _ _old(· x_<t) )= _θ(· x_<t) _ _old(· x_<t)- _θ(· x_<t) _ _old(· x_<t)-1. Weighting scheme. We adopt a simple sigmoid weighting scheme (to ensure bounded weights), which transforms divergence into weights through: wt=1+s​(σ​(α⋅KLt)−12),σ​(x)=11+e−x.w_t=1+s (σ(α·KL_t)- 12 ), σ(x)= 11+e^-x. The parameter α controls the direction and magnitude of emphasis: α>0α>0 amplifies high-divergence tokens, whereas α<0α<0 emphasizes low-divergence ones. The sigmoid function provides a smooth, bounded nonlinear transformation that enables selective focus on either high- or low-divergence regions depending on the sign of α. This formulation allows us to investigate whether concentrating the learning signal on regions that have already changed or those that remain unchanged yields more effective training dynamics. Evaluation. We train with divergence-weighted advantages using the DAPO training recipe and data on Qwen2.5-Math-7B, evaluating on AIME 2024, AIME 2025, and AMC. Results are presented in Table 2. Detailed training hyperparameters and implementation details are documented in Appendix A.2.3. Table 2: Accuracy (%) under divergence-weighted configurations on Qwen2.5-Math-7B. Results shown for KL divergence with πθold _ _old and sigmoid weighting scheme across AIME 2024, AIME 2025, and AMC datasets. The results displayed are the Mean@32 scores (or, equivalently, the pass@1 scores computed using 32 samples). Results are each averaged over 3 runs. Configuration AIME24 AIME25 AMC Overall Avg Baseline DAPO 33.61 18.75 75.08 42.48 ± 1.35 Low-KL boost 35.90 19.90 78.97 44.92 ± 0.05 High-KL boost 36.74 20.00 78.40 45.05 ± 0.79 These results demonstrate that weighting token-level updates by divergence can amplify performance gains, providing empirical support for the hypothesis that targeted tokens disproportionately drive improvements. Both low-KL and high-KL boost configurations can yield improvements over the baseline, suggesting that different divergence weighting strategies may be effective. However, the optimal choice between these approaches, and indeed whether divergence weighting provides benefits at all, may depend on the specific models and training methods used. Effective divergence weighting across training configurations may require model-specific paradigms or adaptive scheduling mechanisms to stabilize learning dynamics. We present this approach as a complementary diagnostic tool that may inform future refinements of token-level training strategies. These include approaches that aggregate information from token-level divergences into better signals, as well as those that more effectively promote the rare actions discussed in Section 4. 6 Related Work RL fine-tuning in LLMs. Reinforcement learning has become an important component of LLM fine-tuning, largely stemming from reinforcement learning with human feedback (RLHF) used to align LLM behavior to human preferences (Christiano et al., 2023; Ouyang et al., 2022). Recently, reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving reasoning by optimizing with verifiable reward signals of generated responses (Lambert et al., 2024). A number of RLVR methods build on policy-gradient variants, including Group Relative Policy Optimization (GRPO) (Shao et al., 2024), as well as extensions such as Dr.GRPO (Liu et al., 2025b), Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) (Yu et al., 2025), and sequence-level variants such as Group Sequence Policy Optimization (GSPO) (Zheng et al., 2025). Beyond these core methods, several works propose complementary improvements to better target impactful updates and stabilize training, including entropy-based exploration or minority-token perspectives (Wang et al., 2025; Cheng et al., 2025), clipping/KL regularization strategies (Cui et al., 2025), reweighting based on token probability, perplexity, or position (Yang et al., 2025b; Deng et al., 2025), as well as analyses of different reward designs (Shao et al., 2025). Additional works have also studied the reasoning boundaries of RLVR and explore ways to expand it (Yue et al., 2025; Wen et al., 2025; Liu et al., 2025a). Understanding RLVR and its differences with SFT. A growing body of work argues that RL fine-tuning often acts as a scalpel rather than a hammer, amplifying existing capabilities through localized changes. This perspective is supported mainly through evaluations on different domains/tasks, catastrophic forgetting, parameter changes, and overall KL divergence (Rajani et al., 2025; Chu et al., 2025; Shenfeld et al., 2025; Huan et al., 2025). Recent work also analyze locality from the parameter perspective: for example, Mukherjee et al. (2025) report that RL fine-tuning can concentrate effective updates into relatively small subnetworks, while Zhu et al. (2025) provide theoretical insight into RLVR learning dynamics and the structure of these parameter-sparse updates. Our paper complements these perspectives by focusing on quantifying the changes induced by RLVR at the level of token distributions. We show that RL not only induces smaller aggregate divergence than SFT, but that its changes are substantially sparser at the token level. Moreover, we connect these sparse distributional changes to sequence-level reasoning outcomes and to fine-grained reallocations of probability mass. Token-Level analyses of RLVR. Several works seek to understand RLVR through a token-level lens. Wang et al. (2025) attribute a substantial portion of RL gains to high-entropy minority tokens, while Cheng et al. (2025) connect such tokens to exploratory reasoning steps; related work also highlights entropy collapse risks and token-level regularization mechanisms (Cui et al., 2025). Other studies emphasize the disproportionate role of specific tokens or sampling decisions (Vassoyan et al., 2025; Lin et al., 2025; Karan and Du, 2025), and Huan et al. (2025) analyzes RL-induced changes using token-level KL divergence and token rank shifts. Recent work by Chen et al. (2026) shows that RL training modifies a sparse subset of tokens when viewed through rank-shift statistics, and develops a theoretical analysis based on reasoning patterns. Our contributions are complementary but distinct: we conduct a systematic empirical study of RLVR-induced token-level changes through the lens of quantities such as divergence, entropy, and probability mass redistribution, and connect these shifts to sequence-level reasoning via forward and reverse cross-sampling interventions that establish their impact on reasoning performance. 7 Conclusion Our study reveals that reinforcement learning with verifiable rewards (RLVR) reshapes LLMs in a manner that is sparse, targeted, and structured rather than uniformly diffused across tokens. By analyzing token-level distributional shifts, we show that only a small subset of tokens undergo meaningful divergence, and that these divergences carry disproportionate functional importance: cross-sampling interventions confirm that performance gains hinge on precisely these positions. Moreover, our fine-grained analyses suggest that, even at high-divergence positions, RLVR typically refines behavior by reallocating probability mass within an existing candidate set rather than introducing fundamentally new tokens. However, the comparatively rare cases of substantial re-ranking and promotion of initially low-probability tokens may still be important for the observed improvements in reasoning performance. To complement these analyses, we explored divergence-weighted advantage, a simple modification that scales token-level advantages by per-token divergence. Our results suggest that such weighting strategies can influence learning dynamics, though stabilizing performance may require model-specific choices and further investigation. Together, these findings advance a token-level understanding of RL fine-tuning. They highlight that the essence of RLVR’s success lies not in widespread distributional changes, but in selective refinements aligned with varying entropy levels. Taken together, our findings suggest that RLVR operates not as a global policy shift, but as a sparse intervention on a small set of high-impact decision points that steer generation trajectories. Beyond clarifying the mechanics of existing methods, our work offers a perspective for designing future RL objectives and diagnostics that further incorporate distributional structure, opening avenues for more effective, interpretable, and controllable LLM post-training. 8 Authors Haoming Meng111University of Toronto Kexin Huang Shaohang Wei222Peking University Chiyu Ma333Dartmouth College Shuo Yang22footnotemark: 2 Xue Wang Guoyin Wang444Alibaba Group,222Qwen Pilot Team Lead Bolin Ding44footnotemark: 4 Jingren Zhou44footnotemark: 4 References X. Chen, T. Li, and D. Zou (2026) Reshaping reasoning in LLMs: a theoretical analysis of RL training dynamics through pattern selection. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §1, §6. D. Cheng, S. Huang, X. Zhu, B. Dai, W. X. Zhao, Z. Zhang, and F. Wei (2025) Reasoning with exploration: an entropy perspective on reinforcement learning for llms. External Links: 2506.14758, Document, Link Cited by: §1, §6, §6. P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei (2023) Deep reinforcement learning from human preferences. External Links: 1706.03741, Link Cited by: §6. T. Chu, S. Tong, J. Yang, T. Chu, Y. Zhai, Y. Ma, S. Xie, D. Schuurmans, Q. V. Le, and S. Levine (2025) SFT memorizes, rl generalizes: a comparative study of foundation model post-training. External Links: 2501.17161, Document, Link Cited by: §A.4, §6. G. Cui, Y. Zhang, J. Chen, L. Yuan, Z. Wang, Y. Zuo, H. Li, Y. Fan, H. Chen, W. Chen, Z. Liu, H. Peng, L. Bai, W. Ouyang, Y. Cheng, B. Zhou, and N. Ding (2025) The entropy mechanism of reinforcement learning for reasoning language models. External Links: 2505.22617, Link Cited by: §1, §6, §6. J. Deng, J. Chen, Z. Chen, D. Cheng, F. Bai, B. Zhang, Y. Min, Y. Gao, W. X. Zhao, and J. Wen (2025) From trial-and-error to improvement: a systematic analysis of llm exploration mechanisms in rlvr. External Links: 2508.07534, Document, Link Cited by: §6. A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2020) The curious case of neural text degeneration. External Links: 1904.09751, Link Cited by: §A.2.1. M. Huan, Y. Li, T. Zheng, X. Xu, S. Kim, M. Du, R. Poovendran, G. Neubig, and X. Yue (2025) Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. External Links: 2507.00432, Link Cited by: §1, §6, §6. A. Karan and Y. Du (2025) Reasoning with sampling: your base model is smarter than you think. External Links: 2510.14901, Link Cited by: §6. S. Kim, K. Mangalam, S. Moon, J. Malik, M. W. Mahoney, A. Gholami, and K. Keutzer (2023) Speculative decoding with big little decoder. External Links: 2302.07863, Link Cited by: §3.1. W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, Cited by: §A.2.1. N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, S. Lyu, et al. (2024) T\ " ulu 3: pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124. Cited by: §1, §6. Y. Leviathan, M. Kalman, and Y. Matias (2023) Fast inference from transformers via speculative decoding. External Links: 2211.17192, Link Cited by: §3.1. Z. Lin, T. Liang, J. Xu, Q. Lin, X. Wang, R. Luo, C. Shi, S. Li, Y. Yang, and Z. Tu (2025) Critical tokens matter: token-level contrastive estimation enhances llm’s reasoning capability. External Links: 2411.19943, Link Cited by: §6. M. Liu, S. Diao, X. Lu, J. Hu, X. Dong, Y. Choi, J. Kautz, and Y. Dong (2025a) ProRL: prolonged reinforcement learning expands reasoning boundaries in large language models. External Links: 2505.24864, Link Cited by: §6. Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin (2025b) Understanding r1-zero-like training: a critical perspective. External Links: 2503.20783, Link Cited by: §6. MistralAI (2025) Mistral small 3. External Links: Link Cited by: §2.1. S. Mukherjee, L. Yuan, D. Hakkani-Tur, and H. Peng (2025) Reinforcement learning finetunes small subnetworks in large language models. External Links: 2505.11711, Link Cited by: §6. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe (2022) Training language models to follow instructions with human feedback. External Links: 2203.02155, Link Cited by: §6. Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu (2025) Qwen2.5 technical report. External Links: 2412.15115, Link Cited by: §2.1. N. Rajani, A. P. Gema, S. Goldfarb-Tarrant, and I. Titov (2025) Scalpel vs. hammer: grpo amplifies existing capabilities, sft replaces them. External Links: 2507.10616, Document, Link Cited by: §A.4, §6. D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2023) GPQA: a graduate-level google-proof q&a benchmark. External Links: 2311.12022, Link Cited by: §2.1. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv abs/1707.06347. External Links: 1707.06347, Link, Document Cited by: §5.1. J. Schulman (2020) Approximating kl divergence. Note: http://joschu.net/blog/kl-approx.htmlBlog post Cited by: §5.2. R. Shao, S. S. Li, R. Xin, S. Geng, Y. Wang, S. Oh, S. S. Du, N. Lambert, S. Min, R. Krishna, Y. Tsvetkov, H. Hajishirzi, P. W. Koh, and L. Zettlemoyer (2025) Spurious rewards: rethinking training signals in rlvr. External Links: 2506.10947, Link Cited by: §6. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, Y. K. Li, Y. Wu, D. Guo, and M. Zhang (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models. External Links: 2402.03300, Document, Link Cited by: §1, §5.1, §6. I. Shenfeld, J. Pari, and P. Agrawal (2025) RL’s razor: why online reinforcement learning forgets less. External Links: 2509.04259, Document, Link Cited by: §6. G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu (2024) HybridFlow: a flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256. Cited by: §A.2.3, §5.2. J. Vassoyan, N. Beau, and R. Plaud (2025) Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning. External Links: 2502.06533, Link Cited by: §6. S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, Y. Liu, A. Yang, A. Zhao, Y. Yue, S. Song, B. Yu, G. Huang, and J. Lin (2025) Beyond the 80/20 rule: high-entropy minority tokens drive effective reinforcement learning for llm reasoning. External Links: 2506.01939, Document, Link Cited by: §1, §2.4, §6, §6. X. Wen, Z. Liu, S. Zheng, Z. Xu, S. Ye, Z. Wu, X. Liang, Y. Wang, J. Li, Z. Miao, J. Bian, and M. Yang (2025) Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms. External Links: 2506.14245, Document, Link Cited by: §6. T. Wu, R. Yang, J. Li, P. Hu, N. Wong, and Y. Yang (2025) Shadow-ft: tuning instruct via base. External Links: 2505.12716, Link Cited by: §A.3. A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Tang, W. Yin, X. Ren, X. Wang, X. Zhang, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Zhang, Y. Wan, Y. Liu, Z. Wang, Z. Cui, Z. Zhang, Z. Zhou, and Z. Qiu (2025a) Qwen3 technical report. External Links: 2505.09388, Link Cited by: §A.7, §2.1. A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, K. Lu, M. Xue, R. Lin, T. Liu, X. Ren, and Z. Zhang (2024) Qwen2.5-math technical report: toward mathematical expert model via self-improvement. External Links: 2409.12122, Link Cited by: §2.1. Z. Yang, X. Luo, Z. Wang, D. Han, Z. He, D. Li, and Y. Xu (2025b) Do not let low-probability tokens over-dominate in rl for llms. External Links: 2505.12929, Link Cited by: §6. Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y. Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang (2025) DAPO: an open-source llm reinforcement learning system at scale. External Links: 2503.14476, Document, Link Cited by: §A.2.3, §2.1, §2.4, §5.1, §6. Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang (2025) Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?. External Links: 2504.13837, Link Cited by: §6. W. Zeng, Y. Huang, Q. Liu, W. Liu, K. He, Z. Ma, and J. He (2025) SimpleRL-zoo: investigating and taming zero reinforcement learning for open base models in the wild. External Links: 2503.18892, Document, Link Cited by: §2.1. C. Zheng, S. Liu, M. Li, X. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, J. Zhou, and J. Lin (2025) Group sequence policy optimization. External Links: 2507.18071, Link Cited by: §6. H. Zhu, Z. Zhang, H. Huang, D. Su, Z. Liu, J. Zhao, I. Fedorov, H. Pirsiavash, Z. Sha, J. Lee, D. Z. Pan, Z. Wang, Y. Tian, and K. S. Tai (2025) The path not taken: rlvr provably learns off the principals. External Links: 2511.08567, Link Cited by: §6. Appendix A Appendix A.1 Sequence-level Divergence Bounds for Cross-Sampling Setup. Let (Xt)t≥1(X_t)_t≥ 1 be the decoded tokens (each in V), with stopping time τ:=inft≥1:Xt=EOS∧Tmax.τ:= \t≥ 1:X_t=EOS\ T_ . We may work on the fixed horizon TmaxT_ by absorbing the EOS token: once EOS is generated, the process deterministically outputs EOS thereafter. This yields an equivalent distribution over X1:TmaxX_1:T_ and ensures all sequences are of length TmaxT_ . All results below therefore sum over t=1,…,Tmaxt=1,…,T_ ; terms after τ contribute zero since both policies become point masses on EOS. Let πprim _prim be the primary policy and πint _int the intervention policy. Given a switching rule :<ℕ→0,1S:V^<N→\0,1\, define St:=​(X<t)S_t:=S(X_<t) and the mixed policy πmix(⋅∣X<t)=(1−St)πprim(⋅∣X<t)+Stπint(⋅∣X<t). _mix(· X_<t)=(1-S_t)\, _prim(· X_<t)+S_t\, _int(· X_<t). Let PmixP_mix and PintP_int denote the induced sequence-level distributions on X1:TmaxX_1:T_ . A.1.1 KL Case We first provide a bound on the sequence-level divergence between the cross-sampled policy and the target intervention policy, in the simpler case of KL divergence with a token-level KL switching rule. Lemma A.1 (KL decomposition). Let P,QP,Q be probability distributions on X1:TmaxX_1:T_ admitting factorizations P​(x1:Tmax)=∏t=1Tmaxpt​(xt∣x<t)P(x_1:T_ )= _t=1^T_ p_t(x_t x_<t) and Q​(x1:Tmax)=∏t=1Tmaxqt​(xt∣x<t)Q(x_1:T_ )= _t=1^T_ q_t(x_t x_<t). For each t, let P<tP_<t and Q<tQ_<t denote the marginals of X<tX_<t under P and Q, respectively. Then DKL(P∥Q)=∑t=1TmaxX<t∼P<t[DKL(pt(⋅∣X<t)∥qt(⋅∣X<t))].D_KL(P Q)= _t=1^T_ E_X_<t P_<t\! [D_KL\! (p_t(· X_<t) \ q_t(· X_<t) ) ]. Proof. By definition, DKL​(P∥Q)=X∼P​[log⁡P​(X)Q​(X)].D_KL(P Q)=E_X P\! [ P(X)Q(X) ]. Using the factorizations, log⁡P​(X1:Tmax)Q​(X1:Tmax)=∑t=1Tmaxlog⁡pt​(Xt∣X<t)qt​(Xt∣X<t). P(X_1:T_ )Q(X_1:T_ )= _t=1^T_ p_t(X_t X_<t)q_t(X_t X_<t). Taking expectation under P and exchanging sum and expectation gives DKL​(P∥Q) D_KL(P Q) =∑t=1TmaxX∼P​[log⁡pt​(Xt∣X<t)qt​(Xt∣X<t)] = _t=1^T_ E_X P\! [ p_t(X_t X_<t)q_t(X_t X_<t) ] =∑t=1TmaxX<t∼P<t​[P​[log⁡pt​(Xt∣X<t)qt​(Xt∣X<t)|X<t]] = _t=1^T_ E_X_<t P_<t\! [E_P\! [ . p_t(X_t X_<t)q_t(X_t X_<t)\, |\,X_<t ] ] =∑t=1TmaxX<t∼P<t[DKL(pt(⋅∣X<t)∥qt(⋅∣X<t))], = _t=1^T_ E_X_<t P_<t\! [D_KL\! (p_t(· X_<t) q_t(· X_<t) ) ], where the second last equality follows by the definition of conditional expectation (or the law of total expectation) applied to the conditional distribution of XtX_t given X<tX_<t. ∎ Proposition A.2 (Token-level KL threshold ⇒ sequence-level KL bound). Assume the switching rule is defined by a KL threshold: (x<t)=DKL(πprim(⋅∣x<t)∥πint(⋅∣x<t))>ε.S(x_<t)= 1\! \D_KL\! ( _prim(· x_<t) _int(· x_<t) )> \. Define the number of non-intervention steps on a trajectory N0:=∑t=1τ​(X<t)=0.N_0:= _t=1^τ 1\S(X_<t)=0\. Then DKL​(Pmix∥Pint)≤ε​X∼Pmix​[N0].D_KL(P_mix P_int)≤ \ E_X P_mix[N_0]. Proof. We apply Lemma A.1 with P=PmixP=P_mix and Q=PintQ=P_int. For any history h=x<th=x_<t, since ​(h)∈0,1S(h)∈\0,1\, πmix(⋅∣h)=πint(⋅∣h),​(h)=1,πprim(⋅∣h),​(h)=0. _mix(· h)= cases _int(· h),&S(h)=1,\\ _prim(· h),&S(h)=0. cases Hence DKL(πmix(⋅∣h)∥πint(⋅∣h))=(h)=0DKL(πprim(⋅∣h)∥πint(⋅∣h)).D_KL\! ( _mix(· h) _int(· h) )= 1\S(h)=0\\,D_KL\! ( _prim(· h) _int(· h) ). By the definition of S, whenever ​(h)=0S(h)=0 the token-level KL DKL(πprim(⋅∣h)∥πint(⋅∣h))D_KL\! ( _prim(· h) _int(· h) ) is at most ε . So, for all h, DKL(πmix(⋅∣h)∥πint(⋅∣h))≤ε 1(h)=0.D_KL\! ( _mix(· h) _int(· h) )≤ \, 1\S(h)=0\. Under absorbing EOS, DKL(πmix(⋅∣X<t)∥πint(⋅∣X<t))=0D_KL\! ( _mix(· X_<t)\, \, _int(· X_<t) )=0 for t>τt>τ, so we may multiply by ​t≤τ 1\t≤τ\. Substituting this into Lemma A.1 yields DKL​(Pmix∥Pint)≤∑t=1TmaxX<t∼(Pmix)<t​[​t≤τ​ε​ 1​St=0]=ε​∑t=1TmaxX<t∼(Pmix)<t​[​t≤τ​ 1​St=0].D_KL(P_mix P_int)≤ _t=1^T_ E_X_<t (P_mix)_<t\! [ 1\t≤τ\\, \, 1\S_t=0\ ]= _t=1^T_ E_X_<t (P_mix)_<t\! [ 1\t≤τ\\, 1\S_t=0\ ]. Note that t≤τ=EOS∉X<t\t≤τ\=\EOS∉ X_<t\ depends only on X<tX_<t, so ​t≤τ​St=0 1\t≤τ\ 1\S_t=0\ is σ​(X<t)σ(X_<t)-measurable. Thus, we may equivalently write the expectation under the full trajectory X∼PmixX P_mix: X<t∼(Pmix)<t​[​t≤τ​ 1​St=0]=X∼Pmix​[​t≤τ​ 1​St=0].E_X_<t (P_mix)_<t\! [ 1\t≤τ\\, 1\S_t=0\ ]=E_X P_mix\! [ 1\t≤τ\\, 1\S_t=0\ ]. Then, DKL​(Pmix∥Pint) D_KL(P_mix P_int) ≤ε​∑t=1TmaxX<t∼(Pmix)<t​[​t≤τ​ 1​St=0] ≤ _t=1^T_ E_X_<t (P_mix)_<t\! [ 1\t≤τ\\, 1\S_t=0\ ] =ε​∑t=1TmaxX∼Pmix​[​t≤τ​ 1​St=0] = _t=1^T_ E_X P_mix\! [ 1\t≤τ\\, 1\S_t=0\ ] =ε​X∼Pmix​[∑t=1Tmax​t≤τ​ 1​St=0] = \,E_X P_mix\! [ _t=1^T_ 1\t≤τ\\, 1\S_t=0\ ] =ε​X∼Pmix​[∑t=1τ​St=0] = \,E_X P_mix\! [ _t=1^τ 1\S_t=0\ ] =ε​X∼Pmix​[N0]. = \,E_X P_mix[N_0]. ∎ Remark A.3 (Effective KL on non-intervention steps). Define the effective token-level KL on non-intervention steps κ¯:=X∼Pmix[∑t=1τSt=0DKL(πprim(⋅∣X<t)∥πint(⋅∣X<t))]X∼Pmix​[N0],(κ¯:=0​ if ​[N0]=0). κ:= E_X P_mix\! [ _t=1^τ 1\S_t=0\\,D_KL\! ( _prim(· X_<t) _int(· X_<t) ) ]E_X P_mix[N_0], ( κ:=0 if E[N_0]=0). Then DKL​(Pmix∥Pint)=κ¯​X∼Pmix​[N0]≤ε​X∼Pmix​[N0],D_KL(P_mix P_int)= κ\,E_X P_mix[N_0]≤ \,E_X P_mix[N_0], and typically κ¯≪ε κ in regimes where the models are already close (eg. in the setting of a base model and its RL fine-tuned counterpart). A.1.2 JS Case We now give the corresponding sequence-level result for Jensen–Shannon divergence. Unlike KL divergence, the exact decomposition involves a history-dependent skew Jensen–Shannon divergence rather than the ordinary JS used in our experiments. Definition A.4 (Skew Jensen–Shannon divergence). For probability measures μ,νμ,ν on a common measurable space and α∈[0,1]α∈[0,1], define DJSα(μ∥ν):=αDKL(μ∥αμ+(1−α)ν)+(1−α)DKL(ν∥αμ+(1−α)ν).D_JS^α(μ ν):=α\,D_KL\! (μ αμ+(1-α)ν )+(1-α)\,D_KL\! (ν αμ+(1-α)ν ). When α=12α= 12, this reduces to the usual Jensen–Shannon divergence: DJS​(μ∥ν)=DJS1/2​(μ∥ν).D_JS(μ ν)=D_JS^1/2(μ ν). Lemma A.5 (JS decomposition via skew JS). Let P,QP,Q be probability distributions on X1:TmaxX_1:T_ admitting factorizations P​(x1:Tmax)=∏t=1Tmaxpt​(xt∣x<t)P(x_1:T_ )= _t=1^T_ p_t(x_t x_<t) and Q​(x1:Tmax)=∏t=1Tmaxqt​(xt∣x<t)Q(x_1:T_ )= _t=1^T_ q_t(x_t x_<t) respectively. Let M:=12​(P+Q)M:= 12(P+Q) and for each t let P<t,Q<t,M<tP_<t,Q_<t,M_<t denote the marginals of X<tX_<t under P,Q,MP,Q,M, respectively. Define, for M<tM_<t-almost every history h=x<th=x_<t, αt​(h):=12​d​P<td​M<t​(h)=P<t​(h)P<t​(h)+Q<t​(h). _t(h):= 12 dP_<tdM_<t(h)= P_<t(h)P_<t(h)+Q_<t(h). Then DJS(P∥Q)=∑t=1TmaxH∼M<t[DJSαt​(H)(pt(⋅∣H)∥qt(⋅∣H))].D_JS(P Q)= _t=1^T_ E_H M_<t\! [D_JS _t(H)\! (p_t(· H) q_t(· H) ) ]. Proof. By definition, we have DJS​(P∥Q)=12​DKL​(P∥M)+12​DKL​(Q∥M),M=12​(P+Q).D_JS(P Q)= 12D_KL(P M)+ 12D_KL(Q M), M= 12(P+Q). For each t, since M<t=12​(P<t+Q<t)M_<t= 12(P_<t+Q_<t), we have P<t,Q<t≪M<tP_<t,Q_<t M_<t, so the Radon–Nikodym derivatives d​P<td​M<t dP_<tdM_<t and d​Q<td​M<t dQ_<tdM_<t exist. Since the prefix space is discrete, these Radon–Nikodym derivatives reduce to ratios of probability masses wherever M<t​(h)>0M_<t(h)>0. For each t and each history h=x<th=x_<t with M<t​(h)>0M_<t(h)>0, define mt(⋅∣h):=M(Xt∈⋅∣X<t=h).m_t(· h):=M(X_t∈· X_<t=h). We first identify this conditional law. For any xt∈x_t , mt​(xt∣h)=M​(X1:t=(h,xt))M​(X<t=h)=12​P<t​(h)​pt​(xt∣h)+12​Q<t​(h)​qt​(xt∣h)12​P<t​(h)+12​Q<t​(h),m_t(x_t h)= M(X_1:t=(h,x_t))M(X_<t=h)= 12P_<t(h)p_t(x_t h)+ 12Q_<t(h)q_t(x_t h) 12P_<t(h)+ 12Q_<t(h), hence mt(⋅∣h)=αt(h)pt(⋅∣h)+(1−αt(h))qt(⋅∣h).m_t(· h)= _t(h)\,p_t(· h)+ (1- _t(h) )\,q_t(· h). Since M​(x1:Tmax)=∏t=1Tmaxmt​(xt∣x<t),M(x_1:T_ )= _t=1^T_ m_t(x_t x_<t), Lemma A.1, applied to (P,M)(P,M) and (Q,M)(Q,M) yields DKL​(P∥M) D_KL(P M) =∑t=1TmaxH∼P<t[DKL(pt(⋅∣H)∥mt(⋅∣H))], = _t=1^T_ E_H P_<t\! [D_KL\! (p_t(· H) m_t(· H) ) ], DKL​(Q∥M) D_KL(Q M) =∑t=1TmaxH∼Q<t[DKL(qt(⋅∣H)∥mt(⋅∣H))]. = _t=1^T_ E_H Q_<t\! [D_KL\! (q_t(· H) m_t(· H) ) ]. Thus, DJS​(P∥Q) D_JS(P Q) =12​DKL​(P∥M)+12​DKL​(Q∥M) = 12D_KL(P M)+ 12D_KL(Q M) =12∑t=1TmaxH∼P<t[DKL(pt(⋅∣H)∥mt(⋅∣H))]+12∑t=1TmaxH∼Q<t[DKL(qt(⋅∣H)∥mt(⋅∣H))]. = 12 _t=1^T_ E_H P_<t\! [D_KL\! (p_t(· H) m_t(· H) ) ]+ 12 _t=1^T_ E_H Q_<t\! [D_KL\! (q_t(· H) m_t(· H) ) ]. By definition of αt _t, we have 12​d​P<td​M<t​(H)=αt​(H) 12 dP_<tdM_<t(H)= _t(H) and thus 12​d​Q<td​M<t​(H)=1−αt​(H) 12 dQ_<tdM_<t(H)=1- _t(H) M<t​-a.s.M_<t-a.s. (since M<t=12​P<t+12​Q<tM_<t= 12P_<t+ 12Q_<t). Then by the Radon–Nikodym Theorem, 12H∼P<t[DKL(pt(⋅∣H)∥mt(⋅∣H))]+12H∼Q<t[DKL(qt(⋅∣H)∥mt(⋅∣H))] 12E_H P_<t\! [D_KL\! (p_t(· H) m_t(· H) ) ]+ 12E_H Q_<t\! [D_KL\! (q_t(· H) m_t(· H) ) ] =H∼M<t[12d​P<td​M<t(H)DKL(pt(⋅∣H)∥mt(⋅∣H))+12d​Q<td​M<t(H)DKL(qt(⋅∣H)∥mt(⋅∣H))] =E_H M_<t\! [ 12 dP_<tdM_<t(H)\,D_KL\! (p_t(· H) m_t(· H) )+ 12 dQ_<tdM_<t(H)\,D_KL\! (q_t(· H) m_t(· H) ) ] =H∼M<t[αt(H)DKL(pt(⋅∣H)∥mt(⋅∣H))+(1−αt(H))DKL(qt(⋅∣H)∥mt(⋅∣H))] =E_H M_<t\! [ _t(H)\,D_KL\! (p_t(· H) m_t(· H) )+ (1- _t(H) )\,D_KL\! (q_t(· H) m_t(· H) ) ] =H∼M<t[DJSαt​(H)(pt(⋅∣H)∥qt(⋅∣H))]. =E_H M_<t\! [D_JS _t(H)\! (p_t(· H) q_t(· H) ) ]. Summing over t=1,…,Tmaxt=1,…,T_ gives the claim. ∎ Proposition A.6 (Token-level skew-JS control ⇒ sequence-level JS bound). Let Pmix,PintP_mix,\;P_int be the sequence-level laws induced by πmix _mix and πint _int on X1:TmaxX_1:T_ , and let M:=12​(Pmix+Pint).M:= 12 (P_mix+P_int ). For each t, let (Pmix)<t(P_mix)_<t, (Pint)<t(P_int)_<t, and M<tM_<t denote the corresponding marginals of X<tX_<t, and define αt​(h):=12​d​(Pmix)<td​M<t​(h)=(Pmix)<t​(h)(Pmix)<t​(h)+(Pint)<t​(h),M<t​-a.s. _t(h):= 12 d(P_mix)_<tdM_<t(h)= (P_mix)_<t(h)(P_mix)_<t(h)+(P_int)_<t(h), M_<t-a.s. Assume that the switching rule satisfies, for every t and history h=x<th=x_<t with M<t​(h)>0M_<t(h)>0, (h)=0⟹DJSαt​(h)(πprim(⋅∣h)∥πint(⋅∣h))≤ε.S(h)=0 D_JS _t(h)\! ( _prim(· h) _int(· h) )≤ . Define N0:=∑t=1τ​(X<t)=0.N_0:= _t=1^τ 1\S(X_<t)=0\. Then DJS​(Pmix∥Pint)≤ε​X∼M​[N0].D_JS(P_mix P_int)≤ \ E_X M[N_0]. Equivalently, DJS​(Pmix∥Pint)≤ε2​(X∼Pmix​[N0]+X∼Pint​[N0]).D_JS(P_mix P_int)≤ 2 (E_X P_mix[N_0]+E_X P_int[N_0] ). Proof. We apply Lemma A.5 with P=PmixP=P_mix and Q=PintQ=P_int. For any history h=x<th=x_<t, πmix(⋅∣h)=πint(⋅∣h),​(h)=1,πprim(⋅∣h),​(h)=0. _mix(· h)= cases _int(· h),&S(h)=1,\\ _prim(· h),&S(h)=0. cases Hence DJSαt​(h)(πmix(⋅∣h)∥πint(⋅∣h))=(h)=0DJSαt​(h)(πprim(⋅∣h)∥πint(⋅∣h)).D_JS _t(h)\! ( _mix(· h) _int(· h) )= 1\S(h)=0\\,D_JS _t(h)\! ( _prim(· h) _int(· h) ). By the assumed token-level control, for every such history h, DJSαt​(h)(πmix(⋅∣h)∥πint(⋅∣h))≤ε 1(h)=0.D_JS _t(h)\! ( _mix(· h) _int(· h) )≤ \, 1\S(h)=0\. Under absorbing EOS, DJSαt​(X<t)(πmix(⋅∣X<t)∥πint(⋅∣X<t))=0D_JS _t(X_<t)\! ( _mix(· X_<t) _int(· X_<t) )=0 for t>τt>τ, so we may multiply by ​t≤τ 1\t≤τ\. Substituting this into Lemma A.5 gives DJS​(Pmix∥Pint)≤∑t=1TmaxX<t∼M<t​[​t≤τ​ε​ 1​St=0].D_JS(P_mix P_int)≤ _t=1^T_ E_X_<t M_<t\! [ 1\t≤τ\\, \, 1\S_t=0\ ]. Since ​t≤τ​St=0 1\t≤τ\ 1\S_t=0\ is σ​(X<t)σ(X_<t)-measurable, we may equivalently view the expectation under the full trajectory X∼MX M: X<t∼M<t​[​t≤τ​ 1​St=0]=X∼M​[​t≤τ​ 1​St=0].E_X_<t M_<t\! [ 1\t≤τ\\, 1\S_t=0\ ]=E_X M\! [ 1\t≤τ\\, 1\S_t=0\ ]. Thus, DJS​(Pmix∥Pint) D_JS(P_mix P_int) ≤ε​∑t=1TmaxX∼M​[​t≤τ​ 1​St=0] ≤ _t=1^T_ E_X M\! [ 1\t≤τ\\, 1\S_t=0\ ] =ε​X∼M​[∑t=1Tmax​t≤τ​ 1​St=0] = \,E_X M\! [ _t=1^T_ 1\t≤τ\\, 1\S_t=0\ ] =ε​X∼M​[∑t=1τ​St=0] = \,E_X M\! [ _t=1^τ 1\S_t=0\ ] =ε​X∼M​[N0]. = \,E_X M[N_0]. Finally, since M=12​(Pmix+Pint)M= 12(P_mix+P_int), X∼M​[N0]=12​X∼Pmix​[N0]+12​X∼Pint​[N0],E_X M[N_0]= 12E_X P_mix[N_0]+ 12E_X P_int[N_0], giving the equivalent form. ∎ Remark A.7 (Effective skew-JS on non-intervention steps). Let M:=12​(Pmix+Pint),M:= 12(P_mix+P_int), and define the effective token-level skew Jensen–Shannon divergence on non-intervention steps by ȷ¯:=X∼M[∑t=1τSt=0DJSαt​(X<t)(πprim(⋅∣X<t)∥πint(⋅∣X<t))]X∼M​[N0],(ȷ¯:=0​ if ​X∼M​[N0]=0). := E_X M\! [ _t=1^τ 1\S_t=0\\,D_JS _t(X_<t)\! ( _prim(· X_<t) _int(· X_<t) ) ]E_X M[N_0], ( :=0 if E_X M[N_0]=0). Then DJS​(Pmix∥Pint)=ȷ¯​X∼M​[N0].D_JS(P_mix P_int)= \,E_X M[N_0]. Moreover, under the assumption of Proposition A.6, ȷ¯≤ε, ≤ , and hence DJS​(Pmix∥Pint)=ȷ¯​X∼M​[N0]≤ε​X∼M​[N0].D_JS(P_mix P_int)= \,E_X M[N_0]≤ \,E_X M[N_0]. Here ȷ¯ averages the skew Jensen–Shannon terms appearing in the exact decomposition of Lemma A.5; this is the natural JS analogue of the effective KL in the preceding remark. A.2 Experimental Details A.2.1 Token Distribution Analyses We run model inference using vllm (Kwon et al., 2023). On AIME 2024 and 2025, we apply nucleus sampling (Holtzman et al., 2020) with top-p=0.7p=0.7 and temperature=1temperature=1. For divergence calculations on AIME, we use the top-p truncated distribution to reflect the effective sampling distribution, to provide a more accurate estimate for our cross-sampling experiments. We also examine the distribution of JS divergence values without truncation (and on other top-p values) to ensure the main results are not impacted by the truncation. For experiments on the fine-training data, we use top-p=1p=1 to reflect the training sampling distribution. A.2.2 Cross-Sampling For cross-sampling experiments, we use the same inference setup as token analysis. Cross-sampling experiments selectively swap tokens between base and RL models at positions where JS divergence exceeds a threshold, allowing us to measure the functional importance of divergent token distributions. We perform forward and reverse cross-sampling experiments on the following model-dataset combinations. The divergence thresholds used for each configuration are as follows: • Qwen2.5-32B with SimpleRL: – AIME 2024: Forward threshold ε=0.03 =0.03, Reverse threshold ε=0.05 =0.05 – AIME 2025: Forward threshold ε=0.05 =0.05, Reverse threshold ε=0.05 =0.05 • Qwen2.5-32B with DAPO: – AIME 2024: Forward threshold ε=0.08 =0.08, Reverse threshold ε=0.06 =0.06 – AIME 2025: Forward threshold ε=0.1 =0.1, Reverse threshold ε=0.08 =0.08 • Mistral-Small-24B with SimpleRL: – AIME 2024: Forward threshold ε=0.002 =0.002, Reverse threshold ε=0.02 =0.02 A.2.3 Additional Training Details We implement RLVR training experiments using verl (Sheng et al., 2024) with the standard DAPO recipe (Yu et al., 2025). Qwen2.5-Math-7B DAPO Training. We follow the public DAPO recipe, namely with clip ratios ϵlow=0.2 _low=0.2 and ϵhigh=0.28 _high=0.28. However, for token analysis, we also train a variant with ϵhigh=0.2 _high=0.2 for comparison. We optimize with learning rate 1×10−61× 10^-6, a 1010-step warmup using AdamW, and no explicit reference-KL penalty. Each RLVR step processes 512512 prompts with 1616 sampled responses per prompt; these are split into mini-batches of 3232 prompts, yielding 1616 gradient updates per RLVR step. Maximum generation length and the overlong-penalty threshold are set to 8​k8k and 4​k4k tokens. Supervised Fine-Tuning (SFT) Training. For the SFT model based on on Qwen2.5 32B, we sampled 42k instances from the AM-DeepSeek-R1-Distilled-1.4M dataset. The model underwent full parameter fine-tuning for 5 epochs, employing DeepSpeed ZeRO-3 optimization. Divergence-weighted Training For the divergence-weighted advantage experiments on Qwen2.5-Math-7B, under the high-KL setting we use s=0.3s=0.3 and set α to increase linearly from 0 to 5050 starting at step 100100. In the low-KL setting, we use s=0.3s=0.3 and set α to decrease linearly from 0 to −50-50 beginning at step 150150. A.3 Weight-Level Analysis of Changes Orthogonal to the analysis done in the main text, we also investigate the degree of modifications induced by RLVR at the parameter level. More specifically, we employ the relative gap ratio (Wu et al., 2025), denoted as σ, to quantify the magnitude of weight divergence pre- and post-fine-tuning. This ratio is formulated as: σ=∑|Woriginal−Wtuned|∑|Woriginal|+∑|Wtuned|σ= Σ|W_original-W_tuned|Σ|W_original|+Σ|W_tuned| where WoriginalW_original and WtunedW_tuned represent the model parameters before and after fine-tuning, respectively. A lower σ value signifies greater similarity between the parameter sets, indicating a smaller overall modification from the fine-tuning process. In our experiment, we utilized the Qwen2.5-32B and Qwen2.5-Math-7B models as foundations. Each model was independently fine-tuned via two distinct methodologies: RL and Supervised Fine-Tuning (SFT). To ensure a controlled and equitable comparison, the training regimen for both methods was standardized, employing an identical dataset size and the same number of training steps. Subsequently, the σ was computed between each original model and its corresponding tuned counterparts. The results are presented in the following table. Table 3: Relative gap ratio (σ) after RL and SFT fine-tuning. Model Qwen2.5-32B Qwen2.5-Math-7B σ after RL 0.00143 0.00136 σ after SFT 0.00347 0.00944 The results presented in the table demonstrate a consistent trend across both models: the σ values corresponding to RL fine-tuning are substantially lower than those from SFT. This quantitative analysis at the parameter level suggests that the cumulative weight modifications induced by RL are significantly less extensive than those resulting from SFT. This finding provides empirical support for the hypothesis that RL achieves performance gains through sparse and targeted parameter adjustments, contrasting with the more distributed updates characteristic of SFT. A.4 Additional RLVR vs. Supervised Fine-Tuning Results This section provides additional results to supplement the discussion in Section 2.6. A natural question is whether the sparse, targeted distributional shifts we observe are specific to RLVR, or if they also characterize other fine-tuning approaches. To address this, we compare RLVR-trained models with models refined through supervised fine-tuning (SFT). We analyze Qwen2.5-32B trained with SFT alongside Qwen2.5-32B DAPO. Figure 12 shows JS divergence distributions for both approaches. SFT produces a noticeably larger high-divergence set with larger divergence values, whereas RLVR concentrates almost all token distributions below very small JS values. This directly reflects RLVR’s extreme selectivity and the broader edits introduced by SFT. The top-k overlap analysis (Figure 16) highlights that SFT consistently achieves lower overlap with the base model, indicating more aggressive re-ranking, while RLVR largely stays within the base model’s existing candidate set. The rank reordering analysis (Figure 17) further shows that SFT promotes more tokens outside the base model’s top-3, whereas RLVR mainly promotes candidates that were already high-ranked. (a) SFT: Histogram (b) SFT: Percentiles (c) RLVR (DAPO): Histogram (d) RLVR (DAPO): Percentiles Figure 12: JS divergence distributions comparing supervised fine-tuning and RLVR on AIME 2024. RLVR exhibits sparser distributional shifts than SFT, suggesting more targeted refinement. (a) SFT: Histogram (topp1) (b) SFT: Percentile curve (topp1) Figure 13: JS divergence distributions computed using top-p=1p=1 distribution (while sampling is still with top-p=0.7p=0.7) for Qwen2.5-32B SFT on AIME 2024. (a) Qwen2.5-32B SFT (b) Qwen2.5-32B DAPO Figure 14: Mean JS divergence by normalized token position comparing SFT and RLVR-trained models on AIME 2024. The positional patterns reveal differences in how SFT and RLVR concentrate their updates. Taken together, the metrics highlight that SFT diverges from RLVR along several axes. The SFT model exhibits higher overall JS divergence as well as a larger mass of high-divergence tokens (Figure 12), and attains lower top-k overlap with the base model (Figure 16) alongside larger rank shifts (Figure 17). Moreover, SFT’s divergent tokens more frequently elevate low base-probability choices compared to RLVR (Figure 18). These differences reinforce that RLVR acts as a targeted editor, while SFT drives broader, less selective reshaping of the distribution. These findings align with recent work suggesting that RL fine-tuning acts as a scalpel rather than a hammer, amplifying existing capabilities through localized changes compared to the broader modifications induced by supervised fine-tuning (Rajani et al., 2025; Chu et al., 2025). Our results align with this behavior but from the perspective of token-level distributional changes: RLVR modifies far fewer token positions (as measured by JS divergence), and at those positions, the changes are more likely to be re-ranking within the base model’s top candidates rather than introducing entirely new tokens. In contrast, SFT exhibits more widespread token-level distributional shifts across a larger fraction of positions, as it learns to mimic provided outputs while adjusting token probabilities more broadly across the vocabulary space. (a) Qwen2.5-32B SFT Low JS bin (<0.1<0.1) (b) Qwen2.5-32B SFT High JS bin (>0.1>0.1) Figure 15: Entropy distributions across divergence bins using full vocabulary for Qwen2.5-32B with SFT on AIME 2025. (a) Qwen2.5-32B SFT (b) Qwen2.5-32B DAPO Figure 16: Top-k token overlap between base and refined models at divergent positions (JSt>0.1JS_t>0.1) comparing SFT and RLVR-trained models on AIME 2024. (a) Qwen2.5-32B SFT (b) Qwen2.5-32B DAPO Figure 17: Distribution of base-model ranks for fine-tuned models’ top-3 tokens at high-divergence positions (JS>0.1JS>0.1) comparing SFT and RLVR-trained models on AIME 2024. (a) Qwen2.5-32B SFT (b) Qwen2.5-32B DAPO Figure 18: Percentage of divergent tokens whose RL top-1 choice had base probability below a given threshold comparing SFT and RLVR-trained models on AIME 2024. (a) Qwen2.5-32B SFT: Histogram (b) Qwen2.5-32B SFT: Percentiles Figure 19: JS divergence distributions for Qwen2.5-32B SFT on AIME 2025. (a) SFT: Histogram (topp1) (b) SFT: Percentile curve (topp1) Figure 20: JS divergence distributions computed using top-p=1p=1 for Qwen2.5-32B SFT on AIME 2025. A.5 Additional Token Distribution Analyses This section provides supplementary and extended token distribution analyses. We first present supplementary figures for the main models (Qwen2.5-32B with DAPO and SimpleRL on AIME 2024), then extend the analysis to additional models and datasets to demonstrate the generalizability of our findings. A.5.1 Supplementary Figures for Main Models We provide additional figures for Qwen2.5-32B with DAPO and SimpleRL that complement the analyses in the main text. (a) Low JS bin (<0.1<0.1). (b) High JS bin (>0.1>0.1). Figure 21: Entropy distributions across divergence bins for SimpleRL. Low-divergence tokens are mostly low-entropy, while high-divergence tokens are concentrated in higher-entropy regions, reflecting a more conservative update strategy. (a) Qwen2.5 32B DAPO. (b) Qwen2.5 32B SimpleRL. Figure 22: Distribution of base-model ranks for RL’s top-3 tokens at high-divergence positions (JS>0.1JS>0.1). Most RL-selected tokens were already highly ranked in the base model, especially under SimpleRL. (a) Frequent high JS tokens. (b) Frequent low JS tokens. Figure 23: Histogram of divergences for frequent high JS tokens and frequent low JS tokens (Qwen2.5 32B with DAPO). (a) SimpleRL Figure 24: Percentage of divergent tokens whose RL top-1 choice had base probability below a given threshold. (a) DAPO. (b) SimpleRL. Figure 25: Probability differences and ratios for top-3 tokens under DAPO and SimpleRL among divergent distributions (JS>0.1JS>0.1). (a) Near the start of generation (b) Near the answer span Figure 26: Local averages of JS divergence as a function of distance from key regions (prompt beginning and answer) for Qwen2.5-32B models on AIME 2024. Average divergence peaks occur in the same early and late windows highlighted by the positional analysis. (a) Qwen2.5-32B DAPO (b) Qwen2.5-32B SimpleRL Figure 27: Per-sequence scatter plots relating entropy to JS divergence for Qwen2.5-32B DAPO and SimpleRL on AIME 2024. DAPO exhibits a broader entropy spread among divergent tokens, whereas SimpleRL concentrates divergence in higher-entropy regions. Results on GPQA-Diamond. We extend our analysis to GPQA-Diamond to demonstrate the generalizability of our findings across different reasoning benchmarks. Figure 28 shows JS divergence percentile curves and positional concentration for Qwen2.5-32B with DAPO on GPQA-Diamond, revealing consistent sparsity patterns. Figure 29 shows entropy distributions across divergence bins. (a) JS divergence percentiles (b) Positional concentration Figure 28: JS divergence analysis for Qwen2.5-32B with DAPO on GPQA-Diamond. The sparsity patterns and positional concentration are consistent with findings on AIME datasets. (a) Low JS bin (<0.1<0.1). (b) High JS bin (>0.1>0.1). Figure 29: Entropy distributions across divergence bins for Qwen2.5-32B with DAPO on GPQA-Diamond. Patterns are consistent with those observed on AIME datasets. Effect of Top-p Sampling on JS Divergence. To verify that our findings are robust to different top-p sampling settings, we compare JS divergence distributions across different sampling configurations. The default setting uses top-p=0.7p=0.7 for sampling. We also evaluate configurations where sampling is performed with top-p=0.8p=0.8 and top-p=0.9p=0.9. Figure 30 shows that the sparsity patterns remain consistent across different sampling top-p values, confirming that our main claims are not sensitive to the specific sampling top-p value used. (a) Sampling top-p=0.8p=0.8 (b) Sampling top-p=0.9p=0.9 Figure 30: JS divergence percentile curves for Qwen2.5-32B with DAPO on AIME 2024 under different top-p sampling settings. The sparsity patterns remain consistent across different sampling top-p values, indicating robustness to the specific sampling configuration. JS Divergence on AIME 2025. Figure 31 shows JS divergence percentile curves for Qwen2.5-32B with DAPO and SimpleRL on AIME 2025, demonstrating consistent sparsity patterns across datasets. (a) DAPO: Percentile curve (b) SimpleRL: Percentile curve Figure 31: JS divergence distributions for Qwen2.5-32B with DAPO and SimpleRL on AIME 2025. The sparsity patterns are consistent with those observed on AIME 2024, confirming the robustness of our findings across datasets. Effect of Top-p Truncation on JS Divergence. To verify that our use of top-p truncated distributions (with topp=0.7topp=0.7) does not significantly impact our findings, we compare JS divergence distributions computed using the estimated full distribution (top-p=1p=1) under the original sampling setting (top-p=0.7p=0.7) with those using truncated distributions. Figure 32 shows that the patterns remain consistent: distributional shifts are highly sparse regardless of truncation, with the vast majority of tokens showing near-zero divergence. (a) DAPO: Percentile curve (topp1) (b) SimpleRL: Percentile curve (topp1) Figure 32: JS divergence distributions computed using top-p=1p=1 for Qwen2.5-32B with DAPO and SimpleRL on AIME 2025. The sparsity patterns are consistent with those observed using top-p truncated distributions, confirming that truncation does not significantly impact our findings. A.5.2 Comparison of DAPO Variants: Clip-Higher Settings DAPO’s clip-higher mechanism controls the degree of upper clip in the PPO updates. We compare two Qwen2.5-Math-7B models trained with DAPO: one with the default clip-higher setting (0.28) and another with a more restrictive setting (0.2). Figure 33 shows their JS divergence distributions on AIME 2024 and AIME 2025, revealing how the clip-higher parameter affects distributional shifts across datasets. (a) DAPO (0.28) AIME 2024: Percentiles (b) DAPO (0.28) AIME 2025: Percentiles (c) DAPO (0.2) AIME 2024: Percentiles (d) DAPO (0.2) AIME 2025: Percentiles Figure 33: JS divergence distributions for Qwen2.5-Math-7B trained with DAPO under different clip-higher settings on AIME 2024 and AIME 2025. The more restrictive clip-high=0.2 setting leads to sparser distributional shifts compared to the default 0.28 setting across both datasets, with a smaller proportion of tokens exhibiting nonnegligible divergence. However, on its divergent token set, the JS values are higher as indicated by the higher upper percentiles. Figure 34 compares positional concentration patterns on AIME 2024 and AIME 2025, while Figure 38 and Figure 39 examine top-k overlap and rank reordering, respectively. Figure 37 shows the percentage of divergent tokens whose RL top-1 choice had base probability below a given threshold for both DAPO variants across different datasets. Figure 35 shows entropy distributions across divergence bins for both DAPO variants. (a) DAPO (0.28) AIME 2024 (b) DAPO (0.28) AIME 2025 (c) DAPO (0.2) AIME 2024 (d) DAPO (0.2) AIME 2025 Figure 34: Mean JS divergence by normalized token position for DAPO variants with different clip-higher settings on AIME 2024 and AIME 2025. (a) Low JS bin (<0.1<0.1) (b) High JS bin (>0.1>0.1) (c) Low JS bin (<0.1<0.1) (d) High JS bin (>0.1>0.1) Figure 35: Entropy distributions across divergence bins for Qwen2.5-Math-7B with DAPO variants on AIME 2024. Top row: DAPO (clip-higher=0.28); bottom row: DAPO (clip-higher=0.2). Patterns are consistent with those observed in the main text, confirming the relationship between entropy and divergence across different clip-higher settings. (a) Low JS bin (<0.1<0.1) (b) High JS bin (>0.1>0.1) (c) Low JS bin (<0.1<0.1) (d) High JS bin (>0.1>0.1) Figure 36: Entropy distributions across divergence bins for Qwen2.5-Math-7B with DAPO variants on AIME 2025. Top row: DAPO (clip-higher=0.28); bottom row: DAPO (clip-higher=0.2). (a) AIME 2024 (b) AIME 2025 (c) Fine-tune Data (d) AIME 2024 (e) AIME 2025 (f) Fine-tune Data Figure 37: Percentage of divergent tokens whose RL top-1 choice had base probability below a given threshold for Qwen2.5-Math-7B with DAPO variants. Top row: DAPO (clip-higher=0.28); bottom row: DAPO (clip-higher=0.2). Consistent with findings in the main text, RL rarely promotes tokens with very low base probability, even under more exploratory settings with clip-higher. We further observe a distinction between the two clip-high settings, with the more restrictive setting (0.2) promoting fewer tokens with low base probability. (a) DAPO (0.28) AIME 2024 (b) DAPO (0.28) AIME 2025 (c) DAPO (0.2) AIME 2024 (d) DAPO (0.2) AIME 2025 Figure 38: Top-k token overlap between base and RL models at divergent positions (JSt>0.1JS_t>0.1) for DAPO variants on AIME 2024 and AIME 2025. (a) DAPO (0.28) AIME 2024 (b) DAPO (0.28) AIME 2025 (c) DAPO (0.2) AIME 2024 (d) DAPO (0.2) AIME 2025 Figure 39: Distribution of base-model ranks for RL’s top-3 tokens at high-divergence positions (JS>0.1JS>0.1) for DAPO variants on AIME 2024 and AIME 2025. Fine-tuning Data Results. We also analyze distributional shifts on the fine-tuning data to examine how models behave on data they were fine-tuned on. Figure 40 shows JS divergence distributions, while Figures 41, 42 show additional analyses. (a) DAPO (clip-higher=0.28): Percentiles (b) DAPO (clip-higher=0.2): Percentiles Figure 40: JS divergence distributions for DAPO variants of Qwen2.5-Math-7B on fine-tuning data (computed using approximated full distributions instead of truncated ones) (a) DAPO (clip-higher=0.28) (b) DAPO (clip-higher=0.2) Figure 41: Mean JS divergence by normalized token position for DAPO variants on fine-tuning data. (a) DAPO (clip-higher=0.28) (b) DAPO (clip-higher=0.2) Figure 42: Top-k token overlap between base and RL models at divergent positions (JSt>0.1JS_t>0.1) for DAPO variants on fine-tuning data. A.5.3 Qwen3-8B with DAPO We analyze Qwen3-8B trained with DAPO on AIME 2024 and AIME 2025 to further demonstrate the robustness of our findings across model families and training configurations. Notably, this model was fine-tuned for nearly twice as many RLVR training steps as the Qwen2.5-Math-7B, providing a useful point of comparison for understanding the effect of extended RL training. Figure 43 shows JS divergence percentile curves, revealing similarly sparse distributional shifts despite the longer training horizon. Figure 44 shows positional concentration, Figure 45 shows entropy distributions across divergence bins, and Figure 46 shows tail behavior analysis. (a) AIME 2024: Percentiles (b) AIME 2025: Percentiles Figure 43: JS divergence distributions for Qwen3-8B with DAPO on AIME 2024 and AIME 2025. Sparse distributional shifts persist even under extended training. (a) AIME 2024 (b) AIME 2025 Figure 44: Mean JS divergence by normalized token position for Qwen3-8B with DAPO on AIME 2024 and AIME 2025. Divergences remain concentrated at early and late positions, consistent with other models. (a) Qwen3-8B DAPO Low JS bin (<0.1<0.1) (b) Qwen3-8B DAPO High JS bin (>0.1>0.1) Figure 45: Entropy distributions across divergence bins for Qwen3-8B with DAPO on AIME 2024. (a) Qwen3-8B DAPO AIME 2024 (b) Qwen3-8B DAPO AIME 2025 Figure 46: Percentage of divergent tokens whose RL top-1 choice had base probability below a given threshold for Qwen3-8B with DAPO on AIME 2024 and AIME 2025. (a) Qwen3-8B DAPO AIME 2024 Figure 47: Example of base vs RL distributions at differing divergence levels. A.5.4 Mistral-Small-24B with SimpleRL We analyze Mistral-Small-24B trained with SimpleRL on AIME 2024 and AIME 2025 to demonstrate the generalizability of our findings across different model architectures. Figure 48 shows JS divergence percentile curves, revealing consistent sparsity patterns. Figure 49 shows positional concentration, Figure 50 shows entropy distributions across divergence bins, and Figure 51 shows tail behavior analysis. (a) AIME 2024: Percentiles (b) AIME 2025: Percentiles Figure 48: JS divergence distributions for Mistral-Small-24B with SimpleRL on AIME 2024 and AIME 2025. Sparse distributional shifts are consistent with findings in the main text across both datasets. (a) AIME 2024 (b) AIME 2025 Figure 49: Mean JS divergence by normalized token position for Mistral-Small-24B with SimpleRL on AIME 2024 and AIME 2025. Consistent with findings for other models, divergences are concentrated at the start and end of responses. (a) Mistral-24B SimpleRL Low JS bin (<0.1<0.1) (b) Mistral-24B SimpleRL High JS bin (>0.1>0.1) Figure 50: Entropy distributions across divergence bins for Mistral-Small-24B with SimpleRL on AIME 2024. (a) Mistral-24B SimpleRL AIME 2024 (b) Mistral-24B SimpleRL AIME 2025 Figure 51: Percentage of divergent tokens whose RL top-1 choice had base probability below a given threshold for Mistral-Small-24B with SimpleRL on AIME 2024 and AIME 2025. A.6 Additional Cross-Sampling Results This section provides supplementary cross-sampling results and a general description of the algorithm used for the cross-sampling experiments. Algorithm 1 describes the general procedure for a single prompt (in practice, we batch prompts for efficiency), which generates a response primarily under a primary policy and selectively intervenes using an intervention policy at positions where the token-level divergence exceeds a fixed threshold. Note that to obtain the cross-sampling plots, for every k cross-sampling interventions, we complete the response with πprim _prim and evaluate, then continue with cross-sampling using the prefix from prior to completing it with the primary model. Algorithm 1 Cross-Sampling for a Single Prompt 1:Prompt prefix x<1x_<1, primary policy πprim _prim, intervention policy πint _int, divergence threshold ε , maximum steps T 2:Generated sequence x1:tx_1:t, number of intervention steps k 3:k←0k← 0 4:Initialize prefix x<1x_<1 5:for t=1,…,Tt=1,…,T do 6: Compute divergence JSt=DJS(πprim(⋅∣x<t)∥πint(⋅∣x<t))JS_t=D_JS\! ( _prim(· x_<t) _int(· x_<t) ) 7: if JSt>εJS_t> then 8: Sample xt∼πint(⋅∣x<t)x_t _int(· x_<t) 9: k←k+1k← k+1 10: else 11: Sample xt∼πprim(⋅∣x<t)x_t _prim(· x_<t) 12: end if 13: Append xtx_t to prefix x<t+1x_<t+1 14: if xt=EOSx_t=EOS then 15: break 16: end if 17:end for 18:return generated sequence x1:tx_1:t and intervention count k Table 4: Summary of cross-sampled tokens required to reach approximate RL-level performance (forward) or base-level performance (reverse) for Qwen2.5-32B on AIME 2024 and AIME 2025 with a token budget of 8000. Effective token counts/percentages exclude identity swaps during cross-sampling. Token percentages are computed at the sequence level. Dataset Method Eff. % % Eff. # # Initial Final Tokens Tokens Tokens Tokens Acc. Acc. AIME24 SimpleRL 3.863.86% 7.587.58% 3838 7575 8.238.23 >25>25 SimpleRL Rev. 55% 8.38.3% 2929 5151 25.5225.52 <8.3<8.3 DAPO 7.87.8% 11.911.9% 280280 410410 8.238.23 >44>44 DAPO Rev. 10.110.1% 14.914.9% 173173 258258 44.844.8 <8.5<8.5 AIME25 SimpleRL 1.531.53% 2.972.97% 1313 2626 5.35.3 >14>14 SimpleRL Rev. 4.734.73% 7.877.87% 3131 5353 12.7112.71 <4<4 DAPO 6.476.47% 9.189.18% 230230 326326 4.84.8 >33>33 DAPO Rev. 9.899.89% 14.1914.19% 181181 261261 3232 <4.5<4.5 (a) Forward cross-sampling Qwen2.5-32B SimpleRL. (b) Reverse cross-sampling Qwen2.5-32B SimpleRL (c) Forward cross-sampling Qwen2.5-32B DAPO (d) Reverse cross-sampling Qwen2.5-32B DAPO Figure 52: Cross-sampling token pair histograms of the form (primary sampled token) -> (intervention sampled token) at cross-sampling intervention points (excluding identity token swaps). (a) Forward cross-sampling (b) Reverse cross-sampling Figure 53: Cross-sampling results (DAPO on AIME 2025): injecting RL tokens into base generations progressively recovers RL accuracy, while reverting RL tokens with base tokens causes near-monotonic degradation toward base performance. (a) Forward cross-sampling (b) Reverse cross-sampling Figure 54: Cross-sampling results (DAPO on AIME 2024): injecting RL tokens into base generations progressively recovers RL accuracy, while reverting RL tokens with base tokens causes near-monotonic degradation toward base performance. (a) Random baseline (b) DAPO Figure 55: Comparison of random baseline and DAPO cross-sampling on AIME 2024: average number of tokens (including identity swaps) replaced versus accuracy. The random baseline shows slow performance improvement, demonstrating that targeted RL token selection is critical for performance gains. Performing random replacement may skip critical positions for the reasoning trajectories. (a) Forward cross-sampling (b) Reverse cross-sampling Figure 56: Cross-sampling results (Mistral-Small-24B SimpleRL on AIME 2024): injecting RL tokens into base generations progressively recovers RL accuracy, while reverting RL tokens with base tokens causes near-monotonic degradation toward base performance. A.7 Additional Results For completeness, Figure 57 shows a training run on Qwen3-32B-Base (Yang et al., 2025a) using the standard DAPO recipe and dataset. We observe that AIME24 performance appears to plateau between approximately steps 80 and 180 (around mean@32 of 57–60%), before improving steadily beyond step 180 to exceed 70%. This delayed improvement suggests that additional performance gains may manifest only after extended training. Given the substantial computational cost of such runs (exceeding 140,000 GPU hours to reach roughly step 500 for a single training run), it is plausible that shorter training horizons could lead to prematurely terminated runs and consequently undertrained baselines in prior work. Due to these resource constraints, we do not investigate this phenomenon further here, but we highlight it as a potentially important consideration for future work. Figure 57: AIME24 performance of Qwen3-32B-Base trained with DAPO.