Paper deep dive

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 86

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/22/2026, 6:03:04 AM

Summary

PowerFlow is a principled framework for unsupervised fine-tuning of Large Language Models (LLMs) that reformulates the process as a distribution matching problem. By utilizing a length-aware Trajectory-Balance (LA-TB) objective, it targets alpha-power distributions to directionally elicit latent capabilities: sharpening (alpha > 1) for logical reasoning and flattening (alpha < 1) for expressive creativity, effectively mitigating degenerative biases like length collapse.

Entities (5)

PowerFlow · framework · 100%Alpha-power distribution · mathematical-concept · 95%GFlowNet · algorithm · 95%LA-TB · objective-function · 95%RLIF · paradigm · 95%

Relation Signals (3)

PowerFlow → targets → Alpha-power distribution

confidence 95% · By targeting α-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs

PowerFlow → utilizes → LA-TB

confidence 95% · we introduce a length-aware Trajectory-Balance (LA-TB) objective tailored for unsupervised LLM alignment.

GFlowNet → servesas → Amortized variational sampler

confidence 90% · By casting GFlowNet as an amortized variational sampler for unnormalized densities

Cypher Suggestions (2)

Find all frameworks and their associated objective functions · confidence 90% · unvalidated

MATCH (f:Framework)-[:UTILIZES]->(o:ObjectiveFunction) RETURN f.name, o.name

Identify the target mathematical concepts for a specific framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'PowerFlow'})-[:TARGETS]->(m:Concept) RETURN m.name

Abstract

Abstract:Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha > 1$) to intensify logical reasoning, or flattening it ($\alpha < 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

PDF

Open source PDF →Open local PDF →

Full Text

85,811 characters extracted from source content.

Expand or collapse full text

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching Ruishuo Chen1, Yu Chen1, Zhuoran Li1, and Longbo Huang1 ^1\, envelope 1IIIS, Tsinghua University Corresponding Author: longbohuang@tsinghua.edu.cn Abstract Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting α-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution (α>1α>1) to intensify logical reasoning, or flattening it (α<1α<1) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks. 1. Introduction The ongoing debate regarding whether reinforcement learning (RL) can empower Large Language Models (LLMs) to transcend the capabilities inherent in their base models has sparked renewed interest in the latent potential embedded within pre-trained weights Yue et al. (2025); Shao et al. (2025); Liu et al. (2025b). Consequently, significant research effort has been directed toward eliciting these capabilities without relying on labeled trajectories or external reward signals. In this landscape, Reinforcement Learning from Internal Feedback (RLIF) has emerged as a dominant paradigm. By employing intrinsic rewards derived from self-certainty Zhao et al. (2025); Prabhudesai et al. (2025); Li et al. (2025) or ensemble-based consistency Zuo et al. (2025); Zhang et al. (2025c), RLIF seeks to induce a process of self-evolution, guiding models toward more confident and consistent outputs to bolster reasoning performance. Despite their empirical successes, existing RLIF methods predominantly rely on handcrafted reward designs that heuristically specify optimization directions. Lacking a principled theoretical objective, these approaches are often susceptible to the unintended biases inherent in their specific reward formulations. As a result, models frequently suffer from various pathological behaviors—particularly during over-optimization Ghimire et al. (2026)—such as distorted response lengths (e.g., collapse Shafayat et al. (2025) or explosion Zhao et al. (2025)), overconfidence Zhang et al. (2025d), and mode collapse Zuo et al. (2026). Recent research attributes reasoning gains from RL post-training to distributional sharpening Shao et al. (2025); Yue et al. (2025); Karan and Du (2025), characterizing LLM self-improvement, including existing RLIF paradigms, as the implicit concentration of probability mass relative to the base distribution Huang et al. (2025); Zuo et al. (2026). Conversely, empirical evidence also indicates that over-sharpened distributions, often a byproduct of alignment, can stifle generative diversity and creative expression Yang and Holtzman (2025); West and Potts (2025); Zhang et al. (2025b). Figure 1: Illustration of the PowerFlow framework for directional capability elicitation. By matching the length-aware α-power distribution, PowerFlow can either sharpen the distribution (α>1α>1) to enhance logical reasoning or flatten it (α<1α<1) to restore latent creativity. The right panels illustrate significant performance gains and a clear Pareto improvement over existing baselines. Inspired by these insights, we propose PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. Moving beyond heuristic rewards, we target the α-power distribution of the base model: pα(y|q)∝pbase(y|q)αp_α(y|q) p_base(y|q)^α (also known as the α-order escort distribution Beck and Schlögl (1993)). This target is uniquely motivated by its ability to modulate entropy while strictly preserving the intrinsic structural features and relative mode rankings of the base distribution. We treat the power exponent α as a controllable knob to directionally elicit the dual nature of LLMs: sharpening the distribution (α>1α>1) to concentrate mass on latent reasoning paths, or flattening it (α<1α<1) to release the creative potential typically suppressed in aligned models. We provide a conceptual overview of the PowerFlow framework in Figure 1. To operationalize this distribution matching, we build upon Generative Flow Network (GFlowNet) Bengio et al. (2021); Malkin et al. (2022), a principled paradigm for learning stochastic policies that sample proportional to unnormalized densities. However, applying GFlowNets to the autoregressive structure of LLMs reveals a critical challenge: the exponential decay of trajectory probabilities. Without explicit correction,standard distribution matching objectives are often dominated by length-related variance rather than semantic density, leading to pathological length collapse during sharpening (α>1α>1) or repetitive explosion during flattening (α<1α<1) Zhao et al. (2025); Shafayat et al. (2025). To bridge this gap, we introduce a length-aware Trajectory-Balance (LA-TB) objective tailored for unsupervised LLM alignment. By reparameterizing the partition function into an amortized token-level energy term, we enable optimization on a length-normalized energy surface. This formulation effectively decouples the training gradient from response length and neutralizes the structural length bias inherent in autoregressive generation. As illustrated in Figure 3, this principled derivation allows PowerFlow to achieve stable optimization and monotonic performance gains that raw distribution matching objectives fail to maintain. Extensive evaluations across various models and benchmarks reveal that PowerFlow (α>1α>1) consistently outperforms current RLIF methods, matching or even surpassing the performance of supervised GRPO Shao et al. (2024) while preserving the diversity of reasoning paths. Furthermore, when applied to instruction-tuned models, PowerFlow (α<1α<1) yields simultaneous diversity and quality gains in creative writing, successfully unlocking the latent creativity that is typically suppressed during the alignment process. Our contributions are summarized as follows: • PowerFlow Framework: We propose a principled framework that reformulates unsupervised LLM fine-tuning as a distribution matching problem. By targeting the α-power distribution, PowerFlow provides a unified theoretical foundation to directionally elicit the model’s dual nature—enhancing reasoning or restoring creativity—via a single controllable parameter α. • Length-Aware Objective: We derive a length-aware Trajectory-Balance objective that neutralizes the exponential length bias inherent in autoregressive generation. This formulation enables stable, principled distribution alignment on a length-normalized energy surface, preventing the degenerative length collapse common in heuristic RLIF methods. • Dual-Nature Activation: Extensive experiments demonstrate that PowerFlow (α>1α>1) consistently outperforms existing RLIF methods and matches or exceeds supervised GRPO in reasoning accuracy. Conversely, PowerFlow (α<1α<1) successfully restores the latent creativity of aligned models, achieving simultaneous gains in both output diversity and quality. 2. Related Works Reinforcement Learning from Internal Feedback (RLIF). RLIF was pioneered by Zhao et al. (2025) to facilitate unsupervised elicitation of reasoning capabilities by substituting external rewards—typically derived from pre-trained reward models or verifiers—with intrinsic feedback signals. This paradigm has catalyzed extensive research into diverse intrinsic reward mechanisms, including self-certainty Zhao et al. (2025), token-level entropy Prabhudesai et al. (2025), generation probabilities Li et al. (2025), semantic entropy Zhang et al. (2025c), and majority voting Zuo et al. (2025); Shafayat et al. (2025). Despite their empirical successes, most existing RLIF methods rely on handcrafted rewards that heuristically guide optimization without a principled target distribution. This lack of theoretical grounding leaves them vulnerable to the inherent biases of reward design, particularly in the later stages of policy optimization. For instance, entropy-based RLIF has been shown to induce overconfidence and performance degradation as training progresses, while exhibiting minimal impact on low-entropy instruct-tuned models Zhang et al. (2025d). Recent work further reveals that such internal-confidence-based methods may trigger the generation of inappropriate, repetitive, and predictable patterns as a shortcut to artificially reduce entropy Ghimire et al. (2026); Zuo et al. (2026). Similarly, majority-voting rewards are susceptible to reward hacking, where models generate high-entropy, stochastic chains-of-thought to arrive at consistent but irrelevant answers Shafayat et al. (2025). Furthermore, probability-based rewards are prone to length collapse as models exploit the higher joint probabilities of shorter sequences Zhao et al. (2025); Zuo et al. (2026). In contrast, PowerFlow eliminates heuristic reward biases by matching the principled α-power distribution, while its length-aware objective prevents degenerative length collapse. The Distribution Sharpening Mechanism. Following the remarkable success of Reinforcement Learning from Verifiable Rewards (RLVR) in enhancing LLM reasoning Shao et al. (2024); DeepSeek-AI et al. (2025), significant effort has been devoted to understanding its underlying mechanisms. Critical insights from prior work Yue et al. (2025); Shao et al. (2025) suggest that current RLVR training may not necessarily elicit entirely new reasoning patterns beyond the base model’s intrinsic capacity. Instead, it operates through “distribution sharpening,” a process of amplifying specific internalized reasoning paths to improve sample efficiency at low sampling rates (e.g., pass@1). While recent theoretical works have begun to analyze RLIF methods through this lens Huang et al. (2025); Zuo et al. (2026), they remain largely observational, lacking efficient practical sharpening methods and overlooking the structural length bias inherent in α-power distributions. Notably, Karan and Du (2025) demonstrated that MCMC sampling from the sharpened α-power distribution yields outstanding performance; however, its prohibitive inference cost and complex sampling logic render it impractical for general deployment. In contrast, PowerFlow amortizes the computational cost of distribution matching into the training phase, eliciting either logical reasoning via sharpening or expressive creativity via flattening under standard decoding at inference time. 3. PowerFlow: Unsupervised Fine-Tuning via Distribution Matching 3.1. The Distribution Matching Formulation In the unsupervised fine-tuning of Large Language Models (LLMs), we operate in the absence of external reward signals or ground-truth trajectories. Our primary information source is the base model distribution pbase(y|q)=∏t=1Tpbase(yt|q,y<t)p_base(y|q)= _t=1^Tp_base(y_t|q,y_<t), where q is a given query and y is the generated response. Consequently, any unsupervised training scheme can be conceptualized as a mechanism for redistributing the initial probability mass by leveraging information intrinsic to the base model itself. In this work, we introduce PowerFlow, a framework that returns to the probabilistic essence of the problem by reformulating unsupervised fine-tuning as a distribution matching problem. Our goal is to minimize the divergence between the fine-tuned policy πθ _θ and a target distribution ptargetp_target. We primarily focus on the α-power distribution, a self-derived and non-heuristic transformation of the base model: pα(y|q)=pbase(y|q)αZ(q,α),p_α(y|q)= p_base(y|q)^αZ(q,α), (1) where Z(q,α)=∑ypbase(y|q)αZ(q,α)= _yp_base(y|q)^α is the partition function. Unlike handcrafted rewards, the α-power distribution, widely known as the escort distribution in statistical mechanics Beck and Schlögl (1993), provides a principled reshaping of pbasep_base that modulates entropy while strictly preserving its relative probability rankings and mode structure via monotonicity. This ensures that the fine-tuning remains inherently grounded in the model’s pre-trained knowledge, effectively circumventing the biased distributional drift often associated with heuristic rewards. Under the PowerFlow framework, we treat α as a controllable knob to directionally elicit the model’s dual nature: Reasoning Elicitation (α>1α>1): Inspired by research linking reasoning gains to distribution sharpening Yue et al. (2025); Karan and Du (2025), we expect that matching pαp_α with α>1α>1 will enhance reasoning accuracy. This objective is grounded in the principle of verification-generation asymmetry, supported by both computational complexity theory Cook (1971) and empirical studies Weng et al. (2023); Yuan et al. (2024). This asymmetry, where recognizing a correct solution is easier than generating one, suggests that the model harbors substantial “hidden knowledge” Hinton et al. (2015) that is not fully manifested during default generation. We hypothesize that a significant performance-capability gap exists due to the base model’s relatively flat distribution. Sharpening thus acts as a mechanism to bridge this gap by concentrating probability mass on latent, high-quality reasoning paths, thereby increasing the efficiency of surfacing correct trajectories during standard decoding. Indeed, we theoretically demonstrate that the empirically effective majority-voting based RLIF can be formalized as an implicit mechanism for extreme distribution sharpening, effectively driving the policy towards the dominant mode (see Theorem D.1 in Appendix D). Creativity Release (α<1α<1): When α<1α<1, the target distribution is flattened, redistributing probability mass toward the long-tail regions. While applying such flattening to a raw base model may lead to incoherent outputs due to its high intrinsic entropy, it serves a distinct purpose for aligned models. Recent studies observe that alignment procedures, such as RLHF Ouyang et al. (2022), frequently induce excessive distribution sharpening, which suppresses the inherent creativity and diversity of the original base model West and Potts (2025); Yang and Holtzman (2025); Lanchantin et al. (2025). Specifically, Zhang et al. (2025b) formalize the “typicality bias” inherent in reward models, asserting that aligned models implicitly sample from a α(>1)α(>1)-power distribution of the reference model. This results in a pathological preference for high-probability, typical responses while discarding creative alternatives. We hypothesize that these creative capabilities remain latent within sub-peak probability regions. By matching a flattened distribution, PowerFlow facilitates the recovery of these suppressed expressive capabilities, effectively counteracting typicality bias to enable a more diverse and creative generation space. Ultimately, PowerFlow transforms unsupervised fine-tuning from a heuristic pursuit of rewards into a principled optimization task, where α serves as a precise control mechanism to navigate the model’s latent capability space without the brittleness of manual reward specification. To this end, we first frame GFlowNet as an amortized solution to this distribution matching problem from the perspective of variational inference. We then derive a length-aware Trajectory-Balance objective designed to neutralize the structural length bias inherent in autoregressive generation, thereby ensuring stable and robust optimization. The complete algorithmic workflow is illustrated in Figure 2. Figure 2: The PowerFlow framework. During training (top), the policy πθ _θ and log⁡Zϕ′ Z _φ module are optimized via the LA-TB objective to match the α-power distribution of the base model while neutralizing length bias. The control knob α enables directional elicitation: sharpening (α>1α>1) for reasoning or flattening (α<1α<1) for creativity. The inference pipeline (bottom) remains standard. 3.2. GFlowNets as Amortized Samplers A fundamental challenge in matching the target distribution ptarget(y|q)p_target(y|q) is that its partition function, Z(q)=∑yp~target(y|q)Z(q)= _y p_target(y|q), is computationally intractable due to the combinatorial explosion of the response space Y. From the perspective of Variational Inference (VI), this intractability can be bypassed by expressing the reverse KL divergence via a variational surrogate: KL(πθ∥ptarget)=y∼πθ[log⁡πθ(y|q)p~target(y|q)]+log⁡Z(q).D_KL ( _θ\|p_target )=E_y _θ [ _θ(y|q) p_target(y|q) ]+ Z(q). (2) Since log⁡Z(q) Z(q) is constant with respect to the policy parameters θ, minimizing the reverse KL divergence is equivalent to minimizing the expectation term, which serves as a variational upper bound. Generative Flow Networks (GFlowNets) Bengio et al. (2021) provide a principled framework for this variational setting by learning a policy that acts as an amortized sampler for unnormalized densities. Zimmermann et al. (2023) formally established that the Trajectory Balance (TB) objective Malkin et al. (2022) acts as a specialized variational surrogate for minimizing Eq. (2). Proposition 3.1 (Zimmermann et al. (2023)). For a trajectory τ=(s0,s1,…,sT)τ=(s_0,s_1,…,s_T), define the Trajectory Balance objective as: ℒTB(θ,ϕ,ψ;τ)=(log⁡Zϕ∏t=0T−1PF(st+1|st;θ)p~target(sT)∏t=1TPB(st−1|st;ψ))2.L_TB(θ,φ,ψ;τ)= ( Z_φ _t=0^T-1P_F(s_t+1|s_t;θ) p_target(s_T) _t=1^TP_B(s_t-1|s_t;ψ) )^2. (3) If the learned partition function ZϕZ_φ is optimized to its equilibrium, the expected gradient of ℒTBL_TB with respect to the forward policy parameters θ satisfies: τ∼PF[∇θℒTB(θ,ϕ,ψ;τ)]=2∇θKL(PF(τ;θ)∥ptarget(τ)).E_τ P_F [ _θL_TB(θ,φ,ψ;τ) ]=2 _θD_KL (P_F(τ;θ)\|p_target(τ) ). (4) In the context of LLMs, the autoregressive generation process naturally forms a tree-structured Directed Acyclic Graph (DAG), where each state sts_t corresponds to a prefix y<ty_<t. Since each state in a tree has a unique parent, the backward policy simplifies to PB(y<t|y≤t,q)=1P_B(y_<t|y_≤ t,q)=1 for all valid transitions. Identifying the forward policy PFP_F as our model πθ _θ, the TB loss for a given query q and response y reduces to a form tailored for autoregressive models: ℒTB(θ,ϕ;q,y)=(log⁡Zϕ(q)+∑t=1Tlog⁡πθ(yt|y<t,q)−log⁡p~target(y|q))2. _TB(θ,φ;q,y)= ( Z_φ(q)+ _t=1^T _θ(y_t|y_<t,q)- p_target(y|q) )^2. (5) This formulation transforms the distribution matching problem into an RL-style on-policy optimization task: minθ,ϕ⁡(θ,ϕ)=q∼[y∼πθ(⋅|q)[ℒTB(θ,ϕ;q,y)]]. _θ,φJ(θ,φ)=E_q [E_y _θ(·|q) [L_TB(θ,φ;q,y) ] ]. (6) By optimizing Eq. (6), πθ _θ learns to align its sequence-level probabilities with p~target p_target, while Zϕ(q)Z_φ(q) amortizes the estimation of the normalization constant to reduce gradient variance. Further implementation and training details are provided in Appendix A.1. 3.3. Length-Aware PowerFlow Objective Figure 3: Stability analysis of distribution matching strategies. Matching the trajectory-level α-power distribution via standard TB or RL objectives (-traj) leads to rapid length collapse. Token-level normalization (-token) initially improves performance but eventually decays due to the exploitation of repetitive tokens. PowerFlow maintains both stable response length and superior reasoning accuracy (pass@1 on MATH) throughout training. Autoregressive generation in LLMs is inherently plagued by structural length bias. Specifically, the log-probability of a trajectory, log⁡p(y|q)=∑t=1|y|log⁡p(yt|y<t,q) p(y|q)= _t=1^|y| p(y_t|y_<t,q), is approximately negatively linear with respect to the sequence length |y||y|. Consequently, a naive distribution matching objective is often dominated by sequence length rather than semantic density. For instance, when targeting an α-power distribution with α>1α>1 (sharpening), the model tends to exploit the path probability by producing excessively short, trivial sequences. Conversely, when α<1α<1 (flattening), the model is prone to generating repetitive, deterministic long sequences to accumulate probability mass. Furthermore, the extreme sensitivity of path probabilities to |y||y| causes the gradient of the partition function ZϕZ_φ to exhibit massive variance, severely destabilizing the optimization process. As illustrated in Figure 3, directly matching the α(>1)α(>1)-power distribution using either the Trajectory Balance (TB-traj) or RL-based KL-regularized objectives (RL-traj) leads to an immediate and pathological collapse of response length. We include the RL-based formulation as a baseline because the standard RL objective with KL regularization, maxπ⁡y∼π[r(y)]−βKL(π∥πbase) _πE_y π[r(y)]- _KL(π\| _base), theoretically yields an optimal policy π∗(y|q)∝πbase(y|q)exp⁡(r(y)/β)π^*(y|q) _base(y|q) (r(y)/β). By setting the intrinsic reward to the base model’s log-probability, r(y)=log⁡pbase(y|q)r(y)= p_base(y|q), the target becomes an α-power distribution where α=1+1/βα=1+1/β. To counteract length bias, a common heuristic is to optimize the average token log-probability, 1|y|log⁡pbase 1|y| p_base. While these token-level variants, TB-token and RL-token, exhibit initial performance gains, Figure 3 reveals a subsequent decay. This failure stems from a fundamental distortion of the target distribution’s structural integrity. By optimizing for average token-level confidence, these methods reshape the energy surface such that local probability mass is decoupled from global semantic coherence. Consequently, the model exploits repetitive and meaningless tokens to artificially lower the average energy, effectively eroding the learned semantic structure to inflate likelihood metrics through repetitive generation. These observations underscore that both naive RL and standard GFlowNet objectives are profoundly susceptible to reward and structural biases, necessitating a more principled approach to distribution alignment. To bridge the gap between principled distribution matching and the non-stationary length distributions of LLMs, we introduce a structural reparameterization of the Trajectory-Balance objective. Standard GFlowNets Malkin et al. (2022) typically treat the partition function Zϕ(q)Z_φ(q) as a prompt-dependent scalar, a formulation that is ill-conditioned for autoregressive sequences where probabilities decay exponentially with length. We instead reformulate the normalization constant as a length-aware energy term, Zϕ(q,y)=(Zϕ′(q))|y|Z_φ(q,y)=(Z _φ(q))^|y|, which effectively projects the optimization onto a length-normalized energy surface. By further normalizing the log-trajectory mismatch by |y||y|, our LA-TB objective ensures that the optimization gradient remains scale-invariant across varying sequence lengths: ℒLA-TB(θ,ϕ;q,y)=(log⁡Zϕ′(q)+1|y|log⁡πθ(y|q)p~target(y|q))2. _LA-TB(θ,φ;q,y)= ( Z _φ(q)+ 1|y| _θ(y|q) p_target(y|q) )^2. (7) This objective converges to a length-normalized target distribution: π∗(y|q)∝p~target(y|q)Zϕ′(q)|y|.π^*(y|q) p_target(y|q)Z _φ(q)^|y|. (8) While this formulation does not strictly maintain relative mode rankings across sequences of varying lengths, it achieves a robust balance by neutralizing structural biases while fundamentally preserving the target distribution’s semantic essence. By operating within the space of amortized geometric mean probabilities, PowerFlow prioritizes semantic quality over sequence brevity or redundancy, effectively shifting the optimization focus toward the model’s true latent capability space. Finally, we instantiate the target as the α-power distribution, pbase(y|q)αp_base(y|q)^α. To ensure instruction-following integrity and logical structure, we incorporate a format penalty ψ(y)ψ(y). Specifically, ψ(y)ψ(y) is set to a negative constant (e.g., −0.5-0.5) if the output fails to match a predefined pattern (e.g., the absence of ), and 0 otherwise. This yields the final PowerFlow objective: ℒPowerFlow=w⋅(log⁡Zϕ′(q)+1|y|log⁡πθ(y|q)−α[1|y|log⁡pbase(y|q)+ψ(y)])2 _PowerFlow=w\,· ( Z _φ(q)+ 1|y| _θ(y|q)-α [ 1|y| p_base(y|q)+ψ(y) ] )^2 (9) where w is the importance sampling ratio defined as: w=clip(πθ(y|q)πold(y|q),1−ϵ,1+ϵ)detach.w=clip ( _θ(y|q) _old(y|q),1-ε,1+ε )^detach. (10) The inclusion of w ensures compatibility with off-policy fine-tuning, where trajectories are sampled from a behavior policy πold _old. Following Zhu et al. (2025), we apply clipping to maintain training stability and prevent gradient collapse during iterative optimization. As evidenced by the robust training dynamics in Figure 3, the PowerFlow objective effectively circumvents these structural distortions, achieving sustained length stability and monotonic performance gains by preserving the principled α-power density on a length-normalized surface. 4. Experiments In this section, we evaluate the effectiveness of PowerFlow across two primary domains: complex logical reasoning and diverse creative writing. Following the experimental setup detailed in Section 4.1, we present our findings in two parts. First, Section 4.2 demonstrates that distribution sharpening (α>1α>1) effectively intensifies reasoning performance across various model variants. Subsequently, Section 4.3 reveals that distribution flattening (α<1α<1) restores the generative diversity typically suppressed in aligned models while simultaneously improving output quality. Together, these experiments illustrate that principled distribution matching serves as a robust mechanism for the directional elicitation of latent LLM capabilities without external supervision. 4.1. Experimental Setup Data and Training Configuration. For reasoning tasks, we follow standard practices in the community Hugging Face (2025); Zhang et al. (2025a) by utilizing questions from the NuminaMath-CoT dataset LI et al. (2024) for unsupervised training. Specifically, we employ a subset of 18,000 queries filtered by Zhang et al. (2025c) to exclude instances with excessive response length or potential answer leakage. Each query is appended with a prompt instructing the model to “think step by step” and provide the final answer within a environment. For creative writing tasks covering poem continuation, story generation, and joke writing Lu et al. (2025), we select a training set of 300300 prompts, drawn from the 500500-prompt collection curated by Zhang et al. (2025b) and sourced from PoemHunter.com, BookMIA Shi et al. (2024), and Reddit r/DadJokes Reddit (2023). All inputs are formatted using the models’ official chat templates; detailed prompts and hyperparameters are provided in Appendix B and Appendix A.2, respectively. Models and Baselines. To evaluate the generalizability of PowerFlow, we conduct experiments across several representative model families and scales, including the 1.5B, 3B, and 7B variants of the Qwen2.5 series, the domain-specific Qwen2.5-Math series, and the Llama-3.2 series. For reasoning benchmarks, we compare PowerFlow against a comprehensive suite of baselines: (i) Base: The original un-finetuned model. (i) Low-temp: The Base model using a reduced inference temperature (T′=T/αT =T/α) following Karan and Du (2025). (i) Instruct: The official instruction-tuned version of the respective base model. (iv) Format-only: A variant of our framework using α=1α=1, isolating the performance gains solely from improved answer extractability via the format penalty. (v) RLIF Methods: State-of-the-art unsupervised methods including Intuitor (self-certainty rewards) Zhao et al. (2025), EMPO (semantic entropy rewards) Zhang et al. (2025c), and TTRL (majority-voting rewards) Zuo et al. (2025). (vi) PowerSampling: An inference-time baseline that samples from the α-power distribution using Markov Chain Monte Carlo (MCMC) methods Karan and Du (2025). (vii) One-shot EM: A method that directly minimizes token-level entropy via unsupervised optimization Gao et al. (2025). (viii) GRPO: Group Relative Policy Optimization Shao et al. (2024) trained on the NuminaMath-CoT dataset with external verifiable rewards, representing a supervised counterpart. For reproducibility, we utilize the official open-source checkpoints for RLIF baselines, given the substantial computational overhead of full-scale fine-tuning. We discuss this choice in Appendix E. For creative writing tasks, we compare PowerFlow (α=0.5α=0.5) with: (i) Instruct: The default instruct-tuned model. (i) Base: The original un-finetuned (pre-trained) model. (i) High-temp: The Instruct model with an increased sampling temperature. (iv) VS-Standard: The “Verbalized Sampling” method Zhang et al. (2025b), which improves diversity by adjusting prompts to elicit latent expressive variety. Evaluation. For reasoning tasks, we report the mean accuracy across 16 independent samples (avg@16) per problem, with sampling temperature and top-p fixed at 1.0. Our evaluation spans various mathematical reasoning benchmarks, including MATH500 Hendrycks et al. (2021), OlympiadBench He et al. (2024), AIME24 LI et al. (2024), AIME25 Zhang and Math-AI (2025), and AMC23 LI et al. (2024). We also include GPQA (diamond) Rein et al. (2024) to assess natural scientific reasoning. Following Zeng et al. (2025), we extract answers within tags and employ a unified script to determine mathematical equivalence. To assess creative writing, we utilize the remaining 200200 prompts. Per prompt, we sample 3030 independent responses at a temperature of 0.80.8 (excluding the High-temp baseline at 1.01.0) to ensure a robust estimation of semantic diversity and quality. To maintain consistency across phases, both training and evaluation employ official chat templates with concise, generation-restricting instructions. These constraints are designed to minimize conversational overhead, preventing distortion of probability calculations and ensuring that the computed densities strictly reflect the creative generation. We quantify performance using three metrics: (i) Semantic Diversity: Following established practice Shaib et al. (2024); Cann et al. (2023); Lu et al. (2025), defined as 1−s¯1- s, where s¯ s is the mean pairwise cosine similarity of response embeddings from bge-small-en-v1.5 Xiao et al. (2023). (i) Lexical Redundancy: Measured by ROUGE-L scores among outputs following Shaib et al. (2024), where lower scores indicate higher variety. (i) Quality: Evaluated via LLM-as-a-judge using Qwen3-plus, scoring responses based on rubrics from Creative Writing v3 Paech (2023) and HumorBench Narad et al. (2025) (see Appendix B.2 for details). Table 1: Model Performance Comparison (avg@16) across Benchmarks. Within each model block, horizontal lines separate unsupervised methods (top: PowerFlow and RLIF baselines) from reference models and supervised counterparts (bottom: Instruct and GRPO). MATH500 Olympiad AIME24 AIME25 AMC23 GPQA Average Qwen2.5-1.5B Base 6.20 1.90 0.00 0.00 1.40 25.80 5.88 Low-temp 18.60 4.90 0.80 0.40 7.50 26.60 9.80 Intuitor 47.40 15.30 1.50 0.80 22.30 26.40 18.95 PowerFlow (Ours) 49.30 16.00 0.80 1.50 23.80 27.70 19.85 Instruct 34.90 10.00 0.80 0.00 13.60 28.00 14.55 GRPO 45.40 14.10 1.00 0.40 21.90 26.00 18.13 Qwen2.5-Math-1.5B Base 43.30 20.90 4.60 1.90 28.40 26.10 20.87 Low-temp 62.90 30.10 9.40 4.00 49.50 28.70 30.77 Format-only 65.70 30.10 5.60 5.00 47.0 26.10 29.92 EMPO 69.90 32.20 12.30 4.60 46.20 29.50 32.45 PowerFlow (Ours) 70.90 32.50 10.80 10.00 53.30 28.30 34.30 Instruct 71.50 33.50 10.20 6.00 49.40 26.40 32.83 GRPO 71.40 34.00 8.10 6.70 49.50 26.80 32.75 Llama-3.2-3B-Instruct Base 40.10 10.30 4.00 0.00 18.80 29.50 17.12 Low-temp 50.00 17.50 9.60 0.60 24.80 28.80 21.88 Intuitor 50.40 16.60 9.20 0.20 27.30 30.50 22.37 PowerFlow (Ours) 50.60 16.60 10.70 0.40 28.80 30.20 22.88 GRPO 50.10 17.20 11.20 0.00 25.00 30.50 22.33 Qwen2.5-Math-7B Base 46.70 22.30 12.30 4.20 34.50 29.70 24.95 Low-temp 69.00 34.00 23.10 7.70 47.70 34.10 35.93 PowerSampling 72.20 39.50 23.30 10.00 57.50 36.30 39.80 TTRL 80.40 39.60 21.70 11.90 58.80 34.70 41.18 One-shot EM 61.40 29.80 18.10 6.20 48.90 32.80 32.87 EMPO 79.30 41.70 15.80 12.30 60.20 36.00 40.88 PowerFlow (Ours) 78.10 40.10 20.00 14.40 63.40 37.00 42.17 Instruct 74.50 31.20 12.50 12.30 65.90 35.00 38.57 GRPO 78.40 42.50 22.70 12.90 63.40 34.40 42.38 Qwen2.5-32B-Instruct 79.70 43.10 13.10 9.40 60.30 44.10 41.62 4.2. Eliciting Reasoning via Distribution Sharpening This section evaluates the efficacy of PowerFlow in eliciting latent reasoning capabilities through distribution sharpening (α>1α>1). Based on preliminary tuning on Qwen2.5-Math-1.5B (detailed in Appendix C.1), we adopt a default sharpening parameter of α=4α=4 for base models, aligning with the optimal configurations identified by Karan and Du (2025). Notably, for instruction-tuned models, we find that a lower value of α=2α=2 is more effective; this is because these models have already undergone distribution sharpening during the alignment process, requiring less intensification to reach peak performance. To prioritize assessing our framework’s generalizability across diverse architectures and scales, we adopt these values as robust defaults rather than pursuing exhaustive per-model optimization. While individual models possess unique intrinsic entropy profiles that may benefit from specialized α-schedules, we leave the development of automated adjustment mechanisms for future work. 4.2.1 Main Performance Results Table 1 summarizes the performance of PowerFlow across various model series and reasoning benchmarks. The results indicate that PowerFlow yields significant performance improvements that far exceed the gains achievable through simple temperature scaling (Low-temp), standard instruction tuning (Instruct) or targeted format regularization (Format-only). By performing principled distribution sharpening, PowerFlow effectively surfaces the latent reasoning capabilities inherent in the base models, consistently surpassing existing RLIF methods across all tested scales. Notably, our unsupervised approach outperforms the supervised GRPO on three configurations—Qwen2.5-1.5B (19.8519.85 vs. 18.1318.13), Qwen2.5-Math-1.5B (34.3034.30 vs. 32.7532.75), and Llama-3.2-3B-Instruct (22.8822.88 vs. 22.3322.33)—while achieving comparable results on Qwen2.5-Math-7B (42.1742.17 vs. 42.3842.38). These findings suggest that sharpening the distribution while preserving the structural integrity of the initial model provides a competitive elicitation mechanism that can rival, or even surpass, traditional reward-driven RL methods without the need for external labels or verifiers. We further analyze the underlying elicitation mechanism and provide the comparative Pass@n performance in Appendix C.2. 4.2.2 Preservation of Solution Diversity Recent studies highlight the potential for models to converge toward less diverse output patterns during RLIF Zhang et al. (2025d); Ghimire et al. (2026); Zuo et al. (2026). To investigate if PowerFlow inherits this vulnerability, we assess the diversity of reasoning paths on the AIME24 and AIME25 benchmarks. Using DeepSeek-V3.2 Liu et al. (2025a) as an LLM-as-a-judge, we evaluate distinct solution strategies across 1616 independent samples per problem, following the rubric in Appendix B.1. Figure 4: Comparison of solution diversity scores on AIME24/25. PowerFlow maintains superior strategy variety. As shown in Figure 4, PowerFlow achieves the highest diversity score (4.054.05), notably outperforming both EMPO (3.803.80) and supervised GRPO (3.933.93). This preservation of variety stems from the mathematical nature of the α-power distribution matching objective. Unlike traditional RL objectives that may prioritize the exploitation of a single high-reward trajectory, the α-power transformation rescales the entire density surface while strictly maintaining the relative rankings and multi-modal structure of the base model. Consequently, PowerFlow intensifies the probability of correct reasoning paths across the entire latent landscape, allowing the model to elicit a broad spectrum of viable strategies rather than collapsing into a monolithic output pattern. 4.3. Restoring Creativity via Distribution Flattening We further investigate the effectiveness of distribution flattening (α=0.5α=0.5) in restoring the creative diversity of aligned models, which often suffer from mode collapse during instruction tuning. Figure 5 illustrates the averaged performance across four model families, mapping the Pareto frontier between generation quality and semantic diversity. Additional results regarding lexical redundancy and per-task breakdowns are provided in Appendix C.3 and Table 2. Figure 5: Quality vs. Semantic Diversity on creative writing tasks. The shaded region indicates the area of Pareto improvement relative to the Instruct baseline. PowerFlow (stars) consistently shifts the Pareto frontier outward across all model scales. As shown in Figure 5, the Instruct models (squares) maintain high quality but exhibit severely restricted diversity. This aligns with prior observations of “typicality bias,” where alignment suppresses high-quality but atypical semantic paths in favor of predictable responses. Conversely, while Base models (circles) possess high latent creativity, their excessive entropy and poor instruction-following frequently lead to incoherent outputs, preventing the realization of their full generative potential. Other baselines fail to bridge this gap effectively: increasing the sampling temperature (High-temp, diamonds) improves diversity only at the expense of quality, while VS-Standard (triangles) degrades quality on models at the 7B scale or smaller due to its reliance on advanced instruction-following capabilities. In contrast, PowerFlow (stars) amortizes the flattening of the energy surface to facilitate the exploration of high-quality, unconventional linguistic paths frequently suppressed by aligned distributions. By achieving superior semantic diversity and reduced lexical redundancy while simultaneously surpassing original Instruct models in quality, PowerFlow effects a Pareto-dominant shift. Notably, it stands as the only evaluated method capable of enhancing generation quality, demonstrating a successful preservation of robust instruction-following while releasing latent expressive potential. The resulting “best-of-both-worlds” synergy ensures the model remains highly controllable and stylistically sophisticated, yet retains the structural variety and creative entropy inherent in the pre-trained distribution. Such outcomes confirm that distribution flattening serves as a principled and effective mechanism for revitalizing the creative potential of aligned large language models. 5. Conclusion and Discussion In this work, we introduced PowerFlow, a principled framework for unsupervised capability elicitation in LLMs. By framing fine-tuning as a distribution-matching problem, we demonstrated that the latent capabilities of LLMs can be directionally activated without external labels or verifiable rewards. Our results demonstrate that distribution sharpening (α>1α>1) significantly enhances logical reasoning, rivaling or even surpassing supervised methods such as GRPO, while distribution flattening (α<1α<1) restores the creative diversity frequently suppressed during standard instruction tuning. The success of PowerFlow is fundamentally rooted in its length-aware Trajectory Balance objective, which neutralizes the inherent length bias of autoregressive generation and ensures stable, non-degenerative optimization. Beyond empirical gains, PowerFlow provides a robust diagnostic and optimization framework for investigating the distributional geometry of LLMs. Our findings suggest that pre-trained models possess remarkably sophisticated structural integrity, where RL-based fine-tuning often focuses more on optimizing distribution shapes than introducing novel knowledge. By shifting from heuristic reward engineering toward explicit distribution matching, we offer a transparent methodology generalizable beyond α-power transformations to diverse distribution families. This paves the way for a unified unsupervised alignment paradigm where task-specific elicitation occurs by matching models to optimized target geometries. Future research on automating the discovery of these shapes and schedules could lead to a more efficient, theoretically grounded approach to developing versatile, specialized, and creative AI agents. Impact Statement This work presents a principled framework for the unsupervised elicitation of latent capabilities within Large Language Models (LLMs). By enabling the directional activation of reasoning and creativity through principled distribution matching, our framework mitigates the reliance on cost-intensive human-labeled datasets and external verifiable rewards, thereby democratizing the development of specialized and high-performing AI agents. However, the capacity to manipulate a model’s latent distributional modes warrants careful ethical consideration.. Distribution flattening, while intended to restore generative diversity, may inadvertently resurface or amplify harmful content previously suppressed through safety alignment protocols like RLHF. Conversely, aggressive sharpening could prioritize biased or suboptimal reasoning paths if the induced target distribution grants undue prominence to such modes. We therefore emphasize that the deployment of this framework should be fortified by robust safety guardrails and rigorous red-teaming to ensure that elicited capabilities remain socially beneficial and ethically grounded. References C. Beck and F. Schlögl (1993) Thermodynamics of chaotic systems: an introduction. Vol. 4, Cambridge University Press. Cited by: §1, §3.1. E. Bengio, M. Jain, M. Korablyov, D. Precup, and Y. Bengio (2021) Flow network based generative models for non-iterative diverse candidate generation. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §1, §3.2. T. J. Cann, B. Dennes, T. Coan, S. O’Neill, and H. T. Williams (2023) Using semantic similarity and text embedding to measure the social media echo of strategic communications. arXiv preprint arXiv:2303.16694. Cited by: §4.1. S. A. Cook (1971) The complexity of theorem-proving procedures. Proceedings of the third annual ACM symposium on Theory of computing. External Links: Link Cited by: §3.1. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Ding, H. Xin, H. Gao, H. Qu, H. Li, J. Guo, J. Li, J. Wang, J. Chen, J. Yuan, J. Qiu, J. Li, J. Cai, J. Ni, J. Liang, J. Chen, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Zhao, L. Wang, L. Zhang, L. Xu, L. Xia, M. Zhang, M. Zhang, M. Tang, M. Li, M. Wang, M. Li, N. Tian, P. Huang, P. Zhang, Q. Wang, Q. Chen, Q. Du, R. Ge, R. Zhang, R. Pan, R. Wang, R. J. Chen, R. Jin, R. Chen, S. Lu, S. Zhou, S. Chen, S. Ye, S. Wang, S. Yu, S. Zhou, S. Pan, S. S. Li, S. Zhou, S. Wu, T. Yun, T. Pei, T. Sun, T. Wang, W. Zeng, W. Zhao, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, W. Xiao, W. An, X. Liu, X. Wang, X. Chen, X. Nie, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yang, X. Li, X. Su, X. Lin, X. Q. Li, X. Jin, X. Shen, X. Chen, X. Sun, X. Wang, X. Song, X. Zhou, X. Wang, X. Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. Zhang, Y. Xu, Y. Li, Y. Zhao, Y. Sun, Y. Wang, Y. Yu, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Ou, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Xiong, Y. Luo, Y. You, Y. Liu, Y. Zhou, Y. X. Zhu, Y. Huang, Y. Li, Y. Zheng, Y. Zhu, Y. Ma, Y. Tang, Y. Zha, Y. Yan, Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Xie, Z. Zhang, Z. Hao, Z. Ma, Z. Yan, Z. Wu, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Pan, Z. Huang, Z. Xu, Z. Zhang, and Z. Zhang (2025) DeepSeek-r1 incentivizes reasoning in llms through reinforcement learning. Nature 645, p. 633 – 638. External Links: Link Cited by: §2. Z. Gao, L. Chen, H. Luo, J. Zhou, and B. Dai (2025) One-shot entropy minimization. arXiv preprint arXiv:2505.20282. Cited by: §4.1. M. Ghimire, A. Feng, L. You, Y. Luo, F. Liu, and X. Zhu (2026) PRISM: a unified framework for post-training llms without verifiable rewards. arXiv preprint arXiv:2601.04700. Cited by: Appendix E, §1, §2, §4.2.2. C. He, R. Luo, Y. Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang, et al. (2024) Olympiadbench: a challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 3828–3850. Cited by: §4.1. D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), External Links: Link Cited by: §4.1. G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.1. A. Huang, A. Block, D. J. Foster, D. Rohatgi, C. Zhang, M. Simchowitz, J. T. Ash, and A. Krishnamurthy (2025) Self-improvement in language models: the sharpening mechanism. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §1, §2. Hugging Face (2025) Open r1: a fully open reproduction of deepseek-r1. External Links: Link Cited by: §4.1. A. Karan and Y. Du (2025) Reasoning with sampling: your base model is smarter than you think. arXiv preprint arXiv:2510.14901. Cited by: Appendix E, §1, §2, §3.1, §4.1, §4.2. J. Lanchantin, A. Chen, S. Dhuliawala, P. Yu, J. Weston, S. Sukhbaatar, and I. Kulikov (2025) Diverse preference optimization. arXiv preprint arXiv:2501.18101. Cited by: §3.1. J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y. Fleureau, G. Lample, and S. Polu (2024) NuminaMath. Numina. Note: [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) Cited by: §4.1, §4.1. P. Li, M. Skripkin, A. Zubrey, A. Kuznetsov, and I. Oseledets (2025) Confidence is all you need: few-shot rl fine-tuning of language models. arXiv preprint arXiv:2506.06395. Cited by: §1, §2. A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. (2025a) Deepseek-v3. 2: pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556. Cited by: §4.2.2. M. Liu, S. Diao, X. Lu, J. Hu, X. Dong, Y. Choi, J. Kautz, and Y. Dong (2025b) ProRL: prolonged reinforcement learning expands reasoning boundaries in large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1. X. Lu, M. Sclar, S. Hallinan, N. Mireshghallah, J. Liu, S. Han, A. Ettinger, L. Jiang, K. Chandu, N. Dziri, and Y. Choi (2025) AI as humanity’s salieri: quantifying linguistic creativity of language models via systematic attribution of machine text against web text. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §4.1, §4.1. N. Malkin, M. Jain, E. Bengio, C. Sun, and Y. Bengio (2022) Trajectory balance: improved credit assignment in GFlownets. In Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (Eds.), External Links: Link Cited by: §1, §3.2, §3.3. R. Narad, S. Suresh, J. Chen, P. S. Dysart-Bricken, B. Mankoff, R. Nowak, J. Zhang, and L. Jain (2025) Which llms get the joke? probing non-stem reasoning abilities with humorbench. arXiv preprint arXiv:2507.21476. Cited by: §B.2, §4.1. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, p. 27730–27744. Cited by: §3.1. S. J. Paech (2023) Eq-bench: an emotional intelligence benchmark for large language models. arXiv preprint arXiv:2312.06281. Cited by: §B.2, §4.1. M. Prabhudesai, L. Chen, A. Ippoliti, K. Fragkiadaki, H. Liu, and D. Pathak (2025) Maximizing confidence alone improves reasoning. arXiv preprint arXiv:2505.22660. Cited by: §1, §2. Reddit (2023) Reddit dad jokes. External Links: Link Cited by: §4.1. D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) GPQA: a graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, External Links: Link Cited by: §4.1. S. Shafayat, F. Tajwar, R. Salakhutdinov, J. Schneider, and A. Zanette (2025) Can large reasoning models self-train?. arXiv preprint arXiv:2505.21444. Cited by: §1, §1, §2, §2. C. Shaib, J. Barrow, J. Sun, A. F. Siu, B. C. Wallace, and A. Nenkova (2024) Standardizing the measurement of text diversity: a tool and a comparative analysis of scores. arXiv preprint arXiv:2403.00553. Cited by: §4.1. R. Shao, S. S. Li, R. Xin, S. Geng, Y. Wang, S. Oh, S. S. Du, N. Lambert, S. Min, R. Krishna, et al. (2025) Spurious rewards: rethinking training signals in rlvr. arXiv preprint arXiv:2506.10947. Cited by: §1, §1, §2. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1, §2, §4.1. W. Shi, A. Ajith, M. Xia, Y. Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer (2024) Detecting pretraining data from large language models. In The Twelfth International Conference on Learning Representations, External Links: Link Cited by: §4.1. Y. Weng, M. Zhu, F. Xia, B. Li, S. He, S. Liu, B. Sun, K. Liu, and J. Zhao (2023) Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023, p. 2550–2575. Cited by: §3.1. P. West and C. Potts (2025) Base models beat aligned models at randomness and creativity. In Second Conference on Language Modeling, External Links: Link Cited by: §1, §3.1. S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff (2023) C-pack: packaged resources to advance general chinese embedding. External Links: 2309.07597 Cited by: §4.1. C. Yang and A. Holtzman (2025) How alignment shrinks the generative horizon. arXiv preprint arXiv:2506.17871. Cited by: §1, §3.1. W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, and J. E. Weston (2024) Self-rewarding language models. In Forty-first International Conference on Machine Learning, Cited by: §3.1. Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang (2025) Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §1, §2, §3.1. W. Zeng, Y. Huang, W. Liu, K. He, Q. Liu, Z. Ma, and J. He (2025) 7B model and 8k examples: emerging reasoning with reinforcement learning is both effective and efficient. Note: https://hkust-nlp.notion.site/simplerl-reasonNotion Blog Cited by: §4.1. H. Zhang, J. Yao, C. Ye, W. Xiong, and T. Zhang (2025a) Online-dpo-r1: unlocking effective reasoning without the ppo overhead, 2025. Notion Blog. Cited by: §4.1. J. Zhang, S. Yu, D. Chong, A. Sicilia, M. R. Tomz, C. D. Manning, and W. Shi (2025b) Verbalized sampling: how to mitigate mode collapse and unlock llm diversity. arXiv preprint arXiv:2510.01171. Cited by: §B.2, §1, §3.1, §4.1, §4.1. Q. Zhang, H. Wu, C. Zhang, P. Zhao, and Y. Bian (2025c) Right question is already half the answer: fully unsupervised LLM reasoning incentivization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: Appendix E, §1, §2, §4.1, §4.1. Y. Zhang, Z. Zhang, H. Guan, Y. Cheng, Y. Duan, C. Wang, Y. Wang, S. Zheng, and J. He (2025d) No free lunch: rethinking internal feedback for llm reasoning. arXiv preprint arXiv:2506.17219. Cited by: §1, §2, §4.2.2. Y. Zhang and T. Math-AI (2025) American invitational mathematics examination (aime) 2025. Cited by: §4.1. X. Zhao, Z. Kang, A. Feng, S. Levine, and D. Song (2025) Learning to reason without external rewards. arXiv preprint arXiv:2505.19590. Cited by: Appendix E, §1, §1, §1, §2, §2, §4.1. X. Zhu, D. Cheng, D. Zhang, H. Li, K. Zhang, C. Jiang, Y. Sun, E. Hua, Y. Zuo, X. Lv, et al. (2025) Flowrl: matching reward distributions for llm reasoning. arXiv preprint arXiv:2509.15207. Cited by: §A.1, §B.1, §3.3. H. Zimmermann, F. Lindsten, J. van de Meent, and C. A. Naesseth (2023) A variational perspective on generative flow networks. Transactions on Machine Learning Research. Note: External Links: ISSN 2835-8856, Link Cited by: §3.2, Proposition 3.1. Y. Zuo, B. He, Z. Liu, S. Zhao, Z. Fu, J. Yang, K. Zhang, Y. Fan, G. Cui, C. Qian, X. Chen, Y. Sun, X. Lv, X. Zhu, L. Sheng, R. Li, H. Gao, Y. Zhang, L. Yuan, Z. Liu, B. Zhou, and N. Ding (2026) How far can unsupervised RLVR scale LLM training?. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2, §2, §4.2.2. Y. Zuo, K. Zhang, L. Sheng, S. Qu, G. Cui, X. Zhu, H. Li, Y. Zhang, X. Long, E. Hua, B. Qi, Y. Sun, Z. Ma, L. Yuan, N. Ding, and B. Zhou (2025) TTRL: test-time reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §2, §4.1. Appendix Appendix A Implementation Details In this section, we elaborate on the implementation details of the PowerFlow framework, focusing on the architecture of the partition function estimator and the specific configurations used to ensure stable and diverse distribution matching. Code and models will be available upon official publication. A.1. Log-Partition Function Estimator (log⁡Z′ Z Module) To operationalize the length-aware Trajectory-Balance objective, we employ a projection network, ProjZModule, which provides an amortized estimation of the log-partition function log⁡Z′(q) Z (q). It is crucial to note that the module is designed to output log⁡Z′ Z directly rather than the partition function Z′Z , which enhances numerical stability during training. Architecture. The architecture is inspired by the projection network in FlowRL Zhu et al. (2025), consisting of a 3-layer MLP that maps the hidden states of the final token from the query prefix to a scalar value. We use a hidden dimension consistent with the base model’s hidden size, incorporating GELU activations, LayerNorm, and Dropout (0.1) to regularize the learning of the energy surface. Principled Initialization. To stabilize the initial phase of optimization and accelerate convergence, we introduce a specialized initialization strategy for the Z module’s output. We initialize the log-partition function based on the base model’s own statistics: log⁡Zϕ′(q)≈¯ref⋅(α−1)+ϵ Z _φ(q)≈ P_ref·(α-1)+ε, where ¯ref P_ref is the average token-level log-probability sampled from the reference model on the first training batch, and ϵε is a random noise term with a mean of 0. This ensures that the estimated log-partition function starts at a magnitude compatible with the target α-power density, preventing the large gradient fluctuations that typically occur when the Z module is forced to adapt from a zero-initialized state. A.2. Training Configuration and Hyperparameters The effectiveness of PowerFlow relies on a principled exploration of the probability landscape. We employ a substantial sample size per prompt (N=16N=16) and conservative learning rates to ensure the model samples the distribution broadly, thereby preserving the structural information of the base distribution and avoiding collapse into local modes. For reasoning tasks where α>1α>1, we utilize a high sampling temperature (T=1.0T=1.0) to facilitate exploration. However, for creativity tasks where α<1α<1, the temperature is adjusted to 0.70.7. This modification is necessary because the flattened target distribution already inherently promotes diversity; a more moderate temperature prevents the model from generating incoherent or nonsensical sequences during the later stages of optimization. Optimization and Learning Rates. We use a prompt batch size of B=128B=128. The learning rates are tailored to the specific optimization regime and model scale. For the reasoning regime (α>1α>1), we apply a learning rate of 3×10−63× 10^-6 for 1.5B models and 1×10−61× 10^-6 for larger variants. In the creativity-focused regime (α<1α<1), we use a more stringent learning rate of 5×10−75× 10^-7 for all models to maintain control over the flattened distribution. Clip Higher Strategy. To further enhance training stability, we implement the Clip Higher mechanism within the importance sampling objective. Specifically, we set an asymmetric clipping threshold with ϵhigh=0.28 _high=0.28 and ϵlow=0.2 _low=0.2. By allowing a slightly higher upper bound for the importance weights, we facilitate more effective updates from high-quality trajectories while still guarding against the gradient collapse and instability often encountered in off-policy GFlowNet training. Appendix B Experimental Prompts This appendix provides the specific prompts used for generation and evaluation in our experiments. We categorize these into reasoning-focused and creativity-focused tasks to ensure transparency and reproducibility. B.1. Reasoning Prompts The prompts for reasoning tasks include the input templates for model generation and the evaluation criteria used to assess trajectory diversity. Generation Prompt. The prompt template is designed to elicit structured step-by-step reasoning while enforcing a specific output format to ensure the final answer is reliably captured by a parser. Mathematical Reasoning Generation Prompt System: Please reason step by step, and output your final answer within . User: Question Let’s think step by step and output the final answer within . GPQA Reasoning Genaration Prompt System: Please reason step by step, and output your final answer (A, B, C, or D) within . User: Question Let’s think step by step and output the final answer (A, B, C, or D) within . Diversity Scoring Prompt. To evaluate the diversity of the generated reasoning trajectories, we utilize a scoring mechanism adapted from the methodology established by Zhu et al. (2025). Reasoning Diversity Evaluation Rubric System: You are evaluating the DIVERSITY of solution approaches for a mathematics competition problem. Focus on detecting even SUBTLE differences in methodology that indicate different problem-solving strategies. PROBLEM: problem 16 SOLUTION ATTEMPTS: formatted_responses EVALUATION CRITERIA - Rate diversity from 1 to 5: Score 1 - Minimal Diversity: • 14+ responses use essentially identical approaches • Same mathematical setup, same variable choices, same solution path • Only trivial differences (arithmetic, notation, wording) • Indicates very low exploration/diversity in the generation process Score 2 - Low Diversity: • 11-13 responses use the same main approach • 1-2 alternative approaches appear but are rare • Minor variations within the dominant method (different substitutions, orderings) • Some exploration but heavily biased toward one strategy Score 3 - Moderate Diversity: • 7-10 responses use the most common approach • 2-3 distinct alternative approaches present • Noticeable variation in problem setup or mathematical techniques • Balanced mix showing reasonable exploration Score 4 - High Diversity: • 4-6 responses use the most common approach • 3-4 distinct solution strategies well-represented • Multiple mathematical techniques and problem framings • Strong evidence of diverse exploration strategies Score 5 - Maximum Diversity: • No single approach dominates (≤ 3 responses use same method) • 4+ distinctly different solution strategies • Wide variety of mathematical techniques and creative approaches • Excellent exploration and generation diversity IMPORTANT: Focusing on the DIVERSITY of the attempted approaches. Return ONLY a number from 1 to 5. B.2. Creativity Prompts The prompts for our creative writing experiments are divided into two categories: those used for model generation and those used for output assessment. Model Generation Prompts. The system and user prompts utilized to elicit creative responses from the models are adapted from Zhang et al. (2025b). These prompts are designed to provide sufficient stylistic constraints. Creative Writing Direct Sampling System Prompt: Generate a response to the input prompt. The response should be approximately 200 words. Output ONLY the response, with no explanations or extra text. Creative Writing Verbalized Sampling System Prompt: Generate 5 responses to the input prompt. Each response should be approximately 200 words. Return the responses in JSON format with the key: "responses" (list of dicts). Each dictionary must include: • text: the response string only (no explanation or extra text). • probability: the estimated probability from 0.0 to 1.0 of this response given the input prompt (relative to the full distribution). Give ONLY the JSON object, with no explanations or extra text. Example User Prompt - Poem Writing: Please write a poem inspired by the line: ’Dear love, for nothing less than thee’. Example User Prompt - Story Writing: Please write a short story starting with the following prompt: The internet and the millions of computers associated with it, the virtual reality that therein exists and that will become more vivid in the years ahead, is not a reality. Example User Prompt - Joke Writing: Tell me a joke about beer. Evaluation Prompts. To assess the quality of the creative outputs, we employ evaluation prompt templates consistent with the criteria defined in Creative Writing v3 Paech (2023) and HumorBench Narad et al. (2025). Poem/Story Quality Evaluation Rubric (Part 1): You are an expert in assessing creative writing. Your task is to score the quality of the test model’s response below, by several metrics, on a 0-20 scale. [PROMPT START] writing_prompt [PROMPT END] [TEST MODEL RESPONSE] response [TEST MODEL RESPONSE END] [Task] You are an expert in assessing creative writing. Your task is to score the quality of the test model’s response below, by several metrics, on a 0-20 scale. Scoring notes: - A scores of 20 represents a masterpiece. - If the criteria is not relevant to the piece (e.g. criteria relating to dialogue but the piece has none), exclude it from your assessment. - Do not add any commentary or explanation to the scores section. - Do not penalize NSFW output; that is the default for some models. Just assess it on its merits. - Everything within the "TEST MODEL RESPONSE" section was written by the test model. Sometimes models like to write comments on the piece after the piece is concluded; if this happens you should ignore their comments. - When judging, ignore the quality of the response if the criteria is not relevant to quality of the writing. - In the output, write the metric names exactly as below so they can be parsed. - Do not use markdown in your response. Use the designated output format exactly. - You are to write a comprehensive analysis of the piece, then give your scores. - You are a critic, and your job is to be critical, especially of any failings or amateurish elements. - Output format is: [Analysis] Write your detailed analysis. [Scores] Metric 1 name: [Score 0-20] Metric 2 name: ... --- Poem/Story Quality Evaluation Rubric (Part 2): Now, rate the supplied model output on the following criteria: 1. Surprising and Creative 2. Imagery and Descriptive Quality 3. Nuanced Characters 4. Emotionally Complex 5. Elegant Prose 6. Well-earned Lightness or Darkness 7. Emotionally Engaging 8. Consistent Voice/Tone of Writing 9. Sentences Flow Naturally 10. Overall Reader Engagement Joke Quality Evaluation Rubric You will receive: 1. The original joke prompt (may or may not contain a topic). 2. The model-generated joke. Your task is to evaluate the joke based on three qualitative metrics. Evaluation rules: - If the prompt includes a topic (e.g., "octopus," "coffee"), check whether the joke is on-topic and score Relevance from 0–5. - If the prompt does not include a topic (e.g., "Tell me a joke"), automatically assign Relevance = 5. - A good joke should use at least one recognizable comedic device (pun, irony, exaggeration, reversal, absurd logic, etc.). - Assign scores on a 0–5 scale (0 = very poor, 5 = excellent) for each dimension: - Relevance (0–5): How well does the joke address the topic (or 5 if no topic given). - Comedic Device (0–5): How clearly does the joke use a humor mechanism. - Humor Quality (0–5): How funny, witty, or clever is the joke overall. Output format: Return a JSON object in the following format: "Relevance": <int>, "Comedic Device": <int>, "Humor Quality": <int> Input format: Prompt: prompt Generated joke: joke Appendix C Additional Experimental Results C.1. Sensitivity Analysis of the Power Exponent α To evaluate the impact of the power exponent α on reasoning elicitation, we conducted a sensitivity analysis on Qwen2.5-Math-1.5B across α∈2,4,6α∈\2,4,6\. As illustrated in the tuning results , α=4α=4 achieves the superior performance (e.g., 34.30 ) compared to α=2α=2 (34.08) and α=6α=6 (33.64). Figure 6: Performance comparison of Qwen2.5-Math-1.5B with varying α. The results demonstrate that α=4α=4 provides the optimal balance for distribution sharpening. The results suggest that α=4α=4 represents an ideal trade-off for the base model: α=2α=2 provides insufficient sharpening to effectively prioritize high-quality reasoning paths, while α=6α=6 tends to induce over-sharpening, potentially causing the model to converge prematurely to local optima. Consequently, we utilize α=4α=4 as the default for all reasoning experiments. C.2. Mechanisms of Reasoning Activation via PowerFlow In the absence of external supervision, PowerFlow is formulated as a mechanism to elicit the model’s latent internal capabilities rather than injecting novel skills. Indeed, the framework’s efficacy is significantly underpinned by the robust knowledge representation and reasoning primitives established during the base model’s initial training. As illustrated in Figure 7, the performance of PowerFlow, similar to GRPO, is eventually matched or even surpassed by the Base model on OlympiadBench as n increases toward 256. This observation aligns with prior findings indicating that performance gains in this regime stem from increased sampling efficiency rather than the acquisition of new knowledge. Consequently, we view PowerFlow as a length-neutral amortized sampler that more effectively surfaces high-value reasoning paths already present within the model’s intrinsic distribution. Figure 7: Pass@n comparison on OlympiadBench. As n increases, the performance gap between fine-tuned models and the Base model narrows, suggesting that PowerFlow primarily improves elicitation efficiency. C.3. Lexical Diversity Analysis To complement our semantic analysis, we evaluate the lexical variety of generated responses. Figure 8 maps the Pareto frontier between generation quality and lexical diversity (measured by ROUGE-L scores among outputs). Consistent with our semantic findings, PowerFlow (stars) consistently shifts the frontier toward the upper-right quadrant. The shaded regions highlight the area of Pareto improvement over the original instruction-tuned baselines. While standard methods often exhibit a steep trade-off between variety and coherence, PowerFlow successfully reduces lexical redundancy while simultaneously enhancing overall output quality. Figure 8: Quality-Lexical Diversity Pareto frontier across four model families. Shaded regions indicate areas of Pareto dominance over the Instruct baseline. PowerFlow (stars) uniquely improves both metrics simultaneously. C.4. Comprehensive Task-Level Performance on Creative Writing Table 2 presents a granular performance breakdown across three creative genres: poem continuation, joke writing, and story generation. PowerFlow demonstrates notable consistency across all model series and tasks, achieving a superior overall equilibrium between diversity and quality metrics. Specifically, our method consistently outperforms baselines in each category while uniquely realizing concurrent gains in both diversity and quality. These findings confirm that PowerFlow’s distribution flattening revitalizes creative potential without compromising the model’s robust instruction-following performance. Table 2: Comprehensive Performance Comparison across Creative Writing Tasks. Div. denotes Semantic Diversity, R-L denotes ROUGE-L (Lexical Redundancy), and Qual. denotes LLM-judged Quality. Poem Joke Story Overall Average Model Div.↑ R-L↓ Qual.↑ Div.↑ R-L↓ Qual.↑ Div.↑ R-L↓ Qual.↑ Div.↑ R-L↓ Qual.↑ Qwen2.5-1.5B-Instruct Instruct 0.1623 0.1882 0.4056 0.2981 0.4391 0.7077 0.2253 0.1816 0.3726 0.2286 0.2696 0.4953 High-temp 0.2313 0.1227 0.4051 0.3620 0.3141 0.6854 0.2863 0.1402 0.3624 0.2932 0.1923 0.4843 Base 0.1895 0.1720 0.2947 0.3795 0.1804 0.6725 0.2055 0.1693 0.2894 0.2582 0.1739 0.4189 VS-Standard 0.2404 0.2087 0.3203 0.3900 0.2742 0.7181 0.2817 0.2063 0.2670 0.3040 0.2298 0.4351 PowerFlow (Ours) 0.2492 0.1245 0.4123 0.4446 0.0960 0.7423 0.2994 0.1175 0.3862 0.3311 0.1127 0.5136 Qwen2.5-3B-Instruct Instruct 0.1491 0.2024 0.4678 0.2894 0.3906 0.7939 0.1860 0.2401 0.4331 0.2082 0.2777 0.5650 High-temp 0.1799 0.1699 0.4579 0.3194 0.3377 0.7909 0.2073 0.2173 0.4446 0.2355 0.2416 0.5645 Base 0.1740 0.1777 0.3825 0.3330 0.2080 0.7575 0.2021 0.1700 0.3548 0.2364 0.1852 0.4983 VS-Standard 0.3020 0.1715 0.3369 0.3756 0.2432 0.8032 0.2710 0.2202 0.3194 0.3162 0.2117 0.4865 PowerFlow (Ours) 0.2085 0.1544 0.4675 0.3360 0.2931 0.8147 0.2093 0.1903 0.4466 0.2513 0.2126 0.5763 Llama-3.2-3B-Instruct Instruct 0.1296 0.2143 0.4834 0.2783 0.4531 0.8680 0.1710 0.2136 0.5030 0.1930 0.2937 0.6182 High-temp 0.1542 0.1753 0.4723 0.3151 0.3796 0.8604 0.2027 0.1768 0.4585 0.2240 0.2439 0.5971 Base 0.3122 0.0741 0.3327 0.3298 0.0762 0.8144 0.3188 0.0987 0.2270 0.3203 0.0830 0.4580 VS-Standard 0.3095 0.1682 0.3280 0.3855 0.2795 0.8423 0.3860 0.1475 0.2604 0.3603 0.1984 0.4769 PowerFlow (Ours) 0.1546 0.1983 0.4963 0.3135 0.3004 0.8791 0.1971 0.2077 0.5144 0.2217 0.2355 0.6299 Qwen2.5-7B-Instruct Instruct 0.1360 0.2147 0.5054 0.2803 0.4242 0.8383 0.1530 0.2100 0.5010 0.1898 0.2830 0.6149 High-temp 0.1613 0.1868 0.4809 0.3108 0.3668 0.8238 0.1721 0.1921 0.5176 0.2147 0.2485 0.6074 Base 0.1725 0.1744 0.4451 0.3416 0.2301 0.7942 0.1866 0.1667 0.3636 0.2336 0.1904 0.5343 VS-Standard 0.2416 0.2205 0.3647 0.3763 0.2444 0.8335 0.1921 0.2971 0.3449 0.2700 0.2540 0.5144 PowerFlow (Ours) 0.1620 0.1641 0.5519 0.3257 0.3193 0.8460 0.1634 0.1821 0.5630 0.2170 0.2219 0.6537 Appendix D Theoretical Analysis of Majority Voting Dynamics In this section, we provide a formal analysis of the convergence behavior of Reinforcement Learning from Internal Feedback (RLIF) when utilizing majority voting rewards. Theorem D.1 (Asymptotic Convergence to Dirac Distribution). Let Y be a finite output space with cardinality ||<∞|Y|<∞. Consider a policy optimization process generating a sequence of policies πkk=0∞\ _k\_k=0^∞, where πk+1 _k+1 is obtained by solving the entropy-regularized objective defined on the expected reward: πk+1=argmaxπ∈Δ()(y∼π[r¯k(y)]−βKL(π∥πk)). _k+1= *argmax_π∈ (Y) (E_y π[ r_k(y)]- _KL(π\| _k) ). (11) Here, Δ() (Y) denotes the probability simplex, β>0β>0 is the regularization coefficient, and r¯k(y) r_k(y) is the expected majority-voting reward derived from a batch of size N≥2N≥ 2 sampled from πk _k: r¯k(y)≜ℙN∼πkN(y∈mode(N)), r_k(y) _D_N _k^N (y∈ *mode(D_N) ), (12) where ties in the mode calculation are broken uniformly at random. Assume the initial policy π0 _0 possesses a unique mode y∗=argmaxy∈π0(y)y^*= *argmax_y _0(y). Then, the policy sequence converges pointwise to the Dirac delta distribution concentrated at y∗y^*: limk→∞πk(y)=δy∗(y)≜1if y=y∗0if y≠y∗. _k→∞ _k(y)= _y^*(y) cases1&if y=y^*\\ 0&if y≠ y^* cases. (13) Proof. The optimization problem in Eq. (27) is convex, and its closed-form solution corresponds to the exponentiated gradient update based on the expected reward vector. The policy update rule is: πk+1(y)=πk(y)exp⁡(r¯k(y)/β)Zk,where Zk=∑z∈πk(z)exp⁡(r¯k(z)/β). _k+1(y)= _k(y) ( r_k(y)/β)Z_k, Z_k= _z _k(z) ( r_k(z)/β). (14) Note that while the reward r¯k(y) r_k(y) is derived from a stochastic process, it represents the expectation over the sampling noise. Thus, the update rule describes the deterministic trajectory of the policy under the exact gradient of the expected objective. Let y∗y^* be the unique mode of πk _k. To prove convergence, we analyze the log-likelihood ratio between the mode y∗y^* and any suboptimal candidate y′∈∖y∗y \y^*\. Define Λk(y′)≜log⁡(πk(y∗)πk(y′)) _k(y ) ( _k(y^*) _k(y ) ). The recurrence relation is: Λk+1(y′)=Λk(y′)+1β(r¯k(y∗)−r¯k(y′))⏟Δr¯k(y′). _k+1(y )= _k(y )+ 1β ( r_k(y^*)- r_k(y ) )_ r_k(y ). (15) We now prove that πk(y∗)>πk(y′) _k(y^*)> _k(y ) implies r¯k(y∗)>r¯k(y′) r_k(y^*)> r_k(y ), ensuring the drift term Δr¯k(y′) r_k(y ) is strictly positive. Let =(ky)y∈k=(k_y)_y be the frequency vector of a batch ND_N, which follows a Multinomial distribution P()=N!∏yπk(y)kyky!P(k)=N! _y _k(y)^k_yk_y!. Let EzE_z denote the set of configurations where z is the strictly unique mode (ties are handled via symmetry and do not affect the strict inequality for N≥2N≥ 2). Consider a bijective mapping ϕ:Ey′→Ey∗φ:E_y → E_y^* that swaps the counts of y∗y^* and y′y : for any ∈Ey′k∈ E_y , define ϕ()φ(k) such that its components are ky∗′=ky′k _y^*=k_y , ky′=ky∗k _y =k_y^*, and kz′=kzk _z=k_z otherwise. Since ∈Ey′k∈ E_y , y′y is the mode, so ky′>ky∗k_y >k_y^*. Comparing the probability mass of the swapped configurations: P(ϕ())P()=…πk(y∗)ky′πk(y′)ky∗…πk(y∗)ky∗πk(y′)ky′=(πk(y∗)πk(y′))ky′−ky∗. P(φ(k))P(k)= … _k(y^*)^k_y _k(y )^k_y^*… _k(y^*)^k_y^* _k(y )^k_y = ( _k(y^*) _k(y ) )^k_y -k_y^*. (16) By hypothesis πk(y∗)>πk(y′) _k(y^*)> _k(y ), the base is >1>1. Since ky′>ky∗k_y >k_y^*, the exponent is positive. Thus, P(ϕ())>P()P(φ(k))>P(k). This implies that for every batch configuration where the suboptimal candidate y′y wins, there exists a strictly more probable configuration where y∗y^* wins. Summing over all configurations yields r¯k(y∗)>r¯k(y′) r_k(y^*)> r_k(y ). Returning to the recurrence: since π0 _0 has a unique mode y∗y^*, Λ0(y′)>0 _0(y )>0. The strict positivity of Δr¯k(y′) r_k(y ) ensures Λk(y′) _k(y ) is monotonically increasing. As long as the distribution has not collapsed (πk(y′)>0 _k(y )>0), the gap Δr¯k(y′) r_k(y ) remains positive. Consequently, limk→∞Λk(y′)=∞ _k→∞ _k(y )=∞. The probability of the mode is given by the sigmoid function of these ratios: πk(y∗)=11+∑y′≠y∗exp⁡(−Λk(y′)). _k(y^*)= 11+ _y ≠ y^* (- _k(y )). (17) As Λk(y′)→∞ _k(y )→∞, the terms exp⁡(−Λk(y′))→0 (- _k(y ))→ 0. Thus, limk→∞πk(y∗)=1 _k→∞ _k(y^*)=1, and limk→∞πk(y′)=0 _k→∞ _k(y )=0 for all y′≠y∗y ≠ y^*. ∎ Appendix E Further Discussion Considerations on Baseline Comparisons. We acknowledge that, due to computational constraints, our comparison with certain RLIF baselines and One-shot EM relies on their respective open-source checkpoints. This may introduce minor inconsistencies in the comparative analysis, as the original training recipes might differ from our internal pipeline. However, to ensure a controlled and equitable evaluation, we trained our GRPO baseline in-house using an experimental configuration identical to that of PowerFlow. Given that this setup is also virtually indistinguishable from the training environment described in the EMPO study Zhang et al. (2025c), it provides a high-fidelity benchmark for assessing relative performance gains. This rigorous alignment of training conditions ensures that the superior results demonstrated by PowerFlow are attributable to the principled nature of our distribution matching framework rather than disparate optimization settings. Comparison with Temperature Scaling and Token-level Optimization. It is crucial to delineate the fundamental distinctions between PowerFlow and common strategies such as temperature scaling or token-level log-probability optimization. As established in recent analysis Karan and Du (2025), simply lowering the sampling temperature does not equate to sampling from a true power distribution pαp^α for α>1α>1. Specifically, temperature scaling follows a conditional distribution ptemp(xt|x<t)∝(∑x>tp(x0,…,xT))αp_temp(x_t|x_<t) ( _x_>tp(x_0,…,x_T))^α, which averages future likelihoods in a greedy manner. In contrast, the power distribution targets pow(xt|x<t)∝∑x>tp(x0,…,xT)αp_pow(x_t|x_<t) _x_>tp(x_0,…,x_T)^α, thereby explicitly accounting for the sharpening of high-likelihood future paths—a property essential for surfacing correct reasoning trajectories. This theoretical gap explains why the Low-temp baseline in our experiments fails to match the robust elicitation achieved via principled distribution matching. Furthermore, treating the token-level average log-probability as a reward objective discards critical information regarding the global trajectory density. Such methods frequently succumb to local optima by over-exploiting low-entropy tokens, leading to the vacuous or repetitive generation observed in prior RLIF studies Zhao et al. (2025); Ghimire et al. (2026) and illustrated in Figure 3. PowerFlow instead remains theoretically anchored in the global distribution π∗(y|q)∝pbase(y|q)α/Zϕ′(q)|y|π^*(y|q) p_base(y|q)^α/Z _φ(q)^|y|. By utilizing the Zϕ′(q)|y|Z _φ(q)^|y| term to neutralize the structural length bias of autoregressive generation, our framework effectively filters the base model’s density. Although this reparameterization does not strictly preserve mode rankings across varying sequence lengths, it largely maintains the model’s semantic essence and relative structure while mitigating the degenerative biases inherent in generation probabilities. While this length-aware objective serves as a pragmatic compromise to neutralize the structural length bias of autoregressive models, it represents an initial step toward achieving true length invariance in distribution matching. We anticipate that future research will yield even more principled mechanisms for elegantly decoupling sequence length from semantic density. Relationship with the FlowRL Framework. While PowerFlow incorporates architectural insights from FlowRL, such as the use of an amortized partition function and importance sampling, there are fundamental distinctions in our theoretical motivation and treatment of sequence length. FlowRL employs GFlowNets within the paradigm of reinforcement learning from verifiable rewards (RLVR) to specifically mitigate the challenges of mode collapse and the over-optimization of dominant reward signals. In contrast, PowerFlow is formulated as a purely unsupervised framework for directional capability elicitation, aligning the policy with the intrinsic α-power distribution of the base model itself. This conceptual shift fundamentally redefines the role of sequence length, transitioning it from a numerical stability concern into a structural alignment requirement. Specifically, in FlowRL, length normalization is primarily introduced as a reward-shaping technique to stabilize training and mitigate gradient explosion in long-trajectory reasoning. PowerFlow instead introduces a length-aware Trajectory-Balance (LA-TB) objective derived from a structural reparameterization of the energy surface, where the partition function is reformulated as Zϕ(q,y)=(Zϕ′(q))|y|Z_φ(q,y)=(Z _φ(q))^|y|. This effectively projects the distribution matching problem onto a space of geometric mean probabilities. This reparameterization is not merely a numerical stabilizer but a theoretical necessity for robust, unsupervised distribution alignment; without it, the intensification of probability mass under α>1α>1 would inevitably drive the model toward trivial, short sequences that exploit the exponential decay of path probabilities. Thus, PowerFlow establishes length invariance as a principled foundation for eliciting latent capabilities on a normalized energy landscape, moving beyond pragmatic modifications for optimization stability.