Paper deep dive
P^2O: Joint Policy and Prompt Optimization
Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/26/2026, 2:35:20 AM
Summary
P^2O is a novel framework that synergizes Prompt Optimization with Policy Optimization to address the exploration bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR). By identifying 'hard samples' and using the Genetic-Pareto (GEPA) algorithm to evolve prompt templates, P^2O guides models toward successful trajectories and uses context distillation to internalize these reasoning gains directly into model parameters, improving performance on complex reasoning tasks.
Entities (5)
Relation Signals (3)
P^2O ā improves ā RLVR
confidence 95% Ā· P2O addresses the exploration bottleneck in RLVR
P^2O ā utilizes ā GEPA
confidence 95% Ā· P2O identifies hard samples during training iterations and leverages the Genetic-Pareto (GEPA) prompt optimization algorithm
P^2O ā implements ā Context Distillation
confidence 90% Ā· we employ Context Distillation (Snell et al., 2022) to transfer the reasoning capabilities
Cypher Suggestions (2)
Identify techniques used by P^2O to improve model performance. Ā· confidence 95% Ā· unvalidated
MATCH (p:Framework {name: 'P^2O'})-[:IMPLEMENTS|UTILIZES]->(t) RETURN t.name, labels(t)Find all frameworks that utilize the GEPA algorithm. Ā· confidence 90% Ā· unvalidated
MATCH (f:Framework)-[:UTILIZES]->(a:Algorithm {name: 'GEPA'}) RETURN f.nameAbstract
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).
Tags
Links
- Source: https://arxiv.org/abs/2603.21877v1
- Canonical: https://arxiv.org/abs/2603.21877v1
Full Text
48,635 characters extracted from source content.
Expand or collapse full text
P2O: Joint Policy and Prompt Optimization Xinyu Lu Kaiqi Zhang Jinglin Yang Boxi Cao Yaojie Lu Hongyu Lin Min He Xianpei Han Le Sun Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting āhard samplesā that yield near-zero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P2O identifies hard samples during training iterations and leverages the Genetic-Pareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.). Reinforcement Learning, Prompt Optimization, RLVR, ICML 1 Introduction Large Language Models (LLMs) have achieved remarkable proficiency across a spectrum of complex reasoning tasks (Jaech et al., 2024; Guo et al., 2025). A defining characteristic of these tasks is the existence of objective verification mechanisms capable of providing deterministic feedback. Capitalizing on this, Reinforcement Learning with Verifiable Rewards (RLVR) (Lambert et al., 2024) has emerged as a dominant paradigm for reasoning alignment. By leveraging outcome-based supervision, RLVR enables models to surpass the limitations of imitation learning, facilitating autonomous exploration of the solution space to optimize reasoning trajectories aligned with rigorous verification signals (Liu et al., 2025a). Despite its promise, the performance of RLVR is severely constrained when the training data intermixes āsimple samplesā with āhard samplesāāspecifically, instances where the model achieves near-zero success rates across K rollouts. For hard samples, the model fails to capture reliable gradient feedback due to exploration difficulties and sparse rewards (Song et al., 2025). Consequently, the optimization process heavily relies on the accessible signals from simple samples. This causes the model to overfit to simple instances, trapping it in a suboptimal policy where it excels at trivial tasks but fails to resolve complex reasoning problems. Figure 1: Conceptual Illustration of the P2O Framework. Standard policy optimization often gets trapped in local optima (Ļinit _init) due to sparse rewards on hard samples. P2O bridges this exploration gap using optimized prompts (the red arrow) to reach high-reward regions that are inaccessible via standard exploration. Subsequently, the model consolidates these gains (the white arrows) by updating its parameters to master the new region (Ļopt _opt), effectively internalizing the prompt-induced capabilities. Common approaches to mitigate exploration challenges include curriculum learning strategies, which progressively introduce harder samples as training advances (Blakeman et al., 2025), and various reward shaping techniques that provide intermediate feedback. However, curriculum learning requires heuristic-based and computationally expensive schedule generation (Bercovich et al., 2025), while reward shaping demands domain-specific expert heuristics. The recent success of prompt optimization methods (Fernando et al., 2023; Opsahl-Ong et al., 2024; Yuksekgonul et al., 2025) offers a compelling way to break this stalemate. These methods demonstrate that even when a model fails to solve a hard problem under a standard policy, a carefully evolved prompt can often elicit the correct reasoning path. This implies that the solution lies within the modelās latent search space but is inaccessible via standard gradient ascent (Zelikman et al., 2022). As illustrated in Figure 1, we visualize this mechanism as the red arrow: optimized prompts act as a bridge, enabling the model to ājumpā out of the local optimum (Ļinit _init) and cross the reward-sparse valley. However, relying solely on inference-time prompts is insufficient; the ultimate goal is to internalize these capabilities. Therefore, the ājumpā must be followed by consolidationārepresented by the white arrows ascending the second peakāwhere the model parameters are updated to master the new region (Ļopt _opt). Building on this intuition, we propose P2O (Joint Policy and Prompt Optimization), a novel framework that synergizes adaptive prompt evolution with reinforcement learning to overcome the hard sample bottleneck. Specifically, P2O identifies challenging instances during training and employs the Genetic-Pareto (GEPA) (Agrawal et al., 2025) prompt optimization algorithm to evolve prompts that elicit successful reasoning chains. Rather than relying on inference-time prompting, we utilize these improved trajectories as supervision signals, enabling the policy to internalize reasoning patterns directly into its parameters. This creates a virtuous cycle: as the model improves, previously hard samples become tractable, and GEPA focuses its efforts on the new frontier of difficulty. Finally, to verify the effectiveness of our method, we evaluate P2O on representative reasoning benchmarks, demonstrating its superior performance compared to baselines. The contributions of this paper are summarized as follows: ⢠We propose P2O, a joint optimization framework that alternates between policy updates and prompt evolution. This approach addresses the exploration bottleneck in RLVR by leveraging optimized prompts to discover valid trajectories for hard samples. ⢠We introduce a context distillation strategy that allows the model to learn from prompt-guided trajectories without relying on them at inference. By computing gradients on the original input, the model internalizes the reasoning patterns elicited by the prompts. ⢠We demonstrate through experiments on six benchmarks that P2O consistently outperforms GRPO baselines, showing particularly strong improvements on hard reasoning tasks such as AIME (+12.3% avg.). 2 Preliminaries 2.1 Policy Optimization Problem Definition. In the context of LLMs, policy optimization is formulated as a Reinforcement Learning (RL) problem where the LLM acts as a policy ĻĪø _Īø parameterized by weights Īø. Given a dataset of queries =xD=\x\, the policy generates a response y consisting of a sequence of tokens. The quality of the generated response is evaluated by a reward function rā(x,y)r(x,y), which may be derived from ground-truth verification (e.g., in reasoning tasks) or a preference model. Optimization Objective. The goal of policy optimization is to maximize the expected reward while ensuring the updated policy does not deviate excessively from a reference policy Ļref _ref. The standard objective function is: JRL(Īø)=xā¼,yā¼ĻĪø(ā |x)[r(x,y)āβKL(ĻĪø(ā |x)ā„Ļref(ā |x))]J_RL(Īø)=E_x ,y _Īø(Ā·|x) [r(x,y)- _KL( _Īø(Ā·|x)\| _ref(Ā·|x)) ] where β is a coefficient controlling the KL-divergence penalty to maintain training stability. Group Relative Policy Optimization (GRPO). Traditional RL methods like PPO (Schulman et al., 2017) require a separate value function to estimate the baseline for advantage computation, which doubles the memory overhead. GRPO (Shao et al., 2024) eliminates the critic by employing a group-based baseline. For each query x, GRPO samples a group of K outputs y1,y2,ā¦,yK\y_1,y_2,ā¦,y_K\ from the current policy. The advantage for the i-th output is computed by normalizing its reward against the group statistics: Ai=rā(x,yi)āμgroupĻgroupA_i= r(x,y_i)- _group _group where μgroup _group and Ļgroup _group are the mean and standard deviation of the rewards within the group. GRPO updates the policy by maximizing a surrogate objective based on these relative advantages, significantly improving training efficiency for reasoning tasks. 2.2 Prompt Optimization Problem Definition. Prompt optimization treats the LLM as a frozen black-box function ā³M. The optimization variable is the discrete prompt z from the space of all possible natural language strings P. Given an input x, the model generates an output y^=ā³ā(z,x) y=M(z,x). The problem is to find an optimal prompt zāz^* that maximizes a metric Sā(y^,y)S( y,y) (e.g., accuracy or F1 score) over the validation dataset valD_val. Optimization Objective. This is a discrete, derivative-free optimization problem. The objective is formally defined as: zā=argā”maxzāā”(x,y)ā¼valā[Sā(ā³ā(z,x),y)]z^*= _z E_(x,y) _val [S(M(z,x),y) ] (1) Unlike policy optimization, which adjusts continuous weights Īø, prompt optimization searches the discrete semantic space of instructions to elicit better capabilities from the fixed model. GEPA (Genetic-Pareto). To solve this optimization problem efficiently, we consider GEPA (Agrawal et al., 2025), a state-of-the-art evolutionary framework. GEPA conceptualizes prompt optimization as a genetic process driven by language-based reflection. Instead of random mutations, it employs a āReflection LLMā to analyze error traces from the current promptās performance and generate targeted semantic mutations that address specific failure modes. Furthermore, GEPA maintains a population of diverse prompts using a Pareto-based selection mechanism, optimizing for multiple objectives (e.g., accuracy and conciseness) simultaneously. This allows the framework to navigate the discrete prompt space effectively, escaping local optima that trap conventional prompt optimization methods. 3 P2O: Joint Policy and Prompt Optimization 3.1 Overview of P2O Framework Figure 2: Overview of the P2O Framework. The training process is formulated as an alternating maximization procedure between two phases: (1) Policy Optimization with Context Distillation, where the policy ĻĪø _Īø is updated to internalize reasoning patterns elicited by augmented inputs x~ x; and (2) Evolutionary Prompt Optimization, where the prompt template set Z is evolved using GEPA to discover successful trajectories for the remaining hard samples (hardD_hard). We propose P2O, a framework designed to overcome the exploration bottleneck in RLVR. Standard RL methods often fail on āhard samplesāāinstances where the current policy yields zero successful trajectoriesāleading to vanishing gradients. P2O addresses this by treating the prompt template as a dynamic latent variable that is jointly optimized with the policy parameters. As illustrated in Figure 2, P2O operates as an alternating maximization procedure comprising two distinct phases: (1) Policy Optimization with Context Distillation, where the model internalizes the reasoning capabilities elicited by optimized prompts; and (2) Evolutionary Prompt Optimization, where a genetic algorithm discovers new prompts to crack the remaining hard samples. 3.2 Problem Formulation We build upon the RLVR setting defined in Section 2.1. The standard objective is to maximize the expected reward JRLā(Īø)=xā¼,yā¼ĻĪøā[rā(x,y)]J_RL(Īø)=E_x ,y _Īø[r(x,y)]. However, in complex reasoning tasks, the solution space is vast and the reward signal is sparse. The Exploration Bottleneck. We formally define a subset of the data, hardāD_hard , as āhard samplesā where the current policy ĻĪø _Īø fails to discover almost any successful trajectory within a reasonable computational budget. Specifically, for xāhardx _hard, the expected reward is nearly zero: yā¼ĻĪø(ā |x)ā[rā(x,y)]ā0E_y _Īø(Ā·|x)[r(x,y)]ā 0. Under standard policy gradient methods, the gradient for these samples vanishes: āĪøJā(x)āyā¼ĻĪøā[(rā(x,y)āb)āā0āāĪølogā”ĻĪøā(y|x)]ā0 _ĪøJ(x) _y _Īø[ (r(x,y)-b)_ā 0 _Īø _Īø(y|x)]ā 0 (2) where b is a baseline function (e.g., a value function Vā(x)V(x) or moving average) used for variance reduction. Since both the realized reward r and the baseline b approach zero for hard samples, the model ceases to learn from hardD_hard, leading to sample inefficiency and sub-optimal convergence. Prompt as a Latent Variable. To tackle this, we introduce a set of auxiliary prompt templates Z as discrete latent variables. We posit that for a hard sample xāhardx _hard, while the direct mapping xāyāxā y^* is inaccessible to the current policy, there exists a latent instruction zāz such that the augmented input x~=ā(x,z) x=T(x,z) (where T is a template insertion function) significantly increases the density of successful trajectories. Therefore, we reformulate the optimization problem. Instead of optimizing Īø solely on the raw input x, we seek to maximize a joint objective over the policy parameters Īø and the latent template space Z: maxĪø,ā”joint=āxāmaxzāāŖĻµā”yā¼ĻĪø(ā |(x,z))ā[rā(x,y)] _Īø,ZJ_joint= _x _z āŖ\ε\E_y _Īø(Ā·|T(x,z))[r(x,y)] (3) where ϵε denotes the empty template (original input). This formulation implies a bilevel optimization challenge: we must identify the optimal latent prompt zāz^* to unlock the gradient for x, and simultaneously update Īø to maximize the likelihood of high-reward trajectories. Since directly optimizing the discrete Z and continuous Īø is intractable, P2O decouples this into an iterative alternating maximization process. 3.3 Phase 1: Policy Optimization with Context Distillation In this phase, we fix the template set Z and update the policy parameters Īø. A naĆÆve approach would be to train the policy to map the augmented input x~ x to the correct output y. However, this creates a dependency on inference-time prompting. Instead, we employ Context Distillation (Snell et al., 2022) to transfer the reasoning capabilities triggered by z directly into the parameters of ĻĪø _Īø for the original input x. Hard Sample Mining. At epoch t, we dynamically identify the set of hard samples hard(t+1)D_hard^(t+1). For each query xix_i, we perform K rollouts. We define a hardness metric based on the empirical success rate: xiāhard(t+1)ā1Kāāk=1Krā(xi,yiāk)<Ļx_i _hard^(t+1) 1K _k=1^Kr(x_i,y_ik)<Ļ (4) where Ļ is a threshold (typically nearly zero). These hard samples are collected to serve as the seed data for the subsequent Prompt Optimization phase. Trajectory Augmentation & Distillation. For samples identified as hard in the immediately preceding epoch, standard exploration is insufficient. We leverage the template set (t)Z^(t) evolved in the immediately preceding epoch. For a hard sample xāhard(t)x _hard^(t), we sample a template zā¼(t)z ^(t) and generate trajectories using the augmented context x~=ā(x,z) x=T(x,z). Crucially, while the generation is conditioned on x~ x, the policy update is calculated with respect to the original input x. Let y~ y be an improved trajectory generated via prompt augmentation. The gradient update is computed as: āĪøJā1Nāāi=1Nāy~ā~Aā(x,y~)āāĪølogā”ĻĪøā(y~|x) _ĪøJā 1N _i=1^N _ yā YA(x, y) _Īø _Īø( y|x) (5) where ~ Y denotes the set of improved trajectories with input x~ x. By decoupling the rollout context (x~ x) from the gradient context (x), P2O forces the model to learn the reasoning path y as an intrinsic capability, independent of the auxiliary prompt z. This effectively distills the āteacherā distribution Ļ(ā |x~)Ļ(Ā·| x) into the āstudentā distribution Ļ(ā |x)Ļ(Ā·|x). 3.4 Phase 2: Evolutionary Prompt Optimization In the second phase, we fix the policy ĻĪø _Īø and optimize the template set (t+1)Z^(t+1) to address the newly identified hard set hard(t+1)D_hard^(t+1). We employ GEPA (Algorithm 3) for this discrete optimization task. Reflective Evolution. Unlike traditional genetic algorithms that rely on random mutation, GEPA utilizes the reference model Ļinit _init as a mutation operator to āperform gradient descent in semantic spaceā (Yang et al., 2024). For each template z in the current Pareto front frontZ_front, we sample a mini-batch miniD_mini from hardtrainD_hard^train and generate error feedback ā±F containing failed predictions and ground truth. The mutation step then produces an improved candidate: zā²āĻinitā(āPropose Improvementāā£z,ā±)z ā _init(``Propose Improvementā² z,F) (6) where āPropose Improvementā denotes the meta-prompt for proposing new candidate templates. We adopt the same meta-prompt as in Agrawal et al. (2025). Candidates that demonstrate improvement on miniD_mini are evaluated on the development set harddevD_hard^dev and added to the template pool Z. Pareto Selection. Using only the best template is insufficient to cover the diversity of failure modes in reasoning tasks. GEPA maintains a population of templates and employs Pareto optimization on harddevD_hard^dev to identify nondominated candidates (Algorithm 4). This approach ensures that Z remains diverse and generalizable across different error patterns. After evolution, we apply a greedy set cover strategy (Algorithm 2) to select a minimal Pareto set coveredZ_covered and assign sample-specific templates to each instance in hardD_hard for the next training epoch. This mechanism enables targeted intervention while maintaining template diversity. The complete training procedure is summarized in Algorithm 1. By iteratively refining the policy to absorb prompt-induced capabilities and evolving prompts to target the policyās remaining weaknesses, P2O establishes a virtuous cycle of self-improvement. Algorithm 1 P2O 0: Training Data D, Initial Template Set (0)=ā Z^(0)= 0: Policy Model ĻĪø _Īø, Reference Model Ļinit _init 0: Epochs NepochN_epoch, Rollout Num K, Threshold Ļ 0: Optimized Policy ĻĪøā _Īø^* 1: for t=0t=0 to Nepochā1N_epoch-1 do 2: Initialize next hard set hard(t+1)āā D_hard^(t+1)ā 3: // Phase 1: Policy Optimization 4: for all batch (xi,y^i)ā(x_i, y_i) do 5: for k=1k=1 to K do 6: xiākāxix_ikā x_i 7: if xiāhard(t)x_i _hard^(t) and (t)ā ā Z^(t)ā then 8: Sample template zā¼(t)z ^(t) 9: xiākāā(xiāk,z)x_ik (x_ik,z) 10: // T is a template insertion function 11: end if 12: Sample yiākā¼ĻĪø(ā |xiāk)y_ik _Īø(Ā·|x_ik) 13: riākāRewardFuncā(yiāk,y^i)r_ikā RewardFunc(y_ik, y_i) 14: end for 15: rĀÆiā1Kāāk=1Kriāk r_iā 1K _k=1^Kr_ik 16: if rĀÆi<Ļ r_i<Ļ then 17: hard(t+1)āhard(t+1)āŖxiD_hard^(t+1) _hard^(t+1)āŖ\x_i\ 18: end if 19: Compute advantages AiākA_ik 20: Update ĻĪø _Īø using xi,yiāk,Aiāk\x_i,y_ik,A_ik\ 21: end for 22: // Phase 2: Prompt Optimization 23: if hard(t+1)ā ā D_hard^(t+1)ā then 24: (t+1)āGepaā(hard(t+1),ĻĪø,Ļinit)Z^(t+1)ā Gepa(D_hard^(t+1), _Īø, _init) // Alg. 3 25: else 26: (t+1)āā Z^(t+1)ā 27: end if 28: end for Algorithm 2 GreedyPromptAssignment 0: Template set with scores =(z1,(r1,1,ā¦,r1,N)),ā¦,(zM,(rM,1,ā¦,rM,N))Z=\(z_1,(r_1,1,ā¦,r_1,N)),ā¦,(z_M,(r_M,1,ā¦,r_M,N))\, Hard samples hardD_hard, Rollout Num K // Each (ri,1,ā¦,ri,N)(r_i,1,ā¦,r_i,N) contains scores of template ziz_i on N dev samples 0: Sample-specific template assignments (xj,j)\(x_j,z_j)\ // Step 1: Greedy set cover to find minimal Pareto set 1: Initialize Template Pareto Set coveredāā Z_coveredā and Covered Sample Set coveredāā S_coveredā 2: while True do 3: Find zāāargā”maxziācoveredā”|ā(zi)ācovered|z^*ā _z_i _covered|C(z_i) _covered| 4: // ā(zi)=xjā£ri,j=1C(z_i)=\x_j r_i,j=1\ is the set of dev samples solved by ziz_i 5: if |ā(zā)ācovered|=0|C(z^*) _covered|=0 then 6: break 7: end if 8: coveredācoveredāŖzāZ_covered _coveredāŖ\z^*\ 9: coveredācoveredāŖā(zā)S_covered _covered (z^*) 10: end while 11: // Step 2: Assign K templates to each hard sample 12: Compute weights: rĀÆiā1Nāān=1Nri,n r_iā 1N _n=1^Nr_i,n for each ziācoveredz_i _covered 13: Initialize assignment: āAā\\ 14: for xjx_j in hardD_hard do 15: if xjācoveredx_j _covered then 16: jāziācoveredā£xjāā(zi)Z_jā\z_i _covered x_j (z_i)\ 17: // Sample from templates that solve this sample 18: else 19: jācoveredZ_j _covered 20: end if 21: Sample j=(zj,1,ā¦,zj,K)z_j=(z_j,1,ā¦,z_j,K) from jZ_j weighted for K times by rĀÆi\ r_i\ 22: āāŖ(xj,j)A āŖ\(x_j,z_j)\ 23: end for 24: RETURN A Table 1: Comparative Evaluation on Mathematical Reasoning Benchmarks. We report the accuracy (%) of P2O against baselines. P2O consistently outperforms the baselines, achieving the most significant gains on challenging benchmarks such as AIME24 and AIME25. Best results are highlighted in bold. Model AIME24 AIME25 AMC MATH500 Minerva Olympiad AVG Base Models Qwen3-0.6B 2.1 1.9 30.2 53.8 12.5 19.4 19.6 Qwen3-1.7B 13.1 10.2 47.7 72.8 22.1 38.4 34.1 Qwen3-4B-Base 9.2 5.8 37.7 67.4 32.4 35.6 31.9 Qwen3-4B 23.8 19.4 68.0 82.8 31.6 49.9 45.9 DeepMath-5K GRPO 33.8 29.0 79.5 87.6 41.5 57.5 54.8 Qwen3-4B-P2OSelf-Ref_Self-Ref 52.1 39.8 85.3 90.2 41.9 60.9 61.7 Qwen3-4B-P2OTeacher-Ref_Teacher-Ref 44.6 34.4 81.9 90.2 36.0 60.0 57.9 DeepScaler-5K GRPO 46.9 37.7 88.1 90.4 41.5 58.2 60.5 Qwen3-4B-P2OSelf-Ref_Self-Ref 51.5 42.1 88.1 91.0 40.1 61.8 62.4 Qwen3-4B-P2OTeacher-Ref_Teacher-Ref 59.8 49.4 92.2 91.4 36.4 62.2 65.2 4 Experiments 4.1 Settings Datasets. We conduct experiments on two distinct datasets, each comprising N=5,000N=5,000 samples. The first dataset is a subset randomly sampled from DeepScaler (Luo et al., 2025), while the second is derived from DeepMath (He et al., 2025) by selecting samples with a difficulty level of ā„7ā„ 7. Method Configuration. We combine GRPO with a one-step off-policy strategy, excluding the KL divergence penalty. The training hyperparameters include a maximum learning rate of 1Ć10ā61Ć 10^-6, a global batch size of 128, and a maximum generation length of 12k tokens. During the rollout phase, we utilize a temperature of T=0.6T=0.6 and sample K=6K=6 trajectories per prompt. All models are trained for 5 epochs. We use Qwen3-4B (Yang et al., 2025) as the training backbone. For GEPA, we use a validation set of 300 examples. Unlike the original GEPA setting (Agrawal et al., 2025), we use beam size W to parallelize the Pareto search for improved computational efficiency. The candidate selection beam size is set to W=16W=16 (yielding ā¼40 40 templates per iteration). Regarding the reflection model in GEPA, we evaluate two variants: P2OSelf-Ref_Self-Ref, which utilizes the reference model (Qwen3-4B) as the mutation operator, and P2OTeacher-Ref_Teacher-Ref, which employs Kimi-K2 (Team et al., 2025) as a stronger external teacher to provide high-quality feedback and prompt mutations. Reward. We employ a strict binary reward (rā0,1rā\0,1\). A reward of 1 is granted solely when the response follows the format and matches the ground truth. Evaluation Protocol. We employ the open-source evaluation suite111https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation provided by Qwen for all mathematical benchmarks. To ensure statistical robustness and encourage exploration on smaller datasets (fewer than 100 samples), we report the average performance across 16 rollouts per question, generated with a temperature of 0.60.6. For larger datasets, we adopt greedy sampling. 4.2 Main Results Figure 3: Training Dynamics of P2O. The plots illustrate the dynamics of Training Reward (Left) and Validation Accuracy (Right) throughout the optimization process using the Teacher-Ref variant on the DeepScaler-5K dataset. P2O maintains a higher reward compared to the baseline by dynamically cracking hard samples. Crucially, this reward advantage translates into robust gains in validation accuracy, confirming that the context distillation mechanism effectively transfers prompt-dependent success into intrinsic model capability. Figure 4: Prompt Optimization Effectiveness. Left: Optimized prompts (triangles) consistently yield gains over the standard policy (circles) on both Pass@1 and Pass@6 metrics, demonstrating that GEPA effectively assists the model in bridging the performance gap. Right: With this assistance, the model continuously conquers hard samples, resulting in a steady decline in the number of intractable instances throughout the training epochs. Data points are derived from the Teacher-Ref variant on the DeepScaler-5K dataset. Table 1 summarizes the performance of models trained on DeepMath-5K and DeepScaler-5K across six challenging mathematical benchmarks, comparing P2O against Qwen3 base models and the GRPO baselines. Comparison with Baselines. As shown in Table 1, P2O consistently outperforms both the backbone model and the GRPO baseline across most mathematical benchmarks. On the DeepScaler-5K dataset, our best configuration achieves an average accuracy of 65.2%, surpassing the GRPO baseline by 4.7%. Similar performance gains are observed on the DeepMath-5K dataset, where P2O yields a 6.9% improvement over GRPO, demonstrating robustness across different training distributions. The benefits of P2O are most pronounced in high-difficulty reasoning tasks that demand extensive exploration. Notably, on the AIME24 and AIME25 benchmarks under the DeepScaler-5K setting, P2O improves upon GRPO by 12.9% and 11.7% respectively. These results empirically validate our hypothesis: prompt evolution bridges the exploration gap for hard samples, recovering vital training signals from instances that would otherwise remain intractable. Impact of Reflection Models on Prompt Evolution. Our comparison reveals that the optimal reflection source is task-dependent: on DeepScaler-5K, the Teacher-Reflection variant dominates (65.2% vs. 62.4% avg.), whereas on DeepMath-5K, P2OSelf-Ref_Self-Ref (61.7% avg.) surprisingly outperforms the Teacher-Reflection variant (57.9% avg.). This divergence suggests that for certain specialized domains, a policy model may benefit more from self-generated refinements that strictly align with its own reasoning capacity, whereas an external teacher might introduce complex priors that are harder for the policy to internalize. Crucially, all P2O variants consistently surpass the GRPO baseline across both datasets, confirming that the structural exploration enabled by GEPA is a robust driver of performance gains regardless of the reflection source. 4.3 Training Dynamics Analysis of Learning Curves. As shown in Figure 3, the continuous integration of optimized prompts maintains a training reward consistently superior to the baseline. Crucially, this advantage translates into significant gains in validation accuracy, confirming that the model effectively internalizes the elicited reasoning patterns to achieve robust in-distribution generalization. Analysis of Prompt Optimization Process. Figure 4 elucidates the mechanism driving P2Oās performance. As the modelās intrinsic capability grows during training, GEPA effectively assists in continuously conquering āhard samplesā, evidenced by the steady decline in the number of intractable instances (Right). Furthermore, the performance breakdown (Left) reveals that prompt optimization yields consistent gains across both Pass@1 and Pass@6 metrics. This demonstrates that the evolved templates do not merely facilitate a single lucky guess; rather, they robustly enhance the modelās solution space, boosting both the deterministic success rate and the broader exploration coverage required for effective distillation. 4.4 Ablations Table 2: Ablation Study of P2O Components. Comparison of performance on mathematical benchmarks when excluding context distillation or group prompt diversity. Data points are derived from the Teacher-Ref variant on the DeepScaler-5K dataset. Model AIME24 AIME25 AMC MATH500 Minerva Olympiad AVG Qwen3-4B 23.8 19.4 68.0 82.8 31.6 49.9 45.9 GRPO 46.9 37.7 88.1 90.4 41.5 58.2 60.5 Qwen3-4B-P2OTeacher-Ref_Teacher-Ref 59.8 49.4 92.2 91.4 36.4 62.2 65.2 Qwen3-4B-P2Ow/o context distillation_w/o context distillation 36.5 31.5 81.1 89.6 40.1 54.8 55.6 Qwen3-4B-P2Osame template in group_same template in group 57.3 44.6 88.9 90.8 40.4 63.0 64.2 Importance of Context Distillation. We investigate the necessity of context distillation by ablating the supervision signal. Instead of calculating the policy gradient on the original query x using trajectories generated from the augmented input x~=ā(x,z) x=T(x,z), the ablated model (P2Ow/o context distillation_w/o context distillation) is trained directly on x~ x. As shown in Table 2, excluding this component results in a severe performance degradation. Specifically, the average accuracy drops significantly from 65.2% (P2OTeacher-Ref_Teacher-Ref) to 55.6%. Crucially, the ablated model not only fails to match the proposed methodās performance but also falls behind the GRPO baseline (60.5%) by 4.9%. This result suggests that training directly on prompt-augmented data leads to a ādependencyā effect, where the model learns to rely on the auxiliary templates to solve the task rather than internalizing the reasoning logic. Consequently, when these templates are absent during evaluation, the model fails to generalize. Thus, context distillation is not merely an enhancement but a prerequisite for effectively transferring the capabilities from the teacher reference into the modelās intrinsic parameters. Impact of Group Prompt Diversity. We further investigate the impact of prompt diversity within each rollout group during training. In the standard P2O framework, we utilize a Pareto-based selection strategy and assign a diverse set of templates to each hard sample (i.e., different templates may be used across the K rollouts for the same query). To evaluate the necessity of this diversity, we compare it against a baseline variant, P2Osame template in group_same template in group, where the model samples a single template zā¼(t)z ^(t) per query xix_i and applies that fixed template to all K rollouts in the group. As shown in Table 2, the diversity-driven approach (P2OTeacher-Ref_Teacher-Ref) achieves an average accuracy of 65.2%, outperforming the single-template baseline (64.2%). The gain is particularly pronounced in high-difficulty benchmarks such as AIME24 (+2.5%) and AIME25 (+4.8%). This empirical evidence confirms that employing a diverse set of Pareto-optimal prompts within the same rollout group provides broader coverage of the reasoning space. By eliciting a wider variety of successful trajectories for a single hard sample, the model receives a denser and more robust supervision signal, which facilitates more effective distillation of complex reasoning capabilities into the modelās parameters. 5 Related Works Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) are widely used for LLM reasoning alignment, stabilizing training via group baselines and verification, but still struggle with exploration in sparse or misleading landscapes. To address these scalability and stability issues, recent works have focused on algorithmic advances. For instance, DAPO (Yu et al., 2025) prunes prompts with accuracy equal to 11 or 0 to improve training stability. Despite these improvements, these methods primarily focus on optimizing the policy on a fixed manifold, leaving the initial task prompts untouched. To further aid the model in traversing complex reasoning paths, various āhint-basedā or guidance-augmented strategies have been proposed. Approaches such as strong model guidance (Nath et al., 2025; Liu et al., 2025b), experience replay data (Zhan et al., 2025), and expert solutions (Zhang et al., 2025b; Yan et al., 2025; Li et al., 2025) attempt to scaffold the reasoning process. Similarly, Critique-GRPO (Zhang et al., 2025a) utilizes feedback signals to guide the policy. Crucially, however, these methods typically rely on heuristically obtained hints or expert trajectory fragments derived from the dataset. There is a notable absence of āoptimizationā regarding the guidance itself; the prompts or hints are treated as static constants rather than dynamic variables. This reliance on predefined or oracle-dependent guidance limits the modelās ability to generalize beyond the scope of the provided hints. In contrast, P2O treats the prompt as an optimizable parameter, jointly evolving the task topology with the policy to achieve bi-level self-improvement. 6 Conclusion In this paper, we proposed P2O, a framework that synergizes genetic prompt evolution with reinforcement learning to address the exploration bottleneck in reasoning tasks. By leveraging optimized prompts to unlock successful trajectories for hard samples and subsequently internalizing these capabilities through context distillation, P2O effectively recovers valid training signals for previously intractable instances, thereby preventing the policy from collapsing into local optima. Extensive experiments on mathematical benchmarks demonstrate that P2O significantly outperforms standard GRPO baselines, validating the effectiveness of jointly optimizing the prompt space and parameter space to achieve robust, autonomous self-improvement. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning, specifically in the area of reinforcement learning and reasoning post-training for Large Language Models. Our proposed method, P2O, aims to automate the discovery of effective reasoning strategies to assist policy model training, thereby potentially reducing the barrier to developing capable reasoning models. There are many potential societal consequences of our work, primarily related to the general advancement of AI reasoning capabilities and their downstream applications. However, we feel that none of these consequences must be specifically highlighted here beyond the standard ethical considerations of the field. References L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, et al. (2025) Gepa: reflective prompt evolution can outperform reinforcement learning. arXiv preprint arXiv:2507.19457. Cited by: §1, §2.2, §3.4, §4.1. A. Bercovich, I. Levy, I. Golan, M. Dabbah, R. El-Yaniv, O. Puny, I. Galil, Z. Moshe, T. Ronen, N. Nabwani, et al. (2025) Llama-nemotron: efficient reasoning models. arXiv preprint arXiv:2505.00949. Cited by: §1. A. Blakeman, A. Grattafiori, A. Basant, A. Gupta, A. Khattar, A. Renduchintala, A. Vavre, A. Shukla, A. Bercovich, A. Ficek, et al. (2025) Nemotron 3 nano: open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. arXiv preprint arXiv:2512.20848. Cited by: §1. C. Fernando, D. Banarse, H. Michalewski, S. Osindero, and T. RocktƤschel (2023) Promptbreeder: self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797. Cited by: §1. D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §1. Z. He, T. Liang, J. Xu, Q. Liu, X. Chen, Y. Wang, L. Song, D. Yu, Z. Liang, W. Wang, et al. (2025) Deepmath-103k: a large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. arXiv preprint arXiv:2504.11456. Cited by: §4.1. A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. (2024) Openai o1 system card. arXiv preprint arXiv:2412.16720. Cited by: §1. N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, S. Lyu, et al. (2024) Tulu 3: pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124. Cited by: §1. J. Li, H. Lin, H. Lu, K. Wen, Z. Yang, J. Gao, Y. Wu, and J. Zhang (2025) Questa: expanding reasoning capacity in llms via question augmentation. arXiv preprint arXiv:2507.13266. Cited by: §5. M. Liu, S. Diao, X. Lu, J. Hu, X. Dong, Y. Choi, J. Kautz, and Y. Dong (2025a) Prorl: prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864. Cited by: §1. Z. Liu, C. Gong, X. Fu, Y. Liu, R. Chen, S. Hu, S. Zhang, R. Liu, Q. Zhang, and D. Tu (2025b) Ghpo: adaptive guidance for stable and efficient llm reinforcement learning. arXiv preprint arXiv:2507.10628. Cited by: §5. M. Luo, S. Tan, J. Wong, X. Shi, W. Y. Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica (2025) DeepScaleR: surpassing o1-preview with a 1.5b model by scaling rl. Note: Notion Blog Cited by: §4.1. V. Nath, E. Lau, A. Gunjal, M. Sharma, N. Baharte, and S. Hendryx (2025) Adaptive guidance accelerates reinforcement learning of reasoning models. arXiv preprint arXiv:2506.13923. Cited by: §5. K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab (2024) Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.), Miami, Florida, USA, p. 9340ā9366. External Links: Link, Document Cited by: §1. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.1. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §2.1. C. Snell, D. Klein, and R. Zhong (2022) Learning by distilling context. arXiv preprint arXiv:2209.15189. Cited by: §3.3. Y. Song, J. Kempe, and R. Munos (2025) Outcome-based exploration for llm reasoning. arXiv preprint arXiv:2509.06941. Cited by: §1. K. Team, Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen, et al. (2025) Kimi k2: open agentic intelligence. arXiv preprint arXiv:2507.20534. Cited by: §4.1. J. Yan, Y. Li, Z. Hu, Z. Wang, G. Cui, X. Qu, Y. Cheng, and Y. Zhang (2025) Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945. Cited by: §5. A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §4.1. C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen (2024) Large language models as optimizers. In ICLR, External Links: Link Cited by: §3.4. Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. (2025) Dapo: an open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Cited by: §5. M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P. Lu, Z. Huang, C. Guestrin, and J. Zou (2025) Optimizing generative ai by backpropagating language model feedback. Nature 639, p. 609ā616. Cited by: §1. E. Zelikman, Y. Wu, J. Mu, and N. Goodman (2022) Star: bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems 35, p. 15476ā15488. Cited by: §1. R. Zhan, Y. Li, Z. Wang, X. Qu, D. Liu, J. Shao, D. F. Wong, and Y. Cheng (2025) ExGRPO: learning to reason from experience. arXiv preprint arXiv:2510.02245. Cited by: §5. X. Zhang, H. Sun, Y. Zhang, K. Feng, C. Lu, C. Yang, and H. Meng (2025a) Critique-grpo: advancing llm reasoning with natural language and numerical feedback. arXiv preprint arXiv:2506.03106. Cited by: §5. X. Zhang, Z. Huang, Y. Li, C. Ni, J. Chen, and S. Oymak (2025b) BREAD: branched rollouts from expert anchors bridge sft & rl for reasoning. arXiv preprint arXiv:2506.17211. Cited by: §5. Appendix A Details on GEPA We provide the comprehensive pseudocode for the Genetic-Pareto (GEPA) prompt optimization process used in Phase 2 of our framework. Please refer to Algorithm 3 for the detailed execution of the main evolutionary loop, and Algorithm 4 for the specific implementation of the Pareto-based frontier selection strategy. Algorithm 3 Gepa 0: Hard Data hardD_hard, Policy Model ĻĪø _Īø, Reference Model Ļinit _init 0: Budget CtotalC_total, Mini-batch B, Width W 0: Output Template Set Z 1: Split hardD_hard into hardtrainD_hard^train and harddevD_hard^dev 2: Initialize ā(ϵ,Evalā(ĻĪø,ϵ,harddev))Zā\(ε, Eval( _Īø,ε,D_hard^dev))\ 3: // ϵε means empty template. 4: // Eval() will return reward of every sample in the given dataset by applying the given template 5: CleftāCtotalC_leftā C_total 6: while Cleft>0C_left>0 do 7: frontāSelectParetoFrontā(,W)Z_frontā SelectParetoFront(Z,W) 8: for all zāfrontz _front do 9: Sample mini-batch miniāhardtrainD_mini _hard^train of size B 10: rĀÆoldāmeanā”(Evalā(ĻĪø,z,mini)) r_old ( Eval( _Īø,z,D_mini)) 11: Generate error feedback ā±F from rollouts and rewards on miniD_mini 12: zā²āĻinitā(āPropose Improvementā,z,ā±)z ā _init(``Propose Improvementā²,z,F) 13: rĀÆnewāmeanā”(Evalā(ĻĪø,zā²,mini)) r_new ( Eval( _Īø,z ,D_mini)) 14: Update Cost: CleftāCleftā2āBC_leftā C_left-2B 15: if rĀÆnew>rĀÆold r_new> r_old then 16: āāŖ(zā²,Evalā(ĻĪø,zā²,harddev))Z āŖ\(z , Eval( _Īø,z ,D_hard^dev))\ 17: CleftāCleftā|harddev|C_leftā C_left-|D_hard^dev| 18: end if 19: end for 20: end while 21: āGreedyPromptAssignmentā(,hard)Zā GreedyPromptAssignment(Z,D_hard) 22: RETURN Z Algorithm 4 SelectParetoFront 0: Template set with scores =(z1,(r1,1,ā¦,r1,N)),ā¦,(zM,(rM,1,ā¦,rM,N))Z=\(z_1,(r_1,1,ā¦,r_1,N)),ā¦,(z_M,(r_M,1,ā¦,r_M,N))\, Width W 1: // Each (ri,1,ā¦,ri,N)(r_i,1,ā¦,r_i,N) contains scores of template ziz_i on N dev samples 1: Selected templates frontZ_front 2: // Step 1: Identify Pareto-optimal templates 3: frontāā Z_frontā 4: for i=1i=1 to M do 5: dominatedāFalsedominatedā False 6: for j=1j=1 to M do 7: if jā ijā i and zjz_j dominates ziz_i then 8: // zjz_j dominates ziz_i if ān:rj,n>ri,nā n:r_j,n>r_i,n and ānā²:rj,nā²ā„ri,nā²ā n :r_j,n ā„ r_i,n 9: dominatedāTruedominatedā True 10: break 11: end if 12: end for 13: if not dominated then 14: frontāfrontāŖ(zi,(ri,1,ā¦,ri,N))Z_front _frontāŖ\(z_i,(r_i,1,ā¦,r_i,N))\ 15: end if 16: end for 17: // Step 2: Select W templates from Pareto front 18: if |front|ā¤W|Z_front|⤠W then 19: frontāzā£(z,(r1,ā¦,rN))āfrontZ_frontā\z (z,(r_1,ā¦,r_N)) _front\ 20: else 21: // Sample based on mean dev scores 22: Compute weights: rĀÆiā1Nāān=1Nri,n r_iā 1N _n=1^Nr_i,n for each (zi,(ri,1,ā¦,ri,N))āfront(z_i,(r_i,1,ā¦,r_i,N)) _front 23: Sample W templates from frontZ_front weighted by rĀÆi\ r_i\ 24: frontāsampled templatesZ_frontā\sampled templates\ 25: end if 26: RETURN frontZ_front Appendix B Case Study Figure 5: Qualitative Analysis: Overcoming Local Optima in Geometric Reasoning. To further demonstrate how P2O overcomes the exploration bottleneck, we analyze a representative āhard sampleā from the training process in Figure 5. The problem asks for the radius of the smallest sphere enclosing four unit spheres. Failure of the Base Policy. Without guidance, the model falls into a common cognitive trap: it defaults to a visually intuitive but mathematically suboptimal configuration. As shown in the generated trace, the model attempts to place ā3 spheres on a plane⦠with the 4th sphere resting on top.ā This corresponds to a localized packing that yields a parent sphere radius of 1+21+ 2, failing to minimize the volume. Mechanism of the Optimized Prompt. The template evolved by GEPA acts as a targeted intervention in the modelās latent search space. It does not merely encourage the model to āthink harderā; rather, it injects specific domain knowledgeāspecifically the concept of the regular tetrahedron and the precise mathematical constant for its centroid-to-vertex distance (3/8 3/8). This prompt effectively prunes the invalid search space (planar configurations) and steers the reasoning trajectory toward the global optimum (tetrahedral packing), allowing the model to derive the correct radius of 1+621+ 62.